Search Relevance
One of the major features you get with the search engine style features of MarkLogic is support for ordering results by relevance. It's often the case that one result might be a better match than another, even if both satisfy the constraints. Relevance is based on a complex mathematical equation that assigns scores to each result, with the highest score considered most relevant.
MarkLogic can use several inputs in the process of determining relevance: frequency of term appearance (more appearances is more relevant), proximity of terms to each other (terms appearing together is more relevant), document position of terms (words in the title are more relevant that words in the main body), document length (with longer documents you expect more term appearances, so high counts should matter less), inherent document quality (some documents are more naturally important than others, like Google's PageRank), preciseness of term matches (an exact term match might mean more than a more fuzzy word match), geographic proximity (weighting items with points closer to a given location), and any hierarchical boolean combination of the above. The programmer controls these knobs and decides which are in effect and with what weightings. The end result: results in beautiful relevance order.
Here's a query where you can see a few relevance knobs in action. We specify
four places to look for the match $word
, with different weighting on each:
The $query
requires the message was sent to one of the $lists
, is classified
as one of the given $types
, and also the cts:or-query
rule has to be true as
well. The cts:or-query
looks for $word
in four places: in the message/@list
attribute, or inside any subject
, para
, or quotepara
element. If it's a list
match, it should be unstemmed (usually proper nouns shouldn't be stemmed). If
it's in regular text then stemming is allowed. Appearances of the term in the
list name are worth triple score, in subject double score, in para regular
score, and in quotepara half score.
Besides placement weighting, documents are also implicitly weighted based on quality. In MarkMail we've set it up so more recent documents have a higher quality. Announcement messages also have a higher quality. So in this "top 100" list you'll see announcements mails tend to rise to the top.
In the results you'll see "Released" a lot in the subject. That's because we have a weighting that prefers matches in the subject. You might remember we searched for "release" but are matching "Released" so you can see how the query ran case-insensitive and stemmed.
Try changing "release" in the query to a phrase like "thank you" which rarely appears in a subject line (I'm not sure what that says about open source mailing lists). You can then confirm the query prefers subject line matches but doesn't require them.
Text Search
Functions
Stack Overflow: Get the most useful answers to questions from the MarkLogic community, or ask your own question.