Search Relevance

One of the major features you get with the search engine style features of MarkLogic is support for ordering results by relevance. It's often the case that one result might be a better match than another, even if both satisfy the constraints. Relevance is based on a complex mathematical equation that assigns scores to each result, with the highest score considered most relevant.

MarkLogic can use several inputs in the process of determining relevance: frequency of term appearance (more appearances is more relevant), proximity of terms to each other (terms appearing together is more relevant), document position of terms (words in the title are more relevant that words in the main body), document length (with longer documents you expect more term appearances, so high counts should matter less), inherent document quality (some documents are more naturally important than others, like Google's PageRank), preciseness of term matches (an exact term match might mean more than a more fuzzy word match), geographic proximity (weighting items with points closer to a given location), and any hierarchical boolean combination of the above. The programmer controls these knobs and decides which are in effect and with what weightings. The end result: results in beautiful relevance order.

Here's a query where you can see a few relevance knobs in action. We specify four places to look for the match $word, with different weighting on each:

The $query requires the message was sent to one of the $lists, is classified as one of the given $types, and also the cts:or-query rule has to be true as well. The cts:or-query looks for $word in four places: in the message/@list attribute, or inside any subject, para, or quotepara element. If it's a list match, it should be unstemmed (usually proper nouns shouldn't be stemmed). If it's in regular text then stemming is allowed. Appearances of the term in the list name are worth triple score, in subject double score, in para regular score, and in quotepara half score.

Besides placement weighting, documents are also implicitly weighted based on quality. In MarkMail we've set it up so more recent documents have a higher quality. Announcement messages also have a higher quality. So in this "top 100" list you'll see announcements mails tend to rise to the top.

In the results you'll see "Released" a lot in the subject. That's because we have a weighting that prefers matches in the subject. You might remember we searched for "release" but are matching "Released" so you can see how the query ran case-insensitive and stemmed.

Try changing "release" in the query to a phrase like "thank you" which rarely appears in a subject line (I'm not sure what that says about open source mailing lists). You can then confirm the query prefers subject line matches but doesn't require them.

Text Search

Functions

Stack Overflow iconStack Overflow: Get the most useful answers to questions from the MarkLogic community, or ask your own question.