Text Search

All the examples above show MarkLogic Server acting like a database: A database of documents, sure, but still just doing the core job of matching, retrieving, sorting, and counting. MarkLogic can also act like a search engine, with rich support for human language and the unique challenges involved in written text. This section digs into those features.

Let me start with a question: Should a query for hello world match a document containing the text "Hello, Worlds!". Maybe, but maybe not. The capitalization is different, the punctuation is different, and while one is singular the other is plural. Close enough? MarkLogic gives you as the programmer the ability to decide, with flags to control case-sensitivity, punctuation-sensitivity, diacritic-sensitivity, stemming, and thesaurus expansion in matches.

The basic building-block of text searches is a cts:query object. The cts namespace stands for "core text search". Here's a basic example:

This searches for <subject> elements that have within them the word "release". The first argument to cts:search() dictates the scope of the search. The second argument dictates the match constraint. A simple cts:word-query just tries to match the given word (or phrase).

Because you didn't specify any options, the cts:word-query uses some sensible defaults. It's case-insensitive because the term was lower-case which implies no preference for case; had it used any capital letters the query would've been case-sensitive. It also runs stemmed because we have the stemming index enabled and that's a good default for searching text. Because it's case-insensitive and stemmed, you'll see "RELEASE" and "Released" as valid matches. We can control the options with a second argument:

The second argument accepts a sequence of strings. Up above we passed a single string. Is that legal? Yes, in XQuery. In XQuery there's no difference between a single value and a sequenence of length one containing that value. The following query passes a sequence of strings to also require case sensitivity in the matches:

That might be easier to read using a FLWOR:

There's dozens of cts:query constructors. Here's one that uses a boolean constructor along with some query constructors that specify in which element or element-attribute location the match has to be found:

The $query variable is a cts:and-query object containing three other queries, all of which have to be satisfied for the whole cts:and-query query to be satisfied. The first is a cts:element-attribute-word-query. This says we're looking inside a given element-attribute for a particular word. In this case "httpd" or "firefox". We're limiting our view to lists that have those words in their names. The second is a cts:element-attribute-value-query. By changing "word-query" to "value-query" we're saying we're not looking for word containment but full value matching. The third is a cts:element-word-query saying we're looking inside subject elements for the word "OT". In list-parlance that means the post is knowingly Off Topic. It's OK to search for 2-character words in MarkLogic. This also shows the value of a word match (like a search engine), comapred to a simple character substring match (like a relational database).

Change "OT" to any word or phrase and have fun. For example:

This allows matches for any of those subjects. In many cases in XQuery you can pass a single item or a sequence to a function. With cts:query objects a sequence typically means to OR the options. You can specify multiple QNames or multiple values by just passing sequences.


Search Relevance

Stack Overflow iconStack Overflow: Get the most useful answers to questions from the MarkLogic community, or ask your own question.