Text Search
All the examples above show MarkLogic Server acting like a database: A database of documents, sure, but still just doing the core job of matching, retrieving, sorting, and counting. MarkLogic can also act like a search engine, with rich support for human language and the unique challenges involved in written text. This section digs into those features.
Let me start with a question: Should a query for hello world
match a
document containing the text "Hello, Worlds!". Maybe, but maybe not. The
capitalization is different, the punctuation is different, and while one is
singular the other is plural. Close enough? MarkLogic gives you as the
programmer the ability to decide, with flags to control case-sensitivity,
punctuation-sensitivity, diacritic-sensitivity, stemming, and thesaurus
expansion in matches.
The basic building-block of text searches is a cts:query
object. The cts
namespace stands for "core text search". Here's a basic example:
This searches for <subject>
elements that have within them the word "release".
The first argument to cts:search()
dictates the scope of the search. The
second argument dictates the match constraint. A simple cts:word-query
just
tries to match the given word (or phrase).
Because you didn't specify any options, the cts:word-query
uses some sensible
defaults. It's case-insensitive because the term was lower-case which implies
no preference for case; had it used any capital letters the query would've
been case-sensitive. It also runs stemmed because we have the stemming index
enabled and that's a good default for searching text. Because it's
case-insensitive and stemmed, you'll see "RELEASE" and "Released" as valid
matches. We can control the options with a second argument:
The second argument accepts a sequence of strings. Up above we passed a single string. Is that legal? Yes, in XQuery. In XQuery there's no difference between a single value and a sequenence of length one containing that value. The following query passes a sequence of strings to also require case sensitivity in the matches:
That might be easier to read using a FLWOR:
There's dozens of cts:query
constructors. Here's one that uses a boolean
constructor along with some query constructors that specify in which element
or element-attribute location the match has to be found:
The $query variable is a cts:and-query
object containing three other queries,
all of which have to be satisfied for the whole cts:and-query
query to be
satisfied. The first is a cts:element-attribute-word-query
. This says we're
looking inside a given element-attribute for a particular word. In this case
"httpd" or "firefox". We're limiting our view to lists that have those words
in their names. The second is a cts:element-attribute-value-query
. By
changing "word-query" to "value-query" we're saying we're not looking for word
containment but full value matching. The third is a cts:element-word-query
saying we're looking inside subject elements for the word "OT". In
list-parlance that means the post is knowingly Off Topic. It's OK to search
for 2-character words in MarkLogic. This also shows the value of a word match
(like a search engine), comapred to a simple character substring match (like a
relational database).
Change "OT" to any word or phrase and have fun. For example:
This allows matches for any of those subjects. In many cases in XQuery you
can pass a single item or a sequence to a function. With cts:query
objects a
sequence typically means to OR the options. You can specify multiple QNames
or multiple values by just passing sequences.
Facets
Search Relevance
Stack Overflow: Get the most useful answers to questions from the MarkLogic community, or ask your own question.