Query, Search, and Indexing

Because MarkLogic is built on search-engine-style indexing techniques, a MarkLogic database query is fundamentally no different than a search. To say that again: a search is a query is a search.

To begin with, we’ll discuss some simple queries. We’ll move on to a brief introduction to search and a discussion of how richer queries can be evaluated efficiently using MarkLogic indexes. We’ll conclude by walking through the evaluation of a complex query, touching on how indexes are used to evaluate the query efficiently.

Simple Queries

Lookup by URI

Each document in MarkLogic is uniquely identified by its URI. Querying by URI is both trivial and fast. For example, via REST, you can do a simple GET passing the URI as a parameter. To find a document with URI, /my-document, you’d do:

curl -X GET --anyauth --user username:password \
  'https://myhost:port/v1/documents?uri=/my-document'

Lookup by Key Value

Finding documents based on the value of a key inside the document is also simple. For JSON documents, you can specify a property and for XML documents you can specify an Element or Attribute. Again, via the REST API, you can get a list of documents that have the JSON property color with a value of red, by:

curl -X GET --anyauth --user username:password \
  'https://myhost:port/v1/keyvalue?key=color&value=red'

To find documents with an Element, <size> and value of large, you’d do:

curl -X GET \
  --anyauth --user username:password \
  'https://myhost:port/v1/keyvalue?element=size&value=large'

Richer Queries

Beyond these simple lookups, queries can be expressed as:

A simple string (as with a search engine)
A structured query, using JSON or XML syntax
A stored procedure using a language called XQuery (think: like XPath on steroids)

Rather than describing the specific syntax of richer queries, we introduce search (remember: a query is a search). This enables us to review how MarkLogic uses its indexes to evaluate queries efficiently. Our introduction starts with a thought experiment.

Searching for Words

Imagine I give you ten documents printed out. I tell you I’m going to provide you with a word and you’ll tell me which documents have the word. What will you do to prepare? If you think like a search engine, you’ll create a list of all words that appear across all the documents and for each word keep a list of which documents have that word. This is called an inverted index, inverted because instead of documents having words its words having document identifiers. Each entry in the inverted index is called a term list. A term is just a technical name for something like a word. No matter which word I give you, you can quickly give me the associated documents by finding the right term list. This is how MarkLogic resolves simple word queries. (NB: To be clear, we call these, term lists, which might sound like we’re describing a list of terms. In actuality, a term list is a list of document ids associated with a specific term.)

Now let’s imagine I’m going to ask you for documents that contain two different words. That’s not hard. You can use the same data structure. Find all document ids with the first word, all document ids with the second, and intersect the lists. That results in the set of documents with both words. If you sort the term lists in advance and use a “skip list” implementation, the intersection can be done efficiently, and in logarithmic time. This is how MarkLogic resolves simple boolean queries.

It works the same with negative queries. If I ask for documents containing one word but excluding those with another, you can use the indexes to find all document ids with the first word, all document ids with the second word, and do a list subtraction.

Searching for Phrases

Now let’s say I’m going to ask you to find documents with a two-word phrase. You have three choices in how to resolve that query.

First, you can use the same data structure you have already: find documents that contain both words, and then actually look inside the candidate documents to determine if the two words appear together in a phrase.

Second, you can add word position information to each of your term list entries. That way you know at what location in the document each word appears. By looking for term hits with adjoining positions, you can find the answer using just indexes. This avoids having to look inside any documents.

Third, you can create new entries in your inverted index where instead of a simple word as the lookup key you put a two-word phrase. You make the two-word phrase the “term”. When later given a two-word phrase you can find the term list for that two-word phrase and immediately know which documents contain that phrase without further processing.

Which approach does MarkLogic take? It has the ability to take each of these approaches. Exactly which approach it chooses to take happens at runtime based on your database index settings.

If you have the fast phrase searches option enabled (the switch set to on), MarkLogic will incorporate two-word terms into its inverted index. With this index enabled, MarkLogic can use the third approach listed above. This makes phrase queries very efficient, at the cost of slightly larger indexes on disk and slightly slower insert performance to maintain the extra indexes.

If you have the word positions option enabled (another switch), MarkLogic will use position information to resolve the phrase, the second approach listed above. This index resolution isn’t as efficient as fast phrase searches because it requires position-matching work, but word positions can also support proximity queries that look for words near each other but not necessarily next to each other.

If neither the fast phrase searches or word positions index is enabled, then MarkLogic will use the simple word indexes as best it can and rely on filtering the results. Filtering is what MarkLogic calls the act of opening a document and checking if it’s a true match. Having the indexes off and relying on filtering keeps the indexes smaller, but will slow queries proportional to how many candidate documents aren’t actually matching documents. That is, how many have to be read as candidates but thrown away during filtering.

It’s important to note, the query results produced are the same in each case; it’s just a matter of index choices and performance tradeoffs. Because the fast phrase searches index doesn’t expand the inverted index very much, it’s a good choice when phrase searches are common.

Searching based on Document Structure

Everything up to this point is pretty much standard search engine behavior. (Well, except that traditional search engines, because they don’t hold the source data, can’t actually do the filtering and always return results unfiltered from their indexes.) Let’s now look at how MarkLogic goes beyond simple search to index document structure.

Say I’m going to ask you to find me documents that have a <title> element within them. What would you do? If you’re MarkLogic, you’d create a term list for element <title>, the same way you do for a word. You’d have a term list for each element, and no matter what element name gets requested you can deliver a fast answer.

Not many queries ask for documents containing just a named XML element, of course. So let’s say the question is more complex. Let’s try to find documents matching the XPath /book/metadata/title and for each return the title node. That is, we want documents having a <book> root element, with a <metadata> child, and a <title> subchild. (We can build equivalent queries for JSON data.) What would you do? With the simple element term list from above you can find documents having <book>, <metadata>, and <title> elements. That’s good, but it doesn’t know their hierarchical relationship. It’s like a phrase query that looks just for words.

Now imagine a parent-child index, which tracks parent-child element hierarchies. It’s just like a fast phrase searches index except, instead of word relationships, it indexes element relationships. With it you can find documents having a book/metadata relationship (that is, a <book> as a parent of a <metadata>) and a metadata/title relationship also. That produces an even better set of candidate documents. Now you can see how the XPath /a/b/c can be resolved very much like the phrase “a b c”.

The parent-child index lets you search against an XPath even when you don’t know the path in advance. The parent-child index is so useful with XML indexing that it’s one MarkLogic always maintains; there’s no configuration option to turn it off.

Note that even with the parent-child index there’s still the small potential for documents to be in the candidate set that aren’t an actual match. Knowing that somewhere inside a document there’s a <book> parent of <metadata> and a <metadata> parent of <title> doesn’t mean it’s the same <metadata> between them. That’s where filtering comes in: MarkLogic confirms each of the results by looking inside the document before the programmer sees them. While the document is open MarkLogic also extracts any nodes that match the query.

Remember: The goal of index resolution is to make the candidate set so small and accurate that there’s very little filtering needed.

Searching for Specific Values

Now let’s expand the query. Let’s start with the same XPath but require the title equal the phrase “Good Will Hunting”. How do we resolve this efficiently?

If we limited ourselves to the text and element indexes discussed so far, we would use indexes to find candidate documents that: have a book root element, have a book parent of metadata, have a metadata parent of title, have the phrase “Good Will”, have the phrase “Will Hunting”, and maybe have “Good”, “Will”, and “Hunting” located together positionally also if we have word positions enabled. That’s a great list but it pretty much ignores the need that this phrase has to appear as an exact match within a certain element. Not taking that restriction into consideration might lead to lots of filtering work as we find appearances of “Good Will Hunting” in places other than a title.

What you want to do, thinking back to the paper-based approach, is maintain a term list for each element value. In other words, you can track a term list for documents having a <title> of “Good Will Hunting”, as well as any other element name with any other value you find during indexing. Then, for any query asking for an element with a particular value, you can immediately resolve which documents have that element value direct from indexes.

Can an element-value index be stored efficiently? Yes, thanks to hashing. Instead of storing the full element name and text, you can hash the element name and the text down to a succinct integer and use that as the term list lookup key. Then no matter how long the element name and text string, it’s actually a small entry in the index. MarkLogic behind the scenes uses hashes to store all term list keys, element-value or otherwise, for sake of efficiency. The element-value index has proven to be so efficient that it’s always and automatically enabled within MarkLogic.

Searching for Text inside Structure

What if we’re not looking for a title with a specific value but simply one containing a word or phrase, like “Good Will”? The element-value index above isn’t any use, because it only matches full values. We need an index for knowing what words or phrases appear within what elements. That index option is called fast element word searches. When enabled it adds a term list for tracking element names along with the individual words within the element. In other words, it adds a term list entry when it sees a <title> with the word “Good”, another for a <title> with the word “Will”, and another for a <title> with the word “Hunting”. Using these indexes we can additionally confirm out of indexes that the words “Good” and “Will” are both present directly under a <title> element, which should improve our index resolution.

The indexes don’t know if they’re together in a phrase yet. If we want indexes to resolve at that level of granularity, MarkLogic has the fast element phrase searches and element word positions index options. The fast element phrase searchesoption, when enabled, maintains a term list for every element and pair of words within the element. It will have a term list for times when a <title> has the phrase “Good Will” in its text. The element word positions maintains a term list for every element and its contained words, and tracks the positions of all the contained words. Either or both of these indexes can be used to ensure the words are together in a phrase under the <title> element. Whether those indexes provide much benefit depends on how often “Good Will” (or any other queried phrase) appears in a place other than the <title>.

Whatever indexing options you have enabled, MarkLogic will automatically make the most of them in resolving queries.

Other Concerns and Knobs

Searching JSON

So far, we’ve really only discussed XML documents, even though MarkLogic also supports JSON. MarkLogic stores documents as compressed trees, based on the well-known XPath Data Model. Fortunately, this model is sufficiently featured to represent all sorts of documents, including JSON (and plain-text docs as well). In fact, both JSON and XML are represented using the same compressed tree structure (as of MarkLogic 8). Queries for JSON data are specified using JSON properties, in precisely the same way the examples above specify XML elements.

Reindexing

What happens if you decide to change your index settings after loading content? Just make the changes, and MarkLogic manages the update in the background. It’s a key advantage of having all the data in a transactional repository. When the background reindexing runs it doesn’t even stop the system from handling queries or updates. The system just background reloads and reindexes all the fragments affected by the change.

If you add a new index option, it’s not available to support requests until the reindexing has completely finished, because it’s only partially available until then. If you remove an index, it stops being used right away.

You can monitor the reindexing via the Administration UI or the Management API. You can also control the “Reindexer Throttle” from 5 (most eager) to 1 (least eager) or just turn it off temporarily when you don’t want any background work happening.

Relevance

When performing a full text query it’s not enough to just find documents matching the given constraint. The results have to be returned in relevance order. Relevance is a mathematical construct whose idea is simple. Documents with more matches are more relevant. Shorter documents that have matches are more relevant than longer documents having the same number of matches. If the search includes multiple words, some common and some rare, appearances of the rare terms are more relevant than appearances of the common terms. The math behind relevance can be very complex, but with MarkLogic you never have to do it yourself; it’s done for you when you choose to order by relevance. You also get many control knobs to adjust relevance calculations when preparing and issuing a query.

Evaluation of a Query

We’ve covered a lot so far. Now let’s walk through the step-by-step evaluation of a real query to bring home the topics we’ve discussed. We’ll perform a query that finds the top ten documents most relevant to a set of criteria:

The Document is an article
It was published in the year 2010
Its description contains the phrase “pet grooming”
The phrases “cat” and “puppy dog” appear within 10 words of each other in the document
The keyword section must not contain the word “fish”.

MarkLogic uses term lists to perform the query. Exactly what term lists it uses depends on the indexes you’ve enabled on your database. For simplicity in our explanation here, let’s assume all possibly useful indexes are enabled. In that case this search uses the following term lists:

A) Root element of <article>.
B) Element <article> with an attribute “year” equal to 2010.
C) Element <description> containing “pet grooming”.
D) Word “cat”.
E) Word “puppy dog”.
F) Element <keyword> containing “fish”.

MarkLogic performs set arithmetic over the term lists:

(A intersect B intersect C intersect D intersect E) subtract F

This produces a set of document ids. No documents outside this set could possibly be matches, and all documents within this set are likely matches.

The query has a positional constraint also. MarkLogic applies it against the candidate documents using the positional indexes. That limits the candidate document set even further to those documents where “cat” and “puppy dog” appear within 10 words of each other. MarkLogic then sorts the document ids based on relevance score, calculated using term frequency data held in the term lists and document frequency (rarity of the term) held in the database. This produces a score-sorted set of candidate document ids. At this point MarkLogic starts iterating over the sorted document ids. MarkLogic opens the highest scoring document and filters it by looking inside it to make sure it’s a match. If it’s not a match, it gets skipped. If it is, it continues on. And that’s it.

Written Tutorial