How indexing makes XPath fast

by Evan Lenz

Last year, I built my first MarkLogic app (RunDMC, on which this site is based). I used a lot of XSLT and a dash of XQuery and everything worked just as I expected it would. It also performed reasonably well. When I built the app, I was blissfully ignorant about how I might optimize the code to make it perform even better. But lately that's what I've been learning about, and it's all starting to make sense. A lot of this is obvious to people who have already been working with the technology for a while. But for beginners, I thought it might help to share some of these insights.

Whenever MarkLogic evaluates an XPath expression (which is what both XSLT and XQuery use to extract information from XML documents), there are typically two steps that take place:

  1. Get a list of documents ("fragments" actually, but I'll leave that for another blog post)
  2. Open those documents to extract relevant results

Consider this expression:

collection()/Article/Title

A naïve implementation would go grab all available documents, open each of them up and get the <Title> child of <Article>, if present. In fact, that's the only option for an XQuery (or XSLT) implementation that doesn't use a database. Saxon, for example, can look in all XML documents in the current file system directory using collection("?select=*.xml")/Article/Title. Since these documents have not been indexed or pre-loaded, there's no way of knowing whether any of them even have an <Article> element without first opening them up to see. And this works fine up to a certain point. If you try to load too many documents at once, you will run out of memory. Also, having to read the file and parse the XML at the time the expression is evaluated is inherently slower than if the documents were somehow pre-loaded. Sometimes, this is unavoidable, such as when the XML you're processing is generated on-the-fly. But when the XML is where you actually store your content—and when you have a lot of it—this is where an indexed implementation like MarkLogic comes in handy.

One way MarkLogic makes XPath evaluation faster is through the use of indexing. For our purposes here, you can conceptually think of the documents as sitting on the file system (we'll ignore for now the fact that documents are pre-parsed, compressed, and cached in memory to make reads much faster). Let's call this the "document store." With a non-database-backed XQuery or XSLT implementation, that's all you have: the document store. But MarkLogic stores its data in two places:

  • the document store, and
  • the index

The index does not contain every piece of information about documents. For example, you cannot necessarily reconstruct an entire document from the index alone. In that sense, you can think of it like a back-of-the-book index. The back-of-the-book index lets you look up information in the book more quickly, but it does not substitute for the book itself. Let's carry the analogy further with an example. Let's say I'm reading a cookbook, and I want to look up every recipe that has ginger in it. If my cookbook didn't have an index, then I'd have to flip through the entire book and visually scan each page for the word "ginger." If I were to break this down into two steps, they might be:

  1. Get a list of pages
  2. Scan each page for "ginger," filtering out the ones that don't have it

Without an index, the answer to step 1 is "all the pages in the cookbook." That will take some time. But if my cookbook has an index, then I will see a listing that looks something like this:

Ginger, 15, 32, 49, 50, 57, 65, 66

Now I no longer have to scan the whole book. I only have to look at those 7 pages to find what I'm looking for. I won't even have to scan the page, filtering out what I don't want. I can confidently know from the index that each of these pages is a recipe that has ginger in it. We could say that my request is fully searchable from the index. Not only that, it doesn't require me to do any filtering. (Technically, my filtered results would be the same as if I had skipped filtering.)

But what if I only want ginger recipes that don't require any heating/cooking? It would be great if my cookbook's index included a listing like "Ginger - Raw." But what if it doesn't? I'd have to make do with the index listing I have, so my process would look like this:

  1. Get a list of pages (7 altogether)
  2. Scan each page to see if the recipe requires heating or not

Now I have to do some visual filtering (step 2), but at least I don't have to scan the whole book. In this case, we could say my query is partially searchable from the index.

One last example: let's say I want to find all recipes that include both ginger and tuna. In that case, I'd look up two different entries in the index—the one for "ginger" and another one for "tuna," which might look like this:

Tuna, 13, 32, 49, 57, 80

Now all I have to do is get the intersection of those two listings. In this case, I can quickly see that pages 32, 49, and 57 contain both "ginger" and "tuna," because those pages appear in both index listings.

If you haven't gathered this already, each cookbook page in the analogy corresponds to an XML document (fragment, actually). Also, just as the back-of-the-book index saves us time when looking for ginger recipes, MarkLogic's index saves it time when looking for documents. In that sense, it plays an inherently negative role. Indexing lets you ignore (subtract) a subset of the documents because you already know, for example, that they don't have any <Title> elements. Another way of thinking about it is that, for any given query, it reduces the size of your document store (while still giving you the same results as if you had scanned all the documents).

I've been throwing around the terms "fully searchable" and "partially searchable." As it turns out, these are technical terms that apply to both XPath expressions and word search queries. MarkLogic Server analyzes each path expression to see how searchable it is. Even if it's not fully searchable, it might be partially searchable (as in the case of our request for raw ginger recipes). In that case, it reduces the number of documents that it has to "filter" (another technical term, describing the scanning process). As you might expect, learning what makes an XPath expression searchable or not will go a long way towards helping you write highly performant code on MarkLogic Server. Stay tuned for another blog article on just this topic.

Taking this full circle, our original expression collection()/Article/Title is a fully searchable XPath expression, which is to say that each of its 3 path expression steps is searchable, so even if we have billions of documents in our database, MarkLogic will only open the documents that contain those elements in that arrangement. Of course, I had always assumed this to be the case. I just didn't have a very clear idea of how it did this. But now, after having played a bunch with xdmp:query-trace() and after having studied "Inside MarkLogic Server," it's all much more clear to me. And I hope it is for you too!

UPDATE: We've now posted a tutorial on how to write fast queries

Comments