Understanding the Performance of Term Lists vs. Range Indexes

by Ian Small

Categories: Performance; Indexing
Altitude: 200 feet

As I warned in the first column, the material in Small Changes will cover a pretty big range in terms of technical complexity. While every column will be targetting a technical audience, some columns are going to go a lot deeper than others. Hence the altitude warning at the top of each column. This one is going to fly pretty low, so hang on to your hat...

When we're confronted with trying to meet an interesting performance challenge with MarkLogic Server, we often benefit from a pretty significant selection of rabbits we keep up our sleeve. Why so many rabbits? Mostly because the inner workings of the server are fairly complex, and we have the advantage of knowing most everything there is to know about it. So for us, it turns out that choosing the optimal rabbit is usually the biggest part of the problem.

This column discusses index resolution, and in particular, the differences between how term lists and range indexes are resolved. Understanding this in some detail will equip you with one of your first rabbits, which I hope you will be able to deploy to significant advantage at some time in the future.

Understanding Term Lists

If you've attended a MarkLogic Server technical overview, you will likely be familiar with some of the following statements.

  1. MarkLogic Server is a database built using search engine approaches.
  2. MarkLogic has generalized from term lists to fact lists.
  3. MarkLogic has generalized from documents to parts of documents.

Let's take a look at each one of these in more detail:

MarkLogic Server is a database built using search engine approaches.

When we say this, what we mean is that on the outside, MarkLogic Server looks a lot like a database. It has a general-purpose query language, can perform updates, is transactional, etc. On the inside, however, MarkLogic Server looks a lot like a search engine - rather than performing index resolution using complex B-tree indexes (like an RDBMS), we perform term list intersection (like a search engine). So term lists, and term list intersection, are a core component of how MarkLogic resolves queries using indexes.

MarkLogic has generalized from term lists to fact lists.

Unlike search engines, we don't just use term lists to indicate the presence of a word (a term in search engine parlance) in a given document. We have generalized the approach, using term lists to indicate the presence of arbitrary facts - whether it's the presence of an element or the membership of a document in a given collection. While we still call them term lists, a more accurate descriptor would probably be "fact lists".

MarkLogic has generalized from documents to parts of documents.

Finally, rather than mapping term occurrence to documents, we have generalized that idea so that we can to relate specific terms (oops, let's make that specific facts) to specific localities of documents (fragments in MarkLogic-speak). This notion takes advantage of the spatial coherence and locality of related information in most XML documents, and works particularly well with large XML documents.

Hmmm....

So if I put on my wizardly deduction hat, what I might be able to tell from all this is that rather than maintaining a relatively small number of complex indexes (as does an RDBMS), MarkLogic maintains a large number (easily millions, sometimes much more) of simple indexes based on term lists. A term list maps a particular fact to a set of fragment references, each of which refer to a part of some document. And now for the important part: each term list that is involved in resolving a given query needs to be in memory, which means that we need to do an I/O to fetch it from disk (unless that term list was recently used and is still in cache).

To understand how term lists are used in index resolution, consider the following simple request consisting of a single XPath:

  //authors/author[type = "primary"]

To resolve this query, the server needs to figure out what fragments to fetch from disk, fetch them, then navigate them to return the appropriate author elements. For the purposes of today's column, we're most interested in the very first step — figuring out what fragments to fetch from disk.

To do this, the server will analyze the XPath to identify a series of facts that are necessary characteristics of any fragment that might contain a value to be returned. In this case, the server might choose to intersect term lists for the following facts:

element(authors) 12 17 34 51 97 115 119 133 ...
element(author) 5 12 19 34 47 51 85 115 119 121 ...
element(type) = "primary" 12 51 115 119 121 ...

The first term list above captures the fact that the element authors exists in a fragment. The second term list captures the fact that the element author exists in a fragment. And the third term list captures the fact that an element type with value "primary" exists in a fragment.

Note: The rocket scientists in the crowd may already have realized that this example is grossly simplified. In fact, such a simple set of term lists is not sufficient to guarantee that all candidate fragments are found. If you figured this out, well done — bear with me through the example, because it's enough to communicate the main point of the column. If I've now confused you even more, forget that I said anything — it's our job to worry about this on your behalf...

Because the fragment references in the term lists are ordered in a consistent order (which we can call database order), the intersection can be performed efficiently using an O(n) algorithm:

element(authors) 12 17 34 51 97 115 119 133 ...
element(author) 5 12 19 34 47 51 85 115 119 121 ...
element(type) = "primary" 12 51 115 119 121 ...

In this way, we combine term lists to yield candidate fragments for retrieval from the database:

candidate fragments 12 51 115 119 ...

Now we can continue evaluating the request by fetching these fragments, and navigating them to return the appropriate author elements

Term Lists Have Their Shortcomings

Term lists are great for a lot of things. They are extremely compact, very fast to navigate, and we can maintain an almost endless supply of them at relatively little cost.

Surprise, surprise, there's no such thing as a free lunch. All of this goodness does come at some cost. One cost is easy to understand. Term lists relate a single fact to a set of fragments. If what you are trying to retrieve matches more than one fact, then you will need more than one term list. If it matches a lot of facts, then you will need a lot of term lists, each of which might require an I/O. Since I/Os are general the enemy, this is not a good thing.

To understand why this can be painful, imagine trying to use term lists to find all the articles published between July 1, 1999 (that's Canada Day, in case you're interested) and February 14, 2000 (you probably already know that one). Assuming each article includes an element specifying the publication date, a quick examination of my calendar tells me that's 219 different days, meaning 219 I/Os. At about 100 random I/Os per second (a reasonable expectation from a single spindle), that's about 2 seconds. Not good.

The second cost is a little more obscure. It turns out that term lists only work one-way: they map a fact to a set of fragments. You can't use a term list to work from a set of fragments back to a fact.

That one hurts when you want to do something like sort a set of documents you've already selected, but only return the first ten. The term lists are great at selecting the documents in the first place - they can get identify the 50,000 appropriate documents in a flash. But what you want is the first ten, ordered by author name. Oops! Term lists don't help you map from fragment to fact, in this case from documents to author names. Without that, you need to load all 50,000 documents, examine them to extract the author name, sort by those extracted author names, and only then will you know which ten documents to return first. Using the basic performance math from above, this could easily take minutes!

So what this tells us is that while term lists are great at some things, they're not the be-all and end-all of indexing strategies. To address the shortcomings outlined above, MarkLogic Server integrates another kind of index that we call a range index.

Range Indexes Work Two Ways

Range indexes use a more complex indexing structure than term lists. Instead of mapping a fact to a set of fragments, they map values to fragments (and vice versa) for a given QName. Range indexes are type specific (eg. an xs:int is interpreted differently than an xs:dateTime), and only work for simple types. Range indexes can be configured for element values (maintained by what the Admin Interface calls an Element Index) or for element-attribute value (an Element-Attribute Index). To learn more about range indexes, take a look at Chapter 15 of the Administrator's Guide.

For a given QName and type, a range index lets the server map values to fragments, and fragments to values. The former capability is used to support "range predicates" such as:

  /docs[pub-date >= xs:date("1999-07-01") and
        pub-date <= xs:date("2000-01-14")] 

The latter is used to support fast order by operations, so that if a range index is present, the sort operation I described above has a decent shot at avoiding 49,990 unneeded I/Os.

For the moment, however, let's concentrate on the former capability. To resolve the sample XPath shown above, the server might intersect the indexes below:

element(docs) 9 16 21 35 42 67 91 113 121 139 ...
range(pub-date) ...
1999-06-28:
91 35 183 61
1999-07-19:
118 21 4 241 42 83
1999-07-23:
51 9 139
...

Notice that the fragment references for the range index are not simple fragment reference lists. Instead, they are a hybrid representation that incorporates values and fragment references, storing the references in value order. The first thing we need to do, then, is extract the fragment references from the range index that correspond to any value that matches the original predicate provided:

element(docs) 9 16 21 35 42 67 91 113 121 139 ...
range(pub-date)
(extracted)
(118 21 4 241 42 83) (51 9 139) ...

Progress has been made. At this point, we have lists of fragment references to intersect, but the list extracted from the range index still does not exhibit the convenient characteristic that term lists bring to the table — the references aren't in database order. Rather than attempting a complex intersection algorithm, we simplify the problem by sorting the value-ordered range-index-extracted fragment references into database order:

element(docs) 9 16 21 35 42 67 91 113 121 139 ...
range(pub-date)
(sorted)
4 9 21 42 51 83 118 139 241 ...

Then we perform the intersection as we normally would:

element(docs) 9 16 21 35 42 67 91 113 121 139 ...
range(pub-date)
(sorted)
4 9 21 42 51 83 118 139 241 ...

to yield the candidate fragments to retrieve from the database:

candidate fragments 9 21 42 139 ...

Gee, range indexes sound great... So why don't we use them for everything? Clearly, we can use them for the fact to fragment mapping that term lists provide.

Ah, Virginia, there really is no free lunch. First of all, range indexes need to be configured in advance of indexing the content. Term lists just work without knowing anything about the content. Second, range indexes are type-specific. If you have a QName that contains an xs:date in one document, an xs:float in another, and a complex type in a third, range indexes aren't much of an option. And third, range indexes are much larger and more complex data structures than term lists - which makes them slower to create and more memory intensive to use.

So like many things, range indexes are great when the application fits their design. I briefly outlined those design criteria above. So if the problem doesn't look like one of those, perhaps range indexes are not the right answer.

Or perhaps they are...

Finding Your First Rabbit

Like many parts of MarkLogic Server, applying features to what they were designed for is only the start of the party. Creative mis-use is ever so much more fun.

Because range indexes are large, complex data structures, the server memory maps them to help rapidly access the relevant sections of the range index as needed. While the memory mapping chews up some address space (not a terrible problem in most 64-bit systems), it provides significant benefits from an I/O standpoint.

First of all, a frequently used range index is likely to be fully cached in memory (just like a term list). But when a range index is fully cached in memory, it can provide matches for many predicates (unlike a term list, which is only good for one). And second, when the range index was originally read in, the server probably did so with a single long contiguous read — which is better than having to read in 1,000 different term lists, each of which probably requires a random I/O.

What this tells us is that if your performance requirement is driving you to to eliminate I/Os as thoroughly as you can, range indexes can be a useful tool.

For instance, imagine being in possession of the aggregated e-mail archives for Enron Corporation, and wanting to try to resolve the following query:

For every e-mail thread deemed to be interesting, how many times was this e-mail thread contributed to in each month of the years 1999, 2000 and 2001?

  (: First use cts:search to identify
     $threads of interest, then ... :)

  for $thread in $threads
  for $year in (1999, 2000, 2001)
  for $month in (1 to 12)
  return
    xdmp:estimate(
      /email[thread-id = $thread]
            [year = $year]
            [month = $month]
    ) 

Thinking about this query in terms of I/Os, you rapidly come to the conclusion that there are:

  • 3 term lists representing the 3 years
  • 12 term lists representing the 12 months
  • hundreds of thousands of term lists representing the hundreds of thousands of e-mail threads

Each xdmp:estimate() is therefore going to resolve a query by intersecting at least these four indexes:

term list element(email)
term list element(year) = $year
term list element(month) = $month
term list element(thread-id) = $thread

Thinking through this index intersection, the element(email) term list will be cached after the very first xdmp:estimate(). The element(year) = $year and element(month) = $month predicate tests will be completely cached for all values of $year and $month after the first $thread has been completely evaluated. But each $thread will require an I/O to load in the appropriate element(thread-id) = $thread term list. And if a particular line of inquiry involves looking at thousands of threads, then thousands of I/Os will be required.

This is where range indexes can become interesting. Assuming that the element thread-id is some kind of number (it could just as easily be a string), and assuming that an appropriate range index is configured for the QName thread-id, then the query can be optimized with one change, shown in bold below:

  (: First use cts:search to identify
     $threads of interest :)

  for $thread in $threads
  for $year in (1999, 2000, 2001)
  for $month in (1 to 12)
  return
    xdmp:estimate(
      /email[thread-id = xs:integer($thread)]
            [year = $year]
            [month = $month]
    ) 

The optimizer automatically uses the range index (instead of the corresponding term list) if a range index of the appropriate type for the relevant QName is available. So this query will instead be resolved by intersecting these four indexes:

term list element(email)
term list element(year) = $year
term list element(month) = $month
range index xs:integer(thread-id) = $thread

Now, as we discussed above, the fragment references in range indexes are retrieved in value order, not in database order. So performing this intersection actually takes a bit more work than the previous one. First, the appropriate section of the range index (for the value $thread in the range index xs:integer(thread-id)) needs to be identified. Then the fragment references need to be extracted and sorted into database order. Only then the intersection can proceed as it did before.

Remembering that range indexes are memory mapped and can be rapidly cached, the benefit of this approach should become obvious. Unlike the term-list-only approach, an I/O is no longer required for each $thread. Instead, we have to do some fragment reference extraction and some sorting.

The tipping point, then, is when the cost to read in and decompress a term list is the same as the cost to extract and sort the range index fragment references. On a gross basis, take it from me that the former operation is dominated by the I/O, and the latter by the sort time. Now let's assume that (at 100 I/Os per second) reading a term list takes roughly 10 ms, and that a decent processor can exceed more than 1M sort ops per second. It doesn't take a Mensa member to figure out that as long as you expect that there there will be fewer than 10,000 emails per thread on average, range indexes are probably a win. In my experience, there's usually far fewer than 100 emails in a typical e-mail thread, in which case you're winning big time. With this very simple change, odds are that you're probably making this query run 100x faster.

So yup, there's your rabbit — we've flushed it right out into the open for you!

It's been our experience that this style of query can be quite powerful for analytical applications - whether you're looking at Enron e-mails, evaluating chatter on signal intercepts, or figuring out who's citing whom in technical articles.

Understanding how range indexes can be applied to accelerate certain types of repeated queries can take an application from something you run during your coffee break to something that works so fast you can't do without it.

And that's a great example of how a small change can make a big difference.

Cheers
ian

Comments

  • HI Ian, I am confused does Term list contains Universal Index,Word Index,Range Index etc.Can you please clarifiy.
    • Each of the types of indexes you mentioned makes use of Term Lists. Those term lists map from a piece of information (which may be a term, or for the Universal Index, a term's presence in an element or an element hierarchy) to a list of fragment identifiers. You might find it helpful to read the <a href="http://developer.marklogic.com/blog/smallchanges/2006-07-05#comment-2416583525">Wildcard Search in MarkLogic</a> post, which talks more about how indexes work.