Performance of XML Lexicons Constrained By cts:query Constructors

by Ian Small

Categories: Performance; Feature
Altitude: 1,000 feet

In a previous column, I talked at some length about XML lexicons (XML leprechauns for those of you in the know) and what you can do with them. In that column, I introduced the idea of using a cts:query constructor to constrain the output, and promised some future writings about the performance implications inherent in taking advantage of this capability. Well, now that we've had a chance to examine the ins and outs of term list and range indexes (see my last column), we've laid the necessary foundation to do that promised double-click on performance implications. So here we go...

Step 1: Recipe for Constraining XML Lexicons

To understand the performance implications inherent in constraining the output of any of the XML lexicons using the optional cts:query parameter, you need first to remember that there are fundamentally two types of lexicons: word lexicons and value lexicons. If this is news to you, go back to this column and start reading from there.

The word lexicon accessors (eg. cts:words(), cts:words-match(), cts:element-words(), etc.) perform look-ups into purpose-built data structures containing lists of words. Value lexicons, on the other hand, take advantage of the data structures that are used for range indexes. So while the accessors look the same, the underlying data structures are quite different.

Big deal, you say. Well, in fact this does turn out to be a rather significant, so the words "big deal" are far from misplaced.

Range indexes relate ordered values to fragments that contain those values (look here for a recap). The word lexicon structure, on the other hand, just contains a list of words. Term lists are what relate specific words to fragments that contain those words. To see why this is significant, let's walk through a simplified sequence of steps for constraining XML lexicon results using that optional cts:query constructor parameter.

  1. Evaluate the cts:query constructor to determine the constraining set of fragment references
  2. Advance through the appropriate lexicon data structure containing candidate words or values
  3. For each candidate word (or value), make sure that it matches the accessor criteria (eg. the wildcard pattern, if one was provided)
  4. For each candidate word (or value) passing step 3, verify that it exists in a fragment included in the constraining set of fragment references from step 1
  5. For each candidate word (or value) passing step 4, return the candidate word (or value)
  6. Repeat from step 2 until sufficient candidate words (or values) are returned

Step 2: How Term Lists and Range Indexes Get Used

At first glance, this procedure seems relatively straight-forward. The thorny bit shows up in step 4, "...verify that [the candidate word or value] exists in a fragment included in the constraining set of fragment references...". Let's think about what we would actually do to accomplish this.

If we're talking about a word lexicon (eg. cts:words(), cts:element-words()), then the answer lies in the word or element-word term list for the candidate entry. First we intersect that term list with the constraining set of fragment references from the optional cts:query constructor. An empty result tells us to ignore this candidate entry. On the other hand, if there is at least one fragment reference produced by that intersection, we can conclude that the candidate word or element-word exists within the constraining set.

Of course, to carry out this operation we might first need to go fetch the term list from disk, and decompress it. So there's one of those pesky I/Os showing up again...

On the other hand, if we're talking about a value lexicon, then the range index itself contains the fragment references for the candidate value in question. So as we walk through the range index to determine candidate values, we simply intersect each value's fragment references with the constraining set of fragment references from the optional cts:query constructor. We use the same rules: an empty result means throw out this candidate, an intersection with at least one result means we've got a win.

The intersection itself takes more or less the same amount of work for a range index as it does for a term list. But if we assume that the range index is in frequent use, the fragment references for the values are probably already cached, so we get to avoid the disk fetch we encountered above.

So on a gross basis, you could say that resolving cts:query constraints using a value lexicon is done entirely in memory, whereas resolving cts:query constraints using a word lexicon requires I/Os. But that's a little simple-minded, so let's take this analysis a step further.

Step 3: How Many I/Os Are We Talking About?

I/Os themselves are a fact of life. It's not that requiring an I/O is the world's worst idea, it's just that you want to avoid a lot of them. So if, for instance, you want to display twenty words from a constrained word lexicon, well, then twenty-odd I/Os would be a pretty managable cost function if you've got a decently configured I/O subsystem.

But the challenge is that it's not necessarily just twenty-odd I/Os. The server will do an I/O for each returned word lexicon entry (or potentially more than one, if the database has multiple forests and/or the forests have multiple stands), but it will also have to perform I/Os for any lexicon entries that were ruled out by the cts:query constraint.

Put another way, if the cts:query constraint eliminates 99,980 out of a possible 100,000 lexicon words, and the 20 lexicon words that work are the last entries in the lexicon, then if we start at the beginning of the lexicon, the server will first examine the 99,980 term lists for the 99,980 words that don't work before it makes its way to the last 20. Which could make for a lot of I/Os.

Of course, that example is pretty pathological, and while it's possible, it's unlikely to happen to you. (If it does, repeatedly, you should either examine your XQuery programming skills or go invest in lottery tickets.)

So here's a (more useful) rule of thumb to apply for word lexicons:

If the cts:query constraint selects a very small subset of the database — a statistically unrepresentative sample set — then you may have to examine a lot of words (meaning a lot of I/Os) to find the twenty you're looking for.

On the other hand, if the cts:query constraint selects a statistically representative subset of the entire database, the odds are that most words in the lexicon will exist in the subset. So you are not likely to end up in the pathologically bad edge case I outlined above.

And if it's a value lexicon, well then, rock on... because it will still be reasonably fast even if the subset is quite small.

Okay, that's the basic picture of XML lexicons and performance, at least as far as cts:query constructor constraints are concerned. Next time, a couple more quick hits about other aspects of lexicon performance, and we'll wrap this topic up.