Result Streams from Range Indexes

by Bradley Mann

The other day I needed to write an XQuery script to collect all the values from a range index and group them into "sets" of 1000. I ended up with something like this:

This query performs fine on a small set of values (5000), but when we increase the number of values pulled from the range index, we see that this call

let $group := $values[(($i * $groupsize) + 1) to (($i + 1) * $groupsize)]

quickly becomes the long pole in the tent. In fact, for a sample size of 50,000 values (50 groups), 91% of the execution time is taken by this one call, 2.3 seconds for just 50 calls. Increasing the sample size to values above 1,000,000 and it's clear that this query will no longer even run in a reasonable amount of time. So what's going on here? Shouldn't sequence accesses be lightning fast?

As it turns out, our cts:element-values() call isn't doing exactly what one might initially think. Rather than returning an in-memory sequence of values, it actually returns a stream, which is loaded lazily as needed. This optimization limits memory use in the (common) situations where you don't need the entire sequence. In my case, though, it doesn't help. 50 sequence accesses are actually 50 stream accesses, each time streaming the results from the first item (grabbing items at the end of the stream takes longer).

In order to get around this issue, there's a handy function called xdmp:eager(), which avoid lazy evaluation. But there's another easy "trick" that will reliably ensure you're working with an in-memory sequence rather than a stream. Simply, drop into a "sub-flwor" statement to generate a sequence from the return value:

let $values := for $i in cts:element-values(xs:QName("sample:value")) return $i

Now, $values is no longer a stream, but rather an in-memory sequence. Accessing subsequences within it is now much faster, particularly values at the end of the sequence.

This made my day.

Comments