Solutions

MarkLogic Data Hub Service

Fast data integration + improved data governance and security, with no infrastructure to buy or manage.

Learn More

Learn

Stay On Top Of Everything MarkLogic

Be the first to know! News, product information, and events delivered straight to your inbox.

Sign Me Up

Community

Stay On Top Of Everything MarkLogic

Be the first to know! News, product information, and events delivered straight to your inbox.

Sign Me Up

Company

Stay On Top Of Everything MarkLogic

Be the first to know! News, product information, and events delivered straight to your inbox.

Sign Me Up

Result Streams from Range Indexes

by Bradley Mann

The other day I needed to write an XQuery script to collect all the values from a range index and group them into "sets" of 1000. I ended up with something like this:

This query performs fine on a small set of values (5000), but when we increase the number of values pulled from the range index, we see that this call

let $group := $values[(($i * $groupsize) + 1) to (($i + 1) * $groupsize)]

quickly becomes the long pole in the tent. In fact, for a sample size of 50,000 values (50 groups), 91% of the execution time is taken by this one call, 2.3 seconds for just 50 calls. Increasing the sample size to values above 1,000,000 and it's clear that this query will no longer even run in a reasonable amount of time. So what's going on here? Shouldn't sequence accesses be lightning fast?

As it turns out, our cts:element-values() call isn't doing exactly what one might initially think. Rather than returning an in-memory sequence of values, it actually returns a stream, which is loaded lazily as needed. This optimization limits memory use in the (common) situations where you don't need the entire sequence. In my case, though, it doesn't help. 50 sequence accesses are actually 50 stream accesses, each time streaming the results from the first item (grabbing items at the end of the stream takes longer).

In order to get around this issue, there's a handy function called xdmp:eager(), which avoid lazy evaluation. But there's another easy "trick" that will reliably ensure you're working with an in-memory sequence rather than a stream. Simply, drop into a "sub-flwor" statement to generate a sequence from the return value:

let $values := for $i in cts:element-values(xs:QName("sample:value")) return $i

Now, $values is no longer a stream, but rather an in-memory sequence. Accessing subsequences within it is now much faster, particularly values at the end of the sequence.

This made my day.

Stack Overflow iconStack Overflow: Get the most useful answers to questions from the MarkLogic community, or ask your own question.

Comments

The commenting feature on this page is enabled by a third party. Comments posted to this page are publicly visible.