Working with Ranged Buckets

by Dave Cassel

One of my colleagues ran into an interesting problem a while ago. He had data with high and low values for some field, and he wanted to display bucketed facets on those values. Let's take a look at how to implement that.

Note: all the code for this post is available at my ranged-bucket GitHub repo, so you're welcome to clone and follow along.

The Data

To illustrate the point, let's look at some sample data.

<doc>
  <lo>2</lo>
  <hi>9</hi>
  <id>1154</id>
</doc>

This represents a document whose valid values range from 2 to 9. Now suppose we want to get a bucketed facet on these documents, showing how many fall into ranges like 0-4, 5-8, 9-12, etc. The first observation is this is different from how we usually do facets or buckets. The sample document should be counted for the 5-8 bucket, even though no value from five to eight appears in the document.

The next observation is that a document may well fall into more than one bucket. The example document will be represented in three of the buckets we've specified so far.

Generating Data

We need some data to work with, so let's generate some. The repository has a Query Console workspace that you can import, with a buffer to generate sample data with "lo" values ranging from zero to 10 (inclusive) and "hi" values ranging from zero to twenty. The high value is a random number added to the low, ensuring that the high is always greater than the low.

The Code

To implement this, two approaches occurred to me: a custom constraint facet and a UDF. This post shows the custom constraint approach; I'll return to the UDF another time.

Custom Constraint

To implement a custom constraint facet, there are three functions we need to know about. The first is used when someone selects a facet value, or otherwise makes use of a constraint -- the function parses the request and turns it into a cts:query. This function is important for any constraint, whether used as a facet or not.

The text part of the incoming request is expected to look like "5-8", or some other pair of numbers. These are split and used to build an and-query.

To make a custom constraint work as a facet, you need to implement functions to return values and counts. These are split into start-facet and finish-facet functions. The job of the start function is to make the lexicon API calls needed to identify the values; the finish function formats the results as the Search API and REST API expect.

You're not technically required to implement the start function -- you can make the lexicon calls in the finish function if you want. That's actually a simple way to get started. You will get some performance improvement if you split the work properly, however. To illustrate this, I implemented both ways. I'll only show the split code here, but you can see the single-function approach at GitHub.

Here's the split implementation:

You can see the call to xdmp:estimate() in the start function. The values returned end up in the $start parameter to the finish function. Why split them up this way? Because MarkLogic can do some of this work in the background, allowing for a faster overall response.

Sidebar: why estimate and not count?

Note that what you return from the start function is important. In my first attempt, my start function constructed elements with attributes for count, hi, and low, then the finish function pulled out what it needed to make the search:facet-value elements. That was (not surprisingly) slower than just doing everything in the finish function. My revised implementation just returns the results of the xdmp:estimate() calls. The finish function already knows what order they will be in, so it's able to map those to the correct hi-lo values to construct the search:facet-values.

It's fair to ask how much difference the one-function versus two-function approaches makes. I generated 100,000 sample documents and ran some simple tests on my MacBook Pro (MarkLogic 7.0-4.1). (I should caveat this by saying I didn't trouble to shut everything else down, I just wanted to get an idea about the difference.) I threw in a single-value, standard bucket facet for comparison. Each approach was run as a single facet, making calls through the REST API.

Approach Median Facet-resolution Time
Two function 0.003622 sec
One function 0.009369 sec
Single-value buckets 0.001049 sec

Take these as rough numbers, but if you'd like to run more precise tests, you can get the full code and configuration from GitHub.

Comments