A UDF for Ranged Buckets

by Dave Cassel

Last week I wrote a blog post about working with Ranged Buckets. To summarize the problem, we have data that look like this:

<doc>
  <lo>2</lo>
  <hi>9</hi>
  <id>1154</id>
</doc>

We want to build a facet with buckets like 0-4, 5-8, 9-12, 13-16, and 17-20. The "lo" and "hi" values in the sample document represent a range, so the document should be counted for the 0-4, 5-8, and 9-12 buckets, even though no value from 5-8 appears in the document. 

In my earlier post, I showed how to solve this problem using a normal custom constraint. Today, I took a crack at it with a more involved technique -- a User Defined Function. Also referred to as "Aggregate User Defined Functions", UDFs let MarkLogic application developers write C++ code to implement map/reduce jobs. For me, this took some effort as I haven't written much meaningful C++ since I came to MarkLogic about 5 years ago (the notable exception being the other UDF that I wrote). I got through it, though, and I found some interesting results. (Feel free to suggest improvements to the code.)

Implementation

I'll refer you to the documentation for the general background on UDFs, but essentially, you need to think about four functions.

start

The start function handles any arguments used to customize this run of the UDF. In my case, I needed to pass in the buckets that I wanted to use. I dynamically allocate an array of buckets that I'll use throughout the job. 

map

Two range indexes get passed in -- one for the "lo" element and one for the "hi" element. The map function gets called for each forest stand in the database, examining the values in the input range indexes. When two indexes are passed in, the map function sees the values as tuples. For instance, the values in the sample document above show up as the tuple (2, 9). Always check the frequency of that tuple, in case the same pair occurs in multiple documents. Once this function has been called for a stand, we know the counts for each bucket for the values in that particular stand. 

reduce

The reduce function combines the per-stand counts, aggregating them until a set of values for the entire database is known. My implementation just needed to add the counts for each bucket. 

finish

The last step is to organize the results in a way that they can be sent back to XQuery. The finish function builds a map, using "0-4" as the key for the first bucket and the count as the value. 

Encoding and Decoding

When working in a cluster, the encode and decode functions are important too. For my simple tests, I implemented them but used the UDF on a single MarkLogic instance, so these functions weren't called. 

Deploying

Building the UDF is pretty simple using the Makefile provided by MarkLogic. I customized the two places where the name needed to match my filename, but otherwise left it alone. 

After compiling, I uploaded the UDF to MarkLogic using Query Console. I exported the workspace and that's available on GitHub

You can call a UDF using the /v1/values endpoint, but I decided to wrap it in a custom constraint to provide a straightforward comparison with the custom constraint built in the previous post. After all, the goal is to provide a facet. A custom constraint requires some XML for the search options and some XQuery

The Results

I figured UDFs would be more interesting with multiple forests, as mapping a job to a single forest that has just one stand doesn't gain any parallelism. With that in mind, I bumped my database up to four forests, then to six, and compared my UDF implementation with the two-function approach I described in the previous approach. I tested with the same 100,000 documents used in the previous post. 

Median Seconds 4 forests 6 forests
UDF 0.002898 0.002858
two-function 0.003909 0.004261

The numbers are the median seconds returned in the facet-resolution-time part of the response to /v1/search?options=udf or /v1/search?options=startfinish. A couple of things jump out at me. First, the UDF out-performed the two-function XQuery custom facet. Second, the UDF had a very slight improvement while moving from four forests to six -- slight enough that let's call it even. The two-function approach, however, increased a noticable amount. 

Thoughts on UDFs

When should you reach for a UDF? When your data don't support directly getting your values, it might be worthwhile. For instance, with ranged buckets we can't simply do a facet on "lo" or "hi", because we wouldn't represent the values in between. Writing a UDF is more complicated and more dangerous than other approaches, but appears to have some performance benefits. 

There is usually an alternative. For instance, in this case I could have supplemented my data such that the sample document would have all values from two through nine inclusive, allowing me to use a standard facet. That leads to the tradeoff -- do I want to spend a little more time at ingest and take up a little more space, or do I want to dynamically compute the values I need? The answer to that question is certainly application specific, but UDFs provide a handy (and sharp!) tool to work with. 

 


[1] You'll find the MarkLogic Makefile for UDFs in /opt/MarkLogic/Samples/NativePlugins/ (Linux), ~/Library/MarkLogic/Samples/NativePlugins/ (Mac), or C:\Program Files\MarkLogic\Samples\NativePlugins\ (Windows). 

Comments