An Exploration into Android Gaming, Powered by MarkLogic

by Brent Perry

Last year before I joined MarkLogic, I took it on myself to start the development of an Android-based geo-location game that we’ll call “Contagion”. I had ideas for how it would work, the game design, and all kinds of areas to expand if it were successful. There are many technical challenges to overcome with this kind of project, though. In a matter of days a successful title on Google Play can go from a few users to a few hundred thousand users. Additionally, based on user feedback I would need to modify the software frequently to add features and possibly modify game design elements. All of this added up to some traditionally nasty software requirements: a technical stack with high scalability and flexibility.

Coming from a software background in enterprise java development, my instincts were to go for a classic 3-tiered architecture. The game’s business logic would live in a DropWizard (Jetty, Jersey, Jackson for HTTP, REST, and JSON handling respectively) layer running as a standoff service from MarkLogic. Writing pages of stored-procedures to hold the game’s business logic at the database level sounded like an anti-pattern to say the least. Java was a more appropriate language, I thought, than a niche functional script like XQuery. As the scope of the project grew, however, managing the hand-off between the MarkLogic platform, the service code, and the Android client code became more and more ungainly. State management issues plagued the system, introduced latency, and brought to light serious concerns about scalability.

Once I joined MarkLogic, I discovered many use cases with full enterprise applications running directly against the MarkLogic platform. In some cases tens or hundreds of thousands of concurrent users were hitting scalable MarkLogic clusters with great performance. Complex business logic was implemented directly on the platform further challenging my previously held 3-tier or bust philosophy. I do maintain there are many cases where decoupling the database platform and the business logic provides better separation of concerns and cohesion, but in the case of Contagion I decided to go with the simpler architecture given the limited resources.

I invited Adam Lewis to assist in the project and together we reviewed the existing code and possibilities. This new design would have the Android client contacting REST end-points supplied by the MarkLogic platform. We set off converting the existing Java business logic into XQuery and eventually JavaScript modules as MarkLogic 8 neared release and incrementally added support for JavaScript REST API extensions. These end-points would handle all the basic client interactions and access additional JavaScript modules (hooray for ML8) executing the business logic of the game. This architecture dramatically simplified interactions between the client and server and reduced latency to boot! As most gamers know, latency kills.

When I was using a middle-tier, I wrote dense mathematical Java code calculating great circle distance between actors in the game, calculating bearings and distances traveled. Using the geo-spatial indexes and tools available in MarkLogic, much of this code could now be shelved. Less code to maintain is a major risk reducer for the project so this was a big win.

The game’s progress continues at a measured pace as we both have full time jobs with MarkLogic and families, but it’s a fun side-activity and we are both learning the nuances of the technologies involved. “Contagion” has forced us to tackle many complex details not only at the architectural level, but also in the weeds of the MarkLogic platform itself. As the project scales up, we’ll be examining classic sizing issues like when documents are created, how frequently we are updating them, and what kind of query performance we are getting with hundreds (and hopefully tens of thousands) of concurrent requests.

In addition to using this project to learn more about our product firsthand, my hope is to further explore the ways to exploit MarkLogic to the benefit of mobile platform users. In future posts, I intend to delve into our usage of groovy/grails for package management, our extensions of the REST interface on MarkLogic 8, and how we’re exercising geospatial search.

A UDF for Ranged Buckets

by Dave Cassel

Last week I wrote a blog post about working with Ranged Buckets. To summarize the problem, we have data that look like this:


We want to build a facet with buckets like 0-4, 5-8, 9-12, 13-16, and 17-20. The "lo" and "hi" values in the sample document represent a range, so the document should be counted for the 0-4, 5-8, and 9-12 buckets, even though no value from 5-8 appears in the document. 

In my earlier post, I showed how to solve this problem using a normal custom constraint. Today, I took a crack at it with a more involved technique -- a User Defined Function. Also referred to as "Aggregate User Defined Functions", UDFs let MarkLogic application developers write C++ code to implement map/reduce jobs. For me, this took some effort as I haven't written much meaningful C++ since I came to MarkLogic about 5 years ago (the notable exception being the other UDF that I wrote). I got through it, though, and I found some interesting results. (Feel free to suggest improvements to the code.)


I'll refer you to the documentation for the general background on UDFs, but essentially, you need to think about four functions.


The start function handles any arguments used to customize this run of the UDF. In my case, I needed to pass in the buckets that I wanted to use. I dynamically allocate an array of buckets that I'll use throughout the job. 


Two range indexes get passed in -- one for the "lo" element and one for the "hi" element. The map function gets called for each forest stand in the database, examining the values in the input range indexes. When two indexes are passed in, the map function sees the values as tuples. For instance, the values in the sample document above show up as the tuple (2, 9). Always check the frequency of that tuple, in case the same pair occurs in multiple documents. Once this function has been called for a stand, we know the counts for each bucket for the values in that particular stand. 


The reduce function combines the per-stand counts, aggregating them until a set of values for the entire database is known. My implementation just needed to add the counts for each bucket. 


The last step is to organize the results in a way that they can be sent back to XQuery. The finish function builds a map, using "0-4" as the key for the first bucket and the count as the value. 

Encoding and Decoding

When working in a cluster, the encode and decode functions are important too. For my simple tests, I implemented them but used the UDF on a single MarkLogic instance, so these functions weren't called. 


Building the UDF is pretty simple using the Makefile provided by MarkLogic. I customized the two places where the name needed to match my filename, but otherwise left it alone. 

After compiling, I uploaded the UDF to MarkLogic using Query Console. I exported the workspace and that's available on GitHub

You can call a UDF using the /v1/values endpoint, but I decided to wrap it in a custom constraint to provide a straightforward comparison with the custom constraint built in the previous post. After all, the goal is to provide a facet. A custom constraint requires some XML for the search options and some XQuery

The Results

I figured UDFs would be more interesting with multiple forests, as mapping a job to a single forest that has just one stand doesn't gain any parallelism. With that in mind, I bumped my database up to four forests, then to six, and compared my UDF implementation with the two-function approach I described in the previous approach. I tested with the same 100,000 documents used in the previous post. 

Median Seconds 4 forests 6 forests
UDF 0.002898 0.002858
two-function 0.003909 0.004261

The numbers are the median seconds returned in the facet-resolution-time part of the response to /v1/search?options=udf or /v1/search?options=startfinish. A couple of things jump out at me. First, the UDF out-performed the two-function XQuery custom facet. Second, the UDF had a very slight improvement while moving from four forests to six -- slight enough that let's call it even. The two-function approach, however, increased a noticable amount. 

Thoughts on UDFs

When should you reach for a UDF? When your data don't support directly getting your values, it might be worthwhile. For instance, with ranged buckets we can't simply do a facet on "lo" or "hi", because we wouldn't represent the values in between. Writing a UDF is more complicated and more dangerous than other approaches, but appears to have some performance benefits. 

There is usually an alternative. For instance, in this case I could have supplemented my data such that the sample document would have all values from two through nine inclusive, allowing me to use a standard facet. That leads to the tradeoff -- do I want to spend a little more time at ingest and take up a little more space, or do I want to dynamically compute the values I need? The answer to that question is certainly application specific, but UDFs provide a handy (and sharp!) tool to work with. 


[1] You'll find the MarkLogic Makefile for UDFs in /opt/MarkLogic/Samples/NativePlugins/ (Linux), ~/Library/MarkLogic/Samples/NativePlugins/ (Mac), or C:\Program Files\MarkLogic\Samples\NativePlugins\ (Windows). 

Working with Ranged Buckets

by Dave Cassel

One of my colleagues ran into an interesting problem a while ago. He had data with high and low values for some field, and he wanted to display bucketed facets on those values. Let's take a look at how to implement that.

Note: all the code for this post is available at my ranged-bucket GitHub repo, so you're welcome to clone and follow along.

The Data

To illustrate the point, let's look at some sample data.


This represents a document whose valid values range from 2 to 9. Now suppose we want to get a bucketed facet on these documents, showing how many fall into ranges like 0-4, 5-8, 9-12, etc. The first observation is this is different from how we usually do facets or buckets. The sample document should be counted for the 5-8 bucket, even though no value from five to eight appears in the document.

The next observation is that a document may well fall into more than one bucket. The example document will be represented in three of the buckets we've specified so far.

Generating Data

We need some data to work with, so let's generate some. The repository has a Query Console workspace that you can import, with a buffer to generate sample data with "lo" values ranging from zero to 10 (inclusive) and "hi" values ranging from zero to twenty. The high value is a random number added to the low, ensuring that the high is always greater than the low.

The Code

To implement this, two approaches occurred to me: a custom constraint facet and a UDF. This post shows the custom constraint approach; I'll return to the UDF another time.

Custom Constraint

To implement a custom constraint facet, there are three functions we need to know about. The first is used when someone selects a facet value, or otherwise makes use of a constraint -- the function parses the request and turns it into a cts:query. This function is important for any constraint, whether used as a facet or not.

The text part of the incoming request is expected to look like "5-8", or some other pair of numbers. These are split and used to build an and-query.

To make a custom constraint work as a facet, you need to implement functions to return values and counts. These are split into start-facet and finish-facet functions. The job of the start function is to make the lexicon API calls needed to identify the values; the finish function formats the results as the Search API and REST API expect.

You're not technically required to implement the start function -- you can make the lexicon calls in the finish function if you want. That's actually a simple way to get started. You will get some performance improvement if you split the work properly, however. To illustrate this, I implemented both ways. I'll only show the split code here, but you can see the single-function approach at GitHub.

Here's the split implementation:

You can see the call to xdmp:estimate() in the start function. The values returned end up in the $start parameter to the finish function. Why split them up this way? Because MarkLogic can do some of this work in the background, allowing for a faster overall response.

Sidebar: why estimate and not count?

Note that what you return from the start function is important. In my first attempt, my start function constructed elements with attributes for count, hi, and low, then the finish function pulled out what it needed to make the search:facet-value elements. That was (not surprisingly) slower than just doing everything in the finish function. My revised implementation just returns the results of the xdmp:estimate() calls. The finish function already knows what order they will be in, so it's able to map those to the correct hi-lo values to construct the search:facet-values.

It's fair to ask how much difference the one-function versus two-function approaches makes. I generated 100,000 sample documents and ran some simple tests on my MacBook Pro (MarkLogic 7.0-4.1). (I should caveat this by saying I didn't trouble to shut everything else down, I just wanted to get an idea about the difference.) I threw in a single-value, standard bucket facet for comparison. Each approach was run as a single facet, making calls through the REST API.

Approach Median Facet-resolution Time
Two function 0.003622 sec
One function 0.009369 sec
Single-value buckets 0.001049 sec

Take these as rough numbers, but if you'd like to run more precise tests, you can get the full code and configuration from GitHub.

blogroll Blogroll