SQL's "Group By" the MarkLogic way

by Gabo Manuel

Coming from an RDBMS background, one of the features I looked for in MarkLogic is the “group by” statement, i. e. to run aggregate functions based on particular fields. Let me show how this is done in MarkLogic using both XQuery and Server-side Javascript (SJS).

First, some sample data structure:

Person
first_name John
last_name Doe
age 18
gender M
height 180.2
civil_status Single

We can represent that in MarkLogic as either XML or JSON:

We create the data in our database using the following:

To get a basic gender count like 'male: 22, female: 31', in an RDBMS we do this:

select gender, count(1) from person group by gender

These snippets work but aren’t optimized. XQuery:

... and JavaScript:

fn:distinct-values() and fn:count() always load documents off disk to do their work. Not scalable. Performance gets slower the bigger the data set. The solution is to use MarkLogic extension functions that can run out of pure indexes.

We add an 'element range indexes' for 'gender':

  1. Go to admin console (localhost:8001)
  2. Go to Configure > Databases > myDB > Element Range Indexes
  3. Add an entry for gender as follows:
    • Scalar type: string
    • localname: gender
    • Collation: you could use the collation builder, just take note of the resulting value and pass that as part of the 'options' of 'cts:element-reference'
  1. Use 'cts:values()' and 'xdmp:estimate()' and get a big speed boost.

XQuery

... and JavaScript

This code is index optimized, runs in parallel on large clusters, and is amazingly fast. It's simple to get sub-second answers even against massive data sets. It greatly outperforms typical RDBMS systems.

Next, let’s limit our search to a particular height range.

select gender, count(1) from person where age between 18 and 40 group by gender

In MarkLogic, we represent constraints as 'queries', which can be built up from primitives. We first define an age query constraint as shown below. For each unique 'gender' value in the database, we apply the '$age-query' to limit the results to those genders appearing in a specific age range. To count the matching documents, we combine the existing '$gender-query' with '$age-query'.

XQuery

... and JavaScript

Let’s get fancier and use two different fields for the "group by":

select last_name, gender, count(1) from person group by last_name, gender

Similar to gender, we add an element range index for 'last_name':

  1. Go to admin console (localhost:8001)
  2. Go to Configure > Databases > myDB > Element Range Indexes
  3. Add an entry for last_name as follows:
    • Scalar type: string
    • localname: last_name

XQuery

... and JavaScript

That works, but there’s a better way. It’s more elegant and faster to use co-occurences:

XQuery

... and JavaScript

It’s also possible to do n-way co-occurences. Let’s bring civil_status into the mix as well.

  1. Go to admin console (localhost:8001)
  2. Go to Configure > Databases > myDB > Element Range Indexes
  3. Add an entry for civil_status as follows:
    • Scalar type: string
    • localname: civil_status
    • range value positions: true
  4. Update indexes for 'gender' and 'age' to set 'range value positions' to true.
  5. Use 'cts:value-tuples' which support n-size list of indexes.

XQuery

... and JavaScript

We can even rewrite our initial age/gender query using value-tuples:

XQuery

... and JavaScript

As a final thought, we don't just have to count results. There's a long list of aggregate functions available. For example, this query shows the average height grouped by age and gender:

XQuery

... and JavaScript

Hope this helps you as you do grouped aggregates in MarkLogic.

Comments

  • Great post. Thanks for sharing.
  • Very nice post, thanks! I just used cts:value-tuples for the first time - not for frequencies, but to extract the values themselves, so I could get them from the indexes and didn't have to open the docs. Cool stuff.
  • Cool. Opens up a lot of opportunities in several dimensions. Unless it is intentional, the JSON data in the beginning is 'roe' for john and jane. :D How are the index handled across clusters?
    • The 'roe' is intentional. Within a cluster (group of attached MarkLogic instances), index values related to a particular document are managed by the forest that holds that document. You can read more about clusters in the <a href="http://docs.marklogic.com/guide/admin">Administrator's Guide</a> and the <a href="http://docs.marklogic.com/guide/cluster">Scalability, Availability, and Failover Guide</a>.