Introducing...path range indexes!

by Evan Lenz

I've been secretly excited about a little feature that was added to our latest release (MarkLogic 6). It's not the kind of feature that's listed as one of our sexiest selling points, but I'm excited because I know it will have a positive impact on so many of our customers and users. In particular, it will simplify the development and architecture of MarkLogic applications.

Loading XML "as is" has always been a selling point for MarkLogic. MarkLogic will automatically index any well-formed XML you give it. (It will even clean up malformed documents, store text files and binary files, extract metadata, etc.). It's designed to get you up and running fast with the data that you have. However, as I discussed in "Good XML design and performance," MarkLogic did play less nicely (note the past tense) with certain XML design patterns than others. In particular, it was best to avoid reusing generic names (such as <title> or <name>) in multiple contexts and instead use globally unique element names. The reason for this is that range indexes (i.e. custom, user-defined indexes in MarkLogic) were based on the element's QName alone, without taking into consideration its context within the document (such as its parent element). That remains true for element range indexes today. But in MarkLogic 6, you now have the ability to define path range indexes.

Previously, if your XML had generic element names, you needed to jump through some hoops to get optimal performance. For example, take this simple (Docbook) document:

The element <title> is re-used in four different places in the above example:

  • book <title>
  • chapter <title>
  • section <title>
  • sub-section <title>

Let's say you want to make a range index only on chapter titles (to enable fast retrieval of all chapter names). Before MarkLogic 6, you had to somehow change your data so that chapter titles would be distinguished from other <title> elements. The two basic choices here would be:

  • Rename <title> to <chapterTitle> and create an element range index on "chapterTitle", or
  • Annotate <title> with <title chapterTitle="yes"> and create a corresponding field range index (as described in my answer to a similar question on Stack Overflow).

Either way, you had to modify the data so that the indexer could distinguish between the different <title> elements. But with MarkLogic 6, that's no longer necessary. You can now define a path range index, which lets you identify the element (or attribute) using a (possibly quite complex) XPath expression.

Here's how it's done. Find the "Path Range Indexes" menu item for your database in the Admin UI:

Machine generated alternative text: [] Configure IlØ Groups Ef Databases ia A-Services i9C4 Documents I ES Forests j E Flexible Reptation ¡ E Database Replication j Ei Fragment Roots f E Fragment Parents ! EiTnggers ¡ E Merge Policy j E Scheduled Backups Content Processing Element Range Indexes Attribute Range Indexes 1 ! EF Field Range Indexes f Path ces j Path Rangeindexes ! ! Element Word Lexicons ! ! ! Attribute Word Lexicons

Click the "Add" tab, change the scalar type to "string" and enter "chapter/title" as the XPath expression, leaving everything else at their defaults:

Machine generated alternative text: Add Path Range Indexes j Configure Add Help 1 [ ok ) [ cancel ] Add Path Range Indexes to Database scalar type An atomic type specification. path expression The path expression. For example:/prefixl :locnamel/prefix2:locname2... collation http/niarkiog’ccon*ollation/ Root Collation e collation builder A collation UR1 for string comparisons. range value positions true . false index range value positions for faster near searches involving range queries (slcwer document loads and larger database files). Invalid values Ject T1 Allow ingestion & documents that do not have matching type & data. more items _ - ok ) cancel

If, unlike our example, your content uses namespaces (which is likely), then you'd first need to define a prefix/namespace binding using the "Path Namespaces" link in the left-hand Admin menu. This allows you to then use that prefix when configuring the path expression (e.g., my:chapter/my:title).

Okay, now that we've configured the index, let's look at how to access the data in our queries.

To retrieve all unique chapter titles from our database, we'd run the following query:

The new cts:values() function is a generalization of cts:element-values(), cts:element-attribute-values(), etc. It enables you to extract values from a lexicon regardless of its type. Similar generic functions are provided for extracting ranges (cts:value-ranges), wildcards (cts:value-match), etc. The cts:path-reference() function is a constructor for a cts:reference type, which is how you can now refer to (and pass around) range indexes. A similar cts:*-reference function is provided for each of the kinds of range indexes that MarkLogic supports.

Note that range indexes are identified not only by their path but also by their type, and, if it's a string index, by their collation. That's why you need to include the collation URI in the above call to cts:path-reference(). (Either that, or set your application server's default collation to the one your index uses.)

Range indexes are good not only for providing a source of lexicon lookups (as with cts:values()), but also for—imagine that—range queries! We'll look at an example of this in part 2.

Comments

  • I tried creating a path range index and load the options through QueryOptionsManager but am getting a Null pointer exception. <?xml version="1.0" encoding="UTF-8"?> <options xmlns="http://marklogic.com/appservices/search"> <constraint name="date"> <range collation="http://marklogic.com/collation/" type="xs:string" facet="false"> <path-index xmlns="">/product/date</path-index> </range> </constraint> </options> What's wrong with the above options. Is it because xmlns is " " inside <path-index>
    • Looks like the problem is the xmlns="" on the path-index. Take that out and let it inherit the default namespace from the options node.
      • Thanks David, below options worked <?xml version="1.0" encoding="UTF-8"?> <options xmlns='http://marklogic.com/appservices/search'> <constraint name='date'> <range collation='http://marklogic.com/collation/' type='xs:string' facet='false'> <path-index>/product/date</path-index> </range> </constraint> </options>
        • I want to retrieve all the values of the path index values just like the above cts:values(cts:path-reference("chapter/title", ("collation=http://marklogic.com/collation/"))) using java api in my case that would be all /product/date values. Is there a link that provides and example.