I've been secretly excited about a little feature that was added to our latest release (MarkLogic 6). It's not the kind of feature that's listed as one of our sexiest selling points, but I'm excited because I know it will have a positive impact on so many of our customers and users. In particular, it will simplify the development and architecture of MarkLogic applications.
Loading XML "as is" has always been a selling point for MarkLogic. MarkLogic will automatically index any well-formed XML you give it. (It will even clean up malformed documents, store text files and binary files, extract metadata, etc.). It's designed to get you up and running fast with the data that you have. However, as I discussed in "Good XML design and performance," MarkLogic did play less nicely (note the past tense) with certain XML design patterns than others. In particular, it was best to avoid reusing generic names (such as <title> or <name>) in multiple contexts and instead use globally unique element names. The reason for this is that range indexes (i.e. custom, user-defined indexes in MarkLogic) were based on the element's QName alone, without taking into consideration its context within the document (such as its parent element). That remains true for element range indexes today. But in MarkLogic 6, you now have the ability to define path range indexes.
Previously, if your XML had generic element names, you needed to jump through some hoops to get optimal performance. For example, take this simple (Docbook) document:
The element <title> is re-used in four different places in the above example:
- book <title>
- chapter <title>
- section <title>
- sub-section <title>
Let's say you want to make a range index only on chapter titles (to enable fast retrieval of all chapter names). Before MarkLogic 6, you had to somehow change your data so that chapter titles would be distinguished from other <title> elements. The two basic choices here would be:
- Rename <title> to <chapterTitle> and create an element range index on "chapterTitle", or
- Annotate <title> with <title chapterTitle="yes"> and create a corresponding field range index (as described in my answer to a similar question on Stack Overflow).
Either way, you had to modify the data so that the indexer could distinguish between the different <title> elements. But with MarkLogic 6, that's no longer necessary. You can now define a path range index, which lets you identify the element (or attribute) using a (possibly quite complex) XPath expression.
Here's how it's done. Find the "Path Range Indexes" menu item for your database in the Admin UI:
Click the "Add" tab, change the scalar type to "string" and enter "chapter/title" as the XPath expression, leaving everything else at their defaults:
If, unlike our example, your content uses
namespaces (which is likely), then you'd first need to define a
prefix/namespace binding using the "Path Namespaces" link in the
left-hand Admin menu. This allows you to then use that prefix when
configuring the path expression (e.g.,
Okay, now that we've configured the index, let's look at how to access the data in our queries.
To retrieve all unique chapter titles from our database, we'd run the following query:
The new cts:values() function is a generalization of cts:element-values(), cts:element-attribute-values(), etc. It enables you to extract values from a lexicon regardless of its type. Similar generic functions are provided for extracting ranges (cts:value-ranges), wildcards (cts:value-match), etc. The cts:path-reference() function is a constructor for a cts:reference type, which is how you can now refer to (and pass around) range indexes. A similar cts:*-reference function is provided for each of the kinds of range indexes that MarkLogic supports.
Note that range indexes are identified not only by their path but also by their type, and, if it's a string index, by their collation. That's why you need to include the collation URI in the above call to cts:path-reference(). (Either that, or set your application server's default collation to the one your index uses.)