Getting Acquainted with XML Lexicons in 3.1

by Ian Small

Categories: Feature; 3.1
Altitude: 1,000 feet

Hmmm... I can see there's some discipline involved in getting from idea to actual written article on a regular basis. It took a little longer to get around to this one than I would have thought, even though I knew what I wanted to write about. I'll see what I can do about get back on the "timely train" for the next.

One of my favorite new features in 3.1 is XML lexicons. At this year's User Conference, one of the attendees arrived halfway through the opening session and thought that we kept saying XML "leprechauns". For the life of him, he couldn't figure out why we had incorporated a team of mischievous male elfs from Ireland in the product. Every time I think about it, it still makes me laugh.

I can tell that lexicons are popular with you as well - if only by the volume of technical questions I see moving around our consulting organization. It's easy to understand why: the more time you spend with lexicons, the more you realize that we've put a Swiss army knife right into the middle of the server.

From Query to Analytics

As content applications have grown in complexity, we've seen their functionality spread from content query, through content processing, and now to content analytics. With content query, you already know what you're looking for - you just need a powerful system that enables you to locate and extract the relevant pieces of content and assemble them into the most effective response.

With content analytics, you're trying to figure out what you should be interested in looking for. You want the server to help characterize the content - whether that's at the element, the document or the database level. Where content query builds on search, content analytics builds on discovery.

Content analytics comes in lots of flavours - from help characterizing a document or an element (using, for instance, an XML classification engine) to dictionaries that you can use to find out what words and values are in use in the database.

Dictionaries are where XML lexicons come in. You can use them for word browsing. You can use them for auto-completion of search terms. You can use them to create drill downs and query refinement interfaces. And you can use them as the basis for numerical analysis across a search result set.

A Simple Idea, But Lots of Power

Lexicons come in two flavours: word lexicons and value lexicons.

Word lexicons are dictionaries of the word tokens used throughout the database. Word lexicons can span the entire content of the database, or can be constrained to the content found within a particular element or element-attribute pair.

Value lexicons are dictionaries of specific sets of values used across the entire database. Because values relate to QNames, specific value lexicons relate to specific elements or element-attributes.

In many ways, lexicons are a great demonstration of how a query language can make 1+1=10. Lexicons themselves are deceptively simple. After all, what could be easier than a list of words or values? But being able to programmatically access these dictionaries from the middle of an XQuery module unleashes a powerful combinatoric effect: because it's easy to combine lexicon access with the other features of MarkLogic Server, you end up with an endless variety of use cases.

Lexicons come equipped with two different data access methods. With the first access method (eg. cts:words(), cts:element-words(), cts:element-values(), etc.), you can retrieve the contents of a given lexicon, starting at a particular point in the lexicon, and going backwards or forwards from there. This lets you easily start at the letter "A", the letter "q", or the word "football", whichever you prefer.

Using the second access method (eg. cts:word-match(), cts:element-word-match(), etc.), you can specify a wildcard pattern, and only words or values that match that pattern and are contained within the specified lexicon will be returned. Just as with the first access method, you can get that list returned forwards or backwards. Whereas the previous access method would let you start at the letter "A" and run through to the end of the lexicon, this access method makes it easy to only get words or values that begin with the letter "A" (eg. using the wildcard pattern "A*"), and no others.

Both access methods come with a flurry of options for whether matching should pay attention to case and/or diacritics. Both access methods can pull from a single lexicon or from more than one lexicon at the same time.

But what's really cool is that with both access methods you can specify a cts:query constructor - effectively, a search expression - that is used to constrain the results.

Combinatorics at Work

Hmmm... you say, that sounds pretty interesting. What can I do with that?

First, a quick review. As I've outlined above, lexicons give you immediate access to lists of words or values from across the database. By configuring specific lexicons on a QName basis, you can specify the element or element-attribute for which you want words or values maintained in a lexicon.

But what if I want to get a list of words or values for a specific subset of the database? For instance, perhaps I want the words or values in use in a given collection or directory. Or perhaps the values used in documents returned by a specific search.

By providing a cts:query constructor - whether it's a simple cts:word-query() or a highly complex search tree incorporating proximity, boolean logic, and the whole set of cts:query constructors - you can specify a set of fragments from which the lexicon entries are to be drawn. In this case, before any lexicon entry (word or value) is returned to you, the server checks to see if that entry is used in at least one of the fragments covered by the cts:query constructor. This will work with either access method discussed above.

You can use this parameter to constrain lexicon results to specific collections, directories, documents, search results, or just about any combination thereof. The only requirement is that you be able to express your constraint using cts:query constructors.

Of course, there's no such thing as a free lunch, so this constraint functionality does come with some performance cost. Performance implications vary based on the type of lexicon, the density of the constraint and (to some extent) the complexity of the constraint. In the best scenarios, the performance cost is negligible. In the worst, well, let's just say it can get ugly... I'll double-click on the ins and outs of this particular subject in an upcoming column.

In the meantime, that's the end of our introductory tour of XML lexicons. If you want to know more about them, take a look at Chapter 21 of the Developer's Guide and, of course, the relevant sections of the API documentation.

Have fun!