10,000 Range Indexes

by Dave Cassel

There was a recent discussion on an internal mailing list asking whether you could set up 10,000 range indexes on a database. When faced with a question like this, we should step back and consider the problem we're trying solve. The data set in question has about 1,000 entities, with an expectation that an average of 10 fields related to each entity would need to be indexed. This leads to the question about having 10,000 range indexes.

At first blush, this line of thought suggests relational thinking -- this is natural; that's what most of us learned first. Of course, every index has a cost, regardless of whether the database is MarkLogic, an RDBMS, or another NoSQL database. 10,000 range indexes isn't a good idea in MarkLogic, but know that if you were thinking about setting up that many, there's probably a better solution. 

Universal Index

The first question we should consider is whether we actually need range indexes for those 10,000 fields (elements). MarkLogic's Universal Index may provide what's needed already: indexing the terms and structure of all documents. Through the Universal Index, we can do full-text searches on any ingested content, even scoping it to particular document sections if we want. In many cases, this means we don't need to set up specific indexes to provide rapid access to particular content. 

Range Indexes

The Universal Index provides immediate access to text and structure. When do we need range indexes? In a search context, we use range indexes for data-type specific inequalities, such as "find me all articles published since Jan 1, 2012". By having a date range index on the publication date, we can build a greater-than-or-equal-to query. We can also use range indexes to get lists of values, enabling us to build facets. Jason Hunter's Inside MarkLogic Server lists other range index benefits. 

In typical applications, we want to search across many (or all) fields, but we don't need inequality comparisons or to generate thousands of facets. This means that for most applications, we'll get much of our search capability from the Universal Index and supplement with a small number of range indexes. 


In MarkLogic, a field is a structure that lets us refer to the contents of multiple elements by the same name. When we merge data from different sources, we sometimes get multiple elements that represent the same thing, but with different names. For instance, consider two book databases, where one has "published-date" and one has "pub-date". At first glance, these appear to be two separate types of data, suggesting separate range indexes. However, with MarkLogic's field feature, a single name can refer to the contents of both elements, with one type-specific index pulling values from all the elements. This is another way that the number of indexes can be reduced.


Sometimes you really do want to do range queries across a wide variety of fields. In an extreme case, MarkLogic lets you represent everything as triples, allowing for inequality queries using SPARQL's FILTER or the cts:triples() function. MarkLogic's own history monitoring is built entirely with triples. More commonly, triples are used in combination with documents to produce a powerful hybrid. 

Why Not 10,000 Range Indexes?

Having looked at some alternatives to setting up 10,000 range indexes, let's come back to the original question. It turns out that the answer is no, you should not attempt to make anything on the order of 10,000 -- a target cap for range indexes should be about 100, with the vast majority of applications requiring a much smaller number than that. Each forest stores the indexes that relate to the content in that forest; each forest is broken into one or more stands. Each of these stands manages its indexes in two memory-mapped files per index. We commonly see 12 forests on a host (six master, six replica) with about 100 stands; multiply that by 10,000 range indexes and we'd have millions of open files handles. 


Sometimes the transition from the relational model to the document + triples model doesn't click for a person right away, which can lead to a question like this one. If you find yourself planning to make thousands (or even hundreds) of range indexes, it's probably worth stepping back and rethinking about how the data will be represented. The Universal Index is really powerful -- let it do what it does best! Then for cases the Universal Index doesn't satisfy, apply fields, range indexes, and triples as needed. 

Monitoring MarkLogic History

by Eric Bloch

In MarkLogic 7, we introduced a new History Monitoring APIs and a dashboard to help you visualize the performance over time. These features are described nicely in our docs.

But, if you're still running MarkLogic 6, we've also made available a tool we've used in house called mlstat

mlstat is a command line tool that monitors various aspects of MarkLogic Server performance on Linux. It runs on the MarkLogic node itself and is modeled on the classic Unix tools like vmstat and mpstat. It is designed to be always on, running in the background redirecting it's output to a file. It has the ability to tag each line of output with an Epoch or timestamp so the data can be correlated with an event.


Range Index Scoring

by Adam Fowler

A new feature of MarkLogic 7′s search API is range index scoring – affecting relevancy based on a value within a document. Here I detail a couple of use cases

Range index scoring allows you to determine relevancy by values in a document, rather than matching values against a term exactly.

A good use case of this is for ratings. A higher rating should show nearer the top of search results.

A second use case of that of distance from the centre point of a geospatial query. Just like you get on hotel search websites.

We can now do these directly in MarkLogic without any special voodoo from a developer. Just set up the search options and perform a query. Easy!

Show me!

Below is the feature in action:


This uses MLJS for rendering results, but the functionality is in core MarkLogic, not MLJS. MarkLogic also calculates a heatmap on the fly. This calculated data is passed to heatmap-openlayers.js. Much more efficient than just sending lots of data to heatmap.js, especially for thousands of visible points.

Note that the MLJS widgets interact with each other – hovering over a marker on the map highlights it in the search results list with a different background colour.

Isn’t this like sorting?

In a word, no.

Sorting is based purely on a value in a document. By changing relevancy scores you can combine different search terms. For example, you could have rating and distance and a word query all contributing to the relevancy score. A result which is a little further but a much higher rating may trump one that’s dead centre on the map, but with a low rating.

How does it work?

Under the hood you provide a set of options and a query. I’ve documented the REST search options I’m using, and the search query I’m sending, and the results I’m getting back raw within a Gist. Go have a read, it’s pretty straight forward. (I tend to go overkill in setting search options though!)

In Summary

Ever wanted to tweak relevancy by values in a document? Now you can! Go have a read of this new V7 feature, download MarkLogic today too, and have a play!

This post original appeared on Adam's blog and we thank him for his permission to re-post here. The post is related to several other of his posts:

blogroll Blogroll