The Secrets to Wildcard Search in MarkLogic

by Mike Wooldridge

Wildcards are special characters that you can include in your queries to search for variations of a term. A short string with wildcard characters can potentially match thousands terms of in your database, making wildcards an efficient way to search on similarly spelled words. MarkLogic search supports two wildcard characters:

* Matches zero or more non-space characters in a word, with car* matching car, care, and caring (and all other terms that start with car).
? Matches exactly one non-space character in a word, with car? matching care and cart but not car or caring.

You can have wildcards at the beginning, middle, or end of your search terms, or mix them together (e.g., *e*?her?*).

Note that wildcard searches are similar to stemming, which is how MarkLogic automatically considers grammatically similar words when processing a query. With stemming turned on, MarkLogic considers swam and swimming when you search for swim. For more detail, see the Understanding and Using Stemmed Searches section of the Search Developer's Guide.

MarkLogic supports wildcard search with three types of indexes:

  • three-character indexes
  • trailing wildcard indexes
  • lexicons

How and when it uses these indexes depends on the wildcard pattern you're searching on.

Three-Character Indexes

Indexes are the key to regular search in MarkLogic and are also the key to handling wildcards. An index takes character sequences in your database documents and maps those sequences to document IDs. During a typical search, MarkLogic analyzes the query—breaking a complicated query into its component parts—and then consults the indexes to return a list of matching results. By default, MarkLogic indexes all the words in a database as well as structural information like the element and attribute names (for XML documents) and property names (for JSON documents). You can have MarkLogic index additional sequences to support certain features (such as wildcards).

When you turn on a three-character index for your database, MarkLogic will map every unique three-letter sequence to its corresponding documents. If the term cater appears in a document, entries for cat, ate, and ter are added to the index. MarkLogic also includes contextual information about whether each sequence was at the start or end of a word. A three-character index is perfect for handling wildcard searches that include three characters and *—for example *ate, ate*, or *ate*. MarkLogic can retrieve all of the document IDs it needs for the results from that index.

MarkLogic can also leverage the three-character index for longer wildcard terms. It can split a four-character term such as *ater* into *ate* and *ter*, look up the two terms in the three-character index, and combine the results. However, this strategy can result in false positives. For example, a document that contains terminate would be incorrectly flagged as a match.

Such false positives are removed during the filtering part of the search. Filtering is when result documents are opened and double-checked to make sure they actually match the search query. The more filtering MarkLogic has to do, the longer it takes to return results. This means that using the three-character index to handle longer wildcard terms—which can increase filtering time—isn't always the most efficient strategy. (You can turn off filtering, which means you'll get results faster but potentially at the expense of false positives.)

MarkLogic also gives you the option to turn on one- and two-character indexes for databases. You might think that these would be helpful for getting wildcard results for shorter terms (e.g., b* and ba*), but turning these indexes on results in much, much larger indexes, and it also impacts the time it takes to load, index, and merge documents. For these reasons, activating these indexes is discouraged, and the rest of this article assumes they are not turned on.

What about indexing sequences greater than three characters? MarkLogic doesn't offer these indexes since, in most cases, any benefit would be outweighed by the increased index space. Typically, three-character indexes give you the most bang for the buck by efficiently supporting wildcards and search in general.

Trailing Wildcard Index

For quickly handling wildcard searches that have more than three characters and end in *, the trailing wildcard index can be turned on. This will store entries for all the possible prefix combinations for each word greater than or equal to three characters. For cater, MarkLogic stores cat, cate, and cater in the index. The index includes contextual information marking the sequences as occurring at the start of a word. Search for cate*, and MarkLogic will use the trailing wildcard index.

A trailing wildcard index isn't helpful for terms with a leading * (e.g., *ater). Nor does MarkLogic offer an analogous leading wildcard index. MarkLogic is biased toward English in this sense. Searchers for English terms put wildcards at the end of search terms more often than at the beginning, because suffixes are more commonly used in English grammar than prefixes. Hence the preference for a trailing wildcard index.

Three-character and trailing wildcard indexes

Figure 1: Indexes map character sequences that appear in documents to their respective document IDs. The figure above shows how MarkLogic stores sequences from the word cater when different indexes are turned on.

Lexicons

For wildcard searches that involve something more complicated than three-character strings or a trailing wildcard, MarkLogic will try to use lexicons. A lexicon is an ordered list of the unique words in a database. Unlike an index, a lexicon doesn't map words to the documents in which they appear; it's just a big ordered list.

Lexicons enable MarkLogic to quickly retrieve all the potential matches for a wildcard pattern. Let's say you are searching for the wildcarded string r?b*?t. MarkLogic can take that string and compare it to the lexicon word list. Since the lexicon words are stored in order, MarkLogic can scan them efficiently. (We'll talk more about that in a moment.)

In our r?b*?t example, a lexicon scan might return the terms rabbit, rebuilt, and robot. With this information, MarkLogic can rewrite the wildcard query like this:

rabbit OR rebuilt OR robot

It can then find the occurrences of these three words in its indexes and combine what it finds into a set of results.

Lexicons also support type-ahead suggestions in search boxes. When a user types vac in a search box, the application can check a lexicon and display a menu with words that match, like vacation and vacuum. It's similar to performing a search for vac*, but doesn't require checking indexes.

Limits of Lexicons

Using a lexicon to build an expanded OR query doesn't always work. Sometimes a wildcard string will return thousands of matching terms from the lexicon, way too many to construct an effective OR query. There are ways to get around this.

You can cut down on a long list of terms returned by constructing a shorter list of three-character prefixes and postfixes. Let's say a lexicon results list looks like this:

  • staggered
  • staired
  • stammered
  • stared
  • starred

With the prefix/postfix strategy, MarkLogic can rewrite the list as something shorter and more manageable:

  • sta*
  • *red

Then you can use those three-character sequences with the three-character indexes, which is very fast. This can yield false positives (which will need filtering), but it can be a better way to go when a lexicon scan returns too many words.

MarkLogic employs another strategy for single letters with a trailing wildcard—for example, a*. MarkLogic would break this into three separate searches:

  • search for a using the regular word index (there will be one index entry for a, if it exists)
  • search for a? in the lexicon (this should yield a small number of results) and do an expanded OR query
  • search for a??* in the lexicon, using the prefix/postfix strategy (to avoid a long list of matches for words beginning with a)
Wildcard Search Options

Developers have control over how MarkLogic uses lexicons to resolve wildcard searches with the lexicon-expand option. This option can have the following values:

heuristic
Resolve wildcard queries using MarkLogic's set of internal rules (e.g., if a lexicon scan is estimated to return a large number of terms, use alternative strategies). This is the default.
full
Always expand a wildcard query into a set of ORed terms, even if this results in a large number of terms and, potentially, a slow query.
prefix-postfix
Always resolve wildcard queries with prefix and postfix sequences based on matching lexicon terms.
off
Do not use lexicon expansion for wildcard searches. Use indexes instead.

Better Lexicon Scans

The order of entries in a lexicon is determined by the collation type of the lexicon. The codepoint lexicon, which orders lowercase before uppercase, is recommended for faster searching and more accurate results. This presents a challenge when doing case-insensitive searches. For example, a simple linear scan through the lexicon for the wildcard term cat* will check lots of entries that will never match (those starting with d, e, f, etc.) as you scan through the entries starting with c to the entries starting with C.

MarkLogic solves this problem by initially doing a series of binary searches to pick out just the sections of the lexicon that are worth scanning. A binary search works by looking at the middle item of an ordered list, determining if the search term is located before or after that middle item, and then performing a similar search on that half. By continuing to divide and conquer, the search eventually finds what it is looking for—usually much faster than if it searched linearly.

For a wildcard search for cat*, MarkLogic performs binary searches to find the range starting at c* and ending at C*. Then, within that range, it does binary searches to find the subsections from ca* to cA* and from Ca* to CA*. Finally, it finds the subsections for the letters t and T. (In reality, MarkLogic may stop doing binary searches for the characters in a search term once the ranges get small enough.) After the subsections are determined, MarkLogic can scan through them quickly to find the matching entries.

Binary searches minimize what needs to be scanned

Figure 2: MarkLogic uses a series of binary searches to minimize what it needs to scan in a lexicon for a term. The above shows the process when performing a case-insensitive search for the term cat*. The white sections mark the areas that could have matches.

Conclusion

With these techniques, MarkLogic can quickly return results for even very complex wildcard queries. You can learn more about wildcard search and configuring related settings in the MarkLogic documentation.

Thanks to Mary Holstege and Fei Xue for their help with this article.

Comments