Wildcards are special characters that you can include in your queries to search for variations of a term. A short string with wildcard characters can potentially match thousands terms of in your database, making wildcards an efficient way to search on similarly spelled words. MarkLogic search supports two wildcard characters:
||Matches zero or more non-space characters in a word, with
||Matches exactly one non-space character in a word, with
You can have wildcards at the beginning, middle, or end of your search terms, or mix them together (e.g.,
Note that wildcard searches are similar to stemming, which is how MarkLogic automatically considers grammatically similar words when processing a query. With stemming turned on, MarkLogic considers
swimming when you search for
swim. For more detail, see the Understanding and Using Stemmed Searches section of the Search Developer's Guide.
MarkLogic supports wildcard search with three types of indexes:
- three-character indexes
- trailing wildcard indexes
How and when it uses these indexes depends on the wildcard pattern you're searching on.
Indexes are the key to regular search in MarkLogic and are also the key to handling wildcards. An index takes character sequences in your database documents and maps those sequences to document IDs. During a typical search, MarkLogic analyzes the query—breaking a complicated query into its component parts—and then consults the indexes to return a list of matching results. By default, MarkLogic indexes all the words in a database as well as structural information like the element and attribute names (for XML documents) and property names (for JSON documents). You can have MarkLogic index additional sequences to support certain features (such as wildcards).
When you turn on a three-character index for your database, MarkLogic will map every unique three-letter sequence to its corresponding documents. If the term
cater appears in a document, entries for
ter are added to the index. MarkLogic also includes contextual information about whether each sequence was at the start or end of a word. A three-character index is perfect for handling wildcard searches that include three characters and *—for example
*ate*. MarkLogic can retrieve all of the document IDs it needs for the results from that index.
MarkLogic can also leverage the three-character index for longer wildcard terms. It can split a four-character term such as
*ter*, look up the two terms in the three-character index, and combine the results. However, this strategy can result in false positives. For example, a document that contains
terminate would be incorrectly flagged as a match.
Such false positives are removed during the filtering part of the search. Filtering is when result documents are opened and double-checked to make sure they actually match the search query. The more filtering MarkLogic has to do, the longer it takes to return results. This means that using the three-character index to handle longer wildcard terms—which can increase filtering time—isn't always the most efficient strategy. (You can turn off filtering, which means you'll get results faster but potentially at the expense of false positives.)
MarkLogic also gives you the option to turn on one- and two-character indexes for databases. You might think that these would be helpful for getting wildcard results for shorter terms (e.g.,
ba*), but turning these indexes on results in much, much larger indexes, and it also impacts the time it takes to load, index, and merge documents. For these reasons, activating these indexes is discouraged, and the rest of this article assumes they are not turned on.
What about indexing sequences greater than three characters? MarkLogic doesn't offer these indexes since, in most cases, any benefit would be outweighed by the increased index space. Typically, three-character indexes give you the most bang for the buck by efficiently supporting wildcards and search in general.
Trailing Wildcard Index
For quickly handling wildcard
searches that have more than three characters and end in *,
the trailing wildcard index can be turned on. This will store entries for all
the possible prefix combinations for each word greater than or equal to three
cater, MarkLogic stores
cater in the index. The index includes contextual
information marking the sequences as occurring at the start of a word. Search
cate*, and MarkLogic will use the trailing wildcard
A trailing wildcard index isn't
helpful for terms with a leading * (e.g.,
*ater). Nor does MarkLogic offer an analogous leading
wildcard index. MarkLogic is biased toward English in this sense. Searchers for
English terms put wildcards at the end of search terms more often than at the
beginning, because suffixes are more commonly used in English grammar than
prefixes. Hence the preference for a trailing wildcard index.
Figure 1: Indexes map character sequences that appear in documents to their respective
document IDs. The figure above shows how MarkLogic stores sequences from the word
cater when different indexes are turned on.
For wildcard searches that involve something more complicated than three-character strings or a trailing wildcard, MarkLogic will try to use lexicons. A lexicon is an ordered list of the unique words in a database. Unlike an index, a lexicon doesn't map words to the documents in which they appear; it's just a big ordered list.
Lexicons enable MarkLogic to
quickly retrieve all the potential matches for a wildcard pattern. Let's say
you are searching for the wildcarded string
MarkLogic can take that string and compare it to the lexicon word list. Since
the lexicon words are stored in order, MarkLogic can scan them efficiently. (We'll
talk more about that in a moment.)
r?b*?t example, a lexicon scan might return the terms
robot. With this
information, MarkLogic can rewrite the wildcard query like this:
rabbit OR rebuilt OR robot
It can then find the occurrences of these three words in its indexes and combine what it finds into a set of results.
Lexicons also support type-ahead
suggestions in search boxes. When a user types
vac in a
search box, the application can check a lexicon and display a menu with words that
It's similar to performing a search for
doesn't require checking indexes.
Limits of Lexicons
Using a lexicon to build an expanded OR query doesn't always work. Sometimes a wildcard string will return thousands of matching terms from the lexicon, way too many to construct an effective OR query. There are ways to get around this.
You can cut down on a long list of terms returned by constructing a shorter list of three-character prefixes and postfixes. Let's say a lexicon results list looks like this:
With the prefix/postfix strategy, MarkLogic can rewrite the list as something shorter and more manageable:
Then you can use those three-character sequences with the three-character indexes, which is very fast. This can yield false positives (which will need filtering), but it can be a better way to go when a lexicon scan returns too many words.
MarkLogic employs another strategy
for single letters with a trailing wildcard—for example,
MarkLogic would break this into three separate searches:
ausing the regular word index (there will be one index entry for
a, if it exists)
a?in the lexicon (this should yield a small number of results) and do an expanded OR query
a??*in the lexicon, using the prefix/postfix strategy (to avoid a long list of matches for words beginning with
Developers have control over how MarkLogic uses lexicons to resolve wildcard searches with the
lexicon-expand option. This option can have the following values:
- Resolve wildcard queries using MarkLogic's set of internal rules (e.g., if a lexicon scan is estimated to return a large number of terms, use alternative strategies). This is the default.
- Always expand a wildcard query into a set of ORed terms, even if this results in a large number of terms and, potentially, a slow query.
- Always resolve wildcard queries with prefix and postfix sequences based on matching lexicon terms.
- Do not use lexicon expansion for wildcard searches. Use indexes instead.
Better Lexicon Scans
The order of entries in a lexicon is determined by the
collation type of the lexicon.
The codepoint lexicon, which orders lowercase before uppercase, is recommended
for faster searching and more accurate results. This
presents a challenge when doing case-insensitive searches. For example, a
simple linear scan through the lexicon for the wildcard term
cat* will check lots of entries that will never match (those starting with
etc.) as you scan through the entries starting with
c to the entries starting with
MarkLogic solves this problem by initially doing a series of binary searches to pick out just the sections of the lexicon that are worth scanning. A binary search works by looking at the middle item of an ordered list, determining if the search term is located before or after that middle item, and then performing a similar search on that half. By continuing to divide and conquer, the search eventually finds what it is looking for—usually much faster than if it searched linearly.
For a wildcard search for
MarkLogic performs binary searches to find the range starting at
C*. Then, within that range, it does binary
searches to find the subsections from
cA* and from
CA*. Finally, it finds the subsections for the letters
T. (In reality, MarkLogic may stop doing binary
searches for the characters in a search term once the ranges get small
enough.) After the subsections are
determined, MarkLogic can scan through them quickly to find the matching
2: MarkLogic uses a series of binary searches to minimize
what it needs to scan in a lexicon for a term. The above shows the process when
performing a case-insensitive search for the term
cat*. The white sections mark the areas that could
With these techniques, MarkLogic can quickly return results for even very complex wildcard queries. You can learn more about wildcard search and configuring related settings in the MarkLogic documentation.