Working with xdmp:plan
Given the size of MarkLogic's
cts:query API, there are often many ways to construct the
cts:query parameter passed to
cts:search. While all are expected to return only relevant
matches, the performance can vary (in some cases significantly) depending on the current index options enabled on the
To help understand how MarkLogic constructs responses to search requests, it's important to understand the concepts of E-nodes and D-nodes. MarkLogic server instances are logically segmented based on the operations performed to satisfy requests. Servers can be Evaluators (E-nodes), Data Managers (D-nodes) or combined E/D-nodes (for instance, a lone MarkLogic instance is a combined E/D-node). E-nodes listen on a socket, parse requests, and generate responses. D-nodes hold data along with its associated indexes, and support E-nodes by providing them with the data they need to satisfy requests and process updates. For more information, refer to MarkLogic E-nodes and D-nodes in the MarkLogic Concepts Guide.
cts:search resolves queries in two phases. The first phase performs index resolution
on the D-nodes. This initial result may contain false positives depending on the index configuration and the query.
The second phase performs filtering of the results on the E-nodes, which examines the matched documents and
removes false positives. If a query can be resolved completely from the indexes, then filtering is not required.
One goal when optimizing search performance is to configure the database indexes and construct queries that take
advantage of these indexes in such a way as to ensure that filtering isn't necessary. To accomplish this
requires analysis of the types of searches supported by an application and matching those requirements to the
optimal index configuration and associated
cts:query constructors. Of course, there are always tradeoffs to consider,
primarily between query response times under expected loads, memory requirements, and on disk size of the database. In other
words, the decision to trade disk space and memory for response times or response times for disk space and memory depends
entirely on the specific requirements of the application, budgetary constraints, and any Service Level Agreements (SLA)
between the application owners and its users.
Since the index resolution takes place in memory and filtering requires reading documents off disk (disk I/O), filtered
searches will be slower than queries that can return only relevant results without filtering. This is only possible if
the database configuration and specific
cts:query constructors used can ensure that false positives are
not included during the index resolution phase.
A simple example
For this discussion we will use a sample database containing 50,000 documents constructed by
randomly selecting words from an English language thesaurus, stringing the words together into randomly selected
sentence lengths, and stringing the sentences together into random numbers of paragraphs. Documents also included
metadata sections containing randomly selected words contained in
<keyword> elements. In addition,
one or more choices from a set of known quotations are included in some documents to provide known sequences of words
for testing more complex queries.
The generated documents follow this structure:
The first set of tests are run against a database with all index options disabled except for word searches (at
minimum either word searches or stemming is required for searching content). The test consists of
executing a simple search using
This example requests that the MarkLogic server select all documents containing the word "ontology" regardless of
where the word appears within a document.
The query plan for this search can be examined by passing the
cts:search function to
xdmp:plan like this:
which returns the following plan:
For this discussion, we're primarily interested in the final-plan, the estimate, and what the plan indicates can be determined during the index resolution phase.
<qry:annotation> element in the final plan indicates that during query formulation
only one assertion about documents was identified:
- The document contains the word "ontology".
This is the only assertion that can be identified during query formulation given the current index configuration of the
database and the supplied query. In addition, the plan contains a
with an estimate of the number of matching documents in the database:
<qry:result estimate="76"/>. Using information in the index alone, the server estimates
that the database contains 76 matching documents.
Executing a filtered search and counting the results using
provides a total of 76, which matches the value provided by estimate. The server is able to accurately retrieve the
correct documents using index resolution alone. This search could be performed "unfiltered" with a high degree of
confidence that the results do not include false positives. This simple configuration is enough to satisfy many use cases.
Unfortunately, this simple query and minimal database configuration is insufficient to support a wide range of search requirements often found in typical search applications. The following sections build on this simple example to illustrate how to analyze more complex queries and the impact of enabling different sets of additional index options on the server's ability to accurately resolve searches without filtering.
Restricting matches to specific elements
Since MarkLogic indexes both content and structure, it's possible to formulate queries that restrict results not only containing specific words, but also to only those documents containing the word within a specific element. Consider a more specific query executed against the same database using the same index configuration. In this case, the requirement is to retrieve documents containing the word "ontology", but only if the word appears in the "keyword" element. This can be accomplished using a query like this:
cts:search(fn:doc(), cts:element-word-query(xs:QName('keyword'), "ontology"))
Like the previous example, this search requests that the server match documents containing the word
"ontology", but in addition, the word must appear in an element named
- word searches
The final-plan for this query is:
This plan contains two assertions about possible matching documents:
- The document contains the word "ontology".
- The document contains an element named "keyword".
Note that it does not assert that the word "ontology" appears in the element "keyword". This is clearly not enough information to ensure that a document found during the index resolution phase actually matches the query. This is demonstrated by comparing the estimate (75) with a count of the filtered search results (2). The unfiltered search results contain 73 false positive matches. These false positive matches must be removed during the filtering phase to guarantee accurate results.
Enabling element-based searches
To support resolving queries of this type without resorting to filtering, MarkLogic provides additional index configuration options. The first one to enable is named "fast element word searches". Note that there are tradeoffs to consider for each additional enabled index option. In this case, enabling fast element word searches results in decreased document ingestion performance and larger database size on disk. This is due to the additional index information captured and persisted on disk when documents are inserted into the database.
- word searches
- fast element word searches
Once the server has finished reindexing the documents, executing:
xdmp:plan(cts:search(fn:doc(), cts:element-word-query(xs:QName('keyword'), "ontology")))
results in the following final-plan:
This plan contains now two assertions to be tested:
- The document contains an element keyword containing the word "ontology" and …
- The document contains the word "ontology"
These assertions are now specific enough to match the intent of the
cts:query passed to search during the query
resolution phase. Note the value of the estimate has gone from 75 to 2. This matches the number of results actually
returned by executing this query in a filtered search and counting the results. The correct set of matching documents can
now be determined solely during the index resolution phase of query execution. With fast element word searches
enabled, searches of this type can be executed "unfiltered" with high confidence that the results will not contain false
This result demonstrates why
xdmp:plan is an essential tool for understanding and optimizing search performance
in MarkLogic applications. The insight it provides into the inner workings of MarkLogic's indexing and search capabilities is
invaluable in helping application developers deliver the best possible performance to their users.