A goal without a plan is just a wish

by Scott Parnell

Working with xdmp:plan

Given the size of MarkLogic's cts:query API, there are often many ways to construct the cts:query parameter passed to cts:search. While all are expected to return only relevant matches, the performance can vary (in some cases significantly) depending on the current index options enabled on the database.

To help understand how MarkLogic constructs responses to search requests, it's important to understand the concepts of E-nodes and D-nodes. MarkLogic server instances are logically segmented based on the operations performed to satisfy requests. Servers can be Evaluators (E-nodes), Data Managers (D-nodes) or combined E/D-nodes (for instance, a lone MarkLogic instance is a combined E/D-node). E-nodes listen on a socket, parse requests, and generate responses. D-nodes hold data along with its associated indexes, and support E-nodes by providing them with the data they need to satisfy requests and process updates. For more information, refer to MarkLogic E-nodes and D-nodes in the MarkLogic Concepts Guide.

By default, cts:search resolves queries in two phases. The first phase performs index resolution on the D-nodes. This initial result may contain false positives depending on the index configuration and the query. The second phase performs filtering of the results on the E-nodes, which examines the matched documents and removes false positives. If a query can be resolved completely from the indexes, then filtering is not required.

One goal when optimizing search performance is to configure the database indexes and construct queries that take advantage of these indexes in such a way as to ensure that filtering isn't necessary. To accomplish this requires analysis of the types of searches supported by an application and matching those requirements to the optimal index configuration and associated cts:query constructors. Of course, there are always tradeoffs to consider, primarily between query response times under expected loads, memory requirements, and on disk size of the database. In other words, the decision to trade disk space and memory for response times or response times for disk space and memory depends entirely on the specific requirements of the application, budgetary constraints, and any Service Level Agreements (SLA) between the application owners and its users.

Since the index resolution takes place in memory and filtering requires reading documents off disk (disk I/O), filtered searches will be slower than queries that can return only relevant results without filtering. This is only possible if the database configuration and specific cts:query constructors used can ensure that false positives are not included during the index resolution phase.

A simple example

For this discussion we will use a sample database containing 50,000 documents constructed by randomly selecting words from an English language thesaurus, stringing the words together into randomly selected sentence lengths, and stringing the sentences together into random numbers of paragraphs. Documents also included metadata sections containing randomly selected words contained in <keyword> elements. In addition, one or more choices from a set of known quotations are included in some documents to provide known sequences of words for testing more complex queries.

The generated documents follow this structure:

The first set of tests are run against a database with all index options disabled except for word searches (at minimum either word searches or stemming is required for searching content). The test consists of executing a simple search using cts:search(fn:doc(), 'ontology'). This example requests that the MarkLogic server select all documents containing the word "ontology" regardless of where the word appears within a document.

The query plan for this search can be examined by passing the cts:search function to xdmp:plan like this:

xdmp:plan(cts:search(fn:doc(), 'ontology'))

which returns the following plan:

For this discussion, we're primarily interested in the final-plan, the estimate, and what the plan indicates can be determined during the index resolution phase.

The <qry:annotation> element in the final plan indicates that during query formulation only one assertion about documents was identified:

  1. The document contains the word "ontology".

This is the only assertion that can be identified during query formulation given the current index configuration of the database and the supplied query. In addition, the plan contains a <qry:result> element with an estimate of the number of matching documents in the database: <qry:result estimate="76"/>. Using information in the index alone, the server estimates that the database contains 76 matching documents.

Executing a filtered search and counting the results using fn:count(cts:search(fn:doc(), 'ontology')) provides a total of 76, which matches the value provided by estimate. The server is able to accurately retrieve the correct documents using index resolution alone. This search could be performed "unfiltered" with a high degree of confidence that the results do not include false positives. This simple configuration is enough to satisfy many use cases.

Unfortunately, this simple query and minimal database configuration is insufficient to support a wide range of search requirements often found in typical search applications. The following sections build on this simple example to illustrate how to analyze more complex queries and the impact of enabling different sets of additional index options on the server's ability to accurately resolve searches without filtering.

Restricting matches to specific elements

Since MarkLogic indexes both content and structure, it's possible to formulate queries that restrict results not only containing specific words, but also to only those documents containing the word within a specific element. Consider a more specific query executed against the same database using the same index configuration. In this case, the requirement is to retrieve documents containing the word "ontology", but only if the word appears in the "keyword" element. This can be accomplished using a query like this:

cts:search(fn:doc(), cts:element-word-query(xs:QName('keyword'), "ontology"))

Like the previous example, this search requests that the server match documents containing the word "ontology", but in addition, the word must appear in an element named <keyword>.

Database configuration:
  • word searches

The final-plan for this query is:

This plan contains two assertions about possible matching documents:

  1. The document contains the word "ontology".
  2. The document contains an element named "keyword".

Note that it does not assert that the word "ontology" appears in the element "keyword". This is clearly not enough information to ensure that a document found during the index resolution phase actually matches the query. This is demonstrated by comparing the estimate (75) with a count of the filtered search results (2). The unfiltered search results contain 73 false positive matches. These false positive matches must be removed during the filtering phase to guarantee accurate results.

Enabling element-based searches

To support resolving queries of this type without resorting to filtering, MarkLogic provides additional index configuration options. The first one to enable is named "fast element word searches". Note that there are tradeoffs to consider for each additional enabled index option. In this case, enabling fast element word searches results in decreased document ingestion performance and larger database size on disk. This is due to the additional index information captured and persisted on disk when documents are inserted into the database.

Database configuration:
  • word searches
  • fast element word searches

Once the server has finished reindexing the documents, executing:

xdmp:plan(cts:search(fn:doc(), cts:element-word-query(xs:QName('keyword'), "ontology")))

results in the following final-plan:

This plan contains now two assertions to be tested:

  1. The document contains an element keyword containing the word "ontology" and …
  2. The document contains the word "ontology"

These assertions are now specific enough to match the intent of the cts:query passed to search during the query resolution phase. Note the value of the estimate has gone from 75 to 2. This matches the number of results actually returned by executing this query in a filtered search and counting the results. The correct set of matching documents can now be determined solely during the index resolution phase of query execution. With fast element word searches enabled, searches of this type can be executed "unfiltered" with high confidence that the results will not contain false positives.

This result demonstrates why xdmp:plan is an essential tool for understanding and optimizing search performance in MarkLogic applications. The insight it provides into the inner workings of MarkLogic's indexing and search capabilities is invaluable in helping application developers deliver the best possible performance to their users.

Comments