Page over Optic API results

Problem

Your Optic API query returns a large result set. You want to get a stable set of results a page at a time.

Solution

Applies to MarkLogic versions 9+

const op = require('/MarkLogic/optic');

const page = parseInt(xdmp.getRequestField('page', '1'), 10);
const pageSize = 1000;

let timestamp = xdmp.getRequestField('timestamp');
if (timestamp === null) {
  timestamp = xdmp.requestTimestamp();
}

const response = { 
  timestamp: timestamp,
  page: page,
  results: xdmp.invokeFunction(
    function() {
      return op.fromTriples([...)])
        .offsetLimit(op.param('offset'), pageSize)
        .result(null, {offset: pageSize * (page - 1)});
    },
    { timestamp: timestamp }
  )
}

response

Required Privileges:

https://marklogic.com/xdmp/privileges/xdmp-timestamp

Discussion

Sometimes your result set will be bigger than you want to return in a single request. Paging solves this problem by having the caller request successive pages until all results have been returned. This means that no individual response is too big, but all results are returned. One of the challenges with paging is the risk that the underlying data set may change, with the result that a row might be skipped or repeated. In this recipe, we’re working through a large set of triples by calling op.fromTriples, but the same principles apply if calling op.fromLexicons, op.fromLiterals, or op.fromView.

This recipe prevents this problem using point-in-time queries. If you aren’t familiar with how timestamps are managed in MarkLogic, read over Understanding Point-In-Time Queries in the Application Developer’s Guide.

By using point-in-time queries, we can ask for a batch of results in one request, process them, then ask for the next batch, knowing that the list will not change in between. This recipe is intended to be used as a main module, so the caller is able to specify the page and the timestamp. The timestamp would not be sent with the first request, but the response will indicate at what timestamp the query was run. Subsequent calls can include this to ensure stable results.

Note that the REST API provides its own ways to manage timestamps. For example, take a look at the POST /v1/rows endpoint, paying attention to the timestamp parameter and the ML-Effective-Timestamp header.

As with any point-in-time query, one caveat is that the caller should finish before MarkLogic’s merge timestamp catches up to the request timestamp. In practice, this is unlikely to be a problem; if it becomes one, you may need to take control of the merge timestamp to ensure the results remain available.

The offsetLimit call has a reference to op.param('offset'). This could have been written with the offset value in place; however, writing it this way allows MarkLogic to cache and reuse the query. MarkLogic analyzes the query and builds up a plan. By parameterizing it, this plan can be re-used, enabling faster execution.

The caller will need to determine when all results have been provided by watching for an empty result set. While some MarkLogic searches provide an estimate of the total number of results, estimating rows is harder than estimating search because the pipeline of operations can produce more or fewer output than input rows. Even with an estimate, that would not be an exact count, so iterating until empty would be necessary regardless.

Written Tutorial

Problem

Solution

Discussion

Learn More

Stay on top of everything Marklogic.

This website uses cookies.