Extract from Documents Using Optic API

Problem

You have used TDE to extract data from documents into the row index, but there are additional pieces of information you want to return with a particular query.

Solution

Applies to MarkLogic versions 9+

Consider a schema called content with a view called books. The view has columns title, author, and price. The documents also have an element called summary that you decided not to put in the view, as it won’t be used very often. For this query, however, you want to include it in the results.

const op = require('/MarkLogic/optic');

const docId = op.fragmentIdCol('docId');
op.fromView('content', 'books', null, docId)
  .where(op.gt(op.col('price'), 20))
  .joinDoc('doc', docId)
  .select([
    'title', 'author', 'price', 
    op.as('summary', op.xpath(op.col('doc'), '/doc/summary'))
  ])
  .result();

import module namespace op="https://marklogic.com/optic"
     at "/MarkLogic/optic.xqy";

let $doc-id := op:fragment-id-col('docId')
return 
  op:from-view('content', 'books', (), $doc-id)
    => op:where(op:gt(op:col('price'), 20))
    => op:join-doc('doc', $doc-id)
    => op:select((
      'title', 'author', 'price', 
      op:as('summary', op:xpath(op:col('doc'), '/doc/summary/fn:string()'))
    ))
    => op:result()

Required Indexes:

TDE-extracted rows

Discussion

The Optic API is useful for executing relational operations on values; however, sometimes we need to pull in additional data from the source documents. It makes sense to pull scalar values into the row index, where we can do calculations and aggregations on them, but content with substructure, such as XML with markup, doesn’t really benefit. The example presented here is relatively simple, but supplementing the results with content from documents can be done with much more complex queries, involving joins, aggregates, or other operations.

The op.fromView call identifies the schema and view that we’ll draw data from. We also use op.fragmentIdCol to let Optic know that we want to work with the source documents. In this case, I’ve used the name “docId”, but the name isn’t meaningful as we won’t be using it once we get our results.

Before we actually join the documents to our rows, we should filter down to just those rows we need, in order to avoid reading more documents from disk than is necessary. In the recipe above, only books with prices greater than 20 are included.

Having told Optic what data to work with, .joinDoc tells Optic to use the docId to include the document content in a column called doc.

Finally, the select specifies what columns to return in the result set. The as clause tells Optic to make a summary column based on an XPath statement, run against the document associated with the current row.

When using this technique, we’ll see columns from the row index, as well as the specified additional data from the documents, in this case the summary. It’s important to remember that to get the summary, MarkLogic had to load the entire document, then use XPath to select just a part of it. This is similar to any other search where we return just part of search result. To ensure this performs well, use a where clause to reduce the number of documents that need to be retrieved. Consider using paging if you still get a lot of results. If the documents are large, considering adding the additional data to the row index instead of retrieving it separately.

Written Tutorial

Extract content from retrieved documents using the Optic API

Problem

Solution

Discussion

Learn More

Stay on top of everything Marklogic.

This website uses cookies.