Ingest Aggregate JSON File with Docs

Problem

You need to ingest a large JSON file that contains an array of objects. Each object should be split into its own document in MarkLogic.

Solution

Applies to MarkLogic versions 8+

declareUpdate();

// Insert URL of zipped JSON file
let url = "REPLACE WITH URL";
// Insert the name of the JSON file within the .zip
let zipFile = "REPLACE WITH FILENAME";
let zip = xdmp.documentGet(url);

let idx = 0;
for (let rec of fn.head(xdmp.zipGet(zip, zipFile)).xpath("./results")) {
  xdmp.documentInsert("/content/rec-" + idx++ + ".json", rec);
}

(: Insert URL of zipped JSON file :)
let $url := "REPLACE WITH URL"
(: Insert the name of the JSON file within the .zip :)
let $zip-file := "REPLACE WITH FILENAME"
let $zip := xdmp:document-get($url)
for $rec at $idx in xdmp:zip-get($zip, $zip-file)/results
return xdmp:document-insert("/content/rec-" || $idx || ".json"), $rec)

Discussion

The basic unit of storage in MarkLogic is a document. When we have an input document that describes many entities, we’ll want to split those into one document per entity. It’s not unusual to find aggregate documents in this form.

MarkLogic Content Pump (mlcp) provides ways to split two types of aggregate documents: XML documents where a each child of the root element will become its own document, and line-delimited JSON documents, in which each line is a separate bit of JSON. If you have a document that looks like those, your best bet is to use MLCP. This recipe covers a different case, where you’ve got one large JSON document that contains an array, and you want to make each item in the array into a document in MarkLogic.

The sample .zip file I worked with is a large JSON file that starts off like this:

{
  "meta": {
    "last_updated": "2017-08-29",
    "terms": "https://open.fda.gov/terms/",
    "results": {
      "skip": 0,
      "total": 253355,
      "limit": 100000
    },
    "license": "https://open.fda.gov/license/",
    "disclaimer": "Do not rely on openFDA to make decisions regarding medical care. While we make every effort to ensure that data is accurate, you should assume all results are unvalidated. We may limit or otherwise restrict your access to the API in line with our Terms of Service."
  },
  "results": [
    {

Note that xdmp:document-get will accept a URL that starts with https:// or file://—that is, it will pull a file down from the web or read one from the local filesystem.

In either case, the xdmp:zip-get($zip, $zip-file)/results expression reads the JSON document from the zip file, then applies an XPath expression to select the results property. The results property holds an array, each item of which is an object that we want to store in a separate document. We loop on the sequence of results, inserting each as a new document.

Be aware that if your JSON document is very large, you may get a timeout from attempting to insert all the documents in a single transaction. You can try increasing the timeout on the request (xdmp.setRequestTimeLimit / xdmp:set-request-time-limit). If that doesn’t work, the best approach is likely to do the orchestration (splitting the JSON document and inserting the results) from outside MarkLogic, using the Data Movement SDK.

An alternative approach is to use the code above to split the JSON, but perform the document inserts in separate transactions using xdmp.spawn. This may seem appealing, but it comes with risks. These spawned tasks will be placed on the Task Server queue to be executed asynchronously. If MarkLogic were to go down (accidentally or due to a deliberate restart), an tasks remaining on the queue would be lost. Worse, it would be difficult to determine whether the tasks had finished, except by checking for the presence of the expected documents. If the same process is managed outside of MarkLogic, and MarkLogic went down, the external program could report that fact. (And if the external program went down, it would be clear that an error had occurred and that not all documents had been processed. Perhaps the biggest advantage of working externally is that xdmp.spawn can only put tasks on the queue of the host that it’s running on; if operating as part of a cluster, it won’t be able to take advantage of that extra power. DMSDK makes it much simpler to spread work across a cluster.

Written Tutorial

Ingest an Aggregate JSON File with Many Documents Inside

Problem

Solution

Discussion

Learn More

Stay on top of everything Marklogic.

This website uses cookies.