Solutions

MarkLogic Data Hub Service

Fast data integration + improved data governance and security, with no infrastructure to buy or manage.

Learn More

Learn

Stay On Top Of Everything MarkLogic

Be the first to know! News, product information, and events delivered straight to your inbox.

Sign Me Up

Community

MarkLogic World 2019

Learn how to simplify data integration & build innovative applications. Join us in Washington D.C. May 14-15!

Find Out More

Company

Stay On Top Of Everything MarkLogic

Be the first to know! News, product information, and events delivered straight to your inbox.

Sign Me Up

Building a Search and Export App with the Data Movement SDK

Ashutosh Agarwal
Last updated December 20, 2018

Introduction to DMSDK

Data Movement SDK (DMSDK) is a set of Java classes which are part of the Java Client API in MarkLogic 9+ and is used for loading and transforming large numbers of documents. The DMSDK is asynchronous, and efficiently distributes (a generally long running) job across a MarkLogic cluster. Along with DMSDK's ability to read, transform, write, and delete documents, it supports any input source supported by Java, including streaming or file, as well as XQuery or JavaScript for transformations.

While MLCP and the DMSDK are both Java-based tools that can read documents and do transformations, MLCP is a relatively simple command-line tool, designed for bulk loading. The DMSDK, on the other hand, is a development kit for Java developers who want the capability to create highly customized load and transformation jobs, processing data such as Java message queues, a real-time Twitter pipeline, or a workflow where documents are periodically dropped into a directory.

Figure 1: A Java application using DMSDK distributes jobs across a MarkLogic cluster

Following the diagram in Figure 1, a batcher acts as a job controller and encapsulates the characteristics of a job (e.g., threads, batch size, listeners), controlling the workflow. The sub-interfaces of the batcher determine the workflow, such as reading or writing.

There are two kinds of batcher jobs:

When using DMSDK, there are basic classes which are used in almost every application, creating the required objects and starting the job. Let’s review below.

  1. A write job sends batches of documents to MarkLogic for insertion into a database. You can insert both content and metadata.
  2. A query job creates batches of URIs and dispatches each batch to listeners. The batcher gets URIs either by identifying documents that match a query or from a list of URIs you provide as an Iterator. The action applied to a batch of URIs is dependent on the listener. For example, a listener might read the documents specified by the batch from the database and then export them to the filesystem.

Basic DMSDK Classes

Documentation of the classes below is available in the MarkLogic Java Client API.

When using DMSDK, there are basic classes which are used in almost every application, creating the required objects and starting the job. Let’s review below.

  1. Create database client connection (like in any MarkLogic Java Client API code)
  2. After you have created client connection in your Java client API code, use the DataMovementManager class as the primary DMSDK job control interface. This object is intended to be long-lived, and should manage multiple jobs.
  3. Now create a batcher. The type of batcher you create determines the basic job flow.
  4. For a write batcher:

    For a query batcher:

    Configure job characteristics such as batch size and thread count, which can be done using batcher.withBatchSize(<count>) and batcher.withThreadCount(<count>).

    Attach one or more listeners to interesting job events. The available events depend on the type of job. Using listeners in a job is shown via more detailed examples below; follow the comments in the example code surrounded by asterisks (***).

  5. To submit the DMSDK job, use the startJob method:
  6. Once the job has started, it runs asynchronously and is a non-blocking operation.

    Stop the job when you no longer need it, otherwise the job will run indefinitely. A graceful shutdown of a job includes waiting for in-progress batches to complete.

Loading and Transforming Documents with DMSDK

The following example has been derived from our Data Integration. course; however, our focus here is to provide an example of a DMSDK job for transforming and loading documents.

Suppose you want to create a set of documents that use the envelope pattern and model with application specific canonical data about the entity while preserving the original source data as-is for compliance reasons. For example:

Figure 2: Example of envelope pattern

Load your server-side transformation code (e.g., envelope.xqy, as below) into your modules database.

Figure 3: Example of server-side transformation code, e.g. envelope.xqy

For example, if you use the REST endpoint /v1/config/transforms/<module-name> to load envelope.xqy, your curl loading script may look like this:

curl --anyauth --user admin:admin -X PUT -d@"<location of envelope.xqy>" -i -H "Content-type: application/xquery" http://localhost:<port>/v1/config/transforms/envelope

Create a Maven project that uses the Java Client API. The pom.xml file would have a dependency corresponding to marklogic-client-api (version 4.x).

Create the following two classes in any package you want (the examples here use com.ml.mlu):

  1. Utils.java – reads example properties file and make a database client connection (see Figure 5)
  2. LoadAndTransform.java – runs dmsdk job to load and transform document (see Figure 6)

These two classes use this example.properties file:

Figure 4: Sample example.properties

Figure 5: Example Utils.java

Note that comments surrounded by asterisks in the code sample below reference the steps in the Basic DMSDK Classes section.

Figure 6: Example LoadAndTransform.java

Exporting Documents to Filesystem Based on a String Query

In the example in Figure 7, you will see how to extract documents from a database based on a string query and save them on the filesystem in their native format. Remember, comments surrounded by asterisks in the code sample below reference the steps in the Basic DMSDK Classes section.

The example code above uses QueryBatcher and ExportListener to read documents from MarkLogic and save them to the filesystem. The job uses a string query to select the documents to be exported.

If sending the contents of each document as-is to the writer does not meet the needs of your application, you can register an output listener to prepare custom input for the writer. Use onGenerateOutput method to register such a listener using class ExportToWriterListener. Each fetched document (and its metadata) is made available to the onGenerateOutput listeners as a DocumentRecord.

In the example in Figure 8, you will see how to create an ExportToWriterListener configured to fetch documents and collection metadata. The onGenerateOutput listener generates a comma-separated string containing the document URI, first collection name, and the document content.

ExportToWriterListener.withRecordSuffix is used to emit a newline after each document is processed. The end result is a three-column CSV file.

Figure 8: Example code using ExportToWriterListener and onGenerateOutput

Benefits of Data Movement SDK

These were just a few starting examples of what we can do with the DMSDK, moving data into, out of, and [later] within a MarkLogic cluster.

Java is the most widely used language in the enterprise. It has a mature set of tools and an immense ecosystem. MarkLogic provides Java APIs to make working with MarkLogic in a Java environment simple, fast, and secure. The DMSDK complements the Java Client API by adding an asynchronous interface for reading, writing, and transforming data in a MarkLogic cluster.

Integrating data from multiple systems starts with getting data from those systems. The DMSDK allows you to do this easily, efficiently, and predictably in a programmatic fashion so that it can be integrated into your overall architecture.

Additional Resources

Stack Overflow iconStack Overflow: Get the most useful answers to questions from the MarkLogic community, or ask your own question.

Comments

The commenting feature on this page is enabled by a third party. Comments posted to this page are publicly visible.