MarkLogic Data Hub Service

Fast data integration + improved data governance and security, with no infrastructure to buy or manage.

Learn More


Stay On Top Of Everything MarkLogic

Be the first to know! News, product information, and events delivered straight to your inbox.

Sign Me Up


MarkLogic World 2019

Learn how to simplify data integration & build innovative applications. Join us in Washington D.C. May 14-15!

Find Out More


Stay On Top Of Everything MarkLogic

Be the first to know! News, product information, and events delivered straight to your inbox.

Sign Me Up


Connector for Hadoop

HadoopHadoop is an open-source framework for distributed processing of large data sets across clusters of computers using simple programming models. When used with MarkLogic, Hadoop provides cost-effective batch computation and distributed storage.

The Connector for Hadoop is supported against the Hortonworks Data Platform (HDP) version 2.6 the Cloudera Distribution of Hadoop (CDH) version 5.8, and Mapr 5.1 The source is licensed under the commercial-friendly Apache 2.0 license and is freely available for inspection or modification.


Connector zip 3.4 MB
Connector source zip 285 KB



The Connector for Hadoop is a drop-in extension to Hadoop's MapReduce framework that allows you to easily and efficiently communicate with a MarkLogic database from within a MapReduce job. You can use the connector to:

  • Stage raw data in HDFS and prepare, reformat, extract, join, or filter for use in interactive applications in MarkLogic
  • Enrich or transform data in situ in MarkLogic using Java and MapReduce, taking advantage of MarkLogic's fast indexes and security model
  • Age data out of a MarkLogic database into archival storage on HDFS or transfer it in parallel to another system

The MarkLogic Connector for Hadoop enables you to run Hadoop MapReduce jobs on data in a MarkLogic Server cluster. You can use the connector to:

  • Leverage existing MapReduce and Java libraries to process MarkLogic data
  • Operate on data as Documents, Nodes, or Values
  • Access MarkLogic text, geospatial, value, and document structure indexes to send only the most relevant data to Hadoop for processing
  • Send Hadoop Reduce results to multiple MarkLogic forests in parallel
  • Rely on the connector to optimize data access (for both locality and streaming IO) across MarkLogic forests

The Connector's drop-in set of Java classes includes:

  • MarkLogic-specific implementations of the
  • Sample code for a variety of use cases

HDFS Client Bundles

Previously, using HDFS for forest storage required you to assemble a set of Hadoop HDFS JAR files or install Hadoop on each MarkLogic host containing a forest on HDFS (or to install Hadoop in a well-known location).

You can now download a pre-packaged Hadoop HDFS client bundle and install this bundle on your MarkLogic hosts. A bundle is available for each supported Hadoop distribution. Use of one of these bundles is required if you use HDFS for forest storage.

MarkLogic supports MapR-FS, which is a POSIX file system natively compatible with MarkLogic Server. No pre-packaged client bundle is required for MapR compatibility.

Downloads for MarkLogic 9.0-8

Client Bundle for CDH 5.8

Client Bundle for HDP 2.6

Downloads for MarkLogic 8.0-9

Client Bundle for CDH 5.8

Client Bundle for HDP 2.4

Downloads for MarkLogic 8

Connector 2.1.9 zip 1.6 MB
Connector 2.1.9 source zip 250 KB


    Stack Overflow iconStack Overflow: Get the most useful answers to questions from the MarkLogic community, or ask your own question.


    The commenting feature on this page is enabled by a third party. Comments posted to this page are publicly visible.
    • So nice to read.Its very useful for me to get more valuable info about hadoop.Thanks for it.Keep going.
    • Hi, Thanks for sharing the great information about <a href="">Hadoop…</a> Its useful and helpful information…Keep Sharing. Thanks Hari
    • Do you have a connector with Apache Storm for realtime ingestions? How to handle data ingestion at 20,000 messages per second with huge volume of data?
    • Does this connector work well with Apache Hadoop 2.7.0. As I don't see vannilla Hadoop listed in support list.
    • Is this connecter only works for CDH 5.4 or 5.4+ ?
    • Thanks for sharing this informative blog... it helped me lot about the knowledge of hadoop and associated information about data integration. The blog gave me a solution of <a href="">how to load data in Hadoop Data Lake</a> and its framework.
    • Thankss for this... really you made a great job by sharing these things with us. It will surely help me for getting into Hadoop. I have heard about <a href="">data integration hadoop</a> and its advancement in new phase will give benefit to everyone.
    • Learning hadoop would take one to the next level in his/her career. I guess you made a great job by publishing some nice work here. This will be very helpful for me in getting into hadoop. Could you please explain this "Consider case scenario: In M/R system, - HDFS block size is 64 MB" <a href="">Hadoop Course in Chennai</a>
    • When will HDP 2.0 be certified?