Solutions

MarkLogic Data Hub Service

Fast data integration + improved data governance and security, with no infrastructure to buy or manage.

Learn More

Learn

Stay On Top Of Everything MarkLogic

Be the first to know! News, product information, and events delivered straight to your inbox.

Sign Me Up

Community

Stay On Top Of Everything MarkLogic

Be the first to know! News, product information, and events delivered straight to your inbox.

Sign Me Up

Company

Stay On Top Of Everything MarkLogic

Be the first to know! News, product information, and events delivered straight to your inbox.

Sign Me Up

Hadoop

Connector for Hadoop

HadoopHadoop is an open-source framework for distributed processing of large data sets across clusters of computers using simple programming models. When used with MarkLogic, Hadoop provides cost-effective batch computation and distributed storage.

The Connector for Hadoop is supported against the Hortonworks Data Platform (HDP) version 2.6 the Cloudera Distribution of Hadoop (CDH) version 5.8, and Mapr 5.1 The source is licensed under the commercial-friendly Apache 2.0 license and is freely available for inspection or modification.

Downloads

Connector 2.3.1 zip 3.5 MB(SHA1)
Connector 2.3.1 source zip 310 KB(SHA1)

Maven

Dependencies

The Connector for Hadoop is a drop-in extension to Hadoop's MapReduce framework that allows you to easily and efficiently communicate with a MarkLogic database from within a MapReduce job. You can use the connector to:

  • Stage raw data in HDFS and prepare, reformat, extract, join, or filter for use in interactive applications in MarkLogic
  • Enrich or transform data in situ in MarkLogic using Java and MapReduce, taking advantage of MarkLogic's fast indexes and security model
  • Age data out of a MarkLogic database into archival storage on HDFS or transfer it in parallel to another system

The MarkLogic Connector for Hadoop enables you to run Hadoop MapReduce jobs on data in a MarkLogic Server cluster. You can use the connector to:

  • Leverage existing MapReduce and Java libraries to process MarkLogic data
  • Operate on data as Documents, Nodes, or Values
  • Access MarkLogic text, geospatial, value, and document structure indexes to send only the most relevant data to Hadoop for processing
  • Send Hadoop Reduce results to multiple MarkLogic forests in parallel
  • Rely on the connector to optimize data access (for both locality and streaming IO) across MarkLogic forests

The Connector's drop-in set of Java classes includes:

  • MarkLogic-specific implementations of the
  • Sample code for a variety of use cases

HDFS Client Bundles

Previously, using HDFS for forest storage required you to assemble a set of Hadoop HDFS JAR files or install Hadoop on each MarkLogic host containing a forest on HDFS (or to install Hadoop in a well-known location).

You can now download a pre-packaged Hadoop HDFS client bundle and install this bundle on your MarkLogic hosts. A bundle is available for each supported Hadoop distribution. Use of one of these bundles is required if you use HDFS for forest storage.

MarkLogic supports MapR-FS, which is a POSIX file system natively compatible with MarkLogic Server. No pre-packaged client bundle is required for MapR compatibility.

Downloads for MarkLogic 10.0-1

Client Bundle for CDH 5.8

Downloads for MarkLogic 9.0-9

Client Bundle for CDH 5.8

Client Bundle for HDP 2.6

Downloads for MarkLogic 8.0-9

Client Bundle for CDH 5.8

Client Bundle for HDP 2.4

Downloads for MarkLogic 9

Connector 2.2.9 zip 3.5 MB
Connector 2.2.9 source zip 312 KB

Downloads for MarkLogic 8

Connector 2.1.9 zip 1.6 MB
Connector 2.1.9 source zip 250 KB

Documentation

    Stack Overflow iconStack Overflow: Get the most useful answers to questions from the MarkLogic community, or ask your own question.