Hadoop

Connector for Hadoop

HadoopHadoop is an open-source framework for distributed processing of large data sets across clusters of computers using simple programming models. When used with MarkLogic, Hadoop provides cost-effective batch computation and distributed storage.

The Connector for Hadoop is supported against the Hortonworks Data Platform (HDP) version 2.4 the Cloudera Distribution of Hadoop (CDH) version 5.8, and Mapr 5.1 The source is licensed under the commercial-friendly Apache 2.0 license and is freely available for inspection or modification.

Downloads

Connector 2.1.6 zip 1.5MB
Connector 2.1.6 source zip 188 KB

 

The Connector for Hadoop is a drop-in extension to Hadoop's MapReduce framework that allows you to easily and efficiently communicate with a MarkLogic database from within a MapReduce job. You can use the connector to:

  • Stage raw data in HDFS and prepare, reformat, extract, join, or filter for use in interactive applications in MarkLogic
  • Enrich or transform data in situ in MarkLogic using Java and MapReduce, taking advantage of MarkLogic's fast indexes and security model
  • Age data out of a MarkLogic database into archival storage on HDFS or transfer it in parallel to another system

The MarkLogic Connector for Hadoop enables you to run Hadoop MapReduce jobs on data in a MarkLogic Server cluster. You can use the connector to:

  • Leverage existing MapReduce and Java libraries to process MarkLogic data
  • Operate on data as Documents, Nodes, or Values
  • Access MarkLogic text, geospatial, value, and document structure indexes to send only the most relevant data to Hadoop for processing
  • Send Hadoop Reduce results to multiple MarkLogic forests in parallel
  • Rely on the connector to optimize data access (for both locality and streaming IO) across MarkLogic forests

 

The Connector's drop-in set of Java classes includes:

  • MarkLogic-specific implementations of the
  • Sample code for a variety of use cases

HDFS Client Bundles

Previously, using HDFS for forest storage required you to assemble a set of Hadoop HDFS JAR files or install Hadoop on each MarkLogic host containing a forest on HDFS (or to install Hadoop in a well-known location).

You can now download a pre-packaged Hadoop HDFS client bundle and install this bundle on your MarkLogic hosts. A bundle is available for each supported Hadoop distribution. Use of one of these bundles is required if you use HDFS for forest storage.

Downloads for MarkLogic 8.0-6

Client Bundle for CDH 5.8

Client Bundle for HDP 2.4

Downloads for MarkLogic 8.0-1 and 8.0-2

Connector 2.1-2 zip for Cloudera Distribution of Hadoop 4.3 1.9 MB
Connector 2.1-2 zip source 167 KB

Downloads for MarkLogic 7

Connector 2.0-5.4 zip for Cloudera Distribution of Hadoop 4.3 2.0 MB
Connector 2.0-5.4 zip source 168 KB

Documentation

    Comments

    • Does this connector work well with Apache Hadoop 2.7.0. As I don't see vannilla Hadoop listed in support list.
    • Is this connecter only works for CDH 5.4 or 5.4+ ?
    • Thanks for sharing this informative blog... it helped me lot about the knowledge of hadoop and associated information about data integration. The blog gave me a solution of <a href="http://diyotta.com/glossary">how to load data in Hadoop Data Lake</a> and its framework.
    • Thankss for this... really you made a great job by sharing these things with us. It will surely help me for getting into Hadoop. I have heard about <a href="http://www.diyotta.com">data integration hadoop</a> and its advancement in new phase will give benefit to everyone.
    • Learning hadoop would take one to the next level in his/her career. I guess you made a great job by publishing some nice work here. This will be very helpful for me in getting into hadoop. Could you please explain this "Consider case scenario: In M/R system, - HDFS block size is 64 MB" <a href="http://www.fita.in/big-data-hadoop-training-in-chennai/">Hadoop Course in Chennai</a>
    • When will HDP 2.0 be certified?