Hadoop
The Hortonworks Data Platform (HDP) is an open-source data management platform based on Apache Hadoop. MarkLogic has partnered with Hortonworks to combine Hadoop’s massive storage and computing capabilities with MarkLogic’s real-time indexing and security into a unified platform.
Connector for Hortonworks Data Platform
| Release 1.0-2 zip package | 7.6MB | |
|---|---|---|
- Prepare, reformat, extract, join, or filter raw data for use in interactive applications in MarkLogic using MapReduce
- Enrich or transform data in situ in MarkLogic using MapReduce, taking advantage of MarkLogic’s fast indexes and security model
- Age data out of a MarkLogic database into archival storage on HDFS or transfer it in parallel to another system with MapReduce
The Connector for HDP includes:
- Hortonworks Data Platform, an open source data management platform based on Apache Hadoop
- MarkLogic Connector for Hadoop, an open source Java API for using a MarkLogic database as an MapReduce input source or output destination
- MarkLogic Content Pump (mlcp), an open source command line tool that uses Hadoop to efficiently transfer content between a MarkLogic Server database and HDFS or the native file system, or copy data between MarkLogic Server databases.
We’ve created a command line installation script that simplifies deploying HDP to one or more host machines and configures the MarkLogic Connector into the default Hadoop environment. To get started, download the Connector and consult the README. The full documentation for installing and using the Connector for HDP is available here.
Both the Connector and HDP are free and open-source. MarkLogic provides commercial support for the combination of MarkLogic and HDP using the MarkLogic Connector. For more information about support for Hadoop see marklogic.com.
Connector for Apache Hadoop
| Release 1.1-3 zip package | 1.5MB | |
|---|---|---|
| Release 1.1.3 source zip package | 104 KB | |
| Maven repository | ||
The MarkLogic Connector for Hadoop enables you to run Hadoop MapReduce jobs on data in a MarkLogic Server cluster. You can
- Leverage existing MapReduce and Java libraries to process MarkLogic data
- Operate on data as Documents, Nodes, or Values
- Access MarkLogic text, geospatial, value, and document structure indexes to send only the most relevant data to Hadoop for processing
- Send Hadoop Reduce results to multiple MarkLogic forests in parallel
- Rely on the connector to optimize data access (for both locality and streaming IO) across MarkLogic forests
- MarkLogic-specific implementations of the
- Hadoop
InputFormatclass for reading data from MarkLogic - Hadoop
OutputFormatclass for writing data to MarkLogic
- Hadoop
- Sample code for a variety of use cases
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-avaiability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-availabile service on top of a cluster of computers, each of which may be prone to failures.Hadoop is often used for computationally complex bulk processing and cheap offline storage of long-tail data. It provides complimentary services to MarkLogic's real-time analytics, full-text search, delivery, and updates.
Documentation
![]()
MarkLogic Connector for Hadoop Javadoc
MarkLogic Connector for Hadoop Developer's Guide
Comments