This bundle provides an API for a MarkLogic Server content connector for Apache Hadoop MapReduce. The overview covers the following topics:
For detailed information, see the MarkLogic Connector for Hadoop Developer's Guide.
The MarkLogic Connector for Hadoop API allows you to use MarkLogic Server as either or both a Hadoop MapReduce input source and an output destination.
The following classes are provided for defining MarkLogic-specific key and value types for your MapReduce key-value pairs:
You may also use Apache Hadoop MapReduce types such as Text in certain circumstances. See {@link com.marklogic.mapreduce.ValueInputFormat} {@link com.marklogic.mapreduce.KeyValueInputFormat}.
You may generate input data using MarkLogic Server lexicon functions by subclassing one of the lexicon function wrapper classes in com.marklogic.mapreduce.functions. Use lexicon functions with {@link com.marklogic.mapreduce.ValueInputFormat} and {@link com.marklogic.mapreduce.KeyValueInputFormat}.
The following classes are provided for defining MarkLogic-specific MapReduce input and output formats. Input and output formats need not be the same type.
Configure the connector using the standard Hadoop configuration mechanism. That is, use a Hadoop configuration file to define property values, or set properties programmatically on your Job's {@link org.apache.hadoop.conf.Configuration} object.
The configuration properties available for the connector are described in {@link com.marklogic.mapreduce.MarkLogicConstants}.
When using MarkLogic Server as an input source for MapReduce
tasks, you may use either basic or advanced input mode. The default
is basic
mode. The mode is controlled through
the {@link com.marklogic.mapreduce.MarkLogicConstants#INPUT_MODE
mapreduce.marklogic.input.mode} property. The following sections
describe the input modes briefly. For details, see the
MarkLogic Connector for Hadoop Developer's Guide.
In basic mode, you may supply components of an XQuery path expression which the connector uses to generate input data. You may not use this option along with a lexicon function class.
To allow MarkLogic Server to optimize the input query, the path expression is constructed from two components: A {@link com.marklogic.mapreduce.MarkLogicConstants#DOCUMENT_SELECTOR document node selector} and a {@link com.marklogic.mapreduce.MarkLogicConstants#SUBDOCUMENT_EXPRESSION sub-document expression}.
The input split is not configurable in basic
mode. The
splits are based on a rough count of the number of fragments in
each forest. Use advanced
input mode for more control
over input split generation.
Conceptually, the input data for each task is constructed from a path expression similar to:
$document-selector/$subdocument-expression
Both components of the input path expression are optional. If no
document selector is given, fn:collection()
is used.
If no subdocument expression is given, the document nodes returned
by the document selector are used as the input values.
Examples:
document selector: none
subdocument expression: none
=> All document nodes in fn:collection()
document selector: fn:collection("wiki-topics")
subdocument expression: none
=> All document nodes in the "wiki-topics" collection
document selector: fn:collection("wiki-topics")
subdocument expression: //wp:a[@href]
=> All elements in the "wiki-topics" collection containing hrefs
document selector: fn:collection("wiki-topics")
subdocument expression: //wp:a[@href]/@title
=> The titles of all documents in the "wiki-topics" collection
containing hrefs
In basic mode, you may gather input data using a MarkLogicServer lexicon function. This option may not be used with the XPath based configuration properties described above. If both are configured for a job, the lexicon function takes precedence.
To use a lexicon function for input, implement a subclass of
one of the lexicon wrapper functions in com.marklogic.mapreduce.functions.
For example, to use cts:element-values
, implement a
subclass of {@link com.marklogic.mapreduce.functions.ElementValues}.
Override the methods corresponding to the function parameter value
you want to include in the call.
For details, see "Using a Lexicon to Generate Key-Value Pairs" in the MarkLogic Connector for Hadoop Developer's Guide.
In advanced
input mode, you must supply an
{@link com.marklogic.mapreduce.MarkLogicConstants#SPLIT_QUERY
input split query} and an
{@link com.marklogic.mapreduce.MarkLogicConstants#INPUT_QUERY
input query}.
The split query is used to generate meta-data for Hadoop's input splits. This query must return a sequence of triples, each of which includes a forest id, record (fragment) count, and list of host names. The count may be an estimate.
The input query is used to fetch the input data for each map task. This query must return data that matches the configured InputFormat subclass.