See: Description
Package | Description |
---|---|
com.marklogic.contentpump | |
com.marklogic.contentpump.examples | |
com.marklogic.contentpump.test | |
com.marklogic.contentpump.utilities | |
com.marklogic.dom |
This package provides W3C DOM based APIs to the
the internal on-disk representation of documents and their
contents in the expanded tree cache of a MarkLogic database
forest.
|
com.marklogic.io | |
com.marklogic.mapreduce |
MarkLogic Connector for Hadoop core interfaces.
|
com.marklogic.mapreduce.examples |
Examples of using MarkLogic Server in MapReduce jobs.
|
com.marklogic.mapreduce.functions |
Interfaces for using a MarkLogic Server lexicon as an input source.
|
com.marklogic.mapreduce.test | |
com.marklogic.mapreduce.utilities | |
com.marklogic.tree |
This bundle provides an API for a MarkLogic Server content connector for Apache Hadoop MapReduce. The overview covers the following topics:
For detailed information, see the MarkLogic Connector for Hadoop Developer's Guide.
The MarkLogic Connector for Hadoop API allows you to use MarkLogic Server as either or both a Hadoop MapReduce input source and an output destination.
The following classes are provided for defining MarkLogic-specific key and value types for your MapReduce key-value pairs:
NodePath
for keysDocumentURI
for keysMarkLogicNode
for values
You may also use Apache Hadoop MapReduce types such as Text in
certain circumstances. See ValueInputFormat
KeyValueInputFormat
.
You may generate input data using MarkLogic Server lexicon functions
by subclassing one of the lexicon function wrapper classes in
com.marklogic.mapreduce.functions. Use lexicon functions
with ValueInputFormat
and
KeyValueInputFormat
.
The following classes are provided for defining MarkLogic-specific MapReduce input and output formats. Input and output formats need not be the same type.
DocumentInputFormat
NodeInputFormat
ValueInputFormat
KeyValueInputFormat
ContentOutputFormat
NodeOutputFormat
PropertyOutputFormat
Configure the connector using the standard Hadoop configuration
mechanism. That is, use a Hadoop configuration file to define
property values, or set properties programmatically on your
Job's Configuration
object.
The configuration properties available for the connector are
described in MarkLogicConstants
.
When using MarkLogic Server as an input source for MapReduce
tasks, you may use either basic or advanced input mode. The default
is basic
mode. The mode is controlled through
the mapreduce.marklogic.input.mode
property. The following sections
describe the input modes briefly. For details, see the
MarkLogic Connector for Hadoop Developer's Guide.
In basic mode, you may supply components of an XQuery path expression which the connector uses to generate input data. You may not use this option along with a lexicon function class.
To allow MarkLogic Server to optimize the input query, the path
expression is constructed from two components: A
document node selector
and a
sub-document expression
.
The input split is not configurable in basic
mode. The
splits are based on a rough count of the number of fragments in
each forest. Use advanced
input mode for more control
over input split generation.
Conceptually, the input data for each task is constructed from a path expression similar to:
$document-selector/$subdocument-expression
Both components of the input path expression are optional. If no
document selector is given, fn:collection()
is used.
If no subdocument expression is given, the document nodes returned
by the document selector are used as the input values.
Examples:
document selector: none
subdocument expression: none
=> All document nodes in fn:collection()
document selector: fn:collection("wiki-topics")
subdocument expression: none
=> All document nodes in the "wiki-topics" collection
document selector: fn:collection("wiki-topics")
subdocument expression: //wp:a[@href]
=> All elements in the "wiki-topics" collection containing hrefs
document selector: fn:collection("wiki-topics")
subdocument expression: //wp:a[@href]/@title
=> The titles of all documents in the "wiki-topics" collection
containing hrefs
In basic mode, you may gather input data using a MarkLogicServer lexicon function. This option may not be used with the XPath based configuration properties described above. If both are configured for a job, the lexicon function takes precedence.
To use a lexicon function for input, implement a subclass of
one of the lexicon wrapper functions in com.marklogic.mapreduce.functions.
For example, to use cts:element-values
, implement a
subclass of ElementValues
.
Override the methods corresponding to the function parameter value
you want to include in the call.
For details, see "Using a Lexicon to Generate Key-Value Pairs" in the MarkLogic Connector for Hadoop Developer's Guide.
In advanced
input mode, you must supply an
input split query
and an
input query
.
The split query is used to generate meta-data for Hadoop's input splits. This query must return a sequence of triples, each of which includes a forest id, record (fragment) count, and list of host names. The count may be an estimate.
The input query is used to fetch the input data for each map task. This query must return data that matches the configured InputFormat subclass.
Copyright © 2022 MarkLogic Corporation
Complete online documentation for MarkLogic Server, XQuery and related components may be found at developer.marklogic.com