VALUE
- Only ForestDocument is currently supported, but types
such as Text or BytesWritable are possible candidates to be added.public class ForestInputFormat<VALUE> extends org.apache.hadoop.mapreduce.lib.input.FileInputFormat<DocumentURIWithSourceInfo,VALUE> implements MarkLogicConstants
FileInputFormat
subclass for reading documents from a forest using DirectAccess.
Direct Access is intended primarily for extracting documents in offline or read-only forests, such as forests containing archived data that are part of a Tiered Storage data management strategy.
This format produces key-value pairs where the key is a DocumentURI
and the value is a ForestDocument
. The type of ForestDocument
depends on the underlying document content type: DOMDocument
for XML or text, or BinaryDocument
for binaries. Binary
documents can be further specialized to RegularBinaryDocument
or
LargeBinaryDocument
, depending on size and the database
configuration.
Modifier and Type | Field and Description |
---|---|
static org.apache.commons.logging.Log |
LOG |
DEFAULT_LIST_STATUS_NUM_THREADS, INPUT_DIR, INPUT_DIR_RECURSIVE, LIST_STATUS_NUM_THREADS, NUM_INPUT_FILES, PATHFILTER_CLASS, SPLIT_MAXSIZE, SPLIT_MINSIZE
ADVANCED_MODE, ASSIGNMENT_POLICY, BASIC_MODE, BATCH_SIZE, BIND_SPLIT_RANGE, COLLECTION_FILTER, CONTENT_TYPE, COPY_COLLECTIONS, COPY_METADATA, COPY_QUALITY, DEFAULT_BATCH_SIZE, DEFAULT_CONTENT_TYPE, DEFAULT_LOCAL_MAX_SPLIT_SIZE, DEFAULT_MAX_SPLIT_SIZE, DEFAULT_OUTPUT_CONTENT_ENCODING, DEFAULT_OUTPUT_XML_REPAIR_LEVEL, DEFAULT_PROPERTY_OPERATION_TYPE, DEFAULT_TXN_SIZE, DIRECTORY_FILTER, DOCUMENT_SELECTOR, EXECUTION_MODE, EXTRACT_URI, INDENTED, INPUT_DATABASE_NAME, INPUT_HOST, INPUT_KEY_CLASS, INPUT_LEXICON_FUNCTION_CLASS, INPUT_MODE, INPUT_PASSWORD, INPUT_PORT, INPUT_QUERY, INPUT_QUERY_LANGUAGE, INPUT_QUERY_TIMESTAMP, INPUT_RESTRICT_HOSTS, INPUT_SSL_OPTIONS_CLASS, INPUT_SSL_PROTOCOL, INPUT_USE_SSL, INPUT_USERNAME, INPUT_VALUE_CLASS, MAX_SPLIT_SIZE, MIN_NODEUPDATE_VERSION, MODE_DISTRIBUTED, MODE_LOCAL, MR_NAMESPACE, NODE_OPERATION_TYPE, OUTPUT_CLEAN_DIR, OUTPUT_COLLECTION, OUTPUT_CONTENT_ENCODING, OUTPUT_CONTENT_LANGUAGE, OUTPUT_CONTENT_NAMESPACE, OUTPUT_DATABASE_NAME, OUTPUT_DIRECTORY, OUTPUT_FAST_LOAD, OUTPUT_FOREST_HOST, OUTPUT_GRAPH, OUTPUT_HOST, OUTPUT_KEY_TYPE, OUTPUT_KEY_VARNAME, OUTPUT_NAMESPACE, OUTPUT_OVERRIDE_GRAPH, OUTPUT_PARTITION, OUTPUT_PASSWORD, OUTPUT_PERMISSION, OUTPUT_PORT, OUTPUT_PROPERTY_ALWAYS_CREATE, OUTPUT_QUALITY, OUTPUT_QUERY, OUTPUT_QUERY_LANGUAGE, OUTPUT_RESTRICT_HOSTS, OUTPUT_SSL_OPTIONS_CLASS, OUTPUT_SSL_PROTOCOL, OUTPUT_STREAMING, OUTPUT_URI_PREFIX, OUTPUT_URI_REPLACE, OUTPUT_URI_SUFFIX, OUTPUT_USE_SSL, OUTPUT_USERNAME, OUTPUT_VALUE_TYPE, OUTPUT_VALUE_VARNAME, OUTPUT_XML_REPAIR_LEVEL, PATH_NAMESPACE, PROPERTY_OPERATION_TYPE, QUERY_FILTER, RECORD_TO_FRAGMENT_RATIO, REDACTION_RULE_COLLECTION, SPLIT_END_VARNAME, SPLIT_QUERY, SPLIT_START_VARNAME, SUBDOCUMENT_EXPRESSION, TEMPORAL_COLLECTION, TXN_SIZE, TYPE_FILTER
Constructor and Description |
---|
ForestInputFormat() |
Modifier and Type | Method and Description |
---|---|
org.apache.hadoop.mapreduce.RecordReader<DocumentURIWithSourceInfo,VALUE> |
createRecordReader(org.apache.hadoop.mapreduce.InputSplit split,
org.apache.hadoop.mapreduce.TaskAttemptContext context) |
List<org.apache.hadoop.mapreduce.InputSplit> |
getSplits(org.apache.hadoop.mapreduce.JobContext job) |
protected List<org.apache.hadoop.fs.FileStatus> |
listStatus(org.apache.hadoop.mapreduce.JobContext job) |
addInputPath, addInputPathRecursively, addInputPaths, computeSplitSize, getBlockIndex, getFormatMinSplitSize, getInputDirRecursive, getInputPathFilter, getInputPaths, getMaxSplitSize, getMinSplitSize, isSplitable, makeSplit, makeSplit, setInputDirRecursive, setInputPathFilter, setInputPaths, setInputPaths, setMaxInputSplitSize, setMinInputSplitSize
public org.apache.hadoop.mapreduce.RecordReader<DocumentURIWithSourceInfo,VALUE> createRecordReader(org.apache.hadoop.mapreduce.InputSplit split, org.apache.hadoop.mapreduce.TaskAttemptContext context) throws IOException, InterruptedException
createRecordReader
in class org.apache.hadoop.mapreduce.InputFormat<DocumentURIWithSourceInfo,VALUE>
IOException
InterruptedException
protected List<org.apache.hadoop.fs.FileStatus> listStatus(org.apache.hadoop.mapreduce.JobContext job) throws IOException
listStatus
in class org.apache.hadoop.mapreduce.lib.input.FileInputFormat<DocumentURIWithSourceInfo,VALUE>
IOException
public List<org.apache.hadoop.mapreduce.InputSplit> getSplits(org.apache.hadoop.mapreduce.JobContext job) throws IOException
getSplits
in class org.apache.hadoop.mapreduce.lib.input.FileInputFormat<DocumentURIWithSourceInfo,VALUE>
IOException
Copyright © 2020 MarkLogic Corporation
Complete online documentation for MarkLogic Server, XQuery and related components may be found at developer.marklogic.com