To get started with RecordLoader, try the tutorial.
The entry point is the main method in the
com.marklogic.ps.RecordLoader class.
It takes zero or more property files as its arguments.
Any specified system properties will override file-based properties,
and properties found in later files may override properties
specified in earlier files on the command line.
It's also possibly to specify properties as VM arguments (-DNAME=value).
See
src/recordloader.sh
for a sample shell script.
See src/config/
for sample property files.
None. If ID_NAME is missing,
then the default value #FILENAME will be used.
| Property | default value | notes |
|---|---|---|
| CONFIGURATION_CLASSNAME | com.marklogic.recordloader.xcc.XccConfiguration | This class will be used to provide configuration information. This class must be an extension of the com.marklogic.recordloader.Configuration class. |
| CONTENT_FACTORY_CLASSNAME | com.marklogic.recordloader.xcc.XccContentFactory |
This class will be used to create new content objects,
which implement com.marklogic.recordloader.ContentInterface.
One alternative implementation is provided, as
When XccModuleContentFactory is used, new documents must fit in memory,
and will be posted to the XQuery main module designated by the
When RecordLoader invokes this module, it will set external variables:
The following XQuery implements an example ContentModule,
which implements a simple transform to lower-case all element names.
Note that the module implements its own versions of the
xquery version "0.9-ml"
define variable $URI as xs:string external
define variable $XML-STRING as xs:string external
define variable $NAMESPACE as xs:string external
define variable $LANGUAGE as xs:string external
define variable $ROLES as xs:string external
define variable $COLLECTIONS as xs:string external
define variable $SKIP-EXISTING as xs:boolean external
define variable $ERROR-EXISTING as xs:boolean external
define function do($list as node()*)
as node()*
{
for $n in $list return typeswitch($n)
(: lower-case element localnames :)
case element() return element {
expanded-QName(namespace-uri($n), lower-case(local-name($n)))
} {
$n/@*, do($n/node())
}
case document-node() return document { do($n/node()) }
default return $n
}
if ($SKIP-EXISTING and doc($URI)) then ()
else if ($ERROR-EXISTING and doc($URI)) then error('DUPLICATE-URI', $URI)
else xdmp:document-insert(
$URI,
do(xdmp:unquote(
$XML-STRING,
$NAMESPACE,
if ($LANGUAGE) then concat('default-language=', $LANGUAGE) else ()
)),
for $r in tokenize($ROLES, '[,\s]+')[. ne '']
return xdmp:permission('read', $r),
tokenize($COLLECTIONS, '[,\s]+')[. ne '']
)
|
| CONNECTION_STRING | xcc://admin:admin@localhost:9000/ | XCC URI, including username, password, host, and port, to use for all queries and inserts. If desired, a database name may also be supplied. Multiple connection strings may be separated with whitespace or commas. |
| DEFAULT_NAMESPACE | null | If present, all XML will default to the supplied namespace uri. |
| DOCUMENT_FORMAT | xml | Document format for all new documents.
Valid settings are
xml, text, and binary
|
| ERROR_EXISTING | false |
If true, RecordLoader will throw an error
if it finds itself trying to overwrite an existing document uri.
This error may or may not be fatal,
depending on the value of FATAL_ERRORS.
Note that this option requires the server to perform a separate check for each document uri. This can reduce performance.
Note that if using
|
| FATAL_ERRORS | true | If true, RecordLoader will exit with an error upon encountering any non-retryable error. If set to false, RecordLoader will close the current record and continue on to the next. |
| ID_NAME | #FILENAME |
Within each input document or RECORD_NAME element,
the first element called ID_NAME will be used to compose the new document uri.
If ID_NAME starts with '@', an attribute with this local-name
will be used to compose the new document uri.
Note that namespace is ignored: only the local-name is used. The named node must have a simple text value: it may not be empty, and it must not contain any non-text children.
The special value
Note that when the input is standard input,
the default value is
The special value Examples: ID_NAME=MedlineID, ID_NAME=@id |
| IGNORE_FILE_BASENAME | false | If true, RecordLoader will omit the file or zip archive basename when composing new document uris. |
| IGNORE_UNKNOWN | false | If set, RecordLoader will ignore siblings of RECORD_NAME that are not RECORD_NAME elements. Otherwise, this condition causes a fatal error. |
| INPUT_MALFORMED_ACTION | REPORT | Constant values from java.nio.charset.CodingErrorAction,
used to determine what happens if there are
invalid character sequences in the input XML.
|
| INPUT_ENCODING | UTF-8 | The Java Charset encoding (codepage) to use for all input XML.
If unset, RecordLoader will use null,
which will default to the default Locale's character encoding.
Note that MarkLogic Server must receive all XML as UTF-8, so the output encoding is always UTF-8. Example: if the input XML is encoded as windows-1252,
use INPUT_ENCODING=Cp1252 to ensure correct conversion.
|
| INPUT_FILE_SIZE_LIMIT | 0 |
If greater than zero, RecordLoader will skip any input files
larger that INPUT_FILE_SIZE_LIMIT Bytes.
This does not apply to zip archives, nor to the size of their entries.
|
| INPUT_HANDLER_CLASSNAME | com.marklogic.recordloader.DefaultInputHandler |
The specified class will be used to marshall loader inputs.
The default class handles INPUT_PATH
as well as command-line arguments.
This property is meant for plug-in classes, which must implement
com.marklogic.recordloader.InputHandlerInterface,
and may extend the com.marklogic.recordloader.AbstractInputHandler
class.
Built-in alternatives:
|
| INPUT_PATH | null | The filesystem path in which to look for XML files or zip archives. If unset, RecordLoader will read XML directly from standard input. |
| INPUT_PATTERN | ^.+\\.[Xx][Mm][Ll]$ | Matching pattern (regex) for files found in INPUT_PATH.
The default value matches all filenames ending with .xml |
| INPUT_STRIP_PREFIX | null | If not null, characters matching this pattern (regex)
will be removed from all input URIs.
For example, Windows users may wish to set
INPUT_STRIP_PREFIX=^[A-Z]:
so that document URIs in the database
do not include drive-letter prefixes.
|
| INPUT_NORMALIZE_PATHS | false | If true, backslashes in input paths
will be coalesced and replaced with slashes
in all output document URIs.
This is useful for Windows users,
especially in combination with INPUT_STRIP_PREFIX.
With both properties set as suggested,
C:\foo\bar\baz.xml on the filesystem becomes
/foo/bar/baz.xml in the database.
|
| LANGUAGE | null |
If set, the value will be passed
to XCC ContentCreateOptions.setLanguage(),
or to the CONTENT_MODULE
external variable $LANGUAGE.
Accepted values are documented
in XML 1.0
and RFC 3066.
If null, the default database language will be used. |
| LOG_LEVEL | INFO | java.util.logger.Level at which to log. |
| LOG_HANDLER | CONSOLE,FILE | java.util.logger log handlers with which to log. |
| OUTPUT_COLLECTIONS | null | One or more collections to apply to every new document. Use whitespace to separate multiple collection uris. |
| OUTPUT_FORESTS | null | If set, all documents will be explicitly placed into the named forests.
Use whitespace or the characters ,:; to separate values. |
| READ_ROLES | null | One or more existing role name, separated by whitespace. If set, every document inserted by RecordLoader will have read permission for these roles. If any of the supplied role-names do not exist, the first document insert will throw a fatal error. |
| RECORD_NAME | null |
Element name in which each document is found. These may not nest. If no RECORD_NAME is set, the first child element of the first root element will be used for the entire RecordLoader run. If |
| RECORD_NAMESPACE | null | Element namespace in which each document is found. If unset, but RECORD_NAME is set, then the empty namespace is assumed. If unset, and RECORD_NAME is also unset, then then the namespace of the first child element of the first root element will be used for the entire RecordLoader run. |
| SKIP_EXISTING | false |
If true, existing document uris will be skipped.
This allows RecordLoader to resume after being interrupted.
This option may be combined with Note that one read I/O is required per skip, so SKIP_EXISTING is slower than using START_ID (below).
Note that if using
|
| START_ID | null |
When set, records are skipped
until one with an ID_NAME value
equal to START_ID is found.
This can be used to resume ingestion after interruptions or fatal errors.
|
| THREADS | 1 |
Number of RecordLoader threads. Note that when using standard input, this value is ignored. Note that RecordLoader uses at most 1 thread per input file or zip entry. |
| THROTTLE_BYTES_PER_SECOND | 0 | If non-zero, all threads will be throttled to the given number of bytes inserted per second. |
| THROTTLE_EVENTS_PER_SECOND | 0 | If non-zero, all threads will be throttled to the given number of inserts per second. |
| URI_PREFIX | null | Prefix used before the ID_NAME value, to compose all document uris. If the prefix does not end in '/', RecordLoader will add a '/' to it. |
| URI_SUFFIX | null | Suffix used after the ID_NAME value, to compose all document uris. |
| USE_FILENAME_COLLECTION | true | If ID_NAME is not #FILENAME,
and this property is true,
RecordLoader will add an extra collection to each record,
built from the filename of the current input file.
This can be useful when splitting superfiles.
|
| XML_REPAIR_LEVEL | NONE | To what degree should XPP3 and MarkLogic Server
compensate for invalid XML?
|
XmlPullParserException: could not resolve entity named 'foo'.
The XPP implementation used by RecordLoader, xpp3, does not handle unknown references, and does not process DTD-style document declarations. So if your XML includes non-XML character entities, RecordLoader is not for you. Future enhancements could include a plug-in system, allowing the user to substitute an XPP implementation that supports document declarations.
java.util.concurrent.RejectedExecutionException.
If you are using RecordLoader with thousands of files or zipfile entries,
you may need to increase the JVM heap space. Try -Xmx256m
as one of your command-line JVM arguments.
You should see UTF-8 in the output from locale -a:
$ locale -a | grep -i utf
en_CA.UTF-8
en_US.UTF-8
es.UTF-8
es_MX.UTF-8
fr.UTF-8
fr_CA.UTF-8
If no UTF-8 locales are available,
make sure to install the Solaris SUNWeu8os package.