RecordLoader

To get started with RecordLoader, try the tutorial.

Running RecordLoader

The entry point is the main method in the com.marklogic.ps.RecordLoader class. It takes zero or more property files as its arguments. Any specified system properties will override file-based properties, and properties found in later files may override properties specified in earlier files on the command line. It's also possibly to specify properties as VM arguments (-DNAME=value). See src/recordloader.sh for a sample shell script. See src/config/ for sample property files.

Required JVM: Sun 1.5 or later

Required libraries:

Required inputs:

None. If ID_NAME is missing, then the default value #FILENAME will be used.

Available properties:

Propertydefault valuenotes
CONFIGURATION_CLASSNAME com.marklogic.recordloader.xcc.XccConfiguration This class will be used to provide configuration information. This class must be an extension of the com.marklogic.recordloader.Configuration class.
CONTENT_FACTORY_CLASSNAME com.marklogic.recordloader.xcc.XccContentFactory

This class will be used to create new content objects, which implement com.marklogic.recordloader.ContentInterface. One alternative implementation is provided, as com.marklogic.recordloader.xcc.XccModuleContentFactory, which creates objects in the class com.marklogic.recordloader.xcc.XccModuleContent.

When XccModuleContentFactory is used, new documents must fit in memory, and will be posted to the XQuery main module designated by the CONTENT_MODULE_URI property. If the SKIP_EXISTING or ERROR_EXISTING features are desired, the module must implement each itself (see below).

When RecordLoader invokes this module, it will set external variables:

  • $URI
  • $XML-STRING
  • $NAMESPACE  (using DEFAULT_NAMESPACE)
  • $LANGUAGE  (using LANGUAGE)
  • $ROLES  (comma-separated values, using READ_ROLES)
  • $COLLECTIONS  (comma-separated values, using OUTPUT_COLLECTIONS)
  • $SKIP-EXISTING  (using SKIP_EXISTING)
  • $ERROR-EXISTING  (using ERROR_EXISTING)

The following XQuery implements an example ContentModule, which implements a simple transform to lower-case all element names. Note that the module implements its own versions of the SKIP_EXISTING and ERROR_EXISTING checks.

xquery version "0.9-ml"

define variable $URI as xs:string external
define variable $XML-STRING as xs:string external
define variable $NAMESPACE as xs:string external
define variable $LANGUAGE as xs:string external
define variable $ROLES as xs:string external
define variable $COLLECTIONS as xs:string external
define variable $SKIP-EXISTING as xs:boolean external
define variable $ERROR-EXISTING as xs:boolean external

define function do($list as node()*)
 as node()*
{
  for $n in $list return typeswitch($n)
  (: lower-case element localnames :)
  case element() return element {
    expanded-QName(namespace-uri($n), lower-case(local-name($n)))
  } {
    $n/@*, do($n/node())
  }
  case document-node() return document { do($n/node()) }
  default return $n
}

if ($SKIP-EXISTING and doc($URI)) then ()
else if ($ERROR-EXISTING and doc($URI)) then error('DUPLICATE-URI', $URI)
else xdmp:document-insert(
  $URI,
  do(xdmp:unquote(
    $XML-STRING,
    $NAMESPACE,
    if ($LANGUAGE) then concat('default-language=', $LANGUAGE) else ()
  )),
  for $r in tokenize($ROLES, '[,\s]+')[. ne '']
  return xdmp:permission('read', $r),
  tokenize($COLLECTIONS, '[,\s]+')[. ne '']
)
CONNECTION_STRINGxcc://admin:admin@localhost:9000/ XCC URI, including username, password, host, and port, to use for all queries and inserts. If desired, a database name may also be supplied. Multiple connection strings may be separated with whitespace or commas.
DEFAULT_NAMESPACEnull If present, all XML will default to the supplied namespace uri.
DOCUMENT_FORMATxml Document format for all new documents. Valid settings are xml, text, and binary
ERROR_EXISTINGfalse If true, RecordLoader will throw an error if it finds itself trying to overwrite an existing document uri. This error may or may not be fatal, depending on the value of FATAL_ERRORS.

Note that this option requires the server to perform a separate check for each document uri. This can reduce performance.

Note that if using CONTENT_FACTORY_CLASSNAME=com.marklogic.recordloader.xcc.XccModuleContentFactory, this option requires the module to implement its mechanism (see above).

FATAL_ERRORStrue If true, RecordLoader will exit with an error upon encountering any non-retryable error. If set to false, RecordLoader will close the current record and continue on to the next.
ID_NAME#FILENAME Within each input document or RECORD_NAME element, the first element called ID_NAME will be used to compose the new document uri. If ID_NAME starts with '@', an attribute with this local-name will be used to compose the new document uri.

Note that namespace is ignored: only the local-name is used. The named node must have a simple text value: it may not be empty, and it must not contain any non-text children.

The special value ID_NAME=#AUTO will cause RecordLoader to automatically generate ids, in sequence, for each input record. Since RecordLoader automatically includes the base filename in each output URI, this is safe.

Note that when the input is standard input, the default value is #AUTO - not #FILENAME.

The special value ID_NAME=#FILENAME will cause RecordLoader to automatically load each input file into a single document per input file, using the file's basename to compose the new document uri. This is the default behavior.

Examples: ID_NAME=MedlineID, ID_NAME=@id

IGNORE_FILE_BASENAMEfalse If true, RecordLoader will omit the file or zip archive basename when composing new document uris.
IGNORE_UNKNOWNfalse If set, RecordLoader will ignore siblings of RECORD_NAME that are not RECORD_NAME elements. Otherwise, this condition causes a fatal error.
INPUT_MALFORMED_ACTIONREPORT Constant values from java.nio.charset.CodingErrorAction, used to determine what happens if there are invalid character sequences in the input XML.
  • REPORT: throws a MalformedInputException
  • REPLACE: replaces invalid sequence with a '?' or similar.
  • IGNORE: skips over the invalid sequence.
INPUT_ENCODINGUTF-8 The Java Charset encoding (codepage) to use for all input XML. If unset, RecordLoader will use null, which will default to the default Locale's character encoding.
Note that MarkLogic Server must receive all XML as UTF-8, so the output encoding is always UTF-8.
Example: if the input XML is encoded as windows-1252, use INPUT_ENCODING=Cp1252 to ensure correct conversion.
INPUT_FILE_SIZE_LIMIT0 If greater than zero, RecordLoader will skip any input files larger that INPUT_FILE_SIZE_LIMIT Bytes. This does not apply to zip archives, nor to the size of their entries.
INPUT_HANDLER_CLASSNAME com.marklogic.recordloader.DefaultInputHandler The specified class will be used to marshall loader inputs. The default class handles INPUT_PATH as well as command-line arguments. This property is meant for plug-in classes, which must implement com.marklogic.recordloader.InputHandlerInterface, and may extend the com.marklogic.recordloader.AbstractInputHandler class.
Built-in alternatives:
  • com.marklogic.recordloader.svn.SvnInputHandler treats INPUT_PATH as a subversion repository url (EXPERIMENTAL).
INPUT_PATHnull The filesystem path in which to look for XML files or zip archives. If unset, RecordLoader will read XML directly from standard input.
INPUT_PATTERN^.+\\.[Xx][Mm][Ll]$ Matching pattern (regex) for files found in INPUT_PATH. The default value matches all filenames ending with .xml
INPUT_STRIP_PREFIXnull If not null, characters matching this pattern (regex) will be removed from all input URIs. For example, Windows users may wish to set INPUT_STRIP_PREFIX=^[A-Z]: so that document URIs in the database do not include drive-letter prefixes.
INPUT_NORMALIZE_PATHSfalse If true, backslashes in input paths will be coalesced and replaced with slashes in all output document URIs. This is useful for Windows users, especially in combination with INPUT_STRIP_PREFIX. With both properties set as suggested, C:\foo\bar\baz.xml on the filesystem becomes /foo/bar/baz.xml in the database.
LANGUAGEnull If set, the value will be passed to XCC ContentCreateOptions.setLanguage(), or to the CONTENT_MODULE external variable $LANGUAGE. Accepted values are documented in XML 1.0 and RFC 3066.

If null, the default database language will be used.

LOG_LEVELINFO java.util.logger.Level at which to log.
LOG_HANDLERCONSOLE,FILE java.util.logger log handlers with which to log.
OUTPUT_COLLECTIONSnull One or more collections to apply to every new document. Use whitespace to separate multiple collection uris.
OUTPUT_FORESTSnull If set, all documents will be explicitly placed into the named forests. Use whitespace or the characters ,:; to separate values.
READ_ROLESnull One or more existing role name, separated by whitespace. If set, every document inserted by RecordLoader will have read permission for these roles. If any of the supplied role-names do not exist, the first document insert will throw a fatal error.
RECORD_NAMEnull

Element name in which each document is found. These may not nest. If no RECORD_NAME is set, the first child element of the first root element will be used for the entire RecordLoader run.

If ID_NAME is set to an element or attribute name, or set to #AUTO (including when RecordLoader reads from standard input), then the special value RECORD_NAME=#DOCUMENT will cause RecordLoader to treat every document root element as a record. This mode is slower than ID_NAME=#FILENAME, but useful when the filenames are not appropriate as document URIs.

RECORD_NAMESPACEnull Element namespace in which each document is found. If unset, but RECORD_NAME is set, then the empty namespace is assumed. If unset, and RECORD_NAME is also unset, then then the namespace of the first child element of the first root element will be used for the entire RecordLoader run.
SKIP_EXISTINGfalse

If true, existing document uris will be skipped. This allows RecordLoader to resume after being interrupted. This option may be combined with START_ID, in case the known value for START_ID already exists.

Note that one read I/O is required per skip, so SKIP_EXISTING is slower than using START_ID (below).

Note that if using CONTENT_FACTORY_CLASSNAME=com.marklogic.recordloader.xcc.XccModuleContentFactory, this option requires the module to implement its mechanism (see above).

START_IDnull When set, records are skipped until one with an ID_NAME value equal to START_ID is found. This can be used to resume ingestion after interruptions or fatal errors.
THREADS1

Number of RecordLoader threads.

Note that when using standard input, this value is ignored.

Note that RecordLoader uses at most 1 thread per input file or zip entry.

THROTTLE_BYTES_PER_SECOND0 If non-zero, all threads will be throttled to the given number of bytes inserted per second.
THROTTLE_EVENTS_PER_SECOND0 If non-zero, all threads will be throttled to the given number of inserts per second.
URI_PREFIXnull Prefix used before the ID_NAME value, to compose all document uris. If the prefix does not end in '/', RecordLoader will add a '/' to it.
URI_SUFFIXnull Suffix used after the ID_NAME value, to compose all document uris.
USE_FILENAME_COLLECTIONtrue If ID_NAME is not #FILENAME, and this property is true, RecordLoader will add an extra collection to each record, built from the filename of the current input file. This can be useful when splitting superfiles.
XML_REPAIR_LEVELNONE To what degree should XPP3 and MarkLogic Server compensate for invalid XML?
  • NONE: throw an exception (see also: FATAL_ERRORS).
  • FULL: do everything reasonable to ingest the document.

Troubleshooting

XmlPullParserException: could not resolve entity named 'foo'.

The XPP implementation used by RecordLoader, xpp3, does not handle unknown references, and does not process DTD-style document declarations. So if your XML includes non-XML character entities, RecordLoader is not for you. Future enhancements could include a plug-in system, allowing the user to substitute an XPP implementation that supports document declarations.

java.util.concurrent.RejectedExecutionException.

If you are using RecordLoader with thousands of files or zipfile entries, you may need to increase the JVM heap space. Try -Xmx256m as one of your command-line JVM arguments.

With Solaris, my UTF-8 accents and diacritics are mangled.

You should see UTF-8 in the output from locale -a:

$ locale -a | grep -i utf en_CA.UTF-8 en_US.UTF-8 es.UTF-8 es_MX.UTF-8 fr.UTF-8 fr_CA.UTF-8

If no UTF-8 locales are available, make sure to install the Solaris SUNWeu8os package.