MarkLogic Content Pump (mlcp)

MarkLogic Content Pump is an open-source, Java-based command-line tool (mlcp). mlcp provides the fastest way to import, export, and copy data to or from MarkLogic databases. It is designed for integration and automation in existing workflows and scripts.

The MarkLogic Content Pump is developed in the open on GitHub. Submit tickets and pull requests there to contribute.

MarkLogic Content Pump on GitHub ›

Download

Release 9.0.2 binaries zip package 30 MB(SHA1)
Release 9.0.2 source zip package 410 KB(SHA1)
Release 9.0.2 binaries zip package for use with MapR 31 MB(SHA1)

Maven

Dependencies

Features

Content Pump can:

Data sources and destinations

Content Pump supports moving data between a MarkLogic database and any of the following:

  • Local filesystem
  • HDFS
  • MarkLogic archive
  • Another MarkLogic database

Formats

Content Pump supports

The popular RecordLoader and XQSync projects have served as inspirations for Content Pump. However, mlcp is not designed for compatibility with either of those tools.

Getting Started with MLCP

You may find this free online training course helpful.

To get started moving data with mlcp, download and unpack the binaries. For those interested in hacking or look at the internals, you can also download the Apache 2.0 licensed source.

To create your first import script make sure you have an XDBC server attached to your database (running on port 8006, for example, below). From the command line, run the following, substituting your particulars.

To export a subset of that same database into a platform-independent archive:

To import all triples from an N-Triples formated file named example.nt:

This is a small sampling of the rich set of capabilities that mlcp provides. For much more information, sample code, and examples in the documentation.

Documentation


Older versions

MarkLogic 8.0-3+

Release 8.0.7 binaries zip package 29 MB(SHA1)
Release 8.0.7 source zip package 391 KB(SHA1)
Release 8.0.7 binaries zip package for use with MapR 31 MB(SHA1)

Maven

Dependencies

MarkLogic 8.0-2

Release 1.3-2 binaries zip package 29 MB
Release 1.3-2 source zip package 160 KB

MarkLogic 7

Release 7.0-6.4 binaries zip package 29 MB
Release 7.0-6.4 source zip package 160 KB

Comments

  • Will MLCP split automatically PDF file while loading into xml and xhtml or not?Because once I tried loading file and it was split automatically but later after two days when I tried the same procedure it did not split. What would be the possible mistake here which I have committed to ?
    • I think <a href="http://stackoverflow.com/questions/ask?tags=marklogic">Stack Overflow</a> would be a better place for this question. You'll be able to give more detail about what you tried and more people will see it.
  • Is MLCP version tied to ML version? We are currently running 7.0 of ML and now have the need to install MLCP on a few servers and thought perhaps I could install the latest MLCP ...
    • Jeff, MLCP is built to be backwards compatible (if you find a case where it isn't, that's a bug and we'd like to hear about it). We advise using the latest MLCP.
  • Hi, when loading 65K docs in either a local or remote MarkLogic instance, using -input_file_path . from the relevant directory, following display of the Hadoop library version (2.6) and "Content type is set to MIXED", the program pauses for 4-5 minutes before it actually begins to load, occupying about 20% of the CPU during that time--does this on both an iMac and MBP running 10.11.2. I've used mlcp in the past, and there was never any substantial delay prior to contentpump.LocalJobRunner output. Is this normal? Have you tested with El Capitan? Locally it's an ML8, remotely ML7 with an XDBC I setup. I would appreciate help with correcting this as it's a procedure I sometimes repeat often. Can I provide you with more info for troubleshooting? Thank you.
    • Shannon, Thanks for trying out mlcp and letting us know the issue you ran into. We don't officially support El Capitan for MarkLogic 8 products, although I don't believe that's the issue. We have filed a bug and will investigate the issue further. When you used a previous release of mlcp, did you run it with a similar number of input files? You are welcome to work with MarkLogic support so that you'll be notified when a fix becomes available. Thanks, Jane
      • Thanks, Jane. In fact last time I used mlcp without any noticeable hanging was with 3x as many files. However, I'm pretty certain I used an options file then. I will try using one for this job and see if it makes any difference. Otherwise yes I might get in touch with Support. Thanks!
  • Is there a way to import metadata (collections and properties) when using -input_file_type forest? For some reason it never does it for me. Using ML7.
  • the options ~> mlcp input ...... -input_file_type aggregates -aggregate_record_element stwtext -aggregate_uri_id @id don't work properly, if in the "stwtext"-root element there are many sub element with same attribute "id". In this case it will be used the last id. Or may be it work, but how can i point at the attribute of my root element, for example stwtext.id ?
  • It is great to read that "mlcp" is an open-source project. I am curious where to find the related github project (haven't found it under the marklogic organization) resp. how people might be able contribute?
    • Better late than never, MLCP is now on github: https://github.com/marklogic/marklogic-contentpump
    • The intent is to move mlcp (and our other open-source projects) to open development on GitHub. We've started this process with our Java and Node.js Client APIs. I don't have a specific timeline, but it wouldn't be before 2015. In the meantime, please submit bug or enhancement requests through the normal support channels or developer mailing list. Product Management and Engineering pay close attention to both. If you're thinking of doing major surgery I'd be happy to get you in touch with some of our Engineers to coordinate, making a potential downstream merge much more feasible.
  • Is there a (binary) distribution of MLCP available, that doesn't include/require the Hadoop libraries? I'm interested in scripting MLCP, but even usage seems to require Hadoop common..
    • I think it does require that jar file, which comes in the distribution. It does not, however, require hadoop unless you are doing the fast load option.
      • I though there used to be two distributions of MLCP, one with Hadoop support, and one without. I might be mistaken though. The issue is the 'redundant' jar files, some are quite large for jars, particularly if not used. Most notably the hadoop-common..
  • It's a shame we can't set the target database (for any operation). We need a dedicated XDBC Port for each Content Database we want to process. Is there a way to propose a "database" option as a feature request (going with host and port options) ?
    • Update: MLCP now supports a -database option.
    • Stephane, there's an existing RFE for that feature, but I think it's most likely to be implemented when MLCP moves to GitHub and someone in the community adds it.