Creating an MLCP Docker container with pre-loaded data

by Bill Miller

Sometimes it's useful to be able to pass around a known dataset to use in dev, testing, and QA activities. This blog will show you how to create a Docker image with a known dataset and then use it to deploy that data to a MarkLogic instance using the container's MLCP library. Without further ado, let's get started!

Assumptions

Before proceeding with this tutorial, I'm making the following assumptions about your environment.

  • Docker version 1.12+ (Previous versions of Docker will most likely work as well, but not tested.)
  • Basic familiarity with Docker (See Building a MarkLogic Docker Container for an introduction to using MarkLogic with Docker)

Creating the Docker Image

The first thing to do is create the Dockerfile we need for building our image.

Dockerfile

  1. Create a folder to hold all of our files then open a terminal window and navigate to this directory
  2. Download the current version of MLCP from MarkLogic and place the zip in the directly you created
  3. Create a file called "Dockerfile" and copy the code below into the file.

Let's look at what's going on in the Dockerfile. First, we're basing our image on Centos 7 and then we're updating our yum package manager while installing the necessary libraries we'll need for our environment. Next, we set some environment variables and then install Java JDK. After that, we're copying some required scripts and MLCP configuration files (we'll create them next) and then adding the necessary permissions to the script. Lastly, I'm setting a few more environment variables to be used inside the scripts and then setting the default "entrypoint" script to run by default when the container is started.

When you build your image you should update the following environment variables as required:

  • ARG MLCP_VERSION - Set this value to the version of MLCP you downloaded earlier.
  • ENV INPUT_FILE_TYPE - Specify the type of data being imported. [ aggregates, archive, delimited_text, delimited_json, documents, forest, rdf, sequencefile ]
  • ENV INPUT_FILE_PATH - You shouldn't have to modify this unless you change the name of the folder you're copying data into in your container.
  • ENV CONFIG_FILE_PATH - You shouldn't have to modify this unless you change where the mlcp_config.txt file is copied to or you change the name of the config file itself.

Scripts

  1. Create a folder called "data" in the one you created earlier.

    This folder will hold the data you want to import into MarkLogic.

  2. Create a file called run_script.sh and copy the code below into the file.

    This file is fairly well commented so it should be pretty self-explanatory. Basically, this file will ensure the container stays running and will parse the arguments we pass to the Docker Run command when we instantiate the container to run MLCP.

  3. Create a file called mlcp_config.txt

    This file can be filled in or left empty. It's purpose is to allow the individual building the initial image to enter MLCP options he/she knows will be required based upon the type of data stored in our container. Ensure you follow proper formatting for this file. The basic format is:

    Just add as many options as you require following this format. ONLY use options valid with the Import command since that's all this container supports. 

  4. Last, we need some data to import. For this post, I've used MLCP to export an archive from an existing MarkLogic database and copied it to the data folder. You can use any data set you like, just put it in the data folder and update the MLCP options in your Dockerfile.

Here is what your folder structure should look like. Note, there's a file called mlcp_possible_options.txt in my example below. You can ignore it for the purposes of this tutorial. In the screen shot below, my data folder contains archives of metadata and binaries I sampled from an existing MarkLogic database.

Now we have the necessary scripts, data, and have created our Dockerfile. Let's go ahead and build the image. Open a terminal window and ensure you're in our working directory, then enter:

docker build -t <your registry/your image name>:<image tag> .

Note: It may be helpful to utilize the image tag attribute to specify the data in the container. I.e.

docker build -t local/mlcp:enron .

Note: It would probably be beneficial to create a "base" MLCP image with the necessary libraries and then use it as the base for subsequent data-specific images. That way, every time you build the image with new data, all of the required libraries are already part of the image and don't have to be reinstalled each time the image is created. I'll list the steps required to build an MLCP Base image at the end of this blog.

Running the Container

Running the container is easy just enter the following:

docker run --rm  <image name>:<image tag> --host <target MarkLogic server> --username admin --password password --port <port>

Looking at the command.

  • --rm - Tells the Docker engine to completely delete the container after it exits.
  • <image name>:<image tag> - Should be the same as the name/tag you used when building the image
  • --host - This is the MarkLogic server you want to import data into
  • --username - Username used to authenticate with your MarkLogic server
  • --password - Password associated with the username
  • --port - The XDBC App server port associated with the database you want to insert data into

Mandatory Options for container

  • --host
  • --port
  • --username
  • --password 

The following are the optional arguments you can provide to this container:

  • --batch_size
  • --database
  • --fastload {takes no value} If passed, assumes true, otherwise false
  • --filename-as-collection
  • --namespace
  • --output_cleandir {takes no value} If passed, assumes true, otherwise false
  • --output_collections
  • --output_directory
  • --output_uri_prefix
  • --output_uri_suffix
  • --tolerate_errors

Note: When passing optional arguments use the --option_name=option_value syntax.

This is by no means an all inclusive list, but the savvy user could easily modify the script to support additional arguments and don't forget that there is an MLCP options file to use. 

Below is an example output you can expect when running the container.

Summary

That's it! You now have an MLCP Docker container to store pre-defined datasets that you can pass around to use for dev, testing, and SQA activities. Enjoy!!

Building an MLCP Base Image

Assuming you followed the steps outlined at the beginning of this blog to build the MLCP Image, you can easily modify them for creating a "Base" image. This way, you'll have an image with all the necessary libraries already baked in and the only thing you'll need to do is create a new image based on the Base image and include the data you need in the container with the necessary configurations.

  1. Go ahead and build the image just as described in Creating the Docker Image, but:
    1. Do not add any data to the container so comment out the COPY directive for this
    2. Next, make sure the mlcp_config.txt file is empty.
    3. Save the file and then run the Docker Build command. In this example I've called the image mlcp_base and provide a versioning tag v1. You can obviously call it whatever you like and use whatever tagging convention makes sense for you. (Don't forget to prefix the image name with your registry if you're publishing to a registry)
    4. docker build -t mlcp_base:v1 .

Now that we have a base image created, we can create data-specific MLCP images. Here's how.

  1. Create a new folder to hold your configuration and data to be copied to the container.
  2. In the folder created, create a new Dockerfile called Dockerfile and copy the code below.
  3. Make sure you update the environment variables as required. Specifically, ENV INPUT_FILE_TYPE. This should reflect the type of data you're packaging up.
  4. Create a folder inside the one you created in step 1 and then copy whatever data you want to be packaged in your container into this folder.
    1. Create a file to store your data-specific MLCP options (if required) and call it mlcp_config.txt.

      If you don't need an additional options, just comment out the COPY mlcp_config.txt directive.

    2. If you created an options file, modify it as required to support the necessary MLCP Import options you need based on the data you're using for this image and then save it.
    3. Now, build your new data-specific image
    docker build -t [your registry/][your image name]:[image tag]

    So for example, you could call it my_registry/mlcp_enron:latest or my_registry/mlcp:enron. You should notice the build process doesn't take as long (depending on the size of the dataset you're copying to the image). That's it! You now have a recipe for creating a data-specific MLCP image.

      Learn More

      Comments