When you start developing a new application on MarkLogic, or if you're using MarkLogic for the first time, one of your first tasks typically will be to load content - JSON, XML, Office files, anything really - into a MarkLogic database. MarkLogic supports several options for loading content, with Content Pump (mlcp) providing a number of useful features, including:
- You can load content from text files, delimited text files, aggregate XML files, Hadoop sequence files, and compressed ZIP/GZIP files
- You can apply a transform to modify and enrich content before inserting it into a database
- You can distribute ingestion across all nodes in a MarkLogic cluster via Hadoop
And it's easy to get started with mlcp and its command-line interface - just enter in some configuration parameters, and you're off and running.
However, many ETL processes are far more complex than just loading content from files. You may need to listen to a JMS queue for new messages, or periodically pull data from a relational database, or invoke a method in a third-party JAR file. You may also have complex routing rules that require splitting data records and later aggregating the results back together, or you may wish to send emails or generate notifications when certain events occur during your ETL process.
For scenarios such as these, developers often rely on integration tools such as Apache Camel. Camel provides a messaging and routing framework that implements many of the well-known Enterprise Integration Patterns that define solutions for common integration problems. Just as importantly, Camel provides dozens of components that allow you to easily consume data from and publish data to a variety of interfaces. With Camel, addressing the use cases mentioned above becomes a straightforward task.
So with mlcp providing a rich interface for loading content into MarkLogic, and with Camel providing a flexible integration framework, the question becomes - how can we integrate Camel and mlcp together? Fortunately, this integration is fairly simple, as Camel supports writing custom components in Java and mlcp itself is written in Java.
To show a basic example of using a Camel component that can invoke mlcp, I've put together a Github repository that demonstrates a "hot folder" - i.e. a directory that Camel watches for new files, and when a file shows up, it's ingested into MarkLogic via mlcp. This sample project uses Gradle for declaring a dependency on the sample Camel component and for running Camel via Spring.
Since we're just watching a folder and then handing files off to mlcp in this example, our Camel routes file is very simple:
The fileUri and mlcpUri properties are defined in the gradle.properties file - Gradle makes these easy to override. Of course, change the host/port/username/password as needed to match that of the XDBC server that mlcp will talk to. With the default fileUri, Camel will create a directory named "inbox" in your current directory if one does not yet exist.
The Camel component for mlcp knows how to parse mlcpUri into a series of arguments that will be passed into mlcp. And it supports all of the import arguments - this example just shows a collection being specified. You'll notice that there's no "input_file_path" parameter in the URI - this will be supplied automatically by the mlcp Camel component.
Finally, we need a simple Gradle task that will fire up Camel. For brevity, I've omitted the dependencies, but you can view the real Gradle file to see everything, which includes an easy way to setup a MarkLogic application for testing.
We can now run "./gradlew camelRun". Camel will watch the "inbox" directory for new files, and every time a file appears, it will be loaded into MarkLogic and into the "sample-collection" collection. You can customize this as you wish - for example, you may want to load aggregate XML files or delimited text files, and so you would just add the necessary parameters to the mlcpUri property to configure mlcp appropriately. And now that Camel is able to send files to mlcp, you can utilize all the support that Camel has for Enterprise Integration patterns and for connecting to a wide variety of interfaces.
I hope this gives you a sense of how straightforward it is to integrate mlcp with a framework such as Camel; similar frameworks exist, and as long as they provide some mechanism for adding new components, the integration should be very similar. The net benefit is that you can both quickly and easily load content into MarkLogic while reusing the tools that you're familiar with for implementing ETL processes.
If you have any stories you'd like to share about similar ETL work, or any other comments, please post them to this blog entry. If you have questions pertaining to ETL and MarkLogic, I encourage you to post those to stackoverflow with a "marklogic" tag.