a whirlwind tour through a custom built collector
InformationStudio is now available with MarkLogic 4.2. Want to see it? You'll find it on the Application Services page at http://localhost:8002/ (or, in MarkLogic 5, at http://localhost:8000/appservices/).
InformationStudio provides us with flows, which are essentially reusable pipelines used to load information into MarkLogic Server. A flow consists of three parts: a collector, a set of transformations, and a target database. The collectors and transformers are "pluggable", in that you can write your own. They become part of the InformationStudio UI automatically.
Today we're going to examine some of the finer details you'll want to pay attention to when building a custom collector. Developing InformationStudio collectors is already covered in the InformationStudio Developer's Guide. A chapter on the Plugin Architecture can be found in the Application Developer's Guide as well. This post is in no way comprehensive, but will get down to some of the nitty-gritty so you can see which APIs are used where so you can make your own custom collector that seamlessly integrates with InformationStudio.
Click 'New Flow' under InformationStudio on the ApplicationServices page to view the Flow Editor.
InformationStudio already ships with two collectors, one for reading files from a directory on the filesystem, and another providing a browser dropbox so you can drag and drop files into your InformationStudio flow. In the Flow Editor, click the 'Change Collector' link in the Collect pane to see which collectors are available.
What if we want to collect information from an atom feed? Well, we can build our own collector and plug it right into the InformationStudio framework.
Spoiler Alert: We've actually provided sample collectors on github to help jumpstart your development. Here you'll find collector-feed.xqy for collecting the contents of an atom feed. Rather than post all the code here, let's use this collector as the basis for our discussion on things to think about when building a customer collector, highlighting the essentials of what we need to know to be up and running quickly.
Copy collector-feed.xqy to marklogic-dir/Plugins, where marklogic-dir is the directory where MarkLogic is installed. All custom plugins go in this directory. Now go back to the Flow Editor, and once again click 'Change Collector'. You'll find the feed collector available for use. (You may have to refresh your browser.)
Click the Feed Collector button to select it. You'll then see the Feed Configuration page. There are 2 configurations available. The first, a URI for a feed, is required. I've entered the URI for a feed from a popular blog I follow. The second configuration is optional. I'm leaving it blank for this demonstration, but if we entered a date, the collector would only collect feeds from the URI since that date. Enter a URI and click 'Done'.
We now find ourselves back in the Flow Editor, and the Collect pane is providing the details for our configuration. If we were to click the 'Configure' button, we'd find ourselves back in the Feed Configuration pane.
All that's left to do is start collecting content. But before we do that, remember everything we've seen so far is completely customizable. We've started with the UI, but before we load content, let's dig into our collector code and see how our button in figure 4 and our configuration pane in figure 5 were generated as well as what's going to happen when we click 'Start Loading'.
Plugins in a Nutshell
A Plugin is a registered set of capablities. A capability is just a handle to a function. The InformationStudio application will update its UIs and take actions on the documents we collect based on these functions. So InformationStudio uses the capabilities to resolve what function to call based on a particular action in its interface and will dynamically invoke the right function at runtime. What do these capabilities and functions look like? Let's look at our Plugin module.
Open collector-feed.xqy in your favorite editor.
The first thing you'll notice is this module is declared in the feed namespace.
The namespace is arbitrary and for our own code organization and use. Next we see we import 3 modules, plugin, info, and infodev. The plugin module will allows us to register the plugin we're creating with the InformationStudio framework. Both info and infodev are a set of APIs provided to us to help create our custom plugins, as well as write code that can access InformationStudio functionality programatically, without having to go through the fancy UI. A great jumpstart introduction to the InformationStudio APIs can be found here.
Jumping to the comments, we see the minimum set of capabilities required for any Plugin.
Now, the comment is actually a little bit misleading, as the first required function for any collector is capabilities(), which is a map containing the capabilities we want to register for our collector plugin. Take a look at the following function:
You'll see within feed:capabilities(), that in addition to model(), start(), and string(), we've also registered cancel() and view(). And that's it. With those 5 functions we have a custom collector. Let's look at each in detail.
The model is where we can save data that we want to use as parameters for our collection process. The child elements under plugin:data are completely up to you. We've named these to reflect how the feed collector is using them. We can add as many elements to our model as we like. Notice that the values here aren't used by our collector, we overwrite them in the configuration screen. But you could hardcode values or have other elements with values for your model that aren't editable by users. So how was the configuration screen generated? And how was it able to populate the URI for our model?
The view is essentially a form. We can update the form contents to create fields for our user input, in our case, the Feed Configuration page. We associate the input with our model through name and id. We can also associate the input with a label. Labels can then be reused throughout our collector UI in InformationStudio. The labels here aren't reused anywhere in our collector example, but you could if you wanted to. Look at the value for the input, and you'll see it's an expression that evaluates our model.
All labels for display are captured here. Look at the key attribute for each label. You'll find that name and description, populated the values for our button in figure 4 above. After we entered our URI for collection, the description also was used for the description in figure 6.
Once we click 'Start Loading' on the flow editor, we'll see the progress of our load, and this button will be replaced with a 'Stop' button. Clicking stop calls our cancel function.
Here we see our first use of an infodev function. Setting the ticket status to "cancelled" cancels our collection, and InformationStudio stops any further processing of documents. This example is very simple, but if you created a much more complex collector, you could do any additional required cleanup here before cancelling your collection run.
We've seen how view() provided our configuration page, and saved values to our model() as well as how string() captured labels for data we want displayed in the InformationStudio UI for our collector. We also saw the supporting function cancel() to help cancel our run once it starts. And so here's where the collecting and document processing really begins, in our start() function.
For our collector, collecting is simply a xdmp:http-get() of the $uri supplied by our model. That's it. Well, not entirely. So let's conclude by looking at the particular info and infodev functions that are going to help us complete our collector.
Look back at figure 6. We saw how clicking 'Configure' will take us to a custom form we can build using our collector view() capability. Clicking 'Ingestion' takes us to a page of ingestion settings. That page is not configurable, but we'll want those settings for our collector. The above info and infodev functions get the 'Documents per Transaction' from the ingestion settings page for us.
You'll see in the code that we calculate the number of transactions we want for our collector by dividing the number of feed entries we are going to ingest by the 'Documents per Transaction' count. Next, we set the total documents and total number of transactions for our collector. These are going to help drive the nifty UI that gives us a count of how many out of the total have been loaded in our collector progress bar.
We batch our ingest into the database by transactions so we don't load everything all at once. This gives a way to track progress in the progress bar in the InformationStudio interface as well as a way to fail gracefully at a certain collection point and indentify problem documents. To batch transactions for ingest, we save the documents per transaction each to a map. We then loop through our sequence of transactions, calling infodev:transaction() for each map. The documents for each map will be loaded in a single transaction to the database selected for the flow as defined in the Flow Editor.
You probably noticed that we pass the function we want to perform the ingesting of documents as a parameter to the infodev:transaction() function. You also probably noticed that the function is defined in our collector-feed.xqy and calls a single function infodev:ingest().
Why didn't we call infodev:ingest() directly? We could have. But by specifying this callback function, we provided ourselves a way to do additional processing for each document ingested. Before calling infodev:ingest(), we could analyze our feed entries, and transform them, or augment them with additional queries based on their contents before loading the final documents we curate into our database. Callback functions are awesome; very flexible and very useful.
When collection is complete, we set our ticket status to 'completed'.
The very last thing we do is register our plugin.
Congratulations! You are now masters of the Collector universe. Go forth and Collect! On github we provided additional examples for loading a directory of .csv files, as well as extracting .zip files from a directory of zips. We also provided a simple Twitter collector. They all have the same basic capabilities, though you'll see some don't use transactions, and twitter has no configuration screen. These are intended to jumpstart custom collector development and spark some ideas as to what types of things you might want to collect into MarkLogic using InformationStudio.
For more details on custom collectors and the plugin framework be sure to check out:
The Plugin chapter of the Application Developer's Guide.
Creating Custom Collectors and Transformers in the Information Studio Developer's Guide.
For more information on InformationStudio, check out the following: