Jumpstart custom scripting with the Information Studio API's

Colleen Whitney
Last updated September 28, 2012

Using the friendly graphical interface of Information Studio (accessible from the Application Services page at http://localhost:8000/appservices/), you can quickly create or configure a database, load content, or even transform your documents on the way in.  But behind that friendly interface, you'll find a set of rich APIs that offer even more power and flexibility.

So if a friendly user interface isn't your cup of $beverage, or you need to do some heavy-duty scripting or scheduling, this tutorial is for you.

The possibilities are almost endless, so I'll walk you through three short scripts:

  • Simple load to a new database
  • Repeated loads with customized, extended policy
  • Scheduled loads using an Information Studio flow

Script 1: Simple load to a new database

Let's start with the simplest possible scenario.

You've got a folder containing a few hundred sample XML files you plan to use for application development.  You want to quickly create a database (with indexing optimized for wildcard and position queries) and load them into MarkLogic Server.

Here's a simple script that gets it all done in seconds, without touching the Admin panel. Fire up Query Console and give it a try.

Query

 xquery version "1.0-ml";
import module namespace info = "http://marklogic.com/appservices/infostudio"
at "/MarkLogic/appservices/infostudio/info.xqy";

let $db-name := "Sampledata"
let $create-db := info:database-create($db-name)
let $configure-database :=
    info:database-set-feature($db-name,
    <settings xmlns="http://marklogic.com/appservices/infostudio">
        <wildcard>true</wildcard>
        <position>true</position>
        <reverse>false</reverse>
    </settings>)
return info:load("/Users/elenz/work/infoTutorial",(),(),$db-name)

Result

/tickets/ticket/12012448215552290795

The query creates and configures a new database named "Sampledata", and initiates a separate process that loads content into that database.  It returns the URI for a ticket, which is a handle for tracking the progress of the loading process. You can run a second query to check the status of your load, passing in your new ticket id.

Query

xquery version "1.0-ml";
import module namespace info = "http://marklogic.com/appservices/infostudio"
at "/MarkLogic/appservices/infostudio/info.xqy";

info:ticket("/tickets/ticket/12012448215552290795")

Result

<ticket id="/tickets/ticket/12012448215552290795"
  timestamp="2011-11-13T16:05:01.015028-08:00"
  xmlns="http://marklogic.com/appservices/infostudio">
  <status>completed</status>
  <start-time>2011-11-13T16:04:09.959938-08:00</start-time>
  <ticket-expiration>2011-12-13T16:04:09.959938-08:00</ticket-expiration>
  <database>Sampledata</database>
  <total-documents>440</total-documents>
  <total-transactions>5</total-transactions>
  <time-consumed>PT8.974743S</time-consumed>
  <documents-processed>440</documents-processed>
  <transactions-completed>5</transactions-completed>
  <errors>0</errors>
  <percent-complete>100</percent-complete>
  <annotation><directory xmlns="">/Users/elenz/work/infoTutorial</directory>
  </annotation>
</ticket>

This query returns a ticket with status information.  This ticket shows that 440 documents were processed in about 9 seconds, and that the process is complete.  Now, if you refresh your Query Console window, you'll see your new database in the drop-down "Content Source" menu.  You can select it, then run queries or use Query Console's "Explore" button to verify that the documents are there and see how they're stored by default.  In the screen shot below, I ran a simple query to see the URI and collections for the first 10 documents; notice how the URIs are formed, and note that the documents are marked with the ticket id as a collection.

for $i in collection()[1 to 10]  let $uri := xdmp:node-uri($i)  return <doc uri="{$uri}"              collections="{xdmp:document-get-collections($uri)}"/>

Quick tips:

  • This load completed with no errors (the ticket reports <errors>0</errors>).  If there are errors, use info:ticket-errors() to access a paginated error report; optional arguments control filtering, sorting, start index and page length, making it possible to assemble customized error reports for this ticket.  In addition, info:ticket-detail() gives access to details for a specific error by id.
  • You can delete the database and its forests using info:database-delete(); an optional second argument controls whether or not to keep the forest data on disk.
  • Other functions for working with tickets are info:tickets() and info:ticket-delete().

Script 2: Repeated loads with customized, extended policy

Imagine a slightly more complicated scenario.

You've stored documents in different folders on your hard drive, each containing sample XML files representing articles from a separate publication you plan to use as content for application development.  You want to load these documents using a standard way of handling errors and assigning URIs, but also annotatating each document with a unique collection URI.

Setting a policy

Before I run the loading script, I'll need to create reusable policy. I can set policy for many aspects of document loading, including how I want errors to be handled (continue? skip? stop the load?), how long to keep tickets on hand, how to structure URIs of incoming documents, what permissions to assign by default, and how many documents to process per transaction.  I can easily reset that policy at any time, specifying any options I want to customize (and leaving the rest to be filled in with sensible defaults).

So let's start by creating a policy named "articles" that:

  • Skips documents if the URI exists in the database
  • Constructs URIs based on the current date and time
  • Adds a generic "articles" collection to each document
  • Gives permission for users with the "editor" role to read and update each document (note an "editor" role must exist to run this query successfully).

Query

xquery version "1.0-ml";
import module namespace info = "http://marklogic.com/appservices/infostudio"
at "/MarkLogic/appservices/infostudio/info.xqy";

info:policy-set("articles",
<options xmlns="http://marklogic.com/appservices/infostudio">
    <overwrite>skip</overwrite>
    <uri>
        <literal>/articles/</literal>
        <literal>{xdmp:strftime("%Y-%m-%d",current-dateTime())}</literal>
        <literal>/</literal>
        <filename/>
        <dot-ext/>
    </uri>
    <collection>articles</collection>
    <permission>
        <role>editor</role>
        <capability>update</capability>
    </permission>
    <permission>
        <role>editor</role>
        <capability>read</capability>
    </permission>
</options>)

Result

empty sequence

Now we're ready to load documents, referencing the "articles" policy but customizing by adding one more collection at load time.

Query

xquery version "1.0-ml";
import module namespace info = "http://marklogic.com/appservices/infostudio"
at "/MarkLogic/appservices/infostudio/info.xqy";

let $base-directory := "/Users/elenz/"
let $pub := "pub1"
let $policy-deltas :=
    <options xmlns="http://marklogic.com/appservices/infostudio">
        <collection add="true">{$pub}</collection>
    </options>
return info:load(concat($base-directory,$pub),"articles",$policy-deltas,"Sampledata")

Result

/tickets/ticket/1235182800700566532

In the screen shot below, I ran the same simple query as before to see the URI and collections for the first 10 documents.  Notice that the URIs are now formed using the rule defined in our policy, and that they have both the "articles" and "pub1" collections as well as the ticket id.

for $i in collection()[1 to 10]  let $uri := xdmp:node-uri($i)  return <doc uri="{$uri}"              collections="{xdmp:document-get-collections($uri)}"/>

Quick tips:

  • In this example, we used a named policy.  Using the policy name "default", you can set a default policy that is used when no policy name is passed in.
  • Other functions for working with policies are info:policy-names(), info:policy(), and info:policy-delete(). 
  • If you try to save an invalid policy (e.g. an invalid element name, or the role referenced in a permission doesn't exist), you'll get an error, with information about how to correct it.

Script 3: Scheduled loads using an Information Studio flow

The loading, ticketing and policy APIs introduced above are solid building blocks for getting started with simple processes that load documents from a directory.  But loading processes are often more complicated, involving pre-processing or transformations to modify or enrich the content as it is loaded.

Information Studio is a graphical tool for building those more complex processes, or flows.  (Take a look at the 5-minute Information Studio Walkthrough if you haven't tried it yet.)  It enables you to configure a pipeline of transformation steps to run on your content, selecting from a palette of common transformation tasks (including custom XSLT or XQuery steps) or writing your own Transfomer plug-in.  You can use built-in Collectors to load content from a directory or your desktop, or write a custom Collector plug-in that can pre-process your files before inserting them. 

Once you've created a flow, it's easy to imagine wanting to run that process in contexts outside of the Information Studio interface.  For example, you might want to wire up a "start" button in a custom application built for your end users, trigger the load on some other event, or schedule the process to run every day or every hour.   

Let's Imagine that you've already created a flow named it "Myflow".

Information Studio flow

You've decided that you want to run this flow nightly.  We'll start by writing a very simple main module that starts "Myflow", the equivalent of pressing the "Start Flow" button on the Flow Editor.

(: Main module, named start-flow.xqy :)

xquery version "1.0-ml";
import module namespace info = "http://marklogic.com/appservices/infostudio"
    at "/MarkLogic/appservices/infostudio/info.xqy";

let $flow-id := info:flow-id("Myflow")
let $active-tickets :=  info:flow-tickets($flow-id)
return
    if ($active-tickets)
    then ()
    else info:flow-start($flow-id)

To run this script regularly, you'll need a scheduled task to run it (the script assumes that you have the admin role).  Here's an example, adding a daily task that runs at 2 am on all hosts, to be run as the "EditorialAdmin" user.

xquery version "1.0-ml";
import module namespace admin = "http://marklogic.com/xdmp/admin"
at "/MarkLogic/admin.xqy";

let $config := admin:get-configuration()
let $task :=
    admin:group-daily-scheduled-task(
        "/start-flow.xqy",
        "/space/mymodules",
        1,
        xs:time("02:00:00"),
        xdmp:database("Sampledata"),
        0,
        xdmp:user("EditorialAdmin"),
        ()
    )
let $add := admin:group-add-scheduled-task($config,xdmp:group(),$task)
return admin:save-configuration($add)

Quick Tips:

  • When you configure a flow, you're configuring a policy for the flow in the background.  You can access the policy name using:
    info:policy(info:flow-policy($flow-id)).
  • You can call info:flow-cancel() to cancel all active tickets associated with a flow.
  • See "Scheduling Tasks" in the Administrator's Guide, or the Scripting Administrative Tasks Guide, for more detail on configuring scheduled tasks.

A Few Final Hints

To use the Information Studio API, a user must be assigned the infostudio-user role by the administrator.  Using the API, users with this role can create, configure and delete databases, create loading policies, check ticket status, and load content.  This role should be assigned with care.

The info:load() function can traverse and load directories containing millions of documents over many hours.  However, it does not maintain state; if the server restarts mid-load, the ticket is marked "aborted" and does not resume on restart.  The load can be restarted with an overwrite policy of "skip"; you might also choose to set boundaries on the size of the load to reduce risk.

These APIs generate ticket documents and associated progress and error data, all of which are stored in the App-Services database.  By default, these are marked to expire after 30 days.  If you have created any Information Studio flows, a scheduled task is installed for you that will clean up expired tickets nightly.  If you have not created any flows, you may need to do some similar cleanup using info:ticket-delete() on inactive tickets.

Other Resources

For a quick overview of Information Studio, try the 5-minute Walkthrough of Information Studio.

For a deeper dive, read the Information Studio Developer's Guide.

Check out the API documentation.

Comments