The Royal Road to auto-applying XSLT

by Evan Lenz

Last year on the developer mailing list, David Sewell asked:

To paraphrase Euclid, I'm guessing there's no royal road to auto-applying XSLT to a document at load time into the database?

This article is meant to provide a shortcut for implementing exactly this use case. (I'm responding a little late to the original post, but perhaps this will be of help to others of you.) It also provides a simple tutorial for getting CPF up and running. Even if you don't have a need for this right now, I encourage you to work through the tutorial.

MarkLogic customers often have a need to perform a transformation of a document on load. Information Studio provides the easiest way to do this. Whether you use the browser-based UI or the infostudio API, you can create a flow, set up a transformation step, and load documents. There are a number of built-in transformation steps you can choose from, as shown in the UI's dialog. You can also perform a custom transformation using arbitrary XSLT or XQuery code. XSLT is often the preferred method, since the domain of transformations (being the "T" in "XSLT") is where XSLT shines:

Machine generated alternative text: SELECT A TRANSFORM Cancel X Delete Filter Documents Delete an element or attribute. Extract text and metadata from binary documents and store them in properties. Normalize Dates Rename Fix dates in an element or attribute. Rename an element or attribute. Schema Validation XQuery Validate a document against existing schemas. Custom XQuery transformation. XSLT Custom XSLT stylesheet. Close

So if you're exclusively using InfoStudio, there's your answer: use InfoStudio. No need to read any further.

But what if you want a document to be transformed no matter how you update it? For example, let's say you already have various workflows for loading documents, and Information Studio is only one of them. Other times they're inserted directly using xdmp:document-insert(), loaded using XQSync, or dragged-and-dropped using WebDAV. An InfoStudio transformation will only get applied if the document is loaded by InfoStudio. Is there a way to ensure that a custom transformation gets applied regardless of how you load a document? Yes. You just have to use a different framework that comes with MarkLogic: the Content Processing Framework (CPF).

CPF is a powerful, flexible framework for managing document state changes. It provides mechanisms for running arbitrary code triggered by various state changes such as documents being created, updated, and deleted. Given such flexibility, you have the power to define quite complex pipelines of transformations based on state changes. But often you don't need a lot of complexity. What if all you want is to apply an XSLT transformation whenever a document gets added or changed? Do you have to learn about all the ins and outs of CPF in this case? Wouldn't it be nice to just follow a quick recipe and save learning about CPF for another day? That's what this article is for. (And if you learn something about CPF along the way, so be it.)

Okay, let's get started. Below are the overall steps that you'll need to take. I'll walk you through each of these to make them go as fast as possible:

  1. Install CPF.
  2. Define the CPF domain.
  3. Write and load your XSLT.
  4. Write and load your CPF pipeline.
  5. Attach the pipeline to the domain you configured.
  6. Load documents and see them automatically get transformed.

Install CPF

  1. Go to the admin interface (http://localhost:8001) and navigate to the configuration page for your database (I'll be using the "Documents" database in this example):

    Machine generated alternative text: fJ Configure [EIØ Groups Databases AppServtes ! barbecue corona E Documents I E- Forests Flexible Replication

  2. Check to see if a "triggers database" (where CPF will get installed) has been selected for your database. If one has already been selected, then you can skip this step. But if it says "(none)", then select the "Triggers" database (one that has been pre-defined for this purpose) and then click "ok":

    Machine generated alternative text: cancel database - The database merge reindex clear disable delete specification, - _________ database name Documents The database name. security database Security The security database. schema database Schemas The database that contains schemes. triggers database

  3. Navigate to the "Content Processing" menu for your database:

    Machine generated alternative text: (] Configure 1Ø Groups Databases App-Services barbecue ILi corona Documents Forests Flexible Replication I I Database Replication I Wi Fragment Roots E Fragment Parents ! EiTriggers ! Merge Policy 1 ! Scheduled Backups I tentPmlng I I Ifl Element Range Indexes ! Attribute Range Indexes

  4. Install CPF into your database by clicking the "Install" tab as instructed:

    Machine generated alternative text: • Summary j Install Help Domains (nia) -. Execution and application scce configuration Content Processing is not installed on Documents To install Content Processing, click the Install tab above.

  5. On the next screen, choose "false" for the "enable conversion" option and click "install":

    Machine generated alternative text: Database: Documents content processing -- resource installation enable conversion tru false Whether rs ion processing should be activated for the default domain. The conversion license option is required. cancel —

  6. Confirm by clicking "ok":

    Machine generated alternative text: Content Procn&ng will b Instilled br tito drehee. Documente without convnlon.

    Define the CPF domain

    Here you'll specify which documents your on-load XSLT transformation will be applied to. These are identified using a domain.

  7. Navigate to the "Default [your-db-name]" menu item in the admin interface, under Content Processing->Domains:

    Machine generated alternative text: (J Configure Groups ! L..I. Databases App-Services barbecue corona ElLI Documents I E} Forests I EF Flexible Rephcaton I ÆI Database Replication ! Fragment Roots ¡ Fragment Parents Triggers ! ! ÆI Merge Policy I W Scheduled Backups ! ! Eifj Content Processing ! ! ELS Domains !!! EFLtDocumen 1 ! 1 1 I I EI Pipelines ! ! Element Range indexes

  8. Here we are going to re-configure the default domain to narrow the scope of documents that get transformed on load. Change the values for the highlighted fields shown below. You can choose to define the domain of documents using a directory or collection (or just one document URI). In this case, every document in the "/docs-to-transform" database directory and all of its sub-directories will be a part of the applicable domain:

    Machine generated alternative text: Domain Configuration j Summary Configure  Create Help  [ ok ) [__cancel Docs to transform The name o domain. - domain description  Adescrp domain scope -- The range of applicability of the domain. document scope directory : I How the domain is s coped. un depth /docs-to-transforrnf The ur o rectory, collection, or document. infinity : How many levels of subdirectories of the directory to include in the s cope. evaluation context -- Where condition and action modules of content processing applications will be evaluated. mod u les root Modules The database containing the rnodbles of the content processing applx.ation. The root directory for invoking the modules of the content processing application. cancel - domain - A domain definition. domain name delete

Write and load your XSLT

In this step you'll upload your XSLT script to the modules database that is configured for your CPF domain. If you didn't make any changes to the default "modules" field in your domain configuration (see above), that means you'll be loading it to the "Modules" database. First of all, let's assume your stylesheet is named onload.xsl, is located on your local filesystem, and looks like this:

<xsl:stylesheet version="2.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 
  <!-- By default, copy everything unchanged -->
  <xsl:template match="@* | node()">
    <xsl:copy>
      <xsl:apply-templates select="@* | node()"/>
    </xsl:copy>
  </xsl:template>
 
  <!-- Add a comment to the top of the doc -->
  <xsl:template match="/">
    <xsl:comment>EVAN WAS HERE</xsl:comment>
    <xsl:next-match/>
  </xsl:template>
   
</xsl:stylesheet>

The above script leaves the document unchanged except to insert a comment at the top. In practice, you'll want to do something more useful and specific to your application.

  1. For your convenience in following along in this tutorial, go ahead and save this onload.xsl file to your file system.
  2. Open up Query Console (on port 8000 in your browser). If MarkLogic is running on the same machine, you would navigate to http://localhost:8000/qconsole.
  3. Select the "Modules" database in the "Content Source" drop-down:

    Machine generated alternative text: Queryl + Content Sou rc Modules (fil:Apps) Expio.’.

  4. Copy and paste the following script into Query Console:
     
  5. Edit the "/path/to/onload.xsl" to the location where you saved it on your local filesystem.
  6. First select "Text" and then click the "Run" button:

    Machine generated alternative text: Rune JXML HTML (xt) Profil.

Assuming the query ran without error, your XSLT module is now ready to go.

Write and load your CPF pipeline

The following pipeline is configured to apply an XSLT stylesheet against a document in the applicable domain whenever that document is added or updated. The XSLT transform is only applied if the document is an XML document (as opposed to text or binary):

<pipeline xmlns="http://marklogic.com/cpf/pipelines">
 
  <pipeline-name>Apply XSLT transform on load</pipeline-name>
  <pipeline-description>XSLT transformation applied to new and update XML documents</pipeline-description>
  <success-action>
    <module>/MarkLogic/cpf/actions/success-action.xqy</module>
  </success-action>
  <failure-action>
    <module>/MarkLogic/cpf/actions/failure-action.xqy</module>
  </failure-action>
 
  <state-transition>
    <annotation>
      When a document is FIRST INSERTED, apply the XSLT.
    </annotation>
    <state>http://marklogic.com/states/initial</state>
    <on-success>http://marklogic.com/states/done</on-success>
    <on-failure>http://marklogic.com/states/error</on-failure>
    <execute>
      <!-- Only apply XSLT against XML documents -->
      <condition>
        <module        >/MarkLogic/cpf/actions/node-type-condition.xqy</module>
        <options xmlns="/MarkLogic/cpf/actions/node-type-condition.xqy">
          <format>xml</format>
        </options>
      </condition>
      <!-- Apply this XSLT -->
      <action>
        <module>/onload.xsl</module>
      </action>
    </execute>
  </state-transition>
 
  <state-transition>
    <annotation>
      When a document is UPDATED, apply the XSLT.
    </annotation>
    <state>http://marklogic.com/states/updated</state>
    <on-success>http://marklogic.com/states/done</on-success>
    <on-failure>http://marklogic.com/states/error</on-failure>
    <execute>
      <!-- Only apply XSLT against XML documents -->
      <condition>
        <module        >/MarkLogic/cpf/actions/node-type-condition.xqy</module>
        <options xmlns="/MarkLogic/cpf/actions/node-type-condition.xqy">
          <format>xml</format>
        </options>
      </condition>
      <!-- Apply this XSLT -->
      <action>
        <module>/onload.xsl</module>
      </action>
    </execute>
  </state-transition>
     
</pipeline>

  1. For your convenience once again, you can grab this pipeline.xml file and save it to your filesystem.
  2. In the MarkLogic admin UI, navigate to the "Pipelines" menu item under "Content Processing":

    Machine generated alternative text: (J Configure Groups FL Databases I App-Services IF barbecue I ! corona I EEC Documents ! Forests Flexible Replication Database Replication I !‘ Fragment Roots Ei Fragment Parents E Triggers OEi Merge Policy Scheduled Backups Content Processing EF DomaThs ! ! Ef lines I I EFAlerting

  3. Click the "Load" tab at the top of the page:

    Machine generated alternative text: Summary

  4. Enter the path to the directory on your filesystem where you saved the pipeline.xml file and click "ok":

    Machine generated alternative text: directory  Directory containing pipelines RequIred. You must supply a value for directory. filter .xml Filename filter source (file system) Source for pipelines cancel J

  5. You should see your pipeline XML file listed. Confirm that you want to load it by clicking "ok":

    Machine generated alternative text: Pipeline Load The following pipeline files will be loaded: Ipathltolpipeline-dirlpipeline.xml cancel

  6. To confirm that the pipeline has been loaded, navigate to the name of your pipeline under "Content Processing"->"Pipelines":

    Machine generated alternative text: (] Configure Groups Databases EI$ P.pp-services I EI barbecue EI corona Lj. Docu monts ELi Forests Flexible Replication Database Replication Fragment Roots m Fragment Parents Triggers ! Merge Policy rn Scheduled Backups Content ProcessIng EI Domains I I EFU4 PIpelInes . ‚ 1 EAierurg I I ! ELJXsLTtrnsfoonlo 1. ! Calai Entity Enrichment Sample

Attach your pipeline to your domain

Now that you've configured both the domain (which set of documents you want automatically transformed) and the pipeline (the description of the transform itself), you need to associate the two with each other.

  1. Navigate to the "Pipelines" menu child of your domain ("Docs to transform") in the admin UI:

    Machine generated alternative text: ( Confl9ure Groups Eli Databases EII ft.pp-services barbecue El corona EFI Documents I El Forests !. El Flexible Replication Database Replication f EI Fragment Roots ! El Fragment Parents I El Triggers ! Merge Policy El Scheduled Backups Content ProcessIng Domains f j Docs to transform ! Ili Pipelines I. OE) Element Range Indexes

  2. Find your pipeline ("Apply XSLT transform on load"), click the checkbox next to it, and then click "ok":

    Machine generated alternative text: L Domain Pipeline Configuration Help cancel Configure Pipelines for a Domain attached plpollno name Status Change Handling Q Alerting Apply XSLT transform on load fl Calais Entity Enrichment Sample Conversion Processing

  3. The pipeline will move up to the "attached" section, indicating that you've successfully attached it to this domain:

    Machine generated alternative text: Domain Pipeline Configuration J Configure Help 1 0k J Lcancel Configure Pipelines for a Domain attached pipeline name Apply XSLT transform on bad usnHandIing D Pb9 Calais Entity Enrichment Sample

Load some documents and watch them get transformed

Everything is all set up now. The only thing that remains is to test it out.

  1. Back in Query Console, change the "Content Source" to your content database ("Documents" in our case):

    Machine generated alternative text: Quetyl Contint Source ExploN

  2. Enter and run the following query. It inserts a new document into the "/docs-to-transform" database directory, which means it should automatically get transformed:
     
  3. Now let's look at the document by running this query:
  4. You should see that the "EVAN WAS HERE" comment was added to the top of the document:

    Machine generated alternative text: RunØ ,XML HTML JText 1 Profile <he1Th5TZ7Tièflo>

  5. Now take a look inside this document's properties. Here you'll see that this is where CPF manages your document's state. If something went wrong when it tried to apply the XSLT pipeline, then you would see the error information here. Run the following query:

That's it!

Now you have a cheat sheet for setting up an auto-applied XSLT stylesheet. If you've made it this far, then you also have some basic familiarity with CPF concepts like "domain" and "pipeline," which should put you in good stead should you decide to dig deeper into using CPF.

Comments