Getting OOXML into MarkLogic

by Pete Aven

Part 3 in a series on MarkLogic Server and Office 2007

Categories: Office 2007
Altitude: 1,000 feet

In part 1 we introduced Office 2007, the OOXML formats, and constructed a simple Word document using XQuery and MarkLogic Server. In part 2, we continued to generate Office 2007 documents, creating an Excel spreadsheet from the content found within an XHTML table on the web.

A great way to become familiar with OOXML documents is by creating them, and using XQuery and MarkLogic built-ins, it's simple to create Office 2007 documents where none existed previously. Most likely though, you already have Office documents you want to store somewhere, or maybe you're just starting to use Office 2007 and want to ensure all your documents will be stored in a repository that enables you to do analytics of your content and/or keeps it conveniently available for re-use. We've seen some functions already for loading documents one at a time and "as-is", but this week we introduce WebDAV and the Content Processing Framework. WebDAV will provide us with a convenient and intuitive way for saving our Office 2007 documents. The Content Processing Framework will show us a way to manipulate that content as soon as we've saved it. Using both of them, we'll save our Office 2007 documents to MarkLogic Server and insure we have our documents available to us how and when we need them.

As a refresher, we know that we can load a single document using xdmp:document-load().


     xdmp:document-load("C:\HelloWorld.docx",
                         <options xmlns="xdmp:document-load">
                            <uri>/HelloWorld.docx</uri>
                         </options>) 


We also know that we can then use MarkLogic built-in zip utilities to access the separate XML documents contained within the OOXML package.


     xdmp:zip-get(doc("/HelloWorld.docx"), "word/document.xml")

And we know we can loop through the manifest of an OOXML document, and save the XML pieces in MarkLogic Server using xdmp:zip-get() and xdmp:manifest(). But we don't want to have to write an XQuery each time we want to save a document. That would be a pain. Wouldn't it be convenient if we could save directly from Office 2007 into our XML Repository? Or if we could just have a folder on our desktop that we could drag Office documents into that would automatically ingest them into our XML Server? Well yes, that would be great, and that's exactly what we're going to do!

Introduction to WebDAV

WebDAV is an abbreviation that stands for Web-based Distributed Authoring and Versioning. It refers to a set of extensions to HTTP. Since MarkLogic Server is also an HTTP Server, we can take advantage of WebDAV to create a folder on our desktop that will allow us to save our Office documents directly into MarkLogic.

Log in to your MarkLogic Server installation; If you've installed with the defaults, open your favorite browser and log in to http://localhost:8001. In the tree on the left, Navigate to Groups -> Default -> App Servers. In the display on the right you'll see a list of the App Servers you've currently installed. Make sure you note the ports that your Server is currently using. You'll see that I've already set up an XDBC connection. XDBC allows me to connect to MarkLogic using the XCC java and .net APIs. If you'd like to know more about XCC, please see the XCC Developer's Guide. We won't be using any java for this example. I just wanted to let you know what you are looking at in the screenshot, maybe we'll explore XCC in a future post. For now, you should be looking at something similar to below.

Next, click the 'Create WebDAV' tab. On this tab enter the following information:

  • Enter your Server name.
    For the example I'm naming the Server 8004-Documents, to reflect the port and the database we intend to use, but you can name it anything you like.
  • Enter the root document folder.
    I've entered '/', this way any document we save will be saved as '/'+filename. So, as an example, when we save 'test.docx' through WebDAV, we'll find the document in MarkLogic as '/test.docx'.
  • Enter the port number you want WebDAV to be accessed on.
    I've entered 8004 as it was available when we checked for ports on the App Servers page.
  • Select the database that WebDAV should use.
    The default is 'Documents', and that's what we'll use for the example.

The rest of the defaults on the page are fine for this example. Just scroll to the top or bottom of the page and click 'ok'.

Now in Admin, in the treepane on the left you'll see our newly created WebDAV Server. Also, if you go to the App Servers page, you'll see our WebDAV Server added to the list. We've opened the WebDAV connection on our Server, now we just need to expose it in Windows so we have a way to use it.

In Windows, go to: Start -> Accessories -> Communications -> Network Connections. On the left of the window that opens you'll see 'My Network Places', go ahead and click that as well.

  • Click 'Add a network place'
  • Click next ('choose another network location...' is selected by default)
  • For the 'Internet or Network Address' type in: http://localhost:8004
  • Click next and give your MarkLogic Server username and password when prompted
  • Type a name for the connection; I've typed in '8004-documents', but you can name it anything you wish. And click next.
  • Click finish

You may be prompted for your MarkLogic Server username and password again. Now under 'Network Places', you'll see '8004-documents' under your 'Local Network'. Finally, copy the connection, and paste as a Shortcut to the desktop. Let's test what we've done!

Let's create a test document. Go ahead and open Word. Here are some fun tips you might not know about for quickly populating text into a Word 2007 document. When you open Word to a new document, type '=lorem()' on the first line (without the quotes) and then hit 'enter' on your keyboard. The document will be populated with some random latin text. You can also do something similar by typing '=rand()' followed by 'enter' as well. rand() returns english text. Both lorem() and rand() will also take a parameter, so if you need to test a larger document, you can type '=rand(100)' and you'll end up with 3 pages or so of random text. This can be useful when you just need some test content. A part of me wonders though: why with all the nonsense content out there in the world already did they feel it necessary to implement these functions? :) I keed! I'm glad they did, it is helpful. Anyway, let's continue.

Ok, enter some test content into Word, then 'Save As' so you can pick the destination to save to. You now have the option of A) saving anywhere and dragging the file to your WebDAV folder on the desktop, or B) just save directly to the WebDAV folder: 8004-documents. So save test.docx to your WebDAV folder. If you open the folder, you'll then see the .docx you've created there.

Now in CQ, evaluate the following:


     xdmp:zip-get(doc("/test.docx"), "word/document.xml")

The document.xml for the .docx we just saved is returned. Now, goto your WebDAV folder and delete test.docx. If you evaluate the query in CQ again, you'll get an error as the document no longer exists in MarkLogic. You also could have deleted by evaluating the following in CQ.


     xdmp:document-delete("/test.docx")

With WebDAV, we now have a convenient way to quickly save and delete Office 2007 documents from our MarkLogic Server. That was quite easy to set up too. We're showing you examples in Windows as we assume that's what you're using so you can use Office 2007. You can set up WebDAV functionality just as easily in Linux as well. For more information on WebDAV, please see the Administrator's Guide. Pages 33-43 will tell you all about WebDAV, its strengths, and the opportunities it presents.

Well, having WebDAV set up helps us a lot, but we can still improve upon how we're saving our documents. Currently, we just save the OOXML package. If we want to query any of the XML files within the package, we'll still need to use the zip utilities to access them. Wouldn't it make more sense to just save the package and at the same time save the individual XML files that comprise the package? That approach would be very helpful to us for future queries as well as content re-use. With MarkLogic's Content Processing Framework, we have an easy way to do that too.

Just a quick note: WebDAV and the Content Processing Framework are separate features. We're discussing both as it makes sense for what we want to achieve, but you don't have to have one configured to utilize the other. With WebDAV, we've set up a way to save our documents to our MarkLogic Server in a convenient and intuitive way. With CPF, we'll take action on a document to process its' contents as soon as it is loaded to our Server and it doesn't matter if its loaded through WebDAV or by using xdmp:document-load() or any other utility. Using WebDAV and CPF together is very useful, but we want to make sure you distinguish between the two as we're moving quickly. It would be easy to think maybe we had to setup one to start working with the other, but you're just getting a two-for-one with today's post. Now back to the action ....

Introduction to CPF (the Content Processing Framework)

The Content Processing Framework is simple, intuitive, flexible, and powerful. I'm just going to give you an in-a-nutshell high-speed overview and we'll explore the components needed to create our first Pipeline; One that will extract the contents of an OOXML package for us as soon as we save it to our Server. To learn more about CPF, all it has to offer, and the many other possibilities you may want to explore for transforming your content, please refer to the Content Processing Framework Guide.

Content can go through many phases before it's ready for use in an application. It may be tranformed from one XML structure to another, or from text to XHTML, etc.; The list goes on. The process of content going from one stage to another is called content processing. Every document has a lifecycle. A lifecycle typically begins when a document is created, then continues through various phases of content processing. Content Processing enables a document to move through the phases of its lifecycle. Content Processing can be very simple, or very complex. Today's example falls into the "very simple" category. But we hope that we pique your curiousity enough that you go on to try more complex processing on your own. If you stick with this series, We're sure you will.

CPF 101 Time! Don't worry, you don't have to understand everything below for what we want to accomplish, just have an awareness they exist. Very briefly, the components of CPF are:

  • Domains
    Defines the scope of documents to process. With domains, you can organize your content so some docs are processed one way, and others are processed in another way.
  • Pipelines
    A pipeline is and XML document that describes a set of content processing steps. It defines the steps that occur during the processing of documents and defines actions that occur at each step.
  • XQuery Functions and Modules
    MarkLogic Server includes many XQuery functions and supporting XQuery modules for CPF. An example of this is for document conversion. You can read about these functions and how they can be used by referring to the CPF Guide and the MarkLogic Built-In and Module Functions Reference. These are very helpful in that they will do some work for us already, and also give us the option of re-using components to quickly build our own applications. We could write our own, and sometimes will, but if we have what we need readily available, we'll use it.
  • Triggers
    MarkLogic Server and CPF use triggers to automate processes described by a pipeline. Triggers allow you to capture document and system events and then perform some tasks after the event occurs.
  • Custom Applications
    CPF was built for you to create your own content processing applications, with your own content processing code, and following your own logical and business processes.

    Bringing it all together, in-a-nutshell style, we can write XQuery Modules that execute as our document passes through lifecycle phases as defined by a pipeline. Because of the simplicity of the CPF architecture, we end up creating what I like to think of as "XQuery Legos"; building blocks of XQuery that allow us to plug the components we want together to process the content however we see fit.

So let's start by writing the XQuery for extracting the individual XML files from an OOXML package.


     declare namespace zip="xdmp:zip"

      let $document-uri := "/test.docx"

      let $directory-uri := fn:concat($document-uri,"/")
      let $zipfile := doc($document-uri)
      let $manifest := xdmp:zip-manifest($zipfile)
      for $part-name in $manifest/zip:part
      let $options := if ($part-name = "/_rels/.rels") then
                            <options xmlns="xdmp:zip-get">
                               <format>xml</format>
                            </options>
                      else
                            <options xmlns="xdmp:zip-get"/>

      let $part := xdmp:zip-get($zipfile, $part-name, $options)
      let $part-uri := fn:concat($directory-uri, $part-name)

      return 
        xdmp:document-insert($part-uri, $part)


We've seen this before. If we have /test.docx in our MarkLogic Server, we can evaluate the above in CQ. We just loop through the pieces in the manifest and for each file, insert into our XML Server. So we can identify which files belong with which OOXML package, we prefix the name of the document we are saving to the individual files so in effect we save them in a directory named for the file we are saving.

If the above doesn't execute properly for you, you may need to enable the 'uri lexicon' option in your Server. From Administration, navigate to Databases->Documents. On the 'configure' page, scroll down to 'uri lexicon' and select 'true'. Go to the top or bottom of the page and click 'ok'. Now re-test your XQuery and you should be in business.

We know what XQuery we want to evaluate, and we know we want it evaluated when we first save our document to our repository. Now that we have that, it's a good time to create our pipeline. A pipeline is just a simple XML document. The only rule here is that the document must conform to the pipelines.xsd schema, found at <install-dir>/Config/pipelines.xsd. We provide a sample pipeline below, and if you're interested, the CPF Guide contains another example as well.

Under the /Modules directory of your MarkLogic Server installation, create a folder: custom_pipelines. In that folder, create a new file named ooxml_extract_pipeline.xml, and save the following contents to the file.


     <?xml-stylesheet href="/cpf/pipelines.css" type="text/css"?>
     <pipeline xmlns="http://marklogic.com/cpf/pipelines"
               xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
               xsi:schemaLocation="http://marklogic.com/cpf/pipelines pipelines.xsd">
       <pipeline-name>ML OOXML Extract</pipeline-name>
       <pipeline-description>Save OOXML files from Office 2007 documents.</pipeline-description>
       <success-action>
           <module>/MarkLogic/cpf/actions/success-action.xqy</module>
       </success-action>
       <failure-action>
           <module>/MarkLogic/cpf/actions/failure-action.xqy</module>
       </failure-action>
       <state-transition>
         <annotation> Extract OOXML files only.</annotation>
         <state>http://marklogic.com/states/initial</state>
         <on-success>http://marklogic.com/states/final</on-success>
         <on-failure>http://marklogic.com/states/error</on-failure>
         <execute>
            <condition>
               <module>/MarkLogic/cpf/actions/mimetype-condition.xqy</module>
                 <options xmlns="/MarkLogic/cpf/actions/mimetype-condition.xqy">
                   <mime-type>application/vnd.openxmlformats</mime-type>
                 </options>
               </condition>
            <action>
	       <module>/MarkLogic/conversion/actions/extract_ooxml_into_ml.xqy</module>
            </action>
         </execute>
       </state-transition>
     </pipeline>

Here we define a pipeline named "ML OOXML Extract". We provide a description as well. Next, we see actions to take on success and failure. For these, CPF has provided us with some XQuery in our installation to handle any exceptions and advance the lifecycle of the document as we successfully process our content. Next, we have a state-transition. For the transition, we define the action to take on state "initial", which all our documents will be the first time we save them to the Server. We then provide states to transition to in the case of success and failure. Now we come to the interesting part, to verify that we only execute our CPF logic against OOXML documents we add a condition that will evaluate the mimetype-condition.xqy. We've passed the mimetype for validation as an option to the module. This is another of those modules provided for us that we just re-use; You can check out the file for yourself at the path indicated. If the condition is met, we will evaluate extract_ooxml_into_ml.xqy, which will extract the XML files from our package for us into the Server. This last file we'll need to create, but we have a good portion of as it will contain the XQuery we wrote above.

Navigate to /MarkLogic/conversion/actions and create a file named extract_ooxml_into_ml.xqy. In that file save the following contents:


     declare namespace zip="xdmp:zip"

     import module namespace cpf = "http://marklogic.com/cpf" at "/MarkLogic/cpf/cpf.xqy"

     define variable $cpf:document-uri as xs:string external
     define variable $cpf:transition as node() external

     if (cpf:check-transition($cpf:document-uri,$cpf:transition)) then
     try {
       xdmp:log(fn:concat($cpf:document-uri, "DOCUMENT-URI")),
       let $directory-uri := fn:concat($cpf:document-uri,"/")
       let $zipfile := fn:doc($cpf:document-uri)
       let $manifest := xdmp:zip-manifest($zipfile)
       for $part-name in $manifest/zip:part
       let $options := if ($part-name = "/_rels/.rels") then
                             <options xmlns="xdmp:zip-get">
                               <format>xml</format>
                             </options>
                          else
                            <options xmlns="xdmp:zip-get"/>
       let $part := xdmp:zip-get($zipfile, $part-name, $options)
       let $part-uri := fn:concat($directory-uri, $part-name)

       return 
           xdmp:document-insert($part-uri, $part),
           cpf:success( $cpf:document-uri, $cpf:transition, () )
  
     }catch ($e) {
           cpf:failure( $cpf:document-uri, $cpf:transition, $e, () )
     }
     else ()

We see two variables are passed to our CPF module, $cpf:document-uri and $cpf:transition. These variables will be passed from the Server and will be dependent on the document being saved and the transitions defined in our pipeline XML. We check the state of the document and if it's a candidate for processing, enter the try/catch. We try to log the name of the document and execute our XQuery. If our XQuery is successful, we'll return cpf:success and our state transition, if anything goes wrong, cpf:failure. That's it, we've got our pipeline and our XQuery, we just have one step left.

We have to load and attach the pipeline in our Server. Once we've done that, we can test. Log in to the Admin console and navigate to Databases -> Documents -> Content Processing. On the Content Processing Installation page, select false for the 'enable conversion' option, and click Install if you haven't done so already.

Click 'ok'. Under 'Content Processing' in the Tree pane you'll now see 'Domains' and 'Pipelines', select 'Pipelines'. Next, click the 'Load' found in the display on the right. This page is where you'll load your pipeline.

  • Provide the directory that contains ooxml_extract_pipeline.xml
    ( C:\Program Files\MarkLogic\Modules\custom_pipeline )
  • Set your filter to *.xml
    All xml files in this directory will be processed, so make sure only your pipeline is in this directory.
  • Leave '(file system)' as the source selection
  • Click 'ok'

Click 'ok' again. We now see ML OOXML Extract under pipelines on the summary page. In the tree on left, click 'ML OOXML Extract' and you can see the details for the pipeline we've just loaded.

Finally, navigate to Content Processing -> Domains -> Default Documents -> Pipelines in the tree on the left of the Admin console. You'll see no pipelines exist for the domain. We just have to attach the pipeline we've loaded, so click on the 'attach' tab. From the first drop down menu, select 'ML OOXML Extract' and click 'ok'. You'll now see our pipeline on the pipelines page with a 'detach' option. That's it! We're in business. Let's try it out.

Open Word, add some test text, and save the document to your WebDAV folder as testCPF.docx. Next, evaluate the following in CQ to validate that all the pieces were extracted correctly.


     let $doc := "/testCPF.docx/"
     let $uris := cts:uris("",(),cts:directory-query($doc,"infinity"))  
     for $i in $uris
      let $dir := fn:substring-after($i,$doc)
      return <piece>{$dir}</piece>

You can also evaluate the following to check the state of the documents you've saved. If they extracted successfully, you'll see "final".


     xdmp:document-properties("/testCPF.docx")

If you go to your WebDAV folder, and refresh. You'll now see the testCPF.docx we just saved, as well as a folder named /testCPF.docx. If you delete the .docx from the folder, the package will be removed from the Server. If you delete the folder, the directory and all its children will be deleted from the Server. Remember, the WebDAV folder we created is a window into the Server, it is not the filesystem. So make sure you really want to delete the documents from your MarkLogic Server before deleting from your WebDAV folder.

Also, you don't need WebDAV for your Pipeline to execute. Go ahead and test by evaluating xdmp:document-load() with another test document. All the XML files in the OOXML package will be extracted.

If you want to delete a directory and its children from your MarkLogic Server, you'll find the following helpful.


     let $doc := ""/testCPF.docx/"
     return xdmp:directory-delete($doc)

For now, I say save every document you can! As we progress through the series we'll use our saved XML content to explore search, enrichment, and re-use; So at this point, the more content, the merrier.

Conclusion

Again I get to bang the drum of: MarkLogic Server, XQuery, and Office 2007 are a powerful combination! I kind of fooled you with today's post, as it was more MarkLogic-y than Office-y, but if you stick with the series and future posts, you'll understand why we explored WebDAV and the Content Processing Framework. For one, I know everytime I pick up an XML/XQuery book or read an article on the web, the assumption is that you have the content stored somewhere already. Sometimes figuring out how to get the content into the Server so you can start exploring what the books or articles are suggesting can be a chore. So now we know we can use MarkLogic Built-Ins and WebDAV to quickly and conveniently load our content. If you continue to explore OOXML, you'll find the Content Processing Framework particularly useful. Right now we just have one pipeline that just extracts the contents of our Office 2007 package. Tomorrow we could have a pipeline to enrich the content in our document.xml, or a pipeline to transform our OOXML to ODF, or a pipeline that takes some subset of our OOXML package and puts it in another XML structure for publishing and consumption by another service, we can even have pipelines within other pipelines, the possibilities with CPF are endless! CPF is a powerful ally for content conversion. Next post we take a finer look at our friend WordprocessingML. Until then, Enjoy!!

Comments

  • <a href="http://www.dataentryconsultants.com/automation-anywhere-expert-developers/">Automation Anywhere Expert</a> provides helps to develop automation templates to create an intelligent, automated task in a minute. It will quite an easy for you. Thanks...!