Office Logic

Office Logic

by Pete Aven

Categories: Office 2007
Altitude: 1,000 feet

With the release of Microsoft Office 2007, there are now many exciting opportunities for desktop documents in the enterprise and online. All Office documents are now just a package of XML. Previously, a Word document was a binary object, capable of only being stored and labeled as a single entity. Storing documents in this way made them very difficult to repurpose and impossible to search within. And what about enriching those documents with other information, such as entity types? Now we can store, search, enrich, and repurpose Word documents in ways never before possible. With Word 2007 documents stored in a XML Repository, and the ability to manipulate and fetch them using XQuery, we now have the flexibility and firepower to take advantage of all the information we've stored in our enterprise. So for the next few weeks, we hope to help you understand how you can take advantage of these multiple opportunities utilizing XQuery and MarkLogic.

Introduction to Word 2007

A Word document is now just a bag of XML wrapped up in a zip file.

Figure 1

Above is the set of XML created upon saving a simple document. To create this document, just open up Word, type "Hello World", and save. To inspect the XML, in Windows, take the saved Word 2007 document, change the .docx extension to .zip, right-click on the package and select "Extract All".

This is not the maximum number of files you'll find in a .docx. Depending on the contents of the document, there can be any number of XML files in the package. All the content is stored in document.xml. There are many XML files present above that we don't really need to create our own Word document. Word 2007 is now a consumer for XML files, and we can create the same Word document from scratch with XQuery using just 3 files: [Content_Types].xml, document.xml and .rels. We'll explore these below.

Microsoft calls the XML they use in Office 2007 OOXML (Office Open XML). You'll find many articles that refer to this acronym on the internet and in documentation. In actuality, a specification exists for each application; WordprocessingML for Word, SpreadsheetML for Excel, and PresentationML for PowerPoint. The OOXML formats cover those 3 applications. Throughout this series we'll touch on some of the specifics of the tags and xml files used, but if you can't wait and want to dive in right now, you can find the ECMA specifications for all the OOXML file formats here.

Examining the Document

If you don't have MarkLogic Server installed yet, go get the download and install it using a free Community License. To evaluate the queries, you can grab the latest release of CQ. If you wish to explore XQuery beyond our discussion of Office 2007, our developer site has an abundance of useful documentation, downloads, and examples. The api reference is clear and very helpful as well.

First, let's insert the document.

     xdmp:document-load("C:\HelloWorld.docx",
                         <options xmlns="xdmp:document-load">
                            <uri>HelloWorld.docx</uri>
                         </options>)

Now we have the document in our XML Repository, but it's in the zip file format still. Using MarkLogic's built-in zip functions, we can easily view the document.xml. To view the contents of document.xml, evaluate the following.

     xdmp:zip-get(doc("HelloWorld.docx"), "word/document.xml")

What if we want to just extract all the XML files in any Office 2007 document and save them to our XML Repository? Well, we can do that too.

     declare namespace zip="xdmp:zip"

     let $doc := "HelloWorld.docx"
     let $directory-uri :=fn:concat($doc,"/")
     let $zipfile := fn:doc($doc)
     let $manifest := xdmp:zip-manifest($zipfile)
     for $part-name in $manifest/zip:part
         let $options := if ($part-name = "/_rels/.rels") then
                              <options xmlns="xdmp:zip-get">
                                <format>xml</format>
                              </options>
                          else
                              <options xmlns="xdmp:zip-get"/>
         let $part := xdmp:zip-get($zipfile, $part-name, $options)
         let $part-uri := fn:concat($directory-uri, $part-name)

         return xdmp:document-insert($part-uri, $part)

You can easily take the above and make it into a reusable module. What's great about this is that it will work for any Office 2007 OOXML document, be it Word, Excel, or PowerPoint. All we do is loop through the pieces of the manifest and insert the parts into a directory that is named for the document we are saving. Since all the files have a .xml extension, MarkLogic will recognize them as XML. To capture the .rels file as XML we can explicitly assign it the XML format. Now that we're storing all our Office 2007 documents as XML, we can now search across the documents and when we find what we're looking for, just grab the zip (.docx in this example) or piece of the document that we're interested in. So now that we're gaining momentum, let's have some fun and step it up a notch.

Creating a Document

As we mentioned previously, a Word 2007 document to be consumed and displayed by Office 2007 has a minimum of 3 parts. Since we're interested in XML, it's easiest to just think of these parts as required nodes. These nodes will be zipped into a single package that we can then open in Word.

The first part of any Office 2007 OOXML document is what Microsoft calls the "start-part". It's the place where the application will start to parse the document contents. Word, Excel, and PowerPoint each have a file in their package considered a start-part. Since the start-part is different for each application, they each serve a different purpose. For Word, the start-part stores the main text and body of the document and is named document.xml. We'll create our start-part with some sample text as follows.

     let $document :=
      <w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
        <w:body>
          <w:p>
            <w:r>
              <w:t>Have you visited http://www.MarkMail.org ?</w:t>
            </w:r>
          </w:p>
        </w:body>
      </w:document>

The second required part is the [Content_Types].xml file. This file stores the content type for each part inside the package. It stores information in two ways: one is defining default content-types based on the file extensions for the parts in the package; the second way is to override parts based on the locations of a single part within the package. The minimal [Content_Types].xml file for a Word document will require the following.

     let $content-types :=
      <Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types">
        <Default Extension="rels"
         ContentType="application/vnd.openxmlformats-package.relationships+xml"/>
        <Default Extension="xml" ContentType="application/xml" />
        <Override PartName="/word/document.xml" 
         ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml"/>
      </Types>

The final part we'll need is .rels. The relationships file tells Word how to relate the start-parts of your package. In Figure 1 above, you saw a document.xml.rels file. It turns out there can be many relationships files (*.rels) within a single Word document, but this one in particular is crucial for relating the start-parts of the package.

     let $rels :=
      <Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
        <Relationship Id="rId1" 
         Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument"
         Target="word/document.xml"/>
      </Relationships>

We're skipping a lot of detail in the examples above, but don't worry about that right now. For now, just see how $content-types defines the content types for the parts of the package we're going to create, how $rels identifies where to find our start-part, and how $document contains the content for the Word document we're creating.

Now, using XQuery and MarkLogic, we can create our Word document. We can insert the document to our XML Repository, or view by opening directly into Word. We've chosen to do the latter.

     define function generate-docx(
       $content-types as node(),
       $rels as node(),
       $document as node()
     ) as binary()
     {
       let $manifest := <parts xmlns="xdmp:zip">
                           <part>[Content_Types].xml</part>
                           <part>_rels/.rels</part>
                           <part>word/document.xml</part>
                        </parts>
       let $parts := ($content-types, $rels, $document)
       return
         xdmp:zip-create($manifest, $parts)
     }

     let $content-types :=
       <Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types">
       <Default Extension="rels" 
        ContentType="application/vnd.openxmlformats-package.relationships+xml"/>
       <Default Extension="xml" ContentType="application/xml" />
       <Override PartName="/word/document.xml" 
        ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml"/>
      </Types>


     let $rels :=
      <Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
         <Relationship Id="rId1" 
          Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument" 
          Target="word/document.xml"/>
      </Relationships>

     let $document :=
      <w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
       <w:body>
        <w:p>
         <w:r>
          <w:t>Have you visited http://www.markmail.org?</w:t>
         </w:r>
        </w:p>
       </w:body>
      </w:document>

     let $package := generate-docx($content-types, $rels, $document)
     let $filename :=  "hello-world.docx"

     let $disposition := concat("attachment; filename=""",$filename,"""")
     let $x := xdmp:add-response-header("Content-Disposition", $disposition)
     let $x := 
      xdmp:set-response-content-type("application/vnd.openxmlformats-officedocument.wordprocessingml.document")
     return
      $package

We've defined a function, generate-docx, that takes our 3 nodes as parameters and returns the zip package. Once it returns, we have a Word 2007 file that we can open in Office 2007. Because we've chosen to open our file instead of saving, we give our file a name, and set the response-header and response-content-type so that the proper application will open our document.

The 3 files we discussed above are all you need for a simple Word document that can be consumed and used with Office 2007. If you'd like to have the ability to open in Corel WordPerfect X3, the minimum files required number 5, not 3. Corel wants 2 more nodes: docProps/app.xml and docProps/core.xml. To see what these look like, go back to Office 2007, create your "hello world" test .docx document, unzip it and take a look at the docProps/app.xml and docProps/core.xml files you find in there. You just need to add them to the last bit of code we outlined above. Remember to update your [Content_Types].xml and .rels entries as well.

Conclusion

We've just started to scratch the surface of what we can do with XQuery and Office 2007. We've covered a lot of ground in a short time, but we aim to take you further. In the next few weeks, we'll be posting more ways to get the most out of Office 2007 and MarkLogic. We'll be showing you how to automatically store your Office 2007 documents to MarkLogic, how to enrich your Word documents, how to repurpose content from Word documents as well as other sources, and we'll even explore Excel and PowerPoint. Look for new posts each Monday; it will be a great way to start the week.

For those who use Office 2007, the OOXML formats present a great opportunity for storing, querying, enriching, and repurposing Word 2007 documents in your business or enterprise. As Office 2007 OOXML applications all store their content in XML, we can even reuse that content to fuel and feed other applications. XQuery and MarkLogic provide a powerful combination for managing those documents and making the most out of them.

blogroll Blogroll

Comments

  • For xquery version 1.0-ml, it may need a namespace. This following code works with ML Server version 4.2.<br><br><pre style="overflow: auto !important;">xquery version "1.0-ml";<br><br>declare namespace oox = "<a href="http://developer.marklogic.com/xquery/util" rel="nofollow">http://developer.marklogic.com...</a>";<br><br>declare function oox:generate-docx(<br> $content-types as node(),<br> $rels as node(),<br> $document as node() ) as binary()<br> {<br> let $manifest := <parts xmlns="xdmp:zip"><br> <part>[Content_Types].xml</part><br> <part>_rels/.rels</part><br> <part>word/document.xml</part><br> </parts><br> let $parts := ($content-types, $rels, $document)<br> return<br> xdmp:zip-create($manifest, $parts)<br> };<br><br>let $content-types :=<br> <types 2006="" content-types"="" http:="" package="" rel="nofollow" schemas.openxmlformats.org="" xmlns="&lt;a href=">http://schemas.openxmlformats....;<br> <default contenttype="application/vnd.openxmlformats-package.relationships+xml" extension="rels"><br> <default contenttype="application/xml" extension="xml"><br> <override contenttype="application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml" partname="/word/document.xml"><br> </override></default></default></types><br><br>let $rels :=<br> <relationships 2006="" http:="" package="" rel="nofollow" relationships"="" schemas.openxmlformats.org="" xmlns="&lt;a href=">http://schemas.openxmlformats....;<br> <relationship 2006="" http:="" id="rId1" officedocument="" officedocument"="" rel="nofollow" relationships="" schemas.openxmlformats.org="" type="&lt;a href=">http://schemas.openxmlformats...." Target="word/document.xml"/><br> </relationship></relationships><br><br>let $document :=<br> <w:document 2006="" http:="" main"="" rel="nofollow" schemas.openxmlformats.org="" wordprocessingml="" xmlns:w="&lt;a href=">http://schemas.openxmlformats....;<br> <w:body><br> <w:p><br> <w:r><br> <w:t>Have you visited <a href="http://www.MarkMail.org" rel="nofollow">http://www.MarkMail.org</a> ?</w:t><br> </w:r><br> </w:p><br> </w:body><br> </w:document><br><br>let $package := oox:generate-docx($content-types, $rels, $document) let $filename := "hello-world.docx"<br><br>let $disposition := concat("attachment; filename=""",$filename,"""") let $x := xdmp:add-response-header("Content-Disposition", $disposition) let $x := <br> xdmp:set-response-content-type("application/vnd.openxmlformats-officedocument.wordprocessingml.document")<br> return $package</pre>