XQuery, Office 2007, and the Open Packaging Convention

by Pete Aven

An alternate approach to serializing Word and PowerPoint documents in Office 2007

Categories: Office 2007
Altitude: 1,000 feet

Happy New Year! We hope you all enjoyed the holidays. We're back and have more MarkLogic goodness to share with you. First, a few reminders: If you're interested in trying the examples out for yourself and don't have MarkLogic Server installed yet, go get the download and install it using a free Community License. To evaluate the queries, you can grab the latest release of CQ. Also, you can subscribe to the developer blog here. Ok, let's get to it.

Today we take a look at a second way of serializing the XML within a Word 2007 document by taking advantage of the Open Packaging Convention (OPC). Recall, in our first post, we introduced the concept of how a Word document in Office 2007 is just a .zip file, a bag containing multiple XML pieces. In that post, we demonstrated a way to save the separate XML pieces from an Open XML package into MarkLogic Server. We also introduced server-side document assembly. We generated a .docx on the server by creating the necessary XML files and then zipped those contents into a .docx using MarkLogic built-in utilities. We then finished our example by opening our document directly into Word 2007. In this post we'll expand on these concepts by demonstrating another way to store and retreive a Word 2007 package as a single XML document.

After saving the pieces from an Open XML package into MarkLogic, we then have several XML documents available for search and content repurposing. Depending on the searches we'd like to perform, defining query constraints based on markup that occurs in multiple pieces within a single .docx package can be a bit cumbersome. We have to manage the relationship of the parts within a single Word document on the server and within our queries.  So if you've worked with OOXML, you've probably asked yourself at some point: Isn't there a way to just serialize all the pieces extracted from an Open XML package into a single XML document? And next you've probably wondered: Can’t that same XML document be consumed by Office 2007 to open our document directly into Word? The answer to both questions is Yes!, and OPC helps make it happen.

Introduction to OPC and <pkg:package>

OPC is the term used to describe the XML format that captures an Open XML package as a single element. To view what the XML for a document saved in the OPC format looks like, just open Word 2007 and save a test document as .xml. ( Maybe type "Hello, World", or some other text, into the document before saving so you have some content to look at. ) Go to the Button, choose 'Save As', and select 'Other Formats'. Next, in the 'Save As' dialog box that appears, choose 'Word XML Document (*.xml)' from the 'Save as type' dropdown. Note: this is the proper selection for WordprocessingML. Don't choose Word 2003 XML, as that's something different.

Now, wherever you've saved your file, you can double click the XML document and have it open directly into Word.

To examine the XML, just open the file within your favorite XML editor. I've copied the first few lines of our example file below for reference.


<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<?mso-application progid="Word.Document"?>
<pkg:package xmlns:pkg="http://schemas.microsoft.com/office/2006/xmlPackage">
  <pkg:part pkg:name="/word/document.xml" 
            pkg:contentType="application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml">
    <pkg:xmlData>
      <w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
        <w:body>
          <w:p>
            <w:r>
              <w:t>Hello, World.</w:t>
            </w:r>
          </w:p>
        </w:body>
      </w:document>
    </pkg:xmlData>
  </pkg:part>
  <pkg:part pkg:name="/word/theme/theme1.xml" 
            pkg:contentType="application/vnd.openxmlformats-officedocument.theme+xml">


The first thing we notice is that this file is not .zip, but just a single XML document. This document is the package, serialized using the OPC format.

Next, we notice the processing instruction:


<?mso-application progid="Word.Document"?>

This instruction is what allows us to double-click the document and have it open in Word. A similar instruction exists for PowerPoint as well.

Continuing, we see the parent node for the document is <pkg:package>.

  • <pkg:package> contains children of type <pkg:part>.
  • Each <pkg:part> element has attributes @pkg:name and @pkg:contentType.
  • @pkg:name defines the name of the XML part and it's directory path within the .docx package.
  • @pkg:contentType defines the content type for the part within the package.
  • Each <pkg:part> has a child element <pkg:xmlData>.
  • The child of <pkg:xmlData> is the respective file from the .docx package.
  • All files from the .docx package are serialized here, except for [Content_Types].xml. (defined on @pkg:contentType)

Note: I'm using Word for our example, but this same capability exists for PowerPoint as well. You can save a PowerPoint presentation as XML, and it will be saved in the OPC format.

Creating the <pkg:package> with XQuery

If you're using MarkLogic Server 4.0, then you already have the 'Office Open XML Extract' pipeline available in the Content Processing Framework (CPF) to extract XML pieces from Open XML packages when you save them to the Server. The parts will be saved in a directory named for the original file. Within this directory, the individual pieces will maintain the directory structure they had in the originating package. So if we save HelloWorld.docx to the Server and the pipeline is activated, we'd save the original .docx package, as well as the following files:


/HelloWorld_docx_parts/[Content_Types].xml
/HelloWorld_docx_parts/_rels/.rels
/HelloWorld_docx_parts/docProps/app.xml
/HelloWorld_docx_parts/docProps/core.xml
/HelloWorld_docx_parts/word/_rels/document.xml.rels
/HelloWorld_docx_parts/word/document.xml
/HelloWorld_docx_parts/word/fontTable.xml
/HelloWorld_docx_parts/word/settings.xml
/HelloWorld_docx_parts/word/styles.xml
/HelloWorld_docx_parts/word/theme/theme1.xml
/HelloWorld_docx_parts/word/webSettings.xml

For this example, we'll assume we're using MarkLogic Server 4.0. After insuring the 'Office Open XML Extract' pipeline has been enabled in CPF, open a new Word document, type in some text, and save it to the Server as 'HelloWorld.docx'. The parts of the package will be saved in a directory named '/HelloWorld_docx_parts'. We're going to take these pieces, and serialize them as a single <pkg:package>. You can just cut and past the code below into CQ to try it out.


xquery version "1.0-ml";
declare namespace ooxml = "http://marklogic.com/openxml";
declare namespace pkg = "http://schemas.microsoft.com/office/2006/xmlPackage";

declare function ooxml:get-part-content-type($uri as xs:string) as xs:string?
{
   if(fn:ends-with($uri,".rels"))
   then 
       "application/vnd.openxmlformats-package.relationships+xml"
   else if(fn:ends-with($uri,"document.xml"))
   then
      "application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml" 
   else if(fn:matches($uri, "theme\d+\.xml"))
   then 
      "application/vnd.openxmlformats-officedocument.theme+xml"
   else if(fn:ends-with($uri,"settings.xml"))
   then 
      "application/vnd.openxmlformats-officedocument.wordprocessingml.settings+xml"
   else if(fn:ends-with($uri,"styles.xml"))
   then 
      "application/vnd.openxmlformats-officedocument.wordprocessingml.styles+xml"
   else if(fn:ends-with($uri,"webSettings.xml"))
   then 
      "application/vnd.openxmlformats-officedocument.wordprocessingml.webSettings+xml"
   else if(fn:ends-with($uri,"fontTable.xml"))
   then 
      "application/vnd.openxmlformats-officedocument.wordprocessingml.fontTable+xml"
   else if(fn:ends-with($uri,"docProps/core.xml"))
   then
        "application/vnd.openxmlformats-package.core-properties+xml"
   else if(fn:ends-with($uri,"docProps/app.xml"))
   then
       "application/vnd.openxmlformats-officedocument.extended-properties+xml"
   else
       ()
    
};

declare function ooxml:get-part-attributes($uri as xs:string) as node()*
{

  let $name := attribute pkg:name{$uri}
  let $contenttype := attribute pkg:contentType{ooxml:get-part-content-type($uri)}
  let $padding := if(fn:ends-with($uri,".rels")) then

                     if(fn:starts-with($uri,"/_rels")) then
                      attribute pkg:padding{ "512" }
                     else    
                      attribute pkg:padding{ "256" }

                  else
                     ()
  
  return ($name, $contenttype, $padding)
};

declare function ooxml:get-package-part($directory as xs:string, $uri as xs:string) as node()?
{
  let $docuri := fn:concat("/",fn:substring-after($uri,$directory))
  let $data := doc($uri)

  let $part := if(fn:empty($data) or fn:ends-with($uri,"[Content_Types].xml")) then () 
               else
                  element pkg:part { ooxml:get-part-attributes($docuri), element pkg:xmlData { $data }}
  return $part 
};

declare function ooxml:make-package($directory as xs:string, $uris as xs:string*) as node()*
{
  let $package := element pkg:package { 
                            for $uri in $uris
                            let $part := ooxml:get-package-part($directory,$uri)
                            return $part }
  return $package
};

 let $directory := "/HelloWorld_docx_parts/"
 let $uris := cts:uris("","document",cts:directory-query($directory,"infinity"))
 return  ooxml:make-package($directory, $uris) 


So it looks like there's a lot going on here, but it's actually pretty simple. We have one large function above for assigning content types to the various files, so it may look like a lot of code, but let's start at the bottom of our example and work our way up to examine what's really happening here.

First, we get the uris for the files saved in our directory by evaluating cts:uris on the results of a directory query. We are passing this sequence of uris to a function: ooxml:make-package.

In ooxml:make-package, we have the pkg:package element constructor, which calls a function ooxml:get-package-part for each uri in the package.

Finally, the function ooxml:get-package-part returns the element contructor for pkg:part. If the file is [Content_Types].xml, we ignore it. Within the constructor we call another function, ooxml:get-part-attributes, to set the attributes @pkg:name and @pkg:contentType for the <pkg:part>. If the uri references the XML for a .rels file from within the package, we also have to set the @pkg:padding attribute. The content type for the piece is determined in the function ooxml:get-part-content-type by the uri for the part.

Ok, that's fun. But what if we'd like to open our XML directly into Word 2007? Just replace the line:
'return ooxml:make-package($directory,$uris)'
in the example above with the code below. Save the example to a file named HelloWorld.xqy in your /Docs directory on the server, and then open your favorite browser and just navigate to 'http://localhost:8000/HelloWorld.xqy'. The document will open directly into Word.


 let $filename :=  "HelloWorld.xml"
 let $package :=  (<?mso-application progid="Word.Document"?>,ooxml:make-package($directory, $uris))
 let $disposition := concat("attachment; filename=""",$filename,"""")
 let $x := xdmp:add-response-header("Content-Disposition", $disposition)
 let $x := 
   xdmp:set-response-content-type("application/vnd.openxmlformats-officedocument.wordprocessing.xml")
 return $package

This example is by no means a comprehensive solution for transforming all Word 2007 documents stored in MarkLogic Server to the OPC format. For one thing, there are many other content types to consider. The other content types and their relationships can be found sprinkled throughout Part 1 of the spec; however, a much friendlier summary can be found here. Part 2 of the spec contains all the details on OPC.

We hope the example is helpful and maybe sparks some ideas. We've found the OPC format to be very useful and think you may too. Let us know what you think. But before we go, we have to mention how images are handled.

Images and Binary Parts

A special characteristic of the OPC format is that binary parts are stored as base64 encoded strings. The string must be broken into lines of 76 characters, and there must not be a line break at the beginning or end of the data. No big deal, an example of what the XML looks like follows.


   <pkg:part pkg:name="/word/media/image1.jpeg" 
             pkg:contentType="image/jpeg" 
             pkg:compression="store">
       <pkg:binaryData>/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAkGBwgHBgkIBwgKCgkLDRYPDQwMDRsUFRAWIB0iIiAd
                       Q9wT2Ujace5FHrXBNO5+mVJ8R9cGhyavaaZb3dhA22WdmIK8ZySuVOOxxjuOO+Kuw8Z9Ru7u3tzp	
                       FoomlWMSLKzKMnH7FTNdtL8QXdrNotxLpcoa5a0nuYovIfdg+T5JJZcHcQ6nG3PcmlcmmLY9VWWy
                       RyaKKTsB99P9L6bYWquVku5JVDM12wkxnnABGAOatf4Vp3+Atf8AZX/iiinAmUUUUAf/2Q==</pkg:binaryData>
    </pkg:part>

The <pkg:part> element requires the attribute @pkg:compression. We would also need to assign the appropriate content type. Instead of the child element, <pkg:xmlData>, we require the element <pkg:binaryData>.

If you're interested in trying this out for yourself, assuming you have a .jpeg stored in your document, the following should help get you started.


 let $img := doc("/imageTest_docx_parts/word/media/image1.jpeg")
 let $binstring :=   xs:base64Binary(xs:hexBinary($img)) cast as xs:string 
 return $binstring

All that's left is to format the string. Some simple recursion and fn:substring can take care of that. Something we've found that's useful about encoding images and binary parts in the OPC format is that when we open our XML into Word and save as a .docx, the parts are materialized in their respective locations within the .docx package.

Conclusion

So the answer to the age old question: Can a Word 2007 .docx package be serialized as a single XML document? is a resounding YES! The OPC format allows us an alternate way to serialize both Word and PowerPoint documents in Office 2007. This can be useful for many reasons. A couple that come to mind are simplifying document management on the server and simplifying and optimizing our queries for search and content repurposing. If the system of delivery for a Word 2007 document is a .docx package, then maybe zipping up those multiple pieces into a .docx makes sense. But if we're building content applications that deliver our content through a browser and/or if we have many rich transformations we want to apply to our documents, then the OPC format may be appropriate for our solutions. Office 2007 is a producer and consumer of XML, but more producers and consumers are out there, opening up whole new worlds of opportunity for document assembly and delivery.

We hope you found this post useful. If you have any questions, comments, or suggestions for these posts, please let us know on the general discussion mailing list.

And now as a refresher, and in an act of shameless self-promotion, if you'd like to know more about MarkLogic Server and Office 2007, our previous posts are listed below. Until next time, cheers!

  1. Office Logic (an intro to Office 2007 and the Open XML formats )
  2. Excel-ing with XQuery
  3. Getting OOXML into MarkLogic
  4. Running (a.k.a. <w:r>-ing) with Word
  5. Enriching Word Documents with <w:customXml>
  6. A Final Word

Comments