A Final "Word"

by Pete Aven

Part 6 in a series on MarkLogic Server and Office 2007

Categories: Office 2007
Altitude: 1,000 feet

Today we look at another way of enriching Word documents in Office 2007. From a user perspective, we'll be looking at Content Controls. From the developer perspective, we'll be discussing the "structured document tag" a.k.a. : <w:sdt>. A user of Word will have no insight into what that tag is or what it means, and they don't have to. Users can modify their Word documents using Content Controls and whenever they add a Content Control to a document, a <w:sdt> node will be created within the document.xml inside the .docx package. Now, the document we can create by default in Word and the one we can create using XQuery leveraging the <w:sdt> tags will differ greatly. Using XQuery and MarkLogic Server to create and update our Word documents using <w:sdt> reveals many very exciting opportunities available to us, so let's get started.

This is the final post ( for now ) of our series on Office 2007 and MarkLogic Server. We do hope you've enjoyed it and that it's been useful to you. As a reward for your faithful viewing, we were going to make all the previous entries available in a limited collector's edition DVD set, but then thought: "What the heck?! Let's just reference them all below for everyone's convenience." So here you go:

  1. Office Logic (an intro to Office 2007 and the Open XML formats )
  2. Excel-ing with XQuery
  3. Getting OOXML into MarkLogic
  4. Running (a.k.a. <w:r>-ing) with Word
  5. Enriching Word Documents with <w:customXml>

Similar to <w:customXml>, we can use <w:sdt> with block-level and inline elements. In Word, <w:customXml> gives us a visualization of our custom nodes within a tree view in the XML Structure Pane and the ability to turn on/off the tag visualization within the document from the same Pane. Structured Document tags won't show up in the pane, and the visualization within Word is slightly different in appearance and access, so let's start our examination from within Office 2007 and drill down into the XML.

Introduction to Content Controls and <w:sdt>

Open Word and go to the Developer's tab. There you'll see a "Controls" group. This group contains a palette of Content Controls available to us. We'll be focusing on a text control for our example, so in the Controls group go ahead and click the First "Aa" button in the upper-left corner of the group. A tabbed box will appear in Word informing you to "Click here to enter text".

Go ahead and enter some text. Next, while the tabbed box is still visible around your text, click the "Properties" button, right under "Design Mode" in the "Controls" group of the Developer tab. In the dialogue box that opens, enter values for "Title" and a "Tag". Both values will be saved in the XML, but only the Title will be displayed on the Control in Word unless you enter Design Mode. Click "OK", and then Click "Design Mode", our tags will then be highlighted similarly to how we saw <w:customXml> displayed.

If we were to save the document and open the document.xml in the .docx package, we'd see XML similar to the following:


     <w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
       <w:body>
         <w:sdt>
           <w:sdtPr>
             <w:alias w:val="this is a test"/>
             <w:tag w:val="first"/>
             <w:showingPlcHdr/>
           </w:sdtPr>
           <w:sdtContent>
             <w:p>
               <w:r>
                 <w:t>Test.</w:t>
               </w:r>
             </w:p>
           </w:sdtContent>
         </w:sdt>
       </w:body>
     </w:document>

The Structured Document node (<w:sdt>) is created when the Content Control is created. The <w:sdt> node has 2 sections, a properties section (<w:sdtPr>) and the content section (<w:sdtContent>).

  • In the properties, we see the alias and tag values. The alias element is used for user reference, and the tag value for developer reference, but since we know they're both here, we could use either for queries once we've saved the document in our MarkLogic Server. Finally, we see the content is now showing placeholder data (<w:showingPlcHldr>). It's not really, as we saved the text we wanted to save, and if we were creating this document manually, we could leave this tag out, but Word puts it there when we save. Why is it there? It has to do with databinding. When used for databinding, the data identified this way will not be stored in a separate XML file within our .docx package (what Microsoft refers to as a "custom business part"). If the data was really intended to be a placeholder, as when we saw "Click here to enter text" when we created our Control, we'd find both the <w:showingPlcHldr> element as well as a property for the run (<w:rStyle w:val="PlaceholderText"/>)
  • The content for the document tag is specified within the <w:sdtContent> element. Here you could find the placeholder text for empty elements, or the actual data (that we now know might still be identified as placeholder data. Nice!).

I've stripped out some of the extra elements and attributes that were created when we saved this. As always, we suggest trying functionality out in Word for yourself, saving the documents you create, and examining the XML created to better your understanding of the Open XML formats. We're focusing on Text Controls for this example, but there are several controls available to us in the Controls group. Try adding Content Controls of different types, saving, and examing the XML in the .docx package to see what properties and content are added to your documents. Also, we saw other options available in the Content Control Properties dialog box when we assigned the Title and Tag values. You may want to experiment with these as well.

Note that finest level of granularity for markup using <w:customXml> is the run (<w:r>) and the finest level of granularity for markup for <w:sdt> is text (<w:t>). In addition, we can place Content Controls within other Content Controls. You may want to consider what this last point means for the underlying XML, especially when we discuss databinding.

Once again, we focus on leveraging the XML within Office 2007 documents to quickly do something useful with it. Office 2007 is now a consumer and publisher of XML. You'll find for documents, that the XML you can feed to Word to display a document, may not be the same XML it publishes when you save. Also, the XML you can feed to any given Open XML consumer, and get the same document displayed in the application as a result, may differ. Bottom line, there's a lot to wrap your head around, (trust us, we know) so we aim to keep it simple. When you're ready to dive into the deep end of the Open XML pool, the final word on all the Open XML formats can be found here.

Databinding

Our example thus far has been very simple. In reality, a document that uses Content Controls may have hundreds of Controls present. Controls can be used to cue users to provide data when filling out a form (think of a medical or legal record), or to bind the values to the Controls in document.xml from separate XML files present within our .docx package. Remember, an Office 2007 document is now just a bag of XML. We can actually zip up our own custom XML parts within the package and reference them through Structured Document (<w:sdtPr>) properties using XPath.

So I'm saying we can have a document.xml in our .docx package that actually get's it's data from somecustomfile.xml within the .docx package, that's referenced in document.xml using <w:sdt> and an XPath expression. To which you say: Well, that's the developer perspective, but how can I map the content in my Word document to use these separate XML data islands you speak of from within my Word application? Well, with Word 2007 alone, you can't.

In Word 2007, if I create text Content Controls, we can enter text into the Controls, but the content will always be in document.xml. We can't map the content to a separate custom xml part unless we're using SharePoint. If we don't have SharePoint, there's also the Word 2007 Content Control Toolkit, or the Databinding Toolkit for Word, both available on Codeplex. Since we're creating our Word documents using MarkLogic Server, it's simple to just map the content for ourselves. That's what we're going to do too as this is actually pretty sweet.

To show you what this will look like using more than one custom XML part, we'll create 2 simple documents to add to our Word document. Within our .docx package, we'll be creating a new folder, customXML, and we'll be adding the following 2 documents to it as item1.xml and item2.xml. (This example's really going to build off of our first post, so if you have that XQuery available, you can just start adding this to it.)


     let $custompart1 :=
       <pete:testnode1 xmlns:pete="http://foo">
         <pete:messages>
           <pete:message1>Hello World!</pete:message1>
         </pete:messages>
       </pete:testnode1>

     let $custompart2 :=
       <ml:testnode2 xmlns:ml="http://bar">
	  <ml:message>     For more information on XQuery and MarkLogic Server, 
                      remember to stay tuned to developer.marklogic.com. 
          </ml:message>
       </ml:testnode2>

The next step is to relate document.xml to the custom parts. The following node will be added as word/_rels/document.xml.rels within the .docx pacakage.


     let $docrels := 
       <Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
         <Relationship Id="rId3" 
                       Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/customXml" 
                       Target="../customXML/item1.xml"/>
         <Relationship Id="rId4" 
                       Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/customXml" 
                       Target="../customXML/item2.xml"/>
       </Relationships>

Ok, since we're allowed to store multiple custom XML parts within our package, there can be issues identifying which one to bind <w:sdt> to. Let's say our custom parts above had the same structure. By default, Office 2007 will bind to the first custom XML part it finds that maps to the XPath expression provided in the structured document tag (<w:sdt>). To distinguish the custom pieces and insure we bind to the data we intend to, we add an identifier which is called the store item id. The id is attached to our custom XML files by using a properties file. The properties file defines two things: the ID of the custom XML part and the XML schema for the part. Since we have 2 custom XML parts, we'll create 2 properties parts. These will be named itemProps1.xml and itemProps2.xml. We'll place them within /customXML inside the .docx package. Also notice how we've escaped the first curly brace to distinguish text from XQuery.


     let $datastoreitem1 :=
       <ds:datastoreItem ds:itemID="{6804B95E-15C1-4294-9FA8-D33AC6EFBA10}" 
                         xmlns:ds="http://schemas.openxmlformats.org/officedocument/2006/2/customXml">
         <ds:schemaRefs>
           <ds:schemaRef ds:uri="http://foo" />
         </ds:schemaRefs>
       </ds:datastoreItem>

     let $datastoreitem2 :=
       <ds:datastoreItem ds:itemID="{65AEDF81-0ED8-4401-A00F-72507B473A65}" 
                         xmlns:ds="http://schemas.openxmlformats.org/officedocument/2006/2/customXml">
         <ds:schemaRefs>
           <ds:schemaRef ds:uri="http://bar" />
         </ds:schemaRefs>
       </ds:datastoreItem>

Finally, we relate each custom XML part with its properties part. The following nodes will be added to customXML/_rels within the package as item1.xml.rels and item2.xml.rels.


     let $itemrels1 :=
       <Relationships xmlns="http://schemas.openxmlformats.org`/package/2006/relationships">
         <Relationship Id="rId1" 
                       Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/customXmlProps" 
                       Target="itemProps1.xml"/>
       </Relationships>

     let $itemrels2 :=
       <Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
         <Relationship Id="rId1" 
                       Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/customXmlProps" 
                       Target="itemProps2.xml"/>
       </Relationships>


Putting it all Together

The only thing left to do is reference the custom parts from within document.xml. Check $document below to see what this looks like. You'll find a <w:dataBinding> element within the structured document properties. The @w:storeItemID is actually optional, but remember the default behavior for binding we pointed out above. We're going to create our document from scratch, so we just create our nodes, zip them up, and open in Word.

Place the following in a file name customXml.xqy and place it under /Docs in you MarkLogic Server. Open your favorite browser and navigate to http://localhost:8000/customXml.xqy. If you have Office 2007 installed, this will open directly into the application. Remember that we split some items for readability, you may have to place elements on a single line to get the formatting and output you expect.


     declare namespace w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
     declare namespace ds="http://schemas.openxmlformats.org/officedocument/2006/2/customXml"

     define function ml-generate-openxml-package(
      $content-types as node(),
      $rels as node(),
      $document as node(),
      $docrels as node(),
      $itemrels1 as node(),
      $itemrels2 as node(),
      $datastoreitem1 as node()*,
      $datastoreitem2 as node()*,
      $custompart1 as node(),
      $custompart2 as node()
     ) as binary()
     {
       let $manifest := <parts xmlns="xdmp:zip"> 
                                  <part>[Content_Types].xml</part>
                                  <part>_rels/.rels</part> 
                                  <part>word/document.xml</part> 
                                  <part>word/_rels/document.xml.rels</part> 
                                  <part>customXML/_rels/item1.xml.rels</part> 
                                  <part>customXML/_rels/item2.xml.rels</part> 
                                  <part>customXML/itemProps1.xml</part> 
                                  <part>customXML/itemProps2.xml</part> 
                                  <part>customXML/pete.xml</part> 
                                  <part>customXML/item2.xml</part> 
                        </parts>
       let $parts := ($content-types, $rels, $document, $docrels, 
                      $itemrels1, $itemrels2, $datastoreitem1, 
                      $datastoreitem2,$custompart1, $custompart2)
        return 
         xdmp:zip-create($manifest, $parts)
     }

     let $content-types :=
       <Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types">
         <Default Extension="rels" ContentType="application/vnd.openxmlformats-package.relationships+xml" />
	 <Default Extension="xml" ContentType="application/xml" />
	 <Override PartName="/word/document.xml" 
                   ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml" />
       </Types>

     let $rels :=
       <Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
         <Relationship Id="rId1" 
                       Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument" 
                       Target="word/document.xml"/>
       </Relationships>

     let $itemrels1 :=
       <Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
         <Relationship Id="rId1" 
		       Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/customXmlProps" 
		       Target="itemProps1.xml"/>
       </Relationships>

     let $itemrels2 :=
       <Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
	 <Relationship Id="rId1" 
	   	       Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/customXmlProps" 
		       Target="itemProps2.xml"/>
       </Relationships>


     let $datastoreitem1 :=
       <ds:datastoreItem ds:itemID="{6804B95E-15C1-4294-9FA8-D33AC6EFBA10}" 
                         xmlns:ds="http://schemas.openxmlformats.org/officedocument/2006/2/customXml">
         <ds:schemaRefs>
           <ds:schemaRef ds:uri="http://foo" />
         </ds:schemaRefs>
       </ds:datastoreItem>

     let $datastoreitem2 :=
       <ds:datastoreItem ds:itemID="{65AEDF81-0ED8-4401-A00F-72507B473A65}" 
                         xmlns:ds="http://schemas.openxmlformats.org/officedocument/2006/2/customXml">
         <ds:schemaRefs>
           <ds:schemaRef ds:uri="http://bar" />
         </ds:schemaRefs>
       </ds:datastoreItem>


     let $custompart1 :=
       <pete:testnode1 xmlns:pete="http://foo">
         <pete:messages>
           <pete:message1>Hello World!</pete:message1>
         </pete:messages>
       </pete:testnode1>

     let $custompart2 :=
       <ml:testnode2 xmlns:ml="http://bar">
	 <ml:message>     For more information on XQuery and MarkLogic Server, 
                          remember to check developer.marklogic.com regularly. </ml:message>
       </ml:testnode2>

     let $docrels := 
       <Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
         <Relationship Id="rId3" 
                       Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/customXml" 
                       Target="../customXML/pete.xml"/>
         <Relationship Id="rId4" 
                       Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/customXml" 
                       Target="../customXML/item2.xml"/>
       </Relationships>

     let $document :=
       <w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"  
                   xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships">
         <w:body>
           <w:p>
             <w:sdt>
               <w:sdtPr>
                 <w:alias w:val="pete's message" />
                 <w:tag w:val="message" />
                   <w:dataBinding w:prefixMappings="xmlns:pete='http://foo'"
                                  w:xpath="/pete:testnode1/pete:messages/pete:message1[1]"  
                                  w:storeItemID="{6804B95E-15C1-4294-9FA8-D33AC6EFBA10}"/>
                   <w:text/>
               </w:sdtPr>
               <w:sdtContent>
                 <w:r>
                   <w:t>Click here to enter text.</w:t>
                 </w:r>
               </w:sdtContent>
             </w:sdt>
           </w:p>
           <w:p>
             <w:sdt>
               <w:sdtPr>
                 <w:alias w:val="ML message" />
                 <w:tag w:val="message" />
                 <w:dataBinding w:prefixMappings="xmlns:ml='http://bar'" 
                                w:xpath="/ml:testnode2/ml:message[1]"
                                w:storeItemID="{65AEDF81-0ED8-4401-A00F-72507B473A65}"/>
                 <w:text/>
               </w:sdtPr>
               <w:sdtContent>
                 <w:r>
                   <w:t>Click here to enter text.</w:t>
                 </w:r>
               </w:sdtContent>
             </w:sdt>
           </w:p>
         </w:body>
       </w:document>

let $package := ml-generate-openxml-package($content-types, $rels, $document, $docrels, 
                                            $itemrels1, $itemrels2, $datastoreitem1, $datastoreitem2, 
                                            $custompart1, $custompart2)
let $filename :=  "customXmlExample.docx"
let $disposition := concat("attachment; filename=""",$filename,"""")
let $x := xdmp:add-response-header("Content-Disposition", $disposition)
let $x:= xdmp:set-response-content-type("application/vnd.openxmlformats-officedocument.wordprocessingml.document")
 return
    $package

Two-Way Databinding

We've just created our first document using Word with Custom XML Parts, XQuery, and MarkLogic Server. Using <w:sdt> with Databinding is powerful stuff. We can now create a document that's comprised completely from custom XML pieces. But it gets even better. If you've followed us this far, you should see the following:

Close Word without saving and unzip the .docx package. Take a look at the custom XML pieces and document.xml in particular. Next, zip the package back up, open the document in Word, and go ahead and edit the text for one of the controls. Save the document, then unzip the .docx package again. Take a look at the custom XML part for the Control who's value you edited. You'll find the value in the custom XML piece has changed as well. The databinding to Content Controls is two way. So we have the option of populating the contents of a Word document from custom XML pieces, or likewise, we could prompt users to fill out empty Controls in a Word template and when they saved the document, the values would be saved in any custom XML pieces we've defined within the package.

Databinding using <w:sdt> gives us the ability to repurpose content as well as create it. This is really exciting, especially when you consider all the tools MarkLogic Server and XQuery give us for making the most out of all this content.

A couple of random notes to consider on what we've done today: First, we assigned storeItemIDs to our pieces and referenced them in our document.xml. If we edit the content and save the .docx, Office 2007 assigns new IDs. You may want to consider that in any solution you're developing. Also, did you notice how the first time we opened the document.xml, the values from our custom XML parts were just referenced through XPath, but after editing and saving the document again, we found the XPath as well as the values from our custom pieces within document.xml? Finally, we named our custom XML pieces item1.xml and item2.xml and referenced them as such. You can name the files anything you'd like to, but the first time you save from within Word, the files will be renamed item1.xml, item2.xml, etc. and all the package references will be updated as well.

Conclusion

Content Controls and <w:sdt> present us with all sorts of exciting opportunites for content creation and re-use. In Word we can enrich our documents using either <w:customXml> or <w:sdt>. You'll find that <w:sdt> is the preferred method as we can keep any custom XML markup we require directly in our custom XML pieces. But <w:customXml> may still be useful, especially if we want to add ids or business semantics to portions of our document.xml that we wish to retain control over. For both elements, we've found that the XML we can create and feed to the Open XML consumer Office 2007 is not necessarily the XML we can create from within Word. But using MarkLogic Server and few lines of XQuery, we can quickly make the most of our Word content and leverage all that the Office 2007 application has to offer us.

Thanks to all those who've followed our posts regularly over the last couple of months. I've enjoyed doing this and I'm happy to hear that people have found these useful. For those who use Office 2007, the Open XML formats present a great opportunity for storing, querying, enriching, and repurposing Word, Excel, and PowerPoint documents. XQuery and MarkLogic Server provide a powerful combination for managing those documents and making the most out of them. We mainly focused on Word in this series, but we're working on the other formats as well, so I wouldn't be surprised if there were more posts coming in the future. Until then, Good Luck with the solutions you're creating. For any and all XQuery and MarkLogic questions, I suggest signing up for the discussion and announcement mailing lists.

Cheers!
Pete

Comments