Enriching Word Documents with <w:customXml>

Enriching Word Documents with <w:customXml>

by Pete Aven

Part 5 in a series on MarkLogic Server and Office 2007

Categories: Office 2007
Altitude: 1,000 feet

Welcome back! Today we look at how to enrich Word 2007 documents. Since in Office 2007 all Word documents are now just XML, it makes sense that we can easily add value to our documents by just inserting additional, more meaningful tags. This will allow us to add supplemental information as well as help us to identify the content we deem important so we can run the queries and analytics that give us value. But first, those just joining us may want to catch up with our previous posts. It will help to have some understanding of the Open XML formats before we begin. Also, we provide simple, digestible examples, but we're hoping you'll take the individual posts and put them all together to see the really big picture. Then you'll be able to take advantage of all the opportunities available when we unlock the XML content in our Office 2007 documents. Now in the voice of the narrator from "Lost": previously, on developer.marklogic.com...

  1. Office Logic (an intro to Office 2007 and the Open XML formats)
  2. Excel-ing with XQuery
  3. Getting OOXML into MarkLogic
  4. Running (a.k.a. <w:r>-ing) with Word

There are two ways of adding custom tags to elements or text in WordProcessingML, you can use either <w:customXml> or <w:sdt>. Today we're going to focus on <w:customXml>. We'll be discussing Content Controls and <w:sdt> in our next post. When to use which tag where is an "it depends" scenario. As you'll find out, both offer different functionality that can be helpful to you. The functionality for each differs in the XML, how it's stored in the .docx package, as well as how it's visualized within Office 2007. We'll point out the major differences throughout, but as always, we focus on how to quickly do something useful using XQuery, Open XML, and MarkLogic Server.

In a nutshell, <w:customXml> will allow us to markup elements within our document.xml. This can be useful for adding metadata/business semantics (we markup sections of the document for ingestion by another service or Content Management System) or for adding levels of granularity for search purposes (identifying all Entities within a document for search and/or analytics). The Structured Document tag ( <w:sdt> ) will allow us to markup our document, but will also allow us to bind data values from within document.xml to separate XML data islands stored within our .docx package When leveraged for databinding, this gives us the separation of presentation and data. Both tags present opportunities for us but require a certain level of examination to be useful. Today we focus on <w:customXml>.

Intro to <w:customXml>

These days, if we have a XML document and we want to associate some meaningful information with that document or its contents, we just add the information inline. So with a normal XML document, if I want to tag the name MarkLogic with "company" anywhere it appears in a document, it might look something like this.




      <document>

        <paragraph>

           <company>MarkLogic</company> is the provider of the industry's leading XML content server.

           <company>MarkLogic</company> provides solutions for many content processing challenges.

           Being used to solve a wide range of problems across many industries, MarkLogic

           Server lets you load, query, manipulate and render content to unlock its full value.

        </paragraph>

      </document>



So knowing what this paragraph would look like in our document.xml, can we do something like this and still have the document open successfully in Word?




       <w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">

         <w:body>

           <w:p>

             <w:r>

               <w:t><company>MarkLogic</company> is the provider of the industry's leading XML content server.

                    <company>MarkLogic</company> provides solutions for many content processing challenges.

                    Being used to solve a wide range of problems across many industries, MarkLogic

                    Server lets you load, query, manipulate and render content to unlock its full value.

              </w:t>

             </w:r>

           </w:p>

         </w:body>

       </w:document>



That would be NO. Adding arbitrary XML markup to a document is not allowed in WordprocessingML. There is a special tag, <w:customXml>, that can be used to define custom nodes. To identify our "company" node using this element, an attribute ( @w:element ) will be used. This allows the document to be validated using a schema (somewhat validated, not really though, more on this later). But if we edit the document.xml manually, or update it using XQuery in our MarkLogic Server, we don't need to define any schema to add our custom markup. Time for the plugs: To learn more about the Open XML formats quickly from an introductory perspective, get the free e-book. To dig deeper into the Open XML formats, the ECMA specifications can be found here.

Now, the <w:customXml> element does have one caveat, it's finest level of granularity is the run ( <w:r> ). So we can't just replace the <company> tags above with <w:customXml w:element="company"> (with their associated closing tags) to get the markup that will be accepted by the Office 2007 consumer, we have to do the following.

Note: the <w:t> tags have been split for readability. If you cut and paste this into an Office document, you may have unexpected formatting if you don't make the node a single line.




     <w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">

      <w:body>

       <w:p>

         <w:customXml w:element="company">

           <w:r>

             <w:t>MarkLogic</w:t>

           </w:r>

         </w:customXml>

         <w:r>

             <w:t xml:space="preserve"> is the provider of the industry's leading XML content server.  </w:t>

         </w:r>

         <w:customXml w:element="company">

           <w:r>

             <w:t>MarkLogic</w:t>

           </w:r>

         </w:customXml>

         <w:r>

             <w:t xml:space="preserve"> provides solutions for many content processing challenges.

                                        Being used to solve a wide range of problems across many industries, MarkLogic

                                        Server lets you load, query, manipulate and render content to unlock its full value.</w:t>

         </w:r>

       </w:p>

      </w:body>

     </w:document>



The <w:customXml> node becomes a sibling of <w:r> and any run it's marking up becomes its child. It may be helpful to remember that the run ( <w:r> ) is considered an inline element in WordprocessingML. Well, that's quite an eyeful, but we're not daunted at all. We can mark this up quickly and easily with just a few lines of XQuery, and that's exactly what we'll do. But first, let's take a look at what this looks like in Office 2007. If you want to play along at home, just take the last XQuery example from our first post and replace the $document node with the document above. The document will open in Word and you'll see ....

Where did our tags go? They are there, we just don't see them yet. To see your tags within the document, just go to the Developer tag and hit the "Structure" button in the "XML" group. If you don't have the Developer tab available, go to the button, click "word options", check the box that says "Show Developer tab in the Ribbon" and click "ok". You'll then have the Developer tab available to you. Once you hit the "Structure" button, a pane will open on the right that shows "Elements in the document". Here we see "company" 2 times, as that's the only element we used for markup. If you click on any of these, the "MarkLogic" they're associated within the document will highlight. Finally, click the checkbox below that says "Show XML tags in the document".

Now we have a way to visualize your custom markup within Word. Our example is quite simple, and we'll only scratch the surface of the functionality available in the XML Structure pane, we just want you to be aware of its existence so you can start investigating for yourself. If you were to nest <w:customXml> tags, you'd see a tree structure in your pane, instead of the list. To see what this looks like, go to the bottom of the pane and uncheck "List only child elements of current element". Our company element appears as an element we can apply to our current selection. Now highlight some text in Office, then go click company multiple times. Each time you click you'll see the tags added to the document, and the tree structure in the pane will develop as well. If you'd like to play around with this some more, you can quickly edit the example and make the @w:element attribute equal something different for each instance of MarkLogic in the document.

With <w:customXml> tags in our document.xml, we get use of the pane for adding markup and the visualization our nodes within Word. But what if we didn't have <w:customXml> tags already within our document? Well, we wouldn't have any tags available for markup. You can markup using <w:customXml> from within Word only if your document already has <w:customXml> tags in it already, or by attaching a schema. If either of those conditions is met, you'll then be able to highlight text and mark it up by clicking the element you want applied from the XML Structure pane. Let's attach a schema to show what that looks like.

How to attach a schema in Word 2007

If you open a new document from within Word, and then navigate to the XML Structure pane, you'll see a message informing you that in order to insert your own XML elements, you must first attach a schema. So here's a simple schema. Save this as test.xsd somewhere on your desktop. If you have another schema you'd like to use, but want to check for validity, the w3c provides a handy validator here.




     <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" targetNamespace="foo" xmlns="foo">

       <xs:element name="country" type="Country"/>

         <xs:complexType name="Country">

           <xs:sequence>

             <xs:element name="location" type="xs:string"/>

             <xs:element name="population" type="xs:decimal"/>

           </xs:sequence>

         </xs:complexType>

     </xs:schema>



To add the schema:

  1. Open the document you wish to use the schema with
  2. Click the "Templates and Add-ins" link in the XML Structure pane, or just click the "schema" button, right next to the "Structure" button in the developer tab.
  3. Click the "Add Schema" button
  4. Navigate to the schema file we just saved, and select it
  5. Click the "Open" button
  6. A popup (with the uri from our schema grayed out) asks us to enter an alias. Enter anything here, this is just the name you'll refer to the schema by in your schema library. Once you've added the schema to Word, it will be available for you to enable for all documents until you delete it. You can leave the alias blank and it will default to the namespace (foo) when you click Ok.
  7. Click Ok to close the Schema settings
  8. Click Ok to close the "Templates and Add-ins" window.

So now you'll see "country {foo}" in the lower right of your XML Structure pane. Once you select some text, you can apply the label by clicking it in the pane.

Notes on applying <w:customXml> using Word

Ok, the XML Structure pane is actually pretty limited in its usefulness. It's not really intended for end users, but it can be handy for developers creating <w:customXml> editing solutions. We think it's worth mentioning so you're aware it exists, and now that you do, here are some points to consider.

  1. As you probably noticed, you don't need to highlight text to apply the markup. This may be useful for creating placeholders when developing a templated solution, or could cause issues if a user tends to apply tags accidentally to non-existent content.
  2. The schema validation is not robust.
    • When you see the yellow-slash-symbol next to your element, that means it's being used incorrectly in your document.
    • If you uncheck the "List only child elements of current element" and used those elements for markup, you'll see that child elements don't have a namespace applied. (namespaces are identified between the braces i.e. country{foo}). So we could add the same schema above using a different namespace and markup country{foo} and country{bar} elements, but we wouldn't know which namespace location and population belong too in our document.xml.
  3. When you markup content from the pane using an element that has a namespace, the XML in the document.xml will look like <w:customXml w:uri="foo" w:element="company"> , where "foo" would be the namespace for the schema. But if we manipulate document.xml ourselves, we can apply namespaces to all <w:customXml> elements using the @w:uri attribute. Doing this, all elements in our XML structure pane will have an associated namespace, but note:
    • Any use of <w:customXml> means we must have an entry in word/settings.xml in our .docx package for <w:attachedSchema>. If we don't have a namespace, then we still set it's @val attribute equal to "". Otherwise, we have a <w:attachedSchema> namespace entry for each schema we've added. (if you save our example above and then open the .docx package you'll find <w:attachedSchema w:val="foo"/> in word/settings.xml)

To really get a feel for what XML is being produced by Office 2007 when saving, we suggest marking up a document using different schemas, saving, and then opening the package up for examination. It's great to have some idea of what Word will do, and even better to know what XML is required for any Open XML consumer if we want to use <w:customXml>, but editing XML from within Word is rather tedious, and we have XQuery to do our heavy lifting for us. Let's get on to the fun stuff!

Applying <w:customXml> using XQuery and MarkLogic Server

We'll stick with our original paragraph for this example. The question is how to go from a plain paragraph, to a paragraph that uses <w:customXml> to identify and label words within that paragraph. In our case, anytime we see "MarkLogic", we want to mark it as a company. We can solve this using CQ. First, let's set a variable equal to our test document.




     let $document :=

       <w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">

         <w:body>

           <w:p>

             <w:r>

               <w:t>MarkLogic is the provider of the industry's leading XML content server.

                    MarkLogic provides solutions for many content processing challenges.

                    Being used to solve a wide range of problems across many industries, MarkLogic

                    Server lets you load, query, manipulate and render content to unlock its full value.

              </w:t>

             </w:r>

           </w:p>

         </w:body>

       </w:document>



The function that will help us most here is cts:highlight().




     cts:highlight(

	$node as node(),

	$query as cts:query,

	$expr as item()*

     )  as  node()



From the API reference:
This function returns a copy of the node, replacing any text matching the query with the specified expression. You can use this function to easily highlight any text found in a query. Unlike fn:replace and other XQuery string functions that match literal text, cts:highlight matches every term that matches the search, including stemmed matches or matches with different capitalization.

So we'll just pass our $document as the 1st parameter, our query in the 2nd parameter, and return our marked up document using the expression passed in the 3rd parameter.




     let $highlighteddoc:=

       cts:highlight($document, cts:word-query("MarkLogic"),

                     <w:customXml w:element="company"><w:r><w:t>{$cts:text}</w:t></w:r></w:customXml>)



We're doing the example in CQ, with a clean paragraph, so we can focus on our <w:customXml> solution. We know from our last post that we may have to merge the runs to have the ability to search for and markup the text we're looking for. For this example though, cts:highlight() will find all instance of "MarkLogic" in the text.

Now add "return $highlighteddoc" (without quotes) to the end of your code in CQ and evaluate. Your output will look like the following:




     <w:document>

       <w:body>

         <w:p>

           <w:r>

             <w:t>

               <w:customXml w:element="company">

                 <w:r>

                   <w:t>MarkLogic</w:t>

                 </w:r>

               </w:customXml>

               is the provider of the industry's leading XML content server.

               <w:customXml w:element="company">

               <w:r>

                 <w:t>MarkLogic</w:t>

               </w:r>

               </w:customXml>

               provides solutions for many content processing challenges.

               Being used to solve a wide range of problems across many industries, MarkLogic

               Server lets you load, query, manipulate and render content to unlock its full value.

             </w:t>

           </w:r>

         </w:p>

       </w:body>

     </w:document>



This has done part of the job for us, but we need to do more. Since we can't markup the document inline (we have to use inline elements), we have to transform a single run ( <w:r> ), into potentially multiple runs with custom node ( <w:customXml> ) siblings that have their own runs as children for the text they markup. This may sound complex, but it's not really. If you look at the output, the first thing you notice is that the text that isn't a child of <w:customXml> is that element's sibling. If we wrap those strings in <w:r> tags, we'll have runs as siblings to <w:customXml>. Then, we just need to have the XML we've created not be a sibling of the <w:t> node, and we'll be in business.

This sounds like we may need to iterate over the entire document and do some transformation. So this is a good place for a typeswitch. For more on typeswitch, please see the Developer's Guide. So let's dispatch our document, and anytime we come across a run, well just map our <w:customXml>. Place the "let" following the let for $highlighteddoc. Place functions above your FLWR in CQ, and remember to add the namespace on the first line.




     declare namespace w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"



     define function passthru($x as node()*) as node()*

     {

       for $i in $x/node() return dispatch($i)

     }



     define function dispatch($x as node()*) as node()*

     {

       typeswitch ($x)

        case element(w:r) return map($x/w:t/node())

        default return  element{fn:name($x)} {$x/@*,passthru($x)}

     }



     let $newdocument := dispatch($highlighteddoc)



So far, this looks like pretty familiar stuff. Now we just need our map() function. For that we'll use another typeswitch. We want to iterate over the run, and for any text() nodes (NOT <w:t> nodes, but just nodes of type text), create our run. If the node is <w:customXml>, we'll return it as is, and if it's anything else, we'll just return the empty sequence.




     define function map($x as node()*) as node()*

     {

       for $child in $x return

         typeswitch ($child)

          case text() return makerun($child)

          case element(w:customXml) return $child

          default return ()

     }



     define function makerun($x as text()) as element(w:r)

     {

       <w:r><w:t  xml:space="preserve">{$x}</w:t></w:r>



     }



Putting it all together, place the following in CQ and evaluate.




     declare namespace w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"



     define function passthru($x as node()*) as node()*

     {

       for $i in $x/node() return dispatch($i)

     }



     define function map($x as node()*) as node()*

     {

       for $child in $x return

        typeswitch ($child)

         case text() return makerun($child)

         case element(w:customXml) return $child

         default return ()

     }



     define function dispatch($x as node()*) as node()*

     {

       typeswitch ($x)

       case element(w:r) return map($x/w:t/node())

       default return  element{fn:name($x)} {$x/@*,passthru($x)}

     }



     define function makerun($x as text()) as element(w:r)

     {

       <w:r><w:t  xml:space="preserve">{$x}</w:t></w:r>

     }



     let $document :=

       <w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">

         <w:body>

           <w:p>

             <w:r>

               <w:t>MarkLogic is the provider of the industry's leading XML content server.

                    MarkLogic provides solutions for many content processing challenges.

                    Being used to solve a wide range of problems across many industries, MarkLogic

                    Server lets you load, query, manipulate and render content to unlock its full value.

              </w:t>

             </w:r>

           </w:p>

         </w:body>

       </w:document>





     let $highlighteddoc:=

       cts:highlight($document, cts:word-query("MarkLogic"),

                     <w:customXml w:element="company"><w:r><w:t>{$cts:text}</w:t></w:r></w:customXml>)

     let $newdocument := dispatch($highlighteddoc)

     return $newdocument





This is great! With a few lines of XQuery, we were able to markup our document with meaningful metadata and the document is still consumable by Office 2007. If you've read the previous posts, you can just replace $document with the output from CQ and open directly into Word to see what we've accomplished. Otherwise, just cut and paste the output from CQ into document.xml within a simple Word package, zip it up as .docx, and open.

This is just one way of solving this problem, you may have others. We urge you to experiment and have some fun with XQuery, MarkLogic Server, and the Open XML formats. That's really the best way to learn more about all 3. For example, you may want to experiment running our solution against multiple paragraphs, or adding the @w:uri attribute to the <w:customXml> node, or adding the functions and code to one of your previous .xqy files, etc.

Conclusion

The <w:customXml> element provides us a way of marking up our Word documents right within the document.xml part of our .docx package. This can be useful for adding metadata and business semantics, as well as for search and analytics. Using XQuery and MarkLogic built-ins, we a have simple, fast, and effective way of quickly adding value to our existing Word documents, and they'll still be available to users in Word. In fact, users can remain blissfully unaware of the markup as they make their edits, or we can let them know how to visualize the markup within Word using the utilities available in the Developer tab. For our example, we focused on a single document, but when one thinks of enriching multiple documents, and the multiple queries available to us in MarkLogic Server, the possibilities suddenly get very exciting.

If you've come this far with us, Thanks!!! That's Excellent!!! Your crash course in XQuery, MarkLogic Server, and the Open XML formats is almost complete. Next week we wrap up our series with a look at Content Controls and <w:std>. Until then, cheers!

blogroll Blogroll

Comments

  • If you have an element like title that is styled in Word.. Can you also style that same element with XML style? .. Or do you have to style / format a given piece of text with one or the other schema ? Thanks