Running (a.k.a. <w:r>-ing) with Word

Running (a.k.a. <w:r>-ing) with Word

by Pete Aven

Categories: Office 2007
Altitude: 1,000 feet

Today we take a closer look at Office 2007 Word. More specifically, we'll be looking at the XML that makes up a Word 2007 document. Word uses WordproccessingML, which is one of the OOXML formats. To review, we introduced Word in part 1, Excel in part 2, and looked at ways to save our documents to MarkLogic Server using WebDav and the Content Processing Framework in part 3. Now that you've created and saved some Office 2007 documents, let's crack open the document.xml for a standard .docx file and take a look at what's going on in there.

Note: We suggest reading part 1 before continuing. We'll be working with the assumption you know the buidling blocks for a simple Word document. To test the examples, it will be useful to be able to create your own document using XQuery and MarkLogic Server. Copy the XQuery from part 1 and save as part3.xqy under the /Docs directory of your MarkLogic Server installation. You can then evaluate by opening your favorite browser and navigating to http://localhost:8000/part3.xqy. Make edits to $document based on our discussion below and re-evaluate the .xqy to test for yourself. Using this approach you'll be able to see the results of your work in Word 2007. If you're just interested in the XML, you can evaluate all today's examples in CQ.

This is in no way a comprehensive review of WordprocessingML. Our aim is to give you enough understanding to be able to rip open a .docx and have some idea of what you're looking at. With that understanding you'll know how to create and evaluate your queries as well as how to manipulate the XML to some degree in order to create the document that you really want. We're also going to point out a particular challenge with WordprocessingML with regards to full-text search and re-use: first we'll expose the issue, then we'll provide a solution to it. For a great introduction to WordprocessingML and the other OOXML formats, we suggest checking out the e-book, OpenXML Explained. For the final word on WordprocessingML, you can find the ECMA specifications here.

Intro to WordprocessingML

We know that for a Word document, the document.xml is the start-part and it contains the main text and body of the document. Previously, we created our start-part with some sample text as follows.

Note: Line breaks are added between some opening and closing tags for readability (see <w:t> below). Remember to place on a single line when actually testing out for yourself otherwise you may find the formatting in Word is not what you expect or the example will not work for you.




     let $document :=

       <w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">

         <w:body>

           <w:p>

             <w:r>

               <w:t>Doctor Paul Proteus, the man with the highest income in Ilium,

                    drove his cheap and old Plymouth across the bridge to Homestead. </w:t>

             </w:r>

           </w:p>

         </w:body>

       </w:document>



Let's take a closer look at each of the elements.

  • <w:document> is the root element and is required to start defining the document.
  • <w:body> is the child element of <w:document>. In the <w:body> is where we'll store the text that makes up our document.
  • <w:p> signifies a paragraph within the <w:body>.
  • <w:r> represents a run (of text). A paragraph can be split into multiple runs.
  • <w:t> is the text element. There can be multiple <w:t> elements within a <w:r>.

There are 2 main groups for content within the <body>, block-level and inline. Block-level content provides the main structure of the document and contains inline-content. Examples of block-level content are <w:p> (paragraphs) and <w:tbl> (tables). An example of inline content is <w:r> (a run).

Ok, Here's where the fun starts! So now that we know what a basic paragraph looks like and some of the rules for using the elements, we see we can create the same paragraph above as:




     let $document :=

       <w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">

         <w:body>

           <w:p>

             <w:r>

               <w:t xml:space="preserve">Doctor Paul Proteus, the man with the highest </w:t>

               <w:t>income in Ilium, drove his cheap and old Plymouth across the bridge to Homestead. </w:t>

             </w:r>

           </w:p>

         </w:body>

       </w:document>



It will look exactly the same when we open it in Word. Word is now a consumer for OOXML and treats the above equivalent to our first example. Just remember to add the attribute @xml:space = "preserve" to your text element so the trailing space won't be trimmed by the consumer.

Now, I think it's good to distinguish between the XML you need to create a Word document, and the XML that will be produced when you save your Word document. ( You may want to re-read that sentence as it will come up again later and its a good thing to remember.) Though we can split our text elements within a run, I haven't seen it happen too often when I've saved anything in Word then opened up the package and inspected the document.xml. But it may be helpful to know we can do it or to understand what's going on when we see it within our <w:document>. More importantly, we can also create the same paragraph by splitting the runs.




     let $document :=

       <w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">

         <w:body>

           <w:p>

             <w:r>

               <w:t xml:space="preserve">Doctor Paul Proteus, the man with the highest income in Ilium, drove </w:t>

             </w:r>

             <w:r>

               <w:t>his cheap and old Plymouth across the bridge to Homestead. </w:t>

             </w:r>

           </w:p>

         </w:body>

       </w:document>



You're likely to see your text split into multiple runs (when you unzip the .docx package for a file you've saved in Word and inspect the document.xml). There are a couple of reasons for this: 1) A run is the finest level of granularity at which formatting may be applied, and 2)A run is the finest level of granularity at which you'll see the <w:customXml> element applied. (We'll be demonstrating how to use XQuery to enrich your Word document using <w:customXml> in our next blog post.)

There are 2 more elements you'll see often within a paragraph: <w:pPr> (paragraph properties) and <w:rPr> (run properties). We're not going to dig too deep into styling our Word document, just know you can set the formatting directly in document.xml using the property elements and we also have the option to reference a separate styles.xml file from within our property elements. (Remember: another file in our .docx that relates to the start-part means we add a reference in .rels and an entry in our [Content_Types].xml.) There's also the concept of a style hierarchy, so the properties elements can work together and/or also override each other depending on the parent-child relationship of the styling elements. Don't worry about styling right now. What's important is being able to identify the property elements when we see them in our document.xml. You'll see <w:pPr> and <w:rPr> and there'll be other child elements within them with/without other attributes and/or references to styles.xml; both nodes pertain to formatting. In particular, take note of the use of <w:rPr> We now have a solid reason for seeing split runs in a paragraph <w:p>.




     let $document :=

       <w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">

         <w:body>

           <w:p>

             <w:pPr>

               <w:pBdr>

                 <w:bottom w:val="single" w:sz="4" w:color="auto" />

               </w:pBdr>

             </w:pPr>

             <w:r>

               <w:rPr>

                 <w:i />

               </w:rPr>

                <w:t xml:space="preserve">Doctor Paul Proteus, the man with the </w:t>

             </w:r>

             <w:r>

               <w:rPr>

                 <w:b />

                 <w:sz w:val="52" />

                 <w:rFonts w:ascii="Cambria" />

               </w:rPr>

                <w:t xml:space="preserve">highest </w:t>

             </w:r>

             <w:r>

                <w:t>income in Ilium, drove his cheap and old Plymouth across the bridge to Homestead. </w:t>

             </w:r>

           </w:p>

         </w:body>

       </w:document>



Above we define a <w:pPr>, which just defines a border for the bottom of our paragraph; This is just to show you how the <w:pPr> element is used, as a child of <w:p> and a sibling of <w:r>. What's more interesting is we have 3 runs, each with different formatting. We emphasize the word "highest", and this is how you may see runs used, where maybe one word/phrase within a paragraph is bold or italicized, etc. Understanding the use of <w:r> and <w:rPr> helps us to see why runs are split multiple times within a paragraph. Each time the formatting of the text changes, you're going to see a different run in the XML. But that's not the whole story. What about this?




     let $document :=

       <w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">

         <w:body>

           <w:p>

             <w:pPr>

             </w:pPr>

             <w:r>

               <w:rPr>

                 <w:i />

               </w:rPr>

                <w:t>Doctor Paul Pr</w:t>

             </w:r>

             <w:r>

               <w:rPr>

                 <w:i />

               </w:rPr>

                <w:t>oteus, the man with the highe</w:t>

             </w:r>

             <w:r>

               <w:rPr>

                 <w:i />

               </w:rPr>

                <w:t>st income in Ilium, drove his cheap and old Plymouth across the bridge to Homestead. </w:t>

             </w:r>

           </w:p>

         </w:body>

       </w:document>



This is completely valid WordprocessingML that will display properly as a single, italicized sentence in Word. You're likely to see something similar whenever you save a Word 2007 document and then open up the package document.xml for examination. You'll see runs split, even though they all have the same style, and you'll see the style elements for each run repeated, instead of all text in a single run with a single run properties element. Also notice how we see splits occur in the middle of words. Text is split across elements. Now, granted, the above example is extreme as we're trying to make a point. Most likely, you'll open the document.xml and similarly styled text will be grouped together in one <w:t> element contained in a single run <w:r> for the first few paragraphs at least. As you really dig into the XML, and especially with a larger document with all sorts of formatting, you're going to find cases like this regularly. That's fine for Word, it can save the OOXML as it sees fit and will be able to consume the OOXML the next time the file is open. But we'd like to clean this up a little.

This output presents us with a couple of opportunities for improvement:

1) We'd like to keep any text with the same formatting in a single run element, this way we work our way towards a re-usable paragraph available in a single node. One of the main reasons for storing our content as XML in an XML Server is so we can take advantage of re-use. And sure we could work some magic with our queries to get users to their re-usable content, but why transform the XML everytime we query or goto re-use it? Why not just save the content in a preferable format to begin with? When a user searches for content, we can just return the nodes that contain their search terms and make them available for re-use.

2) Another very important reason for storing our content in an XML Server, especially Marklogic Server, is for search; So we can find what we're looking for, or maybe even what we don't know what we're looking for. What happens when we want to search on "Proteus" for the above example? Any fulltext search engine is going to have difficulties with tokens (Strings) split across elements. And even if you're using an XML Server that doesn't have fulltext capabilities, I think it's a safe bet that you'll be expecting to find single words tokenized within elements, and not split across them.

So if we want to search on "Pr oteus", and we don't mind doing a lot of clean-up work to provide re-use, then this XML is great. Otherwise, it really makes sense to merge similarly formatted runs. Using XQuery and MarkLogic Server, we have a simple way to clean this up. By merging the runs, we'll be able to take advantage of search and re-use within the Server. In the Server, the components of our saved .docx file become other resources for analytics and/or content creation. We can even regenerate the .docx using MarkLogic zip built-ins so the document.xml we're going to fix is in synch with package stored in the Server, or we can just open the package by zipping up its required parts when needed. Because of the rules for how the elements are used, if we write our XQuery correctly, we can be assured that merging split runs will have no effect on how the content will be displayed the next time the OOXML is consumed by Word.

Merging the Runs

You can take the last document example of XQuery above and cut and paste it to CQ. Just add "return $document" at the bottom, and then you can evaluate. We'll do the example in CQ so we can test as we go. If you don't have CQ installed yet, you can refer to the note at the beginning of post 2. It's very simple to install. Once installed, you just add your XQuery to the main window and click 'XML', 'Text", or 'HTML' to evaluate your XQuery, the results will display in the lower pane. When we have a solution we're confident with, we can save it as an .xqy file or module available for regular re-use.

Let's start by defining a function ml-update-document-xml. We'll pass this function our original <w:document>, and it will return our new and improved <w:document>, with similar runs (<w:r>) merged for each <w:p> All this function will do is dispatch our document node for processing. We could call dispatch directly, but if you intend to use the XQuery in a module, or to do additional processing, it will be useful to have this separate function with a meaningful name.




     define function ml-update-document-xml($doc as element(w:document)) as element(w:document)

     {

        dispatch($doc)

     }







In dispatch we'll use a typeswitch function to help us traverse the document. In the case of a paragraph, <w:p>, we'll do some additional processing by calling a function mergeruns(), which will accept the existing paragraph as input and will return the paragraph with the runs merged. If the node is of any other type (not a paragraph), we'll just return the existing element with its existing attributes and passthru() to the next node().




     define function dispatch ($x as node()) as node()

     {



         typeswitch ($x)

          case element(w:p) return mergeruns($x)

          default return (

           element{fn:name($x)} {$x/@*,passthru($x)}

          )

     }



The passthru() function is simple. We just loop through the nodes() and call dispatch() to process the next element in our document.




     define function passthru($x as node()) as node()

     {

       for $i in $x/node() return dispatch($i)

     }



For our mergeruns function, we'll first check for the existence of the paragraph properties node. If it exists, we'll assign its value to a variable. Next we return our new paragraph, with the existing properties node as well as the results of our merged runs, which we'll map() by passing this new function the first run in the paragraph we are transforming.




     define function mergeruns($p as element(w:p)) as element(w:p)

     {

       let $pPrvals := if(fn:exists($p/w:pPr)) then $p/w:pPr else ()

       return element w:p{ $pPrvals, map($p/w:r[1]) }



     }





We see the brunt of the work is done by our map() function. map() takes the current run as a parameter and will return a merged run. We check the current run ,<w:r>, to insure it's not empty. If not, we'll assign the run's properties to a variable. Next, we'll call a descend() function, and we'll pass it the following sibling of the current run, as well as the properties node for the current run. The descend() function will just check the current run's properties against those of the following runs. As long as the properties for each sibling run are equal, descend() will recursively call itself, descending down the document until it hits a run who's properties aren't equal the properties of the runs it's been descending. descend() will return a sequence of runs. We need this count, so we know how many runs to leap ahead in the paragraph before we start trying to merge the next set of runs. With the $count, we can then return the merged run for the runs that have similar properties. We recursively call map(), passing the following sibling that picks up after the count for those runs we've already merged. For both recursive functions, the empty element <w:r> is the stopping condition.




     define function map($r as element(w:r)?) as element(w:r)

     {

      if (fn:empty ($r)) then ()

      else

       let $rToCheck := $r/w:rPr





       let $matches := descend($r/following-sibling::w:r[1], $rToCheck)

       let $count := fn:count($matches)



       let $this := if ($count) then

                     (element w:r{ $rToCheck,

                          element w:t { fn:string-join(($r/w:t, $matches/w:t),"") } })

                    else $r

      return  ($this, map( if($count) then ($r/following-sibling::w:r[1 + $count])  else $r/following-sibling::w:r[1]))

     }



     define function descend($r as element(w:r)?, $rToCheck as element(w:rPr)?) as element(w:r)*

     {

      if(fn:empty($r)) then ()

      else if(fn:deep-equal($r/w:rPr,$rToCheck)) then

        ($r, descend($r/following-sibling::w:r[1], $rToCheck))

      else ()

     }







Putting it all Together

Everything we've discussed so far, we can test in CQ. Place the following in CQ and evaluate.




     declare namespace w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"



     define function ml-update-document-xml($doc as element(w:document)) as element(w:document)

     {

       dispatch($doc)

     }



     define function passthru($x as node()) as node()

     {

       for $i in $x/node() return dispatch($i)

     }



     define function dispatch ($x as node()) as node()

     {



      typeswitch ($x)

       case element(w:p) return mergeruns($x)

       default return (

         element{fn:name($x)} {$x/@*,passthru($x)}

       )

     }



     define function mergeruns($p as element(w:p)) as element(w:p)

     {

       let $pPrvals := if(fn:exists($p/w:pPr)) then $p/w:pPr else ()

       return element w:p{ $pPrvals, map($p/w:r[1]) }



     }



     define function descend($r as element(w:r)?, $rToCheck as element(w:rPr)?) as element(w:r)*

     {

       if(fn:empty($r)) then ()

       else if(fn:deep-equal($r/w:rPr,$rToCheck)) then

        ($r, descend($r/following-sibling::w:r[1], $rToCheck))

       else ()

     }



     define function map($r as element(w:r)?) as element(w:r)

     {

       if (fn:empty ($r)) then ()

       else

        let $rToCheck := $r/w:rPr



       let $matches := descend($r/following-sibling::w:r[1], $rToCheck)

       let $count := fn:count($matches)



       let $this := if ($count) then

                   (element w:r{ $rToCheck,

                         element w:t { fn:string-join(($r/w:t, $matches/w:t),"") } })

                 else $r



       return  ($this, map( if($count) then ($r/following-sibling::w:r[1 + $count])  else $r/following-sibling::w:r[1]))

     }



     let $document :=

       <w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">

         <w:body>

           <w:p>

             <w:pPr>

             </w:pPr>

             <w:r>

               <w:rPr>

                 <w:i />

               </w:rPr>

                <w:t>Doctor Paul Pr</w:t>

             </w:r>

             <w:r>

               <w:rPr>

                 <w:i />

               </w:rPr>

                <w:t>oteus, the man with the highe</w:t>

             </w:r>

             <w:r>

               <w:rPr>

                 <w:i />

               </w:rPr>

                <w:t>st income in Ilium, drove his cheap and old Plymouth across the bridge to Homestead. </w:t>

             </w:r>

           </w:p>

         </w::body>
       </w:document>



     return ml-update-document-xml($document)







After evaluating the above, you'll see the runs merged into a single run with a single properties node, and a single text element within the paragraph. If you open in Word, you'll see it looks exactly the same as it would if Word had consumed the XML we'd started with.




     <w:document>

       <w:body>

         <w:p>

          <w:pPr/>

          <w:r>

            <w:rPr>

              <w:i/>

            </w:rPr>

            <w:t>

              Doctor Paul Proteus, the man with the highest income in Ilium,

              drove his cheap and old Plymouth across the bridge to Homestead.

            </w:t>

          </w:r>

         </w:p>

       </w:body>

     </w:document>



To test our solution more, you can try adding more paragraphs and runs, or just change the properties of the runs we've already defined and re-evaluate. For definitions and descriptions of the functions used, please see the API reference. Also, a great explanation on transforming XML structures with the use of a typeswitch expression can be found in the Developer's Guide.

Conclusion

We've covered a lot of ground in a short time. We dug deeper into WordprocessingML, saw an issue, and we were able to repair it swiftly and with a minimal amount of XQuery. Now, there's a lot more to WordprocessingML and the types of tags you'll find within your document.xml. The solution we've discussed is a great start, but will also need to be expanded upon to cover other issues not discussed here. For example, you'll also find runs ( <w:r> ) embedded in tables. Runs with text can be used for cell values within a table, and you'll find runs split there as well. We kept our examples simple so we could convey basic information and solutions in digestible portions. I really believe in "learning by doing", so I'd suggest creating some test documents, or grabbing some off of the web, and passing those to the XQuery we've created. That will help you how to identify where modifications in our solution are required.

As we identify issues with OOXML that aren't optimal for how we want to use it, we know we can easily transform it and still have it be valid OOXML. We created our solution in CQ for rapid development and testing, but we know from part 3 we could add this to our pipeline and have the document.xml updated automatically upon saving. You may want to try that too. You just need to add another state for your documents, so they don't advance right to "final" after extracting from the .docx. Create a second processing state, add a condition that checks for the existence of the document.xml and an action that processes document.xml before setting the final state for all the documents within the package to final.

XQuery and MarkLogic Server provide us with a lot of power and flexibility for managing our XML content. We demonstrate the OOXML formats in our series as we're excited about the opportunities those new documents bring to our customers, but all our examples can just as easily be done with other document formats. At the end of the day, it's just XML to us, and with XQuery and MarkLogic Server, we can do anything we want with it. OOXML presents a very interesting and exciting case as a Google search will show you there are over 3 million word documents available on the web right now. Imagine if they were all in XML. Just think of the types of queries we could execute! Or the re-use that we'd have available!

Please note, we're going on hiatus for the holidays. We hope you've found the series informative and helpful up to this point. We'll return in the New Year with posts on ways to enrich your Word documents. There are 2 different approaches you can take, and each has its pros and cons. We'll have a post on each method so you can decide which works best for you. Until then, Happy Holidays!

blogroll Blogroll

Comments