Document Assembly with MarkLogic Server and Microsoft Word

by Pete Aven

Creating new documents from existing sources using XQuery, MarkLogic Server, and the altChunk element in Word

Categories: Office 2007
Altitude: 1,000 feet

We often create new documents using parts and chunks of material from other documents we've previously authored. We repurpose pieces of content from documents that we've somehow managed to store in multiple, various places. We may want to reuse a paragraph from one of our blog posts, a text document we've stored on our filesystem, and of course we'll want parts of documents we've saved in MarkLogic Server. Today we'll demonstrate how we can use some of the tools MarkLogic Server provides us to grab the pieces of content we want to repurpose and deliver those results into a Microsoft Word document.

Word documents are now composed of WordprocessingML. One of the more interesting elements this provides us is <w:altChunk>. We'll explore this element as it provides Word's mechanism for creating a document from sources that include format types that aren't WordprocessingML. While we can use XQuery to transform our content into WordprocessingML, we'll find out how <w:altChunk> enables Word to do a lot of the transformation work for us.

Quick Word 2007 Review

Pop quiz hotshots!:
What are the 3 files required for a basic Word 2007 document? Do you know how to use [Content_Types].xml? How about .rels? Not sure? That's ok, just check our previous introduction to Word 2007.

How about paragraph elements? Do you know what runs are? If not, you may want to review our introduction to WordprocessingML too.

Note: Building a Word doc from scratch can actually be very simple, so we won't spend a lot of time discussing in this post. If some of the code examples or terminology used below don't make sense, you may find those 2 posts helpful.

Introduction to <w:altChunk>

We find the <w:altChunk> in document.xml. It is a sibling of the paragraph (<w:p>) element and can be used any place in the markup that we can use <w:p> (within tables, as well as the body, etc.).

In a nutshell: We place <w:altChunk> in our document.xml at a place in the document where we wish to import content. It is used to import content with a format that's an alternate to WordprocessingML. The @r:id attribute value specifies the id of the chunk of content to be imported. The chunk is imported from a file located within the .docx package. As an example, let's say we we want to add a text file to the HelloWorld.docx we created previously. If we add the text file to the .docx pacakge, we can import it into the main document by making 3 simple updates to our package:

  • Add <w:altChunk r:id="someidentifier"/> to our document.xml.
  • Update [Content_Types].xml to specify the content type of the file (if not already present).
  • Update word/_rels/document.xml.rels to refer to the added file by the id.

<w:altChunk> can import content with the following format types:

  • text/html
    • A HTML document.
  • text/plain
    • A Text document.
  • application/xhtml+xml
    • A XHTML document.
  • application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml
    • An existing .docx package in binary form. (That's right, we can import other Word documents.)

Creating a Word Document

So let's see this in action. Using MarkLogic built-ins, we'll grab a text file from our filesystem, a paragraph from a web page on wikipedia, and a .docx we previously saved to the Server to create a new Word document using <w:altChunk>. We'll zip the results up in a .docx and open directly into Word.


xquery version "1.0-ml";
declare namespace gso = "generate-simple-ooxml-alt";
declare namespace html ="http://www.w3.org/1999/xhtml";

declare function gso:generate-simple-ooxml-alt(
  $content-types as node(),
  $rels as node(),
  $document as node(),
  $documentxmlrels as node(),
  $importedhtml as node(),
  $txt as node(),
  $docx as node() 
) as binary()
{
  let $manifest := <parts xmlns="xdmp:zip"> 
                                  <part>[Content_Types].xml</part>
                                  <part>_rels/.rels</part> 
                                  <part>word/document.xml</part> 
                                  <part>word/_rels/document.xml.rels</part> 
                                  <part>word/import.htm</part> 
                                  <part>word/import.txt</part> 
                                  <part>word/import.docx</part> 
                           </parts>
  let $parts := ($content-types, $rels, $document, $documentxmlrels, $importedhtml, $txt, $docx)
  return 
    xdmp:zip-create($manifest, $parts)
};

let $content-types :=
  <Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types">
    <Default Extension="rels" ContentType="application/vnd.openxmlformats-package.relationships+xml"/>
	<Default Extension="xml" ContentType="application/xml" />
	<Override PartName="/word/document.xml" ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml" />
        <Default Extension="htm" ContentType="application/xhtml+xml"/>
        <Default Extension="txt" ContentType="text/plain"/>
        <Default Extension="docx" ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml"/>
  </Types>

let $rels :=
  <Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
    <Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument" Target="word/document.xml"/>
  </Relationships>

let $document :=
  <w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"  xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships">
   <w:body>
       <w:altChunk r:id="altChunk1" /> 
       <w:altChunk r:id="altChunk2" /> 
       <w:altChunk r:id="altChunk3" /> 
       <w:p><w:r><w:t>Coolest document ever!</w:t></w:r></w:p>
   </w:body>
  </w:document>

let $documentxmlrels := <Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships"><Relationship Id="altChunk1" TargetMode="Internal" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/aFChunk" Target="import.htm" /><Relationship Id="altChunk2" TargetMode="Internal" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/aFChunk" Target="import.txt" /><Relationship Id="altChunk3" TargetMode="Internal" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/aFChunk" Target="import.docx" /></Relationships>

let $docx :=  fn:doc("/import.docx") 

let $txt :=  xdmp:document-get("C:\import.txt")

let $html := <html><body>
              <h1>MarkLogic</h1>{
xdmp:tidy(
xdmp:http-get("http://en.wikipedia.org/wiki/Mark_Logic",<options xmlns="xdmp:document-get">
    <repair>full</repair>
  </options>)//*:div[@id="bodyContent"]//*:p[1]
          )[2]}
</body></html>
                 
let $package := gso:generate-simple-ooxml-alt($content-types, $rels, $document,$documentxmlrels, $html,$txt/text(), $docx)
let $filename :=  "hello-world.docx"
let $disposition := concat("attachment; filename=""",$filename,"""")
let $x := xdmp:add-response-header("Content-Disposition", $disposition)
let $x:= xdmp:set-response-content-type("application/vnd.openxmlformats-officedocument.wordprocessingml.document")
 return
    $package


For those that learn by doing and/or want to jump right in: just cut and paste the code above into a file named generate-docx.xqy. Place that file under /Docs in your MarkLogic server. You'll also need to save a .docx file to the Server named import.docx, and a text file to your filesystem named import.txt. You can edit file names and paths in the code to generate a .docx with your own content. Assuming you have Word installed, you can open generate-docx.xqy from your favorite browser (example: http://localhost:8000/generate-docx.xqy), and the document we've created will open directly into Word.

Examining the code above, we notice we've added 3 <w:altChunk> elements to our document.xml. The ids for these pieces are identified by @r:id and can be referenced in the Relationship Ids of $documentxmlrels. We've also added the necessary format types to $content-types.

The first <w:altChunk> element references an HTML page. We create the page in the .xqy using xdmp:http-get() to fetch a paragraph off the wikipedia entry for MarkLogic. So we grab a page right off the web for use in our document. Very powerful stuff! We've seen use of this function before when we worked with Excel. Matt Turner also provides a great tutorial on how to use this function with xdmp:tidy().

The second <w:altChunk> element references a text document we have on our filesystem. We grab the file using xdmp:document-get().

The final <w:altChunk> element references a .docx we have in our MarkLogic Server. A simple fn:doc() returns the binary package to us.

Finally, we zip up all our pieces with a call to gso:generate-simple-ooxml-alt. This function uses xdmp:zip-create() to zip up the parts for us and return our binary package. And just like that, we've created the following Word document:

Examining the .docx Contents

Save the .docx we just created and unzip somewhere. Now take a look at the contents. Guess what? There's no HTML page, no Text file, and no .docx in the new package. Why's that? <w:altChunk> is meant for import only. It facilitates a one time conversion by Word. Whatever documents are imported will be transformed into the WordprocessingML for a single .docx package. All imported files are saved as <w:p> elements within document.xml. Go ahead and take a look. The plus side is that Word has done the transformation for us. We can now save our new document back to MarkLogic, and have all those tasty <w:p>s available for immediate search and repurposing in other documents. And the cycle continues!!

Conclusion

MarkLogic Server provides simple and powerful mechanisms for grabbing content from locations besides just the Server. We can use XQuery and MarkLogic built-ins to dynamically create new documents from repurposed content we find in the Server, on the filesystem, and on the web. Microsoft Word provides a ubiquitous and familiar interface for users who need to interact with the documents we create. The <w:altChunk> element provides a simple and effective mechanism for importing content of different format types into a .docx package. Imported content is transformed by Word when it consumes the XML within the .docx package into <w:p> elements within the document.xml. Used together, MarkLogic Server and Microsoft Word provide impressive tools and opportunities for content authoring!

Until next time, Cheers!

As always, If you have any questions, comments, or suggestions for these posts, please let us know on the general discussion mailing list.

For those who want to know more about MarkLogic Server and Office 2007, please see the following posts:

  1. Office Logic (an intro to Office 2007 and the Open XML formats )
  2. Excel-ing with XQuery
  3. Getting OOXML into MarkLogic
  4. Running (a.k.a. <w:r>-ing) with Word
  5. Enriching Word Documents with <w:customXml>
  6. A Final Word
  7. XQuery, Office 2007, and the Open Packaging Convention

Comments

  • Any tutorial in going the other way around-- that is to convert .docx to HTML/CSS so that MarkLogic can be used to search it?
    • I know this doesn't fully answer your question, but the quickest way to make a .docx file searchable in MarkLogic 5 is to use the built-in "Filter Documents" Transformer in Information Studio (http://localhost:8000/appservices/), which extracts the content and metadata from a .docx file (or any of many other types of binary files) and inserts it into the document's corresponding properties file. Or you can extract the content and metadata yourself using xdmp:document-filter(), e.g., try running this in Query Console and see what's output: xdmp:document-filter( xdmp:document-get(concat("https://lswiki.byu.edu","/images/9/91/Sample.docx"), <options xmlns="xdmp:document-get"> <format>binary</format> </options>))