Excel-ing with XQuery

Excel-ing with XQuery

by Pete Aven

Part 2 in a series on MarkLogic Server and Office 2007

Categories: Office 2007
Altitude: 1,000 feet

Last week we introduced you to Office 2007 OOXML. We introduced the OOXML formats, discussed the pieces of a simple Word document, and showed you how to create your own Word document using XQuery and MarkLogic Server. This week we discuss creating an Excel 2007 document. Instead of creating the content for a document directly in XQuery, we're going to pull it off of a web page and transform it into XML that Excel 2007 will understand. We'll be moving fast as we want to cover as much ground as possible in a short amount of time. We've assumed you've already read part 1 and have MarkLogic Server and CQ installed. If you haven't, you may want to go read part 1 and install now so you can get the most out of the tutorial.

Editor's Note: If you want to start experimenting and exploring with XQuery, it's very simple to install MarkLogic Server free using a Community License. And if you read the fine print, you'll see the Windows Server 2003 version runs fine on XP for development purposes. The installation guide is 29 pages long, but you might need to refer to maybe 3 of them. It's as simple as downloading and clicking next, next, next.... You just need to create an admin user so you can login and enter your Community Key and Bam! - you're up and running. From there, the Getting Started guide, which is only 18 pages, will show you an interactive page ( installed by default ) you can access from your browser to very quickly load and start querying sample data. For the examples I like to use CQ. For CQ, once you have your Server installed, you just download the latest copy of CQ, unzip, and place the cq folder under /Docs. From there you access at localhost:8000/cq. It's simple, fast, intuitive and free! You'll get much more out of these examples by doing them, so have some fun!

Introduction to Excel 2007

For our example, we'll create an Excel spreadsheet that has one sheet. As discussed in part 1, an OOXML document is just a bag of xml wrapped up in a zip file. To create a simple Word 2007 document, we required only 3 files: [Content_Types].xml, document.xml and .rels. For a simple Excel 2007 Spreadsheet, we'll need 5: [Content_Types].xml, .rels, workbook.xml, sheet1.xml, and workbook.xml.rels. To create a valid Excel 2007 package, you will always require the workbook, and at least one sheet. Both of these have a specific content type which is stored in [Content_Types].xml.

Please note: for all the examples, what's between an open bracket (<) and closing bracket (>) for a node, would all be on one line. We split the lines for readability as some of the Types, Namespaces, etc. are very long.

     let $content-types :=
      <Types
      xmlns="http://schemas.openxmlformats.org/package/2006/content-types">
	<Default Extension="rels"
	 ContentType="application/vnd.openxmlformats-package.relationships+xml"/>
	<Default Extension="xml"
	 ContentType="application/xml"/>
	<Override PartName="/workbook.xml"
	 ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet.main+xml"/>
	<Override PartName="/sheet1.xml"
	 ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.worksheet+xml"/>
      </Types>

The .rels file relates the start-parts of the application, which in this case, is the workbook.

     let $rels :=
      <Relationships
      xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
       <Relationship Id="rId1"
	Type=http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument
	Target="workbook.xml"/>
      </Relationships>

The workbook is the start-part, and we know it has 1 sheet, so we relate workbook.xml and sheet1.xml using workbook.xml.rels.

     let $workbookrels :=
      <Relationships
      xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
       <Relationship Id="rId1"
	Type=http://schemas.openxmlformats.org/officeDocument/2006/relationships/worksheet
	Target="sheet1.xml"/>
      </Relationships>

The start-part keeps track of the worksheets and other components for the workbook. For our example the workbook contains a single sheet which we've named Sheet1.

     let $workbook :=
      <workbook
      xmlns=http://schemas.openxmlformats.org/spreadsheetml/2006/main
      xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships">
       <sheets>
	<sheet name="Sheet1" sheetId="1" r:id="rId1" />
       </sheets>
      </workbook>

Finally, we define Sheet1. A <row> element defines a new row. A <row> can contain many cells ( <c> ). If the value of a cell is a number, you can set the number as the value of the <v> element.

If you want to use a string for your cell values, you'll have to do something slightly different. In that case you use the @t attribute of the <c> element and set it equal to "inlineStr". Then, instead of using the <v> element for your value, you use the inline string element <is>. Within <is> set your string in the text node <t>.

     let $sheet1 :=
      <worksheet
      xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
       <sheetData>
	 <row>
	  <c>
	    <v>99999</v>
	  </c> <c t="inlineStr">
	    <is>
	      <t>MarkLogic</t>
	    <is>
	  </c>
	 </row>
       </sheetData>
      </worksheet>

Now what we've described above isn't the only way to store strings in Excel. There's also the concept of Shared Strings, which we'll only briefly discuss. If you have rows where multiple cells repeat the same text value, you have the ability to store the string just once, instead of repeating it multiple times in your sheet. You can do this using a Shared String table, which is just another XML file in your .xlsx package. There can be only one Shared String table per spreadsheet, so it will relate to the workbook part, and not the sheet. You add an entry to [Content_Types].xml, an entry in your workbook.xml.rels to define the relationship, the shared string table <sst> stored as SharedStrings.xml in the root of the package, and then reference the Shared String values in your sheet cells by position using the <v> element. To see what this looks like, just open Excel 2007, add some test text values in some cells, and save the file. Next, change the extension of the file to .zip, right-click and select "Extract All". You can then investigate the parts described above for yourself. You'll see a bunch of other interesting XML files in there as well. If you'd like to know more, you can inspect the ECMA specifications here. There's also a very helpful e-book on the OpenXMLDeveloper site. For our purposes though, we're going to store the strings directly in the worksheet - because it's a lot easier to follow what we're doing.

Getting data from a <table>

Ok, so now we have the parts for a simple Excel 2007 spreadsheet. We know from part 1 that we could zip them all together at this point using xdmp:zip-manifest() and xdmp:zip-create(). We could then display our newly created .xlsx package in Excel 2007 and start editing. Before we do that though, instead of defining the cells for our spreadsheet in the XQuery, like we've done above, let's get the data for our spreadsheet from a table on the web. For that we can use xdmp:http-get().

xdmp:http-get() takes the URI from which we want to fetch a web page as a parameter; it also allows us to set some <options>,which can give us the ability to authenticate with the site we're trying to access (if required). The function sends the GET method to the specified URI and returns the HTTP response as well as whatever information is identified by the specified URI (for example, an XHTML document). I've stored an XHTML page, Books.html, that defines a table of some of my favorite books, in the /Docs directory of my MarkLogic Server install. You can do something similar, or alternately, you can try this with any XHTML table out there on the web.

If you do the latter, you may have to make some updates to the XQuery in our example, so I suggest using the table provided to start with if you're not comfortable making those updates. You can try the following and evaluate from CQ as well, just change your URI, username, and password accordingly. If you don't need to authenticate, you can remove the options node.

     xdmp:http-get("http://localhost:8000/Books.html",
		     <options xmlns="xdmp:http">
		      <authentication method="basic">
			<username>usr</username>
			<password>pwd</password>
		      </authentication>
		     </options>
     )[2]

Make sure to note that we use the predicate [2] to specify the second node of the response sequence that the function returns; Remember, xdmp:http-get() returns two things, the first item is the HTTP response, the second is the XHTML we want to transform into OOXML. Now, if you've been paying attention up to this point, you should have noticed two things: 1) this function returns a string, we need XML to achieve our goal and 2) the web page I'm calling XHTML, isn't really, but happens to work for our example. The xdmp:http-get() function will return any content specified by the URI, not just a web page. And when we're using it out there in the wild on the web, how can we transform what is returned by xdmp:http-get() into valid XHTML and ensure it has the proper encoding so we can process it? We can use xdmp:tidy().

We run xdmp:tidy() on a specified HTML document to convert the document to well-formed and clean XHTML. This function also returns two nodes: the first is a status node indicating any errors or warnings from tidy, and the second is an XHTML node containing the cleaned XHTML. Since XHTML is XML, what's returned will work great for us. We just need to pass our call to xdmp:http-get() as the argument to xdmp:tidy() and again specify that we want the second ([2]) node in the sequence returned. For more information on all these functions and others, please see the MarkLogic XQuery API reference available online.

     xdmp:tidy(
       xdmp:http-get("http://localhost:8000/Books.html",
		     <options xmlns="xdmp:http">
		      <authentication method="basic">
			<username>usr</username>
			<password>pwd</password>
		      </authentication>
		     </options>
       )[2]
     )[2]

Next, we'll create a function called getRows() and we'll assume we're going to pass our <table> to the function as an argument. The function will return our rows for Excel.

 
     define function getRows(
       $x as element(table)*
     ) as element(ms:row)*
     {
       for $i in $x//tr return
	<ms:row> {
	  for $d in $i/td return
	   <ms:c t="inlineStr">
	    <ms:is>
	     <ms:t>{$d//text()}</ms:t>
	    </ms:is>
	   </ms:c>
	} </ms:row>
     }
 

For each row in the XHTML table, we create our row for Excel, and for each XHTML table cell in the row, we create the cell we need for our sheet1.xml.

Putting it all Together

We're almost there. We now have all the components we need to create an Excel 2007 spreadsheet. Since we plan to do this regularly and we know what the components of a simple spreadsheet are, let's create a module that will take our nodes and package them up for us. Create a new file in the /Modules directory of your MarkLogic Server installation and name it marklogic-ooxml.xqy. Place the following in the file and save. For more information on modules, please see the Developer's Guide.

 
     module "marklogic-ooxml"
     define function generate-simple-xl-ooxml(
       $content-types as node(),
       $workbook as node(),
       $rels as node(),
       $workbookrels as node(),
       $sheet1 as node()
     ) as binary()
     {
       let $manifest := <parts xmlns="xdmp:zip">
			     <part>[Content_Types].xml</part>
			     <part>workbook.xml</part>
			     <part>_rels/.rels</part>
			     <part>_rels/workbook.xml.rels</part>
			     <part>sheet1.xml</part>
			</parts>
       let $parts := ($content-types, $workbook, $rels,
                      $workbookrels, $sheet1)
       return
	 xdmp:zip-create($manifest, $parts))
     }

This module takes the nodes we create for our Excel spreadsheet as parameters. It creates a manifest and zips up the nodes, returning the package to the caller. In part 1, we created our .docx with a function directly in the main .xqy file we evaluated. You can easily create a separate function in the above module to handle Word documents, or modify the function above to account for the different file contents. For this example, only the above is the required.

Now, under the /Docs directory of your MarkLogic Server installation, create a new file named ExcelTest.xqy. In this file we'll place the nodes and function calls we've defined above. We'll complete it all by calling the module we just created.

     import module "marklogic-ooxml" at "/marklogic-ooxml.xqy"

     default element namespace = "http://www.w3.org/1999/xhtml"

     declare namespace mlooxml = "marklogic-ooxml" 
     declare namespace ms = "http://schemas.openxmlformats.org/spreadsheetml/2006/main"

     define function getRows(
       $x as element(table)*
     ) as element(ms:row)*
     { 
       for $i in $x//tr
       return 
        <ms:row>
        {
         for $d in $i/td return
          <ms:c t="inlineStr">
           <ms:is>
             <ms:t>{$d//text()}</ms:t>
           </ms:is>
          </ms:c>
        }
        </ms:row>
     }

     let $content-types :=
       <Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types">
	<Default Extension="rels" 
         ContentType="application/vnd.openxmlformats-package.relationships+xml"/>
	<Default Extension="xml" ContentType="application/xml"/>
	<Override PartName="/workbook.xml" 
         ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet.main+xml"/>
	<Override PartName="/sheet1.xml" 
         ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.worksheet+xml"/>
       </Types>

     let $workbook := 
       <workbook xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" 
                 xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships">
        <sheets>
          <sheet name="Sheet1" sheetId="1" r:id="rId1" />
        </sheets>
       </workbook>

     let $rels :=
       <Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
	<Relationship Id="rId1"
         Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument" 
	 Target="workbook.xml"/>
       </Relationships>

     let $workbookrels :=
       <Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
	<Relationship Id="rId1" 
         Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/worksheet" 
	 Target="sheet1.xml"/>
       </Relationships>

     let $page :=
       xdmp:tidy(xdmp:http-get("http://localhost:8000/Books.html",
                                <options xmlns="xdmp:http">
                                 <authentication method="basic">
                                   <username>usr</username>
                                   <password>pwd</password>
                                 </authentication>
                                </options>)[2])[2]

    let $tables := $page//table

    let $sheet1 := 
       <worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
        <sheetData>
           {getRows($tables)}
        </sheetData>
       </worksheet>

    let $package := mlooxml:generate-simple-xl-ooxml($content-types, $workbook, $rels, $workbookrels, $sheet1)

    let $filename :=  "ExcelTest.xlsx"
    let $disposition := concat("attachment; filename=""",$filename,"""")
    let $x := xdmp:add-response-header("Content-Disposition", $disposition)
    let $x := xdmp:set-response-content-type("application/vnd.openxmlformats-officedocument.spreadsheetml.sheet")
    return
      $package

Once we have the $package, we can insert the .xlsx document into our XML Database. We've chosen to open the file into Excel so we can view it right away. Because we've chosen to open our file instead of saving, we give our file a name, and set the response-header and response-content-type so that the proper application will open our document. If you have Office 2007 installed, you can now evaluate the above by opening your favorite browser and directing the URL to http://localhost:8000/ExcelTest.xqy. Excel will open and display the spreadsheet we've created.

You'll notice the first row is empty. In our getRows() function, we didn't account for the <table> headers. If you unzip the .xlsx we've created and inspect Sheet1.xml, you'll find an empty <row/> element after <sheetData>. For each <tr>, we looked at <td>, but forgot about <th>. Updating the function to account for this is left as an exercise for the reader. You can also improve your XQuery skills by aiming xdmp:http-get() at other sites and making updates to your functions as required. Once you really start looking at other people's HTML, you'll see that all sorts of craziness exists out there, but XQuery and MarkLogic built-ins like xdmp:tidy() provide us with a set of simple and very effective tools for getting us the data we really want.

Conclusion

That was just a simple example, but we do hope you're starting so see the possibilities. XQuery, MarkLogic Server, and the new OOXML formats for Office 2007 are a powerful combination. Using a few lines of XQuery and some MarkLogic built-in functions, we were able to pull an Excel spreadsheet practically out of thin air. With the XQuery we've covered so far, we now have the ability to create Word and Excel documents from disparate resources that upon delivery users will understand, appreciate, can find value in, and are comfortable with. With the OOXML formats, our Word and Excel documents can be the final destination for publishing, or just another resource available for re-use in our XML Repository that we can easily query and manipulate with XQuery. Though we didn't cover any business logic, we've been introduced to some powerful tools for creating reports. Until next week, enjoy!

blogroll Blogroll

Comments