[MarkLogic Dev General] Word Document Processing

Pete Aven Pete.Aven at marklogic.com
Thu Aug 9 07:30:04 PDT 2012


Hi Tim,

Wrt 1:

For loading you have a few options:

You can use xdmp:document-load(), or setup a WebDAV server in MarkLogic, configure a client, and just save your docs through WebDAV, or use Information Studio to load the .docx from the filesystem.

For Office 2007/2010 insure the Office OpenXML Extract pipeline is enabled in MarkLogic.  This will unzip the associated parts for each Office doc and place them in a sibling folder to the source doc, similar to conversion.

To download you can do a few things as well.  For the source .docx requested through a browser:

xquery version "1.0-ml";
declare namespace html = "http://www.w3.org/1999/xhtml";
let $filename :=  "File1.docx"
let $disposition := fn:concat("attachment; filename=""",$filename,"""")
let $x := xdmp:add-response-header("Content-Disposition", $disposition)
let $x:= xdmp:set-response-content-type("application/vnd.openxmlformats-officedocument.wordprocessingml.document")
return   fn:doc(fn:concat("/",$filename))

Or zip up the extracted parts on demand and save to the filesystem:

xquery version "1.0-ml";

let $directory := "/MySpreadsheet1_xlsx_parts/"
let $uris := cts:uris("","document",cts:directory-query($directory,"infinity"))
let $parts := for $i in $uris let $x := fn:doc($i) return  $x

let $manifest := <parts xmlns="xdmp:zip">
                         {
                              for $i in $uris
                              let $dir := fn:substring-after($i,$directory)
                              let $part :=  <part>{$dir}</part>
                              return $part
                          }
                         </parts>
let $xlsx := xdmp:zip-create($manifest, $parts)
return xdmp:save("C:\Users\me\Desktop\ExcelChartSample.xlsx",$xlsx)

Or you can do some combination of the above, or just drag the source out of your WebDAV client, or...

Wrt 2:

Office 2003 and earlier Office docs are not natively XML.  For these you'll need to enable the default conversion option.

http://docs.marklogic.com/5.0doc/docapp.xqy#display.xqy?fname=http://pubs/5.0doc/xml/cpf/default.xml&query=default+conversion

You can return similar to 1 above(different response content type may be required, I forget at the moment) or with xdmp:save(), or from WebDAV,  but there are no extracted .zip parts to zip up as the formats generated are not native XML formats for Office.

Wrt 3:

Default conversion will convert PDF and Office 2003 and earlier docs to XHTML and DocBook Lite.  You could then write your own transform to a Office 2007 format.  That's where the Office Toolkit for Word may be useful.

But note, the default conversion option does not work for Office 2007/2010.  Those formats are worked with in their native XML formats. There's currently no conversion option to generate XHTML or DocBook for one of these 2 formats.

Wrt 4:

Yes. http://docs.marklogic.com/4.2doc/docapp.xqy#display.xqy?fname=http://pubs/4.2doc/xml/cpf/default.xml&query=default+conversion

Hope this helps,
Pete


From: general-bounces at developer.marklogic.com [mailto:general-bounces at developer.marklogic.com] On Behalf Of Tim Meagher
Sent: Thursday, August 09, 2012 6:01 AM
To: 'MarkLogic Developer Discussion'
Subject: [MarkLogic Dev General] Word Document Processing

Hi Folks,

I'm new to the idea of storing, converting, and extracting Microsoft Word documents in and from MarkLogic and I have a couple of questions:


1.       How does one go about storing a Microsoft Word 2007/2010 docx document in MarkLogic and then downloading it?  It seems to me that this is pretty straight-forward, but I'm wondering if there are any catches.




2.       How do I do the same for Microsoft Word 97-2003 doc docum

ents?



3.       I have reviewed the marklogic-document-support PDF for ML 5 which includes information about the Conversion option.  Do I understand correctly that with the Conversion option I should be able to load any Mac or Microsoft Word document into MarkLogic, convert it into a common XHTML format which can be parsed (and edited), and further convert it into a desired version (e.g., Microsoft Word 2007 docx) for download?



4.       Is the Conversion option also available for ML 4.2 and if so, where would I get the marklogic-document-support PDF for that?


Thanks for the help!

Tim Meagher

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://developer.marklogic.com/pipermail/general/attachments/20120809/a36727b8/attachment-0001.html 


More information about the General mailing list