[MarkLogic Dev General] Word Document Processing

Tim Meagher tim at aaom.net
Thu Aug 9 07:52:14 PDT 2012

Hi Pete,


Thanks for the response - it has been very helpful.


Regarding enabling the default conversion option, does that does require a
separate license for 4.2 and 5.0+?




From: general-bounces at developer.marklogic.com
[mailto:general-bounces at developer.marklogic.com] On Behalf Of Pete Aven
Sent: Thursday, August 09, 2012 10:30 AM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Word Document Processing


Hi Tim,


Wrt 1:


For loading you have a few options:


You can use xdmp:document-load(), or setup a WebDAV server in MarkLogic,
configure a client, and just save your docs through WebDAV, or use
Information Studio to load the .docx from the filesystem. 


For Office 2007/2010 insure the Office OpenXML Extract pipeline is enabled
in MarkLogic.  This will unzip the associated parts for each Office doc and
place them in a sibling folder to the source doc, similar to conversion.


To download you can do a few things as well.  For the source .docx requested
through a browser:


xquery version "1.0-ml";

declare namespace html = "http://www.w3.org/1999/xhtml";

let $filename :=  "File1.docx"

let $disposition := fn:concat("attachment; filename=""",$filename,"""")

let $x := xdmp:add-response-header("Content-Disposition", $disposition)

let $x:=

return   fn:doc(fn:concat("/",$filename))


Or zip up the extracted parts on demand and save to the filesystem:


xquery version "1.0-ml";


let $directory := "/MySpreadsheet1_xlsx_parts/"

let $uris :=

let $parts := for $i in $uris let $x := fn:doc($i) return  $x


let $manifest := <parts xmlns="xdmp:zip">


                              for $i in $uris

                              let $dir := fn:substring-after($i,$directory)

                              let $part :=  <part>{$dir}</part>

                              return $part



let $xlsx := xdmp:zip-create($manifest, $parts)

return xdmp:save("C:\Users\me\Desktop\ExcelChartSample.xlsx",$xlsx)


Or you can do some combination of the above, or just drag the source out of
your WebDAV client, or.


Wrt 2:


Office 2003 and earlier Office docs are not natively XML.  For these you'll
need to enable the default conversion option.




You can return similar to 1 above(different response content type may be
required, I forget at the moment) or with xdmp:save(), or from WebDAV,  but
there are no extracted .zip parts to zip up as the formats generated are not
native XML formats for Office.


Wrt 3:


Default conversion will convert PDF and Office 2003 and earlier docs to
XHTML and DocBook Lite.  You could then write your own transform to a Office
2007 format.  That's where the Office Toolkit for Word may be useful.


But note, the default conversion option does not work for Office 2007/2010.
Those formats are worked with in their native XML formats. There's currently
no conversion option to generate XHTML or DocBook for one of these 2


Wrt 4:




Hope this helps,




From: general-bounces at developer.marklogic.com
[mailto:general-bounces at developer.marklogic.com] On Behalf Of Tim Meagher
Sent: Thursday, August 09, 2012 6:01 AM
To: 'MarkLogic Developer Discussion'
Subject: [MarkLogic Dev General] Word Document Processing


Hi Folks,


I'm new to the idea of storing, converting, and extracting Microsoft Word
documents in and from MarkLogic and I have a couple of questions:


1.       How does one go about storing a Microsoft Word 2007/2010 docx
document in MarkLogic and then downloading it?  It seems to me that this is
pretty straight-forward, but I'm wondering if there are any catches.



2.       How do I do the same for Microsoft Word 97-2003 doc docum




3.       I have reviewed the marklogic-document-support PDF for ML 5 which
includes information about the Conversion option.  Do I understand correctly
that with the Conversion option I should be able to load any Mac or
Microsoft Word document into MarkLogic, convert it into a common XHTML
format which can be parsed (and edited), and further convert it into a
desired version (e.g., Microsoft Word 2007 docx) for download?


4.       Is the Conversion option also available for ML 4.2 and if so, where
would I get the marklogic-document-support PDF for that?


Thanks for the help!


Tim Meagher


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://developer.marklogic.com/pipermail/general/attachments/20120809/cc446d78/attachment-0001.html 

More information about the General mailing list