[MarkLogic Dev General] Word Document Processing

Tim Meagher tim at aaom.net
Thu Aug 9 07:52:14 PDT 2012


Hi Pete,

 

Thanks for the response - it has been very helpful.

 

Regarding enabling the default conversion option, does that does require a
separate license for 4.2 and 5.0+?

 

Tim

 

From: general-bounces at developer.marklogic.com
[mailto:general-bounces at developer.marklogic.com] On Behalf Of Pete Aven
Sent: Thursday, August 09, 2012 10:30 AM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Word Document Processing

 

Hi Tim,

 

Wrt 1:

 

For loading you have a few options:

 

You can use xdmp:document-load(), or setup a WebDAV server in MarkLogic,
configure a client, and just save your docs through WebDAV, or use
Information Studio to load the .docx from the filesystem. 

 

For Office 2007/2010 insure the Office OpenXML Extract pipeline is enabled
in MarkLogic.  This will unzip the associated parts for each Office doc and
place them in a sibling folder to the source doc, similar to conversion.

 

To download you can do a few things as well.  For the source .docx requested
through a browser:

 

xquery version "1.0-ml";

declare namespace html = "http://www.w3.org/1999/xhtml";

let $filename :=  "File1.docx"

let $disposition := fn:concat("attachment; filename=""",$filename,"""")

let $x := xdmp:add-response-header("Content-Disposition", $disposition)

let $x:=
xdmp:set-response-content-type("application/vnd.openxmlformats-officedocumen
t.wordprocessingml.document")

return   fn:doc(fn:concat("/",$filename))

 

Or zip up the extracted parts on demand and save to the filesystem:

 

xquery version "1.0-ml";

 

let $directory := "/MySpreadsheet1_xlsx_parts/"

let $uris :=
cts:uris("","document",cts:directory-query($directory,"infinity"))

let $parts := for $i in $uris let $x := fn:doc($i) return  $x

 

let $manifest := <parts xmlns="xdmp:zip">

                         {

                              for $i in $uris

                              let $dir := fn:substring-after($i,$directory)

                              let $part :=  <part>{$dir}</part>

                              return $part

                          }

                         </parts>

let $xlsx := xdmp:zip-create($manifest, $parts)

return xdmp:save("C:\Users\me\Desktop\ExcelChartSample.xlsx",$xlsx)

 

Or you can do some combination of the above, or just drag the source out of
your WebDAV client, or.

 

Wrt 2:

 

Office 2003 and earlier Office docs are not natively XML.  For these you'll
need to enable the default conversion option.

 

http://docs.marklogic.com/5.0doc/docapp.xqy#display.xqy?fname=http://pubs/5.
0doc/xml/cpf/default.xml
<http://docs.marklogic.com/5.0doc/docapp.xqy#display.xqy?fname=http://pubs/5
.0doc/xml/cpf/default.xml&query=default+conversion>
&query=default+conversion

 

You can return similar to 1 above(different response content type may be
required, I forget at the moment) or with xdmp:save(), or from WebDAV,  but
there are no extracted .zip parts to zip up as the formats generated are not
native XML formats for Office.

 

Wrt 3:

 

Default conversion will convert PDF and Office 2003 and earlier docs to
XHTML and DocBook Lite.  You could then write your own transform to a Office
2007 format.  That's where the Office Toolkit for Word may be useful.

 

But note, the default conversion option does not work for Office 2007/2010.
Those formats are worked with in their native XML formats. There's currently
no conversion option to generate XHTML or DocBook for one of these 2
formats.

 

Wrt 4:

 

Yes.
http://docs.marklogic.com/4.2doc/docapp.xqy#display.xqy?fname=http://pubs/4.
2doc/xml/cpf/default.xml
<http://docs.marklogic.com/4.2doc/docapp.xqy#display.xqy?fname=http://pubs/4
.2doc/xml/cpf/default.xml&query=default+conversion>
&query=default+conversion

 

Hope this helps,

Pete

 

 

From: general-bounces at developer.marklogic.com
[mailto:general-bounces at developer.marklogic.com] On Behalf Of Tim Meagher
Sent: Thursday, August 09, 2012 6:01 AM
To: 'MarkLogic Developer Discussion'
Subject: [MarkLogic Dev General] Word Document Processing

 

Hi Folks,

 

I'm new to the idea of storing, converting, and extracting Microsoft Word
documents in and from MarkLogic and I have a couple of questions:

 

1.       How does one go about storing a Microsoft Word 2007/2010 docx
document in MarkLogic and then downloading it?  It seems to me that this is
pretty straight-forward, but I'm wondering if there are any catches.

 

 

2.       How do I do the same for Microsoft Word 97-2003 doc docum

 

ents?

 

3.       I have reviewed the marklogic-document-support PDF for ML 5 which
includes information about the Conversion option.  Do I understand correctly
that with the Conversion option I should be able to load any Mac or
Microsoft Word document into MarkLogic, convert it into a common XHTML
format which can be parsed (and edited), and further convert it into a
desired version (e.g., Microsoft Word 2007 docx) for download?

 

4.       Is the Conversion option also available for ML 4.2 and if so, where
would I get the marklogic-document-support PDF for that?

 

Thanks for the help!

 

Tim Meagher

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://developer.marklogic.com/pipermail/general/attachments/20120809/cc446d78/attachment-0001.html 


More information about the General mailing list