[MarkLogic Dev General] Word Document Processing

Pete Aven Pete.Aven at marklogic.com
Thu Aug 9 11:39:27 PDT 2012


Hi Tim,

Yes, for 4.2 and 5.0, Default Conversion is not actually enabled by default, and requires additional licensing.

If you wish to try Conversion out for evaluation purposes, you can use the Express license.  I believe it's enabled there, but if not, you can contact MarkLogic at more at marklogic.com<mailto:more at marklogic.com> and we can possibly enable it for you for evaluation to see if it meets your requirements.

The Content Processing Framework IS available, so the pipeline I mentioned for Office 2007/2010 will still work without the additional conversion requirement.  But for Office 2003 and earlier as well as PDFs, you require the additional licensing for default conversion.

Hope this helps,
Pete

From: general-bounces at developer.marklogic.com [mailto:general-bounces at developer.marklogic.com] On Behalf Of Tim Meagher
Sent: Thursday, August 09, 2012 10:52 AM
To: 'MarkLogic Developer Discussion'
Subject: Re: [MarkLogic Dev General] Word Document Processing

Hi Pete,

Thanks for the response - it has been very helpful.

Regarding enabling the default conversion option, does that does require a separate license for 4.2 and 5.0+?

Tim

From: general-bounces at developer.marklogic.com [mailto:general-bounces at developer.marklogic.com] On Behalf Of Pete Aven
Sent: Thursday, August 09, 2012 10:30 AM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Word Document Processing

Hi Tim,

Wrt 1:

For loading you have a few options:

You can use xdmp:document-load(), or setup a WebDAV server in MarkLogic, configure a client, and just save your docs through WebDAV, or use Information Studio to load the .docx from the filesystem.

For Office 2007/2010 insure the Office OpenXML Extract pipeline is enabled in MarkLogic.  This will unzip the associated parts for each Office doc and place them in a sibling folder to the source doc, similar to conversion.

To download you can do a few things as well.  For the source .docx requested through a browser:

xquery version "1.0-ml";
declare namespace html = "http://www.w3.org/1999/xhtml";
let $filename :=  "File1.docx"
let $disposition := fn:concat("attachment; filename=""",$filename,"""")
let $x := xdmp:add-response-header("Content-Disposition", $disposition)
let $x:= xdmp:set-response-content-type("application/vnd.openxmlformats-officedocument.wordprocessingml.document")
return   fn:doc(fn:concat("/",$filename))

Or zip up the extracted parts on demand and save to the filesystem:

xquery version "1.0-ml";

let $directory := "/MySpreadsheet1_xlsx_parts/"
let $uris := cts:uris("","document",cts:directory-query($directory,"infinity"))
let $parts := for $i in $uris let $x := fn:doc($i) return  $x

let $manifest := <parts xmlns="xdmp:zip">
                         {
                              for $i in $uris
                              let $dir := fn:substring-after($i,$directory)
                              let $part :=  <part>{$dir}</part>
                              return $part
                          }
                         </parts>
let $xlsx := xdmp:zip-create($manifest, $parts)
return xdmp:save("C:\Users\me\Desktop\ExcelChartSample.xlsx",$xlsx)

Or you can do some combination of the above, or just drag the source out of your WebDAV client, or...

Wrt 2:

Office 2003 and earlier Office docs are not natively XML.  For these you'll need to enable the default conversion option.

http://docs.marklogic.com/5.0doc/docapp.xqy#display.xqy?fname=http://pubs/5.0doc/xml/cpf/default.xml&query=default+conversion

You can return similar to 1 above(different response content type may be required, I forget at the moment) or with xdmp:save(), or from WebDAV,  but there are no extracted .zip parts to zip up as the formats generated are not native XML formats for Office.

Wrt 3:

Default conversion will convert PDF and Office 2003 and earlier docs to XHTML and DocBook Lite.  You could then write your own transform to a Office 2007 format.  That's where the Office Toolkit for Word may be useful.

But note, the default conversion option does not work for Office 2007/2010.  Those formats are worked with in their native XML formats. There's currently no conversion option to generate XHTML or DocBook for one of these 2 formats.

Wrt 4:

Yes. http://docs.marklogic.com/4.2doc/docapp.xqy#display.xqy?fname=http://pubs/4.2doc/xml/cpf/default.xml&query=default+conversion

Hope this helps,
Pete


From: general-bounces at developer.marklogic.com<mailto:general-bounces at developer.marklogic.com> [mailto:general-bounces at developer.marklogic.com]<mailto:[mailto:general-bounces at developer.marklogic.com]> On Behalf Of Tim Meagher
Sent: Thursday, August 09, 2012 6:01 AM
To: 'MarkLogic Developer Discussion'
Subject: [MarkLogic Dev General] Word Document Processing

Hi Folks,

I'm new to the idea of storing, converting, and extracting Microsoft Word documents in and from MarkLogic and I have a couple of questions:


1.       How does one go about storing a Microsoft Word 2007/2010 docx document in MarkLogic and then downloading it?  It seems to me that this is pretty straight-forward, but I'm wondering if there are any catches.




2.       How do I do the same for Microsoft Word 97-2003 doc docum

ents?



3.       I have reviewed the marklogic-document-support PDF for ML 5 which includes information about the Conversion option.  Do I understand correctly that with the Conversion option I should be able to load any Mac or Microsoft Word document into MarkLogic, convert it into a common XHTML format which can be parsed (and edited), and further convert it into a desired version (e.g., Microsoft Word 2007 docx) for download?



4.       Is the Conversion option also available for ML 4.2 and if so, where would I get the marklogic-document-support PDF for that?


Thanks for the help!

Tim Meagher

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://developer.marklogic.com/pipermail/general/attachments/20120809/f9b1f533/attachment-0001.html 


More information about the General mailing list