[MarkLogic Dev General] Converting MS Office documents

Pete Aven Pete.Aven at marklogic.com
Fri Mar 27 03:50:29 PDT 2015


That should work.  I just tried on 8.0-1.1 on Windows and got the expected results.

If you're using CPF.  Then you want to confirm you have the following pipelines enabled:

Status Change Handling
Office OpenXML Extract

For Office 2007 and greater (docs ending with a .docx, .pptx. .xlsx extension) the file format is XML, and so you can unzip the contents and work with the native OpenXML Format directly once you've extracted the contents using  the Office OpenXML Extract pipeline.

Once inserted, the original doc will be saved in MarkLogic as:
/myDoc/UtilizationReport_xlsx              //the original doc

Once this original doc processed by Office OpenXML Extract, you should see the extracted parts in MarkLogic as well :
/myDoc/UtilizationReport_xlsx_parts   //with a bunch of .xml here in SpreadsheetML format

The cpf state on the .xlsx will be:  http://marklogic.com/states/extracted

If you already have those 2 pipelines enabled, you may want to disable others to see if you can get the expected results to insure no pipelines are conflicting with each other in their attempt to process the document.

Hope this helps,
Pete



From: general-bounces at developer.marklogic.com [mailto:general-bounces at developer.marklogic.com] On Behalf Of Javier Lizarraga
Sent: Thursday, March 26, 2015 7:51 PM
To: General at developer.marklogic.com
Subject: [MarkLogic Dev General] Converting MS Office documents

Hello Developers,

I want to load an MS excel file with filename.xlsx into a MarkLogic database (using ML8).  I want to be able to access the contents of the MS excel document.
I enabled the triggers for the database and installed  and enabled the Content Processing.  I followed the ML document below:
http://docs.marklogic.com/guide/cpf/default#<http://docs.marklogic.com/guide/cpf/default>

Loaded:
declareUpdate();
xdmp.documentLoad("C:\\Users\\jlizarraga\\Documents\\UtilizationReport.xlsx",
    {
      "uri" : "/myDoc/UtilizationReport.xlsx",
      "permissions" : xdmp.defaultPermissions()
    })

When I load my UtilizationReport.xlsx file I can see the associated properties in Query Console:
<?xml version="1.0" encoding="UTF-8"?>
<prop:properties xmlns:prop="http://marklogic.com/xdmp/property">
  <cpf:processing-status xmlns:cpf="http://marklogic.com/cpf">done</cpf:processing-status>
  <cpf:property-hash xmlns:cpf="http://marklogic.com/cpf">d41d8cd98f00b204e9800998ecf8427e</cpf:property-hash>
  <cpf:last-updated xmlns:cpf="http://marklogic.com/cpf">2015-03-26T16:24:16-07:00</cpf:last-updated>
  <cpf:state xmlns:cpf="http://marklogic.com/cpf">http://marklogic.com/states/converted</cpf:state<http://marklogic.com/states/converted%3c/cpf:state>>
  <cpf:self xmlns:cpf="http://marklogic.com/cpf">/myDoc/UtilizationReport.xlsx</cpf:self>
</prop:properties>

It appears to me that it was successful but I do not see any other associated documents besides the UtilizationReport.xlsx file reference.

I was expecting to see:
UtilizationReport.xlsx  (Original Document)
UtilizationReport_xlsx.xml
UtilizationReport_xlsx.xhtml
A Directory called UtilizationReport_xlsx_Parts

I don't see any errors.  Any help would be greatly appreciated.

Thanks,

Javier
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://developer.marklogic.com/pipermail/general/attachments/20150327/04315a9e/attachment.html 


More information about the General mailing list