[MarkLogic Dev General] MarkLogic PDF content handling

Colleen Whitney Colleen.Whitney at marklogic.com
Tue Jan 13 10:48:29 PST 2009



Sent from my iPhone

On Jan 13, 2009, at 10:35 AM, "Michael Blakeley" <michael.blakeley at marklogic.com 
 > wrote:

> Sundeep,
>
> The error code XDMP-DOCUTF8SEQ suggests that MarkLogic Server sees  
> the pdf document as text or XML, rather than binary. There are  
> several ways to fix this, but in XCC I would specify that the  
> content is binary.
>
> The XCC "overview" section at http://developer.marklogic.com/pubs/4.0/javadoc/index.html 
>  includes sample code to insert content. In this API, the preferred  
> way to build a ContentCreateOptions object representing a binary  
> load is:
>
>  ContentCreateOptions options =
>    ContentCreateOptions.newBinaryInstance();
>
> While the above is the preferred technique, you could also use the  
> ContentCreateOptions() constructor, then call cco.setFormatBinary()  
> or cco.setFormat(DocumentFormat.BINARY)
>
> I hope that helps. I believe it's best to discuss one question at a  
> time, so I'm only going to comment on your pdf ingestion issue in  
> this email.
>
> -- Mike
>
> On 2009-01-13 01:38, Sundeep_Raikhelkar wrote:
>> Hi,
>> I am evaluating MarkLogic for content Processing capabilities. I  
>> have chosen a simple use-case for evaluation: PDF upload, PDF  
>> search, and PDF generation.
>>
>>  1.  PDF load: This happens fine when loaded in binary format, but  
>> with content processing turned on, I am not able upload any PDF.  
>> The error I get is "XDMP-DOCUTF8SEQ: Invalid UTF-8 escape sequence  
>> at /cpf/pdf/xcc.pdf". I tried to upload using XCC API, XDMP load  
>> and WebDAV. All three modes give the same error. I tried specifying  
>> the encoding for XCC API and XDMP load to ISO-8859-1, we get the  
>> error "XDMP-STARTTAGCHAR: Unexpected character "<" in start tag at / 
>> cpf/pdf/xcc.pdf line 2".  We have also tried providing the repair  
>> level.
>>
>>                                     File file = new File("E:\ 
>> \marklogicTech\\xcc.pdf");
>>                                     ContentCreateOptions cco = new  
>> ContentCreateOptions();
>>                                     cco.setEncoding("ISO-8859-1");
>>                                      
>> cco.setRepairLevel(DocumentRepairLevel.FULL);
>>                                     String uriUpload = "/cpf/pdf/ 
>> xcc.pdf";
>>                                     Content content =  
>> ContentFactory.newContent(uriUpload, file, cco);
>>                                     session.insertContent (content);
>>
>> I have tried uploading MS-Word and MS-Excel document, they are  
>> uploaded fine and correspondingly XHTML and XML files are getting  
>> generated. Can you please tell me if it is anything to do with the  
>> encoding of xcc.pdf (the file I am uploading) or with my MarkLogic  
>> database server settings?
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://xqzone.com/mailman/listinfo/general


More information about the General mailing list