[MarkLogic Dev General] MarkLogic PDF content handling
Colleen Whitney
Colleen.Whitney at marklogic.com
Tue Jan 13 10:48:29 PST 2009
Sent from my iPhone
On Jan 13, 2009, at 10:35 AM, "Michael Blakeley" <michael.blakeley at marklogic.com
> wrote:
> Sundeep,
>
> The error code XDMP-DOCUTF8SEQ suggests that MarkLogic Server sees
> the pdf document as text or XML, rather than binary. There are
> several ways to fix this, but in XCC I would specify that the
> content is binary.
>
> The XCC "overview" section at http://developer.marklogic.com/pubs/4.0/javadoc/index.html
> includes sample code to insert content. In this API, the preferred
> way to build a ContentCreateOptions object representing a binary
> load is:
>
> ContentCreateOptions options =
> ContentCreateOptions.newBinaryInstance();
>
> While the above is the preferred technique, you could also use the
> ContentCreateOptions() constructor, then call cco.setFormatBinary()
> or cco.setFormat(DocumentFormat.BINARY)
>
> I hope that helps. I believe it's best to discuss one question at a
> time, so I'm only going to comment on your pdf ingestion issue in
> this email.
>
> -- Mike
>
> On 2009-01-13 01:38, Sundeep_Raikhelkar wrote:
>> Hi,
>> I am evaluating MarkLogic for content Processing capabilities. I
>> have chosen a simple use-case for evaluation: PDF upload, PDF
>> search, and PDF generation.
>>
>> 1. PDF load: This happens fine when loaded in binary format, but
>> with content processing turned on, I am not able upload any PDF.
>> The error I get is "XDMP-DOCUTF8SEQ: Invalid UTF-8 escape sequence
>> at /cpf/pdf/xcc.pdf". I tried to upload using XCC API, XDMP load
>> and WebDAV. All three modes give the same error. I tried specifying
>> the encoding for XCC API and XDMP load to ISO-8859-1, we get the
>> error "XDMP-STARTTAGCHAR: Unexpected character "<" in start tag at /
>> cpf/pdf/xcc.pdf line 2". We have also tried providing the repair
>> level.
>>
>> File file = new File("E:\
>> \marklogicTech\\xcc.pdf");
>> ContentCreateOptions cco = new
>> ContentCreateOptions();
>> cco.setEncoding("ISO-8859-1");
>>
>> cco.setRepairLevel(DocumentRepairLevel.FULL);
>> String uriUpload = "/cpf/pdf/
>> xcc.pdf";
>> Content content =
>> ContentFactory.newContent(uriUpload, file, cco);
>> session.insertContent (content);
>>
>> I have tried uploading MS-Word and MS-Excel document, they are
>> uploaded fine and correspondingly XHTML and XML files are getting
>> generated. Can you please tell me if it is anything to do with the
>> encoding of xcc.pdf (the file I am uploading) or with my MarkLogic
>> database server settings?
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://xqzone.com/mailman/listinfo/general
More information about the General
mailing list