[MarkLogic Dev General] Experience loading PDFs into ML
alan.darnell at utoronto.ca
Thu May 28 17:18:04 PDT 2009
I'm wondering if anyone has experience they could share on loading PDFs into ML and indexing these for text retrieval whie leaving the PDF in the database for users to download.
Do you use the CPF to extract text from the PDF and store that as a new text document in ML?
If so, how do you link up the PDF and the text document - a common URL scheme?
Do you extract XMP encoded metadata from the PDFs and use that to populate properties or create a new XML document associated with the PDF?
It would be great to display snippets from the PDF based on the pages that match the user query (like Google Book Search does). Is there a way to extract text from the PDF that retains it's page and position information so you can go back to the PDF to generate a snippet image?
Does maintaining the PDFs in the database have a negative impact on index sizes or performance?
Thanks in advance,
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the General