[MarkLogic Dev General] Experience loading PDFs into ML
alan.darnell at utoronto.ca
Fri May 29 19:55:54 PDT 2009
Thanks Tony. This give me some ideas for handling our content.
On 5/29/09 11:49 AM, "Apuzzo, Tony" <Tony.Apuzzo at flatironssolutions.com> wrote:
We haven't done the exact process described by the OP. What we're doing to load ~4 million pages of PDF is:
* Use the iText library to split incoming PDFs into separate pages
* Store the PDF split PDF pages into an external web server (We use Weblogic with a REST front-end, but a static HTTP would work too.)
* Deliver the PDF files into Marklogic CPF to do text conversion for full-text-search.
* Create a top-level asset "DocBook like" XML file that contains the converted text + URL references to the page-split PDFs.
Our input PDFs do not have any embedded metadata, so we aren't trying to extract anything from them, but we could use iText to extract the PDF properties if we needed.
The performance is very good using this scheme and we don't have to worry about using ML for BLOBs. We don't have the requirement to do snippet highlighting, but we do get to the correct page(s) very easily.
From: general-bounces at developer.marklogic.com [mailto:general-bounces at developer.marklogic.com] On Behalf Of Mary Holstege
Sent: Friday, May 29, 2009 9:16 AM
To: General Mark Logic Developer Discussion
Subject: Re: [MarkLogic Dev General] Experience loading PDFs into ML
On Thu, 28 May 2009 17:18:04 -0700, Alan Darnell
<alan.darnell at utoronto.ca> wrote:
> I'm wondering if anyone has experience they could share on loading PDFs
> into ML and indexing these for text retrieval whie leaving the PDF in
> the database for users to download.
> Do you use the CPF to extract text from the PDF and store that as a new
> text document in ML?
> If so, how do you link up the PDF and the text document - a common URL
> Do you extract XMP encoded metadata from the PDFs and use that to
> populate properties or create a new XML document associated with the PDF?
> It would be great to display snippets from the PDF based on the pages
> that match the user query (like Google Book Search does). Is there a
> way to extract text from the PDF that retains it's page and position
> information so you can go back to the PDF to generate a snippet image?
> Does maintaining the PDFs in the database have a negative impact on
> index sizes or performance?
> Thanks in advance,
The default CPF PDF conversion will create a new XHTML version of
the PDF. If you just want the extracted text for searching and not for
rendering, one of the alternative pipelines just extracts the text of each
page and sticks it as a bag of words in a "page" element. Some metadata
is extracted in each case as well. Properties on the documents
connects the source and the conversion products.
General mailing list
General at developer.marklogic.com
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the General