[Corona] PUTs of binary data

Colleen Whitney Colleen.Whitney at marklogic.com
Wed Dec 7 15:33:38 PST 2011


Fantastic, I think either way works.

________________________________________
From: Scott Conroy [conroys at avalonconsult.com]
Sent: Wednesday, December 07, 2011 1:02 PM
To: Colleen Whitney
Cc: Ryan Grimm; Corona Email List; sissi.malek.ctr
Subject: Re: [Corona] PUTs of binary data

Having control over what happens when a doc is created or deleted
seems pretty necessary, but using the existing triggering mechanisms
or CPF seems reasonable.  It would be an interesting enhancement to be
able to control the transform (similar to the applyTransform for the
Get functionality).  I guess it would have to be a "suggested"
transform since the client shouldn't really be able to tell the server
what to do...

I do have some thoughts on some further enhancements.  Should I use
the mailing list or open some issues?

On Wed, Dec 7, 2011 at 3:41 PM, Colleen Whitney
<Colleen.Whitney at marklogic.com> wrote:
> Scott, would it be better if metadata extraction was optional, rather than automatic?
>
> Or would it be useful to reference a "plugin" (user-defined stylesheet or code) for input processing to override the default behavior?
>
> Just thinking aloud.
>
> --Colleen
> ________________________________________
> From: corona-bounces at developer.marklogic.com [corona-bounces at developer.marklogic.com] On Behalf Of Scott Conroy [conroys at avalonconsult.com]
> Sent: Wednesday, December 07, 2011 12:37 PM
> To: Ryan Grimm
> Cc: Corona Email List; sissi.malek.ctr
> Subject: Re: [Corona] PUTs of binary data
>
> I think I'm using Thursday's code, but will verify.  I know there have
> been a couple commits since then...
>
> The PDF I'm using...is the MarkLogic install.pdf file.  It just
> happened to be close by.
>
> One piece of metadata enhancement I'm doing is changing from the "meta
> name content" to friendlier elements.  <contentType>MS
> Word</contentType> and such.  The other metadata enhancement is really
> app specific - I'm looking for regex patterns in the content (e.g.
> email addresses) and toggling some metadata if there are matches.
> Nothing too earth-shattering, but not really something applicable to
> Corona.  I'm certainly up for relying on the transformation
> functionality built into Corona, but will have to follow up with some
> triggered functionality.  I'll see if I can adjust my trigger and
> avoid my problem.
>
> On Wed, Dec 7, 2011 at 3:20 PM, Ryan Grimm <grimm at xqdev.com> wrote:
>> Hi Scott,
>>
>> Corona might actually be doing what you need for you already.  When you insert a binary document using Corona, that document is run through xdmp:document-filter (if available) and by default whatever metadata and content that is present in the document is extracted.  The extracted content can be searched by using the wordInBinary structured query constructor (https://github.com/marklogic/Corona/wiki/wordInBinary-%28Structured-Query%29).
>>
>> Would it be possible for you to email me the pdf file that you're testing against so I can try to reproduce the behavior you're seeing?  Also, have you pulled the latest version of Corona from GitHub in the last few days?
>>
>> I'm also curious what other metadata enhancements you plan on adding to the process.  We have thoughts on ways to enhance documents on insert but some real world use cases would be very helpful.
>>
>> Thanks.
>>
>> --Ryan
>>
>>
>> On Dec 7, 2011, at 11:24 AM, Scott Conroy wrote:
>>
>>> Well, I'm actually creating a new file as part of my
>>> xdmp:document-filter.  xxx.pdf ends up as xxx.xhtml.  I'm not married
>>> to that solution at all, since I don't really need the binary file in
>>> MarkLogic at this point.  I'm just looking to make the binary files
>>> searchable.  After the document-filter, I need to do a bit more
>>> metadata enhancement.  I can try with the original pdf or docx
>>> extension and will report back.
>>>
>>> On Wed, Dec 7, 2011 at 2:11 PM, Geert Josten <geert.josten at dayon.nl> wrote:
>>>> Hi Scott,
>>>>
>>>> The url you describe contains a uri request param ending on .xml. Is that
>>>> intentional?
>>>>
>>>> Kind regards,
>>>> Geert
>>>>
>>>> -----Oorspronkelijk bericht-----
>>>> Van: corona-bounces at developer.marklogic.com
>>>> [mailto:corona-bounces at developer.marklogic.com] Namens Scott Conroy
>>>> Verzonden: woensdag 7 december 2011 19:28
>>>> Aan: Corona Email List; sissi.malek.ctr
>>>> Onderwerp: [Corona] PUTs of binary data
>>>>
>>>> I'm able to insert documents using Corona:
>>>>
>>>> curl --upload-file install.pdf
>>>> "http://admin:admin@localhost:9004/store?uri=/foo/bar.xml&contentType=bina
>>>> ry&contentType=binary"
>>>>
>>>> The binary file makes it into MarkLogic fine.
>>>>
>>>> However, I'm attempting to convert the document on load using a
>>>> trigger and xdmp:document-filter.  At this point, the conversion
>>>> happens but I end up with a content type of "application/octet-stream"
>>>> instead of PDF and the body of the converted file is empty
>>>> (<body><p></p></body>).
>>>>
>>>> By the way, this also happens if I copy a file into MarkLogic via
>>>> WebDAV.  The triggered conversion works fine if I just load a PDF via
>>>> Query Console.  Any suggestions on this?
>>>>
>>>> At a later point, I'll be doing more than just converting to xhtml, so
>>>> I need to keep this pattern of triggering when binary content is added
>>>> via Corona.  Unless you can suggest an  alternate/better way?
>>>> _______________________________________________
>>>> Corona mailing list
>>>> Corona at developer.marklogic.com
>>>> http://developer.marklogic.com/mailman/listinfo/corona
>>> _______________________________________________
>>> Corona mailing list
>>> Corona at developer.marklogic.com
>>> http://developer.marklogic.com/mailman/listinfo/corona
>>
> _______________________________________________
> Corona mailing list
> Corona at developer.marklogic.com
> http://developer.marklogic.com/mailman/listinfo/corona


More information about the Corona mailing list