[MarkLogic Dev General] Using Information Studio to split uploaded files..
Geert Josten
geert.josten at dayon.nl
Fri Jan 13 10:10:51 PST 2012
David, Justin,
I have various cases in which I'd like to split. Most come down to what
you suggest: chunking an aggregated file. But I also remember a case where
I splitted a hierarchical taxonomy of some sorts.
In my most current case I have files up to something like 25 Mb, that are
aggregates of tweets. I have my own import script that applies an XSLT on
it to convert the format, then I chunk it with a simple FLWOR in batches
of 1000 docs, which I spawn. In the spawn the tweets get enriched and
inserted.
I could do it with XMLSH, and I will certainly remember its streaming
capabilities (very nice!), but was kinda experimenting with how far I can
get with just MarkLogic Server itself. It is not unusual that customers
have data in aggregated form, and being able to split it with the info
studio, and generate a nice search app with the builder, with only very
limited coding would be very convincing. (It already quite impressed my
colleagues without the splitting.. ;)
Kind regards,
Geert
PS: @Justin, thnx for the offlist suggestion, I'll look into it..
-----Oorspronkelijk bericht-----
Van: general-bounces at developer.marklogic.com
[mailto:general-bounces at developer.marklogic.com] Namens David Lee
Verzonden: vrijdag 13 januari 2012 18:24
Aan: General MarkLogic Developer Discussion
Onderwerp: Re: [MarkLogic Dev General] Using Information Studio to split
uploaded files..
You might consider using xmlsh as a front end for data loading.
(www.xmlsh.org)
Depending on your needs, it can work very well.
For example the "xsplit" command (http://www.xmlsh.org/CommandXsplit ) is
fully streaming so it can take a GB input XML document and split it into
pieces that MarkLogic likes. Then combined with the MarkLogic extension
these files can be efficiently pushed to the DB in bulk, or perhaps
Information studio can then load them. All in the same process so
millions of documents can be handled without the overhead of process
invocation.
I have not experimented with how this would work with Information Studio
... that's something I want to work with in the future.
--------------------------------------------------------------------------
---
David Lee
Lead Engineer
MarkLogic Corporation
dlee at marklogic.com
Phone: +1 650-287-2531
Cell: +1 812-630-7622
www.marklogic.com
This e-mail and any accompanying attachments are confidential. The
information is intended solely for the use of the individual to whom it is
addressed. Any review, disclosure, copying, distribution, or use of this
e-mail communication by others is strictly prohibited. If you are not the
intended recipient, please notify us immediately by returning this message
to the sender and delete all copies. Thank you for your cooperation.
-----Original Message-----
From: general-bounces at developer.marklogic.com
[mailto:general-bounces at developer.marklogic.com] On Behalf Of Justin
Makeig
Sent: Friday, January 13, 2012 11:26 AM
To: General MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Using Information Studio to split
uploaded files..
Geert,
Information Studio is currently designed for single document in, single
document out transformations. Your best bet for splitting a document today
is to do this as part of the collection step.
Can you tell me a little more about the data you'd like to split? Is it
aggregated XML that you're splitting on an XPath-like match expression?
Text separated by line breaks? Something else? I'm interested in figuring
out if and how we might make splitting easier and better integrated into
the product.
Justin
Justin Makeig
Senior Product Manager
MarkLogic Corporation
justin.makeig at marklogic.com
Phone: +1 650 655 2387
www.marklogic.com
On Jan 13, 2012, at 6:35 AM, Geert Josten wrote:
> Hi,
>
> Is Information Studio intended to allow splitting of uploaded files? If
> so, what is the best way of handling that?
>
> I was experimenting with a custom XSLT, and a simple
xsl:result-document,
> but that is giving funny results. Mostly
> http://marklogic.com/states/appservices/distribute-error messages in the
> errorlog, not sure what they exactly mean, but I can imagine it is
because
> CPF handling is 'violated' or something..
>
> Any suggestions?
>
> Kind regards,
> Geert
>
> drs. G.P.H. (Geert) Josten
> Senior Developer
>
>
>
> Dayon B.V.
> Delftechpark 37b
> 2628 XJ Delft
>
> T +31 (0)88 26 82 570
>
> geert.josten at dayon.nl
> www.dayon.nl
>
> De informatie - verzonden in of met dit e-mailbericht - is afkomstig van
> Dayon BV en is uitsluitend bestemd voor de geadresseerde. Indien u dit
> bericht onbedoeld hebt ontvangen, verzoeken wij u het te verwijderen.
Aan
> dit bericht kunnen geen rechten worden ontleend.
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________
General mailing list
General at developer.marklogic.com
http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________
General mailing list
General at developer.marklogic.com
http://developer.marklogic.com/mailman/listinfo/general
More information about the General
mailing list