[MarkLogic Dev General] RE: tail-recursion with xdmp:spawn
Kelly Stirman
Kelly.Stirman at marklogic.com
Wed Apr 7 06:50:43 PDT 2010
Hi Mike,
Yes, that's a good approach overall for one-off processing. It doesn't provide the robustness of CPF, but it can be easier to set up. It allows for a multi-threaded approach to processing your documents by configuring the number of threads on the task server.
Your termination condition could use properties. When you have completed an update on a document, add a property flag. Then your processing can look for documents that do not have the property in place.
/foo[not(property::bar)] is fast
/foo[not(property::bar = "baz")] is also fast.
cts:property-query()
Hope this helps.
Also, I think there is new functionality coming in 4.2 that you will appreciate. :-) Hope to see you at the user conference.
Kelly
Message: 4
Date: Wed, 07 Apr 2010 09:26:06 -0400
From: Mike Sokolov <sokolov at ifactory.com>
Subject: [MarkLogic Dev General] tail-recursion with xdmp:spawn
To: General Mark Logic Developer Discussion
<general at developer.marklogic.com>
Message-ID: <4BBC87EE.9060305 at ifactory.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Perhaps this won't be news to others on the list, but I was so excited
to finally stumble on a solution to a problem I have been struggling
with for years, that I just had to share.
The problem: how to process a large number of documents using xquery only?
This can't be done easily because if all the work is done in a single
transaction, it eventually runs out of time and space. But xquery
modules don't provide an obvious mechanism for flow control across
multiple transactions.
In the past I've done this by writing an "outer loop" in Java, and more
recently I tried using CPF. The problem with Java is that it's
cumbersome to set up and requires some configuration to link it to a
database. I had some success with CPF, but I found it to be somewhat
inflexible since it requires a database insert or update to trigger
processing. It also requires a bit of configuration to get going.
Often I find I just want to run through a set of existing documents and
patch them up in some way or another, (usually to clean up some earlier
mistake!)
Finally I hit on the solution: I wrote a simple script that fetches a
batch of documents to be updated, processes the updates, and then, using
a new statement after ";" to separate multiple transactions, re-spawns
the same script if there is more work to be done after logging some
indication of progress. Presto - an iterative processor. This
technique is a little sensitive to running away into an infinite loop if
you're not careful about the termination condition, but it has many
advantages over the other methods.
What do you think?
Michael Sokolov
Engineering Director
www.ifactory.com
@iFactoryBoston
PubFactory: the revolutionary e-publishing platform from iFactory
------------------------------
More information about the General
mailing list