[MarkLogic Dev General] CORB: Sleep during configurable hours
andprocess 1 forest at a time
Hartwig, Brent (CL Tech Sv)
Brent.Hartwig at cengage.com
Wed Sep 24 04:36:41 PDT 2008
Hi, Ian,
Quite a lot to chew on - thank you! I understand your process is able to run continuously yet, to keep the site running smoothly, the process takes short breaks and imposes a size limit on the merges. That size limit requires you to initiate the residual merge during a low usage period.
Do you believe the size of the documents being updated would impact this approach? Our files can be quite large. It is common for folders to include multiple 10 - 20 MB files. We do have files approaching the 300 MB limit.
-Brent
________________________________
From: general-bounces at developer.marklogic.com [mailto:general-bounces at developer.marklogic.com] On Behalf Of Ian Small
Sent: Tuesday, September 23, 2008 2:52 PM
To: General Mark Logic Developer Discussion
Subject: RE: [MarkLogic Dev General] CORB: Sleep during configurable hours andprocess 1 forest at a time
hi -
while we don't use corb to do it, we do in fact to large-scale in-place modifications of the markmail.org production content set. we take a similar approach to yours:
- only work on one forest (in our case, per D node) at once
- we manage the concurrency of the work to make sure there are lots of cores available for user queries
- we pause in between small bursts of reprocessing
- we manage monster merges manually so that they happen during our low usage time (we have global users, so this is between about 6pm and 2 am pacific)
we do all this because, like you, we are working around live load on the server and need to maintain response time while all this is going on
some things to keep note of:
- pausing between every operation can backfire - because if the pause is long enough, it can "trick" the server into thinking that there are no more updates coming, which can cause an in-memory stand to be flushed out to disk. the result if this is that a bunch of really small in-memory stands can get shot out to disk, requiring more merges - although those merges will be incredibly fast and incredibly lightweight. so we tend to keep our pauses short enough to make sure we give other processing some time to get through. so you may want to experiment on this front a little bit.
- we NEVER turn off merges - this is essentially playing russian roulette, and committing to pull the trigger 12 times while waiting to see what happens. what we do is limit the large merges (where large is compared to our forest size). in our case, with 200 GB forests, we might stick our limit at 75-100 GB, for instance. that generally leaves us with a forest with 2-4 stands in it, which we can then merge manually in low times.
- we start the manual "all done" merges using the forest admin pages
in general, we take this approach because we plan our reprocessing sufficient in advance that we can have it take days, sometimes even 10 days. we haven't had to be in a crash program to have to rework the content set so i can't share any real-world experience there.
ian
________________________________
From: general-bounces at developer.marklogic.com [mailto:general-bounces at developer.marklogic.com] On Behalf Of Hartwig, Brent (CL Tech Sv)
Sent: Tuesday, September 23, 2008 8:03 AM
To: general at developer.marklogic.com
Subject: [MarkLogic Dev General] CORB: Sleep during configurable hours andprocess 1 forest at a time
Hello,
Has anyone extended Corb to sleep during configurable periods or process one forest at a time?
We need to modify every object in our ML instance. Multiple merges are saturating the IO channel. To keep production stable and usable, we intend to put the job to sleep during peak hours and only process one forest at a time. Each processed URI will go into a collection, allowing us to verify all are processed. Preliminary approaches are described below. Your thoughts and experience are welcome. Thank you in advance.
Sleep: Nothing too concerning here (but tried & true is always better). We're planning to work around backups, peak hours and allow time for system resources to recover before peak hours resume.
Forest: Corb can obtain a list of forests from the specified database via Session.getContentbaseMetaData().getForestIds() and iterate in serial. The queue would be populated once per forest by substituting the forest ID within the provided URIS-MODULE. The initial implementation may impose some usage constraints.
-Brent
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://xqzone.marklogic.com/pipermail/general/attachments/20080924/7b755a82/attachment-0001.html
More information about the General
mailing list