[MarkLogic Dev General] RE: Fragmentation planning
karl.erisman at gmail.com
Fri Dec 18 14:35:40 PST 2009
At this point, it's tough to be confident in estimating an upper bound
on document size because we only have restricted samples of data to
work from. However, the following are from the current sample I'm
sample size: 500
max doc size: ~1 M
avg doc size: 208K
Regarding updates, we can adjust the rate, but it will eventually
depend on data currency requirements per client.
Of course, it's average case that we're interested in when reasoning
about how overall activity affects performance.
I appreciate the feedback. I'm hearing that options (1) and (3) are
both reasonable. The choice should be motivated by document sizes and
update patterns. For now, I plan to go with (1), the simplest. As we
get more data, I'll monitor performance and consider switching to
option (3) as a backup strategy.
On Fri, Dec 18, 2009 at 2:09 PM, Kelly Stirman
<Kelly.Stirman at marklogic.com> wrote:
> It sounds like the only downside to approach 1 is the assumption that updates will be slow. Generally speaking, MarkLogic processes updates very quickly, on the order of 1 MB/sec/CPU. So, could you tell us more about how large these documents might become, and the volume of updates to be processed per day?
> If your infrastructure can accommodate the volume of updates, I think approach 1 is the best option.
> Message: 2
> Date: Thu, 17 Dec 2009 18:39:35 -0600
> From: Karl Erisman <karl.erisman at gmail.com>
> Subject: [MarkLogic Dev General] Fragmentation planning
> To: General Mark Logic Developer Discussion
> <general at developer.marklogic.com>
> <ff31d1360912171639u1b670eb9p160af642662ce53a at mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
> My recent discovery that cts:and-query() does not span fragments (and
> helpful input from list contributors) raises a new issue, one that we
> all face: fragmentation.
> A recent thread ("Creating Collections") dealt with this issue and
> contained some interesting ideas, especially many reasons to avoid
> using fragments. However, there must be times when fragmentation is
> necessary, but the thread seems to have ended without a clear
> resolution on when and how to employ fragments, and when to avoid
> Kelly Stirman cautions against the use of fragments:
> "...I think you'll find you have more options, the server is easy
> to use, it will be more difficult to make a false step, and you'll have more in
> common with other developers if you don't use fragmentation and instead load
> your nodes as individual documents. You may not have run into any limitations
> thus far, but in my experience you will eventually."
> ??? http://email@example.com/msg03478.html
> That's useful information (!), but loading new data in ML must not be
> as simple as avoiding fragmentation. The decision involves many
> factors, including these well-known ones:
> (A) average document size
> (B) optimizing query performance (in my case, I'm interested in a
> related factor that I'd call "query strategy")
> (C) optimizing update performance
> As promised, I'll describe my situation at a high level. Consider it
> a mini case study. I can follow up with more detail if necessary.
> We store "patient charts" in documents, currently one document per
> patient. Included in each chart is data that should be updated
> frequently (e.g. demographic info, clinical visit history, and lab
> results). See factor (C). Also, we need to search for documents by
> specifying criteria from multiple sections. See factor (B).
> Currently, our fragmentation policy has each of the sections (e.g.
> demographics, visit history, and lab results) as a fragment root. I
> think this was predominantly motivated by factor (C) -- for example,
> adding a lab result (a frequent event) would merely add a fragment.
> However, I ran into trouble when using cts:and-query() to specify
> search criteria from multiple fragments (cts:and-query doesn't span
> Alternative solutions:
> (1) If we simply stop using fragmentation, the queries work as
> desired. But isn't that a bad idea since sections in the documents
> will need frequent updating?
> (2) If I change nothing about the fragmentation, I can run each
> sub-query independently instead of using cts:and-query(), then take
> the intersection (which may span fragments). But I'm reading that I
> should try to avoid fragmentation, so...
> (3) Another option would be to break up the document, creating
> separate docs for each section, so if a document currently has (for
> example) demographics and lab results, it would be split into two
> documents. Directories could be used to group sets of documents by
> section type (/demographics/10291004, /lab-results/10291004). The
> cts: searches that specify demographic and lab criteria would be
> performed separately, then recombined for a final result set as
> described in (2).
> That seems like enough detail for now. Does (3) sound reasonable?
> Any alternative suggestions?
> General mailing list
> General at developer.marklogic.com
More information about the General