[MarkLogic Dev General] Fragmentation planning

Karl Erisman karl.erisman at gmail.com
Thu Dec 17 16:39:35 PST 2009


My recent discovery that cts:and-query() does not span fragments (and
helpful input from list contributors) raises a new issue, one that we
all face: fragmentation.

A recent thread ("Creating Collections") dealt with this issue and
contained some interesting ideas, especially many reasons to avoid
using fragments.  However, there must be times when fragmentation is
necessary, but the thread seems to have ended without a clear
resolution on when and how to employ fragments, and when to avoid
them.

Kelly Stirman cautions against the use of fragments:

"...I think you'll find you have more options, the server is easy
to use, it will be more difficult to make a false step, and you'll have more in
common with other developers if you don't use fragmentation and instead load
your nodes as individual documents. You may not have run into any limitations
thus far, but in my experience you will eventually."

source:
    http://www.mail-archive.com/general@developer.marklogic.com/msg03478.html

That's useful information (!), but loading new data in ML must not be
as simple as avoiding fragmentation.  The decision involves many
factors, including these well-known ones:
(A) average document size
(B) optimizing query performance (in my case, I'm interested in a
related factor that I'd call "query strategy")
(C) optimizing update performance

As promised, I'll describe my situation at a high level.  Consider it
a mini case study.  I can follow up with more detail if necessary.

We store "patient charts" in documents, currently one document per
patient.  Included in each chart is data that should be updated
frequently (e.g. demographic info, clinical visit history, and lab
results).  See factor (C).  Also, we need to search for documents by
specifying criteria from multiple sections.  See factor (B).

Currently, our fragmentation policy has each of the sections (e.g.
demographics, visit history, and lab results) as a fragment root.  I
think this was predominantly motivated by factor (C) -- for example,
adding a lab result (a frequent event) would merely add a fragment.
However, I ran into trouble when using cts:and-query() to specify
search criteria from multiple fragments (cts:and-query doesn't span
fragments).

Alternative solutions:
(1) If we simply stop using fragmentation, the queries work as
desired.  But isn't that a bad idea since sections in the documents
will need frequent updating?
(2) If I change nothing about the fragmentation, I can run each
sub-query independently instead of using cts:and-query(), then take
the intersection (which may span fragments).  But I'm reading that I
should try to avoid fragmentation, so...
(3) Another option would be to break up the document, creating
separate docs for each section, so if a document currently has (for
example) demographics and lab results, it would be split into two
documents.  Directories could be used to group sets of documents by
section type (/demographics/10291004, /lab-results/10291004).  The
cts: searches that specify demographic and lab criteria would be
performed separately, then recombined for a final result set as
described in (2).

That seems like enough detail for now.  Does (3) sound reasonable?
Any alternative suggestions?

Thanks,
Karl


More information about the General mailing list