"Fragment"-ed thoughts

by Evan Lenz

If you've been working with MarkLogic for any amount of time, you've probably come across the term "fragment." For example, when running a query trace on your code, the result says things like "Selected 5 fragments to filter." What exactly is a fragment? And why is it called that?

The simple answer is that, unless you've specifically enabled a feature called "fragmenting", a fragment is the same thing as a document, the basic unit of storage in MarkLogic. In this case, when you see xdmp:plan() report that your query "selected 100 fragments," it means that 100 candidate documents were found to match your query. (I say "candidate" because the final result set, unless you're running an unfiltered search, may exclude some of those candidates after the filtering process.) So if you're not using fragmenting, then you can just think "document" when you see the word "fragment".

The general and consistent advice I've heard is to avoid using the fragmenting feature except when absolutely necessary. (Sometimes it may be useful to use fragmenting for the content of large books, e.g. broken up into chapters.) It's much simpler to let fragmenting occur at the document level—in other words, to not break your documents into fragments. Under this generally advised scenario, you have one document fragment per document. Sometimes this means chopping up larger documents into smaller ones as a pre-processing step. With MarkLogic, it's generally better to have a large number of small documents as opposed to a small number of large documents. The "Documents are Like Rows" section of Jason Hunter's Inside MarkLogic paper gives two reasons for this:

First, locks are managed at the document level. A separate document for each item avoids lock contention. Second, all index, retrieval, and update actions happen at the fragment level. When finding an item, retrieving an item, or updating an item, that means it's best to have each item in its own fragment. The easiest way to accomplish that is to put them in separate documents.

As it turns out, "document fragment" is just one kind of fragment—the kind you'll most often be encountering. For example, try running the following query:

xdmp:plan(collection())

xdmp:plan is a pseudo-function (really a special form) that does not evaluate the expression you pass to it. Instead, it checks to see if the expression is searchable and, if so, constructs a query plan against index terms, runs the (unfiltered) query, and shows you an XML representation of the plan and how many fragments were selected. If you run the above query against a database with 100 documents (with fragmenting not enabled), then you'll see this in the output:

<qry:query-plan xmlns:qry="http://marklogic.com/cts/query">
  ...
  <qry:info-trace>Selected 100 fragments</qry:info-trace>
  <qry:result estimate="100"/>
</qry:query-plan>

The estimate is the same number you get when calling xdmp:estimate() (another special form) against the same expression.

As I alluded to above, there are other kinds of fragments: document properties and document locks. These also have their own XML representation, just like a normal document fragment. They are also associated with the document to which they apply by having the same URI. The difference is that they are accessed using different APIs. Whereas collection() and doc() return document fragments, they do not return document properties or locks. For those, you need to call other functions. For example, the following query will tell you how many document properties fragments are in your database:

xdmp:estimate(xdmp:document-properties())

Whereas this query will tell you how many document locks are currently in the database:

xdmp:estimate(xdmp:document-locks())

And this query will return the given document, its properties fragment, and its lock fragment (if the document is currently being locked):

let $uri := "/testDir/test.xml" return
<result>{
  doc($uri),
  xdmp:document-properties($uri),
  xdmp:document-locks($uri)
}</result>

Here's the result I get from my database:

<result>
  <test>This is my document.</test>
  <prop:properties xmlns:prop="http://marklogic.com/xdmp/property">
    <prop:last-modified>2011-12-21T13:57:22-08:00</prop:last-modified>
  </prop:properties>
  <lock:lock xmlns:lock="http://marklogic.com/xdmp/lock">
    <lock:lock-type>write</lock:lock-type>
    <lock:lock-scope>exclusive</lock:lock-scope>
    <lock:active-locks>
      <lock:active-lock>
        <lock:depth>0</lock:depth>
        <lock:owner>Evan is editing this document</lock:owner>
        <lock:timeout>120</lock:timeout>
        <lock:lock-token>http://marklogic.com/xdmp/locks/191a5677cb8bc042</lock:lock-token>
        <lock:timestamp>1324504932</lock:timestamp>
        <sec:user-id xmlns:sec="http://marklogic.com/xdmp/security">7071164303237443533</sec:user-id>
      </lock:active-lock>
    </lock:active-locks>
  </lock:lock>
</result>

The first child of <result> is a copy of my document itself (the document fragment). The second child is a copy of the properties fragment. And the third is a copy of the lock fragment (which I previously acquired for the heck of it using xdmp:lock-acquire().) As you can see, all three of these are represented using XML. This means you can process them using the same functions and operators as you'd use on regular documents. Moreover, all three fragments have the same URI ("/testDir/test.xml"). This may seem strange, but it works out; the way you access the three kinds of fragments is different, so there's no conflict.

There are of course many other functions for accessing and manipulating properties and lock fragments. The point here is that they exist, they're stored as fragments, and they are accessed using different queries and functions than normal document fragments.

What about directories? Are they represented as fragments? Well, yes and no. There's no separate type of "directory fragment." However, directories are represented using none other than properties fragments! To prove this, all you have to do is get the properties document whose URI is a directory URI you know exists:

xdmp:document-properties("/testDir/")

The result has an empty <prop:directory/> element which is a flag representing the fact that this properties fragment is actually a directory:

<prop:properties xmlns:prop="http://marklogic.com/xdmp/property">
  <prop:directory/>
</prop:properties>

If directories are just properties fragments (and not document fragments), does that mean you could create a regular document using the same URI as a directory? Yes. But it's not recommended.

Comments

  • This article glosses over the distinction between directories and directory fragments. (Truth be told, I didn't realize the difference at the time.) A directory is just a series of one or more steps in a slash-separated document URI. Directories are always indexed to support directory-related functions such as xdmp:directory() and cts:directory-query(). Directory fragments (as described near the end of this article) are an additional feature used to support WebDAV. They are not necessary for really anything else, and they will only exist if you create them or have your database configured to create them for you. (They also hamper scalability.) Check out Michael Blakeley's excellent blog article explaining the distinction and nuances of each: https://blakeley.com/blogofile/2012/03/19/directory-assistance/