Introduction to the MarkLogic Data Model

Documents

The basic unit of organization in MarkLogic is a document. These can be encoded in JSON or XML like the following:

  {
    name: 'Asahi Draft Beer',
    brewer: {
        name: 'Asahi',
        country: 'Japan'
    },
    calories: 41,
    alcohol:  '5.21'
}

<beer>
    <name>Asahi Draft Beer</name>
    <brewer>
        <name>Asahi</name>
        <country>Japan</country>
    </brewer>
    <calories>41</calories>
    <alcohol>5.21</alcohol>
</beer>

The set of JSON keys, objects, and arrays, or XML elements and attributes you use in your documents is up to you. MarkLogic does not require adherence to any schemas.

MarkLogic also supports documents encoded in binary form (e.g., image files, Word, Excel, PowerPoint, executables, and so on) or plain text as well. We refer to this encoding (JSON, XML, text, or binary) as the document’s Format.

URIs

A document’s URI is a key that you choose when you insert a document into the database. Each document has a unique URI. You use this URI to retrieve or refer to the document later. Typically document URIs begin with a slash like /beer.

Beyond the URI, MarkLogic maintains some additional metadata associated with each document including properties, permissions, and quality.

Organization

How does MarkLogic organize documents in the database? Logically, MarkLogic provides two concepts: Collections and Directories. You can think of collections as unordered sets. If you have a notion of tag as well, that may help. Collections can hold multiple documents and documents can belong to multiple collections.

Directories are similar in concept to the notion of directories or folders in file systems. They are hierarchical and membership is implicit based on the path syntax of URIs.

Compressed Trees

Under the covers, MarkLogic stores Documents as compressed trees, based on the well-known XPath Data Model. This model is sufficiently featured to represent all sorts of documents, including plain-text and JSON. To understand a little bit deeper, take the following XML document:

<doc><title>My doc</title><body>Some text.</body></doc>

It is represented in MarkLogic as a tree structure:

   document-node
         |
       <doc>
       /   \
 <title>   <body>
    |         |
"My doc"    "Some text."

The advantage of storing documents in XML format is that you can query the tree structure, using XPath expressions such as /doc/title. And with MarkLogic, you can perform searches that are aware of and can be qualified by this tree structure.

So, how is a text document, like this one, represented?

I'm a text document.

Well, since every document is a tree, this one is too (albeit a simple one):

   document-node
          |
"I'm a text document."

Binary documents are also (trivial) trees, like text documents. But in this case, MarkLogic extends the XPath model to include binary nodes. These are special and can only occur as singular children of the document node. For example, the storage of a JPG file would look like this:

document-node
      |
 binary-node

MarkLogic stores binary data as is (without additional compression) and provides a mechanism for storing the binary data externally, outside of the database as well.

What about JSON? When you insert JSON into MarkLogic via REST or Java APIs, the JSON is converted to an XML representation that is designed to be indexed for efficient search. In general, writing queries or working with JSON documents doesn’t require you to know any of the details on this XML representation.

Rows, not Tables

When modeling data for MarkLogic, think of documents more like rows than tables. In other words, if you have a thousand items, model them as a thousand separate documents not as a single document holding a thousand child elements. This is for two reasons:

Locks are managed at the document level. A separate document for each item avoids lock contention.
All index, retrieval, and update actions happen at the document level. When finding an item, retrieving an item, or updating an item, that means it’s best to have each item in its own document. The easiest way to accomplish that is to put them in separate documents.

Of course MarkLogic documents can be more complex than simple relational rows. One document can often describe an entity (a manifest, a legal contract, an email) completely.

Written Tutorial