Data Management

Databases, Forests, and Stands

What’s on disk in MarkLogic? The answer is quite simple: databases, forests, and stands. A database is the maximum query unit within MarkLogic. A system can have multiple databases, and generally will, but each request executes against a particular database.

A database consists of one or more forests. A forest is a collection of documents (mostly compressed trees, thus the name), implemented as a physical directory on disk. Each forest holds a set of documents and all their indexes. A single machine may manage several forests.

Forests can be queried in parallel, so placing more forests on a multi-core server can help with concurrency. If the content is updated on a regular basis, each forest should ideally have two CPUs (or cores) per forest. That allows for concurrent operations, such as one core servicing merges while another core is servicing queries. For example, a dual processor/dual core machine should be able to host two forests with a maximum size of 256GB each, and a quad processor/dual core system should be able to host four forests with a maximum size of 256GB. In a clustered environment, you can have a set of servers, each managing their own set of forests, all unified into a single database.

Each forest holds zero or more stands. A stand (like a stand of trees) holds a subset of the forest data and exists as a physical subdirectory under the forest directory. It contains a set of compressed binary files with names like TreeData, IndexData, Frequencies, Qualities, and such. This is where the actual compressed data (in TreeData) and indexes (in IndexData) can be found.

A forest might contain a single stand, but it’s more common to have multiple stands because stands help MarkLogic ingest data.

Document CRUD

The main operations MarkLogic supports can be described as document CRUD. Yep, that’s Create, Read, Update, and Delete. We use the term Insert (or sometimes ingest) instead of Create, but that doesn’t keep us from saying CRUD for fun.

To understand how MarkLogic manages data inside Forests, we need to see how each of these operations work.

Inserts

Let’s start with an empty database having a single forest that (because it has no documents) has no stands. At some point a new document is inserted into MarkLogic. MarkLogic puts this document in an in-memory stand and writes the action to an on-disk journal to maintain transactional integrity in case of system failure.

As new documents are inserted, the same happens to them; they’re placed in the in-memory stand. A read request at this point will see all the data on disk (technically nothing yet) as well as everything in the in-memory stand (our small set of documents). The read request won’t be able to tell where the data is, but will see the full view of data loaded at this point in time.

After enough documents are inserted, the in-memory stand will fill up and be flushed to disk, written out as an on-disk stand. Each new stand gets its own subdirectory under the forest directory, with names that are monotonically-increasing hexadecimal numbers. The first stand gets the lovely name 00000000. That on-disk stand contains all the data and indexes for the documents loaded thus far. It’s written from the in-memory stand out to disk as a sequential write for maximum efficiency. Once it’s written, the in-memory stand’s allocated memory is freed.

As more documents are inserted, they go into a new in-memory stand. At some point this in-memory stand fills up as well, and the in-memory stand gets written as a new on-disk stand, probably named 00000001 and about the same size as the first. Sometimes under heavy load you have two in-memory stands at once, when the first stand is still writing to disk as a new stand is created for additional documents. At all times an incoming request can see all the data across all the stands.

The mechanism continues with in-memory stands filling up and writing to on-disk stands. As the total number of on-disk stands grows, an efficiency issue threatens to emerge. To read an index, MarkLogic must read the index data from each individual stand and unify the results. To keep the number of stands to a manageable level where that unification isn’t a performance concern, MarkLogic runs merges in the background.

A merge takes some of the stands on disk and creates a new singular stand out of them, coalescing and optimizing the indexes and data, as well as removing any previously deleted documents, a topic we’ll discuss shortly. After the merge finishes and the new on-disk stand has been fully written, and after all the current requests using the old on-disk stands have completed, MarkLogic deletes the old on-disk stands.

MarkLogic uses an algorithm to determine when to merge, based on the size of each stand. In a normal server running under constant load you’ll usually see a few large stands, a few more mid-sized stands, and several more small stands. Over time the smaller stands get merged into ever-larger stands. Merges do use CPU and I/O resources, so you do have control over when merges can happen via system administration. However, it’s generally best to let MarkLogic manage this process. Think of it as constant background housekeeping.

Each forest has its own in-memory stand and set of on-disk stands. A new document gets assigned to a forest algorithmically, unless you override the selection as part of the insert. Inserting documents and indexing their content is a largely parallelizable activity so splitting the effort across forests and potentially across machines in a cluster can help scale the ingestion work.

Updates and Deletes

What happens if you delete or change a document? If you delete a document, MarkLogic marks the document as deleted but does not immediately remove it from disk. The deleted document will be ignored by queries based on its deletion markings, and the next merge of the stand holding the document will bypass the deleted document when writing the new stand.

If you change a document, MarkLogic marks the old version of the document as deleted in its current stand and creates a new version of the document in the in-memory stand. MarkLogic distinctly avoids modifying the document in place. If you consider the updates to an index required to reflect the changes in a single document, you will see that updates in place are an entirely inefficient proposition. So, instead, MarkLogic treats any changed document like a new document, and treats the old version like a deleted document.

This approach is known in database circles as MVCC, which stands for Multi-Version Concurrency Control. It has several advantages, including the ability to run lock-free queries, as we’ll explain.

Multi-Version Concurrency Control

In an MVCC system changes are tracked with a timestamp number that increments for each transaction as the database changes. Each document gets its own creation-time (the timestamp at which it was created) and deletion-time (the timestamp at which it was marked as deleted, starting at infinity for documents not yet deleted). On disk you’ll see these timestamps in the Timestamps file. Trivia buffs will note it’s the only file in the stand directory that’s not read-only.

For a request that doesn’t modify data (sometimes we say query, or read, as opposed to an update, or write, that might make changes), the system gets a performance boost by skipping the need for any URI locking. The query is viewed as running at a certain timestamp, and throughout its life it sees a consistent view of the database at that timestamp, even as other (update) requests continue forward and change the data.

MarkLogic does this by adding to the normal query constraints two extra constraints: first, that any documents returned have to have been “created at or before the request timestamp” and second, that they have to have been “deleted after the request timestamp”. It’s easy to create from these two primitives what’s in essence a new implicit list of documents in existence at a certain timestamp. This constraint is implicitly added to every read. It’s a high-performance substitute for having to acquire locks.

Locking

A request that does an insert or update, because it isn’t read-only, has to use read/write locks to maintain system integrity while making changes. Locks are taken on individual document URIs. This lock behavior is implicit and not under the control of the user. Read-locks block for write-locks; write-locks block for both read- and write-locks. An update has to obtain a read-lock before reading a document and a write-lock before changing (adding, deleting, modifying) a document. Lock acquisition is ordered, first-come first-served, and locks are released automatically at the end of a request.

In any lock-based system you have to worry about deadlocks, where two or more updates are stalled waiting on locks held by the other. In MarkLogic deadlocks are automatically detected with a background thread. When the deadlock happens on the same host in a cluster, the update farthest along (with the most locks) wins and the other update gets restarted. When it happens on different hosts, because lock count information isn’t (at least yet) in the wire protocol, both updates start over.

Commits

All reads are made against a read-consistent view and all inserts/updates (writes) are part of a transaction. All index updates happen in the same transaction as the data update. And you can even do a manual rollback in your program.

Locks are acquired during an insert or update, yet the actual commit work only happens once the write finishes successfully. If the write exits with an error, all pending changes that were part of that write are discarded. Each statement is its own auto-commit transaction.

Written Tutorial