Optimizing Cost and Access with Tiered Storage

by Mike Wooldridge

All storage media are not created equal. Fast, expensive storage, like that on solid-state devices (SSDs), is great for high-value documents that are accessed often. Slower, disk-based storage, including storage on the cloud, can be a good fit for older data that is rarely accessed but needs to be kept for historical or auditing purposes.

MarkLogic's Tiered Storage lets you automatically store database documents on the media that's most appropriate. Documents can be saved to different locations based on range-index properties—for example, the date the documents were created or last modified. This is in contrast to how documents are typically assigned, which is with algorithms that ensure that documents are spread evenly across all forests. By storing database documents depending on access needs, Tiered Storage can help users get better performance at lower costs. Note that you must have a Tiered Storage license to use this feature in production.

You set up Tiered Storage by organizing a database's forests into partitions. Each partition is associated with a start and end value from the range index as well as a storage location. Imagine a tiered-storage database that uses the last-modified date as the index. (You can have MarkLogic automatically track this value for documents with the maintain last modified database setting.) You could define a "Newest" partition for the most recently modified data, setting the range to the years 2014 to 2015 and the location to a directory on an SSD. An "Intermediate" partition might handle data from 2010 to 2013 and be set to a regular directory on a local disk. An "Archival" partition could handle older data from 2000 to 2009 and be set to cloud storage.

Tiered Storage Partitions

Figure 1: Each Tiered Storage partition has one or more forests and an assigned range (e.g., a date range), and is associated with a storage location. A document is assigned to a partition based on a range-index value.

When you perform a document insertion, the document is mapped to a partition based on its range-index value. In our example, a document last-modified on January 12, 2014, would be assigned to a forest in the "Newest" partition and saved to the SSD. If the partition includes multiple forests, the document is assigned to the forest in the partition that has the fewest number of documents.

Tiered Storage offers various operations to maintain your data over time:

  • As documents age and you want to move them to lower-cost tiers, you can migrate partitions to different storage locations. Built-in functions and REST endpoints make this easy to do, even between local and shared storage locations.
  • As the database grows, you can add forests to partitions and MarkLogic will rebalance the data within the forests to keep them performing efficiently. (For more on rebalancing, see Finding the Right Balance with MarkLogic.)
  • Similarly, you can retire forests from partitions if your storage needs decline. Retired forests have their documents redistributed to other forests in the partition.
  • You can redefine ranges for your partitions. This results in rebalancing of the documents within and between partitions.
  • You can take partitions online and offline. Offline forests are excluded from queries, updates, and most other operations. You can take a partition offline to archive data and save RAM, CPU, and network resources.

Note that when moving documents to different tiers, such as when you age out documents to less expensive storage, migrating is more efficient than rebalancing. MarkLogic's migration operations copy forests in a partition all at once; updates that involve rebalancing (redefining a partition range, for instance) move documents in much smaller batches, which is computationally more expensive.

Also note that partitioning documents with a range index is different than setting a Fast Data Directory for a database, which is another way to leverage fast, expensive storage such as SSDs. Defining a Fast Data Directory is optional when setting up your database; a Fast Data Directory allows MarkLogic to leverage fast storage for critical operations and to improve overall system performance. For details, see the documentation about Fast Data Directory on Forests.

Super-Databases

Another part of MarkLogic's Tiered Storage strategy is the super-database. A super-database lets you organize your documents into separate databases stored on different storage media but still query those documents as a single unit. The sub-databases can be on the same cluster as the super-database or on different clusters. For queries to work, the super- and sub-databases must have the same configuration. Querying multiple data sources at once and then aggregating the results is also known as federated search.

For example, say you have a Database A of active documents on fast, expensive storage and a Database B of archival documents on slower, lower-cost storage. You can create a super-database and associate it with Database A and Database B. (The super-database itself can also contain forests of documents, although that is not recommended since those forests cannot be queried independently of the associated sub-databases.) With this arrangement, you can query just the high-value documents on Database A, just the archival documents on Database B, or the entire corpus by querying the super-database.

Tiered Storage Super-Database

Figure 2: You create a super-database by associating a database with one or more sub-databases (e.g., A and B above). Querying the super-database distributes the query to any sub-databases.

There are limitations on super-databases. You cannot perform updates on documents that reside in a sub-database via a super-database; you must perform the updates directly on the sub-database where the documents reside. You also cannot have a super-database with more than one level of sub-databases beneath it. This is to avoid circular references—i.e., two databases referencing one another.

HDFS and Amazon S3

A key to Tiered Storage is the ability to store less-active data on lower-cost storage media. Two low-cost options are HDFS (Hadoop Distributed File System) and Amazon S3 (Simple Storage Service).

HDFS is the storage part of Hadoop, which is a framework for processing large data sets in parallel across many hosts. Because HDFS is open source and runs on commodity hardware, it can store data less expensively than traditional shared storage solutions such as SAN (storage area network) and NAS (network-attached storage). Once you configure MarkLogic to use HDFS, you can specify HDFS as a storage directory for a database using the "hdfs:" prefix.

Amazon S3 is a cloud-based storage service offered through Amazon Web Services. S3 customers can interact with documents stored on the service through various interfaces, including REST. S3 storage is inexpensive because of its enormous scale. In 2013, Amazon reported that more than two trillion objects were being stored on S3. To set up S3 on a MarkLogic cluster, you submit your S3 credentials, which are stored in the Security database. Then you can specify an S3 location as a storage directory for a database using the "s3:" prefix.

Although using S3 for storage on MarkLogic can be cheap, there is a performance tradeoff. Data stored on S3 is "eventually consistent," meaning that after you update a document there may be a latency period before you can read the document and see the changes. For this reason, S3 can't be relied upon for document updates or journaling, nor can it be used for shared-disk failover. S3 is recommended for use with read-only forests, which don't require journaling. For Tiered Storage, this can be appropriate for archival partitions containing documents that are rarely accessed and never updated. S3 is also a good option for backup storage.

See the documentation to learn more about Tiered Storage and how to set it up on MarkLogic.

Thanks to Jane Chen and Arthur Tsoi for their help with this article.

Comments