(This is a repost of Denise's 2011 writing)
I’ve been itching to write up a post about the noSQL (“not only SQL”) category of technologies because there’s such a dearth of practical and specific information on this topic, and because so many people are unclear about how MarkLogic compares to these technologies.
This post is targeted at folks who have already come to the realization that a relational database won’t meet their needs, and who are trying to figure out which of the available alternatives is a plausible option. It’s also meant to be a general tutorial for anyone curious about this emerging space.
The noSQL “movement” was born from a growing recognition that there are certain types of data-based problems that are quite difficult, or inefficient, or impossible-within-practical-constraints to tackle using relational database technology. The most common factors behind these problems are unstructured data and scalability. Why?
- Unstructured data doesn’t fit neatly into rows and columns.
- Expensive and time-consuming efforts are required for data modeling activities aimed at mapping pieces of unstructured data to table cells. Inevitably the process results in some amount of fidelity loss.
- Business impedance results from being tied to a rigid schema based architecture, because unstructured data evolves quickly.
- Relational databases scale vertically, not horizontally.
- There gets to be a point where even the most exotic hardware can’t handle the complexity and/or load of these types of problems.
- Today’s data centers and cloud vendors are built around scaling out on commodity hardware.
- There is no efficient way to distribute RDBMS managed data across a growing (or shrinking) cluster of servers.
The noSQL technologies (I don’t include MarkLogic in this category) are not databases “in the traditional sense”, meaning none of them provide both transactional integrity and real-time results, and some of them provide neither.
Each resulted from an effort to alleviate specific limitations found in the RDBMS world which were preventing their architects from completing a specific type of task, and they all made trade-offs to get there. Whether it was tackling “hot row” lock contention, horizontal scale, sparse data performance problems, or single schema induced rigidity, they are much more narrowly focused in purpose than their RDBMS predecessors. They weren’t designed to be enterprise platforms for building ACID transactional, real-time applications.
Data consistency is the most prominent noSQL trade-off made in order to achieve horizontal scale, so it’s worth understanding generally why and generally what the ramifications are. There is an obvious data consistency challenge with cache-based noSQL options (such as memcached) so I’ll focus on the issues with persistent stores here.
The concept is relatively straightforward--figure out how to partition your data such that it will be most evenly distributed across a cluster of commodity servers. This requires some upfront understanding about what kinds of questions you’ll be asking, such that all of the results aren’t clumped together within the cluster. (If you simply rely on the default consistent hash algorithm provided by some of the noSQL technologies, you’ll not be fully optimized because that technique is not content-aware). If the questions change, repartitioning might be needed to maintain performance. In order to avoid “hot spots”, (which introduce lock-contention based performance problems), multiple copies of the data exist within the cluster. With some noSQL options, joins are not possible, so data cannot be normalized, and again you end up with multiple copies of data. Either way, you have a data consistency issue, because when updates occur, for some window of time some of the nodes in the cluster will have stale data. This is known as “eventual consistency”, a model used by Amazon Dynamo and copied by many. In short, updates are treated as “always” writeable, with the complexity of potential conflict resolution left to the read operation. None of the noSQL options have anything other than rudimentary conflict resolution functionality, so this is left as an exercise for the application/business-process layers.
Obviously I cannot provide exhaustive descriptions of each noSQL subcategory here, but since many people expect noSQL options to be “databases”, I’ve tried to highlight those characteristics that might be least expected.
- Essentially a distributed hashmap for simple store and retrieve only
- Achieves flexibility by not requiring a single schema; achieves horizontal scalability through simplified key-based access, key-based partitioning, and eventual consistency
- No joins, no search, no global indexes, no complex queries
- Can be in-memory or persistent
- Examples: memcached, Amazon SimpleDB, Tokyo Tyrant, Redis, Berkeley DB, Voldemort, Oracle Coherence, Gigastore
- Achieves flexibility by not requiring a single schema; addresses sparse data problems by replacing “rows” with “documents”, where documents can have embedded documents, allowing for complex hierarchies to be stored in a single record; achieves horizontal scalability through key-based partitioning and eventual consistency
- Allows for more complex queries than key-value stores
- No joins
- No complex multi-document transactions
- Weak full text search functionality. All keywords need to be placed in a single field. The application code needs to handle tokenization and stemming.
- You need to configure indexes explicitly, and there are limitations on how many indexes you can specify. (e.g. in MongoDB you are limited to 64 indexes per collection). It’s recommended to configure the minimum number of indexes possible, in order to preserve performance.
- Examples: MongoDB, CouchDB
Map Reduce Batch Processing
- It’s a framework for distributed computing on large datasets, not a database.
- Unlike the other noSQL technologies, map reduce (in combination with a distributed filesystem) is fundamentally targeted at addressing the limitation of disk read performance rather than RDBMS shortcomings.
- Good for writing once, reading many times.
- Good for offline batch processing—results come back in minutes, hours or days, not instantly.
- Relies on dedicated hardware in a single data center with very high aggregate bandwidth interconnects.
- Data is partitioned based on the basic questions you are asking
- No search or retrieval
- No global indexes
- No transactions
- Example: Hadoop and HDFS
- Multi-dimensional key-value store which provides richer data model than straight key-value pairs via facility to organize columns
- Addresses sparse data problems by storing data based on columns instead of rows; achieves horizontal scalability through master/chunk-server partitioning (with the exception of Cassandra which follows the Amazon Dynamo model, ie eventual consistency)
- Well-suited for computing aggregates over very large sets of similar data
- Often require data to be of uniform type
- Cassandra, HBase, Google BigTable, HyperTable
You’ll notice that full text search capability isn’t a strong suit of noSQL technologies. One basic reason is that most partition the data across a cluster of commodity servers based on a key used for retrieval and don’t maintain global indexes. To conduct a full text search against a cluster, you have to run the search on every node in the cluster. As a result, throughput is limited by the slowest machine in the cluster, not the size of the cluster. Moreover, if the target of your full text search query is not well aligned with your partitioning, then an i/o bottleneck is introduced if data needs to be copied to different locations to facilitate computations such as joins.
The noSQL crowd expects folks to rely on full text search engines such as Lucene, Solr, or Sphinx. But that’s not ideal either. You may have turned to noSQL for it’s horizontal scale capabilities, but having to scale two point solutions at once is not trivial. Full text search engines have other drawbacks as well:
- The indexes do not reflect updates to the database in real time. They are batched for commits and the commits invalidate the distributed caches relied upon for horizontal scale, so they are very expensive from a performance point of view.
- They can only return references to documents that contain a match, they can’t return more granular results, except for those fields which you have explicitly identified at the outset and must commit to or else reload and reindex later.
- Search engines lie to you all the time in ways that are not always obvious because they need to take shortcuts to make performance targets. In other words, they don’t provide for a way to guarantee accuracy.
- If you change your indexing model, you need to reload all of the content from the external source and reindex it.
- They don’t provide transactional integrity, again preventing the possibility for real time search.
…to name a few.
The last thing for consideration when surveying alternatives to an RDBMS solution is what I’ll call “enterprise worthiness”. You can read more about what I mean by that in an earlier post.
Ok, finally how does MarkLogic compare? I’ll try to be as brief as possible—you’ve found your way to the MarkLogic website already, where there are plenty of materials to provide details. The important take-aways are that, unlike noSQL technologies, MarkLogic is proven to scale horizontally on commodity hardware up to petabyte range:
- With full ACID transactional integrity (ie no compromise on consistency)
- While delivering real-time updates, search, and retrieval results
- While imposing no up-front constraints on data ingestion. (You don’t need to know or decide anything about your data model before you load it into MarkLogic.)
- While allowing for unlimited schemas (or no schemas)
- While automatically indexing every piece of text and document structure upon ingestion while maintaining MBs/sec/core ingestion rates
- While leveraging document structure for targeted search and retrieval
- While delivering any level of document granularity
- All (and much more) with enterprise worthy administration tools and world wide 7x24 support
MarkLogic wasn’t built to solve a specialized problem. It was architected from the ground up to be an enterprise class platform for building applications at any scale which rely on sophisticated real-time search and retrieval functionality. If you’re looking for something that is as reliable as your trusty RDBMS, but is better suited for unstructured data and horizontal scale, then MarkLogic is the first place to look.
So if MarkLogic is not really suitably grouped with the noSQL technologies, where does it fit? It’s in a next-generation database class of it’s own. Here’s how I see it:
E.F. Codd's vision of the relational database was revolutionary because it separated database organization from the physical storage. Rather than worrying about data structures and retrieval procedures, developers could simply employ declarative methods for specifying data and queries. Oracle capitalized on this model and built a business essentially selling agility.
Christopher Lindblad's vision of the unstructured database is revolutionary because it separates data ingestion from data organization, and because it combines search and retrieval. Rather than worrying about schema (or data partition) maintenance, developers can simply employ expressive methods for searching and retrieving information in real-time. MarkLogic is building a business essentially selling agility.