Using Many Collections

by Christopher Lindblad
On an internal discussion list, a question came up recently about a customer who was using a very large number of collections. MarkLogic Founder Chris Lindblad chimed in with the following explanation of how the collection lexicon works and whether using a large number of collections is good or bad. He agreed that sharing his answer here would be useful for others.

I think having a large number of collections is a great way of organizing documents. The collection mechanism in MarkLogic is very scalable. You can easily have as many collections as documents. I encourage using them, not discourage using them.

Collections are implemented as if there is a hidden <collection> element in each document for every collection that the document belongs to. So if a document belongs to ten collections, that is as if there is ten hidden <collection> elements in that document with the names of the ten collections that it belongs to. So the fundamental database cost for collection metadata is the number of collections each document belongs to, times the number of documents. Fundamentally having one collection with a million documents is about the same as having a million collections, each with one document.

Collection lexicons are implemented as if there is a string range index defined for the hidden <collection> element. So for the collection lexicon the cost of a distinct collection name is no more than the cost of a distinct value in an element for which you have defined a string range index. The URI lexicon works the same way. The only difference between the URI lexicon and the collection lexicon is that a document has only one URI, but can be in many collections.

We use large numbers of collections to implement features in MarkLogic. The bitemporal feature uses a distinct collection for each temporally-managed document. Every version of a temporally-managed document exists as a separate database document in that collection. So a bitemporal database with billions of temporally-managed documents would have billions of collections.

Collections are no less and no more expensive than having an extra element in your documents for each collection your document belongs to. The cost of having many collections is no less and no more expensive than having many distinct values in that extra element. The cost of the collection lexicon is no less and no more expensive than having a string range index on that extra element.

Another contributor pointed out that the collection lexicon is off by default and that it only needs to be turned on only if you want to use cts:collections() or cts:collection-match(). As with any range indexes, keep an eye on your memory consumption.

Comments