Let's Get Organized

by Dave Cassel

MarkLogic stores information as documents. Often it's useful organize those documents, which you can do with directories or collections. MarkLogic offers three types of collections, one of which is new and one of which isn't well-known.

Standard Collections

If you've worked with MarkLogic, you are probably familiar with collections. A standard collection logically groups documents together. A document can be in zero or more collections, and collections have no connection to each other. (Conversely, directories are hierarchical and a document can only be in one directory. A document's presence in a directory is reflected in its URI.)

Collections make it easier to search or delete groups of content.

One interesting aspect of collections is that a collection comes into existence when a document is put into it and ceases to exist when the last document is removed from it -- they do not exist independently of the documents that they hold. You can specify collections for a document when you insert it, or add, set, or remove a document's collections once it is in the database.

You can ask for all documents in a collection, or, if you have the collection lexicon enabled, you can get a list of the current collections.

Standard collections are formally referred to as "unprotected collections", to distinguish them from the protected kind.

Protected Collections

Protected collections are a different animal, but not as well known. As you might guess from the name, protected collections provide some security on the collection. For instance, to insert a document into a collection, a user must have insert permission on the collection itself. Moving a document into or out of a protected collection requires permissions on both the collection and the document itself. A protected collection, unlike a standard collection, exists independently of whether there are any documents in it.

Methods of reading a document that depend on a protected collection are different, too. Anyone with read permission on a document could use fn:doc($uri) or find the document by searching. However, accessing a document using an XPath that starts with fn:collection() or by a search that includes a cts:collection-query() would not work if the specified collection was protected and the user didn't have permission on the collection itself.

This capability can be useful in a workflow situation. Consider an application with documents that get entity enrichment (finding people, organizations, and so on). Suppose we have a set of users who are allowed to edit the results (entity enrichment isn't perfect), but someone else has to approve a document before it becomes generally available. The application could use a protected collections called "refine" and "searchable". Editors would be able to update a document, but not change its collection. When an approver, who has insert and update permissions on the collections, decides that a document is ready, he or she could move it from the "refine" collection to "searchable". All users would have read permission on the "searchable" collection, and the application would use cts:collection-query("searchable") with all searches.

You can read more about protected and unprotected collections in the Search Developer's Guide.

Temporal Collections

Protected collections are used to implement one of the new features in MarkLogic 8 -- bitemporal queries. You can get a bitemp overview, but for now know that bitemporal queries are used in highly regulated industries to find out what was known as of some particular time.

A temporal collection, "is a logical grouping of temporal documents that share the same axes with timestamps defined by the same range indices." Inserting a document into a temporal collection requires using the Temporal API, through JavaScript, XQuery, or the REST API -- an attempt to insert a document into a temporal collection using the standard xdmp:document-insert() will fail.

In order ensure the ability to query the previous state of the database ("what did we know at this time?"), deleting a document from a temporal collection does not really delete the document. Rather, MarkLogic records the deletion as a new state of the document across the system and valid axes. The use of protected collections supports this controlled access.