Managed triples are triples that are inserted into a database by MarkLogic automatically. Unmanaged means you take care of inserting them yourself. But what does that mean exactly, and when does it make sense to use unmanaged triples?
Triples vs Documents
When should you use managed triples, and when unmanaged? Well, to put it simple:
- Managed triples are for using MarkLogic as triple store - load large numbers of triples, and have MarkLogic figure out how to store them.
- Unmanaged triples are for using MarkLogic as document store, while embedding triples in those documents.
That doesn't really tell the full story though. Before telling that, let's have a look under the hood first...
MarkLogic introduced its Triple index and SPARQL support in version 7. It currently supports SPARQL 1.1 (pretty much entirely), which includes SPARQL Update. SPARQL code is primarily evaluated against this Triple index. This index is not very different from the other indexes. It looks for certain constructs in documents, and puts those in an index. There are no additional settings to configure, you just enable it or not.
Once enabled it will look within any fragment for anything that matches a triple construct. The indexer currently supports triple constructs in XML format and JSON format. Below an example of each:
Triple expressed as XML:
Triple expressed as JSON:
You can directly load triples stored in most of the common RDF serializations. MarkLogic can parse the following RDF formats out of the box:
All these get parsed into internal sem:triple objects, which can be persisted in a MarkLogic database in the two formats mentioned before.
The support for all those formats opens a multitude of ways to get hold of RDF data. There are a lot of Linked (Open) Data sources on the web. Think of DBPedia, Geonames, and Open Calais, but also governmental sites like data.gov.uk, and many more. Most of them have exports readily available for download, but some also allow running ad hoc SPARQL queries against them to retrieve specific data.
Next to these sources there are many tools that can enrich your data, and which can return the enrichment information as RDF. Of particular interest are semantic tools. The Open Calais API includes a semantic enrichment service, which is a nice example. It is a public endpoint that can be used for free (with some limitations). You can post any piece of text, and in return you get RDF/XML describing the semantic enrichments found by Open Calais.
With managed triples, you load triples yourself, but leave it to MarkLogic to wrap them in documents. MarkLogic does that by inserting XML documents into the target database, where each file will contain about a 100 triples. One way to do that is by loading RDF data with the function sem:rdf-load, for instance like this:
If you look into the database after running above code you will find XML documents with database uris starting with /triplestore/, and a collection of "http://marklogic.com/semantics#default-graph". The content looks something like:
You can also use SPARQL Update as of MarkLogic 8:
This will result in a similar kind of XML document being written to the target database.
There is no magic in triple constructs. Insert them in any XML or JSON document, and they will get indexed. Even XML triples inside document properties will get indexed. As soon as you insert a triple in a document or property, we talk about unmanaged triples, as in, not managed by MarkLogic automatically.
It doesn’t matter in which kind of document or property you insert them. It could be a large book file, with triple data embedded inline, or at the end. It could be a record-style document produced by loading delimited text with MLCP and adding some triple data into it. It could be a small document property containing just one triple, or a large one containing many triples. It makes no difference to the Triple index.
What do matter are document collections. MarkLogic collections are used to represent the notion of Graphs in SPARQL. Graphs are very useful to address subsets of triples. You could for instance use them to distinguish triples from different sources, or triples about different topics, or triples with different quality measures. These are just a few of the many ways in which you could use Graphs. Document collections have the very same purpose, but for documents. Since all triples are persisted in documents in the database, using document collections for graphs makes a lot of sense.
If you on the other hand don’t use Graphs in your SPARQL queries, then you don’t need to worry about document collections (nor Graphs); MarkLogic will simply evaluate against all triples by default, managed or unmanaged, and in any graph or collection.
Some people talk about embedded triples. These are triples embedded in documents that don't have the sem:triples element as root. It is possible to manually insert documents with sem:triples as root, for instance as part of migrating triple data. However, we recommend against constructing such documents yourself, or touching those that have been created automatically. Using semantic functions to create triplexml is less error-prone.
Additionally, MarkLogic will treat any sem:triples document as if it contains managed triples. This applies in particular to SPARQL Update, which only affects managed triples. That does include any document with sem:triples as root. Any custom changes inside sem:triples documents can get lost when MarkLogic touches triples via SPARQL Update.
The motivation is that if you're using MarkLogic as a triple store, triples get loaded as managed triples, and therefore can be updated using SPARQL Update. On the other hand, if you are embedding triples inside documents, you wouldn't expect your documents to be changed by SPARQL Update, and MarkLogic will not allow that. Use the document update APIs in that case.
So, don't create or touch sem:triples documents yourself. Effectively, the terms embedded and unmanaged triples are synonyms.
Managed or Unmanaged
Now that we have learned what managed and unmanaged really means, we come to the key question: how do you store the RDF data? As triples of course, but managed or unmanaged?
Logically it makes sense to keep information close together if it belongs together. Take for instance triples with semantic enrichment info about a particular document in your MarkLogic database. For such triples it makes a lot of sense to embed them either inside the document itself, or in its properties, meaning storing them as unmanaged triples.
This also makes it very easy to maintain the information. If you delete the document, the triples will get deleted along with it automatically, so you don’t need to worry about that.
For RDF data that comes from an entirely different source than your other data, and stands on its own, it makes a lot of sense to store that separately. Using managed triples for that makes a lot of sense.
Or to put it differently as mentioned before, if you use MarkLogic as a pure Triple Store, you would probably use managed triples only, and have the full capabilities of SPARQL at your disposal. If you use MarkLogic as 'pure' Document Store, you would embed triples in your documents, and not use SPARQL (or only very limited).
This distinction, however, isn't always as clearly cut as you might want. The RDF data could be a mixture of generic information and document-specific information, particularly if it comes from one source. In that case you might want to embed only the document-specific triples, and store the other triples separately, probably as managed triples.
Besides, there is a lot to gain by deliberately mixing the two worlds. MarkLogic is perfectly happy with having plain documents, documents with embedded triples, and managed triples all sitting next to each other, and run queries across all of them. Of particular interest are so-called combination queries.
When you have both documents (with or without embedded triples) and managed triples living next to each other within MarkLogic, you could run a search or lookup against one of the two, and use the outcome as input for a search or lookup in the second set. That is how you would perform joins in MarkLogic with plain documents as well.
This is perfectly fine. If tuned properly, each search would take less than 1/100th of a second, so doing a several searches and lookups to do some joins would hardly be noticed by end users, provided you execute all of them in one request on server-side.
However, you can combine triple and SPARQL queries with document queries. These are called combination queries. The REST API endpoint to run SPARQL (/v1/graphs/sparql) as well as the internal commands to run SPARQL (sem:sparql and related) all take extra parameters to constrain the SPARQL code to documents (with triples) matching those queries.
The SPARQL engine simply truncates the documents that don’t match the document queries, and only uses the triples from the documents that are left. That builds on top of how MarkLogic combines query terms already, so requires very little overhead. This is ideal for embedded triples.
You can also do it the other way around, and include a so-called cts:triple-range-query within a more traditional search across documents. However, that query only filters on individual triples, and does not, for instance, take a full SPARQL statement to filter search results. It will also not apply inference rules, and only include materialized triples.
Also worth mentioning, but less efficient, is the fact that you can use cts:contains within the FILTER part of SPARQL, basically allowing you to do Full-Text searching inside SPARQL with the full power of MarkLogic’s capabilities.
Best of both worlds
Such combination queries could get you beyond where you could go if you could use only one kind of query at the same time. It also allows for a much more efficient calculation of search and query results.
Imagine RDF data with a time angle: "tell me what we knew about the MH17 plane crash a year ago", a perfect case for bi-temporal triples.
Or what about RDF data curated for quality: "show me all data about Barack Obama from LOD sources, but validated by approved curators", a good case for triples annotated with curation details.
Or documents with semantic enrichments as triples, with supplementary information as (potentially) managed triples: "search across all documents mentioning a US president born between 1900 and 2000".
Less obvious, but very powerful, is the fact that you can apply document permissions on triples. For managed triples you do that via GRAPHs. Access to unmanaged triples is controlled via the document permissions on the document in which they are embedded.
More examples and details on embedding triples can be found in the Semantics Developer’s Guide.
One other aspect to consider is faceted search. MarkLogic comes with built-in functionality that can return top-values with frequency counts very fast. This leans on the document approach however, and works best with denormalized data.
The idea is that you select a set of documents: your search result. For that search result, MarkLogic can pull up values sorted on frequency directly from range indexes that you define on elements, properties, paths, etc.
With the same kind of effort MarkLogic can also pull up value combinations, also known as co-occurrences or value-tuples. For this it is important that data that belongs together, lives together in one document (or more accurately in one fragment).
Unfortunately with managed triples, you are never sure in which document a triple will end up as that doesn't really have meaning with managed triples, nor if it will be stored together with triples that are about one specific topic. So, that won't work. That is the benefit with embedding triples. With those you have the opportunity to keep related triples together, and embed them in the same fragment as other data they relate to.
It is possible to build facets on managed triples leveraging the Triple index with a custom facet. Inside a custom facet you could run SPARQL code, or do counts on cts:triples calls. With MarkLogic 8 you could even use SPARQL aggregate functions like count. Keep in mind though that the Triple index and SPARQL is about triples, not documents, where facets are focused around documents. What meaning will selecting such a facet value have regarding your search result? With triples embedded inside documents, functions like sem:database-nodes will have a much clearer meaning. Also keep in mind that generating facet information using SPARQL will likely be less performant.
Embedding in documents versus properties
In the beginning of this article I mentioned that the Triple index would look for triple constructs everywhere. It will look for those in both documents and in properties. Storing triples in properties comes with some costs. It requires a second database fragment for each document, meaning extra storage overhead.
Constraining document searches with a properties-query also takes a slight performance hit, since MarkLogic will need to join between document fragments and properties fragments. Showing results might also mean you have to pull information from two places, which could be more cumbersome than having your triples and document content in one fragment.
The benefit though is that you have clean separation between document and triples automatically. And if you are handling binary or plain text documents for instance, you don’t have the option to embed triples other than by embedding them in properties.
As soon as you start embedding your triples inside documents (or properties) you will have unmanaged triples. Unmanaged triples come with a few down-sides like not being able to use SPARQL Update on them, but it opens a lot of interesting possibilities that are unique to MarkLogic. No other database allows querying XML, Text, JSON, Binary, and RDF data in a single query statement.
Special thanks to Patrick McElwee, Eric Poilvet, Dave Cassel, John Snelson, and Stephen Buxton for their feedback and contributions!