Managed vs Unmanaged Triples

by Geert Josten

Managed triples are triples that are inserted into a database by MarkLogic automatically. Unmanaged means you take care of inserting them yourself. But what does that mean exactly, and when does it make sense to use unmanaged triples? 

Triples vs Documents

When should you use managed triples, and when unmanaged? Well, to put it simple:

  • Managed triples are for using MarkLogic as triple store - load large numbers of triples, and have MarkLogic figure out how to store them.
  • Unmanaged triples are for using MarkLogic as document store, while embedding triples in those documents.

That doesn't really tell the full story though. Before telling that, let's have a look under the hood first...

Triple Index

MarkLogic introduced its Triple index and SPARQL support in version 7. It currently supports SPARQL 1.1 (pretty much entirely), which includes SPARQL Update. SPARQL code is primarily evaluated against this Triple index. This index is not very different from the other indexes. It looks for certain constructs in documents, and puts those in an index. There are no additional settings to configure, you just enable it or not.

Once enabled it will look within any fragment for anything that matches a triple construct. The indexer currently supports triple constructs in XML format and JSON format. Below an example of each: 

Triple expressed as XML:

Triple expressed as JSON:

Loading Triples

You can directly load triples stored in most of the common RDF serializations. MarkLogic can parse the following RDF formats out of the box:

All these get parsed into internal sem:triple objects, which can be persisted in a MarkLogic database in the two formats mentioned before.

The support for all those formats opens a multitude of ways to get hold of RDF data. There are a lot of Linked (Open) Data sources on the web. Think of DBPediaGeonames, and Open Calais, but also governmental sites like data.gov.uk, and many more. Most of them have exports readily available for download, but some also allow running ad hoc SPARQL queries against them to retrieve specific data.

Next to these sources there are many tools that can enrich your data, and which can return the enrichment information as RDF. Of particular interest are semantic tools. The Open Calais API includes a semantic enrichment service, which is a nice example. It is a public endpoint that can be used for free (with some limitations). You can post any piece of text, and in return you get RDF/XML describing the semantic enrichments found by Open Calais.

Managed Triples

With managed triples, you load triples yourself, but leave it to MarkLogic to wrap them in documents. MarkLogic does that by inserting XML documents into the target database, where each file will contain about a 100 triples. One way to do that is by loading RDF data with the function sem:rdf-load, for instance like this:

If you look into the database after running above code you will find XML documents with database uris starting with /triplestore/, and a collection of "http://marklogic.com/semantics#default-graph". The content looks something like:

You can also use SPARQL Update as of MarkLogic 8:

This will result in a similar kind of XML document being written to the target database.

Unmanaged Triples

There is no magic in triple constructs. Insert them in any XML or JSON document, and they will get indexed. Even XML triples inside document properties will get indexed. As soon as you insert a triple in a document or property, we talk about unmanaged triples, as in, not managed by MarkLogic automatically.

It doesn’t matter in which kind of document or property you insert them. It could be a large book file, with triple data embedded inline, or at the end. It could be a record-style document produced by loading delimited text with MLCP and adding some triple data into it. It could be a small document property containing just one triple, or a large one containing many triples. It makes no difference to the Triple index.

What do matter are document collections. MarkLogic collections are used to represent the notion of Graphs in SPARQL. Graphs are very useful to address subsets of triples. You could for instance use them to distinguish triples from different sources, or triples about different topics, or triples with different quality measures. These are just a few of the many ways in which you could use Graphs. Document collections have the very same purpose, but for documents. Since all triples are persisted in documents in the database, using document collections for graphs makes a lot of sense.

If you on the other hand don’t use Graphs in your SPARQL queries, then you don’t need to worry about document collections (nor Graphs); MarkLogic will simply evaluate against all triples by default, managed or unmanaged, and in any graph or collection.

Embedded triples

Some people talk about embedded triples. These are triples embedded in documents that don't have the sem:triples element as root. It is possible to manually insert documents with sem:triples as root, for instance as part of migrating triple data. However, we recommend against constructing such documents yourself, or touching those that have been created automatically. Using semantic functions to create triplexml is less error-prone. 

Additionally, MarkLogic will treat any sem:triples document as if it contains managed triples. This applies in particular to SPARQL Update, which only affects managed triples. That does include any document with sem:triples as root. Any custom changes inside sem:triples documents can get lost when MarkLogic touches triples via SPARQL Update. 

The motivation is that if you're using MarkLogic as a triple store, triples get loaded as managed triples, and therefore can be updated using SPARQL Update. On the other hand, if you are embedding triples inside documents, you wouldn't expect your documents to be changed by SPARQL Update, and MarkLogic will not allow that. Use the document update APIs in that case.

So, don't create or touch sem:triples documents yourself. Effectively, the terms embedded and unmanaged triples are synonyms.

Managed or Unmanaged

Now that we have learned what managed and unmanaged really means, we come to the key question: how do you store the RDF data? As triples of course, but managed or unmanaged?

Logically it makes sense to keep information close together if it belongs together. Take for instance triples with semantic enrichment info about a particular document in your MarkLogic database. For such triples it makes a lot of sense to embed them either inside the document itself, or in its properties, meaning storing them as unmanaged triples.

This also makes it very easy to maintain the information. If you delete the document, the triples will get deleted along with it automatically, so you don’t need to worry about that.

For RDF data that comes from an entirely different source than your other data, and stands on its own, it makes a lot of sense to store that separately. Using managed triples for that makes a lot of sense.

Or to put it differently as mentioned before, if you use MarkLogic as a pure Triple Store, you would probably use managed triples only, and have the full capabilities of SPARQL at your disposal. If you use MarkLogic as 'pure' Document Store, you would embed triples in your documents, and not use SPARQL (or only very limited).

This distinction, however, isn't always as clearly cut as you might want. The RDF data could be a mixture of generic information and document-specific information, particularly if it comes from one source. In that case you might want to embed only the document-specific triples, and store the other triples separately, probably as managed triples.

Besides, there is a lot to gain by deliberately mixing the two worlds. MarkLogic is perfectly happy with having plain documents, documents with embedded triples, and managed triples all sitting next to each other, and run queries across all of them. Of particular interest are so-called combination queries.

Combination Queries

When you have both documents (with or without embedded triples) and managed triples living next to each other within MarkLogic, you could run a search or lookup against one of the two, and use the outcome as input for a search or lookup in the second set. That is how you would perform joins in MarkLogic with plain documents as well.

This is perfectly fine. If tuned properly, each search would take less than 1/100th of a second, so doing a several searches and lookups to do some joins would hardly be noticed by end users, provided you execute all of them in one request on server-side.

However, you can combine triple and SPARQL queries with document queries. These are called combination queries. The REST API endpoint to run SPARQL (/v1/graphs/sparql) as well as the internal commands to run SPARQL (sem:sparql and related) all take extra parameters to constrain the SPARQL code to documents (with triples) matching those queries.

The SPARQL engine simply truncates the documents that don’t match the document queries, and only uses the triples from the documents that are left. That builds on top of how MarkLogic combines query terms already, so requires very little overhead. This is ideal for embedded triples.

You can also do it the other way around, and include a so-called cts:triple-range-query within a more traditional search across documents. However, that query only filters on individual triples, and does not, for instance, take a full SPARQL statement to filter search results. It will also not apply inference rules, and only include materialized triples.

Also worth mentioning, but less efficient, is the fact that you can use cts:contains within the FILTER part of SPARQL, basically allowing you to do Full-Text searching inside SPARQL with the full power of MarkLogic’s capabilities.

Best of both worlds

Such combination queries could get you beyond where you could go if you could use only one kind of query at the same time. It also allows for a much more efficient calculation of search and query results.

Imagine RDF data with a time angle: "tell me what we knew about the MH17 plane crash a year ago", a perfect case for bi-temporal triples.

Or what about RDF data curated for quality: "show me all data about Barack Obama from LOD sources, but validated by approved curators", a good case for triples annotated with curation details.

Or documents with semantic enrichments as triples, with supplementary information as (potentially) managed triples: "search across all documents mentioning a US president born between 1900 and 2000".

Less obvious, but very powerful, is the fact that you can apply document permissions on triples. For managed triples you do that via GRAPHs. Access to unmanaged triples is controlled via the document permissions on the document in which they are embedded.

More examples and details on embedding triples can be found in the Semantics Developer’s Guide.

Faceted search

One other aspect to consider is faceted search. MarkLogic comes with built-in functionality that can return top-values with frequency counts very fast. This leans on the document approach however, and works best with denormalized data.

The idea is that you select a set of documents: your search result. For that search result, MarkLogic can pull up values sorted on frequency directly from range indexes that you define on elements, properties, paths, etc.

With the same kind of effort MarkLogic can also pull up value combinations, also known as co-occurrences or value-tuples. For this it is important that data that belongs together, lives together in one document (or more accurately in one fragment).

Unfortunately with managed triples, you are never sure in which document a triple will end up as that doesn't really have meaning with managed triples, nor if it will be stored together with triples that are about one specific topic. So, that won't work. That is the benefit with embedding triples. With those you have the opportunity to keep related triples together, and embed them in the same fragment as other data they relate to.

It is possible to build facets on managed triples leveraging the Triple index with a custom facet. Inside a custom facet you could run SPARQL code, or do counts on cts:triples calls. With MarkLogic 8 you could even use SPARQL aggregate functions like count. Keep in mind though that the Triple index and SPARQL is about triples, not documents, where facets are focused around documents. What meaning will selecting such a facet value have regarding your search result? With triples embedded inside documents, functions like sem:database-nodes will have a much clearer meaning. Also keep in mind that generating facet information using SPARQL will likely be less performant.

Embedding in documents versus properties

In the beginning of this article I mentioned that the Triple index would look for triple constructs everywhere. It will look for those in both documents and in properties. Storing triples in properties comes with some costs. It requires a second database fragment for each document, meaning extra storage overhead.

Constraining document searches with a properties-query also takes a slight performance hit, since MarkLogic will need to join between document fragments and properties fragments. Showing results might also mean you have to pull information from two places, which could be more cumbersome than having your triples and document content in one fragment.

The benefit though is that you have clean separation between document and triples automatically. And if you are handling binary or plain text documents for instance, you don’t have the option to embed triples other than by embedding them in properties.

Conclusion

As soon as you start embedding your triples inside documents (or properties) you will have unmanaged triples. Unmanaged triples come with a few down-sides like not being able to use SPARQL Update on them, but it opens a lot of interesting possibilities that are unique to MarkLogic. No other database allows querying XML, Text, JSON, Binary, and RDF data in a single query statement.

Special thanks to Patrick McElwee, Eric Poilvet, Dave Cassel, John Snelson, and Stephen Buxton for their feedback and contributions!

Comments

  • Great article!!This amalgam approach is making semantic technologies and <a href="https://colaninfotech.com/java-2/">Java application development services </a>mainstream
  • Yet another aspect of managed vs. unmanaged triples is insertion speed. We have a use case where every now and then ten thousands of triples are added. Each of the triples semantically "belongs" to a different document in the database. With unmanaged triples we would update ten thousands of documents. By using managed triples only a few hundred documents (100 triples per managed triple document) are inserted into the database. Thus MarkLogic has to do less (indexing) and adding the triples is finished faster.
  • As someone with a long background in RDF, I consider this an abomination. "Logically it makes sense to keep information close together if it belongs together." "Close together" hahahaha this is so twisted and confused, please make something simple complicated and patently WRONG.
    • I hear this (initial) reaction quite often - from both sides! People who live in the RDF Universe tend to baulk at the idea of mixing triples with documents - why not do everything with triples? Similarly, people who live in the documents Universe say "why not do everything with JSON [or XML]"? And the answer is - because each data model has its strengths and weaknesses. RDF is extremely flexible; it's easy to combine datasets; you can do rich queries with SPARQL, including graph traversal; and you can use the power of inference. But it's hard to express metadata about a triple or set of triples; and every query does many joins to find what you want (filter), then many joins to (re-)constitute the entities you want to return (project). This is expensive both for the developer writing the queries, and for the engine that's executing the queries. Of course you can use indexes that specialize in joins, but a join is an inherently expensive operation, so indexes will only get you so far. They say "the proof of the pudding is in the eating". I've seen many projects get bogged down because developers want to do absolutely everything with triples. And I've seen many successful projects where the developers take a pragmatic approach and use documents for entities, and triples for facts and relationships. For example, the BBC sports web site is often held up as an example of a successful project using triples. But they used triples only for relationships ("Jamie Vardy plays for Leicester", "Leicester is in the Premier League", and so on) and documents for entities (match reports, player bios, league tables, and so on). This hybrid approach - documents + triples - is making semantic technologies mainstream (finally!) If you'd like to learn more, take a look at http://www.marklogic.com/resources/marklogic-semantics-overview/ If you have questions/comments on the presentation, drop me a line at stephen dot buxton at marklogic dot com
  • Another aspect of managed vs. unmanaged triples is document size. By using unmanaged triples your documents could become really large if you added lots of triples to one document. This would slow down opening (or filtering) that document.
    • Interesting idea. It depends somewhat on what you mean by "lots". In this hybrid model, triples embedded in a document are typically either: * metadata about the document or * facts/relationships gleaned from the document or * relationships between this document (this entity) and other documents (entities) So typically you wouldn't have billions or even millions of triples embedded in a single document. If you have dozens or hundreds of embedded triples in a document, the time taken to read that document should be very small. If you want to filter by some triple values, that's all handled in the index, and so isn't affected by overall document size.
      • By "lots" I mean a number of triples such that the document size exceeds the recommended 100KB. Thus I wouldn't put more than a few hundred triples inside one document - as you have similarly indicated. When using managed triples, MarkLogic puts no more than 100 triples inside one managed document. By "filtering" I meant the filter step of a search. I've experienced that filtering larger documents (eg 5MB) is taking some time (several hundred milliseconds).
        • Thanks for the clarification, Andreas. Embedding hundreds of triples should work well, plus I don't see a use case where you'd want to embed more triples than that. If you have one, I'd be interested to see it. RE: filtering - OK, that makes sense. I was thinking of "filtering" in the SQL sense - the stuff in the WHERE clause. Yes, if you're doing a filtered search then the filter step will take longer on bigger documents.