Questions about Semantics

Stephen Buxton
Last updated September 29, 2016

Semantic triples are less widely understood than some other data models, and combining them with documents is a capability unique to MarkLogic. This leads to some questions. Happily, Stephen Buxton has answers.

How does inferencing work in MarkLogic?

The Semantics Guide describes inferencing.

I've attached a Query Console workspace that does "Hello World" inferencing and steps you through using one of the built-in rulesets (RDFS); creating and using your own; and combining rulesets. You can do this via Java or Jena or REST too.

Query Console is an interactive web-based query development tool for writing and executing ad-hoc queries in XQuery, JavaScript, SQL and SPARQL. Query Console enables you to quickly test code snippets, debug problems, profile queries, and run administrative XQuery scripts. A workspace lets you import a set of pre-written queries. See instructions to import a workspace.

Inference is a bit tricky to get your head around - you need data (triples); an ontology that tells you about the data (also triples); and rules that define the ontology language (rulesets). It may help to watch this video of Stephen's talk at MarkLogic World in San Francisco (start at 18:50).

Is inference expensive?

In general yes, inference is expensive no matter what tool you use. When considering inference, you should:

  • run inference over as small a set of triples as possible
    • don’t query for { ?s ?p ?o } with inference
    • paradoxically, more complicated queries will often run faster, because you're working the index to produce a smaller set of results
  • run inference with only the rules you need
    • the equivalent of { ?s ?p ?o } in the query is to include all possible rulesets in inference
    • this infers lots of triples that may not be useful
    • Note that this is more work than just applying each ruleset in turn - rulesets may be recursive, and they may interact (that is, a ruleset may infer triples that "trigger" more triples from another ruleset)
    • MarkLogic is very flexible here -- it lets you specify rulesets on a per-query basis
  • consider alternatives to inference
    • the attached workspace shows how to use property paths instead of inference
    • the downside to this is that inference is now encoded in the query, rather than in a centrally-managed ruleset
    • the upside is that the property path query is generally much more efficient, since it's very specific

(Note inference is expensive no matter which database you use. Many users of Triple Stores start off with very complex inferencing, and whittle it down as they move toward production.)

What can I do with a combination query?

A combination query brings together queries about triples and documents in a single search.

In MarkLogic you can do a SPARQL query and restrict the results according to the context document (the document the triples are embedded in). See the Semantics Guide for an example.

You can also search documents, and restrict the results according to the triples embedded in them.

The biggest difference between these two approaches is that the first returns solutions (the things that SPARQL returns) while the second returns documents, or parts of documents. (Side note: many people assume SPARQL returns triples. A SPARQL query returns solutions -- that is, a sequence of "rows" according to what you specify in the SELECT clause).

For more examples of combination queries and inference, see the materials for the half-day workshop on MarkLogic semantics, including data, a setup script, and Query Console workspaces.

If you delete a Named Graph, will all the triples be deleted too?

It depends.

Here's another place where MarkLogic supports the standards around Triple Stores, AND provides a document store, AND provides a bridge between the two.

If you treat MarkLogic like a Triple Store, then a triple can only belong to one Named Graph; when you DROP that graph (using SPARQL Update), then all the triples in that graph will be deleted. You can also create permissions on the Named Graph, which will apply to all triples in that Named Graph.

If you treat MarkLogic like a Document Store, then Named Graphs map to MarkLogic collections. If the document containing the triple is in collection-A, then you can query Named Graph <collection-A> and find that triple. A document can be in any number of collections, and so triples can be in any number of Named Graphs. If you do an xdmp:collection-delete(), all the documents in that collection will be deleted, even if those documents belong to other collections too. See workspace collections.xml.

Would we ever delete a Named Graph?

A Named Graph is a convenient way to partition triples when using MarkLogic as a Triple Store only. In that case, you may well want to DROP a graph and all its contents.

Document collections are more flexible, but have slightly different semantics (see above).

Can we get the equivalent query power of Named Graphs in SPARQL in other ways? Is this as efficient as using Named Graphs?

You can get the equivalent query power of Named Graphs by doing a combination query (SPARQL + a document query), where the document query restricts results to some collection. This is exactly as efficient as querying by Named Graph in SPARQL, but more flexible.

When using SPARQL to query unmanaged triples from documents, how can we determine which document(s) those triples came from in order to retrieve the original document(s)?

Remember, SPARQL queries don’t return triples, they return solutions. So it doesn't make sense to "return the documents that the resulting triples came from". You can filter the results according to some document query with a combination query (see above). And you can find the documents that contain triples that match some graph pattern using cts:triple-range-query (see above).

Can the SPARQL query return the body of the Content Document, without storing it as a literal in a triple, or does it require a combined query?

It requires some kind of combination query. There's no way to express in SPARQL "… and return the context document", especially since the SPARQL query contains solutions rather than triples.

However, every document in MarkLogic is addressed via a unique URI -- the "name" of the document. These URIs can be subjects or objects in triples. SPARQL can certainly return document URIs, which you can then de-reference using fn:doc().

Which MarkLogic Application API do we want to use? Java, REST, XQuery, etc?

This depends on the overall architecture of the system you are building. All of the ways of interacting with MarkLogic have access to the semantic functionality. If you are using a two-tier architecture, you'll work with XQuery or JavaScript. If you are using a three-tier architecture with the REST API, your calls will go through the semantics endpoints, possibly using the Java or Node.js Client API. If you're working with Java, you may want to use the Jena library.

Even with a three-tier architecture, at some point you may want to write some Server-Side code (much the way you'd write PL/SQL code in Oracle) -- then you should choose between XQuery and JavaScript, which are equivalent in terms of expressive power. If you want to access that Server-side code via REST, you can write a REST extension.

With a SPARQL query, can I determine which returned triples are inferred vs not inferred?

Since you can specify the kind of inference on a per-query basis, you can run the same query with and without inference and examine the difference.

How do you do a SPARQL query to get everything (of a specific type) that is somehow linked to another specific thing in the graph? Using property paths, you must know all predicates that might exist in that path. What if you don't know or don't care about those predicates?

You should look at DESCRIBE Queries. Also, take a look at sem:transitive-closure -- this is an XQuery library function (which lives in $MARKLOGIC/Modules/MarkLogic/semantics/sem-impl.xqy). If it doesn't do exactly what you want, you can copy it and make changes.

What are the implications of Faceted Search?

Faceted Search lets you search over documents and display a value+count alongside the search results, the way a product search on amazon.com shows you the facets for brand, color, price band, and so on. You can build semantics-driven facets by writing a custom constraint.

Should I use MarkLogic as a Triple Store only?

Yes, MarkLogic works well as a Triple Store. It supports all the major standards - SPARQL 1.1, SPARQL 1.1 Update, Graph Protocol - so it can be used anywhere a regular Triple Store is used. In addition, MarkLogic has Enterprise features such as security, ACID transactions, scale-out, HA/DR, and so on which most Triple Stores don't have. And many people find that they start out using MarkLogic as "just a Triple Store" and over time they move much of their data - the data that represents entities in the real world - into documents. It's nice to have that option!

How do I decide what to model as documents versus triples?

Data is often grouped into entities (such as Person or Article). Consider modeling most entity data as documents and modeling only some of the "attributes" of your entities as triples -- those attributes where you need to query across a graph, or up and down a hierarchy, or you need inference. You should also model the semantics of the data as triples -- for example, you may want an ontology that indicates "CA" is a state in the USA, and it's the same as "California"; that "CA" is part of the address; and so on.

For additional perspectives, you can watch David Gorbet's Escape the Matrix MarkLogic World keynote or Pete Aven and Mike Bower's Multi-Model Data Integration in the Real World.

Do you have additional questions about Semantics best practices? Ask away!

Comments