Improving Recall with a Semantic Constraint

Micah Dubinko
Last updated December 22, 2014

If you haven't seen it, the keynote at MarkLogic World 2013 is worth a look. I was on stage demonstrating new Semantics features built into MarkLogic server. Two of the three demos were based on MarkMail, a database of some 60 million messages, with enhanced search capabilities driven by semantics. (The third demo was a built-from-the-ground-up semantic application).

Semantic Constraint

One of the demos showed improving search results via semantics. I did a search constrained with affiliation:IBM to find messages from people who--based on specific facts outside of the messages themselves--were known to be employed by IBM. This is a powerful technique, because even when a message itself bears no sign that it came from a particular company, it can still be queried as such. This is known as semantic querying.

An Example

So, in this posting, let's recreate something analogous to the 'affiliation' search. We'll use the Oscars sample application that ships with the product. To get started, create an Application Builder sample project and deploy it. We'll call the relevant database names 'oscar' and 'oscar-modules' throughout. Since Application Builder ships with only a small amount of data, you may also want to run the sample Information Studio collector that will fetch the rest of the dataset.

Setup

Before we can query, we need to actually turn on the semantics index. The easiest place to do this is on the page at http://localhost:8000/appservices/. Select the oscar database and hit configure. On the page that comes up, tick the box for Semantics and wait for the yellow flash.

The earlier infopanel tutorial included details on the triple data, and this demo uses the same data. Grab the triples from here and put it somewhere on your local system. Then simply load these triples via query console. Point the target database to 'oscar' and run this:

import module namespace sem="http://marklogic.com/semantics"
  at "MarkLogic/semantics.xqy";
sem:rdf-load("/path/to/oscartrips.ttl")

Application Builder: Custom Constraints

A quick refresher is in order. In Application Builder, there are a number of built-in constraint types, for example a range constraint (using a range index) or a geo constraint (using a geo index). In cases like this where an application needs to go beyond the built-in constraint types, it's possible to define a custom constraint, that calls code you provide in order to create the query that powers the search. So defining a custom constraint requires three things: 1) adding the configuration of the custom constraint, 2) writing an XQuery module that implements the constraint, and 3) putting the XQuery module in the right place. In this tutorial, we'll do all three.

First, the constraint definition. We need to add to the query options that are defined as part of the REST API for the oscars app. Fortunately, Application Builder makes this easy to do. On the 'Search' tab, there is a button at the bottom labeled 'Custom XML Options'.

custom xml options

Click this and enter the following code:

<search:constraint name="workedwith">
 <search:custom facet="false">
   <search:parse apply="parse-structured" ns="http://marklogic.com/example/semconstr" 
                 at="/ext/semconstr.xqy"/>
 </search:custom>
</search:constraint>

Re-deploy the app to put the changes in place.

Second, put the following code in a new module named semconstr.xqy:

xquery version "1.0-ml";
module namespace semconstr = "http://marklogic.com/example/semconstr";
declare namespace search = "http://marklogic.com/appservices/search";
import module namespace sem = "http://marklogic.com/semantics" at "/MarkLogic/semantics.xqy";

declare default function namespace "http://www.w3.org/2005/xpath-functions";

declare function semconstr:parse-structured(
 $query-elem as element(),
 $options as element(search:options))
as cts:query
{
   let $use := $query-elem/cts:text/string()
   let $dir := sem:iri("http://dbpedia.org/resource/" || $use)
   let $sparql := "
       prefix foaf: <http://xmlns.com/foaf/0.1/>
       prefix dbpedia: <http://dbpedia.org/ontology/>
       select ?name where {
           ?film dbpedia:director ?dir .
           ?film dbpedia:starring ?person .
           ?person foaf:name ?name.
       }"
   let $results := sem:sparql-values($sparql, map:entry("dir", $dir))
   let $names := $results ! string(map:get(., "name"))
   let $qn := fn:QName("http://marklogic.com/wikipedia", "name")
   return
       cts:element-range-query($qn, "=", $names)  
};

The structure of this code is similar to that in the infopanel. Starting with a director name, it finds films directed by that person, and actors within those films. It returns the names of the actors, and puts those into a cts:element-range-query, which has an existing range index in place, and can quickly resolve the query.

Lastly, you need to put this module into place where it can be executed, specifically in the Modules database used by this application. In MarkLogic 7, there exists an endpoint made specifically for this purpose. Run the following from the command line in the same directory as the semconstr.xqy file, substituting the correct admin name and password and port:

curl -X PUT -d@'semconstr.xqy' --digest --user "admin:admin" \
    -H "Content-type: application/xquery"  "http://localhost:8008/v1/ext/semconstr.xqy"

As soon as everything is in place (and you have redeployed the app), the semantic constraints will be active. For example, you can run the following query from the searchbox:

workedwith:Alfred_Hitchcock

The underscore is required, since this application uses a simple mapping from entered constraint value to triple IRI. As with the earlier tutorial, I hope this provides a base upon which many developers can play and experiment. There's a lot of room for expansion in these techniques.

Comments