Template Driven Extraction

Stephen Buxton and Dave Cassel
Last updated April 26, 2017

Introduction

In MarkLogic 9 you can define a relational lens over your document data, so you can query parts of your data using SQL or the new Optic API. Templates let you specify which parts of documents make up rows in a view. You can also use templates to define a semantic lens, specifying which values from a document should be added to the index as triples.

Note that the template is entirely independent of the documents – a template doesn't change your documents in any way, it just changes how the indexes (row index and triple index) are populated based on values in the documents.

Templates let you look at the data in your documents in several ways, without changing the documents themselves. A relational lens is useful when you want to let SQL-savvy users access your data and when users want to create reports and visualizations using tools that communicate using SQL. It is also useful when you want to join entities and perform aggregates across documents. A semantic lens is useful when your documents contain some data that is naturally represented as triples and queried in SPARQL.

System Requirements and Installation

For the examples in this section you need MarkLogic 9. We assume you are going to run these examples in the Documents database, using Query Console.

Example 1 – Simple Document, Relational Lens

Insert a simple document; create and validate a simple template; create a view; show the view and its rows; query using SQL

The examples are provided in both JavaScript/JSON and XQuery/XML. If you're running them choose just one -- running both will lead to duplicate results in some of the queries, because you'll have duplicated templates stored at different URIs.

We start with a text-rich XML or JSON document (your choice) with some metadata. To begin with this first example, run this in Query Console:

You can do interesting queries and searches to find this document using XQuery, Server-Side JavaScript, or one of the MarkLogic client-side APIs (REST, Java, or Node.js). See Getting Started With MarkLogic Server for some of the ways you can search in MarkLogic.

This is a rich text document, with lots of text and some structure. But it's easy to imagine a relational lens through which you could see a view of a part of this document -- say, just the metadata.

Let's create a Template that defines a view that includes the id, docUri, match-date, and league. We'll create it in-memory (as a variable), and validate it, then apply it to our match report document to see what the rows look like.

In the template, notice that there are two aspects that define what documents it should apply to. The first is the "context" property or element. This is an XPath expression that defines the lookup node that is used as a starting point for extracting values. If a document doesn't match this path, the template won't be applied.

The second is the collection property or element. The template engine will not apply this template to a document unless the document is in the right collection (or set of collections). This is a useful way to ensure that templates intended for different data sources get applied only to the correct documents. In this case, we're requiring documents to be in the source1 collection, which we applied to our sample document when we inserted it. We'll return to this idea later.

Now that you know the template is a valid one, run the template against a document to test that it extracts the rows you expect.

The result is below. Note that the return type from tde:node-data-extract is a map:map.

Now you know what rows will look like when you apply the template. If you're happy with the result, insert the template. The convenience function tde.templateInsert() / tde:template-insert() will validate the template, insert it into the Schemas database, and assign the special collection "http://marklogic.com/xdmp/tde". This will create a view, and it will cause MarkLogic to index rows from any documents matching the context.

Note that tde.templateInsert() / tde:template-insert() ensures that only valid templates are inserted, and that your template is inserted into the correct database (the Schemas database) with the correct collection ("http://marklogic.com/xdmp/tde").

Now you can see the rows extracted from the match report without specifying a Template. Just leave out the Template parameter, and MarkLogic will apply all the Templates in scope – all the templates in the Schemas database where the context matches.

Result:

When you inserted the Template in the Schemas database, you created a view. To see the view definition, run this: tde.getView("soccer", "matches").

Result (the IDs in your results will be different):

When you created the view, you told the indexer to index rows whenever a document matches the Template. So you can now insert (and update and delete) documents, and new row entries will be created in or removed from the index (which, in case you hadn't already guessed, is based on the Triple Index).

You can also query the new view using SQL. Switch to SQL mode in Query Console and run this:

select * from matches

Result:

Data Integration with Templates

Before we leave this example, let's look at what you can do when a new document arrives with the same data, but in a different format — say, with different element names, or in a different structure.

Perhaps this match report came from a different application —- it also has the information we want in the view (id, document, date, league) but in different places in the document. Notice that we put it in a different collection, to reflect that it came from a different source.

To address this new format you need a new template that specifies the same view, but with different paths to the information. Create/validate/test my-second-TDE, and insert it.

Now show all rows again.

xdmp:sql('select * from matches')

You should see rows from both documents. Congratulations! You have integrated data from two sources, in two different formats, into a single view that you can query using SQL or a composable JavaScript or XQuery API. You can even make this view available to a BI tool using ODBC.

Note: Templates that overlap

Once you get beyond your first Template, you may create Templates that overlap – you can create many Templates that apply to the same document (because the context is the same or overlapping); and you can create many Templates that populate the same view (because they apply to documents of different shapes).

Many Templates apply to one document

In the examples above, both templates have a context of "/match". The first template expects to find the ID directly under the match. The second looks for it as a property under "meta" (JSON) or as an attribute (XML). We used separate collections to ensure that the templates were applied to the right documents. What would have happened if we hadn't? By default, the template processor would have thrown an error trying to apply the first template to the second document, because it wouldn't find the data it needed. This behavior can be changed by specifying invalid-values as "ignore" for each part of the template -- but leaving it with the default "reject" value makes it easier to find errors.

Tip: when you are testing/debugging Templates, avoid <invalid-values>ignore</invalid-values> since <invalid-values>reject</invalid-values> gives you helpful error messages when something is missing or invalid.

Many Templates populate the same view

In the examples above, both $my-first-TDE and $my-second-TDE populate the view soccer.matches. In these examples, both Templates specify all columns of that view. If you create a new Template that specifies only some of those columns, it will create rows cells with the missing columns set to NULL. For that to work, the missing columns must be defined as nullable in all Templates.

Similarly, if you create a new Template that specifies columns that are not mentioned in other Templates for the same view, then all Templates must either define those columns as nullable (which requires you to peek into the future when creating Templates) or must define the view as <view-layout>sparse</view-layout>. <view-layout>sparse</view-layout> says "I don't know what new columns I may specify for this view in future Templates. Allow future Templates to define new columns, and behave as if I had defined those columns as nullable in this Template."

Summary: When you create a Template that defines a view, if you want to be able to create Templates in the future that define a different set of columns than the current (first) Template, then you must:

  1. Define the view as <view-layout>sparse</view-layout> in all Templates for that view.
    The default for view-layout is identical, which means that all Templates for that view must define an identical set of columns.
  2. If there are any columns in the first Template that may not be defined in future Templates, those columns must be defined as <nullable>true</nullable> in the first Template.
  3. If there are any columns not defined in the first Template that will be defined in a future Template, those columns must be defined as <nullable>true</nullable> in the future Template.
Query using SQL

Now that you've "extracted" rows, they are accessible via SQL, in all the usual ways – from Query Console in SQL mode; from XQuery via the xdmp:sql() function; from Server-side JavaScript using the xdmp.sql() function; and via ODBC using some tool that uses ODBC (such as a BI tool). We'll run through simple examples of all these except the last.

SQL Query: from Query Console in SQL mode
select * from matches
SQL Query: from XQuery via the xdmp:sql() function

This is the simplest possible SQL query:

xdmp:sql('select * from matches')

Note that you can write an ad hoc SQL query in the argument to xdmp:sql(). MarkLogic 9.0-1 supports SQL92 plus some popular extensions.

But there's more: you can also do a combination query — that is, you can run an ad hoc SQL query and only return rows where the originating document matches some cts.query. For example, if you want the id and league of every document that mentions "Dimitri Payet", you can get it like this:

soccer.matches.id   soccer.matches.league
1234567             Premier

That cts:query restriction can be any cts:query — so you could restrict the contributing documents by metadata values, by a full-text search, bitemporal attributes, collection, geospatial — anything that you can express in a cts:query.

Example 2 -- Making Triples

We can also use templates to index triples. Take a look at our next template:

There are a couple things to notice here. First, we're specifying both the "source1" and "source2" collections. That's because the information we care about (for this template) is found in the same place in both formats. In this case, I could have left the collection out (thus applying to all collections), but by including them I'm future-proofing the template for the day when we get a new, non-conforming format.

Instead of specifying rows, this template specifies <triples>. Looking at the <val> elements, you'll see that we have some XQuery source code (yes, we're using XQuery code for the <val> element in both the JavaScript and XQuery versions of the templates). This allows us to calculate interesting values for the triples. In this case, we're comparing the scores to record the result of the match. This ability to do calculations means that we can record facts that are not explicitly mentioned in the document. Note that we can calculate values with XQuery code for rows as well.

We've also introduced vars in this template. These place holders let us do more complex things once, then reuse the values. Here we've constructed variables with the IRIs of the home and away teams, rather than copying the string manipulation code into each part of the triple where it's needed.

As a best practice, you should validate your templates and ensure that they retrieve the values you expect before inserting them. I've done that already and gone straight to the insert here.

Having inserted the template, let's run the data extract command again:

This command specifies a document to run against, but not a template to run, so any matching template will be applied. The results give both row and triples results, showing what will go into the index.

Template Driven Extraction lets you put content directly into the indexes, without having to modify your document structure. To learn more, check out the following resources:

See Also

Comments