In MarkLogic 9 you can define a relational lens over your document data, so you can query parts of your data using SQL or the new Optic API. Templates let you specify which parts of documents make up rows in a view. You can also use templates to define a semantic lens, specifying which values from a document should be added to the index as triples.
Note that the template is entirely independent of the documents – a template doesn't change your documents in any way, it just changes how the indexes (row index and triple index) are populated based on values in the documents.
Templates let you look at the data in your documents in several ways, without changing the documents themselves. A relational lens is useful when you want to let SQL-savvy users access your data and when users want to create reports and visualizations using tools that communicate using SQL. It is also useful when you want to join entities and perform aggregates across documents. A semantic lens is useful when your documents contain some data that is naturally represented as triples and queried in SPARQL.
For the examples in this section you need MarkLogic 9. We assume you are going to run these examples in the Documents database, using Query Console.
Insert a simple document; create and validate a simple template; create a view; show the view and its rows; query using SQL
We start with a text-rich XML or JSON document (your choice) with some metadata. To begin with this first example, run this in Query Console:
This is a rich text document, with lots of text and some structure. But it's easy to imagine a relational lens through which you could see a view of a part of this document -- say, just the metadata.
Let's create a Template that defines a view that includes the
league. We'll create it in-memory (as a variable), and validate it, then apply it to our match report document to see what the rows look like.
In the template, notice that there are two aspects that define what documents it should apply to. The first is the "context" property or element. This is an XPath expression that defines the lookup node that is used as a starting point for extracting values. If a document doesn't match this path, the template won't be applied.
The second is the collection property or element. The template engine will not apply this template to a document unless the document is in the right collection (or set of collections). This is a useful way to ensure that templates intended for different data sources get applied only to the correct documents. In this case, we're requiring documents to be in the
source1 collection, which we applied to our sample document when we inserted it. We'll return to this idea later.
Now that you know the template is a valid one, run the template against a document to test that it extracts the rows you expect.
The result is below. Note that the return type from
tde:node-data-extract is a
Now you know what rows will look like when you apply the template. If you're happy with the result, insert the template. The convenience function
tde:template-insert() will validate the template, insert it into the Schemas database, and assign the special collection "http://marklogic.com/xdmp/tde". This will create a
view, and it will cause MarkLogic to index
rows from any documents matching the context.
tde:template-insert() ensures that only valid templates are inserted, and that your template is inserted into the correct database (the Schemas database) with the correct collection ("http://marklogic.com/xdmp/tde").
Now you can see the rows extracted from the match report without specifying a Template. Just leave out the Template parameter, and MarkLogic will apply all the Templates in scope – all the templates in the Schemas database where the context matches.
When you inserted the Template in the Schemas database, you created a view. To see the view definition, run this:
tde:get-view ( "soccer", "matches" ).
Result (the IDs in your results will be different):
When you created the view, you told the indexer to index rows whenever a document matches the Template. So you can now insert (and update and delete) documents, and new row entries will be created in or removed from the index (which, in case you hadn't already guessed, is based on the Triple Index).
You can also query the new view using SQL. Switch to SQL mode in Query Console and run this:
select * from matches
Data Integration with Templates
Before we leave this example, let's look at what you can do when a new document arrives with the same data, but in a different format — say, with different element names, or in a different structure.
Perhaps this match report came from a different application —- it also has the information we want in the view (id, document, date, league) but in different places in the document. Notice that we put it in a different collection, to reflect that it came from a different source.
To address this new format you need a new template that specifies the same view, but with different paths to the information. Create/validate/test my-second-TDE, and insert it.
Now show all rows again.
xdmp:sql('select * from matches')
Note: Templates that overlap
Once you get beyond your first Template, you may create Templates that overlap – you can create many Templates that apply to the same document (because the context is the same or overlapping); and you can create many Templates that populate the same view (because they apply to documents of different shapes).
Many Templates apply to one document
In the examples above, both templates have a context of "/match". The first template expects to find the ID directly under the match. The second looks for it as a property under "meta" (JSON) or as an attribute (XML). We used separate collections to ensure that the templates were applied to the right documents. What would have happened if we hadn't? By default, the template processor would have thrown an error trying to apply the first template to the second document, because it wouldn't find the data it needed. This behavior can be changed by specifying
invalid-values as "ignore" for each part of the template -- but leaving it with the default "reject" value makes it easier to find errors.
<invalid-values>reject</invalid-values>gives you helpful error messages when something is missing or invalid.
Many Templates populate the same view
In the examples above, both
$my-second-TDE populate the view
soccer.matches. In these examples, both Templates specify all columns of that view. If you create a new Template that specifies only some of those columns, it will create rows cells with the missing columns set to NULL. For that to work, the missing columns must be defined as nullable in all Templates.
Similarly, if you create a new Template that specifies columns that are not mentioned in other Templates for the same view, then all Templates must either define those columns as nullable (which requires you to peek into the future when creating Templates) or must define the view as
<view-layout>sparse</view-layout> says "I don't know what new columns I may specify for this view in future Templates. Allow future Templates to define new columns, and behave as if I had defined those columns as nullable in this Template."
Summary: When you create a Template that defines a view, if you want to be able to create Templates in the future that define a different set of columns than the current (first) Template, then you must:
- Define the view as
<view-layout>sparse</view-layout>in all Templates for that view.
The default for view-layout is identical, which means that all Templates for that view must define an identical set of columns.
- If there are any columns in the first Template that may not be defined in future Templates, those columns must be defined as
<nullable>true</nullable>in the first Template.
- If there are any columns not defined in the first Template that will be defined in a future Template, those columns must be defined as
<nullable>true</nullable>in the future Template.
Query using SQL
Now that you've "extracted" rows, they are accessible via SQL, in all the usual ways – from Query Console in SQL mode; from XQuery via the
xdmp.sql() function; and via ODBC using some tool that uses ODBC (such as a BI tool). We'll run through simple examples of all these except the last.
SQL Query: from Query Console in SQL mode
select * from matches
SQL Query: from XQuery via the xdmp:sql() function
This is the simplest possible SQL query:
xdmp:sql('select * from matches')
Note that you can write an ad hoc SQL query in the argument to
xdmp:sql(). MarkLogic 9.0-1 supports SQL92 plus some popular extensions.
But there's more: you can also do a combination query — that is, you can run an ad hoc SQL query and only return rows where the originating document matches some cts.query. For example, if you want the id and league of every document that mentions "Dimitri Payet", you can get it like this:
soccer.matches.id soccer.matches.league 1234567 Premier
That cts:query restriction can be any cts:query — so you could restrict the contributing documents by metadata values, by a full-text search, bitemporal attributes, collection, geospatial — anything that you can express in a cts:query.
We can also use templates to index triples. Take a look at our next template:
There are a couple things to notice here. First, we're specifying both the "source1" and "source2" collections. That's because the information we care about (for this template) is found in the same place in both formats. In this case, I could have left the collection out (thus applying to all collections), but by including them I'm future-proofing the template for the day when we get a new, non-conforming format.
Instead of specifying rows, this template specifies
<triples>. Looking at the
<val> elements, you'll see that we have some XQuery source code (yes, we're using XQuery code for the
We've also introduced
vars in this template. These place holders let us do more complex things once, then reuse the values. Here we've constructed variables with the IRIs of the home and away teams, rather than copying the string manipulation code into each part of the triple where it's needed.
As a best practice, you should validate your templates and ensure that they retrieve the values you expect before inserting them. I've done that already and gone straight to the insert here.
Having inserted the template, let's run the data extract command again:
This command specifies a document to run against, but not a template to run, so any matching template will be applied. The results give both row and triples results, showing what will go into the index.
Template Driven Extraction lets you put content directly into the indexes, without having to modify your document structure. To learn more, check out the following resources: