Template Driven Extraction (TDE) lets you create a mapping (or lens) from parts of your documents to indexed rows (for querying via SQL/Optic) and/or triples (for querying via SPARQL/Optic).
To use TDE:
Not in 9.0-1.1. It’s on the Product Management team’s radar. If you’d like to communicate your thoughts to the PM team, you can go through your Account Executive or write to email@example.com.
Note that Entity Services will generate Templates given a model. Many people will use Entity Services as their starting point, where a Template is just an artifact of the model.
There is no CREATE VIEW action. When a Template is inserted, if the view doesn’t exist then it’s created.
The context element plays two important roles:
For #1 above, it’s important that the path in the context element ends with something other than a wildcard.
If you have documents with repeating elements that represent more than one row, it’s often useful to set the context to what would be the “primary key” of the rows you’re extracting.
Note that you can begin a path in <val> with “..”, so the context doesn’t need to be the root of the tree you’re extracting from.
Your Template will only be applied to documents that contain some path matching the <context> element. You can further scope the effect of your Template by adding collection and/or directory scope.
We recommend using collection scoping, to avoid unintentionally indexing some data, and also to make indexing more efficient.
The Template is stored in the Schemas database, with a special collection. It’s visible via e.g. Query Console.
Note: you can make a Template active (that is, make it take part in indexing) by adding it to your Schemas database using
xdmp:document-insert() with the appropriate collection. However, we recommend you use the helper function
tde:insert() – it’s simpler, and it validates the Template on insert, so you know only valid Templates exist in your Schemas database.
You can delete a Template using
Before deleting, you should disable the Template and wait for reindexing to complete, since the indexer makes use of the contents of the disabled Template to do proper cleanup. If you delete the Template without disabling, you may be left with some wasted space. See Deleting Templates in the Application Developer’s Guide.
Every Template potentially applies to every document. But, since the semantics of paths vary subtlely between XML and JSON, we recommend you write a separate Template for each (one for XML and one for JSON) and control them via collections and collection scoping.
It’s important to note that when you protect some information with MarkLogic built-in security – role-based document-level security or the new (in MarkLogic 9) Element-Level Security – users that don’t have access to that information via document queries cannot access it via any method. That includes projecting the data out with Templates into rows or triples and querying with SQL or SPARQL or Optic.
That said, security over rows and triples from Templates may be more restrictive than you expect – that is, there may be some information that you expect a user to be able to access from SQL or SPARQL or Optic which in fact he cannot.
The triple index (which underlies both triples and rows) does not implement ELS. So if any part of a row or triple that a Template wants to project is protected via ELS, that row or triple will not be visible, and so no unauthorized user will see that row or triple.
There’s an exception when the security at the document-level is stronger than the ELS security. In that case the Template will cause the row or triple to be indexed, and user access will be governed by document-level security.
Note that an XPath in a <var> element of the Template may go higher in the tree than the context node. So, set the context to a high-in-the-tree property such as /id and reference the rest of the properties using ../caller, ../customer and so on.
You can do some limited transformations over values in a document as you project them into the index. For example, you can find the day, month, and year at separate paths; concatenate them; and cast the result to a date. This is limited by the data, functions, and language available to you (see below).
Generally, any built-in MarkLogic function that is side-effect-free. See Template Dialect and Data Transformation Functions for a definitive list.
Any data in the current document and its metadata. So for example you can call
xdmp:node-uri() to access the database URI of this document.
A subset of XQuery. See Template Dialect and Data Transformation Functions for a description of the “Template dialect”. The dialect does not include loops, but it does include conditional expressions, so you could write a <val> that populates a cell or triple-part with some default value.
No. The Template dialect is a subset of XQuery.
Any *indexable path expression* is valid inside the context element. *indexable path expression* is described at Understanding Path Range Indexes.
*indexable path expression* is defined normatively by the BNF at Grammar for Index Path Expressions (which you can see and navigate graphically using https://www.bottlecaps.de/rr/ui).
To test whether some path expression is an indexable path expression, use
Yes! You can insert two Templates that populate the same view, just by specifying the same schema name and view name. For example, if I have some documents with the customer ID at /customer/ID and others with the customer ID at /record/cust_ID, create two Templates with the same schema name and view name but with different column definitions. Manage which Templates apply to which documents via the context element or, better, via collection scoping.
See the TDE tutorial for an example.
Yes! For each document, every Template with a matching context element and directory/collection scope will apply.
Once you get beyond your first Template, you may create Templates that overlap – you can create many Templates that apply to the same document (because the context is the same or overlapping); and you can create many Templates that populate the same view (because they apply to documents of different shapes).
In the examples at the TDE tutorial, you’ll see two Templates with the context “/match”, so both Templates will be applied to each of the sample documents. But one Template looks for an ID at the <id> child of <match> (which doesn’t exist in all documents) and another looks for an ID at the id attribute of >match< (which doesn’t exist in all documents). Why don’t you get any errors? Because you specified in the Template that the Template processor should not throw an error if some cell in some column could not be computed – rather, you told it to just ignore the row for that cell and carry on. You did that by adding <invalid-values>ignore</invalid-values> to the definition of each column. If you had specified <invalid-values>reject</invalid-values> for the column id, the Templates processor would have thrown an error and stopped re-indexing when it couldn’t find a value for id. Reject is the default behavior.
In the examples at the TDE tutorial, both Templates populate the view
soccer.matches. In these examples, both Templates specify all columns of that view. If you create a new Template that specifies only some of those columns, it will create rows with cells in the missing columns set to NULL. For that to work, the missing columns must be defined as nullable in all Templates.
Similarly, if you create a new Template that specifies columns that are not mentioned in other Templates for the same view, then all Templates must either define those columns as nullable (which requires you to peek into the future when creating Templates) or must define the view as <view-layout>sparse</view-layout>. <view-layout>sparse</view-layout> says “I don’t know what new columns I may specify for this view in future Templates. Allow future Templates to define new columns, and behave as if I had defined those columns as nullable in this Template.”
Summary: when you create a Template that defines a view, if you want to be able to create Templates in the future that define a different set of columns than the current (first) Template, then you must:
<view-layout>sparse</view-layout>in all Templates for that view. The default for view-layout is identical, which means that all Templates for that view must define an identical set of columns.
<nullable>true</nullable>in the first Template.
<nullable>true</nullable>in the future Template.
TDE is a general solution to the challenge of extracting triples and rows from parts of documents. It doesn’t entirely cover wither RDF/XML or JSON-LD.
However, if you know how the RDF/XML or JSON-LD will be presented (i.e. if you know your input won’t use every possible syntax) you may be able to create a Template.
Try this for RDF/XML:
<template xmlns="https://marklogic.com/xdmp/tde"> <collections> <collection>RDFXML</collection> <collection>dbpedia</collection> </collections> <path-namespaces> <path-namespace> <prefix>rdf</prefix> <namespace-uri>https://www.w3.org/1999/02/22-rdf-syntax-ns#</namespace-uri> </path-namespace> <path-namespace> <prefix>si</prefix> <namespace-uri>https://www.w3schools.com/rdf/</namespace-uri> </path-namespace> </path-namespaces> <context>/rdf:RDF/rdf:Description/*</context> <triples> <triple> <subject> <val>sem:iri(../@rdf:about)</val> </subject> <predicate> <val>sem:iri( xs:string( fn:node-name(.) ) )</val> <invalid-values>reject</invalid-values> </predicate> <object> <val>sem:iri(.)</val> </object> </triple> </triples> </template>
There isn’t a mechanism for handling PREFIXes specifically, but the <var> element can be used to store values to save you some typing in the Template. Set a <var> for your PREFIX value, then concatenate them to attach the PREFIX to the postfix to form an IRI. See the second example in the TDE tutorial.
Redaction is designed to work on an export of the data. If you want BI tool users to see a redacted view of your data, you should export that data with redaction rules; import the redacted version; and create a Template to work against the redacted copy.
No. But keep in mind that Templates are applied at indexing time, so more Templates means more expensive ingestion.
No. But keep in mind that more columns means more expensive queries. For best performance, keep the number of columns to a couple of dozen per view.
The underlying index for rows is the triple index. Each cell is indexed as a triple under-the-covers – the subject is the view/row; the predicate is the column; and the object is the cell value. All these row-related triples are available to SPARQL, and you’ll see them if you do a
SELECT * with no restrictions.
We recommend that you never do a
SELECT * with no restrictions!
You should manage triples in collections/named graphs, so every query should include at least a collection/named graph restriction.
It depends on the following settings in your Template:
<invalid-values>= “ignore” or “reject”
<nullable>= “true” or “false”
<default>may specify a default value
It also depends on the reason the path cannot be mapped to a cell. The value at that path may be “Invalid” (such as a string that cannot be cast to an integer); or it may simply be “Missing” (the path doesn’t exist in this document). Here’s the effect:
|Invalid Values||nullability||Default||Invalid Input||Missing Input|
|ignore||nullable||no default||skip cell||skip cell|
|non-nullable||no default||skip row||skip row|
|reject||nullable||no default||rejected||skip cell|
See Columns in the SQL Data Modeling Guide.
A Template is in scope if it’s valid, sits in the Schemas database, belongs to the TDE collection (
https://marklogic.com/xdmp/tde), and is not disabled.
Each Template that’s in scope will get applied to every document that matches the Template’s collection and directory scoping, and has a path that matches the Template’s context element, on insert and update (and delete).
Note: inserting a new Template may cause large scale re-indexing!
Before inserting and enabling a Template, test it (using e.g.
For best reindexing and ingestion performance, scope the Template using a restrictive <context> and/or collection and directory scoping.
xdmp:forest-counts( xdmp:forest("Documents"), ("reindex-tde-templates", "reindex-deleted-tde-templates") )