Blog(RSS)

Returning Lexicon Values using XPath Expressions

by Gary Vidal

I am often asked, how can I evaluate an XPath expression to return a lexicon of values, such as cts:uris, without pulling each document from disk?. Often this arises from scenarios for bulk processing documents using tools like corb or MarkLogic's Task-Server to spawn processing across your cluster. When performing bulk operations, you need to ensure you can process documents that meet or do not meet a specific condition. Additionally, you must ensure that if the processing fails you can continue where you left off, without reprocessing all documents.

For this article we will focus on a specific problem: how do I find the URIs of documents that do not have a deeply nested structure? In general, you will find this problem not easily solvable using pure cts:query constructs ... till now.

Consider the following code for 2 documents having similar nested structures, but the second is missing the /p:parent/p:outer/p:last element.

declare namespace p = "p";
let $doc1 :=
  <p:parent>
    <p:outer>
      <p:first>pf1</p:first>
      <p:last>pl1</p:last>
    </p:outer>
    <p:child>
      <p:inner>
        <p:first>cf1</p:first>
        <p:last>cl1</p:last>
      </p:inner>
    </p:child>
  </p:parent>

let $doc2 :=
  <p:parent>
    <p:outer>
      <p:first>pf2</p:first>
    </p:outer>
    <p:child>
      <p:inner>
        <p:first>cf2</p:first>
        <p:last>cl2</p:last>
      </p:inner>
    </p:child>
 </p:parent>

return (
  xdmp:document-insert("doc1",$doc1),
  xdmp:document-insert("doc2",$doc2)
)

Determining which uris have this structure is very simple using XPath Expression as below:

doc()[p:parent/p:outer/p:last]/xdmp:node-uri(.)
[Returns]
doc1

The problem with this approach is that the XPath expression runs "filtered", which requires fragments to be pulled from disk to return the uri. While this works for a small database with a few thousand records, at some point you will hit the dreaded "XDMP-EXPNTREECACHEFULL" error. In essence, this means you have tried to return more documents than would fit into memory for the transaction. So putting on your MarkLogic black belt, you construct a complex cts:query using nested cts:element-query's to simulate a path structure such as:

cts:uris((),(),
  cts:element-query(xs:QName("p:parent"),
    cts:element-query(xs:QName("p:outer"),
      cts:element-query(xs:QName("p:last"), cts:and-query(()))
  ))
)

WOW that is complicated and also incorrect, as it returns both doc1 and doc2. So why did this happen? Well, the short answer is that MarkLogic resolves the cts:query "unfiltered", relying on indexes in memory and not the fragments themselves. To resolve this correctly the index must determine that p:parent is a parent element of p:outer who has a child element of p:last. Sure, you could try to tinker with positions and proximity, but even then may not yield the correct result. So how come we can do this in XPath, but not perform the same thing using cts:query? To answer this question, we will look deeper into a handy function called xdmp:plan. From the documentation the xdmp:plan function states the following:

xdmp:plan(
   $expression as item()*,
   [$maximum as xs:double?]
) as element()

Returns an XML element recording information about how the given expression will be processed by the index. The information is a structured representation of the information provided in the error log when query trace is enabled. The query will be processed up to the point of getting an estimate of the number of fragments returned by the index.

So let's dig a bit deeper into what is inside the plan by wrapping our XPath Expression with the xdmp:plan function.

xdmp:plan(/p:parent/p:outer/p:last)

The output is an XML Fragment with the following information:

<qry:query-plan xmlns:qry="http://marklogic.com/cts/query">
  <qry:info-trace>xdmp:eval("declare namespace p = &amp;quot;p&amp;quot;;&amp;#10;xdmp:plan(/p:parent/p:o...", (), &lt;options xmlns="xdmp:eval"&gt;&lt;database&gt;14817900035712326498&lt;/database&gt;&lt;root&gt;c:\users\gvidal\w...&lt;/options&gt;)</qry:info-trace>
  <qry:info-trace>Analyzing path: fn:collection()/p:parent/p:outer/p:last</qry:info-trace>
  <qry:info-trace>Step 1 is searchable: fn:collection()</qry:info-trace>
  <qry:info-trace>Step 2 is searchable: p:parent</qry:info-trace>
  <qry:info-trace>Step 3 is searchable: p:outer</qry:info-trace>
  <qry:info-trace>Step 4 is searchable: p:last</qry:info-trace>
  <qry:info-trace>Path is fully searchable.</qry:info-trace>
  <qry:info-trace>Gathering constraints.</qry:info-trace>
  <qry:info-trace>Executing search.</qry:info-trace>
  <qry:final-plan>
    <qry:and-query>
      <qry:term-query weight="0">
  <qry:key>4523426088818201359</qry:key>
  <qry:annotation>descendant(doc-root(element(p:parent),doc-kind(document)) )</qry:annotation>
      </qry:term-query>
      <qry:term-query weight="0">
  <qry:key>11698328636857559070</qry:key>
  <qry:annotation>descendant(element-child(p:parent/p:outer))</qry:annotation>
      </qry:term-query>
      <qry:term-query weight="0">
  <qry:key>17573168699309579415</qry:key>
  <qry:annotation>element-child(p:outer/p:last)</qry:annotation>
      </qry:term-query>
    </qry:and-query>
  </qry:final-plan>
  <qry:info-trace>Selected 1 fragment</qry:info-trace>
  <qry:result estimate="1"/>
</qry:query-plan>

Now if you notice from the output above the query is full resolvable from indexes denoted by the following lines:

<qry:info-trace>Analyzing path: fn:collection()/p:parent/p:outer/p:last</qry:info-trace>
  <qry:info-trace>Step 1 is searchable: fn:collection()</qry:info-trace>
  <qry:info-trace>Step 2 is searchable: p:parent</qry:info-trace>
  <qry:info-trace>Step 3 is searchable: p:outer</qry:info-trace>
  <qry:info-trace>Step 4 is searchable: p:last</qry:info-trace>
  <qry:info-trace>Path is fully searchable.</qry:info-trace>
  <qry:info-trace>Gathering constraints.</qry:info-trace>
  <qry:info-trace>Executing search.</qry:info-trace>

Well this is all well and good but how does this resolve my issue? The simple answer is that it doesn't. But what is returned after does. Once each step in the plan is resolvable, then the end result is the query plan itself. Now if you notice from the excerpt below, the qry:final-plan expresses a series of qry:term-query elements that define a qry:key.

<qry:final-plan>
    <qry:and-query>
      <qry:term-query weight="0">
        <qry:key>4523426088818201359</qry:key>
        <qry:annotation>descendant(doc-root(element(p:parent),doc-kind(document)) )</qry:annotation>
      </qry:term-query>
      <qry:term-query weight="0">
        <qry:key>11698328636857559070</qry:key>
        <qry:annotation>descendant(element-child(p:parent/p:outer))</qry:annotation>
      </qry:term-query>
      <qry:term-query weight="0">
        <qry:key>17573168699309579415</qry:key>
        <qry:annotation>element-child(p:outer/p:last)</qry:annotation>
      </qry:term-query>
    </qry:and-query>
  </qry:final-plan>

These keys actually resolve to term key's in the Universal Index. Terms within the Universal Index cover both words and structure. Each of the term-query's annotations describe what each key represents. You will notice the first key descendant(doc-root(element(p:parent), doc-kind(document)) ) represents the doc() axis to the p:parent element and the next key descendant(element-child(p:parent/p:outer)) represents the relationship between the p:parent and p:outer element, until you get the final key element-child(p:outer/p:last) which completes the path step between the p:outer and p:last elements.

Okay, this is getting more interesting, but we still have not seen how to resolve the problem. So now we are going to go into undocumented territory and hack the plan FTW.

A little known feature outside of MarkLogic's walls is a function called cts:term-query(xs:unsignedLong), which resolves a query based on a term key. Now if we take the keys from the plan above, we can craft a cts:query to combine all of those term keys into a single composable query. Since the results of the plan are xml this is as simple as the following statement:

cts:uris((),(),
  cts:and-query(
    xdmp:plan(/p:parent/p:outer/p:last)//*:key/cts:term-query(.)
  )
)
[returns]
doc1

Whoa!!!!. Is that for real? Indeed it is. So if that is true, what other things can we query using this method?

How about finding all uris for a given root element?

cts:uris((),(),
  cts:and-query(
    xdmp:plan(/p:parent)//*:key/cts:term-query(.)
  )
)
[Returns]
doc1
doc2

What about all binary documents?

cts:uris((),(),
  cts:and-query(
    xdmp:plan(/binary())//*:key/cts:term-query(.)
  )
)
[Returns all binary document uris]

What about all documents that don't have the /p:parent/p:outer/p:last path?

cts:uris((),(),
  cts:not-query(cts:and-query(
    xdmp:plan(/p:parent/p:outer/p:last)//*:key/cts:term-query(.)
  ))
)
[Returns]
All documents in database not 'doc1'

Why did this not work? We wanted all documents that had p:parent that did not have the p:outer/p:last element. The simple answer is by using a not-query you inverted the query to return all documents that did not resolve to each step in the plan,including all p:parent elements So head scratching how can we fix this?

I will get into another neat and unknown feature of a structure called map:map. Maps are mutable key/value structures, that perform extremely fast hash insert/lookup operations. The map:map structure has been available for quite some time (since MarkLogic 5) and most lexicon functions (cts:uris, cts:element-x-values) support maps as an alternative output to list sequences. But what is unknown about these structures is they support operators such as (+, -, *, div, mod) to mutate and combine maps together. Again, I am not giving due justice to this topic, but will revisit in upcoming blog posts.

So for the purposes of solving our original problem, we will use maps to compute the difference (map - map) of two cts:uris calls. Revisiting our original example, we wanted to return all p:parent documents who did not have the p:outer/p:last element. The solution is returned using the following code :

map:keys(
   cts:uris((),("map"), cts:element-query(xs:QName("p:parent"), cts:and-query(()))) 
   -  
   cts:uris((), ("map"),
      cts:and-query(xdmp:plan(/p:parent/p:outer/p:last)//qry:term-query/qry:key ! cts:term-query(.))
))
[Returns]
doc2

Which translates to:

cts:uris((), ("map"), cts:element-query(xs:QName("p:parent"),cts:and-query(())))

Return all uris that match (/p:parent) as a map:map

- (: Notice the minus sign :)
cts:uris((), ("map"),
   cts:and-query(xdmp:plan(/p:parent/p:outer/p:last)//qry:term-query/qry:key ! cts:term-query(.))

Return the difference (-) of all uris that have match (/p:parent/p:outer/p:last) as a map:map

map:keys($map1 - $map2)

The outer map:keys flattens the map back to a sequence of uri values.

Well that was quite a lot to digest and I am exposing quite a bit of juju and dark magic, but you can see that this provides a powerful tool in your arsenal of using MarkLogic in ways never possible before. Good Luck and Happy Coding.

DISCLAIMER

(BTW. The techniques in this article, including the use of cts:term-query(), may or not be sanctioned by MarkLogic and are subject to change in the product. So use at your own RISK!!!. But hey "No Risk No Reward".)

MarkLogic Version Manager

by Dave Cassel

If you do development work with different versions of MarkLogic, you've probably set up virtual machines. This has the advantage of complete separation among the different versions, but it can be a hassle. Matt Pileggi, part of MarkLogic's Vanguard team, put together the MarkLogic Version Manager. mlvm is an open source tool he uses to switch among versions of MarkLogic he has installed on his laptop, without using virtual machines. 

Matt was inspired to write mlvm after using the Node Version Manager, which solves the same problem for working with multiple versions of Node.js. Right now, mlvm is a Mac-only tool, but Matt would welcome contributors to help with this or other improvements

Matt's mlvm is a development tool that's off to a good start -- check it out on GitHub! 

10,000 Range Indexes

by Dave Cassel

There was a recent discussion on an internal mailing list asking whether you could set up 10,000 range indexes on a database. When faced with a question like this, we should step back and consider the problem we're trying solve. The data set in question has about 1,000 entities, with an expectation that an average of 10 fields related to each entity would need to be indexed. This leads to the question about having 10,000 range indexes.

At first blush, this line of thought suggests relational thinking -- this is natural; that's what most of us learned first. Of course, every index has a cost, regardless of whether the database is MarkLogic, an RDBMS, or another NoSQL database. 10,000 range indexes isn't a good idea in MarkLogic, but know that if you were thinking about setting up that many, there's probably a better solution. 

Universal Index

The first question we should consider is whether we actually need range indexes for those 10,000 fields (elements). MarkLogic's Universal Index may provide what's needed already: indexing the terms and structure of all documents. Through the Universal Index, we can do full-text searches on any ingested content, even scoping it to particular document sections if we want. In many cases, this means we don't need to set up specific indexes to provide rapid access to particular content. 

Range Indexes

The Universal Index provides immediate access to text and structure. When do we need range indexes? In a search context, we use range indexes for data-type specific inequalities, such as "find me all articles published since Jan 1, 2012". By having a date range index on the publication date, we can build a greater-than-or-equal-to query. We can also use range indexes to get lists of values, enabling us to build facets. Jason Hunter's Inside MarkLogic Server lists other range index benefits. 

In typical applications, we want to search across many (or all) fields, but we don't need inequality comparisons or to generate thousands of facets. This means that for most applications, we'll get much of our search capability from the Universal Index and supplement with a small number of range indexes. 

Fields

In MarkLogic, a field is a structure that lets us refer to the contents of multiple elements by the same name. When we merge data from different sources, we sometimes get multiple elements that represent the same thing, but with different names. For instance, consider two book databases, where one has "published-date" and one has "pub-date". At first glance, these appear to be two separate types of data, suggesting separate range indexes. However, with MarkLogic's field feature, a single name can refer to the contents of both elements, with one type-specific index pulling values from all the elements. This is another way that the number of indexes can be reduced.

Triples

Sometimes you really do want to do range queries across a wide variety of fields. In an extreme case, MarkLogic lets you represent everything as triples, allowing for inequality queries using SPARQL's FILTER or the cts:triples() function. MarkLogic's own history monitoring is built entirely with triples. More commonly, triples are used in combination with documents to produce a powerful hybrid. 

Why Not 10,000 Range Indexes?

Having looked at some alternatives to setting up 10,000 range indexes, let's come back to the original question. It turns out that the answer is no, you should not attempt to make anything on the order of 10,000 -- a target cap for range indexes should be about 100, with the vast majority of applications requiring a much smaller number than that. Each forest stores the indexes that relate to the content in that forest; each forest is broken into one or more stands. Each of these stands manages its indexes in two memory-mapped files per index. We commonly see 12 forests on a host (six master, six replica) with about 100 stands; multiply that by 10,000 range indexes and we'd have millions of open files handles. 

Wrap

Sometimes the transition from the relational model to the document + triples model doesn't click for a person right away, which can lead to a question like this one. If you find yourself planning to make thousands (or even hundreds) of range indexes, it's probably worth stepping back and rethinking about how the data will be represented. The Universal Index is really powerful -- let it do what it does best! Then for cases the Universal Index doesn't satisfy, apply fields, range indexes, and triples as needed. 

blogroll Blogroll