Learning with xdmp:query-trace()

by Evan Lenz

One of the things I love to do is learn and help other people learn. I'm still relatively new to MarkLogic, so there's a lot I can't really write about, since I haven't learned it yet. But as long as I keep my learning one step ahead of my writing, then I (and you readers!) should be safe and not wildly misled.

One tool I've been using to learn how MarkLogic evaluates queries is the xdmp:query-trace() function. It has helped me understand why a query runs fast (or slow), and I've also used it to make sure I'm not smokin' crack as I'm about to write a response to the Developer discussion list or write another blog post.

As an example, in my last post on Good XML design and performance, I claimed that the following query would run fast, by leveraging MarkLogic's Universal Index:

//group[@type eq 'widget']

It sure seems like it should be fast, and based on what I had read about MarkLogic internals, it certainly sounds like it would. But just to be extra paranoid, I ran a test in CQ. The first step was to generate some sample data. I did that with the following query:

for $n in (1 to 300) return
xdmp:document-insert(concat("/group",$n,".xml"),
  document {
    let $pos := ($n mod 3) + 1
    let $type := ("widget","person","place")[$pos] return
    <group type="{$type}">stuff</group>
  }
)

A third of the documents will contain a <group> with type="widget", a third with type="person", and a third with type="place". After loading the documents, I ran my test query in conjunction with xdmp:query-trace():

xdmp:query-trace(true()),
//group[@type eq 'widget']

Passing true() to xdmp:query-trace() tells the server to output information to the error log about how it plans to run any searchable expressions it encounters in the following code—specifically what constraints are used and how many fragments are selected from the index for filtering. What I wanted to make sure is that MarkLogic would retrieve only those documents that I was interested in. If it selected 300 fragments (all the docs I loaded), that means it would have to look in each one before filtering out two-thirds of them (the ones whose @type value is something other than "widget"). Instead, the number I wanted to see was 100 (just the "widget" ones). Looking in the error log, this is what I saw (not including the timestamp and line number info):

Analyzing path: fn:collection()/descendant::group[@type eq "widget"]
Step 1 is searchable: fn:collection()
Step 2 is searchable: descendant::group[@type eq "widget"]
Path is fully searchable.
Gathering constraints.
Comparison contributed hash value constraint: group/@type = "widget"
Step 2 predicate 1 contributed 1 constraint: @type eq "widget"
Comparison contributed hash value constraint: group/@type = "widget"
Step 2 predicate 1 contributed 1 constraint: @type eq "widget"
Step 2 contributed 2 constraints: descendant::group[@type eq "widget"]
Executing search.
Selected 100 fragments to filter

Fortunately, from this I could tell that the index magic was indeed doing its job, since it only selected 100 fragments (documents)—the ones that contain "widget". And I could see that the XPath predicate, @type eq 'widget', is successfully interpreted as a constraint that can be resolved from the index. Yay! I could write with confidence.

A similar question came up for me on the Developer list today. Again, being paranoid, I used xdmp:query-trace() to make sure what I was about to say was correct. Here's the query I used to generate some sample data (very similar to the above one):

for $n in (1 to 300) return
xdmp:document-insert(concat("/logfile",$n,".xml"),
  document {
    let $pos := ($n mod 3) + 1
    let $host := concat("host",$pos) return
    <logfile host="{$host}"/>
  }
)

Here's the test query:

xdmp:query-trace(true()),
//logfile[@host eq 'host1']

And here's the line I saw (and was hoping to see) at the end of the Error Log:

Selected 100 fragments to filter

Because of the tiny data size, the above two examples would be fast regardless of what constraint I used (resolvable from the index or not). But when I'm dealing with millions of documents, I want to make sure that I'm effectively using the index. Using a small test data set with xdmp:query-trace() is one way to find out whether the index is being leveraged effectively, and thus whether my queries will scale.

Experimenting with xdmp:query-trace() (and the related xdmp:plan() function) are great ways to learn from the "bottom up". For "top-down" learning, I highly recommend Jason Hunter's paper "Inside MarkLogic". Also, if you want to learn some more from direct experience, check out this tutorial on how to write fast queries.

What about you? What functions or tools have you found helpful for learning MarkLogic? (Feel free to comment below.)

Comments