Blog(RSS)

Result Streams from Range Indexes

by Bradley Mann

The other day I needed to write an XQuery script to collect all the values from a range index and group them into "sets" of 1000. I ended up with something like this:

This query performs fine on a small set of values (5000), but when we increase the number of values pulled from the range index, we see that this call

let $group := $values[(($i * $groupsize) + 1) to (($i + 1) * $groupsize)]

quickly becomes the long pole in the tent. In fact, for a sample size of 50,000 values (50 groups), 91% of the execution time is taken by this one call, 2.3 seconds for just 50 calls. Increasing the sample size to values above 1,000,000 and it's clear that this query will no longer even run in a reasonable amount of time. So what's going on here? Shouldn't sequence accesses be lightning fast?

As it turns out, our cts:element-values() call isn't doing exactly what one might initially think. Rather than returning an in-memory sequence of values, it actually returns a stream, which is loaded lazily as needed. This optimization limits memory use in the (common) situations where you don't need the entire sequence. In my case, though, it doesn't help. 50 sequence accesses are actually 50 stream accesses, each time streaming the results from the first item (grabbing items at the end of the stream takes longer).

In order to get around this issue, there's a handy function called xdmp:eager(), which avoid lazy evaluation. But there's another easy "trick" that will reliably ensure you're working with an in-memory sequence rather than a stream. Simply, drop into a "sub-flwor" statement to generate a sequence from the return value:

let $values := for $i in cts:element-values(xs:QName("sample:value")) return $i

Now, $values is no longer a stream, but rather an in-memory sequence. Accessing subsequences within it is now much faster, particularly values at the end of the sequence.

This made my day.

Recursive Descent in XQuery

by David Cassel

This post covers a technique that’s an oldie but a goodie, with some thoughts on how it applies with today’s MarkLogic features. I reviewed this with my team recently and we thought it would make a good reference. The post will cover both some available implementations and the raw technique itself, and when to use each of them.

It’s a common problem: you’ve got some XML in one format and you need to change it. With MarkLogic, you can make updates to stored documents using functions like xdmp:node-insert-child(), but when you want to update nodes that you have in memory, you need to turn to a different technique.

XQuery versus XSLT

Some of you reading this will be thinking, “that’s easy, just apply an XSL transform” — and you’re right, that’s a good way to do it, if you know XSLT. Personally, I learned XQuery first and never learned XSLT. There’s an XQuery wikibook where others have written up their thoughts on how XQuery compares to XSLT; I’ll refer rather than rehash. For me, it’s a simple matter of already having a tool that does the job well, so I spent my time learning other stuff.

Typeswitch

The essential tool in using XQuery to transform XML is a recursive function with a typeswitch. If you haven’t encountered it before, a typeswitch is the switch statement we know from Java and other languages. The XQuery wikibook has a pretty good page showing the technique.

typeswitch.xqy shows the essence of the technique. Note that it doesn’t yet make any changes; this is just showing the mechanics. The cases in the typeswitch check for some type of node. When there’s a match, it returns something. To transform an element, we create a new element with the desired change, and then (typically) recursively call the function on the element’s children. For any node that doesn’t match (such as a text node), we just return the node.

With this approach, we can change namespaces, local names, add or remove children, and change the text content of an element. Note that we can’t write a case to match an attribute, so we make attribute changes by matching on the element the attribute belongs to.

Adding a Namespace

Let’s make this more interesting by adding a namespace to the change-me element.

We’ve added a case that targets the change-me element. Note that order matters: the first matching case will win. This is similar to our general case statement, but we’re modifying the namespace as we create the new element.

This highlights an important aspect of the technique: we are creating a whole, new XML node, a modified copy of the original. We are not modifying in place. More on why that’s important in a bit.

Enter Functx

FunctX is a collection of XQuery functions covering a range of common needs. A copy of the library is distributed with MarkLogic, so here’s the same change as the above, but using an off-the-shelf implementation:

That was easy. So if we have an existing function that does what we need, why bother looking at the recursive descent code?

First, it’s useful to understand the implications of what the libraries you are using are doing. That lets you make informed decisions about when to use them.

Second, building on the first, suppose you have a much bigger XML node to start with, and you’re going to be running on a lot of them. So far, you’ll still want to use functx. But now suppose you need to do multiple changes that can’t be handled by one of those functions. You’ll want to consolidate that into one run through the XML structure.

Multiple Changes

Here’s another version of our local:change function, this time adding a count increase and redaction (no, I can’t think of why you’d update a count in a non-persisted transformation; work with me here):

This illustrates making multiple changes to the XML structure using a single descent through the XML.

In-Memory-Update library

This review would be incomplete without mentioning another library that comes with MarkLogic: /Modules/MarkLogic/appservices/utils/in-mem-update.xqy. This library module contains five functions that are analogous to the xdmp:node-* functions, but act on in-memory XML nodes instead of in-database documents. For easy reference, the five are:

  • mem:node-insert-child()
  • mem:node-insert-before()
  • mem:node-insert-after()
  • mem:node-replace()
  • mem:node-delete()

Note that these functions use the recursive descent approach as well (see mem:_process()). You can use these functions in your code after importing them:

Where To Use

You can use this technique anytime you want to transform a block of XML. Commonly, you’d use it during ingest (putting data into a better format before storing it) or for display (formatting, supplementing, or redacting data before showing it to the user). With the MarkLogic REST API, applying transforms during ingest and display has become a common pattern.

A member of my team recently had a project in which Office 2007 documents were to be stored, but needed to have internal links updated to reflect the documents’ new home in the database. The documents were opened up using ooxml:package-parts(), surgically adjusted using the transformation method above, then rebuilt as zips using xdmp:zip-create(). No need to create temporary docs in the database; no need to worry about transactions.

This article first appeared as a post on David's blog.

The Art of the Possible: MarkLogic and Tableau Together in Tableau Public

by Sara Mazer

Imagine having a BI dashboard that is not only interactive with support for full text and complex searches on unstructured data, but also intelligent and customized with data from the semantic web.  A hybrid, if you like, of BI, search and semantics on both structured andmarkmail unstructured data.  The combination of MarkLogic and Tableau provides this today- and we have an example for you here on Tableau Public.

The Data Set

A little background. MarkLogic provides a free public service to the technical community called MarkMail.org, a collection of almost 9,000 mailing list archives from over a 12-year span.  Currently we have over 65 million messages that you can search, including email attachments.  These listserve posts, or emails, which have both structured and unstructured components (the email body) are a perfect use case for showing how we can integrate with a BI tool such as Tableau. 

We used structured components such as the “to” and “from” fields as values for some of our BI “dimensions” (categories to group data by – similar to range indexes and facets in MarkLogic).  In addition, we used entity extraction on the message body to pull out more terms that we might also want to use as dimensions.  While MarkLogic is an enterprise NoSQL database, we can create a “view” in MarkLogic that looks like a table to a BI tool like Tableau.  Through our ODBC driver, we can convert Tableau’s SQL queries to queries MarkLogic can understand. And, with MarkLogic 7, we can go even further.  Tableau allows for users to type in custom SQL.  MarkLogic has enhanced the SQL MATCH operator to support our complex enterprise search features such as Boolean operators, word proximity and fielded search.  This is in addition to the support for full text search on the entire documents with markmailstemming, tokenization, and all that good stuff (NOT just a grep on a relational database column)! To demonstrate our use case, we created a sample Tableau dashboard using a subset of MarkMail data.

(A note about MarkLogic’s ODBC connection. In order to provide this demo to Tableau Public, we have extracted the data from MarkLogic into a TDE file -- but you could easily run live data from MarkLogic into Tableau for real-time analysis on your own server.)

If you want to know more about the details on how all of this came together you can refer to the “Analytics, NoSQL, and Visualization” webinar that MarkLogic and Tableau presented through Data Science Central.

What Exactly Am I Looking At?

What you see on the Tableau Public site is a Tableau dashboard loaded with MarkMail. We’ve supplied some charts and graphs for you to play with, and we’ve built a dashboard with some of them.tableau  We’ve included a search bar using a Tableau parameter that uses a new Tableau 8 feature that supports parameters in custom SQL.  That parameter is just the right hand side of a SQL MATCH query, and it can be anything you’d expect from a powerful search engine like MarkLogic: Booleans, proximity search, fielded search and so on.  The entire email is searched, not just a column like you’d get with a relational database!  Then, because Tableau Server and MarkLogic are http servers, you can easily embed one application or widget into another.  You could have a Tableau inside a MarkLogic application, or, like in the Tableau Public case, a MarkLogic application inside Tableau.  That way, you can drill right into your textual results.  Try it!  Click on an email snippet and a new window pops up showing the entire document (in this case an email). 

But Wait, There’s More: Semantics Enriched Dashboards with SPARQL and RDF Triples

We’ve really just scratched the surface of what can be done, but if you type in “Hadoop” with a capital H (as SPARQL is case sensitive), you will see the right hand side of the dashboard fill with new data. This isn’t just any data.  It’s not in a database per se, it’s coming from the Web.  We’re using your search term to also search MarkLogic’s 7 triple store indexes to see “what else do we know?” -- like an infobox you see sometimes with Google searches.  The Hadoop logo isn’t stored in our database, but facts about Hadoop (including a link to the image) are stored using our new RDF triple store.  When you search, Tableau is converting your search to SQL and sending it to MarkLogic, and at the same time we are also sending the search term to MarkLogic through REST and converting the search to SPARQL.  You could have an infobox on your dashboards, or customized pages for every user, and have results change in real-time based upon data that changed because someone on the other side of the world updated a wiki page.  This is business intelligence with the power of the semantic web.  Enhanced by billions or trillions of facts. Think about your dashboards in which the data shown has both data from your organization but is also enhanced by interconnected pages and facts from millions of other people – and how this might help your organization.

blogroll Blogroll