Constraints

Let's go back to the challenge of selecting messages from the database. The following query uses a key-value constraint to limit the retrieval to mails posted to the Apache Maven announce mailing list:

The predicate [@list = "org.apache.maven.announce"] says only to retrieve mails that have a list attribute with that value. How does MarkLogic execute this query? It doesn't exhaustively look at documents trying to find matches. It has indexes that can find these mails extremely fast. How fast? Let's check:

This counts the mails (there's 152 of them) and returns how long it took to do the work. Yes, you need the comma. As I said earlier, everything in XQuery is an expression and at the outermost level you can write either a singular expression or a sequence of expressions. The comma makes it a sequence.

I'm seeing an execution time of 0.1 milliseconds. That's the raw time for MarkLogic indexes to isolate these 152 messages from the corpus of 5,000,000 messages.

Here's another less optimal way to get the count:

This uses count() instead of xdmp:estimate(). Both calls use indexes, but count() goes one step further and checks the results by loading the documents off disk into memory and confirming they're a match. This takes extra time -- more than a second with caches cold, and 5 milliseconds with caches warm. This is why I showed you xdmp:estimate() in the first query, so you wouldn't get bored waiting for the result. So why would anyone use count()? It's useful if your query isn't fully resolvable from indexes, such as if you request a case-sensitive query but haven't enabled the case-sensitive index. But even then it's only useful against a small set of documents because of the disk overhead. In normal programs count() is almost never used.

Now let's make the constraints more complex. The following query includes a list constraint as well as classification type constraint. It returns the results as an HTML list:

This query uses something new: let variable binding. We assign $lists to a sequence of two strings, and $type to a single string. Then we use those two variables within the XPath expression.

Why is there a return in the middle of a query? Because it's a core part of the FLWOR (pronounced "flower") expression that drives much of XQuery. The initials stand for for, let, where, order by, and return. FLWOR expressions let you do looping, variable assignment, conditional logic, sorting, and result generation. The rule is that a FLWOR needs one or more "for" or "let" subclauses, in any order. You can write just lets, just fors, or any combination. These subclauses generate "tuples" (which are ordered sets of values, a fancy way of saying a set of variable bindings). The tuples are then passed through the optional "where" clause and, if they survive, get sorted by the optional "order by" clause. Finally there's a mandatory "return" clause that indicates what to do with each surviving tuple. In the query above we bind two variables, and use a "return" to generate a new <ul> node using them. Inside the enclosed expression we use a "for" to iterate and a "return" to generate a new node. The key thing to remember is that a "return" doesn't return control to the caller like in a procedural language. It's behaves more like a "do". Why didn't the W3C in defining XQuery name it "do"? Who would want to use FLWOD expressions?

Let's check how fast this query operates. We can test the raw index-resolution performance with this simplified query:

I see a time of 0.2 milliseconds.

Searching attachments is a key requirement for robust email searching. In our XML model, attachments are represented by <attachment> elements. The following query shows one at random:

Each attachment element uses attributes to hold various metadata fields, and stores the attachment's content details within. The above attachment happens to be a patch file. It might be more interesting to find a PowerPoint attachment, so let's find some using the @extension attribute.

That's more interesting, and it reveals a bit about how MarkMail handles Office attachments. They're stored raw as a binary document (linked to from the file attribute). They're also stored in a converted-to-PDF format (linked to from the <attachment-pdf> child element). From the PDF there's a binary screen-shot image taken on every page, in both large and thumbnail formats (linked to with yet more child elements). The main XML document maintains links to all the separate binary documents. The XML also tracks, internally, what text resides on every page of the attachment. This lets MarkMail search inside attachments and know which pages have the hits.

You might be wondering where attachments fit into the message structure. We can find out. We just need to write a query for messages having an attachment. The following query does that by finding a PowerPoint attachment element, then asking for its root element:

Someone more familiar with XPath would write this (the dot is important; it roots the internal path to the message not the whole database):

Someone exercising their FLWOR skills would write this:

They all get the job done.

Formatting Results

Facets

Stack Overflow iconStack Overflow: Get the most useful answers to questions from the MarkLogic community, or ask your own question.