Collection constraints are cool

by Evan Lenz

With the recent Developer Community site launch, we introduced a new feature to the Search page which allows you to narrow your search results down by page category. In this post, I'm going to call out some of the MarkLogic features that enabled us to do this. MarkLogic's Search API makes it very easy to support search constraints, which are name/value pairs that a user can enter into a search field, like this:

Machine generated alternative text: I xqerycat:tutora.I I

The above search text will look for the word "xquery" in all tutorials, which is to say, all documents that are in the "tutorial" category. I've defined "cat" (short for "category") as a constraint. That's what a constraint is: a name ("cat") that can be paired with a value ("tutorial"), to constrain your search in some way. The default search grammar uses a colon, although even that can be overridden.

Constraints in and of themselves are nice, but how are users supposed to know how to use them, or even that they exist? One way to weave them naturally into your search UI is to use faceted navigation. In this case, the links with the constraints are provided automatically, not requiring users to type them in. In our case, we just have one constraint, with several possible values. If you run a search on this site for "xquery" by itself, you'll see the current break-down by category of documents that mention XQuery:

Machine generated alternative text: .1 All categories [1675] J Function pages [1294] XCC Connector Javadocs [128] XCC Connector .Net docs [72] Blog posts [53] ) Open-source projects [35] Miscellaneous pages [29] User guides [21] Tutorials [21] Events [11] News items [8]

We see from the above that 21 tutorials mention XQuery and if we click that link, we'll see that the search constraint is automatically added to the query at the top of the resulting page:

Machine generated alternative text: I (xquery) AND cat:tutorial I

The above UI shows that not only is "category" a constraint but it also functions as a facet. When I first came across these terms, "constraint" and "facet," I wasn't sure what the difference was. After learning a bit more, I realized that they're slightly different in this regard: Every facet is a constraint, but not every constraint is a facet.

For a constraint to also function as a facet, you have to be able to retrieve all its values (in the case of the "category" facet: "function", "xcc", "tutorial", etc.) In other words, all of the unique values present in the database for a given constraint must be stored in a lexicon. That's what allows us to quickly generate the breakdown by category. It also can help with getting a quick count of documents for each value (e.g., "21" in the case of the "tutorial" value).

With those basic definitions out of the way, how do we actually implement constraints and facets? The Search API, and the search:search() function in particular, makes it convenient to retrieve the facet values and counts in its resulting XML. The simplest call to search:search() is to just give it some query text. First, we need to import the Search API library:

import module namespace search="http://marklogic.com/appservices/search"
       at "/MarkLogic/appservices/search/search.xqy";

Then make a simple call:

search:search("xquery")

Running the above query in Query Console will give you a <search:response> element, listing the first 10 results for documents in your database containing the word "xquery". In this case, all of the Search API's default options are in effect. These options determine how the query text is interpreted, how many results to return, what format to return them in, etc.

To customize the behavior, we need to pass in an <options> node. And to make this look more like production code, we'll make the query text dependent on an HTTP request parameter ("q"), so our final call to search:search() will look like this:

declare variable $q := xdmp:get-request-field("q","");
declare variable $options :=
  <options xmlns="http://marklogic.com/appservices/search">
    ...
  </options>;
 
search:search($q, $options)

Now let's drill down into the <options> node to define our constraint, leaving all the other options at their defaults:

declare variable $options :=
  <options xmlns="http://marklogic.com/appservices/search">
    <constraint name="cat">
      <!-- constraint type element goes here -->
    </constraint>
  </options>;

Each constraint is defined by a <constraint> element, and the type of constraint, i.e. how the data behind the constraint is represented, is determined by what you put inside that <constraint> element. There are several choices here. The following table summarizes the different constraint types, what element you use to represent them, and whether or not the constraint can also function as a facet:

Constraint type element
(child of <constraint>)

Type of constraint

Can function as facet?

<value>

value of a specific element, attribute, or field

No

<word>

word in a specific element, attribute, or field

No

<collection>

collection URI

Yes

<range>

value or range of values in a specific element, attribute, or field

Yes

<element-query>

word query restricted to the specified element

No

<properties>

word query restricted to the properties document

No

<geo-elem-pair>,

<geo-attr-pair>,

<geo-elem>

geospatial queries

Yes, if it has a <heatmap> child too

<custom>

custom XQuery-defined mapping between constraint value and underlying XML

Yes, if you have an appropriate lexicon

(Look at the documentation for the search:search() function for the full details on each of these.)

For the Developer Community website, we were faced with the above choices. Which one to pick? Well, I knew we needed facets and I knew this wasn't going to be a geospatial query, so that narrowed the list down to three choices of constraint type:

  • <range>,
  • <collection>, or
  • <custom>

Range constraints are probably the most common choice used as a basis for faceted navigation. However, they require your documents to have some resemblance to each other. Specifically, each document must contain a common element name, element/attribute, or applicable field definition. Range constraints also require that the constraint's value(s) appear directly in the document. If we had planned all of the Developer Community's content from the start with faceted navigation in mind, then we probably would have created something like a <category> element in every document, and then created a range index on it so we could provide faceted navigation based on that element.

But as it happens, we had a number of different heterogeneous document types, none of which explicitly list a category value. I also was curious how we might accomplish our goal without making broad database-wide updates to document content. My first thought was that I needed to use a <custom> constraint. That way, I could customize the exact mapping between a constraint query (such as "cat:function") and the query against my underlying XML representation. I learned that this also required creating a lexicon so that all the values ("function", "tutorial", "blog", etc.) could be quickly extracted, i.e. so my custom constraint could function as a facet. But since I didn't want to put those values in the document content, I thought of storing them in collection URIs. Before I knew it, I was reinventing collection constraints!

So I finally had a look at <collection> constraints. How easy they were in comparison! Here's the final $options node:

declare variable $options :=
  <options xmlns="http://marklogic.com/appservices/search">
    <constraint name="cat">
      <collection prefix="category/">
    </constraint>
  </options>;

The only step now was to associate all my documents with collection URIs. Here's the XQuery function I wrote for doing that:

declare function ml:category-for-doc($doc) as xs:string {
       if (contains(base-uri($doc), "/javadoc/")) then "xcc"
  else if (contains(base-uri($doc), "/dotnet/" )) then "xccn"
  else if ($doc/api:function-page               ) then "function"
  else if ($doc/*:guide                         ) then "guide"
  else if ($doc/ml:Announcement                 ) then "news"
  else if ($doc/ml:Event                        ) then "event"
  else if ($doc/ml:Article                      ) then "tutorial"
  else if ($doc/ml:Post                         ) then "blog"
  else if ($doc/ml:Project                      ) then "code"
                                                  else "other"
};

As you can see, I had various ways of mapping documents to their category, sometimes based on the document element name and sometimes based on the document URI. This has evolved some and the beauty of it is that I can use any arbitrary expression to determine the document category. I don't have to do a bunch of hard thinking about how to alter my document's structure.

My invoking code then associates the given document (using xdmp:document-add-collections()) with the appropriate collection URI: "category/xcc", "category/event", "category/tutorial", etc. The Search API has special support for this practice of using a common prefix for collection URIs. The prefix ("category/") acts as the constraint name, and everything after the prefix (e.g., "tutorial") acts as the constraint's value. Under the covers, the Search API calls cts:collection-match("category/*") to efficiently retrieve all the values for my given constraint. Cool stuff!

The only catch, if there is one, is that these collection URI associations need to be maintained. There are several ways of doing this. One is to ensure that every time a document is updated through our homegrown admin UI, the above function is called to correctly (re-)associate it with the right category. Another is to have a global script that does a brute-force update of all documents (we have that too). Finally, and possibly the best approach (although we haven't implemented this yet), is to set up a CPF pipeline so that documents automatically get their category updated whenever they're updated.

As a finale, here's the relevant portion of the <search:response> that search:search() returns for us and that makes it easy to generate the faceted navigation menu:

<search:facet name="cat">
  <search:facet-value name="blog" count="53">blog</search:facet-value>
  <search:facet-value name="code" count="35">code</search:facet-value>
  <search:facet-value name="event" count="11">event</search:facet-value>
  <search:facet-value name="function" count="1294">function</search:facet-value>
  <search:facet-value name="guide" count="21">guide</search:facet-value>
  <search:facet-value name="news" count="8">news</search:facet-value>
  <search:facet-value name="other" count="29">other</search:facet-value>
  <search:facet-value name="tutorial" count="21">tutorial</search:facet-value>
  <search:facet-value name="xcc" count="128">xcc</search:facet-value>
  <search:facet-value name="xccn" count="72">xccn</search:facet-value>
</search:facet>

As you can see, each of the relevant category values are returned, along with a count of how many matching documents  are in the given category. Did I mention that the Search API is cool?

Just getting started? Try out the 5-minute Guide to the Search API. It's how I first got up-to-speed, and I highly recommend it.

Comments

  • how can I make the counts return even if the value is 0?