5-minute Guide to the Search API

Colleen Whitney and Micah Dubinko
Last updated April 1, 2015

This guide will take you through the basics of the Search API, which makes it easy to do flexible Google-style searches. The Search API consists of fewer than a dozen functions (in MarkLogic 8+, you can import these XQuery functions into Server-side JavaScript). These functions act as a buffer between busy developers with jobs to do and a granular, powerful, and complex set of underlying APIs for building search applications. In this series, you'll learn what it does, and why.

You can use Query Console at http://localhost:8000/qconsole/ to run the code in this tutorial and we've provided a Query Console workspace that you can download and import directly into your Query Console.

Data setup

The Search API only covers searching content, not putting it in place or setting up a database. So for this walkthrough, you'll need to do some setup by hand. In particular, we'll need to create a database named barbecue and load some content into it.

MarkLogic provides several interfaces to create databases, but one of the simplest is the UI in the Information Studio tool:

  1. Point your browser at http://localhost:8000/appservices (Replace localhost with your server host name as needed.)
  2. Click the "+ Database" button and create one named 'barbecue'

Once you've created the database, you can copy, paste, and run the following XQuery script into Query Console, pointing at the new 'barbecue' database, to load some data. Be sure the QC buffer is using XQuery as the Query Type. (You will also find a copy of this script in the attached Query Console workspace. NB: the script will return the empty sequence on success).

You are, of course, encouraged to play with all kinds of data using the Search API, but for the purposes of this walkthrough, we'll be working with documents that look like this:

 <entry xmlns="http://example.com"
        date="2007-10-31T14:17:42.125-07:00">
   <title>Sally's Southern BBQ</title>

   <abstract>A classic southern recipe</abstract>
   <flavor-descriptor>cayanne</flavor-descriptor>
   <flavor-descriptor>molasses</flavor-descriptor>
   <flavor-descriptor>smoky</flavor-descriptor>

   <scoville>800</scoville>
   <rating>3.0</rating>
 </entry>

Out of the box

Try a simple query:

xquery version "1.0-ml";
import module namespace search = "http://marklogic.com/appservices/search"
  at "/MarkLogic/appservices/search/search.xqy";

search:search("BBQ smoky")

The two terms are implicitly ANDed together. Likewise, you could have used the built-in AND operator:

import module namespace search = "http://marklogic.com/appservices/search" 
  at "/MarkLogic/appservices/search/search.xqy";

search:search("BBQ AND smoky")

Out of the box, search:search() parses the search string, understanding "Google-style" grammar including AND, OR, quote marks for exact phrases, a minus sign for negation, and parentheses for grouping, allowing complex queries like [(hot OR spicy) AND "southern style" -pepper].

The prior query returns search results in a convenient format, like this:

<response total="1" start="1" page-length="10"
    xmlns:search="http://marklogic.com/appservices/search">
  <result index="1" uri="http://bbqdocs/entry1"
      path="doc(&quot;http://bbqdocs/entry1&quot;)"
      score="124" confidence="0.520534" fitness="0.666994">
    <snippet>
      <match path="doc('http://bbqdocs/entry1')/*:entry/*:title">Sally's Southern <highlight>BBQ</highlight></match>
      <match path="doc('http://bbqdocs/entry1')/*:entry/*:flavor-descriptor[3]"><highlight>smoky</highlight></match>

    </snippet>
  </result>
  <qtext>smoky BBQ</search:qtext>
  <metrics>
    <query-resolution-time>PT0.004577S</query-resolution-time>
    <facet-resolution-time>PT0.000059S</facet-resolution-time>
    <snippet-resolution-time>PT0.001947S</snippet-resolution-time>
    <total-time>PT0.006831S</total-time>
  </metrics>

</search:response>

The parsed query runs a query against the entire database and returns an XML response element with key information necessary for a typical search application.

  • @total and @start on the root element give the estimated total number of results and the starting offset for these results, to support pagination.
  • Each returned result has a result index. Each result contains useful information like @uri, @path, @score, @confidence, and @fitness. Path values are in a format suitable for calling xdmp:unpath().
  • A snippet element for each result, summarizing the matching portion(s) of the document.
  • The original query text.
  • Useful metrics for how long various stages of evaluation took to complete.

Using Options for Customization

That simple search:search() call is pretty powerful. The server responds with a list of the most relevant documents in the database containing the terms "BBQ" and "smoky", in a format that's easy for a search application to transform and use on a results page.

But MarkLogic server...and your information...are capable of doing much more. What if you specifically want to find a document describing a recipe with "BBQ" in the title and a "smoky" flavor, and you don't want it to be too mild or too hot? What if you want to restrict the query to a particular collection or directory within the database? What if you want the results sorted in some other way?

To accommodate the need for more specific and powerful queries on structured content, and for customization of the results that come back, most of the functions in the Search API accept an <options> node. These options offer developers very fine-grained control over how searches are parsed and evaluated, and how the results are returned.

A Few Key Options

In this part of the walkthrough, we'll highlight a few key options, and how they work. By default, responses consist of just a few elements (result, facet, qtext and metrics). Each of these can be independently turned on or off with a boolean option:

  • return-results
  • return-facets
  • return-qtext
  • return-metrics (returns performance timings on various parts of query parsing and resolution)

A few other potentially useful features are turned off by default, but can be easily turned on with similar boolean options:

  • debug (to see additional information useful in a debugging context)
  • return-query (returns the XML representation of the parsed query)
  • return-constraints (returns the original constraints used to define the query)
  • return-similar (return similar documents to a given result)

Other useful options include:

  • page-length (unsignedInt)
  • searchable-expression (a string like "//p" used on the left-hand-side of the cts:search() call), by default "fn:collection()".
  • sort-order (XML)
  • search-option (an options string to pass into cts:search(); multiple search-option elements are allowed, each with one string option)

Most of the options we've listed here are simple to use and understand.

For example, here's an options node that disables metrics, adds the query to the response element, and makes the query unfiltered:

<options xmlns="http://marklogic.com/appservices/search">
  <return-metrics>false</return-metrics>
  <return-query>true</return-query>

  <search-option>unfiltered</search-option> 
</options>

Constraints

One of the most powerful options in the Search API toolkit is a constraint, which gives the Search API information about indexed structures in your content, and how you want to expose those structures to users in your application. Constraints are extremely powerful, and are a little more complex than the options we highlighted above. We'll pause here to take a closer look at constraints, and how they relate to the types of queries your application requires.

Constraints make it possible for users to:

  • Find a document containing <flavor-descriptor>smoky</flavor-descriptor> with the query string [flavor:smoky]
  • Find a document containing <rating>5.0</rating> with the query string [rating:5]
  • Find a document containing <scoville>1000</scoville> with the query string [heat:moderate]
  • Find a document containing <entry date="2009-04-07T14:44:27.550-07:00"> with the query string [date:today] (if today is April 7, 2009)
  • Find a document marked with the collection URI "http://bbq.com/contributor/BigTex" with the query string [contributor:BigTex]
  • Find a document containing <title>Four little pigs</title> with the query string [intitle:pigs]
  • Find a document like the one below with the query string [summary:Louisiana AND summary:sweet]:
<entry xmlns="http://example.com"
      date="2009-03-03T13:18:58.225-07:00">
   <title>Louisiana Bayou Mild</title>
   <abstract>Straight from New Orleans, mild, sweet</abstract>

   <flavor-descriptor>sweet</flavor-descriptor>
   <flavor-descriptor>vinegar</flavor-descriptor>
   <scoville>750</scoville>
   <rating>3.0</rating>

</entry>

Let's take a look at sample constraint definitions that enable each of these possibilities.

Value

The first example, [flavor:smoky] is a simple value constraint based on the text value of a particular element. The definition looks like:

<constraint name="flavor">
  <value>
    <element ns="http://example.com" name="flavor-descriptor"/>
  </value>
</constraint>

Note that the constraint name and the name of the element involved are separate. This simple query does not require an Element Range Index to be configured, and equivalent is based on the string value of the element.

Range (type-aware) Value

The second example, [rating:5] uses a type-aware comparison, so that "5" in the query text can still match "5.0" in the document. Under the hood, this is a range-query with an operator of "=". (Inequalities like "GT" (greater than) are also supported.) The definition looks like:

<constraint name="rating">
  <range type="xs:decimal">
    <element ns="http://example.com" name="rating"/>
  </range>
</constraint>
  • The attribute @name (here "rating") is the part that appears on the left-hand side of the constraint expression.
  • The child element of constraint tells what kind of constraint it is. In this case, a <range> constraint.
  • The values are taken from the element rating (in the http://example.com namespace) with a datatype of xs:decimal, for which our script created a range index.

Bucketed (absolute)

The constraint definition for the third example, [heat:moderate] is a little more complex, because it spells out what we mean by "moderate". Each of the bucket definitions is used to create corresponding range queries when a keyword is recognized, so "moderate" is translated to an element range query for items where the value of <scoville> is between two specified values (the attribute @lt means 'less than', and @ge means 'greater than or equal').

<constraint name="heat">
  <range type="xs:int">

    <element ns="http://example.com" name="scoville"/>
    <bucket name="mild" lt="500">Mild (< 500)</bucket>

    <bucket name="moderate" ge="500" lt="2500">Moderate (500 - 2500)</bucket>
    <bucket name="hot" ge="2500" lt="8000">Hot (2500-8000)</bucket>

    <bucket name="extra-hot" ge="8000">Extra Hot (8000+)</bucket>
  </range>
</constraint>

Notice some differences in how a bucketed constraint is put together:

  • The values are grouped into buckets
  • The attributes @ge (greater than or equal) and @lt (less than) set the boundaries of the bucket. Note that the values of these attributes on a bucket element MUST be valid instances of the declared datatype, here xs:int.
  • Note that the name of the constraint and the name of the element can be different, as is the case here.
  • The name of the bucket (here "moderate") appears on the right-hand side of the constraint expression, like this: [heat:moderate].
  • This example uses the Element Range Index on the <scoville> element with a datatype of xs:int, configured by our script.

Note: for best performance, all buckets should be arranged in ascending order. (Thus, the first one has optional @ge, the last one optional @lt), and the @lt on one bucket should exactly match the @ge on the next. If you don't follow these guidelines, facets will still work, but more slowly.

Bucketed (relative)

The fourth example, [date:year], is a bucketed range constraint, but with an added twist: the bucket definitions are calculated on the fly, rather than being predetermined, as indicated by the element name of <computed-bucket>.

<constraint name="date">
  <range type="xs:dateTime" facet="true">
    <element ns="http://example.com" name="entry"/> 
    <attribute ns="" name="date"/>

    <computed-bucket lt="-P1Y" anchor="start-of-year" name="older">Older</computed-bucket>
    <computed-bucket lt="P1Y" ge="P0Y" anchor="start-of-year" name="year">This Year</computed-bucket>
    <computed-bucket lt="P1M" ge="P0M" anchor="start-of-month" name="month">This Month</computed-bucket>
    <computed-bucket lt="P1D" ge="P0D" anchor="start-of-day" name="today">Today</computed-bucket>
    <computed-bucket ge="P0D" anchor="now" name="future">Future</computed-bucket>
    
    <facet-option>descending</facet-option>
  </range>
</constraint>

Some details to note:

Note the differences from the last example:

  • The constraint definition pre-defines what we mean by "today", it is relative to a particular value range calculated at query time based on the @anchor defined for the computed bucket.
  • The attribute facet="true" is present on range. The value 'true', to return facets, is the default, so here it is for illustrative purposes. Setting it to 'false' would disable this particular constraint from generating facets.
  • A <computed-bucket> (unlike a regular bucket) MUST have an @anchor attribute with a valid anchor name; built in anchor names are 'now', 'start-of-day', 'start-of-month', and 'start-of-year', all based on the current system time.
    • Additionally, a <computed-bucket> has relative values in @lt and @ge. Here the data type of the constraint is xs:date, but the values in @ge and @lt are xs:dayTimeDuration.
    • Remember, a regular <bucket> MUST NOT have a @anchor attribute.
  • A <facet-option> is used to specify a "descending" sort order.
  • This example uses the Element Attribute Range Index, configured on the date attribute of the entry element, that we created in the loading script with a datatype of xs:dateTime.

Collection constraints

The fifth example, [contributor:BigTex], is a collection constraint, based on collection URI. Many applications use collections to create flexible, non-hierarchical groups of documents. Collections may be assigned at ingest time or to support applications like tagging. So this query is looking for documents in the "http://bbq.com/contributor/BigTex" collection. The constraint definition looks like:

<constraint name="contributor">

  <collection prefix="http://bbq.com/contributor/"/>
</constraint>

A few details:

  • This example uses the collection lexicon that was set up in the loading script.
  • Note the specification of a URL prefix that is not repeated in the query text.

Word constraints

The sixth example, [intitle:pigs], is based on a word constraint. Under the covers, this equates to an element word query: the query targets documents with the word "pigs", but only if it occurs in the title element in namespace http://example.com. The constraint definition is very simple:

<constraint name="intitle">
  <word>
    <element ns="http://example.com" name="title"/>
  </word>
</constraint>

No special indexes are required in order to evaluate this query, although the "fast element word searches" index, on by default, will provide better performance. (Note that if you want to use this constraint to define a source for suggestions, you will need to configure an element word lexicon on 'title'.)

Field constraints

The final example, [summary:Louisiana AND summary:sweet], uses a field constraint, based on a field defined in the database. A field allows a developer to logically group selected structures within a document for focused indexing and queries. This query, then, is looking for documents in which the 'summary' field (including title and abstract elements) contains the words "Louisiana" and "sweet".

<constraint name="summary">
  <word>
    <field name="summary"/>
  </word> 
</constraint>

This example requires that the database configuration include a field named "summary", specifying inclusion of the portions of the document with <title> and <abstract> elements.

Putting it all together

So going back to our simple example, how could the search API make it easy to write a query for our moderately hot BBQ recipe with smoky flavor?

Define an options node that includes the following constraints, and pass it into the call to search:search():

xquery version "1.0-ml";
 
import module namespace
  search = "http://marklogic.com/appservices/search"
  at "/MarkLogic/appservices/search/search.xqy";

let $options := 
  <options xmlns="http://marklogic.com/appservices/search">
    <constraint name="flavor">
      <value>
        <element ns="http://example.com" name="flavor-descriptor"/>
      </value>
    </constraint>

    <constraint name="intitle">
      <word>
        <element ns="http://example.com" name="title"/>
      </word>
    </constraint>

    <constraint name="heat">
      <range type="xs:int">
        <element ns="http://example.com" name="scoville"/>
        <bucket name="mild" lt="500">Mild (less than 500)</bucket>
        <bucket name="moderate" ge="500" lt="2500">Moderate (500 - 2500)</bucket>
        <bucket name="hot" ge="2500" lt="8000">Hot (2500-8000)</bucket>
        <bucket name="extra-hot" ge="8000">Extra Hot (8000+)</bucket>
      </range>
    </constraint>
  </options>

return search:search("intitle:BBQ flavor:smoky heat:moderate", $options)

And the result set (listed in the <search:response> doc you get back) will include this matching document:

<entry xmlns="http://example.com"
      date="2007-10-31T14:17:44.425-07:00">
  <title>Sally's Southern BBQ</title>

  <abstract>A classic southern recipe</abstract>
  <flavor-descriptor>cayanne</flavor-descriptor>
  <flavor-descriptor>molasses</flavor-descriptor>
  <flavor-descriptor>smoky</flavor-descriptor>

  <scoville>800</scoville>
  <rating>3.0</rating>
</entry>

How does this work?

Based on the "Google-style" grammar defined in the default options in the Search API, the query parser recognizes the ":" operator, and looks for constraints with names matching the string preceding the colon. The parser then constructs specialized queries that utilize indexes on document structure.

Doing more with constraints: Faceting

Constraints are the mechanism for enabling precise, powerful queries of your structured content. But constraints also provide the foundation for faceted navigation, a key feature in modern search applications.

By default, all range and collection constraints will produce facets on search:search() calls. (You can disable faceting on a particular constraint using the @facet attribute, as in the computed bucket example above.) Let's see how those facets look, using the following query:

import module namespace search = "http://marklogic.com/appservices/search" at "/MarkLogic/appservices/search/search.xqy";

let $options := 
  <options xmlns="http://marklogic.com/appservices/search">

    <constraint name="heat">
      <range type="xs:int">
        <element ns="http://example.com" name="scoville"/>
        <bucket name="mild" lt="500">Mild (lt 500)</bucket>
        <bucket name="moderate" ge="500" lt="2500">Moderate (500 - 2500)</bucket>
        <bucket name="hot" ge="2500" lt="8000">Hot (2500-8000)</bucket>
        <bucket name="extra-hot" ge="8000">Extra Hot (8000+)</bucket>
      </range>
    </constraint>
    <constraint name="contributor">
      <collection prefix="http://bbq.com/contributor/"/>
    </constraint>
  </options>

return search:search("",$options)

==>

<!-- this example includes prefixes as they might be returned from the API -->
<search:response total="5" start="1" page-length="10" 
    xmlns:search="http://marklogic.com/appservices/search">
   ... (omitting result nodes) ...
  <search:facet name="heat">
    <search:facet-value name="moderate" count="3">Moderate (500 - 2500)</search:facet-value>
    <search:facet-value name="extra-hot" count="2">Extra Hot (8000+)</search:facet-value>
  </search:facet>

  <search:facet name="contributor">
    <search:facet-value name="AuntSally" count="1">AuntSally</search:facet-value>
    <search:facet-value name="BigTex" count="2">BigTex</search:facet-value>
    <search:facet-value name="Dorothy" count="1">Dorothy</search:facet-value>
    <search:facet-value name="Dubois" count="1">Dubois</search:facet-value>
  </search:facet>

  <search:qtext/>
    <search:metrics>
      <search:query-resolution-time>PT0.003869S</search:query-resolution-time>
      <search:facet-resolution-time>PT0.015655S</search:facet-resolution-time>
      <search:snippet-resolution-time>PT0.003892S</search:snippet-resolution-time>
      <search:total-time>PT0.023962S</search:total-time>
    </search:metrics>
  </search:response>

The empty string produces a histogram of all documents (which is configured by default to return all documents).

Hints for Working With Options

The options you pass in are merged with a set of default options, so it's only necessary to pass in options that you want to change. You can access and examine the default options using search:get-default-options() if you are curious.

Options can get complicated, so we encourage you to use the search:check-options() function during development. It examines your candidate options XML and reports any differences from what the API expects, and optionally checks the options against available indexes to see if any are missing. Set the <debug> option to true to get additional information.

For further information

Documentation on the Search API can be found at:

Comments

  • I need a programmer to help us create advanced and accurate search engine for ICD-10. There are 68,000 codes with alpha text descriptions.
  • Where can I find more info on facets. This was a good introduction but I want more details.
    • Concepts and how to apply them are covered in <a href="http://docs.marklogic.com/guide/search-dev/search-api#id_77403">Constrained Searches and Faceted Navigation</a> in the Search Developer's Guide. If you're using the Java or Node.js Client API, search for "facet" in those documents to see how to apply the concepts with those tools.
  • excellent introduction