Paginated Search for Web Applications

Michael Blakeley
Last updated September 28, 2012

This article was written prior to the release of the Search API, which handles all of the lower-level details such as optimized sorting, pagination, etc. for you. See:

MarkLogic Server provides a rich set of APIs for content query and full-text search. Presentation of results, however, is largely left up to developers. In this tutorial, we will learn how to build a paginated search interface.

When a user search returns millions of hits, it is impractical to simply list them all on a web page. The most-relevant results should be at the top of the list, and most users will find the answers they need within the first 10-30 items. At the same time, it is time-consuming for the server to fetch and present millions of hits. The classical answer to both of these problems is query-side pagination.

With query-side pagination, every user-submitted search request includes information about the page that should be returned. For example, if a user searches for "product cycle", the web form would automatically include the pagination arguments, asking the server to search for "product cycle" and return results 1-10. In response, the server will build a response that includes the actual start and stop position of the results, the estimated number of items remaining in the results, and any other meta-info needed by the display code.

Note that we have separated two concerns within the server code: we use a query to build a results-page structure, and we use separate code to transform that results-page structure into the end-user display format. That display code could be another XQuery function, an XSL transform, etc. The final display format could also vary: XHTML, plain-text, WML, etc.

Simple Search and its Shortcomings

We can start by defining a very basic search module, without pagination, and extend it until it does everything we need. We could place this query in a library module or in an invokable main module. For this tutorial, we will use an invokable main module: this allows us to use the same code in both a pure XQuery HTTP application, or in an XCC application via Java, .NET, or libmlxcc.

(: page-request.xqy :)
declare variable $RAW-QUERY as xs:string external;

cts:search(doc(), $RAW-QUERY)

This query will return search results, but it has a number of flaws.

  • The user query is always treated as a simple phrase.
  • All results are returned: there is no pagination. A query that returns 1 million results may drive up to 1 million read I/O operations. The query will resolve quickly, but the results retrieval will feel slow.
  • There is no highlighting of search terms within the displayed results.
  • There is no summarization of the displayed results: the entire document is returned for each result. This generates a lot of network traffic, which can be avoided.

Let's fix these problems. We won't do very much about the user query: we can process rich queries by parsing the user query, but that is a little too complex for this tutorial. Instead, we will simply split the query into words.

At the same time, we can add some very basic pagination. We will expect an external variable $START, which will start at 1 for new searches. We will return 10 results for each page, simply by adding a predicate to the cts:search() expression.

(: page-request.xqy :)
declare namespace pr =
  "http://developer.marklogic.com/2006-09-paginated-search"

declare variable $RAW-QUERY as xs:string external;

declare variable $START as xs:integer external;

let $stop := 9 + $START
let $query := cts:and-query(tokenize($RAW-QUERY, '\s+')[. ne ''])
return cts:search(doc(), $query)[ $START to $stop ]

This still looks very simple, and it gives us basic pagination. Before we add more features, though, note that we calculated $stop in a let expression, rather than writing [ $START to 9 + $START ]. This turns out to be important: in XPath, the predicate expression is evaluated for every item in the current context. Adding two integers is cheap, but the query will run faster if we don't have to perform that math for every single search result.

There is still a serious problem with the results of this query, though. When we return these results to the display layer, that display layer doesn't have enough information to tell the user if there is another page after this one, or not. We can fix that, by adding some meta-information around the search results. As we do that, we will introduce our own element namespace for the metadata.

(: page-request.xqy :)
declare namespace ps =
  "http://developer.marklogic.com/2006-09-paginated-search"

declare variable $RAW-QUERY as xs:string external;

declare variable $START as xs:integer external;

let $stop := 9 + $START
let $query := cts:and-query(tokenize($RAW-QUERY, '\s+')[. ne ''])
let $results := cts:search(doc(), $query)[ $START to $stop ]
return element ps:results-page {
  attribute remainder { cts:remainder($results[1]) },
  attribute start { $start },
  $results
}

Now the display layer can look at the ps:results-page root element, read its start and remainder attributes, and know exactly where the current page is within the complete result set. The display layer can use this information to build a pagination widget. At the same time, we haven't locked the display layer into any particular pagination style.

Self-Correcting Estimates

There's a risk, though, in giving the display layer all of this information. The cts:remainder() documentation specifically warns us that it returns an estimate. This happens because cts:remainder() counts the number of unfiltered index matches for the query. However, some of these index matches may be weeded out by the filtering process. So our remainder attribute may be higher than the final count of search results (in some circumstances the remainder may also be lower, but we don't have to worry about that situation here: the Search Developer's Guide has more information about how estimates and remainders are calculated).

This can be very confusing for the user: the web page reads "1-10 of 1423178 results", but there are only two results on the page. Or the user clicks on the "next" button and there are no more results to display. We can mitigate the user confusion by combining a simple look-ahead technique with user-interface cues.

The user-interface cues should be easy to implement: when the display layer gets an estimated remainder, it should tell the user that there are "about 12430" results. When the display layer is given a remainder that is known to be accurate, it should tell the user that there are "exactly 32" results. But our query has to give that information to the display layer. We will do that by adding a boolean "estimated" attribute, which will be true for estimates and false for counts, and we will set that boolean value by looking a page or two ahead in the result sequence.

(: page-request.xqy :)
declare namespace ps =
  "http://developer.marklogic.com/2006-09-paginated-search"

declare variable $RAW-QUERY as xs:string external;

declare variable $START as xs:integer external;

declare variable $SIZE as xs:integer external;

declare variable $LOOKAHEAD-PAGES as xs:integer := 1;

let $page-stop := $START + $SIZE - 1
let $stop := 1 + $page-stop + ($SIZE * $LOOKAHEAD-PAGES)
let $query := cts:and-query(tokenize($RAW-QUERY, '\s+')[. ne ''])
let $results := cts:search(doc(), $query)[ $START to $stop ]
let $count := count($results)
let $remainder :=
  if (exists($results)) then cts:remainder($results[1]) else 0
let $estimated := $remainder gt $count
return element ps:results-page {
  attribute estimated { $estimated },
  attribute remainder { max(( $remainder, $count )) },
  attribute start { $start },
  $results[1 to $page-stop]
}

Along with the estimated attribute, we snuck in an extra feature in this revision: the caller can now set $SIZE to indicate the desired page-size. This allows the user to specify 10, 20, or 30 results per page. Note that cts:remainder(()) throws an error, so we have to make sure that the search returned some hits before we ask for the remainder.

Also, note that this approach will only change from "about" to "exactly" when the current page is within 1 page (plus 1 item) of the end. We can reduce the look-ahead to 0, so that we only look to see if there is a next page, or we can increase the look-ahead to a larger number of pages, to provide more accuracy. Note that a lookahead of more than 1-3 pages is usually counterproductive, since most users don't click past the first few pages.

There's another risk remaining with this approach. If we allow the user random access to the pages, instead of only providing "next" and "previous" actions, then the user may step off of the end of the result set. We can let this happen, or we can handle this situation explicity through recursion.

(: page-request.xqy :)
declare namespace ps =
  "http://developer.marklogic.com/2006-09-paginated-search"

declare variable $RAW-QUERY as xs:string external;

declare variable $START as xs:integer external;

declare variable $SIZE as xs:integer external;

declare variable $LOOKAHEAD-PAGES as xs:integer := 1;

declare function page-results(
 $query as cts:query, $start as xs:integer)
 as element(ps:results-page)
{
  let $page-stop := $start + $SIZE - 1
  let $stop := 1 + $page-stop + ($SIZE * $LOOKAHEAD-PAGES)
  let $results := cts:search(doc(), $query)[ $start to $stop ]
  return
    (: if we stepped off of the end, recurse to the previous page :)
    if (empty($results) and ($start - $SIZE) gt 1)
    then page-results($query, $start + $SIZE)
    else
      let $count := count($results)
      let $remainder :=
        if (exists($results)) then cts:remainder($results[1]) else 0
      let $estimated := $remainder gt $count
      return element ps:results-page {
        attribute estimated { $estimated },
        attribute remainder { max(( $remainder, $count )) },
        attribute start { $start },
        $results[1 to $page-stop]
      }
};

declare function get-query($raw as xs:string)
 as cts:query?
{
  let $words := tokenize($RAW-QUERY, '\s+')[. ne '']
  where $words
  return cts:and-query($words)
};

let $query := get-query($RAW-QUERY)
where $query
return page-results($query, $START)

Now we have moved the actual search into a user function, allowing recursion. This makes our pagination quite robust: we tell the display layer where we are in the result set, and we correct the search estimates using a lookahead. We also handle the case where the user steps off the end of the result set: note that this will require some awareness on the display side. The display layer needs to know that the "start" it requests might not be the "start" that comes back: if the user stepped off the end, the query will return the last available page of results.

Some application developers object to pagination in the query, arguing that users do not want to wait for a server request every time they click the "next" button. You can address this concern without changing any of the above code: simply request a "superpage" that is 2-3 user-visible pages. With this approach, your client-side JavaScript can let the user experience fast local pagination for the first few pages, then perform a server request for the next few pages. You could also use asynchronous server requests (AJAX) to pre-fetch the next page or pages.

We also snuck in another enhancement: we abstracted the query-parser into its own function. This will make it easier to build a sophisticated query parser, later on (for a more complex example of this technique, see Example: Creating a cts:query Parser in the Search Developer's Guide). We also added a couple of checks to make sure that we don't end up with cts:and-query(()), which would match all the documents in the database.

Summarization and Highlighting

Now that we have a robust pagination technique, let's talk about summarization. Our goal here is to provide the display layer with all the information it needs, while minimizing the bytes returned. This keeps the network running smoothly. At the same time, we can include metadata related to each result item: its document URI, or its relevance score for the user's query. It's best if we can hard-code the elements that we will return, and we will use that technique here. Naturally, your elements will be different: your document root element may not be called record, etc.

At the same time, we can use the original query to highlight any matching terms in each result summary. We can do this using cts:highlight().

(: page-request.xqy :)
declare namespace ps =
  "http://developer.marklogic.com/2006-09-paginated-search"

declare variable $RAW-QUERY as xs:string external;

declare variable $START as xs:integer external;

declare variable $SIZE as xs:integer external;

declare variable $LOOKAHEAD-PAGES as xs:integer := 1;

declare function page-results(
 $query as cts:query, $start as xs:integer)
 as element(ps:results-page)
{
  let $page-stop := $start + $SIZE - 1
  let $stop := 1 + $page-stop + ($SIZE * $LOOKAHEAD-PAGES)
  let $results := cts:search(doc(), $query)[ $start to $stop ]
  return
    (: if we stepped off of the end, recurse to the previous page :)
    if (empty($results) and ($start - $SIZE) gt 1)
    then page-results($query, $start + $SIZE)
    else
      let $count := count($results)
      let $remainder :=
        if (exists($results)) then cts:remainder($results[1]) else 0
      let $estimated := $remainder gt $count
      let $page := element ps:results-page {
        attribute estimated { $estimated },
        attribute remainder { max(( $remainder, $count )) },
        attribute start { $start },
        for $i in $results[1 to $page-stop]
        return element ps:result {
          attribute confidence { cts:confidence($i) },
          attribute uri { xdmp:node-uri($i) },
          $i/record/title,
          $i/record/publication-date,
          $i/record/authors
        }
      }
      return cts:highlight(
        $page, $query,
        <span class="highlight">{ $cts:text }</span> )
};

declare function get-query($raw as xs:string)
 as cts:query?
{
  let $words := tokenize($RAW-QUERY, '\s+')[. ne '']
  where $words
  return cts:and-query($words)
};

let $query := get-query($RAW-QUERY)
where $query
return page-results($query, $START)

Now the display layer has the information it needs in order to display highlighted, confidence-scored summaries of each search result on the page. In this example, the appearance of the highlighting will be determined by a CSS stylesheet. If that isn't suitable, we could add another external variable or two, to specify the highlight element name and attributes.

Customized Sorting

So far, we have only talked about relevance-ranked display of results. Some web applications have a requirement to allow sorting on title, or on publication date. Note that we must have a custom range index on these elements, so that sorting does not slow down our queries. For more information about range indexes and sorting, see Optimizing Order By Expressions With Range Indexes.

Given these custom indexes, we can refactor our query to allow sorting. This is more complex than simply adding an order by clause. If we tried that, we would simply sort the current page, not the entire result set. Instead, we need to move the pagination predicate outside the expression in which we sort the results. If done properly, this expression is optimized to use the range indexes.

(: page-request.xqy :)
declare namespace ps =
  "http://developer.marklogic.com/2006-09-paginated-search"

declare variable $RAW-QUERY as xs:string external;

declare variable $START as xs:integer external;

declare variable $SIZE as xs:integer external;

declare variable $SORT-KEY as xs:string external;

declare variable $LOOKAHEAD-PAGES as xs:integer := 1;

declare function page-results(
 $query as cts:query, $start as xs:integer)
 as element(ps:results-page)
{
  let $page-stop := $start + $SIZE - 1
  let $stop := 1 + $page-stop + ($SIZE * $LOOKAHEAD-PAGES)
  let $results := (
    for $i in cts:search(doc(), $query)
    order by
      (: handle custom sort keys :)
      if ($SORT-KEY eq 'title')
      then $i/record/title
      else if ($SORT-KEY eq 'publication-date')
      then $i/record/publication-date
      else ()
    return $i
  )[ $start to $stop ]
  return
    (: if we stepped off of the end, recurse to the previous page :)
    if (empty($results) and ($start - $SIZE) gt 1)
    then page-results($query, $start + $SIZE)
    else
      let $count := count($results)
      let $remainder :=
        if (exists($results)) then cts:remainder($results[1]) else 0
      let $estimated := $remainder gt $count
      let $page := element ps:results-page {
        attribute estimated { $estimated },
        attribute remainder { max(( $remainder, $count )) },
        attribute start { $start },
        for $i in $results[1 to $page-stop]
        return element ps:result {
          attribute confidence { cts:confidence($i) },
          attribute uri { xdmp:node-uri($i) },
          $i/record/title,
          $i/record/publication-date,
          $i/record/authors
        }
      }
      return cts:highlight(
        $page, $query,
        <span class="highlight">{ $cts:text }</span> )
};

declare function get-query($raw as xs:string)
 as cts:query?
{
  let $words := tokenize($RAW-QUERY, '\s+')[. ne '']
  where $words
  return cts:and-query($words)
};

let $query := get-query($RAW-QUERY)
where $query
return page-results($query, $START)

We end up processing our search results twice: first to sort them correctly, and then to render the result summaries. This is not a performance problem, though, since we never have more than 21 results to process. Note that the $SORT-KEY is required: set it to "" for relevance-ranked sorting.

Conclusion

We hope this tutorial was useful and informative. You should be able to customize the last query, above, to suit your web application needs. For more information about the MarkLogic Server search API, visit developer.marklogic.com.

Comments