Learning to use Search Clustering

Clark D. Richey, Jr.
Last updated September 28, 2012

Overview

This tutorial will walk you through the steps to adding search clustering to your applications. To do this, we'll modify an application created with MarkLogic Application Builder, but the fundamental techniques shown here will work with other types of MarkLogic applications as well.

What is Search Clustering?

Simply put, search clustering is the ability for MarkLogic Server to dynamically cluster together like documents (technically nodes). The server categorizes these clusters using terms found in the documents within the cluster itself. These terms are the server's best guess at determining the subject of the documents.

Getting Started

In this tutorial we will be adding the search clustering funcationality to the Oscars sample application that you can build with MarkLogic Application Builder. In order to create that sample database you need to connect to the MarkLogic Application Builder main page at  http://localhost:8000/appservices/ (or http://localhost:8002/ in MarkLogic Server 4.x). From there, simply click on "New Example Application". After providing a name for the database and application to be created (I chose Oscars for both) you can accept the defaults on all the other screens. After launching the sample application, I recommend that you view that application and click on the link on the bottom of the first page in order to load additional content from the internet. This additional content will make your application more interesting and should give you better clustering results.

The sample Oscars application is stored in a database inside of MarkLogic server. In order to modify this application, we are going to need to create a WebDav server that points to the modules database used by our application. The name of this database will be the name of your application followed by a hyphen and then "modules". My modules database is therfore named Oscars-modules. This will be the database to which we want to point our WebDav server.

To create a WebDav server, go to the App Servers menu located under Groups / Default / App Servers in the tree menu on the left side of the admin screen. Once you are there click on the "Create WebDav" tab on the top right of the screen. Give your WebDav server a name (I used Oscars-WebDav), choose a free port (I used 8011), set the database to the modules database for your sample Oscars application, and set the root to "/application/". Everything else can remain retain the default values. Once you're done click on the "OK" button to create the WebDav server. Now simple connect to this WebDav server using your favorate WebDav client and you should see the source code for the Oscars sample application. With all of that out of the way, we're finally ready to start writing some code!

Changing the Sidebar

For this example, I want to show the Search Clustering in the left sidebar, just below the existing facets. In order to do that we need to change the way the sidebar is rendered. The function that renders the sidebar is named asc:sidebar and is located in the file standard.xqy located under the lib directory. We could edit this directly but if we ever used MarkLogic Application builder to make changes to our application, our changes would be lost when this file was regenerated. A better choice is to override this function by editing the appfunctions.xqy file located in the custom directory. Changes made to this file will not be lost if we use MarkLogic Application builder to modify our application.

Open appfunctions.xqy in your editor of choice. Scroll down until you find the commented out function called app:sidebar. Uncommenting this function will cause it to override the default asc:sidebar function. So, let's replace that function with the function below to get things started. Note that all we did was to manually add another div that will be rendered after the facets are generated.

Let the Clustering Begin!

Go ahead and save the file and then look at your Oscars application in your browser. You should now see an empty Topic section on the bottom left hand side of the page. This is where we will put the results of clustering. Let's go ahead and edit our app:sidebar function again, this time adding the clustering. While the code required to generate the clusters isn't complicated, it does take up a little bit of room. Additionally, it is something that we might want to reuse elsewhere in our application at a later date so we don't just in-line the code in our asc:sidebar function. Instead, we'll create a new function that does the clustering and we'll call that function in asc:sidebar. The asc:sidebar function should now look like this:

Don't bother trying to view your changes in the browser yet as we haven't created the function we are calling. Let's do that by adding this function directly under the asc:sidebar function:

You can now save the file and view the results in your browser. Let's take a look at what is happening in the new app:cluster function. In the very first line of the function we are simply deciding to only execute the actual clustering function if we have query results to cluster. If we don't have anything to cluster, such as when the application's splash page is being viewed, then we simply return an empty <li></li>.If there are result nodes then we first load the full documents for each item in the results list. Note that if you have potentially large documents you may not want to do this. However, in this case we know all the documents are small and we get much better clustering results if we can use the entire document for clustering rather than the result snippets. We then call cts:cluster on those documents, passing in an options node in which we explicitly determine which of several possible algorithms we want MarkLogic to use when determining the clusters. We then simply iterate over the resultant nodes and grab the label and the document count to form our list items.

Summary

Clustering is a powerful and interesting capability yet as with most such capabilities, it should be used with care. The accuracy of your clusters and terms will vary according to many factors including content and the algorithm you choose to use. There is also a trade off between accuracy and performance. If you need more accurate results but don't want to increase page load times, consider using AJAX calls to load the cluster information. This tutorial has provided you with enough information to get started using the new clustering feature but there are many options left to explore. Have fun and happy clustering!

Comments