An Introduction to MarkLogic Server and XQuery

Jason Hunter, Micah Dubinko and Eric Bloch
Last updated 2012-09-07

This document gives you an informal introduction to MarkLogic Server, the XQuery language, and the developer resource site dedicated to both.

Getting Your Bearings

The first stop on our tour today is MarkLogic Server.

MarkLogic Server is an Enterprise NoSQL database.

It is a document-centric, transactional, search-centric, structure-aware, schema-agnostic, XQuery- and XSLT-driven, high performance, clustered, database server.

MarkLogic Server fuses together database internals, search-style indexing, and application server behaviors into a unified system. It uses XML documents as its data model, and stores the documents within a transactional repository. It indexes the words and values from each of the loaded documents, as well as the document structure. And, because of its unique Universal Index, MarkLogic doesn't require advance knowledge of the document structure (its "schema") nor complete adherence to a particular schema. Through its application server capabilities, it's programmable and extensible.

MarkLogic Server clusters on commodity hardware using a shared-nothing architecture and differentiates itself in the market by supporting massive scale and fantastic performance — customer deployments have scaled to hundreds of terabytes of source data while maintaining sub-second query response time.

It's probably easiest to understand MarkLogic Server with a demonstration. At MarkMail.org you'll find a web-based application that allows you to explore some 50 million messages from public mailing lists focused on technology and open source. You can drill down into the database based on search terms (including stemming where searcing for 'win' also matches on 'wins' and 'won') , or specific data facets like author, mailing list, date, attachment type, or message type.

Getting MarkLogic Server

The best way to understand what you can do with MarkLogic is to get a copy and play with it. Under the Developer License, individual developers can use a free copy of MarkLogic Server for development purposes. You'll find a big button on the front page of the developer site, which points to http://developer.marklogic.com/download/. Check the information on that page for details of the license and specific system requirements. There's one binary download for each platform.

For all platforms, there is a single shared installation guide, which you can find on the Documentation page and which we highly recommend reading. We'll pause the tour now while you follow the guide's instructions and install the database. It takes just a couple minutes.

Administering MarkLogic Server

The install guide should have walked you through the process of browsing to the admin interface at http://localhost:8001 to enter the license key. Now you can go to the same web address to administer the server. The admin interface lets you control the creation, management, and configuration of databases, forests, servers, and hosts. The left navigation bar contains the "nouns". Use it to select the item you want to act upon. The top right tabs contain the "verbs". Select the verb after selecting the noun. Under the tab is a data entry area for making changes.

The main thing you need to understand when using the admin pages is the database topology. Documents are stored in forests. One or more forests are gathered together to form a database. Databases are logical units against which you can assign HTTP, WebDAV and XDBC (for XCC Java and .NET connectivity) servers and set various runtime configuration options. The name forests comes from the fact that XML documents are tree structures, and a collection of trees is a forest. Databases exist as a logical abstraction because in a distributed environment it can be useful to have the same logical database spread across different hosts, perhaps one host with two forests and another with three.

There's a full Administrator's Guide document available on the Documentation page which you can use as your guide through the port 8001 administration pages.

Build Your First App in Minutes

MarkLogic ships with Application Services which includes MarkLogic Application Builder, running at http://localhost:8000/appservices/. The main use of the Application Builder is to quickly build an app around your own content, though it does include a demonstration app and some sample content, which can be loaded into a database of your choosing.

From the Application Builder main page click the "New Example Application" button, and on the dialog that appears, provide a name and select "New Database" (also giving it a name). Click the "Create Application" button and you will be presented with a wizard of six screens, each of which controls one aspect of the app that it will ultimately generate:

  1. Appearance
  2. Search
  3. Sorting
  4. Results
  5. Content
  6. Deploy

You can experiment with the settings on these pages as much as you like. When ready to run the app, click on the Deploy tab, enter in a port number for the to-be-created App Server, and hit the "Deploy" button. The app will launch on the port of your choosing.

Check out the rest of the tutorials and the official documentation for more details on the Application Builder and how it can be used to rapidly get your apps off the ground.

Writing XQuery

Now that you have a database installed and had a chance to poke around the admin screens and the Application Builder, I'll bet you're itching to dive in deeper and write your own XQuery code. There's more than one way to do it: in the browser, or by creating files on the file system to name a few.

MarkLogic ships with a tool called Query Console. It's a web-based application useful for writing quick queries.

Visit http://localhost:8000/qconsole see a form where XQuery can be entered and evaluated. A drop-list on the page lets you choose which database and App Server to use when evaluating your XQuery.

Note: Query Console is a powerful tool. Use it with care.

Try running the following query in Query Console:

(: This is welcome.xqy :)
<big xmlns="http://www.w3.org/1999/xhtml">
Welcome to { xdmp:product-name() }
           { xdmp:version() }
           { xdmp:product-edition() } Edition!
</big>

Understanding XQuery

XQuery is a potentially huge subject, but as the prior section showed, it can be very easy to get started. In Query Console, you can load a simple demo built from the XQuery Use Cases specification. In the Query Console, click the "Workspace" link (towards the upper right corner of Query Console) and select "Import Workspace...". Navigate to <marklogic-dir>/Samples/w3c-use-cases.xml. The <marklogic-dir> is where MarkLogic is installed on your machine (e.g., c:/Program Files/MarkLogic/Samples on Windows, /opt/MarkLogic/Samples on Linux, or ~/Library/MarkLogic/Samples on OS X). Then click the "Import" button. You should see something like this on your screen:

W3C Use Cases screenshot

To get started:

  1. Choose the database you want to load the sample documents into, e.g., the default "Documents" database.
  2. Select the "Load Source XML" query at the top of the list on the right-hand side of the window, if it's not already selected.
  3. Click the "Run" button.

This will load the sample documents into the database. Now, select another sample query from the list on the right-hand side of the window, and click "Run" to see its results (in your choice of XML, HTML, or Text format) at the bottom of the screen.

You can use these examples to get a taste for what XQuery code looks like. You can also enter your own custom query into the textarea. Here's an example:

(: Try this in the textarea :)
for $i in collection() return document-uri($i)

Running this gives you an unsorted listing of all the documents held within the database. The collection() function returns a sequence of document nodes while document-uri($i) returns the URI (the identifier) for document $i. This query might time out when run against larger databases--more efficient means to iterate URIs exist for larger data sets.

For some explanation on the XQuery Use Cases, I'll point you toward the Getting Started with MarkLogic Server document available at http://developer.marklogic.com/pubs/.

XQuery is a language designed to efficiently query large collections of XML data. Examples include medical records, textbook content, office documents, or web pages. In this model, you store the documents directly into the XQuery database -- possibly going through a conversion to XML. Then you query the documents to extract the bits and pieces deemed important. Increasingly XQuery is getting used for application logic as well, and thus becoming a one-stop-shopping language for building Web Apps.

Loading Documents

As many people have pointed out, vanilla XQuery leaves certain areas underspecified. An additional specification called "XQuery Update" provides a way to put things into the database, though it's not yet widely implemented. Plain old XQuery also lacks built-in support for efficient full-text search, though the W3C is working on that too. MarkLogic addresses these gaps with numerous built-in functions. This section explains a few of the methods you need to understand in order to make the most of MarkLogic Server.

Watch out! If you're new to XQuery and skipped over the Getting Started links above, you're going to find the XQuery code in this section a little heavy. That's OK. I'll just assume you're having such a great time here that you can't wait to continue. Learn what you can. You can always come back.

When it comes to getting your content into the database, the most important MarkLogic Server built-in function is xdmp:document-load(). The first argument points to a local file to load, and by default also forms the database URI used to store the file. But if you desire the document to reside at a different database URI, you can pass in a second parameter with options (see the online docs for an example). Here's a simple example:

xdmp:document-load("/tmp/bib.xml")

This loads the file /tmp/bib.xml to the database under the name /tmp/bib.xml.

The xdmp:document-load() call returns the empty sequence on success and throws an error in case of problems. To print "Loaded" after a load, use the following trick:

xdmp:document-load("/tmp/bib.xml"),
    "Loaded"

When you start writing code like this you'll know you're an XQuery master. This bit of code evaluates as a sequence of two items, the empty sequence (the output from the xdmp:document-load() function) followed by a string ("Loaded"). Put together, the result is the simple string "Loaded". In case of error, the xdmp:document-load() call errors out and the trailing "Loaded" gets ignored. To handle errors, you can use try/catch (another extension to the language):

try {
  xdmp:document-load("/tmp/bib.xml"),
      "Loaded"
}
catch ($e) {
  (: * below matches the element without need of declaring the namespace :)
  <span>Problem loading { $e/*:message/text() }.</span>
}

The caught error is an XML node with elements like <message> that explain the reason for the error.

To view the content of a loaded document, use the standard doc() function:

doc("/tmp/bib.xml")

This returns the document node associated with the given URI. To view a list of all loaded documents:

for $i in collection() return document-uri($i)

You saw this query earlier when you typed it into the use-cases textarea. Bringing the two queries together lets you produce a "list and view" script:

let $uri := xdmp:get-request-field("uri")
return

if (empty($uri) or $uri eq "") then
  (
    xdmp:set-response-content-type("text/html"),
    <ul>
    {
      for $i in collection()
      let $doc := document-uri($i)
      return
        <li><a href=
          "view.xqy?uri={xdmp:url-encode($doc)}"
            >{$doc}</a></li>
    }
    </ul>
  )
else
  (
    xdmp:set-response-content-type("text/xml"),
    if (empty(doc($uri)))
    then <error>No content</error>
    else doc($uri)
  )

To give this query a spin, paste it into Query Console or put it in a file where it will be served by an app server.

When a client requests a query file using the special extension .xqy the server executes the query file content and returns the result. It's basically CGI for XQuery. And because XQuery so easily constructs dynamic XHTML output, it's an amazingly convenient development and deployment model. There's no need to use something like Java classes in processing the result (although you can, as we'll see later).

You'll see a <ul> listing of all the documents in the database. Each is clickable, and when you click on the document you see its raw content. (Because the script doesn't have any throttle support, be careful not to use it with long listings or large documents. Browsers don't always like showing <ul> lists of more than a thousand items or XML files of more than a megabyte.)

The script first fetches the uri query string parameter. If it's empty, then it treats it as a request for a listing. If it's not empty, then it's a request for the given URI to be displayed. To handle listings, we set the content type to text/html and print every document-uri() linking to itself. To handle a document view, we set the content type to text/xml and print the doc($uri) result or give a polite error note if the document couldn't be found for any reason.

Guru Tip: The parentheses (notice they're not curly braces) are required because the expression within a then or else clause has to be a single expression, and parentheses make the multiple items into a single, comma-separated sequence.

Searching Documents

Text search forms the core of a database. Search is the process of selecting from a collection of elements, those "relevant" to some search condition. Starting with version 4.1, MarkLogic Server includes Application Services which has a slick component called Search API that makes it simple to perform powerful "Google-like" searches.

A simple search can be quite straightforward:

xquery version "1.0-ml";
import module namespace search = "http://marklogic.com/appservices/search" at
"/MarkLogic/appservices/search/search.xqy";

search:search("for sale")

This returns the top ten most relevant mentions of for and sale across all documents.

By passing in an options element node as the second argument you can get as sophisticated as your searching needs require. You can find more details about the Search API in the documentation or separate tutorials.

Hint: by default searches are case insensitive for all-lowercase tokens but case sensitive for tokens containing any uppercase characters. The logic is, if you bothered enough to capitalize, you probably meant it. To override this behavior, you can pass in settings via an options element node.

Java-Based and .NET-Based Queries with XCC

While it's easy to develop complete apps from within MarkLogic, there are times when you want to directly connect to the database from a separate application. For this, the database exposes an interface to Java and .NET clients called XDBC, and a client library in both Java and .NET languages called XCC. To get it working you need just a few things:

  • The appropriate XCC client-side package files, downloadable from http://developer.marklogic.com/download/.
  • The server configured to listen for XDBC connections. Use the admin pages to set this up.
  • Java or .NET code written against XCC that connects to the server, executes your query, and (optionally) iterates the result.

It's that easy. The full details are explained in the XCC Developer's Guide [pdf]. You'll find Javadocs and the .NET documentation for the XCC classes included in the distribution and also online on the developer network. For connecting back to Java, see the tutorial and documentation for the MLJAM library.

Continuing On

Well, our tour's coming to an end. Let me leave you with one piece of advice: Join the developer network mailing list.

When new content is posted and new releases come out, the list is where the releases are announced. If you have questions, it's where you ask them. And if you have answers, it's where you share them. Here's the link:

http://developer.marklogic.com/discuss/

Hope to see you around!

Comments