[Editor's Note: My thanks to James for this guest tutorial. James is a friend of MarkLogic who was instrumental in getting the site up on its feet. James continues to lend a hand as his busy schedule permits.]
MarkLogic Server is a powerhouse of XML database processing capability. Part of what makes MarkLogic Server so compelling is that its native interface is the web. By writing XQuery Pages (files that end in .xqy), you can expose any kind of view onto your XML database that you want.
As convenient as it is to use the built-in webserver, however, there are times when you need more flexibility. For example, you might need to set up virtual hosts or do URL rewriting. For such tasks, nothing beats the most used webserver on the planet: Apache httpd.
When we set up the developer network we faced exactly this problem. We needed to set up Subversion with SSL as well as use URL rewriting — both of which required the use of the Apache webserver. And we wanted to present it all within the scope of a single URL namespace to you, the readers of the site. By applying a little bit of Apache configuration magic, we were able to combine the best of both worlds. This tutorial will show you how.
How exactly do you combine two webservers into one when they weren't designed to work together? The answer is that you use the very same HTTP protocol that web browsers use to communicate with web servers. Built into the HTTP protocol is the ability for software to proxy requests to be forwarded onto other servers, as shown in the following figure.
There are actually two ways in which HTTP requests can be proxied: forward and reverse. A forward proxy is the kind that you may be used to using to access the internet from inside a corporate firewall. It supports requests from multiple clients through a single server for security, caching, or filtering. A reverse proxy, on the other hand, is used to redirect client requests on to another server, possibly hidden from the original client.
Apache httpd can act as both kinds of proxy. It's the second kind of proxy, the reverse proxy, that we are interested in. In addition, using the Apache mod_rewrite module can let us be very exact about which requests get proxied. In our case, we want to send all requests for .xqy pages onward to MarkLogic Server while letting Apache handle all the other requests for images and other data, as shown in the following figure.
Even if you don't want to use Apache's advanced features for URL rewriting or its other capabilities, you may want to use SSL to secure communication over the open internet. Currently, MarkLogic Server doesn't support SSL, but Apache does and you can use it to provide an SSL front end, as shown in the following figure.
In short, the client browser only sees one server, but multiple servers are actually being used to satisfy the various requests made of the server.
Now that you know the theory, let's look at how to put it into practce.
First off, start out with the latest version of MarkLogic Server. When you install it and start it up, the admin server is set up to run on port 8001 For this article, we're going to assume you have set up an app server on port 8005 as your document server.
The next step is to set up the Apache httpd 2.0 webserver. It's important to note that we do mean version 2.0 and not version 1.3.x which is still widely used. The features in Apache that we're going to use didn't fully mature until the 2.0 version was released.
If you don't already have the Apache httpd 2.0 webserver set up on your machine, you can download it from the Apache httpd download page. Once downloaded, build and install it. If you need a little bit of help doing so, use the Apache installation guide. Depending on the operating system you're using, Apache httpd 2.0 may be available as a binary install package.
When you build or install Apache, you'll need to make sure that the following modules are available:
Also, if you want to enable SSL, make sure that mod_ssl is also built in. You can set whether these modules are built in or not when you run configure on the httpd source tree before building it. For example, we use the following to configure our httpd build:
$ ./configure --prefix=/usr/local/httpd --with-berkeley-db=/usr/local/BerkeleyDB.4.2 --enable-dav --enable-so --enable-proxy --enable-rewrite --enable-ssl --with-ssl=/usr/local/ssl
If you want to verify which modules are compiled into a copy of Apache httpd, you can execute the httpd command with a -l flag, as follows:
$ /usr/local/httpd/bin/httpd -l
Compiled in modules:
As long as mod_proxy and mod_rewrite (as well as mod_ssl if you want SSL support), you're ready to go on to the next step.
Note: The order in which modules are activated in the Apache https config file can be significant. If you're using other modules that can affect how a request is handled, such as mod_alias or an application server connector like mod_jk, then it may be necessary to explicitly stack them it the proper order. Generally, the last module activated in the config file is the first to see and act on a request.
Now, we come to the real magic. By default, Apache is running on port 80 and MarkLogic Server is running on port 8005. To join them at the hip — or more accurately to have Apache forward on requests for .xqy pages to MarkLogic — you'll need to add a few lines to your httpd.conf file in the Directory directive that you'd like it to be active for. The following example shows these lines in bold:
RewriteRule ^(.*xqy)$ http://localhost:8005/$1 [L,P]
These lines turn on the mod_rewrite functionality that is built into Apache http and tells the server to forward any URL that ends with the .xqy extension to the MarkLogic Server running on the local machine's port 8005 via Apache's proxy module (mod_proxy). Note that the proxy can forward on requests to localhost as illustrated above, or to any other server that is visable to the machine that httpd is running on (which may include hosts behind the firewall not visible to the requesting client).
One side effect of this configuration is that all other content — images, CSS style sheets, and the like — is handled by Apache httpd. This is usually what you want as Apache httpd is tuned to serve files well and this will in turn keep load off of your MarkLogic installation. If, for some reason, you want to forward on requests for other document types, you can easily do so. See the Apache URL Rewriting Guide for ideas as to how to approach different URL rewriting tasks with mod_rewrite.
It is important to note that the MarkLogic Server installation that you are redirecting requests to does not need to be visable to the original client. In fact, you can set up Apache httpd behind a firewall that only permits port 80 requests through and that, in turn, proxies for a MarkLogic installation running on port 8005 which isn't visable outside the firewall. This can be considered to be a security benefit as well since Apache httpd is much more tested in the shark infested waters of the internet.
If you have Apache httpd and MarkLogic running on the same machine, you can use the same document root for both. For example, if you are using the document root of /space/Docs, you could set the DocumentRoot directive, and it's accompanying Directory directive, to the following:
RewriteRule ^(.*xqy)$ http://localhost:8005/$1 [L,P]
As we pointed out, you can have MarkLogic Server installed onto a separate machine from Apache httpd. This can help in situations where you need to do a bit of load balancing. You can also use the strategy presented in this article to do more intensive load-balancing for greater scalability. For example, you could have different URL prefixes be directed to separate machines running MarkLogic. Or, you could build a round-robin script that runs on Apache to distribute the load evenly across a whole rack of MarkLogic Servers.
The possibilities are endless and every site that needs to explore multiple machine configurations for scalability will arrive at a different solution. Now that you know how to use Apache httpd's mod_rewrite, however, you've got one more tool in your tool-chest for handling such problems.
Once you have this up and running, there are all kinds of custom configurations that you can do. You can run CGI scripts, PHP scripts, Servlets and JSPs, and much more all in the same URL namespace.
Apache logs all accesses in a common format in one place and any of the commonly available web log tools can be used to analyze those logs. You can also centralize all authentication using Apache to implement single-signon across all of the applications running on your server (or server cluster).
The sky is the limit. Just remember to make small changes to your configuration and to test things every step of the way. Apache configuration can be tricky if you're not a Zen Master.
You'll want to refer to the following resources if you get serious about running Apache and MarkLogic Server together.