Hi. Unknown Blogger here. In this space I'll be reporting bits and pieces of information about Content Interaction Server (I'm not supposed to use the acronym CIS), XQuery and/or MarkLogic.
What I write here is "officially unofficial", which is to say it comes from the people who know but it's not an official communication of MarkLogic Corporation.
Who am I? That's hard to say. Sometimes I'm one person. Sometimes I'm another. Sometimes I'm several people. Pay no attention to that logger behind the curtain.
Preview of Content Interaction Server 2.2
This is my first blog entry so I figure it has to be a good one. I'm going to give you a peek inside the super-secret MarkLogic Research Laboratory for a preview of what's coming in Content Interaction Server version 2.2 which is due to ship in early September, 2004.
There are a lot of really cool (and anxiously awaited) new features in 2.2. The server team has been hard at work over the last few months to get all this stuff in place (and they look very sharp in their starched white lab coats). I heard that the 2.2 release was delayed for a bit so that even more good stuff could be crammed in. So, let's get to it.
Wildcards in Searches
The search features in Content Interaction Server have always been amazing (not too surprising since the company founders came from InfoSeek and Google). In 2.2 they've added wildcard searches to the bag of tricks. There are two special wildcard characters: "?" which matches exactly one non-space character, and "*" which matches zero or more non-space characters.
You can control whether or not to use wildcards on a per-query basis and there can be multiple wildcards in a search word or phrase.
Wildcards make use of another new feature: character-level indexes. Indexing by character is a database-level option. If you've enabled it for a given database then you'll likely get higher performance from fn:contains(), fn:starts-with() and fn:ends-with() in most cases as an added bonus.
With 2.2 you can use "stemming" in your searches. Stemming allows you to do searches that match words against their root words, not simply as character strings. If you need to do searches where word meaning is more important than character sequences this will make your results far more relevant. For example, if you want to search for occurrences of the word "run", a stemming search will also match on "runs", "running" and "ran" because they all stem from the same root word.
Stemming is enabled or disabled at the database level, but you can turn it off for specific queries if you don't want it. As of 2.2, stemming is available only for English.
One or more word dictionaries can be installed into the server as of 2.2. This will let you query to determine if a particular word is spelled correctly. If it isn't, you can request a list of suggested spellings. Needless to say, this can have a wide range of uses.
MarkLogic is making three sample English dictionaries available for download from the developer network site. These dictionaries come in three sizes: small, medium and large. The smaller the dictionary the more "strict" the spelling checks will appear to be. For each query you can specify which dictionary to check against.
Continuing the theme of working with words rather that strings, you can also install one or more thesauri (collections of synonyms) into the server.
You can specify which thesaurus you want to use for a given query and you can specify whether you want all or a subset of the returned synonyms. This can be very useful for presenting "what's similar" search options to a user.
A sample thesaurus is being made available for free download from developer network to get you started.
One of the zippy-coolest new features in 2.2 is support for WebDAV (Web-based Distributed Authoring and Versioning). In rough terms, this means that a Content Interaction Server instance can look like a filesystem.
This means that you can drag-and-drop files from the Windows desktop (which speaks WebDAV) directly into or out of the server. With WebDAV-aware authoring tools you can view and edit server-resident documents as if they were just another file on your system. This will make it much easier to manage stored content. The server security policy will control who can see and/or modify documents through WebDAV.
There is now a concept of directories in Content Interaction Server. This was added to support WebDAV but should prove generally useful. Directories are similar in concept to collections. But unlike collections, which are a group of documents that you explicitly declare to be related and can have arbitrary, unrelated URIs, directories group documents into a hierarchy decribed by their URI values.
Once a directory exists, documents may be added to it. Depending on your database settings, directories can be created automatically when documents are loaded or you may need to create then explictly using API calls. If two documents are in the same directory, their URI prefixes will be the same.
For example, if directories are enabled, you can tell that the documents /foo/bar/snicker.xml and /foo/bar/guffaw.xml are in the same directory because the path prefix in their URIs leads to the same directory /foo/bar. The document /foo/blah/snort.xml is in a directory named /foo/blah. Both /foo/blah and /foo/bar are in the directory /foo. Directories are very similar (intentionally) to directory structures you're familiar with in regular filesystems.
One thing to note: the WebDAV specs use the term "collection" to refer to what are termed "directories" in Content Interaction Server. A WebDAV collection is not the same thing as a Content Interaction Server collection.
You can now explicitly set exclusionary locks at the document level. This is another feature that was added in support of WebDAV but which can be useful in other contexts. When WebDAV clients access documents in the server, document locking/unlocking is done automatically.
Locking semantics in 2.2 are aligned with the WebDAV specification. They're manadatory for updates, which means that if a document is locked and a user other than the lock holder attempts an update to the document an exception will result. Reading a document will not be affected by whether it's locked or not.
Locks can be shared or exclusive. Exclusive locks can only be granted if there are no other locks on a document. While an exclusive lock is held no other lock of either type can be granted. Shared locks allow any number of users to hold locks simultaneously on a document. Any lock holder can make updates but non-lock holders will receive an error if they try to make updates to the document.
Locks in 2.2 do not have blocking semantics. If you request a lock that can't be granted, because someone else already holds a lock on the document, it will generate an exception rather than blocking until the lock can be granted.
Document locks are persistent and survive the current execution thread and even server restarts. This makes it important to manage your locks carefully to avoid orphaned locks. If you use the new locking APIs, think about wrapping the calls in try/catch/finally blocks.
BLOBs and CLOBs
Prior to 2.2, every document stored in the database had to be valid XML. But there are many reasons to store non-XML files in there as well. Now it's possible to store arbitrary text and binary data in the database.
Non-XML text files will be stored and indexed. You obviously can't apply XPath expressions to these files, but you will be able to search them by character, word or phrase. With the new dictionary and thesaurus features described earlier there's a lot you can do with plain old text documents in 2.2.
Binary files are opaque blobs to the server. File types are deduced from the name extension (.pdf, .jpg, .txt, etc). The mapping between mime type and document format is configured cluster-wide but you can override it on a per-query basis.
The mapping of extensions to mime types lets you drag and drop any file into the server via WebDAV and have it automatically handled properly. There are also new API functions that allow developers to work with text and binary documents.
This is an important addition. It means that you can now store all the files that are logically related in one place. For example, text and illustrations of books, website content, application data, you name it.
Explicit Document Placement
When you load a document, you can now specify which forest it should be placed in. This can be useful for load-balancing in demanding operational environments.
Backup and Restore
It used to be that backups were done per-forest. You could backup an individual forest without taking it offline but that was not enough if you needed a consistent snapshot of all forests in the database.
As of 2.2, you can do a hot backup of the entire database and get a snapshot that's consistent across all forests. The backup can also save the schemas, modules and security databases so you have an all-in-one archive. You'll be able to restore individual forests from the database-level backup if needed.
Several enhancements have been made to XDBC to support the new features in 2.2. There have also been a bunch of bug fixes, performance enhancements and optimizations.
- External Variables
- It's now possible to set external variables that will be passed to the server and be made available to the query when it runs. This is roughly equivalent to JDBC Prepared Statements. External variables can also be used with xdmp:eval() apart from XDBC.
- Get Reader
- You can now get a Reader for the current result item without advancing the cursor. This fixes an annoyance where you had to decide before advancing the cursor whether you wanted a Reader or not.
- You can now pass in an instance of your own logger so that XDBC messages will go to your common log file.
- It's now possible to set distinct user/password information on multiple connections concurrently. This was always implied by the API, but the implementation didn't properly support it until 2.2.
- Binary Streams
- Now that you can store BLOBs in the database, you need a way to pull them out via XDBC. It's now possible to get an InputStream instance from a ResultSequence that will let you read out a binary document.
HTTP POST Body
And lastly, it's now possible to retrieve the body of an HTTP POST submitted to the server. This is handy for situations where you need to gain access to the "payload" in the body of the request. An example of this is the SOAP protocol which uses POST to submit an XML envelope containing a message to be processed.
If you'll be upgrading from 2.1 to 2.2, the upgrade will update the forests on disk. Among other things, that means you'll need to do a clean shutdown before upgrading. Be sure to read and follow the installation instructions to avoid any problems.
As with 2.1, 2.2 complies with the W3C May 2, 2003 Working Group Draft Recommendations for XQuery and XPath. We don't plan to change this until the final draft is approved (sometime next year, probably). See Mary's Standards Issues update for the current status of the standards.
Well, that's it for now. I hope this informal preview has been helpful to you. If you have any questions or comments, please post to the developer network General Mailing list (sign up here). Once 2.2 is released, drop us a line and let us know what you think and how you're putting the jazzy new 2.2 features to use.