Document formats in MarkLogic Server, part 1

by Evan Lenz

MarkLogic Server can support many different document formats (such as text, XML, image files, Word, Excel, PowerPoint, executables, etc.). But when you stand back and look at how these are represented, there are really only three document formats:

  • XML
  • text
  • binary

For example, Word docs can be represented as binary (.doc) or XML (.docx components), image files and executables are binary, and text files are, well, text.

But after you load a document, how can you tell what its format is? Is it stored as metadata somewhere outside the document? No, the answer lies in the document itself. We can generalize even further: every document in MarkLogic Server is represented as a tree, regardless of its format.

XML documents

Take the following XML document:

<doc><title>My doc</title><body>Some text.</body></doc>

It is represented in MarkLogic as a tree structure, according to the XPath 2.0 data model:

   document-node
         |
       <doc>
       /   \
 <title>   <body>
    |         |
"My doc"    "Some text."

You may be wondering why my original XML document doesn't include any line breaks or indents. The answer, of course, is that every bit of whitespace beneath the root element would have to be considered significant, and the resulting tree structure would too much challenge my ASCII art skills, so I decided to keep it simple.

Notice that the root of the tree is not an element. It's a "document node." That's XPath-speak for the invisible container that every document has. The advantage of storing documents in XML format is that you can query the tree structure, using XQuery/XPath expressions such as /doc/title.  And with MarkLogic, you can perform searches that are aware of and can be qualified by this tree structure.

How can you tell if a document is in XML format? You look in the tree itself. If the document node contains an element, then it's in XML format:

exists(doc("my-doc-uri")/*)

which is equivalent to the more explicit version:

exists(doc("my-doc-uri")/element())

Text documents

How is a text document, like this one, represented?

I'm a text document.

Well, since every document is a tree, this one is too (albeit a simple one):

     document-node
            |
  "I'm a text document."

The XPath data model allows document nodes to contain any node that an element can contain (text, elements, comments, and processing instructions). In other words, it can model documents that aren't well-formed XML (such as text documents). In a text document, the tree is always simple: a document node containing exactly one text node.

How can you tell if a document is in text format? You look in the tree itself. If it contains a text node child, then it's a text document:

exists(doc("my-doc-uri")/text())

Binary documents

Binary documents are also (trivial) trees, like text documents. But in this case, the XPath 2.0 data model by itself does not suffice. For that reason, MarkLogic Server extends the data model to include binary nodes. These are special and can only occur as singular children of the document node. For example, the storage of a JPG file would look like this:

document-node
      |
 binary-node

And the "1.0-ml" flavor of XQuery (as opposed to "1.0") supports access via its binary() node test. To test whether a document is a binary document, you test for the presence of a binary node:

exists(doc("my-doc-uri")/binary())

Stay tuned for part 2, on how you can determine what format your documents will be loaded as.

Comments

  • Timely post . Apropos , if others needs to merge two PDF files , We encountered a tool here <a href="http://goo.gl/b43DbC" >ALTO-Merge</a>.