MarkLogic Server can support many different document formats (such as text, XML, image files, Word, Excel, PowerPoint, executables, etc.). But when you stand back and look at how these are represented, there are really only three document formats:
For example, Word docs can be represented as binary (.doc) or XML (.docx components), image files and executables are binary, and text files are, well, text.
But after you load a document, how can you tell what its format is? Is it stored as metadata somewhere outside the document? No, the answer lies in the document itself. We can generalize even further: every document in MarkLogic Server is represented as a tree, regardless of its format.
Take the following XML document:
It is represented in MarkLogic as a tree structure, according to the XPath 2.0 data model:
You may be wondering why my original XML document doesn't include any line breaks or indents. The answer, of course, is that every bit of whitespace beneath the root element would have to be considered significant, and the resulting tree structure would too much challenge my ASCII art skills, so I decided to keep it simple.
Notice that the root of the tree is
an element. It's a "document
node." That's XPath-speak for the invisible container that every
document has. The advantage of storing documents in XML format is
that you can query the tree structure, using XQuery/XPath
expressions such as
/doc/title. And with MarkLogic, you can
perform searches that are aware of and can be qualified by this
How can you tell if a document is in XML format? You look in the tree itself. If the document node contains an element, then it's in XML format:
which is equivalent to the more explicit version:
How is a text document, like this one, represented?
Well, since every document is a tree, this one is too (albeit a simple one):
The XPath data model allows document nodes to contain any node that an element can contain (text, elements, comments, and processing instructions). In other words, it can model documents that aren't well-formed XML (such as text documents). In a text document, the tree is always simple: a document node containing exactly one text node.
How can you tell if a document is in text format? You look in the tree itself. If it contains a text node child, then it's a text document:
Binary documents are also (trivial) trees, like text documents. But in this case, the XPath 2.0 data model by itself does not suffice. For that reason, MarkLogic Server extends the data model to include binary nodes. These are special and can only occur as singular children of the document node. For example, the storage of a JPG file would look like this:
And the "1.0-ml" flavor of XQuery (as
opposed to "1.0") supports access via its
binary() node test. To test whether a document is a
binary document, you test for the presence of a binary
Stay tuned for part 2, on how you can determine what format your documents will be loaded as.