Document formats, part 2: Loading different formats

by Evan Lenz

In part 1, we saw how MarkLogic Server supports three basic document formats (XML, text, and binary), how each of them is modeled as a tree, and how you can determine the format of a document that has already been loaded into the database. This short blog installment shows how you can determine the format when first constructing and loading the document.

There are several functions relating to constructing documents and inserting them into the database:

  • xdmp:document-get() fetches a resource external to the database and constructs a document node (tree)
  • xdmp:document-insert() inserts a given document node (tree) into the database
  • xdmp:document-load() combines both of the above (creates a tree and inserts it into the database)

With xdmp:document-insert(), the format of the document is already determined, because the in-memory tree structure has already been built. The document node has already been constructed, and it either contains an element (XML), text node, or binary node. But with xdmp:document-get() and xdmp:document-load(), the document node hasn't been constructed yet, so somehow it needs to determine how to construct the tree (what format to use).

There are two ways this can happen: explicitly or implicitly. First, let's look at the explicit approach.

Explicitly setting the format

Both xdmp:document-get() and xdmp:document-load() provide an explicit "format" option for specifying either "text", "binary", or "xml". For example, this call to xdmp:document-get() tells the Server to construct a document node in XML format (i.e. to use the XML parser) when loading file.html from the disk:

declare variable $doc := xdmp:document-get("c:\myFiles\file.html",
                            <options xmlns="xdmp:document-get">
                              <format>xml</format>
                            </options>);

The above constructs the in-memory tree. To insert it into the database, you could then use xdmp:document-insert():

xdmp:document-insert("/file.html", $doc)

If you want to do both in one step, then you could use the xdmp:document-load() function instead:

xdmp:document-load("c:\myFiles\file.html",
   <options xmlns="xdmp:document-load">
     <uri>/file.html</uri>
     <format>xml</format>
   </options>)

Either way, the <format>xml</format> part determines that the file will be parsed as XML and constructed into an element tree.

Implicitly setting the format

If you don't explicitly set the format, then MarkLogic Server uses the format that's configured for the MIME type corresponding to the file extension of the file being loaded. For example, what happens if we remove the explicit "format" option from our call to xdmp:document-get()?

declare variable $doc := xdmp:document-get("c:\myFiles\file.html");

In this case, the file extension is ".html", which corresponds to the "text/html" MIME type and the document format of "text". The resulting tree structure for $doc in this case will not be an element tree; instead, the document node will contain exactly one text node child.

If you don't want this to happen, because, for example, you know that the HTML file is well-formed and you want it to be parsed as XML, then you can either explicitly set the format option (as we did above), or you can change the MIME type configuration in the Admin Interface:

Machine generated alternative text: Configure  ‘i L______ Create T Help L..... MIME Types ConfiguratIon ok cancel File extension to MIME type specifications name extensions A MIME type name. A list of file extensions. applicationlrtf apphcatbnlxml format The document format. binary Edelete xml : delete

Scroll down to the "text/html" listing, and change the format from "text" to "xml":

Machine generated alternative text: text/html html,htm xml delete

You would only do this if you knew for sure that all your .htm and .html files are going to be well-formed (and not cause an XML-parsing error), or can be repaired on load using MarkLogic's repair options.

What happens if you need to do this after the fact? Let's say you loaded hundreds of HTML files and didn't realize they were being loaded as "text" until later on when you tried to navigate their elements using XPath? (I'm speaking from experience here.) That's what I'll cover in part 3. Stay tuned.

Comments