Document formats, part 3: Changing formats

by Evan Lenz

I ended the last installment with this question: What happens if you load a bunch of HTML documents as "text" and then realize you want to query them as XML (using the full power of XPath and XQuery)? In this article, I'll show you exactly how I solved this problem (and how you can too).

Before I go into the solution I used, it's worth mentioning the simplest approach. Simply re-load all the documents, but this time, use the format option you want (<format>xml</format>). xdmp:document-load() will replace the documents you loaded the first time, exchanging the old text documents for the new, desirable XML documents. That's the approach I'd generally recommend.

But what happens if you don't have the original documents around? Or you don't know where they came from? Or they came from disparate sources and it's been a long time since you loaded them? Wouldn't it be nice to just flip a switch and tell MarkLogic Server to turn these text documents into XML documents? Well, it's not quite as simple as that (because not all text can be trivially converted to XML), but there is a way.

The key thing to remember is that a document's format is not some property that's external to the document itself. The document's format is an emergent property of its content, if you will. In other words, when we talk about a document's format, all we're really talking about is what kind of node the root document node contains: an element node, a text node, or a binary node. If we want to convert a text document into an XML document, we need to parse the string-value of the text node as XML and replace the existing document with the newly parsed element tree.

In my case, all the docs I had loaded were in the same directory, but not all of those files were HTML docs. The first thing I did was list all the file extensions that appear in that directory, using this query:

declare variable $docs := xdmp:directory("/pubs/4.2/dotnet/","infinity");
distinct-values($docs/tokenize(base-uri(.),'\.')[last()])

This yielded the following list:

html
gif
js
hhc
hhk
hhp
chm
css
log

From a quick look here, I could see that the only files I was interested in were the ".html" files, so I could constrain my relevant $docs down further:

declare variable $docs := xdmp:directory("/pubs/4.2/dotnet/","infinity")
                          [ends-with(base-uri(.),'.html')];

The next step was to figure out how many of these were already XML and how many were still just text documents:

concat("Total: ",count($docs)),
concat("XML: "  ,count($docs[*])),
concat("Text: " ,count($docs[text()]))

This showed that all of them were still text documents:

Total: 651
XML: 0
Text: 651

Thus, my final goal was to get "XML" to say 651 and "Text" to say 0.

If the text of all of these documents happened to contain well-formed XML, then all I needed to do was parse them using xdmp:unquote(), and replace the existing document with the newly parsed XML document. But before I did that, I wanted to make sure this was going to work:

$docs/xdmp:unquote(.)

This yielded an error, complaining about the presence of an undefined entity reference (&nbsp;) in one of my HTML docs. So I knew I was going to need some clean-up or repair. Luckily, MarkLogic Server provides a lot of tools for doing just that. First of all, xdmp:unquote() takes a list of options in its third argument (the second argument is a default namespace you can have applied to the result):

$docs/xdmp:unquote(., "", "repair-full")

The "repair-full" option doesn't repair all kinds of potential well-formedness errors, but it does a good job with things like detecting missing end tags and inserting them as necessary. Here's the final script I used:

xquery version "1.0-ml";
declare variable $docs := xdmp:directory("/pubs/4.2/dotnet/","infinity")
                          [ends-with(base-uri(.),'.html')];
 
for $doc in $docs
let $parsed :=
  try {
    xdmp:unquote($doc),
    xdmp:log(concat("Well-formed: ", base-uri($doc)))
  }
  catch ($e) {
    xdmp:unquote($doc,"","repair-full"),
    xdmp:log(concat("Repaired: ", base-uri($doc)))
  }
return
  xdmp:document-insert(base-uri($doc), $parsed)

Basically, it attempts to parse the text as is. Otherwise, if that fails, it tries repairing it using the "repair-full" option. It then takes the parsed result and stores it in the database at the same URI, replacing the original.

WARNING: Be sure to back up your documents before trying anything like this (programmatically replacing documents). In practice, this involved a lot of trial and error before I no longer had any issues, and on more than one occasion I ended up inadvertently inserting an empty document.

The above script was sufficient for some of the directories I needed to update. But I ran into some other problems in other directories, where I had to do some calls to fn:replace() before calling xdmp:unquote(). Another option is to use xdmp:tidy() which provides all the power of the popular HTML Tidy tool, for converting HTML to well-formed XML.

After running the final version of my script, I tested the results to make sure they were all XML now, and no text documents were left:

Total: 651
XML: 651
Text: 0

That's exactly the output I wanted to see! All 651 docs were converted from text to XML.

Comments

  • I thought that once a document was inserted in some format, you needed to delete it first to insert it again with a new format. You are also inserting it again with original uri, so how does the system know it should take 'xml' as format (which is usually derived from the uri)..
    • Hi Geert, good questions. There's no need to separately delete a document in order to replace it. The xdmp:document-insert() function does that automatically. Remember: the format is just an aspect of a document's content—a description of what type of node the document node has as its child: element, text, or binary. It's not a separate piece of metadata that has to be updated. Since xdmp:document-insert() takes a document node argument (rather than a string or file name), the format of the document is already determined, because the document has already been constructed. Only the functions that are responsible for constructing documents (loading and parsing a string or file into a tree structure) need be concerned about what format should be used to interpret the input. Those would include xdmp:document-get(), xdmp:document-load() (which combines document-get with document-insert), and xdmp:unquote() (whose format option defaults to "format-xml").
      • Odd, I really thought there was some complication when changing format of an existing document. Something to do with index perhaps? Bad that might have been a year or two back. Your reasoning makes totally sense, and that complication didn't. I tried searching markmail for the thread about it, but couldn't find it. Thanks for the heads up on determination of format, you are totally right. I got things mixed up.. :-/