text() is a code smell

by Evan Lenz

In a lot of the XQuery code I've come across, I see the text() node test being used. It is, of course, perfectly legal and valid to use. In some cases, it's essential. But most often I see text() being used where the string() function should be used instead. In these cases, the code is a sitting duck, waiting to be broken. Moreover, since text() is rarely needed in simple queries and, in my experience, its inappropriate uses far outweigh its appropriate uses, I feel perfectly comfortable positing this: text() is a code smell. If you have a habit of using text() a lot, then read on.

The simplest way to explain what I mean is by example. Consider the following query, which is intended to output one <li> element per <item> element in the input:

<ul>{
  for $item in doc("/test.xml")/list/item/text()
  return <li>{$item}</li>
}</ul>

And consider some sample input (at "/test.xml"):

xdmp:document-insert("/test.xml",
<list>
  <item>My first item</item>
  <item>My second item</item>
  <item>My third item</item>
</list>
)

Running the above query yields exactly the desired result:

<ul>
  <li>My first item</li>
  <li>My second item</li>
  <li>My third item</li>
</ul>

So what's the problem? text() did just fine. Isn't it essential here? If we were to leave it out:

<ul>{
  for $item in doc("/test.xml")/list/item(:/text():)
  return <li>{$item}</li>
}</ul>

Then the <item> elements themselves would get copied to the result, and that's not what we want:

<ul>
  <li>
    <item>My first item</item>
  </li>
  <li>
    <item>My second item</item>
  </li>
  <li>
    <item>My third item</item>
  </li>
</ul>

Okay, good point. But now let me throw a wrench into your code. Let's say test.xml now looks like this:

xdmp:document-insert("/test.xml",
<list>
  <item>My first<!--was the second--> item</item>
  <item>My second item</item>
  <item>My third item</item>
</list>
)

Here is the new output from the original query that uses text():

<ul>
  <li>My first</li>
  <li> item</li>
  <li>My second item</li>
  <li>My third item</li>
</ul>

Uh-oh. Now it's doing what the code instructed (output one <li> per text node), but it's not doing what we wanted. The presence of the comment (<!--was the second-->) caused the first <item> to contain not one but two text nodes. No matter, we just need to fix the code. Let's move the text() part to the expression inside the <li> element constructor:

<ul>{
  for $item in doc("/test.xml")/list/item
  return <li>{$item/text()}</li>
}</ul>

Now we get the original desired output. Problem solved, right? Not so fast. I've got another wrench:

xdmp:document-insert("/test.xml",
<list>
  <item>My first item</item>
  <item>My <em>second</em> item</item>
  <item>My third item</item>
</list>
)

Here's what our newly fixed query outputs for the above document:

<ul>
  <li>My first item</li>
  <li>My  item</li>
  <li>My third item</li>
</ul>

Where did the word "second" go? Well, it's not a text node child of <item>, so it didn't get copied through. Only the text node children of <item> were copied through, just as your code instructed.

You may be thinking: give me a break. You changed the data, and your code broke. Big deal. That doesn't mean text() is a code smell. After all, I can fix it again like this (by using //text() instead of /text()):

<ul>{
  for $item in doc("/test.xml")/list/item
  return <li>{$item//text()}</li>
}</ul>

To which I would reply: do you always want to have to set yourself up for chasing down subtle bugs when seemingly innocuous changes are made to your data? There's a more robust way. It's called the string() function:

<ul>{
  for $item in doc("/test.xml")/list/item
  return <li>{string($item)}</li>
}</ul>

The string() function converts its argument to a string. In the case of a node, it returns the string-value of the node. In the case of an element (or a document node), the string-value of the node is the concatenation of the string-values of all its descendant text nodes. Lo and behold, that fits the bill perfectly! Now it doesn't matter if any inline markup constructs (sub-elements, comments, or processing instructions) are added. Our code will continue to work as intended.

If you call string() with no arguments, the context node is taken to be the implicit, default argument. In other words, string() is short for string(.). That means you can use it as a direct replacement for text() in many cases, and it will give you the desired result without being so easily breakable:

<ul>{
  for $item in doc("/test.xml")/list/item
  return <li>{$item/string()}</li>
}</ul>

As I mentioned before, the text() node test (remember it's not a function even though it looks like one) has its perfectly legitimate uses. Out of curiosity, I did a search on the code base for RunDMC (the XQuery and XSLT code that runs this site), and I only found two instances of text() in all of the view-related code. They were both cases where using text() was essential (and string() couldn't be used instead). In contrast, there were many uses of string(). I mention this not in order to suggest that my code is absolutely exemplary (Ha! I only hope you don't look around too much), but that I'm at least practicing what I'm preaching. If you've got a different view, feel free to share it in the comment section below!

Comments

  • great article and yes its an old chestnut(), but I would not totally discount this usage ... as with anything there can be a time and place when putting such code 'smells' in has a purpose. for example, when there is (the rare) requirement of having to run xquery across several different XQuery processors. there is a surprising amount of differences between all the major XQuery implementations in how they atomize values ... sometimes its just bad implementation, but more often then not its because the XQuery spec itself can get vague on things and leaves it up to the implementator to decide upon.
    • Good to know. Of course, getting the string-value shouldn't require atomization (although implementations may do it that way, which could be what causes the problems you're alluding to). My only recommendation in that case would be to heavily comment your code as to why you're doing it that way (essentially a workaround), so it's a clear intention rather than a code smell. :-)
  • Great post! How would you distinguish string() from data()?
    • Good question. data() gives you the atomized value of a node sequence, i.e. each node's typed value; whereas string() gives you the string-value of a node. data() invokes atomization explicitly, whereas atomization often occurs implicitly, e.g. @foo + 1. If you're not using a schema, then the typed value behaves like a string but is annotated as xs:untypedAtomic rather than xs:string, which allows it to be used in situations like @foo + 1 (assuming @foo can be converted to a number). I don't use the data() function very often. I think it's most useful if you're using a schema.
  • An approach I like to take is to use strong typing (see http://blakeley.com/blogofile/archives/518/): <ul>{   for $item as xs:string in doc("/test.xml")/list/item   return <li>{$item}</li> }</ul> It seems to work nicely in my use cases and allows me to better "write what I mean" with my code.
    • Hmm, interesting. I do see that works, even though I wouldn't have expected it too. It's converting the type, not just requiring it, whereas I think "as" expressions (strong typing) are about requiring that the types match. Since they don't match in this case, I'd still recommend using an explicit conversion here, because otherwise you may be relying on implementation-specific behavior.
      • That makes sense.  I guess I was thinking of the type declaration as something like "this is the type I am interpreting this node as - atomize and validate..."  While that may work fine in my current implementation, I could see how it may not be portable.  Maybe it working is just luck?
  • Things start to get even more interesting if you include attributes. I kept wondering countless times why a certain value didn't appear in some HTML table, finding out I inserted the entire attribute instead of just the value..
    • Yes, that's a good one too! In XSLT, curly braces are used only for attribute value templates (so the value is always converted to a string automatically). But in XQuery, that's not true when used in element content. So I've wanted to write this before: &lt;foo>{@bar}&lt;/foo>, not realizing that @bar was getting <em>copied</em> to the result, not just its string-value. So I had to change it to &lt;foo>{string(@bar)}&lt;/foo>.