Should you use namespace wildcards in XPath?

by Evan Lenz

Have you ever wished you could just skip having to deal with namespaces in your content? One way to do this is to avoid using namespaces altogether (i.e. avoid any xmlns or xmlns:* declarations in your XML content). But given that namespaces are in widespread use, both in standard XML vocabularies and in custom application data, that isn't always an option. XPath does provide a convenient feature, namely "local name tests" (or "namespace wildcards"), which let you avoid having to type your content's namespace declaration in your query. In fact, you might be tempted to use it all the time, to save the typing. But I'm here to tell you: don’t. That would be a bad idea. Keep reading if you want to know when it might be safe to use them and when it's not a good idea.

What exactly am I talking about? See item #4 in the following table. There are four kinds of name tests in XPath, and three of them are wildcards:

What it matches

Example(s)

1. Match a specific QName foo, xyz:bar, etc.
2. Match any name *
3. Match any name in a specific namespace xyz:*
4. Match a specific local name, regardless of namespace *:foo

In XPath 1.0 (pre-XQuery), only the first three kinds were supported. If you wanted to select a <foo> element regardless of its namespace, you'd have to write something like this:

*[local-name(.) = 'foo']

One rationale behind this perhaps obvious omission was that such a language feature might encourage some bad practices. The idea of a namespace is that it identifies a distinct set of names. Local names in different namespaces shouldn't necessarily be related to each other (<head> means one thing in HTML and quite another in, say, AnatomyML). Of course, that still didn't prevent people from using namespaces for things like versioning, where each new version of a vocabulary gets a new namespace URI.

In any case, local name tests (or "namespace wildcards") were added to XPath 2.0 (and thus XQuery):

collection()//*:foo

The above query selects all elements with local name "foo", regardless of namespace. Even if you know these elements are in just one namespace, it can be a convenient shortcut. It saves you from having to write out the namespace declaration:

declare namespace xyz="http://example.com";
collection()//xyz:foo

But there are two problems with using namespace wildcards like *:foo. One is that the intentions are unclear. Did you really mean that? Are there really elements named <foo> in more than one namespace? Or were you just being lazy? The other problem is a performance one. MarkLogic indexes elements by QName, not by local name. That means namespace wildcards won't utilize the index and will require a lot of filtering. We can prove this by using our friend xdmp:plan() (or its cousin xdmp:query-trace()):

xdmp:plan(
  collection()//*:foo
)

The output shows how many "fragments" (equivalent to documents, unless you've enabled fragmenting) will have to be read in order to resolve this query. Normally, MarkLogic uses its Universal Index to minimize the number of document reads it has to make. In this case, we can see from the output that the "*:foo" step ("Step 2" below) is problematic:

<qry:info-trace>Analyzing path: fn:collection()/descendant::*:foo</qry:info-trace>
<qry:info-trace>Step 1 is searchable: fn:collection()</qry:info-trace>
<qry:info-trace>Step 2 does not use indexes: descendant::*:foo</qry:info-trace>

Looking further down the output, we see the number of fragments that would have to be opened for the filtering stage:

<qry:info-trace>Selected 14944 fragments</qry:info-trace>
<qry:result estimate="14944"/>

This is not the number of documents that have a <foo> element. This is the total number of documents in my database. So obviously, this query is going to run very slowly, because it's forcing all of those fragments to be read from the disk.

In contrast, let's look at the plan with the case where we specify the exact QName:

declare namespace xyz="http://example.com";
xdmp:plan(
  collection()//xyz:foo
)

In this case, we see that the path is "fully searchable." In other words, all the steps contribute index constraints that can be used to narrow down the possible number of matching documents:

<qry:info-trace>Analyzing path: fn:collection()/descendant::xyz:foo</qry:info-trace>
<qry:info-trace>Step 1 is searchable: fn:collection()</qry:info-trace>
<qry:info-trace>Step 2 is searchable: descendant::xyz:foo</qry:info-trace>
<qry:info-trace>Path is fully searchable.</qry:info-trace>

And we see further down that MarkLogic knows a priori, from the Universal Index, that no <xyz:foo> elements exist in the database:

<qry:info-trace>Selected 0 fragments</qry:info-trace>
<qry:result estimate="0"/>

So simply by specifying the namespace part of the QName, we've gone from having to read all the documents in the database to none of them.

To summarize, you should generally avoid namespace wildcards like *:foo for two reasons:

  • performance, and
  • clarity.

Are there ever any cases where it's okay to use *:foo? Performance is not nearly as big an issue when you're processing documents that you're already committed to opening. For example, if you're processing a single zip file manifest (the result of xdmp:zip-manifest()), then using *:part because you're too lazy to declare the zip namespace isn't a problem as far as performance goes, because you're not searching among thousands or millions of documents and the index doesn't even come into play. Still, in production code, it's a good idea to declare the namespace and use zip:part so your intentions are clearly documented. Of course, when your intentions actually are to select an element with a specific local name but any number of namespaces, then you can use *:foo, but, again, be sure it's not when you're searching across the database. In that case, if it's possible, you should enumerate all the QNames, so MarkLogic can most effectively narrow down the result set based on what it knows from its indexes:

//(abc:foo|def:foo|xyz:foo)

If you didn't even know namespace wildcards existed in XPath, then you might find it odd that I'm both introducing them to you and recommending against using them in the same article. Consider this just another chance to become familiar with xdmp:plan(), which is much more generally useful. It will help you write fast queries and understand what makes them fast. Do use it.

Comments

  • I'm not lazy, but I do deal with lazy or incompetent people who are sending me data. They're not guaranteed to always use the same namespace abbreviation. They don't really appreciate when I suggest they not be lazy or incompetent.
  • Yes I do have multiple foo elements in different namespaces. I have 126 files including 144 namespace declarations, with gross conflicts of multiple xmlns abbrev * multiple xmlns URI. I'm not lazy. Rather, I'm trying to tidy-up 10 years of other people's lazy.
  • Lazy programmers, or people working around bad language design? XQuery clearly makes the specification of a namespace within xpath expressions more difficult than necessary, leading to complexity and bloat, and this bad design decision within the language creates the issue. It would be trivial to permit a local specification of the fallback namespace (e.g. if no namespace is specified, then assume http://www.w3.org/1999/xhtml ). Equally, declarations could be supported globally for input and output namespaces. If programmers could specify a fallback namespace (and the default namespace was the automatic fallback) then none of these issues would cause people pain, and there would be no 'laziness' since the pointless task of declaring a prefix association and typing h: in front of every single element selector would be unnecessary. These and other changes to XML languages are clearly necessary for them to be adopted in practical use, excepting those specialists who get sufficient benefit to be able to put up with them. Such changes are also trivial to incorporate, requiring no extra fundamental complexity in a planner or processor, but the politics of certain design decisions within the W3C seem to force these kludges to appear and be maintained in a language regardless of common sense.
    • in xquery, it is possible to declare a default function namespace ... declare default function namespace "http://www.w3.org/2005/xpath-functions"; as well as with xml declare default element namespace "http://www.w3.org/1999/xhtml"; (apologies if disques munges these) like any language, xquery has its problems ... but some of them are inherited from XML itself ... namespaces are a compromise solution (though MarkLogic works with more then just xml eg json, text, triples, etc). You always have the ability to not use namespaces, which the microxml standard addresses (which is completely valid XML).
  • "Have you ever wished you could just skip having to deal with namespaces in your content?" No. "Have you ever been presented with two bits of content, neither in a namespace, and then struggled with the awkward gymnastics necessary to combine them together in some useful way?" More times than I can count.
    • Namespaces are definitely useful, especially in contexts (like MarkLogic) where names (QNames) tend to be interpreted globally (e.g., when defining range indexes).