Good XML design and performance

by Evan Lenz

MarkLogic has always tried to ensure that well-designed XML performs well "as is" in MarkLogic Server. For example, if your schema uses descriptive, unique element names, that is not only going to make your application code clean and readable but it will be fast too. On the other hand, if your schema contains a lot of generic element names (such as "item") used in multiple ways, then it's going to make for harder-to-read code (in XQuery or XSLT), and it might also require you to do some extra leg work to get the best performance.

For example, consider a schema that has a lot of elements named <group> (or <section> or <item> or some other generic name) but which play very different roles—in this case indicated by the value of an attribute:

<doc>
  <group type="widget">
    <item type="sprocket">...</item>
    ...
  </group>
  <group type="employee">
    <item type="executive">...</item>
    ...
  </group>
  <group type="place">
    <item type="city">...</item>
    ...
  </group>
</doc>

Since MarkLogic indexes elements by their name, it is not automatically going to make a distinction between the various <group> elements you have, because they have the same name. That said, certain queries will still run maximally fast, such as when you want to restrict your results to a particular attribute value, using a simple XPath expression like this: //group[@type eq 'widget']. MarkLogic Server will use its Universal Index to avoid reading any documents that don't have a <group> element whose "type" attribute is equal to "widget". So we're okay so far.

But there are still a few issues here. For one thing, your code will not be very readable. This expression:

//group[@type eq 'widget']/item[@type eq 'sprocket']

is pretty noisy compared to, for example:

//widgets/sprocket

which is what your code would look like if you used more descriptive element names.

The other issue is that you may run into some problems when you want to start doing more advanced things, like word search in subsets of your documents. Specifically, if you want to restrict your search results to all group elements except widget groups, that will be challenging. (Fields can help you do the converse, but in that case you may have to enumerate all the ones you are interested in getting results for.)

Another issue with the above design is that, despite the potential benefit of being data-driven and extensible, it is not possible to apply schema constraints that are unique to specific classes of <group> elements (at least in W3C XML Schemas). You can't, for example, restrict the content of <group> elements to <sprocket> and <gear> elements only when its type attribute is "widget". If you want different content models, then you need to use different element names. Starting off with generic <group> elements may lead you down a slippery slope. You'll find yourself using other generic names like "item", and even then you won't be able to effectively restrict the "type" values to only the applicable ones.

Here's what an arguably better (and more readable) schema design would look like:

<doc>
  <widgets>
    <sprocket>...</sprocket>
    ...
  </widgets>
  <employees>
    <executive>...</executive>
    ...
  </employees>
  <places>
    <city>...</city>
    ...
  </places>
</doc>

To conclude, there are lots of good reasons to use descriptive, unique element names whenever possible. Doing so plays nicely with human readers, XQuery, XSLT, XML Schemas, and MarkLogic Server.

Comments

  • There's quite a strong case for generic naming too.   - Descriptive, unique element names means frequent changes to a Schema so will only be practical in environments where this is realistic.  - Having descriptive, unique elements means having a very verbose Schema. This has disadvantages compared to a compact, intuitive schema which does not require alot of effort to understand. - In application development, creating new elements every time you add a new piece of functionality means you will probably have to write some XSLT / XQuery before the functionality works. - I do not agree that descriptive elements make for easier to read or maintain XSLT / XQuery. Using many different XML elements complicates code sharing between nodes. It also complicates some of the inheritance possibilities available in XSLT, for example <xsl:next-match/>. - With the introduction of conditional typing in the 1.1 Schema spec perhaps the Schema constraints argument you gave reveals more a shortcoming of the Schema 1.0 specification rather than XML design in general. 
    • The ultimate design question depends a lot on the application. For example, many applications don't use schemas at all. In that case, unique element names will perform optimally in MarkLogic out of the box without any enumeration required (in a schema or elsewhere). On the other hand, using generic names with @class or @type attributes *will* require enumeration (as attribute-value-constrained fields in MarkLogic). I agree that re-use of local, generic element names (such as &lt;title> or &lt;name>) is generally a good thing (for simplifying both schema and application code). In MarkLogic, you need to jump through some hoops to get optimal performance in that case (such as pre-processing of content to rename or add an attribute to ensure uniqueness), but this is something that will become easier in the future.
  • Would someone be able to tell me what programming language works better with MarLogic, .NET or Java? is there any link or arcticle available online I can check?
  • On the other hand consider div[contains(concat(' ', @class, ' '), ' widget ')]/..., the idiom common for processing XHTML, a moderately popular XML-based dialect...
    • XHTML? Never heard of it... ;-) Seriously though, you make a good point. You won't always have the option of using this design pattern (hence "whenever possible"). The predicate in the expression you wrote won't be fully resolved from MarkLogic's indexes. All that means is that some "filtering" will be required (checking inside the documents to ensure the constraint is met rather than knowing from the indexes alone). So it will work; it just won't be as automatically fast. What you do from there depends on various factors, including how many documents need to be searched. If it's a relatively small amount, then it may not be an issue at all; MarkLogic's caching will make this query much faster than if, say, you were reading and parsing XML documents off the file system. But if you're dealing with millions of documents, you'll probably want to do some content processing to ensure the relevant data is indexed.
      • Probably we should have added an htmlclass() function to avoid the need for the spaces and to make this probably very common case easier both for people to write and for optimisers. The pattern is fine otherwise for other reasons, of course.
        • I think this also works (slightly longer but a bit simpler): <code>div[tokenize(normalize-space(@class),' ') = 'widget']</code>