Using XML Schema with MarkLogic Server

In this tutorial, we will learn how to load an XML Schema into MarkLogic Server, and verify that it is working as expected. This can be somewhat tricky for first-time users, but is easy once you’ve seen how it’s done.

Why use XML Schema?

XML Schema provides a standard way to associate XML nodes with particular data types. XML Schema defines its own datatypes for XML, and also provides for user-defined datatypes. For example, we might use a Schema to tell MarkLogic Server that elements named Year always contain unsigned integers. Thus, element Year is of type unsignedInt.

This association of an element with a type can be very useful. Suppose we have a simple FLWOR expression, with an order-by clause.

(: simple sort :)
for $i in doc()/Record
order by $i/Year descending, $i/Month descending
return $i

In the absence of an XML Schema, these records will be sorted using the Year and Month elements as xs:string. Strings are sorted by UTF-8 codepoint, so the results might not be what we’d like.

(: sorting Month as string :)
<Record>
  <Year>2006</Year>
  <Month>12</Month>
</Record>
<Record>
  <Year>2006</Year>
  <Month>5</Month>
</Record>
<Record>
  <Year>2006</Year>
  <Month>6</Month>
</Record>

The records aren’t in the order we wanted: apparently, December comes before May. Our XQuery evaluator doesn’t know what the datatype of Month is, so it assumes a string. The UTF-8 character “1” comes before “5” and “6”.

This is exactly how XQuery is supposed to behave, in the absence of datatype information. We could fix this by constructing a numeric type explicitly.

let $n := <Month>12</Month>
return xdmp:describe((data($n), xs:unsignedInt($n)))

(: results :)
(xs:untypedAtomic("12"), xs:unsignedInt("12"))

We can use xdmp:describe() and data() to determine the typed value of a node. This is a valuable technique: later on, we will use it to test our XML Schema.

(: sort with explicit datatypes :)
for $i in doc()/Record
order by xs:unsignedInt($i/Year) descending,
  xs:unsignedInt($i/Month) descending
return $i

The results are now sorted correctly:

(: sorting Month as xs:unsignedInt :)
<Record>
  <Year>2006</Year>
  <Month>5</Month>
</Record>
<Record>
  <Year>2006</Year>
  <Month>6</Month>
</Record>
<Record>
  <Year>2006</Year>
  <Month>12</Month>
</Record>

It would be nicer if MarkLogic Server automatically knew the correct datatype for the Month element – and the correct datatype for all of our elements. To give MarkLogic Server that information, we can use an XML Schema.

XML Schema definitions are, themselves, XML documents. We can tell MarkLogic Server about our schema by loading it into a special schema database. Every MarkLogic Server database (or contentbase) is associated with its own schema database. By default, this is a predefined database, coincidentally named Schemas. There is rarely any reason to change that setting.

Before we get started, there’s one more thing we should know about XML Schema and MarkLogic Server. They work best when our XML documents use a namespace. Schemas can have a target namespace, so setting a document namespace allows us to have multiple schemas, and automatically apply the correct schema to the correct documents. For example, Month in one document might always be an unsignedInt, while Month in another document might be a string. Element namespaces help us organize these different datatypes.

XML Content and Schema

With that in mind, let’s introduce our XML content, and a Schema for that content. We will start with the Schema. This is important: MarkLogic Server stores any Schema information when it indexes your documents. Thus, the Schema should already be in your Schemas database before you insert the first document. If you change the Schema later, you may need to re-insert your documents.

<!-- tutorial.xsd -->
<xs:schema targetNamespace="https://marklogic.com/tutorial"
 attributeFormDefault="unqualified"
 elementFormDefault="unqualified"
 xmlns:xs="https://www.w3.org/2001/XMLSchema">
  <xs:element name="name" type="xs:token"
   abstract="false" nillable="false"/>
  <xs:element name="qualification" type="xs:token"
   abstract="false" nillable="false"/>
  <xs:element name="born" type="xs:date"
   abstract="false" nillable="false"/>
  <xs:element name="dead" type="xs:date"
   abstract="false" nillable="false"/>
  <xs:element name="isbn" type="xs:unsignedLong"
   abstract="false" nillable="false"/>
  <xs:attribute name="id" type="xs:ID"/>
  <xs:attribute name="available" type="xs:boolean"/>
  <xs:attribute name="lang" type="xs:language"/>
  <xs:element name="title" abstract="false" nillable="false">
    <xs:complexType mixed="false">
      <xs:simpleContent>
  <xs:extension base="xs:token">
    <xs:attribute ref="lang" use="optional"/>
  </xs:extension>
      </xs:simpleContent>
    </xs:complexType>
  </xs:element>
  <xs:element name="library" abstract="false" nillable="false">
    <xs:complexType mixed="false">
      <xs:sequence minOccurs="1" maxOccurs="1">
  <xs:element ref="book" maxOccurs="unbounded" minOccurs="1"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
  <xs:element name="author" abstract="false" nillable="false">
    <xs:complexType mixed="false">
      <xs:sequence minOccurs="1" maxOccurs="1">
  <xs:element ref="name" minOccurs="1" maxOccurs="1"/>
  <xs:element ref="born" minOccurs="1" maxOccurs="1"/>
  <xs:element ref="dead" minOccurs="0" maxOccurs="1"/>
      </xs:sequence>
      <xs:attribute ref="id" use="optional"/>
    </xs:complexType>
  </xs:element>
  <xs:element name="book" abstract="false" nillable="false">
    <xs:complexType mixed="false">
      <xs:sequence minOccurs="1" maxOccurs="1">
  <xs:element ref="isbn" minOccurs="1" maxOccurs="1"/>
  <xs:element ref="title" minOccurs="1" maxOccurs="1"/>
  <xs:element ref="author" minOccurs="0"
   maxOccurs="unbounded"/>
  <xs:element ref="character" minOccurs="0"
   maxOccurs="unbounded"/>
      </xs:sequence>
      <xs:attribute ref="id" use="optional"/>
      <xs:attribute ref="available" use="optional"/>
    </xs:complexType>
  </xs:element>
  <xs:element name="character" abstract="false" nillable="false">
    <xs:complexType mixed="false">
      <xs:sequence minOccurs="1" maxOccurs="1">
  <xs:element ref="name" minOccurs="1" maxOccurs="1"/>
  <xs:element ref="born" minOccurs="1" maxOccurs="1"/>
  <xs:element ref="qualification" minOccurs="1"
   maxOccurs="1"/>
      </xs:sequence>
      <xs:attribute ref="id" use="optional"/>
    </xs:complexType>
  </xs:element>
</xs:schema>

XML Schema is fairly easy to read. In this schema, we can see that each element or attribute is represented, along with its name, type, and the number of times it must or can occur. For example, the available attribute is boolean, and occurs optionally on the book element.

It’s important to note one thing about this schema, though. It has a targetNamespace attribute. This attribute means that this schema applies to elements in the target namespace. This schema will not apply to elements in any other namespace (including the empty namespace).

Copy the Schema XML, and save it to a file named tutorial.xsd. NOTE: If you are using Notepad or another Windows program, make sure you don’t end up with something like tutorial.xsd.txt.

Loading our tutorial.xsd is simple. We need to insert it into our Schemas database, and we must make sure that its document URI will be tutorial.xsd. If we put it into the wrong database, or use the wrong URI, then the schema won’t be associated with our documents.

We can load tutorial.xsd into the Schemas database in several ways. The simplest technique is to select a query buffer in Query Console at https://localhost:8000/qconsole/ (or CQ in MarkLogic Server 4.2 and earlier), set the content-source to the Schemas database, and call xdmp:document-load(). In the following sample, adjust the fileystem path to tutorial.xsd to match wherever you saved your copy.

(: cq technique :)
(: NOTE - make sure the content-source is Schemas! :)
if (xdmp:database-name(xdmp:database()) ne 'Schemas')
then error(
  QName('', 'NOT-SCHEMAS'), 'make sure the content-source is Schemas')
else xdmp:document-load(
  '/your/filesystem/path/to/tutorial.xsd',
  <options xmlns="xdmp:document-load">
    <uri>tutorial.xsd</uri>
  </options>
)

When you evaluate this query, the results should be empty. To see if the document is in the Schemas database, you could query doc('tutorial.xsd') (with the content-source still set to Schemas). You should see the contents of the XML Schema.

With the Schema loaded, we can now load some content.

 " available="true">
  
    0836217462
  
  
  
    
      Charles M Schulz
    
    
      1922-11-26
    
    
      2000-02-12
    
  
  
    
      Peppermint Patty
    
    
      1966-08-22
    
    
      bold, brash and tomboyish
    
  
  
    
      Snoopy
    
    
      1950-10-04
    
    
      extroverted beagle
    
  
  
    
      Schroeder
    
    
      1951-05-30
    
    
      brought classical music to the Peanuts strip
    
  
  
    
      Lucy
    
    
      1952-03-03
    
    
      bossy, crabby and selfish
    
  
0836217462

Notice two things about this content. First, the xmlns attribute matches the targetNamespace of our schema. Second, the xsi:schemaLocation is a whitespace-delimited attribute. The first part repeats the target namespace of our schema. The second part is the document URI of the schema. Both parts must match the tutorial.xsd, which we loaded into the Schemas database.

Copy this XML into a new file, called tutorial.xml. Load it into your content database using xdmp:document-load(). This time, your content-source should be the built-in Documents database (if you have already created a database of your own, you can use it too).

(: load tutorial.xml :)
xdmp:document-load(
  '/your/filesystem/path/to/tutorial.xml',
  <options xmlns="xdmp:document-load">
    <uri>tutorial.xml</uri>
  </options>
)

Again, you can test to see if the document loaded correctly by querying doc('tutorial.xml'). But how can we tell if the XML Schema is working?

Testing the Schema

The easiest way to see if our Schema is working or not is to query some element or attribute from tutorial.xml, and test its type using xdmp:describe() and data(). Let’s try to use the born element, which should be xs:date. When we do this, we have to remember that every element in tutorial.xml is in the https://marklogic.com/tutorial namespace.

(: is it working? :)
declare namespace mlt = "https://marklogic.com/tutorial";

xdmp:describe(data(
  doc('tutorial.xml')/mlt:library/mlt:book[1]/mlt:author[1]/mlt:born
))

(: results :)
xs:date("1922-11-26")

This query should always return an atomic of type xs:date. If your query returns the empty sequence, then you probably have a namespace problem. Maybe you mistyped the namespace URI, or maybe you forgot the prefix on one or more of the XPath steps. If your query returns an xdt:untypedAtomic, then MarkLogic Server isn’t finding tutorial.xsd. Go back and make sure that you loaded it into the Schemas database, and that the document URI is tutorial.xsd.

Before we go, let’s look at one more trick. Now that we have a Schema in place, we can use it to validate the data within an element. For example, look at what happens if we try to put an invalid xs:date value into a born element.

(: is it working? :)
declare namespace mlt = "https://marklogic.com/tutorial";

data(<mlt:born>garbage</mlt:born>)

(: error :)
XDMP-LEXVAL: xs:date("garbage") -- Invalid lexical value "garbage"

If your application involves content updates, you might take advantage of this technique to validate updates.

Another useful technique with XML Schema is to query it, just as we might query any other document. For example, we can list all the elements of type xs:date. This might be useful for planning element range indexes.

(: NOTE - make sure the content-source is Schemas! :)
if (xdmp:database-name(xdmp:database()) ne 'Schemas')
then error(
  QName('', 'NOT-SCHEMAS'), 'make sure the content-source is Schemas')
else doc('tutorial.xsd')
  /descendant::xs:element[ @type eq xs:QName('xs:date') ]
  /@name

(: results :)
born
dead

Written Tutorial