Storing document change as metadata

by Paul Hoehne

Motivation

Imagine you have two documents in MarkLogic and you need to know when and if the two documents are different. In some cases a change in a document may be very important, such as adding a new item to a purchase order. In other cases it may be insignificant, such as text formatting within some text. What's more interesting, however, is how we can store information about the difference between the documents using triples. We can also capture some interesting information about the differences.

Let's say we have an application that allows users to update their purchase orders and saves each purchase order document. It allows us to take any two versions of a purchase order and see what changes were made. Jane Doe starts a purchase order on May 1, then updates it on May 5, and then May 7, and submits a final on May 10. I might want to see the document as it was on May 1, 5, 7, or 10, or see what changes were made between the May 7 and the May 1 versions.

Approach

The basic approach involves taking the two versions of a document and performing a recursive descent based on text comparison. The exact algorithm isn't as important as how we store the differences. In some cases the structure of the document is well known and we can leverage that structure to create a more 'intelligent' algorithm than just finding every conceivable change. For the original application where this code was taken, the document structure was known and the algorithm descended through both documents in parallel, looking for changes.

document1.xml
document2.xml

The problem occurs when we try to save off the fact we've found a change. It becomes somewhat cumbersome to try to build up a map of maps, passing it from function to function. This is especially true where the differences could be found in a deeply nested part of the recursive descent. In a map-of-maps strategy, each map contains the information about the differences found so far. When the differencing is done, the map is saved as a document. It would be better if we could save the data along the way instead of passing a lot of data from call to call. We will have to pass something, but maybe just a document id.

One way to save the data is to add it to a "meta-data" section of the document. We often use the envelope pattern, which usually includes a meta-data section that contains data about the document contained in the envelope. However, imagine if we had several versions of a document and wanted to see differences between arbitrary versions of a document. That might lead to some very large documents. In addition, what document should contain the data? Adding this information directly back to the document doesn't seem like a 'clean' way to approach the problem.

Saving Differences

Utility.xqy

The first thing we're going to do is start a set of triples to describe a difference using the start-diff function. We use a B-Node to anchor the triple because there's might be no "natural" name for this document in terms of a IRI. We could, if we work hard enough, come up with something, but a B-Node is fine for now. We have to keep track of the B-Node (which is essentially a GUID), but that's a fairly limited piece of information. If you are trying to come up with a IRI that represents the subject for a set of triples, and the only reasonable IRI is essentially a GUID, you might want to consider using a B-Node.

We use a graph to keep track of our triples. A graph allows us to better handle the collection of triples. For example, let's say the related purchase order (12345) is deleted. At that point we can delete the graph with the URI http://mycompany.com/po-changes/graph/12345 with the function sem:graph-delete. At query time a graph URI can be specified to restrict a query to a particular graph, but otherwise queries look at triples in all graphs.

Utility.xqy

Logically this set of triples describes a "thing." This "thing" has a name or ID, which in this case is the B-Node IRI. The thing can have attributes. These attributes can be values like "123" or they can be the IRIs of other things. The "thing" is called the subject and the attributes are predicates. The values of those attributes are called the objects. The pattern subject, predicate, object is called a triple. If the name graph is included, it is called a quad and has the pattern subject, predicate, object, and graph. For more details, take a look a the training offered through MLU.

The sets of triples describe a piece of information about the difference between two documents. The triples contain the purchase order id, which is constant across all versions, the URIs of the documents being compared, and a type designation to help with queries. Although they are stored together in the same function call, and logically they relate to the same thing, they are actually distinct pieces of information.

How did I decide what would be an attribute and where did I define these URLs? The answer to the first question is driven by the kinds of queries I'd like to ask. Because the data is contained in the purchase order document, I'm more interested in metadata queries about changes, such as what two purchase order documents are being differenced? Different applications may have different predicates, such as the date-time the difference was recorded. It is possible to store the actual document data in triples, but that generates a lot of triples. Given the MarkLogic can very efficiently query XML or JSON documents, there is no advantage to store all the data as triples.

The answer about where the URLs originated is simply that I made them up. Because of SPARQL syntax, I can save myself some heartache by taking some care, so I anchored related predicates with either http://mycompany.com/po-changes/diff or http://mycompany.com/po-changes/types. There's nothing magical about these IRIs and I'm not using any specific ontology. Although that should not stop you from using a specific ontology, if you so choose. In some cases choosing an official ontology will simplify the problem and may allow you to take advantage of MarkLogic's automatic inferencing.

Utility.xqy

When we come across a specific difference, we can call one of the functions create-changed, create-add, or create-delete. That creates a new set of triples related to the B-Node created above. One of the triples we add is the path in the target document. The function xdmp:path will return the path string to a given element in a document. We can use the function xdmp:unpath when we want to refer to that element in the document, given its path as a string. Given a difference, we can go to the exact document and element in that document with the difference.

Querying Differences

Utility.xqy

The first question is given two purchase orders, what are the differences between the two versions of a given purchase order? The function get-diff-root returns the B-Node IRI that anchors all the change triples given a purchase order an a pair of purchase order documents. Given a B-Node IRI, listing the changes then becomes fairly trivial with a modest amount of SPARQL code.

Utility.xqy

Note that SPARQL, if you've never really used it before, is conceptually finding all the triples that satisfy given constraints. This function finds all the XPaths extracted from the new purchase order node. This is where the given B-Node has a predicate diff-type "add", is related to a given difference B-Node, and is a submission diff. In this case it will return the item's B-Nodes and the XPath to the changed element.

In the application for which this code was developed, the combination of path information with a document was used to render a we based view of the document with the old an new versions side-by-side. The additions were highlighted in green, the changes in yellow, and the deletions in red. This allowed an analyst to quickly, visually, identify where the documents had changed. I was able to relate the change information back to a section of the document using the XPath that was saved in the triples describing the change.

Conclusion

While triples are normally associated with complex semantic applications, they are an interesting way to save data in and of themselves. In some cases it's cumbersome to manipulate a document (for example, operations that could not be in the same transaction), triples avoid those conflicts. Using managed triples with a graph reduces the amount of work necessary to clean up the triples, should that be necessary. A particular sweet spot, given that MarkLogic already stores the document data in XML or JSON, is to use triples to store meta data without changing what's in the original document.

Comments