When to Denormalize

by Dave Cassel

When building an application with MarkLogic, we need to figure out how to model our data. We generally start by identifying the relevant entities, which leads to a common question: what should I denormalize from one entity type into another? Let's illustrate with an example. 

Suppose my database has books and authors. In the relational world, I might have a schema that looks like this:

I've left out some details, such as the publisher and address tables, but this is enough to explore with. 

To bring this data into MarkLogic, we see that books and authors are our primary entities (of the tables shown). What goes into each type of document? Some of that is easy: a book will have a title, copyright, and ISBN. An author will have the person's name and date of birth. 

For other tables, we need to decide how to represent their information: create new documents for them? denormalize them into the documents we already have? represent them as semantic triples?

Let's start with an easy one: genre. This is a simple lookup table, where a book holds a genre_id, which points to the string that names that genre. In fact, the value could have been inserted into the book table in the first place, but it was moved to its own table to ensure the string value occurred in just one place.

There are two simple questions we can ask when we consider whether to denormalize a piece of information into entity documents (for instance, our book documents), versus looking it up in some other place (such as having genre documents).

  1. Will this piece of information be useful when searching for the entity?
  2. How likely is this piece of information to change (or how frequently will that occur)?

In the case of genre, these are pretty easy to answer. Including the label "Science Fiction" in the book record will make it possible to search by those terms, as well as supporting a simple facet on genre. On the other hand, a book's genre is unlikely to change; likewise the list of genres themselves will not change often. Let's update our structure.

An author's address is tied to an individual author and certainly belongs with those entities. Of course, an author (especially a successful one) may have multiple addresses. That's simple enough in XML -- we can just add <address> elements as needed. (Likewise, if we're working with JSON, we can set that up as an array.)

We still need to connect the books and the authors. We can do so either by adding some elements that refer from one entity type to another, or we can use semantic triples. While triples allow for some interesting queries, I'll leave that for another post and look at a simple connection here: in each book document, we'll add one or more elements pointing to the author(s). We can do that with the URI of the author documents, which gives us a simple connection that we can use to look up additional information, if we wish.

Let's pause for a moment, though, and revisit the questions above. Are there parts of the author documents that 1) help us search for books and 2) be unlikely to change? A very common use case is looking for books by author's name, so let's support that. Let's bring the author's name into each book. For this case, I think there's more value in treating the author's name as a whole than in breaking it into pieces.

Since MarkLogic will index the terms in <author-name>, we'll find this book when doing text searches for "Arthur C. Clark", "Arthur C Clark", "Arthur Clark", or even just "Clark". We can also (very easily) set up a facet on author-name to get counts of how many books each author has written.

Remember the two questions above when thinking about whether to denormalize a piece of information from one entity to another. To see an example of when not to denormalize, see Kasey Alderete's post on Keeping Reputation Consistent in Samplestack.

Comments

  • Hello Dave, Thank you for this post. I like your way to store link between the Author Name and the Author Data with URI. What is the best way to get the Author Data in the search result?
    • Let's assume you want to take a piece of information that has not gotten denormalized (for instance, dob) and display it along with the book information. If you're using the REST API, you can implement a <a href="http://docs.marklogic.com/guide/rest-dev/search#id_94556">search transform</a>, in which you can retrieve the value you want to display with the search results. As noted above, if this is something you'll want frequently, and the value won't change (date of birth shouldn't), you should probably denormalize the value instead of pulling it in at search time. On the other hand, if it is a value that will change frequently, retrieving it during the search is a good strategy.
      • Thank you for your answer. I want to take a piece of data that has not gotten denormalized and I make cts:search() for some books and I need to display for instance author address. I can loop the search result and add needed information using fn:doc(uri). Is it possible to use something else?Something like "LEFT JOIN" in SQL queries.
        • For something like an address, yes, what you're doing is the way to go. There is an alternative if you're looking for a piece of scalar data that you have a range index on: loop on the search results, but instead of <code>fn:doc()</code>, use <code>cts:element-values($element, (), (), cts:document-query($uri))</code>. By doing this, you're getting the value from an index instead of loading the whole document.
          • Thank you David! I appreciate your help.