10 questions for navigating a jungle of data

by Evan Lenz

I've been having a debate with someone about the best way to represent data in an application. In particular, the question lies around whether:

  1. the data should be left in its original, real-world, raw form, or
  2. the data should be manipulated and stored in an altered structure to facilitate subsequent processing.

Although the question is pretty universal, the answer is not. It really does depend on the application. As I was reflecting on this, I thought it might be helpful (to me as much as to anyone else) to write down what some of the criteria might be for making this decision: namely, whether you should revel in data's messiness (by writing apps against its pristine, wild form) or try to improve the world (by transforming, manipulating, enhancing, re-structuring, or otherwise pre-processing data before storing it in the database).

Reveling in the Messiness

Let's first look at some reasons why you might leave data in its raw, messy form.

1. How quickly do you need running code?

For one thing, it's often the fastest way to get started. Load the data, write a query, get some results. Messy data yield messy queries, but that's okay. We know how to push through it, machete in hand. An example of this would be a pure search app: load a large amount of random documents off the Web into MarkLogic and sling up an App Builder interface to start searching right away. There's no need in this case to pre-occupy ourselves with how the documents/data are represented.

2. Is the data good enough already?

Another possibility is that the original data is not messy. It's already in a close-to-ideal form. Pre-processing would add unnecessary complexity. This is every developer's dream: data arrives in well-structured, searchable, semantic containers, ready to be fully leveraged by your application code. An example of this would be a set of XML documents that all conform perfectly to their respective schemas. Even when it's not perfect, the structure may be good enough.

3. How Big is the data?

The volume of the data alone doesn't directly speak to whether or not it should be pre-processed. If you have terabytes or even petabytes of data that don't have a lot of variety and aren't changing very fast (in other words, just the "Volume" in the 3 V's of Big Data), you still need more information before making the decision. Of course, pre-processing will take time in this case, but that's what technologies like Hadoop are for (e.g. in cooperation with the MarkLogic Connector for Hadoop). Even if you decide to change the structure later on, you can just kick off another Hadoop job. Other constraints have to be factored in. Volume alone doesn't preclude pre-processing.

4. How fast is the data coming in?

Here we're concerned with the "Velocity" of the data. If you want to provide up-to-date search, for example, across the Twitter firehose, you'd be hard-pressed to do any sort of pre-processing. It would be impractical. So you live with the mess in such cases.

5. How much does the data vary?

If your data has little uniformity, or if it has many different pockets of uniformity, it may be undesirable to do pre-processing just because it would be so much work to 1) figure out everything that needs to be cleaned up, and 2) write the code to clean it up. Thus, the "Variety" of the data too can be a prohibitive factor in determining whether or not to pre-process your data.

Improving the World

Now let's look at some reasons why you might choose to do pre-processing.

6. Does the content need to be enriched?

Sometimes there's nothing wrong with the structure per se but the data has additional, latent information that could be made explicit. One example is the automatic detection of entities, like people, places, or dates. To be directly query-able, those need to be pre-processed and tagged ahead of time so they make it into the database index. Another example is finding latent links and encoding them ahead of time. Such pre-processing adds value to the data that could not otherwise be leveraged.

7. Does the data need to be optimized?

Sometimes it's not an issue of whether the data is tagged but whether it's tagged in such a way that the database technology can properly leverage it. Sometimes you need to re-structure the data—perhaps even normalizing it—to get better performance. Sometimes updates to the database technology itself obviate the need to perform pre-processing. MarkLogic's upcoming XPath range indexes are a case in point, eliminating a fairly large class of pre-processing needs.

8. Who controls the data?

When you have no control of the data, you have to work with what you get. But when you have control over the generation of data (such as documents in a technical documentation project), you have the luxury of knowing what all the possible combinations are. In that case, you either don't need to do pre-processing (because it's already in an ideal form) or if not, you can re-structure the data for the convenience of your application, with the assurance that you're covering every possible case.

9. Where is the canonical source of the data?

If it's in the database, then you need to make sure the pre-processing is "lossless." Since you can't anticipate all the use cases in advance, destroying data is never a good idea. If your content is stored canonically outside the database (e.g., in a version control system in the case of authored content), then it's okay if your transformed data doesn't include everything. If you need to add something later, you can just re-generate the optimized data from its canonical source.

10. How much simpler will pre-processing make your code?

When data is hairy, code is even hairier. It's possible to write well-structured code against badly structured data, but it's much easier to write good code against well-structured data. If you decide to use a pipeline approach to transform the data into more manageable forms, you get the benefit of isolating the messiness to the early stages. "Downstream" code will be much easier to write. Note: you can sometimes use a pipeline approach on-the-fly too—in other words, performing the pre-processing at run-time (e.g., when a user requests a page). The question of whether to store it in its pre-processed form or not goes back to the optimization question. Is the pre-processing so expensive that it will make your user wait  significantly longer? Or do the results of the pre-processing need to be indexed? If yes in either case, then do it ahead of time.

Embracing Evolution

Most often, the question of pre-processing is not an all-or-nothing proposition. You may decide to work from the data "as is" at first but enrich it later. Or you may find additional optimizations later on that suit the needs of new applications. Finally, you may realize that multiple copies of the data, each optimized with a different purpose in mind, is the way to go. The important thing to recognize is that it's a matter of engineering. There's no one best approach. The above questions are just a few of the ones you'll want to ask when deciding on this issue yourself.

What criteria would you add? (Share any thoughts below.)

Comments