The world isn’t flat: and neither is your data

Universum_0After you’ve worked with data for a while, there is a temptation to think about it purely in terms of the conventional way in which it is stored: rows and columns. That is, after all, how it appears on screen whenever you run a simple query; it’s very often how data is displayed in a report, particularly if you’re in the habit of exporting to Excel to continue your analysis. If you stop to think about it, though, rows and columns are very far removed from the information they actually encapsulate. It’s important to remember that they’re only a representation, not to be taken too literally. The tendency to understand data only in terms of rows and columns can encourage what I like to call ‘flat earth thinking’: neither accurate nor likely to get you very far.

Much like the astronomical theory itself, flat earth thinking seems reasonable at first glance. Based on simple observation, the world around us seems pretty horizontal; the concept of our world as a spherical object defies the basic empiricism we use to navigate it. But while it is arguably a useful rule of thumb for getting from A to B locally, it starts to fall down when we want to plot a course over a much greater distance. Not only that, but flat earth thinking is factually incorrect: the world just isn’t flat. In much the same way, although a useful abstraction, data isn’t just rows and columns. Treating it as if it were can help us to make basic calculations, draft reports and derive certain metrics, but in many ways it blinds us to data’s true potential.

Take the example of a simple legal contract: you decide to buy a sofa from a department store and do so on credit. You agree to pay the department store a sum of £1,000 over the course of 10 months. If we were to put this in terms of rows and columns, we could construct a simple table as follows:

Customer Name Product Name Cost Instalments Credit Remaining
Jane Smith Deluxe 3-Seater (Maroon) 1000 10 1000

Storing the information in this way seems reasonable enough; it captures the pertinent details of the contract. Now let’s fast forward two months. Will our representation of the contract change? If we stick to our model, it will look like this:

Customer Name Product Name Cost Instalments Credit Remaining
Jane Smith Deluxe 3-Seater (Maroon) 1000 8 800

But there’s a problem here: how do I know what the original number of instalments were at the time the contract was taken out? Well, I could probably reverse-engineer it from the information I have: given the cost price is £1,000 and the credit remaining is £800 with a further 8 instalments to go, I know my instalment amount is £100 – so Jane has presumably made two instalments already. Except, the world doesn’t always work in such a plain (plane?) fashion. It’s possible that Jane actually made an initial down payment of £200 and only took out £800 in credit, meaning the 8 instalments remaining actually represent the original number of instalments in the contract. So, maybe what’s missing here is a contract date. After all, if we know when the contract was taken out, we can use the current date to work out how much time has passed and, therefore, how many instalments have elapsed. Not so fast! Many stores offer an amnesty period, meaning the customer pays nothing for the first year. In this case, just knowing the contract date won’t really help. What’s more, a contract’s terms can be revised from the point it was taken out. Then there are other factors, such as an interest rate that the store may apply to the credit, thereby increasing the total credit remaining such that it actually represents a larger figure than the original cost price. Now, we could just keep bolting on more and more columns to our original data model in an effort to capture all of this information. But we’ll still run into problems with things like interest rates (particularly if they apply after a certain date, or increase in the event of delinquency) and variable values like instalments and credit remaining, both of which will change over time. The real world is actually a lot less flat – and a lot more complicated – than our simplistic rows-and-columns abstraction.

So how do databases that store and track contract data work in practice? Much like our spherical world, they use context. Just as it isn’t 3pm everywhere in the world at once, so it’s the case that data is treated differently depending on the context applied to it. A contract, therefore, is not represented by a single record – even though it relates to a single document in real life – but rather a set of records. Those records are all different, and no one record is identifiably ‘the’ contract. The instalment plan will be stored in one place – and may relate to a multitude of other contracts, not just our one example. The instalments paid will be captured in a transactional table, with dates for each payment made. The terms of the contract itself will be tracked in a CDC (change data capture) format, allowing the real world contract to be reconstructed for any given point in time. This means that no matter how many instalments have been made, how much interest charged or how many terms have been revised, I can still recreate what the contract looked like when it was first taken out – and at any given time since. The data ceases to be flat: a real world thing becomes a set of records spread across multiple tables, and its context means that it can be understood not only as it exists now but as it existed in the past. This contextual nature even permits something close to time travel: contract terms can be revised that take effect in the future, and that future-state can be captured in the database even though in real life that future-state has not yet come to pass. This allows a business to make more accurate forecasting, because they know how much money they can expect to bring in, not just how much they have already made.

A lot of this may seem like common sense, and indeed, nothing about it is especially revolutionary. But too often, data mappers and data modellers get trapped inside the world of rows and columns, believing that simply knowing what attributes a table has will enable them to understand and use the data in front of them. This obscures the true nature of the information the data contains, and promotes a flat earth mentality that gets in the way of handling the data correctly. This is why data mapping, as an exercise, is not simply about taking attribute A in system X and working out where to slot it into system Y. True data mapping involves a deep understanding of the context in system X in order to work out how best to model that data in system Y. Mapping and modelling therefore go hand-in-hand, and are not isolated disciplines. Ultimately, data architecture is about the perfect blend of both disciplines – along with a strong system knowledge base and a library of best practices.

So, the next time you find yourself staring at a screen of rows and columns, remember that what you’re really looking at is a flat earth. The information contained within those rows and columns has a lot more context than the mere attributes of any given table. Like the real world, it is a lot more complex; but also, properly understood, capable of reaping much greater rewards. Just ask Columbus.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: