A new colleague joined my project this week. He has never worked with data architects before and was curious about what, exactly, we do. He knew about the end result: typically an analytical report of some kind, measuring key performance indicators or providing answers to common business questions. “Don’t you just need to write a few SQL queries to get that information?” he asked. In a funny sort of way, he’s right. But there’s a little more to it than that.
Data is messy – even when it’s structured and organised in line with well founded principles. This is because it is rarely ever created with reporting or analysis in mind. Customers often want to be given access to what they consider to be their data. Unfortunately, without a strong command of how that data is stored, simply being given access to the system won’t help. Faced with a mountain of different tables, many of which will not link together directly, a typical business user won’t know what they’re looking at. In frustration, they may combine several unrelated tables together in a query and end up with a totally meaningless result. “Data on demand” can, in these circumstances, lead to unforeseen and potentially damaging consequences.
Data comes from all kinds of places, from supermarket tills to financial ledgers, aviation booking systems to RFID tracking applications. It is captured in a database modeled for optimal system performance. Typically this means a high degree of normalisation: each distinct piece of data is held in its own table and then linked, via foreign key relationships, to every other relevant piece of data. This kind of model is not intended for human intelligibility. It is also designed to reduce data duplication as far as possible. That is, if a value exists in one record in one table, that same value should not be repeated in any other record in any other table. Instead, whenever that value is required, a key should be used to link to it. This allows for the value to be updated, as needed, once and only once. All other references to it are preserved using its key – which does not change – and in turn automatically inherit the change. For example, a magazine subscription system can store your address in one table and your monthly subscription records in another. In that subscription records table, they need only have a single key to your address rather than the address itself. If you move home, they can update your address once and all your subscription records will automatically be associated with your new address.
So far, so useful. But when it comes to interrogating that data, it can be something of a labyrinth. Imagine you want to know how much money subscribers have paid for a given publication this year. The likelihood is that data will be spread across four or five distinct tables, and the relationships between those tables may not be straightforward. With sufficient knowledge of the data model, it will eventually be possible to write a query that returns the answer needed. But that’s just one possible question; you may then want to know the regional breakdown of those subscribers. Or, of those who subscribe to one magazine, how many also subscribe to another? Each one of these questions will need its own specific query to find the answer. All the while, the underlying data is in flux. If a customer moves home, for example, their address record will be overwritten. This can have the effect of wrecking a regional analysis, because past subscription records are associated with the customer’s new address rather than their old one.
Enter, stage left, the data architect. Their fundamental remit is to construct a data model that transforms raw data into real information. By first learning the existing data model of an operational system, they can in turn create a new model that is designed around human, rather than machine, utility. This new model – often in the form of a datamart or data warehouse – takes the existing data and reshapes it into a format that makes sense to a business user. What is more, this new model can incorporate the concept of preserving history rather than overwriting it. This allows business users to track changes across time and avoid their analysis being corrupted by updates in the raw data. The new model also reduces the number of joins needed to bring related data items together. Instead of keeping every piece of data in its own table, the data architect denormalises the source model, collapsing many of the joins. Not only does this make the data more intelligible to the business user, but it also increases the speed with which queries return their results. Fewer joins leads to higher performance.
If business users ask for access to this new model, they will discover that it much more closely resembles the way they understand their data. As a result, they can interrogate it easily, and quickly get answers to their questions. Instead of having to write a myriad of fiendish queries for each question, many related questions can be answered with a single query. What is more, they do not need to invest time in learning about the source data model, because the data architect has done this for them. The complexity of the source has been eliminated in order to produce a new model that services their needs.
So my colleague is right: reports are essentially SQL queries. But data architects ensure those queries are targeted at the right kind of data model.