It is fair to say that the foundation stone upon which nearly every BI solution is built is the data warehouse. So what is it, and what makes it different from a conventional database?
The first thing to stress is that a data warehouse is still a type of database. What makes it different is the way in which it is designed and the purpose that it serves. Broadly speaking, there are two kinds of database in the enterprise world: one is known by the slightly cryptic acronym OLTP, and the other is the data warehouse (sometimes abbreviated as DWH). I will begin by sketching out the differences in function between these two types before explaining in more detail what makes a database a data warehouse.
OLTP, or online transaction processing to give it its full name, refers to databases that underpin critical business functions. As the name suggests, they handle transactions in real time. They are found in all kinds of industries, and the chances are that you’ve interacted with plenty of them in your daily life. Whenever you pay by credit card, withdraw cash, book a flight or buy your groceries, you will have touched an OLTP database. They deal with multiple, small volume records that must be processed quickly and without contention. Thousands of people across the world will buy something from eBay at the same time, and the OLTP system powering eBay must be capable of dealing with all those transactions simultaneously. The emphasis in an OLTP database is therefore on insert-update-delete with a modest requirement for select queries. Referential integrity is very important, and the database will typically be designed in third normal form with a high degree of normalisation. This means little to no duplication of reference values and a large number of individual tables for very specific purposes. When information changes it will often be overwritten because the OLTP system exists to serve a function rather than to maintain a history. The challenge for those who work with OLTP databases is in ensuring near perfect uptime and not allowing two or more users to insert, update or delete the same record at the same time. OLTP handles real time data, the most extreme examples of which are to be found in trading platforms that support banking and financial trading. They will often be spread across multiple database nodes, with plenty of failovers built in to allow for the inevitable system failure of one or more of those nodes at any given time.
A data warehouse, by contrast, is something which a consumer or end user is less likely to experience first hand. Its data, rather than coming in directly from real world transactional scenarios, will more likely be sourced from an OLTP system. Indeed, in many ways a data warehouse complements the OLTP database and relies upon it to provide it with a consistent stream of information. The incoming data will often be provided in batch, with millions of rows being loaded at once rather than one by one. Furthermore, this incoming data will not simply be copied in terms of structure from its source. Rather, the data will be mapped to a new model within the data warehouse. This model will not follow third normal form, and will tend to incorporate plenty of denormalisation. The most common design employed will be that of the star schema. The process by which data is extracted from another system, transformed into the appropriate structure and loaded into the new model is known as ETL.
What is the purpose of duplicating and manipulating data in this way? The data warehouse serves a very different function from that of the OLTP system. It is designed in a way that makes querying that data easy, intelligible and fast – without impacting on the performance of the real-time transactions taking place in the source OLTP database(s). As alluded to above, an OLTP database comprises many tables, and its schema is not easy to comprehend at first glance. There may be more than one OLTP system within a business that serves different functions. The star schema of a data warehouse, meanwhile, is one that makes sense to someone seeking to build reports or run ad-hoc queries across a wide range of different business entities. It is focused upon fact tables, whose records are enriched by joining to one or more dimension tables containing reference data. It is because a data warehouse isn’t handling real time transactions that it can afford to sacrifice the cost of heavy inserts and updates for a pay-off in the form of rapid query response times. Furthermore, because it isn’t typically a real time system, the refresh of data from OLTP sources can occur overnight during a period of scheduled downtime.
Once constructed, it is possible to interact with the data warehouse via reporting tools like IBM Cognos and SAP BusinessObjects. Running queries and reports against the warehouse will only impact other users seeking to do the same – the data itself will remain consistent throughout the reporting period because no inserts, updates or deletes will be taking place at the time. A data warehouse that has been optimally tuned will be able to return query results in a matter of seconds, meaning minimal resource contentions between users. Even if a user is unable to get the results of their query in as fast a time as they are accustomed to, in a business environment this is more likely to cause a minor inconvenience than a major problem. By contrast, if an OLTP system is unable to process simultaneous inserts quickly enough, it could mean the difference between a profitable sale and a heavy loss.
A data warehouse is also designed to maintain a change history for data, particularly reference data. Where an OLTP system will just provide the most up-to-date version of a record, a data warehouse will be able to bring back prior versions of a record along with a history of when and how it was changed. This means the dimensions to which a fact table is joined will be time-sensitive, providing the version of the reference data appropriate to the date on which that particular fact record was loaded. It also allows for the retrieval of information that might otherwise have been lost in the source OLTP system.
Data warehouses are crucial to business intelligence strategy because they provide a single source of truth uncomplicated by the demands of data in flux. They can be sourced from multiple systems and thereby act as the most reliable, consolidated source of information within an organisation. They can be used to answer complex business questions and thus inform the decision-making process. A serious BI solution should always have a data warehouse at its core.
[…] The first pillar collects relevant business data together into one database (called a “data warehouse“), structures it intelligibly and then ensures that the database is “fed” from […]
[…] human, rather than machine, utility. This new model – often in the form of a datamart or data warehouse – takes the existing data and reshapes it into a format that makes sense to a business user. […]