It’s been over ten years since I coded my first ever ETL routine. Since then I’ve collected together a number of important lessons and best practices, presented here as my very own ‘Ten Commandments’ of ETL.
1. Know your data
Whether you’re defining an ETL strategy, designing a set of data flows or writing the code, the single most important rule is to know about the data you are working with. Whatever your preconceptions might be, the truth is that no two data sets or source systems are the same. It is crucial to have a thorough, deep-seated understanding of everything from the high-level purpose of the data set down to the individual meanings of each of the attributes. Quiz your SMEs, carry out detailed data profiling and draw up your own ERDs and data glossaries to ensure you are completely on top of the data you are working with. It may seem like a tall order, particularly when working in a pressured environment where delivery of data is prioritised above everything else, but I have witnessed first hand the danger of trying to do any kind of ETL at any level without a clear idea of what the data means. The guiding principle should be to know the data as well as the person who will ultimately consume it. Only then can you successfully demonstrate that your ETL routines have worked correctly and that the end result meets the original requirements. Trying to do ETL “in the dark” is the fastest path to disaster, so take the time to get to grips with your data up front.
2. Plan your route
Just as you wouldn’t normally set out on a journey to a place you haven’t been before without first consulting a map or an online route planner, so you should not attempt to define extractions, transformations and loads without first defining your data mapping. The dual benefit of a complete end-to-end data mapping is that it helps you (or your team) to write the ETL and it ensures a detailed level of documentation that can later be referred to if there are any quibbles or complications. A proper data map can be used in either direction, either to define the steps to get a source attribute or measure into a target schema or to trace that attribute or measure back to its original source.
3. Prepare to fail
It is not at all pessimistic to define error handling and failure recovery in your ETL framework, it is simply realistic. Even the very best coded ETL in the world cannot hope to avoid a situation in which data corruption occurs at source. Placing error handlers in your ETL code is standard practice, but best practice pre-empts errors that are totally outside of your control. That means not only defining procedures that flag or spit out failed records in line with pre-defined business rules but also marking every row of every ETL batch or stream with an identifier that can be used, if needed, to roll back the entire ETL run in the event of system failure or source data corruption. Of course, this pre-emptive roll back logic also has the added advantage of making the development and unit test cycle easier to manage, as entire runs can be reversed at any step – or artifically stopped and restarted when needed. Another means of guarding against failure is to define a set of data checks that can be run at the end of a load and reported out, either via e-mail or an online reporting system. This gives honest quality assurance to your end consumers and can be immensely powerful for quickly spotting spikes and other unanticipated data errors.
4. Extract at speed, load at leisure
Whether you’re conducting ETL in batch or real-time message queues, the best possible solutions are those which get the data out as fast as possible. This ensures the least overhead on source systems – which, in general, are transactional in nature and whose performance should not be compromised – and allows for multiple runs and re-runs from the same dataset in the event of ETL runtime failure. It is far easier to run an archived source file from a prior run back through the ETL process than it is to re-extract the data from source, not least because source systems often overwrite certain data elements, making it impossible to repeat an extract from a prior point in time. Once data has been extracted, the ETL routines themselves become less time critical, and in any case the dedicated hardware of the ETL system can be used for the transformations and loads rather than the hardware of the source system for whom the ETL is a secondary consideration.
5. Small steps, not giant leaps
When designing ETL code, you should try to make it as modular as possible. Each individual transformation should take place in its own procedure or data flow, and as far as you can you should aim to stage – and then archive – the data so that it is possible to obtain a complete data audit from source to target. There are also coding advantages to this modular approach: it makes the code easier for other developers to read, and complex transformations can be split out into multiple simpler steps. You should not feel afraid to make use of temporary tables; many ETL developers view this as somehow ‘cheating’, but in my experience there are no awards to be had in writing gigantic queries or procedural blocks that accomplish many things simultaneously. Far better to split it out into smaller components and avoid flushing out the temporary data stages until the next ETL run kicks off; that way you can easily retrace your steps both in the code and in the intervening results.
6. Recycle, reuse and re-purpose
It is an old adage in computer science that you should not attempt to reinvent the wheel, and ETL is no different. What is more, shared packages and functions should be the preferred approach for any transformation. Not only does this shorten the overall coding time for a given project, it also provides a level of standardisation. Changes to a single shared function are automatically inherited by every ETL routine that uses it. Of particular note for ETL developers coding by hand (rather than using an ETL tool) is the need for clear design patterns that can be used repeatedly. It is not enough to simply copy and paste code, you should aim to have a clear design pattern and only then select an existing piece of code that matches up to that pattern. Failing to do this can lead to copy-paste errors and incorporating incompatible business rules that are not fit for the purpose at hand.
7. Use parameters
One of the worst crimes in any kind of programming is the use of hard-coded values. If you are ever tempted to hard-code a variable, turn it into a parameter. You should not feel discouraged from using parameters just because they suggest a need for user input: binding your parameters to a lookup table is a practice I would highly recommend, because it both enables automated ETL runs and yet can be customised simply by updating a row in a table.
8. Conforming trumps transforming
When taking data from a source system and manipulating it for consumption, there is often a requirement to update reference data values so that they make sense to the end user. However, simply carrying out a translation in the code is a bad way to go. It is opaque to those outside of the ETL backroom and it also runs the risk of losing vital information along the way. Instead, you should implement reference data conformity, the translations for which should be captured in a transparent table available to end users as well as the ETL code. In that table you can define multiple names and descriptions for the same value, whilst still retaining the original reference data value from source. This guarantees transparent, traceable (and numerous) translations for singular reference data values unconstrained by the limits of coded transformations.
9. Don’t take shortcuts
This is one of those rules that sounds easier to follow than it is in practice. All of us, whether we’re architects or analysts, designers or developers, believe in doing the best possible job and adopting the best practices available; but under the stressful conditions of a pressured project, it is all too easy to take shortcuts. We justify them to ourselves in the name of getting the job done, but in reality speed and care are not as mutually exclusive as you might think. Taking the time to carry out the ETL properly can ultimately save you endless hours of debugging awkward edge cases and unanticipated data anomalies. Decent logging, auditing and data profiling will pinpoint errors during the test phase and allow you to focus on fixing, rather than merely diagnosing them. Clean, elegant, modular code is easier to read and correct where needed. And clear, detailed documentation in the form of data mappings, ERDs and glossaries can instill trust in the target data, as well as making it simpler to determine when a fault lies at source rather than along the ETL path.
10. Be realistic
Successful ETL is as much about your overall approach as it is about the steps you take along the way. If you know up front that your business requirements will involve sourcing data from fifteen source systems, aggregating it to multiple levels and serving it to multiple departments with very different business priorities, you should take a step back and accept that this will not be accomplished overnight. ETL is neither easy nor magical: it is mentally taxing, often times complex and heavily time-consuming work. Adopting a ‘big bang’ approach very rarely succeeds. Instead, take a realistic view and commit to delivering phased releases, each with a healthy level of utility. You will be pleasantly surprised to discover that frequent small, accurate deliverables are much better received by your end users than huge, years-long behemoths riddled with errors. Indeed, it can take just one bad measure viewed by one particularly influential user to send an entire data warehouse effort to its grave. Gradually acquiring a reputation for accuracy and successful delivery will do wonders for you and your ETL team. Fight your corner and justify a phased approach whenever possible.