UPDATE (19-01-2016): Have a look at Azure Data Lake series for more posts on Azure Data Lake.
Extract, Transform & Load (ETL) are so 2000’s. But then again, they are more popular than ever!
On one hand, ETL are from an era where you didn’t care too much about real-time insights or couldn’t afford it because of technical limitation. Today, we prefer integrating data in near real-time, with technologies such as APIs, web hooks, messaging & many others.
On the other hand, with the advent of bid data and other modern challenges (e.g. global markets), we often can’t integrate all the data in real time. Also, let’s face it, it often isn’t worth the problems and ETL do not have to be run only once a day either, enabling shades of grey.
Since ETL are here to stay, it’s a good idea to be familiar with their proven patterns. One pattern I want to introduce here today is the checkpoint pattern.
Mathematically, you could model an ETL as y = f(x), where x is your input, the result of your extract, f is the transform and y is the result of the transform on the input (the load part of ETL).
In practice though, the transformation happens in multiple steps and many things could go wrong. Since we are only human, it makes sense to give us a chance to:
- Trace the transformation steps
- Allow us the possibility to re-run parts of a transformation
- Allow us to roll back parts of a transformation
Checkpoints help us achieve that. Checkpoints are basically points in your transformation where you persist the current state of the data within the transformation.
Here I’m gona give an example. Take it with a pinch of salt. Like every examples, it oversimplies part of the problem and overengineer other parts.
In my example, we’re going to take tweets (how can we do an data example without those?) from certain users and perform some analysis on it before feeding the result to a Machine Learning engine.
The input comes in CSV files:
accountname, tweet @john, “Leaving for work now” @julie, “12km in one hour this morning!” @zul, “The trees, the leaves & the wind, no more”
The first thing we’re gona do is to load it as is in a database table:
|123||1||@john||Leaving for work now|
|123||2||@julie||12km in one hour this morning!|
|123||3||@zul||The trees, the leaves & the wind, no more|
With a load table with the following enty:
That’s our first checkpoint. We didn’t transform anything yet beside the format. So we do a few things here:
- We put the CSV in a table
- It’s a tiny transformation by itself and some parsing errors could occur and be traced
- It gives an access to the raw data in a table format
- We have traceability of what took place with the load table
- We aggregate all the entries of a same file with a load-ID
- We track when the load happened & how much time it took
Next we’re going to structure the data. First is the account able:
Then the tweet table:
Then the tweetContent table:
So, again, at this second checkpoint, we persist everything and we trace it back to the original load. We could therefore re-run the second transformation, for instance, if we choose to remove the punctuation from the word itemization.
We could have a third checkpoint where we count the words but I suppose you get the point by now.
So what are the characteristics of the checkpoint pattern?
- For each step of a transformation you persist your state
- Each step traces back to the previous step
- You can therefore rollback or re-run a step
- You can therefore inspect what happend and potentially understand why the final results contain unexpected values
This ties in nicely with the concept of Data Lake. A Data Lake contains the raw data (in my example the CSV files) but also the refined data.
In ETL you would often call those checkpoints staging tables and would likely dispose of them early. In a data lake you might keep them around longer.
Also, here we saw one transformation path, but there could be an arborescence. You could have one transformation distilling the content of tweets and another transformation loading geographic data and cross-referencing it with the tweets.
This is why you want to load all your data first and dropping data only at later stages. This way you could always start a new transformation using the original data. That’s the value of a data lake: you always keep the maximum information even if it is hard to use in its original format.
This was one pattern for ETL. There are many others but I have observed that experts in ETL (by no mean am I one) are always using it for operational reasons.
The example I gave was fairly trivial and you might not want to checkpoint on so many micro-steps, but I tried to illustrate the pattern and the value it brings. It for you to find where the sweet spot for your solution is.
In Azure you would use Azure Data Factory to drive the transformations and you might want to store both your files and structured data in Azure Data Lake Storage (ADLS). By keeping your data around you could always change your transformation algorithm and recreate the target data at scale.