Checkpoints in ETL

UPDATE (19-01-2016):  Have a look at Azure Data Lake series for more posts on Azure Data Lake.

Extract, Transform & Load (ETL) are so 2000’s.  But then again, they are more popular than ever!

On one hand, ETL are from an era where you didn’t care too much about real-time insights or couldn’t afford it because of technical limitation.  Today, we prefer integrating data in near real-time, with technologies such as APIs, web hooks, messaging & many others.

On the other hand, with the advent of bid data and other modern challenges (e.g. global markets), we often can’t integrate all the data in real time.  Also, let’s face it, it often isn’t worth the problems and ETL do not have to be run only once a day either, enabling shades of grey.

Since ETL are here to stay, it’s a good idea to be familiar with their proven patterns.  One pattern I want to introduce here today is the checkpoint pattern.

Why Checkpoints?

Mathematically, you could model an ETL as y = f(x), where x is your input, the result of your extract, f is the transform and y is the result of the transform on the input (the load part of ETL).

In practice though, the transformation happens in multiple steps and many things could go wrong.  Since we are only human, it makes sense to give us a chance to:

Checkpoints help us achieve that.  Checkpoints are basically points in your transformation where you persist the current state of the data within the transformation.


Here I’m gona give an example.  Take it with a pinch of salt.  Like every examples, it oversimplies part of the problem and overengineer other parts.

In my example, we’re going to take tweets (how can we do an data example without those?) from certain users and perform some analysis on it before feeding the result to a Machine Learning engine.

The input comes in CSV files:

accountname, tweet
@john, “Leaving for work now”
@julie, “12km in one hour this morning!”
@zul, “The trees, the leaves & the wind, no more”

The first thing we’re gona do is to load it as is in a database table:

LoadID Line Account Tweet
123 1 @john Leaving for work now
123 2 @julie 12km in one hour this morning!
123 3 @zul The trees, the leaves & the wind, no more

With a load table with the following enty:

LoadID FileName LoadStart LoadCompletion
123 wasb://tweets@raw/2016/1/1/asduy123478d.csv XYZ XYZ-2

That’s our first checkpoint.  We didn’t transform anything yet beside the format.  So we do a few things here:

Next we’re going to structure the data. First is the account able:

AccountID Account CreatedFromLoadID
54 john 123
55 julie 123
56 zul 123

Then the tweet table:

TweetID CreatedFromLoadID CreatedFromLine
54 123 1
55 123 2
56 123 3

Then the tweetContent table:

TweetContentID TweetID Word
346 54 Leaving
347 54 for
348 54 work
349 54 now
350 55 12km
351 55 in
352 55 one
353 55 hour
354 55 this
355 55 morning!
356 56 The
357 56 trees
358 56 ,
359 56 the
360 56 leaves
361 56 &
362 56 the
363 56 wind
364 56 ,
365 56 no
366 56 more

So, again, at this second checkpoint, we persist everything and we trace it back to the original load.  We could therefore re-run the second transformation, for instance, if we choose to remove the punctuation from the word itemization.

We could have a third checkpoint where we count the words but I suppose you get the point by now.


So what are the characteristics of the checkpoint pattern?

This ties in nicely with the concept of Data Lake.  A Data Lake contains the raw data (in my example the CSV files) but also the refined data.

In ETL you would often call those checkpoints staging tables and would likely dispose of them early.  In a data lake you might keep them around longer.

Also, here we saw one transformation path, but there could be an arborescence.  You could have one transformation distilling the content of tweets and another transformation loading geographic data and cross-referencing it with the tweets.

This is why you want to load all your data first and dropping data only at later stages.  This way you could always start a new transformation using the original data.  That’s the value of a data lake:  you always keep the maximum information even if it is hard to use in its original format.


This was one pattern for ETL.  There are many others but I have observed that experts in ETL (by no mean am I one) are always using it for operational reasons.

The example I gave was fairly trivial and you might not want to checkpoint on so many micro-steps, but I tried to illustrate the pattern and the value it brings.  It for you to find where the sweet spot for your solution is.

In Azure you would use Azure Data Factory to drive the transformations and you might want to store both your files and structured data in Azure Data Lake Storage (ADLS).  By keeping your data around you could always change your transformation algorithm and recreate the target data at scale.

2 responses

  1. Juan Alchourron (@alchourron) 2015-12-29 at 19:04

    Nice post ! I was working on something related to this, just hours ago ! Now I understand better what I was missing. Tks, my friend !

  2. Vincent-Philippe Lauzon 2015-12-30 at 08:14


Leave a comment