My ultimate ingestion process for historical data at scale with KustoSolution ·
In my last article, we discussed different architecture aspects of large historical data ingestion.
In this article, I want to be more prescriptive and share an approach that works well for me.
Is that the ultimate process? Of course not, that is clickbait. As we discussed at length in the previous article, different scenario will call for different choices. But it’s a process that worked great for me in a many contexts though and it should make all this architecture discussion from the previous article more concrete.
It also contains a lot of personal choices. For instance, I find using external table more intuitive than ingesting blobs directly. Logic App, one of my favorite Azure services, also make an appearance.
The scenario I was working with:
- 18 months worth of Wikipedia data, about 400 GB
- Data is stored in Parquet format file in a hourly partitioned folder structure (i.e. yyyy/mm/dd/hh)
- Each blob is about 20 Mb (this isn’t optimal as recommendation is between 100 Mb and 1 Gb)
- Data has the same shape as real time ingestion and hence requires the same transformation
- The only transformation required is to rename the pesky datetime column into timestamp (datetime is a reserved keyword in Kusto and needs to be escaped which is annoying)
Here is the approach:
- Create an external table pointing to parquet files
- Create an ingestion table with the same schema as the target table
- Author a Kusto Stored Function selecting a slice of time from the external data table just after the latest time in the ingestion table but never going further than when the data started in the target table (real time ingestion)
- Author a Logic App iteratively ingesting data using the Stored Function from the last point until there is no more data
- Run the Logic App through the entire historical data
- Validate ingestion
- Move the data from ingestion table to target table (along with the real time ingested data)
For the remainder of this article, we’ll discuss and justify each step.
Create External Table (1)
We are going to base the ingestion on an external table pointing to blobs in the Azure Data Lake as opposed to ingesting from those blobs directly.
As discussed in the previous article, this has the following advantages:
- We can query time windows specifically, so we do not “over ingest” (i.e. ingest data we already ingested in real time)
- We can transform the data as we ingest it
We also find it is more intuitive to deal with an external table than a stack of blobs.
In our case the real time ingestion is based on the same blobs. We are using Event Grid to trigger ingestion every time a new blob is added to Azure Data Lake. Since both real time and historical data are based on the same folders, we need to cut-off the historical ingestion where the real time ingestion started.
As for the transformation, we only had minimal needs. As mentioned, we simply renamed a column.
Create Ingestion Table (2)
This is optional although quite convenient.
Instead of ingesting the historical data in the same table where we ingest data in real time, we separate it during the ingestion.
This buys us the flexibility of being much easier to delete the historical data and start over again.
If we ingest in the same table, we could run into historical extents being merged with “real time” extents, which would make it much more complicated to delete historical data. Also, if by mistake we would re-ingest the data ingested in real time, it would be hard to clean it up.
Author Kusto Stored Function (3)
As with many database technologies, we find it easier to manage data logic when it is closer to the data. Hence, we use Kusto stored function here.
The function’s logic is returning the “next slice” of data. It looks at where the ingestion table is and fetches a slice after that.
The advantage of this approach is that if an ingestion fails, the query will automatically start over since the ingestion table won’t have changed.
Author Logic App (4)
Logic App gives us the reliability / long running capability here.
Nothing fancy: it is basically a while loop with an ingestion command in the body.
In most cases we’ll need to readjust the default of Logic App until-action which is 60 iterations over 60 minutes. Most large-scale ingestion are longer than that.
Run Logic App (5)
We run the Logic App to completion.
Validate Ingestion (6)
We validate a few things:
- Is the data as expected? A few sample can be useful here
- Is the data complete?
- Does the data set in the ingestion table dove tail the one in the target table?
Move data (7)
Finally we can proceed to a migration of the data from the ingestion table to the target table using .move extents.
This ingestion method isn’t the only one possible, but it works rather well.
For simpler scenarios we could skip a few steps. For instance, we could use the LightIngest tool if we didn’t need to cut ingestion at an arbitrary time.
I hope this gave a good idea of the different moving parts and what value they bring to the equation.