Azure Databricks Overview workshop

I haven’t written about Databricks for quite a while (since April 2018 actually) but I had the pleasure to lead an Azure Databricks workshop with a local customer recently.

For that I prepared quite a few demos now available on GitHub.

I covered quite a few angles so I thought it would be interesting to share the Notebooks publicly.

Here is a list of the Notebooks:

Notebook Description We mount two Azure Storage (Data Lake store) containers to two mounting points in DBFS. The paths to the Data Lake stores are stored in Environment Variables in Azure Databricks. Here we quickly show what Spark is about: it’s Python (in the case of PySpark anyway) & it deals with DataFrames (mostly). We perform a typical batch transformation here. We read from a partition of files (a month worth of files), aggregate the data and write it to another folder. This notebook is all about the Delta Lake, both in PySpark and Spark SQL. Delta Table creation, update, time travel and history description. Finally, some structured streaming.

To me Azure Databricks strengths are Data Transformation at scale and Machine Learning training at scale (for parallelizable ML algorithms). Those notebooks cover the Data Transformation aspect.

I hope this can be useful to see a couple of Spark techniques.

Leave a comment