Azure Databricks Overview workshopSolution ·
I haven’t written about Databricks for quite a while (since April 2018 actually) but I had the pleasure to lead an Azure Databricks workshop with a local customer recently.
For that I prepared quite a few demos now available on GitHub.
I covered quite a few angles so I thought it would be interesting to share the Notebooks publicly.
Here is a list of the Notebooks:
|01-config-mount.py||We mount two Azure Storage (Data Lake store) containers to two mounting points in DBFS. The paths to the Data Lake stores are stored in Environment Variables in Azure Databricks.|
|02-spark.py||Here we quickly show what Spark is about: it’s Python (in the case of PySpark anyway) & it deals with DataFrames (mostly).|
|03-transform.py||We perform a typical batch transformation here. We read from a partition of files (a month worth of files), aggregate the data and write it to another folder.|
|04-delta.py||This notebook is all about the Delta Lake, both in PySpark and Spark SQL. Delta Table creation, update, time travel and history description.|
|05-streaming.py||Finally, some structured streaming.|
To me Azure Databricks strengths are Data Transformation at scale and Machine Learning training at scale (for parallelizable ML algorithms). Those notebooks cover the Data Transformation aspect.
I hope this can be useful to see a couple of Spark techniques.