Azure Databricks - Getting Started

Databricks_logo[1]Apache Spark is rising in popularity as a Big Data platform.  It exists on this accelerated timeline for such an impactful technology.

Think about it:

In 2013, the creators of Spark founded Databricks.  Databricks has developped, among other things, a cluster management system for Spark as well as a notebook web interface.

Microsoft & Databricks collaborated to create Azure Databricks:

Designed in collaboration with Microsoft and the creators of Apache Spark, Azure Databricks combines the best of Databricks and Azure to help customers accelerate innovation with one-click set up, streamlined workflows and an interactive workspace that enables collaboration between data scientists, data engineers, and business analysts.

One of the main reason for Spark popularity is its speed, many time faster than Hadoop.  Databricks brings that speed to the fingertips of the Data Scientist with web notebook, enabling interactive data science.  Azure Databricks now enable that speed with the power and flexibility of the cloud.  We can start a cluster in minutes and scale it up or down on demand.

What is it?

Azure Databricks is a managed Spark Cluster service.  Cluster size can either be fixed or auto scaled.  Interaction with the cluster can be done through web notebooks or REST APIs.

The service integrates with different Azure Data Services such as Blob Storage, SQL Data Warehouse, Power BI, Data Lake Store, etc.  .  It also integrates with Hadoop / HD Insights, e.g. Kafka, Hive, HDFS, etc.  .

Getting started

An obvious starting point is the Azure online documentation.  At the time of this writing (mid December 2017), that documentation has few pages but it links to another online documentation on Azure Databricks which is published by Databricks.

We would recommend glimpsing through those to get familiar with the service and provision a first cluster.  It is then quite easy to iterate from there:  experiment in the web environment (notebooks) and read some more documentation.

Spark itself

For those who aren’t familiar with Azure Spark, there are plenty of documentation online.  Apache Spark has a comprehensive documentation.

One of the main differences between Hadoop & Spark is that Hadoop use storage as shared state (HDFS) while Spark uses in-memory shared state in the form of Resilient Distributed Datasets (RDDs).  RDDs can be manipulated in Scala, Java or Python.

RDD programming is what Spark Core is about.  Higher level APIs, i.e. Spark SQL, GraphX, Spark Streaming & MLlib are all based on RDDs.

There are tutorials online, for instance, Tutorial Points has a comprehensive tutorial.  We found a good introduction was the book Taming Big Data with Apache Spark and Python by Frank Kane.  It also exist as a video presentation which can be consumed within 6 to 8 hours.  That author actually recommend other books to go deeper:

Service interaction

At the time of this writing, mid December 2017, the service feels like a shell to Databricks.  An instance of the service can have multiple clusters attached to it and they are created in the Databricks portal.

In time the Azure Portal and corresponding REST API, PowerShell cmdlets and CLI commands will likely expose more functionality, but for now we must interact directly with Databricks REST API


It is an exciting time to do Data Science on Azure!

Two years ago we wrote a series of articles about Azure Data Lake Analytics.  We might port a couple of those examples in Azure Databricks in the following months.

One response

  1. Gerhard Brueckl 2018-11-30 at 01:01

    As there are no official PowerShell cmdlets for the Databricks REST API (and probably will never be according to my information), I created a module and uploaded it to the PowerShell gallery

    The cmdlets work with Databricks on Azure and also Databricks on AWS

    I also blogged about it here:

Leave a comment