Apache Spark is rising in popularity as a Big Data platform. It exists on this accelerated timeline for such an impactful technology.
Think about it:
- 2009, started as a Berkeley’s University project.
- 2010, open sourced
- 2013, donated to Apache Foundation
- 2014, becomes Top-Level Apache Project
In 2013, the creators of Spark founded Databricks. Databricks has developped, among other things, a cluster management system for Spark as well as a notebook web interface.
Microsoft & Databricks collaborated to create Azure Databricks:
Designed in collaboration with Microsoft and the creators of Apache Spark, Azure Databricks combines the best of Databricks and Azure to help customers accelerate innovation with one-click set up, streamlined workflows and an interactive workspace that enables collaboration between data scientists, data engineers, and business analysts.
One of the main reason for Spark popularity is its speed, many time faster than Hadoop. Databricks brings that speed to the fingertips of the Data Scientist with web notebook, enabling interactive data science. Azure Databricks now enable that speed with the power and flexibility of the cloud. We can start a cluster in minutes and scale it up or down on demand.
What is it?
Azure Databricks is a managed Spark Cluster service. Cluster size can either be fixed or auto scaled. Interaction with the cluster can be done through web notebooks or REST APIs.
The service integrates with different Azure Data Services such as Blob Storage, SQL Data Warehouse, Power BI, Data Lake Store, etc. . It also integrates with Hadoop / HD Insights, e.g. Kafka, Hive, HDFS, etc. .
An obvious starting point is the Azure online documentation. At the time of this writing (mid December 2017), that documentation has few pages but it links to another online documentation on Azure Databricks which is published by Databricks.
We would recommend glimpsing through those to get familiar with the service and provision a first cluster. It is then quite easy to iterate from there: experiment in the web environment (notebooks) and read some more documentation.
For those who aren’t familiar with Azure Spark, there are plenty of documentation online. Apache Spark has a comprehensive documentation.
One of the main differences between Hadoop & Spark is that Hadoop use storage as shared state (HDFS) while Spark uses in-memory shared state in the form of Resilient Distributed Datasets (RDDs). RDDs can be manipulated in Scala, Java or Python.
RDD programming is what Spark Core is about. Higher level APIs, i.e. Spark SQL, GraphX, Spark Streaming & MLlib are all based on RDDs.
There are tutorials online, for instance, Tutorial Points has a comprehensive tutorial. We found a good introduction was the book Taming Big Data with Apache Spark and Python by Frank Kane. It also exist as a video presentation which can be consumed within 6 to 8 hours. That author actually recommend other books to go deeper:
- Learning Spark, Lightning-Fast Big Data Analysis by Matei Zaharia, Holden Karau, Andy Konwinski, Patrick Wendell
- Advanced Analytics with Spark, Patterns for Learning from Data at Scale by Sandy Ryza, Uri Laserson, Josh Wills, Sean Owen
- Data Algorithms, Recipes for Scaling Up with Hadoop and Spark by Mahmoud Parsian
At the time of this writing, mid December 2017, the service feels like a shell to Databricks. An instance of the service can have multiple clusters attached to it and they are created in the Databricks portal.
In time the Azure Portal and corresponding REST API, PowerShell cmdlets and CLI commands will likely expose more functionality, but for now we must interact directly with Databricks REST API
It is an exciting time to do Data Science on Azure!
Two years ago we wrote a series of articles about Azure Data Lake Analytics. We might port a couple of those examples in Azure Databricks in the following months.