Azure Databricks – Parsing escaping CSV files in Spark

In previous weeks, we’ve looked at Azure Databricks, Azure’s managed Spark cluster service. We then looked at Resilient Distributed Datasets (RDDs) & Spark SQL / Data Frames.  We also looked at an example of more tedious transformation prior to querying using the H-1B Visa Petitions 2011-2016 (from Kaggle) data set. Here, we’re going to look … More Azure Databricks – Parsing escaping CSV files in Spark

Azure Databricks – Transforming Data Frames in Spark

In previous weeks, we’ve looked at Azure Databricks, Azure’s managed Spark cluster service. We then looked at Resilient Distributed Datasets (RDDs) & Spark SQL / Data Frames. We wanted to look at some more Data Frames, with a bigger data set, more precisely some transformation techniques.  We often say that most of the leg work … More Azure Databricks – Transforming Data Frames in Spark

Azure Databricks – Spark SQL – Data Frames

We looked at Azure Databricks a few weeks ago. Azure Databricks is a managed Apache Spark Cluster service. More recently we looked at how to analyze a data set using Resilient Distributed Dataset (RDD).  We used the Social characteristics of the Marvel Universe public dataset, replicating some experiments we did 2 years ago with Azure … More Azure Databricks – Spark SQL – Data Frames

Azure Databricks – RDD – Resilient Distributed Dataset

We looked at Azure Databricks a few weeks ago.  Azure Databricks is a managed Apache Spark Cluster service. In this article, we are going to look at & use a fundamental building block of Apache Spark:  Resilient Distributed Dataset or RDD.  We are going to use the Python SDK. It is important to note that … More Azure Databricks – RDD – Resilient Distributed Dataset