Spark SQL Catalyst

Quick one this week just to mention a great talk I watched!

A Deep Dive into Spark SQL’s Catalyst Optimizer with Cheng Lian & Maryann Xue from DataBricks.

The talk is part of the Carnegie Mellon University Quarantine 2020 Database Talks. Those are organized by Andy Pavlo, the same Andy giving the Advanced Database Systems lectures.

The talk dives into how Spark SQL treats a query a pass it through its Catalyst Optimizer. It is interesting as Spark isn’t a Database per se and therefore it needs to be quite creative to use traditional cost-based optimization techniques. Indeed, those are based on statistics over the data but since Spark doesn’t own the data, the data could change without its knowledge (this is partially addressed by Apache Delta Table). It’s also interesting to see how the data movement across nodes is addressed in Apache Spark.


Leave a comment