How does Azure Data Warehouse scale?Solution ·
I’ve been diving in the fantastical world of Azure Data Warehouse (ADW) in the last couple of days.
I’ve been reading through all the documentation on Azure.com. If you are serious about mastering that service I advise you do the same: it is a worthy read.
In this article, I wanted to summarize a few concepts that are somehow interconnected: MPP, distribution & partition. Those concepts all define how your data is spread out and processed in parallel.
Let’s get started!
Massively Parallel Processing (MPP)
Let’s start with the general architecture of ADW.
Conceptually, you have one Control Node the clients interact with and it, in turns, interacts with a multitude of Compute Nodes.
The data is stored in Premium Blob storage and is therefore decoupled from the compute nodes. This is why you can scale out, scale in or even pause your ADW quickly without losing data.
The control node takes a query in input, do some analysis on it before delegating the actual compute to the control nodes. The control nodes perform their sub queries and return results to the control node. The control takes the results back, assemble it and return it to the client.
You can tune the number of compute nodes indirectly by requesting more Data Warehouse Unit (DWU) on your instance of ADW. DWUs were modelled about the DTUs from Azure SQL Databases.
Cool? Now let’s dive into how the data and compute are actually split out between the nodes.
As in Babylon, they were 60 databases
Apparently Babylonians had quite a kick at the number 60 and some of its multiples, such as 360. This is why we owe them the subdivision of the hours in 60 minutes and those in 60 seconds. Also, the 360 degrees of arc to complete a circle might have come from them too (or is it because of the 365 days in a year? we might never know).
Nevertheless, ADW splits the data between 60 databases. All the time, regardless of what you do. It’s a constant. It’s like $latex \Pi$.
I do not know the details around that decision but I guess it optimizes some criteria.
Those databases live on the compute nodes. It is quite easy, now that you know there are 60 of those, to deduce the number of compute nodes from the dedicated Data Warehouse Unit (DWU)using my fantastic formula: $latex #nodes \times #db per node = 60$. We can assume that $latex DWU = #nodes \times 100$, i.e. the lowest number of DWU corresponds to 1 compute node.
|DWU||# Compute Nodes||# DB per node|
That’s my theory anyway… I do not have insider information in the product. It would explain why we have those jumps as you go higher in the DWUs: to spread evenly the databases among the compute nodes.
Here’s an example of an ADW instance with 1500 DWU (i.e. 15 compute nodes with 4 DBs each)
So the data you load in ADW is stored in 60 databases behind the scene.
Which data gets stored in which database?
As long as you are doing simple select on one table and that your data is distributed evenly, you shouldn’t care, right? The query will flow to the compute nodes, they will perform the query on each database and the result will be merged together by the control node.
But once you start joining data from multiple tables, ADW will have to swing data around from one database to another in order to join the data. This is called Data Movement. It is impossible to avoid in general but you should strive to minimize it to obtain better performance.
Data location is controlled by the distribution attribute of your tables. By default, tables are distributed in a round robin fashion: data goes first to database 1 then 2, then 3…
You can somewhat control where your data will go by using the hash distribution method. With that method, you specify, when creating your table, that you want the hash algorithm to be used and which column to use. What this guarantees is that data rows with the same hash column value will end up in the same table. It doesn’t guarantee that any two hash column value will end up in the same database: the exact hash algorithm isn’t published.
So, let’s look at a simple example of a round-robin distributed table:
CREATE TABLE [dbo].MyTable ( CustomerID INT NOT NULL, CustomerName VARCHAR(32) NOT NULL, RegionID INT NOT NULL ) WITH ( CLUSTERED COLUMNSTORE INDEX, DISTRIBUTION = ROUND_ROBIN )
Since round robin also is the default distribution, I could have simply omit to specify it:
CREATE TABLE [dbo].MyTable ( CustomerID INT NOT NULL, CustomerName VARCHAR(32) NOT NULL, RegionID INT NOT NULL ) WITH ( CLUSTERED COLUMNSTORE INDEX )
And now with a hash algorithm:
CREATE TABLE [dbo].MyTable ( CustomerID INT NOT NULL, CustomerName VARCHAR(32) NOT NULL, RegionID INT NOT NULL ) WITH ( CLUSTERED COLUMNSTORE INDEX, DISTRIBUTION = HASH(RegionID) )
Here I specified I want the hash to be taken from the RegionID column. So all customers within the same region will be stored in the same database.
So what have I achieved by making sure that customers from the same regions are stored in the same DB? If I would want to obtain the sum of the number of customers per region, I can now do it without data movement because I am guaranteed that rows for a given region will all be in the same database.
Furthermore, if I want to join data from another table on region ID, that join can happen “locally” if the other table also has a hash distribution on the region ID. Same thing if I want to group by region, e.g. summing something by region.
That is the whole point of controlling the distribution: minimizing data movement. It is recommended to use it with columns
- That aren’t updated (hash column can’t be updated)
- Distribute data evenly, avoiding data skew
- Minimize data movement
It is obviously a somewhat advanced feature: you need to think about the type of queries you’re gona have and also make sure the data will be spread evenly. For instance, here, if “region” represents a country and you primarily do business in North America, you just put most of your data in at most two databases (USA + Canada) over 60: not a good move.
It’s also worth noting that hash distribution slows down data loading. So if you are only loading a table to perform more transformation on it, just use default round robin.
Then you have partitions. This gets people confused: isn’t partition a piece of the distribution? One of the databases?
A partition is an option you have to help manage your data because you can very efficiently delete a partition in a few seconds despite the partition containing millions of rows. That is because you won’t log a transaction for each row but one for the entire partition.
Also, for extremely large tables, having partitions could speed up queries using the partition key in their where clause. This is because it would give ADW a hint to ignore all other partitions. Partitions are stored separately, as if they were separate tables.
As a metaphor, you could consider a partitioned table as a UNION of normal tables ; so using the partition key in the where clause is equivalent to hitting one of the normal tables instead of the UNION, i.e. all tables. In some scenario, that could provide some good speed up.
You need to have something big to make it worthwhile in terms of query speedup though. ADW stores its data rows in row groups of up to a million rows. So if your partitions are small, you just increase the number of row groups which will slow down your queries… Again, imagine having lots of tables in a UNION. A query against that would be quite slow.
Here is how I would partition my earlier table:
CREATE TABLE [dbo].MyTable ( CustomerID INT NOT NULL, CustomerName VARCHAR(32) NOT NULL, RegionID INT NOT NULL ) WITH ( CLUSTERED COLUMNSTORE INDEX, DISTRIBUTION = HASH(RegionID), PARTITION ( CustomerName RANGE RIGHT FOR VALUES ('E', 'L', 'Q', 'U') ) )
I built on the previous example which had hash distribution. But it could have been a round robin distribution. Those two options (i.e. hash distribution & partitioning) are orthogonal.
It is important to understand that the 60 databases will have the same partitions. You already have 60 partitions naturally with the 60 databases. This is why you have to think about it wisely not to slow down your queries.
To visualize that, imagine my example with 5 partitions (4 boundaries means 5 partitions in total):
We end up with $latex 60 \times 5 = 300$ partitions. Is that a good thing? It depends on the problem, i.e. the way I plan to manage my partitions and the queries being done against it.
Here I tried to explain the different ways your data gets distributed around Azure Data Warehouse (ADW).
I didn’t get into the index & row groups, which is another level of granularity under partitions.
Hopefully that gives you a clear picture of how which compute node access which part of your data, the data itself being in Premium blob storage and not collocated with compute, how you can control its distribution and how you could partition it further.