# Tag Archives: Data

Data Architecture & Data Science ; Azure SQL Database, SQL on Azure VMs, Azure Data Warehouse, Data Factory, Redis Cache, Azure Search

# Extended Outage @ Instapaper – Resiliency example

I use Instapaper extensively to store the continuous flow of internet articles I want to read.  I created a bunch of tools integrating with it (e.g. monitoring atom feeds and sending new articles to Instapaper)

Last week my tools didn’t work for a while so I finally logged in directly to the site.  The site was down, citing an extended outage, with a reference to a blog article to explain the outage.

It got back on its feet after around 48 hours.  This isn’t an article to call out Instapaper’s engineering:  that type of outage happens everywhere all the time.  But let’s learn from it.

The article cites the cause of the outage as being “we hit a system limit for our hosted database that’s preventing new articles from being saved”.  The article also cite they had to spent time on the phone with their cloud provider (didn’t mention which one) before diagnosing the problem.

## Know the limits

They hit a limit.

What are the limits?

A clear advise when working in Azure:  know where your sandbox ends.  We should always consult http://aka.ms/azurelimits to inform our architecture.

The nice thing about the cloud is that most limit are clearly defined and embedded in SLAs.

This comes as a surprise to a lot of people, especially when they see some of the restrictions.  “How come?  I can only put 2 disks on a Standard D1 v2 VM?”  This is because the experience we have on premise where there are few hard wired limitations.  Instead, we typically have degradations, e.g. sure you can put a Terabyte storage on your old laptop, but you are likely going to saturate its IOs before you will exhaust the storage.  In Azure, it is more clearly defined because Azure is a public cloud, i.e. a multi-tenant environment.  The only way for Azure to guarantee other customers they can have capacity X is to make sure we do not use that capacity.  So there are sandboxes everywhere.

On the flipside, you have less surprises.  The reason you do not have so many sandboxes on prem is that everything is shared.  That is nice as long as we are the only one to crank the resource usage.  But when other systems start grinding that SAN, it isn’t so fast anymore.

So first lesson we get from that Instapaper outage is to know the limit.  Did the Instapaper know the limits of their database?  Did they hit some software limit, e.g. they used all the 32 bits of an identifier column?

## Monitor for those limits

Now that we know the limits, we know when our system is going to break.

So we need to monitor it so that we won’t have a bad surprise as Instapaper team must have had (I’m sure some dinners were cancelled during those 48 hours).

In Azure, we can use Azure Monitor / Log Analytics to monitor different metrics on our resources.  We can setup alerts to be notified when some threshold has been reached.

We can then react.  If we setup the threshold low enough, that will give us a window of time to react and to sketch a plan.

In an article on SQL Database sizes in an elastic pool, we saw that we can fix a limit on database size to make sure it doesn’t consume the allowed size of the entire pool.  This is a safeguard mechanism in case our monitoring strategy fails.

## Strategy to get beyond the limits

We know the limits of our architecture.  We monitor our system to get some heads up before those limits are reached.

When they are reached, what do we do?

Typically, if we’re aware of the limits, we will have given it some thoughts, even if it isn’t fully documented.  This is something that goes in handover conversations actually “this system has this limit, but you know, if that ever becomes a problem, consider that alternative”.  Hopefully, the alternative doesn’t consist into re-writing the entire system.

## Scale out / partitioning strategies

If we are lucky, the original architect(s) of the system have baked in a way to overcome its limits.

A typical way is to partition the system.  We can partition the system per user / group of users or other data segments.  This way, we can scale out the system to handle different partitions.

We can start with only one set of resources (e.g. Virtual Machines) handling all partitions and the day that set of resource hits its limits, we can split the partitions into two groups and have another set of resources handling the other set of partitions.  And so on.

One day, that partition scheme might also hit its limits.  For instance, maybe we have a set of resources handling only one partition each and most of the set of resource have all hit their limits.  In that case, we’re back to figure out how to go beyond the limits of our system.  Typically, that will consist into repartition it in a way we can scale out further.

Azure is a partitioned / scaled out system.  That strategy has allowed it to grow to its current size.  Along the way, repartitioning was needed.  For instance, the transition of ASM to ARM was partially that.  In ASM, there were only 2 regions handling the Azure Management APIs while in ARM, each region handles API requests.

## Conclusion

Instapaper was victim of its own success.  That’s a good problem to have but a problem nevertheless.

Make sure you at least know the limits of your system and monitor for them.  This way if success curses you, you’ll be able to react during working hours instead of cancelling your valentine dinner and spending your days justifying yourself to your CEO.

# Azure SQL Elastic Pool – Moving databases across pools using PowerShell

I’ve written a bit about Azure SQL Elastic Pool lately:  an overview, about ARM template and about database size.

One of the many great features of Azure SQL Elastic Pool is that like Azure SQL Database (standalone), we can change the eDTU capacity of the pool “on the fly”, without downtime.

Unlike its standalone cousin though, we can’t change the edition of the pool.  The edition is either Basic, Standard or Premium.  It is set at creation and is immutable after that.

If we want to change the edition of a pool, the obvious way is to create another pool, move the databases there, delete the original, recreate it with a different edition and move the databases back.

You might want to move databases around for other reasons, typically to optimize the density and performance of pools.  You would then use a very similar script.

## Look at the pool

Let’s start with the pools we established with the sample ARM template of a previous article.

From there we can look at the pool Pool-A using the following PowerShell command:


$old = Get-AzureRmSqlElasticPool -ResourceGroupName DBs -ElasticPoolName Pool-A -ServerName pooldemoserver$old



We can see the pool current edition is Standard while its Data Transaction Unit (DTU) count is 200.

## Create a temporary pool

We’ll create a temporary pool, aptly named temporary, attached to the same server:


$temp = New-AzureRmSqlElasticPool -ResourceGroupName DBs -ElasticPoolName Temporary -ServerName pooldemoserver -Edition$old.Edition -Dtu $old.Dtu$temp



It’s important to create a pool that will allow the databases to be moved into.  The maximum size of a database is dependent of the edition and number of DTU of the elastic pool.  The easiest way is to create a pool with the same edition / DTU and this is what we do here by referencing the $old variable. ## Move databases across First, let’s grab the databases in the original pool: $dbs = Get-AzureRmSqlDatabase -ResourceGroupName DBs -ServerName pooldemoserver | where {$_.ElasticPoolName -eq$old.ElasticPoolName}

$dbs | select DatabaseName  ElasticPoolName is a property of a database. We’ll simply change it by setting each database: $dbs | foreach {Set-AzureRmSqlDatabase -ResourceGroupName DBs -ServerName pooldemoserver -DatabaseName $_.DatabaseName -ElasticPoolName$temp.ElasticPoolName}



That command takes longer to run as the databases have to move from one compute to another.

## Delete / Recreate pool

We can now delete the original pool.  It’s important to note that we wouldn’t have been able to delete a pool with databases in it.



Remove-AzureRmSqlElasticPool -ResourceGroupName DBs -ElasticPoolName $temp.ElasticPoolName -ServerName pooldemoserver  In the second line we delete the temporary pool. Again, this takes a little longer to execute since databases must be moved from one compute to another. ## Summary We showed how to move databases from a pool to another. The pretext was a change in elastic pool edition but we might want to move databases around for other reasons. In practice you might not want to move your databases twice to avoid the duration of the operation and might be happy to have a different pool name. In the demo we did, the move took less than a minute because we had two empty databases. With many databases totaling a lot of storage it would take much more time to move those. # Azure SQL Elastic Pool – Database Size I mentioned in a past article, regarding database sizes within an elastic pool: “No policies limit an individual database to take more storage although a database maximum size can be set on a per-database basis.” I’m going to focus on that in this article. An Azure SQL Database resource has a MaxSizeInBytes property. We can set it either in an ARM template (see this ARM template and the property maxSizeBytes) or in PowerShell. An interesting aspect of that property is that: • It takes only specific values • Not all values are permitted, depending on the elastic pool edition (i.e. Basic, Standard or Premium) ## Valid values One way to find the valid values is to navigate to the ARM schema. That documented schema likely is slightly out of date since, as of December 2016, the largest value is 500GB, which isn’t the largest possible database size (1 TB for a P15). The online documentation of Set-AzureRmSqlDatabase isn’t fairing much better as the documentation for the MaxSizeBytes parameter refers to a parameter MaxSizeGB to know about the acceptable values. Problem is, MaxSizeGB parameter doesn’t exist. But let’s start with the documented schema as it probably only lacks the most recent DB sizes. Using that schema list of possible values and comparing that with the stand alone database size for given editions, we can conclude (after testing with ARM templates of course), that a Basic pool can have databases up to 2GB, for Standard we have 250GB and of course Premium can take all values. It is important to notice that the pool can have larger storage. For instance, even the smallest basic pool, with 50 eDTUs, can have a maximum storage of 5 GB. But each DB within that pool can only grow up to 2 GB. That gives us the following landscape: Maximum Size (in bytes) Maximum Size (in GB) Available for (edition) 104857600 0.1 Premium, Standard, Basic 524288000 0.5 Premium, Standard, Basic 1073741824 1 Premium, Standard, Basic 2147483648 2 Premium, Standard, Basic 5368709120 5 Premium, Standard 10737418240 10 Premium, Standard 21474836480 20 Premium, Standard 32212254720 30 Premium, Standard 42949672960 40 Premium, Standard 53687091200 50 Premium, Standard 107374182400 100 Premium, Standard 161061273600 150 Premium, Standard 214748364800 200 Premium, Standard 268435456000 250 Premium, Standard 322122547200 300 Premium 429496729600 400 Premium 536870912000 500 Premium ## Storage Policies We can now use this maximum database as a storage policy, i.e. a way to make sure a single database doesn’t take all the storage available in a pool. Now, this isn’t as trivially useful as the eDTUs min / max we’ve seen in a pool. In the eDTU case, that was controlling how much compute was given to a database at all time. In the case of a database maximum size, once the database reaches that size, it becomes read only. That will likely break our applications running on top of it unless we planned for it. A better approach would be to monitor the different databases and react to size changes, by moving the database to other pool for instance. The maximum size could be a safeguard though. For instance, let’s imagine we want each database in a pool to stay below 50 GB and we’ll monitor for that and raise alerts in case that threshold is reached (see Azure Monitor for monitoring and alerts). Now we might still put a maximum size for the databases of 100 GB. This would act as a safeguard: if we do not do anything about a database outgrowing its target 50GB, it won’t be able to grow indefinitely, which could top the pool maximum size and make the entire pool read only, affecting ALL the databases in the pool. In that sense the maximum size still act as a resource governor, preventing noisy neighbour effect. ## PowerShell example We can’t change a database maximum size in the portal (as of December 2016). Using ARM template, it is easy to change the parameter. Here, let’s simply show how we would change it for an existing database. Building on the example we gave in a previous article, we can easily grab the Pool-A-Db0 database in resource group DBs and server pooldemoserver:  Get-AzureRmSqlDatabase -ServerName pooldemoserver -ResourceGroupName DBs -DatabaseName Pool-A-Db0  We can see the size is the one that was specified in the ARM template (ARM parameter DB Max Size default value), i.e. 10 GB. We can bump it to 50 GB, i.e. 53687091200 bytes:  Set-AzureRmSqlDatabase -ServerName pooldemoserver -ResourceGroupName DBs -DatabaseName Pool-A-Db0 -MaxSizeBytes 53687091200  We can confirm the change in the portal by looking at the properties. ## Default Behaviour If the MaxSizeByte property is omitted, either in an ARM Template or a new-AzureRmSqlDatabase PowerShell cmdlet, the default behaviour is for the database to have the maximum capacity (e.g. for Standard, 250 GB). After creation, we can’t set the property value to null to obtain the same effect. Omitting the parameter simply keep to previously set value. ## Summary We’ve looked at the maximum size property of a database. It can be used to control the growth of a database inside a pool and prevent a database growth to affect others. # Azure SQL Elastic Pool – ARM Templates In my last article, I covered Azure SQL Elastic Pool. In this one I cover how to provision it using ARM templates. As of today (December 2016), the documentation about Azure SQL Elastic Pool provisioning via ARM templates is… not existing. Searching for it I was able to gather hints via a few colleagues GitHub repos, but there are no examples in the ARM quickstart templates nor is the elastic pool resource schema documented. Also, the Automation Script feature in the portal doesn’t reverse engineer an ARM template for the elastic pool. So I hope this article fills that gap and is easy to search for & consume. ## ARM Template Here we’re going to provision a Server with two pools, Pool-A & Pool-B (yeah, sounds a bit like Thing 1 & Thing 2), each having a few (configurable number of) databases in them. { "$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
"contentVersion": "1.0.0.0",
"parameters": {
"Server Name": {
"defaultValue": "pooldemoserver",
"type": "string",
"description": "Name of the SQL:  needs to be unique among all servers in Azure"
}
},
"type": "string",
}
},
"type": "securestring",
}
},
"Pool A Edition": {
"defaultValue": "Standard",
"type": "string",
"allowedValues": [
"Basic",
"Standard",
],
"description": "Pool A Edition"
}
},
"Pool B Edition": {
"defaultValue": "Standard",
"type": "string",
"allowedValues": [
"Basic",
"Standard",
],
"description": "Pool B Edition"
}
},
"DB Max Size": {
"defaultValue": "10737418240",
"type": "string",
"allowedValues": [
"104857600",
"524288000",
"1073741824",
"2147483648",
"5368709120",
"10737418240",
"21474836480",
"32212254720",
"42949672960",
"53687091200",
"107374182400",
"161061273600",
"214748364800",
"268435456000",
"322122547200",
"429496729600",
"536870912000"
],
"description": "DB Max Size, in bytes"
}
}
},
"variables": {
"Pool A": "Pool-A",
"Pool B": "Pool-B",
"DB A Prefix": "Pool-A-Db",
"DB B Prefix": "Pool-B-Db",
"Count A": 2,
"Count B": 4
},
"resources": [
{
"name": "[parameters('Server Name')]",
"type": "Microsoft.Sql/servers",
"apiVersion": "2014-04-01-preview",
"location": "[resourceGroup().location]",
"dependsOn": [],
"properties": {
"version": "12.0"
},
"resources": [
{
"type": "firewallRules",
"kind": "v12.0",
"name": "AllowAllAzureIps",
"apiVersion": "2014-04-01-preview",
"location": "[resourceGroup().location]",
"dependsOn": [
"[resourceId('Microsoft.Sql/servers', parameters('Server Name'))]"
],
"properties": {
}
},
{
"type": "elasticpools",
"name": "[variables('Pool A')]",
"apiVersion": "2014-04-01-preview",
"location": "[resourceGroup().location]",
"dependsOn": [
"[resourceId('Microsoft.Sql/servers', parameters('Server Name'))]"
],
"properties": {
"edition": "[parameters('Pool A Edition')]",
"dtu": "200",
"databaseDtuMin": "10",
"databaseDtuMax": "50"
}
},
{
"type": "elasticpools",
"name": "[variables('Pool B')]",
"apiVersion": "2014-04-01-preview",
"location": "[resourceGroup().location]",
"dependsOn": [
"[resourceId('Microsoft.Sql/servers', parameters('Server Name'))]"
],
"properties": {
"edition": "[parameters('Pool B Edition')]",
"dtu": "400",
"databaseDtuMin": "0",
"databaseDtuMax": null
}
}
]
},
{
"type": "Microsoft.Sql/servers/databases",
"copy": {
"name": "DBs-A",
"count": "[variables('Count A')]"
},
"name": "[concat(parameters('Server Name'), '/', variables('DB A Prefix'), copyIndex())]",
"location": "[resourceGroup().location]",
"dependsOn": [
"[resourceId('Microsoft.Sql/servers', parameters('Server Name'))]",
"[resourceId('Microsoft.Sql/servers/elasticpools', parameters('Server Name'), variables('Pool A'))]"
],
"tags": {
"displayName": "Pool-A DBs"
},
"apiVersion": "2014-04-01-preview",
"properties": {
"collation": "SQL_Latin1_General_CP1_CI_AS",
"maxSizeBytes": "[parameters('DB Max Size')]",
"requestedServiceObjectiveName": "ElasticPool",
"elasticPoolName": "[variables('Pool A')]"
}
},
{
"type": "Microsoft.Sql/servers/databases",
"copy": {
"name": "DBs-B",
"count": "[variables('Count B')]"
},
"name": "[concat(parameters('Server Name'), '/', variables('DB B Prefix'), copyIndex())]",
"location": "[resourceGroup().location]",
"dependsOn": [
"[resourceId('Microsoft.Sql/servers', parameters('Server Name'))]",
"[resourceId('Microsoft.Sql/servers/elasticpools', parameters('Server Name'), variables('Pool B'))]"
],
"tags": {
"displayName": "Pool-B DBs"
},
"apiVersion": "2014-04-01-preview",
"properties": {
"edition": "[parameters('Pool B Edition')]",
"collation": "SQL_Latin1_General_CP1_CI_AS",
"maxSizeBytes": "[parameters('DB Max Size')]",
"requestedServiceObjectiveName": "ElasticPool",
"elasticPoolName": "[variables('Pool B')]"
}
}
]
}


We can deploy the template as is.  We’ll need to enter at least an Admin password (for the Azure SQL server).

The “Server Name” parameter must be unique throughout Azure (not just your subscription).  So if it happens to be taken when you try to deploy the template (in which case you would receive an error message around Server ‘pooldemoserver’ is busy with another operation), try a new, more original name.

Each parameter is documented in the metadata description.

## Results

Let’s look at the result.  Let’s first go in the resource group where we deployed the template.

In the resource list we should see the following:

We first have our server, with default name pooldemoserver, our two pools, Pool-A & Pool-B, and 6 databases.

Let’s select Pool-A.

We can see the pool is of Standard edition, has 200 eDTUs with a minimum of 10 eDTUs and maximum 50 per databases, which is faithful to its ARM definition (line 10-13).

        {
"type": "elasticpools",
"name": "[variables('Pool A')]",
"apiVersion": "2014-04-01-preview",
"location": "[resourceGroup().location]",
"dependsOn": [
"[resourceId('Microsoft.Sql/servers', parameters('Server Name'))]"
],
"properties": {
"edition": "[parameters('Pool A Edition')]",
"dtu": "200",
"databaseDtuMin": "10",
"databaseDtuMax": "50"
}
}


Similarly, Pool-B has a minimum of 0 and a maximum of 100.  The maximum was set to null in the template and hence is the maximum allowed for a standard pool of 400 DTUs.

Let’s select the databases in Pool-B.  Alternatively, we can select the Configure pool tool bar option.

The following pane shows us the eDTUs consumed in the last 14 days.  It also allows us to change the assigned eDTUs to the pool.

It is in this pane that we can add / remove databases from the pool.

In order to remove databases from the pool, they must first be selected in the lower right pane corner.  We will have to chose a standalone pricing tier for each DB and hit save.  As of today (December 2016), there are no way to move databases from one pool to another directly, i.e. they must first be converted as a stand alone.  It is possible to move databases from a pool to another using PowerShell as I’ll demonstrate in a future article though.

If we go back to the resource group and select any of the database, we have a link to its parent pool.

## Summary

Despite the current lack (as of December 2016) of documentation around it, it is quite possible to create databases within an elastic pool using ARM templates as we’ve demonstrated here.

# Azure SQL Elastic Pool Overview

What is Azure SQL Elastic Pool and what it brings to Azure SQL Database, the SQL Azure Platform as a Service (PaaS).

Let’s look at how Azure SQL works without elastic pools first.

Azure SQL Database comes with an Azure SQL Server.  This shouldn’t be confused with SQL Server installed on a VM:  it is a logical server holding everything that doesn’t belong to a database.  This model makes it compatible with SQL server “on premise”.

The important point here is that the compute sits with the database and not the server.  The edition (i.e. Basic, Standard & Premium) & Pricing Tier / DTUs are set at the database level, not the server level.  Actually, the server doesn’t even have a cost associated to it.

In some ways, this is the opposite to what SQL Server on premise got us used to.  On premise, we have a server sitting on an OS and the databases construct on top of it borrowing computes from the Server.  In Azure the compute sits at the database level  while the server is this pseudo centralized thing with no compute associated to it.

In that sense, Azure SQL DB has a much better isolation model out of the box although you can now do the same thing with SQL Server on premise using the Resource Governor.

## Elastic Pool Conceptual model

Along came Elastic Pool.  Interestingly, Elastic Pools brought back the notion of a centralized compute shared across databases.  Unlike on premise SQL Server on premise though, that compute doesn’t sit with the server itself but with a new resource called an elastic pool.

This allows us to provision certain compute, i.e. DTUs, to a pool and share it across many databases.

A typical scenario where that is beneficial is a lot of small databases which tend to be cost prohibitive with the traditional model.

That makes it an excellent solution for ISV / SaaS providers where different tenants have different spikes.

We could have “hybrid” scenarios where a server have  “traditional databases” with their own pricing tier and databases attached to a pool.

## DTU policy

The pool can define a policy regarding the minimum and maximum DTUs per database.  This allows for each database

• to have a minimum amount of compute dedicated to it, avoiding compute starvation
• to have a maximum amount of compute, avoiding the noisy neighbour effect avoiding that one database starves all the others

## Storage

On the other hand, a pool has a maximum storage size shared across the pool.

No policies limit an individual database to take more storage although a database maximum size can be set on a per-database basis.

It is important to note that once the maximum pool size has been reached by the sum of the databases’ size, all databases become read only.

## Limits

I often find it useful to look at the different limits and quotas of Azure services to understand the structure of a service and inform design decisions.

Looking at Azure SQL databases limits, we find those interesting facts:

• Maximum number of databases per pool vary depending on the pool size, e.g. a Standard 100 DTUs can have 200 databases
• A server can have up to 5000 databases associated to it
• A server can have up to 45000 DTUs associated to it, either via elastic pools, databases directly or even Azure Data Warehouses
• There is no documented limit on the number of pools per server
• The server, its pools & databases must be in the same Azure region under the same subscription

Let’s look at a few design questions now.

### Why use more than one pool?

Why not using a pool with a huge number of DTUs?

• Ultimately a pool cannot be of infinite size (4000 DTUs / 750 GB for Premium, 3000 DTUs / 2.9 TB for Standard) so we’ll use multiple pools to scale
• Policies, i.e. min / max DTU are setup at the pool level ; if we have a bunch of tiny DBs with little transactions on & a group of middle sized DBs with more traffic on them, we might want to have multiple pools with different policies to handle those

### Should we have one server per pool or multiple pools per server?

An Azure SQL Server does very little:

• Holds an Admin account for the entire server
• Holds the pools & databases
• Exists in a region

Obviously, multiple regions, multiple servers.

Why would we choose multiple servers over one server multiple pools?  Security:  if we want to segregate access to different databases at the administration level, we wouldn’t want to share one admin account for all.

A lot can be argued around that point, e.g. we could have one admin account for every DBs but different admins per DB for instance.  In compliance scenario, I could see this playing out, e.g. dev vs prod, banking vs federal customers, etc.  .

### Why use a Premium elastic pool?

Standard pools have bigger storage and comparable parallelism specs, so why go Premium and pay a…  Premium?

The main spec where Premium shines is for min / max DTUs per DB:  Premium allows us to have bigger databases within a pool while Standard is geared to have smaller DBs.

More concretely, standard pools allow to have up to 100 DTUs per database while in Premium, it goes up to 4000.

As a comparison, 100 DTUs is equivalent to a standalone S3 database.

## Summary

We did look at Azure SQL Database Elastic Pool feature.

Elastic Pool really is an economic feature as it’s a way to increase the number of databases ran on the same compute and hence reducing the cost.

In scenarios where we have lots of small databases, it can drastically reduce costs.

In a future post, I’ll cover how to provision an Elastic pool using ARM template.

# Primer on Azure Monitor

Azure Monitor is the latest evolution of a set of technologies allowing Azure resources monitoring.

I’ve written about going the extra mile to be able to analyze logs in the past.

The thing is that once our stuff is in production with tons of users hitting it, it might very well start behaving in unpredictable ways.  If we do not have a monitoring strategy, we’re going to be blind to problems and only see unrelated symptoms.

Azure Monitor is a great set of tools.  It doesn’t try to be the end all solution.  On the contrary, although it offers analytics out of the box, it let us export the logs wherever we want to go further.

I found the documentation of Azure Monitor (as of November 2016) a tad confusing, so I thought I would give a summary overview here.  Hopefully it will get you started.

## Three types of sources

First thing we come across in Azure Monitor’s literature is the three types of sources:  Activity Logs, Diagnostic Logs & Metrics.

There is a bit of confusion between Diagnostic Logs & Metrics, some references hinting towards the fact that metrics are generated by Azure while diagnostics are generated by the resource itself.  That is confusing & beside the point.  Let’s review those sources here.

Activity logs capture all operations performed on Azure resources.  They used to be called Audit Logs & Operational Logs.  Those comes directly from Azure APIs.  Any operations done on an Azure API (except HTTP-GET operations) traces an activity log.  Activity log are in JSON and contain the following information:  action, caller, status & time stamp.  We’ll want to keep track of those to understand changes done in our Azure environments.

Metrics are emitted by most Azure resources.  They are akin to performance counter, something that has a value (e.g. % CPU, IOPS, # messages in a queue, etc.) over time ; hence Azure Monitor, in the portal, allows us to plot those against time.  Metrics typically comes in JSON and tend to be emitted at regular interval (e.g. every minute) ; see this articles for available metrics.  We’ll want to check those to make sure our resources operate within expected bounds.

Diagnostic logs are logs emitted by a resource that provide detailed data about the operation of that particular resource.  That one is specific to the resource in terms of content:  each resource will have different logs.  Format will also vary (e.g. JSON, CSV, etc.), see this article for different schemas.  They also tend to be much more voluminous for an active resource.

That’s it.  That’s all there is to it.  Avoid the confusion and re-read the last three paragraphs.  It’s a time saver.  Promised.

We’ll discuss the export mechanisms & alerts below, but for now, here’s a summary of the capacity (as of November 2016) of each source:

Activity Logs Storage Account & Event Hub Yes
Metrics Storage Account, Event Hub & Log Analytics Yes
Diagnostics Logs Storage Account, Event Hub & Log Analytics No

### Activity Log example

We can see the activity log of our favorite subscription by opening the monitor blade, which should be on the left hand side, in the https://portal.azure.com.

If you do not find it there, hit the More Services and search for Monitor.

Selecting the Activity Logs, we should have a search form and some results.

ListKeys is a popular one.  Despite being conceptually a read operation, the List Key action, on a storage account, is done through a POST in the Azure REST API, specifically to trigger an audit trail.

We can select one of those ListKeys and, in the tray below, select the JSON format:


{
"relatedEvents": [],
"authorization": {
"action": "Microsoft.Storage/storageAccounts/listKeys/action",
"condition": null,
"role": null,
},
"caller": null,
"category": {
},
"claims": {},
"correlationId": "6c619af4-453e-4b24-8a4c-508af47f2b26",
"description": "",
"eventChannels": 2,
"eventDataId": "09d35196-1cae-4eca-903d-6e9b1fc71a78",
"eventName": {
"localizedValue": "End request",
"value": "EndRequest"
},
"eventTimestamp": "2016-11-26T21:07:41.5355248Z",
"httpRequest": {
"clientRequestId": "ba51469e-9339-4329-b957-de5d3071d719",
"method": "POST",
"uri": null
},



I truncated the JSON here.  Basically, it is an activity event with all the details.

### Metrics example

Metrics can be accessed from the “global” Monitor blade or from any Azure resource’s monitor blade.

Here I look at the CPU usage of an Azure Data Warehouse resource (which hasn’t run for months, hence flat lining).

### Diagnostic Logs example

For diagnostics, let’s create a storage account and activate diagnostics on it.  For this, under the Monitoring section, let’s select Diagnostics, make sure the status is On and then select Blob logs.

We’ll notice that all metrics were already selected.  We also noticed that the retention is controlled there, in this case 7 days.

Let’s create a blob container, copy a file into it and try to access it via its URL.  Then let’s wait a few minutes for the diagnostics to be published.

We should see a special $logs container in the storage account. This container will contain log files, stored by date & time. For instance for the first file, just taking the first couple of lines:  1.0;2016-11-26T20:48:00.5433672Z;<strong>GetContainerACL</strong>;Success;200;3;3;authenticated;monitorvpl;monitorvpl;blob;"https://monitorvpl.blob.core.windows.net:443/$logs?restype=container&amp;comp=acl";"/monitorvpl/$logs";295a75a6-0001-0021-7b26-48c117000000;0;184.161.153.48:51484;2015-12-11;537;0;217;62;0;;;"&quot;0x8D4163D73154695&quot;";Saturday, 26-Nov-16 20:47:34 GMT;;"Microsoft Azure Storage Explorer, 0.8.5, win32, Azure-Storage/1.2.0 (NODE-VERSION v4.1.1; Windows_NT 10.0.14393)";;"9e78fc90-b419-11e6-a392-8b41713d952c" 1.0;2016-11-26T20:48:01.0383516Z;<strong>GetContainerACL</strong>;Success;200;3;3;authenticated;monitorvpl;monitorvpl;blob;"https://monitorvpl.blob.core.windows.net:443/$logs?restype=container&amp;comp=acl";"/monitorvpl/\$logs";06be52d9-0001-0093-7426-483a6d000000;0;184.161.153.48:51488;2015-12-11;537;0;217;62;0;;;"&quot;0x8D4163D73154695&quot;";Saturday, 26-Nov-16 20:47:34 GMT;;"Microsoft Azure Storage Explorer, 0.8.5, win32, Azure-Storage/1.2.0 (NODE-VERSION v4.1.1; Windows_NT 10.0.14393)";;"9e9c6311-b419-11e6-a392-8b41713d952c"
1.0;2016-11-26T20:48:33.4973667Z;<strong>PutBlob</strong>;Success;201;6;6;authenticated;monitorvpl;monitorvpl;blob;"https://monitorvpl.blob.core.windows.net:443/sample/A.txt";"/monitorvpl/sample/A.txt";965cb819-0001-0000-2a26-48ac26000000;0;184.161.153.48:51622;2015-12-11;655;7;258;0;7;"Tj4nPz2/Vt7I1KEM2G8o4A==";"Tj4nPz2/Vt7I1KEM2G8o4A==";"&quot;0x8D4163D961A76BE&quot;";Saturday, 26-Nov-16 20:48:33 GMT;;"Microsoft Azure Storage Explorer, 0.8.5, win32, Azure-Storage/1.2.0 (NODE-VERSION v4.1.1; Windows_NT 10.0.14393)";;"b2006050-b419-11e6-a392-8b41713d952c"



Storage Account diagnostics obviously log in semicolon delimited values (variant of CSV), which isn’t trivial to read the way I pasted it here.  But basically we can see the logs contain details:  each operation done around the blobs are logged with lots of details.

## Querying

As seen in the examples, Azure Monitor allows us to query the logs.  This can be done in the portal but also using Azure Monitor REST API, cross platform Command-Line Interface (CLI) commands, PowerShell cmdlets or the .NET SDK.

## Export

We can export the sources to a Storage Account and specify a retention period in days.  We can also export them to Azure Event Hubs & Azure Log Analytics.  As specified in the table above, Activity logs can’t be sent to Log Analytics.  Also, Activity logs can be analyzed using Power BI.

There are a few reasons why we would export the logs:

• Archiving scenario:  Azure Monitor keeps content for 30 days.  If we need more retention, we need to archive it ourselves.  We can do that by exporting the content to a storage account ; this also enables big data scenario where we keep the logs for future data mining.
• Analytics:  Log Analytics offers more capacity for analyzing content.  It also offers 30 days of retention by default but can be extended to one year.  Basically, this would upgrade us to Log Analytics.
• Alternatively, we could export the logs to a storage account where they could be ingested by another SIEM (e.g. HP Arcsight).  See this article for details about SIEM integration.
• Near real time analysis:  Azure Event Hubs allow us to send the content to many different places, but also we could analyze it on the fly using Azure Stream Analytics.

Both Activity Logs & Metrics can trigger alerts.  Currently (as of November 2016), only Metrics alert can be set in the portal ; Activity Logs alerts must be set by PowerShell, CLI or REST API.

Alerts are a powerful way to automatically react to our Azure resource behaviors ; when certain conditions are met (e.g. for a metric, when a value exceeds a threshold for a given period of time), the alert can send an email to a specified list of email addresses but also, it can invoke a Web Hook.

Again, the ability to invoke a web hook opens up the platform.  We could, for instance, expose an Azure Automation runbook as a Web Hook ; it therefore means an alert could trigger whatever a runbook is able to do.

## Security

There are two RBAC roles around monitoring:  Reader & Contributor.

There are also some security considerations around monitoring:

• Use a dedicated storage account (or multiple dedicated storage accounts) for monitoring data.  Basically, avoid mixing monitoring and “other” data, so that people do not gain access to monitoring data inadvertently and, vis versa, that people needing access to monitoring data do not gain access to “other” data (e.g. sensitive business data).
• For the same reasons, use a dedicated namespace with Event Hubs
• Limit access to monitoring data by using RBAC, e.g. by putting them in a separate resource group
• Never grant ListKeys permission across a subscription as users could then gain access to reading monitoring data
• If you need to give access to monitoring data, consider using a SAS token (for either Storage Account or Event Hubs)

## Summary

Azure Monitor brings together a suite of tools to monitor our Azure resources.  It is an open platform in the sense it integrates easily with solutions that can complement it.

# How does Azure Data Warehouse scale?

I’ve been diving in the fantastical world of Azure Data Warehouse (ADW) in the last couple of days.

I’ve been reading through all the documentation on Azure.com.  If you are serious about mastering that service I advise you do the same:  it is a worthy read.

In this article, I wanted to summarize a few concepts that are somehow interconnected:  MPP, distribution & partition.  Those concepts all define how your data is spread out and processed in parallel.

Let’s get started!

## Massively Parallel Processing (MPP)

Conceptually, you have one Control Node the clients interact with and it, in turns, interacts with a multitude of Compute Nodes.

The data is stored in Premium Blob storage and is therefore decoupled from the compute nodes.  This is why you can scale out, scale in or even pause your ADW quickly without losing data.

The control node takes a query in input, do some analysis on it before delegating the actual compute to the control nodes.  The control nodes perform their sub queries and return results to the control node.  The control takes the results back, assemble it and return it to the client.

You can tune the number of compute nodes indirectly by requesting more Data Warehouse Unit (DWU) on your instance of ADW.  DWUs were modelled about the DTUs from Azure SQL Databases.

Cool?  Now let’s dive into how the data and compute are actually split out between the nodes.

## As in Babylon, they were 60 databases

Apparently Babylonians had quite a kick at the number 60 and some of its multiples, such as 360.  This is why we owe them the subdivision of the hours in 60 minutes and those in 60 seconds.  Also, the 360 degrees of arc to complete a circle might have come from them too (or is it because of the 365 days in a year?  we might never know).

Nevertheless, ADW splits the data between 60 databases.  All the time, regardless of what you do.  It’s a constant.  It’s like $\Pi$.

I do not know the details around that decision but I guess it optimizes some criteria.

Those databases live on the compute nodes.  It is quite easy, now that you know there are 60 of those, to deduce the number of compute nodes from the dedicated Data Warehouse Unit (DWU)using my fantastic formula:  $\#nodes \times \#db per node = 60$.  We can assume that $DWU = \#nodes \times 100$, i.e. the lowest number of DWU corresponds to 1 compute node.

DWU # Compute Nodes # DB per node
100 1 60
200 2 30
300 3 20
400 4 15
500 5 12
600 6 10
1000 10 6
1200 12 5
1500 15 4
2000 20 3
3000 30 2
6000 60 1

That’s my theory anyway…  I do not have insider information in the product.  It would explain why we have those jumps as you go higher in the DWUs:  to spread evenly the databases among the compute nodes.

Here’s an example of an ADW instance with 1500 DWU (i.e. 15 compute nodes with 4 DBs each)

## Distribution

So the data you load in ADW is stored in 60 databases behind the scene.

Which data gets stored in which database?

As long as you are doing simple select on one table and that your data is distributed evenly, you shouldn’t care, right?  The query will flow to the compute nodes, they will perform the query on each database and the result will be merged together by the control node.

But once you start joining data from multiple tables, ADW will have to swing data around from one database to another in order to join the data.  This is called Data Movement.  It is impossible to avoid in general but you should strive to minimize it to obtain better performance.

Data location is controlled by the distribution attribute of your tables.  By default, tables are distributed in a round robin fashion:  data goes first to database 1 then 2, then 3…

You can somewhat control where your data will go by using the hash distribution method.  With that method, you specify, when creating your table, that you want the hash algorithm to be used and which column to use.  What this guarantees is that data rows with the same hash column value will end up in the same table.  It doesn’t guarantee that any two hash column value will end up in the same database:  the exact hash algorithm isn’t published.

So, let’s look at a simple example of a round-robin distributed table:


CREATE TABLE [dbo].MyTable
(
CustomerID      INT            NOT NULL,
CustomerName    VARCHAR(32)    NOT NULL,
RegionID        INT            NOT NULL
)
WITH
(
CLUSTERED COLUMNSTORE INDEX,
DISTRIBUTION = ROUND_ROBIN
)



Since round robin also is the default distribution, I could have simply omit to specify it:

CREATE TABLE [dbo].MyTable
(
CustomerID      INT            NOT NULL,
CustomerName    VARCHAR(32)    NOT NULL,
RegionID        INT            NOT NULL
)
WITH
(
CLUSTERED COLUMNSTORE INDEX
)



And now with a hash algorithm:

CREATE TABLE [dbo].MyTable
(
CustomerID      INT            NOT NULL,
CustomerName    VARCHAR(32)    NOT NULL,
RegionID        INT            NOT NULL
)
WITH
(
CLUSTERED COLUMNSTORE INDEX,
DISTRIBUTION = HASH(RegionID)
)



Here I specified I want the hash to be taken from the RegionID column.  So all customers within the same region will be stored in the same database.

So what have I achieved by making sure that customers from the same regions are stored in the same DB?  If I would want to obtain the sum of the number of customers per region, I can now do it without data movement because I am guaranteed that rows for a given region will all be in the same database.

Furthermore, if I want to join data from another table on region ID, that join can happen “locally” if the other table also has a hash distribution on the region ID.  Same thing if I want to group by region, e.g. summing something by region.

That is the whole point of controlling the distribution:  minimizing data movement.  It is recommended to use it with columns

1. That aren’t updated  (hash column can’t be updated)
2. Distribute data evenly, avoiding data skew
3. Minimize data movement

It is obviously a somewhat advanced feature:  you need to think about the type of queries you’re gona have and also make sure the data will be spread evenly.  For instance, here, if “region” represents a country and you primarily do business in North America, you just put most of your data in at most two databases (USA + Canada) over 60:  not a good move.

It’s also worth noting that hash distribution slows down data loading.  So if you are only loading a table to perform more transformation on it, just use default round robin.

## Partition

Then you have partitions.  This gets people confused:  isn’t partition a piece of the distribution?  One of the databases?

No.

A partition is an option you have to help manage your data because you can very efficiently delete a partition in a few seconds despite the partition containing millions of rows.  That is because you won’t log a transaction for each row but one for the entire partition.

Also, for extremely large tables, having partitions could speed up queries using the partition key in their where clause.  This is because it would give ADW a hint to ignore all other partitions.  Partitions are stored separately, as if they were separate tables.

As a metaphor, you could consider a partitioned table as a UNION of normal tables ; so using the partition key in the where clause is equivalent to hitting one of the normal tables instead of the UNION, i.e. all tables.  In some scenario, that could provide some good speed up.

You need to have something big to make it worthwhile in terms of query speedup though.  ADW stores its data rows in row groups of up to a million rows.  So if your partitions are small, you just increase the number of row groups which will slow down your queries…  Again, imagine having lots of tables in a UNION.  A query against that would be quite slow.

Here is how I would partition my earlier table:

CREATE TABLE [dbo].MyTable
(
CustomerID      INT            NOT NULL,
CustomerName    VARCHAR(32)    NOT NULL,
RegionID        INT            NOT NULL
)
WITH
(
CLUSTERED COLUMNSTORE INDEX,
DISTRIBUTION = HASH(RegionID),
PARTITION (
CustomerName RANGE RIGHT FOR VALUES
('E', 'L', 'Q', 'U')
)
)



I built on the previous example which had hash distribution.  But it could have been a round robin distribution.  Those two options (i.e. hash distribution & partitioning) are orthogonal.

It is important to understand that the 60 databases will have the same partitions.  You already have 60 partitions naturally with the 60 databases.  This is why you have to think about it wisely not to slow down your queries.

To visualize that, imagine my example with 5 partitions (4 boundaries means 5 partitions in total):

We end up with $60 \times 5 = 300$ partitions.  Is that a good thing?  It depends on the problem, i.e. the way I plan to manage my partitions and the queries being done against it.

## Summary

Here I tried to explain the different ways your data gets distributed around Azure Data Warehouse (ADW).

I didn’t get into the index & row groups, which is another level of granularity under partitions.

Hopefully that gives you a clear picture of how which compute node access which part of your data, the data itself being in Premium blob storage and not collocated with compute, how you can control its distribution and how you could partition it further.