# How does Azure Data Warehouse scale?

I’ve been diving in the fantastical world of Azure Data Warehouse (ADW) in the last couple of days.

I’ve been reading through all the documentation on Azure.com.  If you are serious about mastering that service I advise you do the same:  it is a worthy read.

In this article, I wanted to summarize a few concepts that are somehow interconnected:  MPP, distribution & partition.  Those concepts all define how your data is spread out and processed in parallel.

Let’s get started!

## Massively Parallel Processing (MPP)

Conceptually, you have one Control Node the clients interact with and it, in turns, interacts with a multitude of Compute Nodes.

The data is stored in Premium Blob storage and is therefore decoupled from the compute nodes.  This is why you can scale out, scale in or even pause your ADW quickly without losing data.

The control node takes a query in input, do some analysis on it before delegating the actual compute to the control nodes.  The control nodes perform their sub queries and return results to the control node.  The control takes the results back, assemble it and return it to the client.

You can tune the number of compute nodes indirectly by requesting more Data Warehouse Unit (DWU) on your instance of ADW.  DWUs were modelled about the DTUs from Azure SQL Databases.

Cool?  Now let’s dive into how the data and compute are actually split out between the nodes.

## As in Babylon, they were 60 databases

Apparently Babylonians had quite a kick at the number 60 and some of its multiples, such as 360.  This is why we owe them the subdivision of the hours in 60 minutes and those in 60 seconds.  Also, the 360 degrees of arc to complete a circle might have come from them too (or is it because of the 365 days in a year?  we might never know).

Nevertheless, ADW splits the data between 60 databases.  All the time, regardless of what you do.  It’s a constant.  It’s like $\Pi$.

I do not know the details around that decision but I guess it optimizes some criteria.

Those databases live on the compute nodes.  It is quite easy, now that you know there are 60 of those, to deduce the number of compute nodes from the dedicated Data Warehouse Unit (DWU)using my fantastic formula:  $\#nodes \times \#db per node = 60$.  We can assume that $DWU = \#nodes \times 100$, i.e. the lowest number of DWU corresponds to 1 compute node.

DWU # Compute Nodes # DB per node
100 1 60
200 2 30
300 3 20
400 4 15
500 5 12
600 6 10
1000 10 6
1200 12 5
1500 15 4
2000 20 3
3000 30 2
6000 60 1

That’s my theory anyway…  I do not have insider information in the product.  It would explain why we have those jumps as you go higher in the DWUs:  to spread evenly the databases among the compute nodes.

Here’s an example of an ADW instance with 1500 DWU (i.e. 15 compute nodes with 4 DBs each)

## Distribution

So the data you load in ADW is stored in 60 databases behind the scene.

Which data gets stored in which database?

As long as you are doing simple select on one table and that your data is distributed evenly, you shouldn’t care, right?  The query will flow to the compute nodes, they will perform the query on each database and the result will be merged together by the control node.

But once you start joining data from multiple tables, ADW will have to swing data around from one database to another in order to join the data.  This is called Data Movement.  It is impossible to avoid in general but you should strive to minimize it to obtain better performance.

Data location is controlled by the distribution attribute of your tables.  By default, tables are distributed in a round robin fashion:  data goes first to database 1 then 2, then 3…

You can somewhat control where your data will go by using the hash distribution method.  With that method, you specify, when creating your table, that you want the hash algorithm to be used and which column to use.  What this guarantees is that data rows with the same hash column value will end up in the same table.  It doesn’t guarantee that any two hash column value will end up in the same database:  the exact hash algorithm isn’t published.

So, let’s look at a simple example of a round-robin distributed table:


CREATE TABLE [dbo].MyTable
(
CustomerID      INT            NOT NULL,
CustomerName    VARCHAR(32)    NOT NULL,
RegionID        INT            NOT NULL
)
WITH
(
CLUSTERED COLUMNSTORE INDEX,
DISTRIBUTION = ROUND_ROBIN
)



Since round robin also is the default distribution, I could have simply omit to specify it:

CREATE TABLE [dbo].MyTable
(
CustomerID      INT            NOT NULL,
CustomerName    VARCHAR(32)    NOT NULL,
RegionID        INT            NOT NULL
)
WITH
(
CLUSTERED COLUMNSTORE INDEX
)



And now with a hash algorithm:

CREATE TABLE [dbo].MyTable
(
CustomerID      INT            NOT NULL,
CustomerName    VARCHAR(32)    NOT NULL,
RegionID        INT            NOT NULL
)
WITH
(
CLUSTERED COLUMNSTORE INDEX,
DISTRIBUTION = HASH(RegionID)
)



Here I specified I want the hash to be taken from the RegionID column.  So all customers within the same region will be stored in the same database.

So what have I achieved by making sure that customers from the same regions are stored in the same DB?  If I would want to obtain the sum of the number of customers per region, I can now do it without data movement because I am guaranteed that rows for a given region will all be in the same database.

Furthermore, if I want to join data from another table on region ID, that join can happen “locally” if the other table also has a hash distribution on the region ID.  Same thing if I want to group by region, e.g. summing something by region.

That is the whole point of controlling the distribution:  minimizing data movement.  It is recommended to use it with columns

1. That aren’t updated  (hash column can’t be updated)
2. Distribute data evenly, avoiding data skew
3. Minimize data movement

It is obviously a somewhat advanced feature:  you need to think about the type of queries you’re gona have and also make sure the data will be spread evenly.  For instance, here, if “region” represents a country and you primarily do business in North America, you just put most of your data in at most two databases (USA + Canada) over 60:  not a good move.

It’s also worth noting that hash distribution slows down data loading.  So if you are only loading a table to perform more transformation on it, just use default round robin.

## Partition

Then you have partitions.  This gets people confused:  isn’t partition a piece of the distribution?  One of the databases?

No.

A partition is an option you have to help manage your data because you can very efficiently delete a partition in a few seconds despite the partition containing millions of rows.  That is because you won’t log a transaction for each row but one for the entire partition.

Also, for extremely large tables, having partitions could speed up queries using the partition key in their where clause.  This is because it would give ADW a hint to ignore all other partitions.  Partitions are stored separately, as if they were separate tables.

As a metaphor, you could consider a partitioned table as a UNION of normal tables ; so using the partition key in the where clause is equivalent to hitting one of the normal tables instead of the UNION, i.e. all tables.  In some scenario, that could provide some good speed up.

You need to have something big to make it worthwhile in terms of query speedup though.  ADW stores its data rows in row groups of up to a million rows.  So if your partitions are small, you just increase the number of row groups which will slow down your queries…  Again, imagine having lots of tables in a UNION.  A query against that would be quite slow.

Here is how I would partition my earlier table:

CREATE TABLE [dbo].MyTable
(
CustomerID      INT            NOT NULL,
CustomerName    VARCHAR(32)    NOT NULL,
RegionID        INT            NOT NULL
)
WITH
(
CLUSTERED COLUMNSTORE INDEX,
DISTRIBUTION = HASH(RegionID),
PARTITION (
CustomerName RANGE RIGHT FOR VALUES
('E', 'L', 'Q', 'U')
)
)



I built on the previous example which had hash distribution.  But it could have been a round robin distribution.  Those two options (i.e. hash distribution & partitioning) are orthogonal.

It is important to understand that the 60 databases will have the same partitions.  You already have 60 partitions naturally with the 60 databases.  This is why you have to think about it wisely not to slow down your queries.

To visualize that, imagine my example with 5 partitions (4 boundaries means 5 partitions in total):

We end up with $60 \times 5 = 300$ partitions.  Is that a good thing?  It depends on the problem, i.e. the way I plan to manage my partitions and the queries being done against it.

## Summary

Here I tried to explain the different ways your data gets distributed around Azure Data Warehouse (ADW).

I didn’t get into the index & row groups, which is another level of granularity under partitions.

Hopefully that gives you a clear picture of how which compute node access which part of your data, the data itself being in Premium blob storage and not collocated with compute, how you can control its distribution and how you could partition it further.

# Disaster Recovery with Azure Virtual Machines

I have had a few conversations lately about Disaster Recovery in Azure where Virtual Machines are involved.  I thought I would write this article to summarize the options I recommend.

We should really call the topic Business Continuity or Resiliency to Azure Region Service Disruption but traditionally, people call this Disaster Recovery and you find this wording in many deployment requirements, so I guess I have to continue the tradition here.

You can find many of the ideas & patterns exposed here in the Resiliency Technical Guidance document and more specifically the Recovery from a region-wide service disruption document.

## Context

First, when you discuss Disaster Recovery plan & more generally resilience to failures, you have to setup the context.

Azure is an Hyper Scale Public Cloud provider.  It isn’t your boutique web provider or your corner shop hosting facility.  I say that because some customers I talk to are used to be involved or at least informed when an hard drive blows, a network line is cut or something of the like.  Azure is a software defined platform and is extremely resilient to hardware failures.  If an hard drive fails on one of your service, you would need to perform some very close monitoring to be aware of the failure and recovery.  The Azure Fabric monitors services health and when a service becomes unhealthy, it gets shutdown and redeploy elsewhere.

So the typical disaster do occur, and probably more than in traditional facility since Azure (like every public cloud provider) uses consumer-grade hardware that can fail at any time (i.e. no redundancy built in).  The resiliency comes from the Azure Fabric, i.e. the software defined layer.  And so the typical disaster do occur but you won’t be affected by it.

What does happen in Azure is Service disruption.  Sometimes those are due to some hardware failures, but most of the time, they are a software problem:  an upgrade in Azure software (Microsoft’s software) gone wrong.  They happen occasionally, are typically short lived but if business continuity is a hard requirement, those are the ones you should protect your solution against.

In this article I’ll cover resilience for virtual machines.  I’ll assume that you are using Azure Traffic Manager or other DNS service to fail over the traffic from a primary to a secondary region.

So here the point is to fail over a solution from a Primary Azure Region to a replica in a Secondary Azure Region with as little business disruption as possible.

I’ll also use the following two concepts:

• Recovery Time Objective (RTO):  maximum amount of time allocated for restoring application functionality
• Recovery Point Objective (RPO):  acceptable time window of lost data due to the recovery process

Those are important and will vary the complexity and cost of your solution.

## False Friends

Out of the gate, lets clear out a few services that you can sometimes use for fail over but do not work for most complex cases.

### Geo Replicated Storage (GRS)

The typical starting point is to have your VM hard drives site in a Read-Only Access Geo Redundant Storage (RA GRS).  That should fix everything, right?  The hard drives of the VMs are replicated to a secondary region automatically, so we’re good, right?

Well…  there are many caveats to that statement.

First, you need to have Read-Only Access GRS and not only GRS storage.  Pure GRS is replicated but isn’t readable unless the primary region is declared “lost” (to the best of my knowledge that never happened).  So this is no good for temporary service disruption.

Second, RA GRS is Read-Only, so you would need to copy the virtual hard drive blobs to another storage account in the secondary region in order to attach them to a VM and fail over.  That will hit your RTO.

Third, replication with GRS is done asynchronously & doesn’t have any SLA on the time it takes to replicate blobs.  It typically takes under an hour and often much less than that, but you have no guarantees.  If you have a stiff and short RPO, that’s going to be a problem.

Fourth, GRS replicates each blob, hence each hard drive, independently.  So if you have a VM with more than one hard drive per VM, chances are you are going to get a corrupted image at the secondary site since you’ll have VM hard drives from different points in time.  Think of a disk for log files and another for DB files coming from two different point in time.

Fifth, GRS replicates between paired regions:  you can’t choose towards which region data to replicate.  Paired regions are documented here and are within a unique geopolitical region, i.e. a region where laws about data sovereignty are about the same.  That usually is ok for disaster recovery but if it isn’t for your use case, this is another reason why GRS isn’t for you.

For all those reasons, RA GRS rarely is a solution all by itself for failing over.  It can still be used as a component of a more complex strategy and for very simple workloads, e.g. one-HD VMs with loose RPO and long enough RTO, it could be used.

GRS use case is to decrease the likelihood of pure data loss.  It does it very well.  But it is designed to quickly and flexibly replicate data anywhere.

### Azure Site Recovery (ASR)

Azure Site Recovery, as of this date (early July 2016), doesn’t support Azure-to-Azure scenario.  ASR main scenario is to replicate an on premise workload, on either Hyper-V or VM ware towards Azure.

Very powerful solution but as of yet, it won’t fail over an Azure solution.

### Azure Backup

Azure Backup seems to be your next best friend after the last two cold showers, right?

Azure Backup, again a great solution, is a backup solution within a region.  It’s the answer to:  “how can I recover (roll back) from a corruption of my solution due to some faulty manual intervention”, or what I call the “oups recovery”.  Again, great solution, works very well for that.

You have the possibility to backup to a Geo Redundant vault.  That vault, like a GRS storage account, isn’t accessible until the primary region has completely failed.  Again, to the best of my knowledge, that never happened.

## Some hope

There is hope, it’s coming your way.

We are going to separate the solutions I recommend in two broad categories:  stateless & stateful VMs.

The best example of a stateless VM is your Web Server:  it runs for hours but beside writing some logs locally (maybe), its state doesn’t change.  You could shut it down, take a copy of the drives from 3 days ago and start it, it would work and nobody would notice.

A stateful VM is the opposite.  A good example?  A database.  Its state changes constantly with transactions.

## Stateless VMs

Stateless VMs are easier to deal with since you do not have to replicate constantly but just when you do changes to your VM.

### Stateless 1 – Virtual Machine Scale Sets

If your VMs are stateless, then the obvious starting point is not to manage VMs at all but manage a recipe on how to build the VMs.  This is what Virtual Machine Scale Sets are about.

You describe a scale set in an ARM Template, adding some Desired State Configuration (DSC), PowerShell scripts, Puppet or Chef scripts and bam!  you can have as many replica of those VMs as you want, on demand.

Scale Sets, as their name suggest, are meant to auto scale, but you could also use it to quickly spin up a disaster recovery site.  Simply scale them from zero to 2 (or more) instances and you have your stateless VMs ready.

The drawbacks of that solution:

• Many custom solutions aren’t automated enough to be used within a Scale Set without major effort
• Provisioning VMs under scale set can take a few minutes:  on top of booting the VMs, Azure first need to copy an Azure Marketplace (AMP) image as the starting point and then execute various configuration scripts to customize the VM.

In the case the second point is a show stopper for you (e.g. you do have a very short RTO), there are a few ways you could still use scale sets.  What you probably need is a hot standby disaster site.  In that case, let 2 VMs run in the scale set at all time and scale it to more if you really fail over.

### Stateless 2 – Copy VHDs

For solutions that haven’t reached the automation maturity required to leverage scale sets or where operational requirements do not allow it, you can continue to manage VM as in the good old days.

The obvious solution attached to this practice is to copy the VM Hard Drive blob files to the secondary regions.

This solution usually sits well with the operational models handling custom images (as opposed to automated ones):  maintenance are performed on VMs in orderly manner (e.g. patches) and once this is done, the VHDs can be copied over to the secondary storage account.

A detail remains:  how do you copy the hard drives?  I would recommend two options:

1. Shutdown (get a clean copy)
• Shutdown the VM
• Copy VHD blobs
• Reboot the VM
• Go to the next VM
2. Snapshots

Blob snapshot is a powerful feature allowing you to create a read-only version of a blob in-place that is frozen in time.  It’s quick because it doesn’t copy the entire blob, it simply keeps it there and start saving deltas from then.

There are two challenges with snapshots.  First you need to snapshot all the hard drives of one VM at about the same time ; not impossible, but it requires some thoughts.  Second, this will give you a Crash Consistent image of your system:  it’s basically as if you pulled the cable on the VM and shut it down violently.

Depending on what workload is running on the VM, that might not be a good idea.  If that’s the case or you aren’t sure if it is, use the shut-down / copy / reboot approach:  it’s more hassle, but it guarantees you’ll have a clean copy.

## Stateless 3 – Capture the image

If the other methods failed, you might want to create an image out of your VMs.  This means a generalized VM that once you boot, you’ll assign a new identity (name).

For that you need to capture the image of an existing VM.  You could then copy that over to the secondary location, ready to be used to create new VMs.

You could actually use that in conjunction with the Scale Set approach:  instead of starting from an Azure Marketplace (AMP) image, you could start with your own and do very little configuration since your image contains all the configuration.

## Stateful VMs

As stated before, stateful VMs are different beasts.  Since their state evolves continuously, you need to backup the state continuously too.

### Stateful 1 – Regular backups

Here you can use any of the methods for stateless VMs in order to backup the data.

Typically you’ll want to segregate the disks containing the transactional data from the other disks (e.g. OS disks) in order to backup only what changes.

You can then do copies at regular intervals.  The size of the intervals will define your RPO.

The major weakness of this method is that since you will copy whole data disks (as opposed to deltas), this will take time & therefore limit the RPO you can achieve.  It will also increase the cost of the solution with both the storage operation but with inter-region bandwidth too.

### Stateful 2 – Application-based Backup (e.g. DB Backup)

This solution will sound familiar if you’ve operated databases in the past.

Segregate a disk for backups.  That disk will only contain backups.  Have your application (e.g. database) backup at regular intervals on that disk.  Hopefully, you can do incremental backups to boost efficiency & scalability so you do not backup the entire state of your system every single time.  So this should be more efficient than the previous option.

Then you have a few options:

1. You can copy the entire hard drive to another storage account in the secondary Azure region ; of course, doing so reduces your scalability (RPO) since that is likely a lot of data.
2. You can have that VHD sit in a GA-GRS account.  This will replicate only the bits that change to the secondary location.  Again, there are no SLA on when this will occur, but if you have a loose RPO, this could be ok.
3. You could AzCopy the backup / incremental backup files themselves to the secondary site.

The advantage of this approach is that you do not need any VM to be up in the secondary Azure region, hence reducing your costs.

The main disadvantage is that you go with backups and therefore limits the RPO you can achieve.

### Stateful 3 – Application-based Replication (e.g. DB replication)

Instead of using the application to backup itself, use the replication mechanism (assuming there is one) to replicate the state to the secondary Azure region.

Typically that would be asynchronous replication not to impact the performance of your system.

This approach lets you drive the RPO down with most systems.  It also drives the cost up since you need to have VMs in the secondary region running at all time.

There are a few ways you could reduce that costs:

1. If your replication technology can tolerate the replica to be offline once in a while, you could live with only one VM in the secondary site.  This would half the compute cost but also drop the SLA on the availability of the VMs in the secondary site.  Make sure you can live with that risk.
2. You could have your replica run on smaller (cheaper) VMs.  Only when you failover would you shutdown one of the VM, upgrade it to the proper size, then reboot it and do the same with the other one (or ones).  If your system can run with slow stateful VMs for a while, this could be acceptable.

## Conclusion

In this article we covered the scenario of surviving an Azure Service disruption to keep your application running.

We did analyse many alternatives and there are still many sub-alternatives you can consider.  The main thing is that typically, RTO/RPO are inversely proportional to costs.  So you can walk from one alternative to the other to change the RTO and / or RPO and / or costs until you find the right compromise.

Remember to first consider the risk of a service outage.  Depending on which environment you are coming from, Azure’s outage risk might not be a problem for your organization since they seldomly happen.

# DocumentDB protocol support for MongoDB

Microsoft announced, in the wake of many DocumentDB announcement, that DocumentDB would support MongoDB protocol.

What does that mean?

It means you can now swap a DocumentDB for a MongoDB and the client (e.g. your web application) will work the same.

This is huge.

It is huge because Azure, and the cloud in general, have few Databases as a Service.

Azure has SQL Database, SQL Data Warehouse, Redis Cache, Search & DocumentDB.  You could argue that Azure Storage (Blob, Table, queues & files) is also one.  HBase under HDInsight could be another.  Data Lake Store & Data Lake Analytics too.

Still, compare that to any list of the main players in NoSQL and less than 10 services isn’t much.  For all the other options, you need to build it on VMs.  Since those are database workloads, optimizing their performance can be tricky.

MongoDB is a leader in the document-oriented NoSQL databases space.

With the recent announcement, this means all MongoDB clients can potentially / eventually run on Azure with much less effort.

And this is why this is a huge news.

## A different account

For the time being, DocumentDB supports MongoDB through a different type of DocumentDB account.

You need to create your DocumentDB account as a DocumentDB – Protocol Support for MongoDB.

You’ll notice the portal interface is different for such accounts.

You can then access those accounts using familiar MongoDB tool such as MongoChef.

But you can still use DocumentDB tools to access your account too.

## Summary

In a way you could say that Azure now has MongoDB as a Service.

A big caveat is that the protocol surface supported isn’t %100.  CRUDs are supported and the rest is prioritized and worked on.

Yet, the data story in Azure keep growing.

UPDATETo get started, check out:

# Recreating VMs in Azure

In this article I’m going to explain how to destroy VMs, keep their disks on the backburner and re-create them later.

Why would you do that?

After all, you can shut down VMs and not be charged for it.  You can later restart them and incur the compute cost only once it’s started.  So why destroy / recreate VMs?

In general, because you get an allocation failure.  I did touch upon that in a past article.

Azure Data Centers are packaged in stamps (group of physical servers).  A dirty secret of Azure is that your VM stays in a stamp even if you shut it down.  So if you need to move your VM, you need to destroy it and recreate it.

Two typical use cases:

• You want to change the size of your VM to a size unsupported by its current stamp (e.g. D-series stamp often do not support G-series VMs)
• You want to restart your VM after a bit of time and the stamp is full so you can bring your VM in

Frankly, this is all sad and I wish it will be addresses in a future service update.  In the mean time, we do need a way to easily destroy / recreate VMs.

Before I dive into it, there is of course no magic and you have to make sure the Azure region you deploy in support the service you’re trying to deploy.  Many services aren’t available everywhere (e.g. G-series VMs, Cortana Analytics services, etc.).  Consulting the list of services per region to validate your choice.

The technique I show here is based on a my colleague’s article.  Alexandre published that article before the Export Template feature was made available.  This article will therefore use ARM templates generated by Export Template in order to resuscitate VMs.

An alternative approach would be to use PowerShell scripts to recreate the VMs.

Also I based this article on the deployment of a SQL Server Always on Cluster (5 VMs).

## Exporting current state

I assume you have a running environment within an Azure Resource Group.

The first step is to export your Azure Resource Group as an ARM Template.  Refer to this article for details.

Save the JSON template somewhere.

## Deleting VMs

You can now delete your VMs.  You might want to hook that to a Run Book as I did in Shutting down VMs on schedule in Azure (e.g. to shut down the VMs frequently) or might just do it once (e.g. to resize VM once).

One way to do it, with PowerShell, is to use the following command:


Get-AzureRmVM | Where-Object {$_.ResourceGroupName -eq "NAME OF YOUR RESOURCE GROUP"} | Select-Object Name, ResourceGroupName | ForEach-Object {Remove-AzureRmVM -ResourceGroupName$_.ResourceGroupName -Name \$_.Name -Force}



Here I basically list the VMs within one resource group and delete them.

This will not delete the disks associated to them.  It will simply shut down the compute and de-allocate the compute resource.  Everything else, e.g. networks, subnets, availability sets, etc. stay unchanged.

If you’re afraid to delete VMs you didn’t intend to delete you can call the Remove-AzureRmVM explicitly on each VM.

Export Template does the heavy lifting.  But it can’t be used as is.

We have to make several changes to the template before being able to use it to recreate the VMs.

1. Remove SecureString password parameters.  We will remove all reference to those so you shouldn’t need it.  The reason is the admin password of your VM is stored on disks and will restored with those disks.  This isn’t essential, but it will avoid you being prompted for password when you’ll run the template.
2. Change all the createOption attributes to Attach.  This tells Azure to simply take the disks in storage as opposed to creating a disk from a generalized image.
3. Just next to the createOption for the os-disk, add a osType attribute of value either Windows or Linux.
4. Remove (or comment out) a few properties:
• imageReference:  that is under storageProfile
• osProfile:  that is after storageProfile
• diskSizeGB:  that is under each dataDisks

Here’s how the osDisk property should look after modifications

"osDisk": {
"name": "osdisk",
"osType": "Windows",
"createOption": "Attach",
"vhd": {



After this little massage of the ARM template you should be able to run the template and recreate your VMs as is.

In some cases you might want to modify the ARM template some more.  For instance, by changing the size of VMs.  You can do this now.

### Caveat around availability set

There are some funny behaviours around availability sets and VM size I found while writing this article.

One thing is that you can’t change the availability set of a VM (as of this writing).  So you need to get it right the first time.

Another is that a load balancer needs to have VMs under the same availability set.  You can’t have two or VMs without availability sets.

The best one is that you can’t have an availability set with VMs of different size families.  In the case I used, i.e. the SQL Always on cluster, the SQL availability set has two SQL nodes and one witness in it.  The original template let you only configure those as D-series.  You can’t change them to G-series later on and this is one of the reason you’ll want to use the technique laid out here.  But…  even then, you can’t have your witness as a D-series and the SQL nodes as G-series.  So you need to have at least a GS-1 as the witness (which is a bit ridiculous considering what a witness does in a SQL cluster).

That last one cost me a few hours so I hope reading this you can avoid wasting as much on your side!

## Running the template

You can then run the template.

My favorite tool for that is Visual Studio but you can do it directly in the portal (see this article for guidance).

## Conclusion

Destroying / recreating your VMs will ensure you to have a more robust restart experience.

It also allows you to get around other problems such as re-configuring multiple VMs at once, e.g. changing the size of all VMs in an availability set.

Thanks to Export Template feature, it isn’t as much work as it used to be a few months ago.

# Azure Export Template – Your new best friend

Azure Resource Manager (ARM) basically is Azure Infrastructure version 2.0.  It has been released for about a year now, although not all Azure Services have caught up yet.

With ARM comes ARM templates.  An ARM template is a description of a group of resources (and their dependencies) in JSON.  It’s a powerful mechanism to deploy resources in Azure and replicate environments (e.g. ensuring your pre-prod & prod are semi-identical).

Up until a few months ago, the only way to create an ARM template was to either build it from scratch or modify an existing one.  You can see examples of ARM templates in the Azure Quickstart Templates.

Enter Export Template.

If you have authored ARM templates, you know this can be a laborious process.  The JSON dialect is pretty verbose with limited documentation and each iteration you trial involves deploying Azure resources which isn’t as fast as testing HTML (to put it lightly).

A more natural workflow is to create some resources in an Azure Resource Group, either via the portal or PowerShell scripts, and then have Azure authoring a template corresponding to those resources for you.

This is what Export Template does for you.

Open your favorite Resource Group and check at the bottom of the settings:

When you click that option, you’ll get a JSON template.  That template, running on an empty Resource Group would recreate the same resources.

One nice touch of that tool is that it infers parameters for you.  That is, knowing the resources you have, it figures out what attribute would make sense in parameters.  For example, if you have a storage account, since the name needs to be globally unique (across all subscriptions in Azure, yours & others), it would make sense you do not hardcode the name, so it would put it as a parameter with the current value as a default.

## Conclusion

As usual, there is no magic ; so for services not yet supported in ARM, this tool won’t give you a template for those.  It will warn you about it though.

For all supported scenarios, this is a huge time saver.  Even if you just use it as a starting point and modify it, it’s much faster than starting from scratch.

Remember that the aim of an ARM template is to describe an entire Resource Group.  Therefore the Export Template is a Resource Group tool (i.e. it’s in the menu of your Resource Group) and it wraps all the resources in your group.

# Training a model to predict failures

Today a quick entry to talk about a twist on Machine Learning for the predictive maintenance problem.

The Microsoft Cortana Intelligence team wrote an interesting blog the other day:  Evaluating Failure Prediction Models for Predictive Maintenance.

When you listen to all the buzz around Machine Learning, it sometimes feels as if we’ve solved all the ML problems and you just need to point your engine in the general direction of your data for it to spew some insights.  Not yet.

That article highlights a challenge with predictive maintenance.  Predictive maintenance is about predicting when a failure (of a device, hardware, process, etc.) will occur.  But…  you typically have very few failures:  your data set is completely imbalanced between success samples and failure samples.

If you go heads on and train a model with your data set, a likely outcome is that your model will predict success %100 of the time.  This is because it doesn’t get penalized that much by ignoring the few failures and gets rewarded by not missing any success.

Remember that Machine Learning is about optimizing a cost function over a sample set (training set).  I like to draw the comparison with humans on a social system with metrics (e.g. bonus in a job, laws in society, etc.):  humans will find any loophole in a metric in order to maximize it and rip the reward.  So does Machine Learning.  You therefore have to be on the lookout for those loopholes.

With predictive maintenance, failures often cost a lot of money, sometimes more than a false positive (i.e. a non-failure identified as a failure).  For this reason, you don’t want to miss failures:  you want to compensate for the imbalance in your data set.

The article, which I suggest you read in full when you’ll need to apply it, suggests 3 ways to compensate for the lack of failures in your data set:

1. You can resample the failures to increase their occurrence ; for instance by using the Synthetic Minority Oversampling Technique (SMOTE) readily available in Azure ML.
2. Tune the hyper-parameters of your model (e.g. by using a parameter sweep, also readily available in Azure ML) to optimize for recall.
3. Change the metric to penalize false negative more than false positive.

This is just one of the many twists you have to think of when creating a predictive model.

My basic advice:  make sure you not only ask the right question but with the right carrot & stick   If a failure, to you, costs more than a success identified as a failure (i.e. a false positive), then factor this in to your model.

# How does criticism hit your brain?

Everybody loves a critic, right?  How do you give feedback to somebody and be effective, i.e. without your message getting deflected on their “shield”?

An area I found especially hard to deliver constructive feedback is presentation / public speaking skills.  Criticizing the way somebody speaks or organizes his / her thoughts often hits very close to home.

My experience is that getting feedback on presentations I do is though too.  Believe me:  it’s not because of lack of material as I’m very far from perfection!  Some people can’t articulate what you do wrong, most do not dare in fear of offending you.

Last week I witnessed a colleague giving feedback on somebody’s presentation.  I found him especially effective…  but I didn’t quite understand why.  He was positive and uplifting while still suggesting ways to improve.  Beyond that I couldn’t see how I could replicate.  The following day I read an article from Fast Company discussing criticism that explained exactly what my colleague was doing right.  So I thought I would share it here:  Why Criticism Is So Tough To Swallow (And How To Make It Go Down Easier) by Caroline Webb.

## Form & Content

There are two parts about delivering criticism:

• The form, which I’m going to discuss here
• The content:  what do you see in somebody’s presentation that could be improved?

The second part requires experience in order to go beyond trivialities (e.g. we couldn’t hear you).  Unless you have a job where you are presenting & seeing colleagues present all the time, you won’t get experience quickly.  If that’s your case, I would suggest joining public speaking clubs (e.g. Toastmasters).

## The Sandwich

The author mentions the good old praise sandwich.  If you aren’t familiar with the meal, let me get you two introduced.

Basically, it’s a ham sandwich where you replace the bread by praise and the ham by improvement suggestions (criticism).

I’ve been handed my fair share of that sandwich and I did cook a few myself.

Nothing could go wrong, right?  You manage the feelings of the other person by throwing praises for intro & conclusion, keeping the negative in the middle.

It is actually a good way to structure feedback.  The problem she points out is that many people keep the praise generic and the criticism specific.

The example she gave hit me because I’ve done a very similar one recently:  “your speech was great, I would do X differently, would improve Y this way and would consider doing Z, otherwise, fantastic speech!”

Why is this so ineffective?

## Criticisms = Threat

Your ancestors were hunted by big animals.  They survived.

Why?  Because they were very alert about threats around them.  So alert that those threats were registering at an unconscious level, even when they were not looking for it.

They passed their genes to you, so your brain is constantly on the lookout for threats.  There are no wolves around anymore so it settles for other threats…  like criticism.

This is why we are very sensitive to criticism.  Criticism, socially, is dangerous.

By making criticism specific, a feedback raises the level of stress of the person you are delivering the feedback to.  They become defensive.  They stop listening.

## Praises need to be concrete to be efficient

We love praises.  Praises are social rewards.  That’s good.

What we love even better is specific praises.

Our brain responds better to concrete ideas than abstract ones, despite all the linear algebra we did in college.

So when we say something like “you’re great but your vocabulary is repetitive”, the threat takes over the generic praise.

## A better way to give feedback

So the better way the author suggests is:

• Give specific praises:  give examples of what the person did well and expand on why you liked it.  “I really liked the way you started with a joke, you paused to let people laugh, you smiled, so you connected with the crowd, then you kicked in the topic”.
• Build on the praises to suggest improvements:  “What could make it even more effective is if you would transition between sections by taking a pause of few seconds to let the last section sink and allow people top breath and be ready for more”.

Basically, you amplify your praise and you downplay your criticism, but in form, not in content

Actually, you don’t criticize, you suggest improvement on what’s already there.  It isn’t political correctness like calling a massive layoff “right sizing”.  The goal is different:  you don’t underline what’s missing, you suggest what could be done to make the whole better.

## Summary

I really encourage you to read the original article.  It is a good read.

Now, something I would add is to go easy on criticism.  The truth of the matter is that people change very slowly.  So if you hit them with 8 criticisms, beside having them crying in fetal position in their shower, you won’t accomplish much.  It’s a long term game.  You give a few pointers at a time for them to improve a little at the time.

I took a badminton class when I was at the university and one day the teacher came in with a video camera.  He said something very wise, in the line of:

“A camera is a powerful tool to analyze the style of a player and it is used more and more in sports.  You have to be careful though.  A camera sees everything and you can destroy an athlete with it.”

When you look at somebody’s performance, be it a presentation or any kind of work that they do, you are the camera.  Be lenient.

But do give feedback.  Because you likely see something the other person doesn’t see.  Delivered appropriately, feedback is gift!