Azure Runbook – A complete (simple) example

I meant to write about Azure Runbooks (also known as Azure Automation) for quite a while.

I had the chance to be involved in the operations of a solution I helped architect.  When you get beyond trivial Azure Solution, like on premise, you’ll want to have some automations.  For instance, you’ll want to:

  • Clean up data at the end of the day
  • Probe a few services for health check
  • Execute some data batch
  • etc.

Azure is mature enough already that you could do that with other technology.  For instance a mix of Scheduler and Web Job.  But those approaches are a little complicated for PowerShell automation and not ideal for long-running workflows.

Azure Automation is more appropriate for those scenarios.

The example

I’ll give a simple example here.  We’ll build an automation scanning a blob container and deleting all the blobs matching a certain name pattern.  That job will run every hour.

I’ll construct that using a PowerShell workflow.  I won’t go into the graphical tool yet nor will I create a custom PowerShell module.  As you’ll see the script will only take a few lines.  Such simple workflow do not mandate modules or graphical workflows in my opinion.

Creating a Resource Group

I won’t go into ARM templates but we’ll build this example into a Resource Group so at the very least, you’ll be able to destroy all artefacts in one go at the end (by destroying the Resource Group).

So let’s go in the Preview Portal to create a new Resource Group.  In the home page, select Resource groups.


Then select Add.


This should pop up the following blade.


As Resource Group Name, type SampleAutomations.

Select the Subscription you wanna use.

Locate the Resource Group where it’s more convenient for you.

Then click on the Create button at the bottom of the blade.

Creating Automation Account

Let’s create an Automation Account.


Give it a unique name (I used myfirstautomation), ensure it is in the resource group we created and in a suitable region (not all regions are supported yet) and click the Create button.

Exploring Automation Account

Let’s open the newly created account.


Runbooks are PowerShell workflows.  In a nutshell those are a mix of PowerShell scripts and Workflow Foundation (WF) worflows.  They allow long running workflows, pauses, restart, etc.  You already have a runbook, it’s the tutorial runbook.  You can look at it.

Assets come in different forms:

  • Schedules
  • Modules
  • Certificates
  • Connections
  • Variables
  • Credentials

We are going to use a schedule to run our run book.  We are also going to use variables to store configuration about our run book.

Creating Storage Account

Before we create our run book we need a storage account.

We’re going to create a storage account within the Resource Group we’ve created.  Click the plus button at the top left of the portal.


Select Data + Storage then select Storage Account.


Then at the bottom of the Storage Account pane, select “Resource Manager” and click Create.

Name the account something unique (I used mysample2015).

In Resource Group, make sure to select the resource group you just created.  Make sure the location suits you and click Create.


Creating Storage Container

Using your favorite Azure Storage tool (I used CloudXplorer), create a container named my-watched-container.

For the runbook to access to container, we’ll use a Shared Access Signature (SAS) token.  Whenever you can, use the access mechanism giving as little access as possible.  This way, if your assets get compromised, the attacker can do less damage than if you stick the keys of the castle in there.  This is the least privilege principle and you should always apply it.

So, for that newly created container, create a SAS token allowing for listing and deleting.  This is what our runbook will do:  list the blobs, delete the ones matching a certain pattern.

Creating Variables

Let’s create the variables for our run book.

Go back to the run book, select assets then select variables then add variable.

Give it accountName as a Name, leave the default string type there and for value, input the name of the storage account you created.  Then click create.


Do the same for the following:

Name Value
containerName my-watched-container
pattern draft
sas The value of the sas token you created for your container.  This should start with the question mark of the query string.

For the last one, select the encrypted option.


This will make the variable inaccessible to operators in the future.  It’s an added level of security.

You should have the following variables defined.


Creating Runbook

Let’s create the runbook.  Let’s close the Variables and Assets blade.

Let’s select the Runbooks box and click the Add a run book button. Select Quick Create.

For Name, input CleanBlobs. For Runbook type, choose PowerShell Workflow. Hit the Create button.

This is the code of our Workflow. Let’s paste in the following:

workflow CleanBlobs
# Here we load all the variables we defined earlier
$account = Get-AutomationVariable -Name ‘accountName’
$container = Get-AutomationVariable -Name ‘containerName’
$sas = Get-AutomationVariable -Name ‘sas’
$pattern = Get-AutomationVariable -Name ‘pattern’

# Construct a context for the storage account based on a SAS
$context = New-AzureStorageContext -StorageAccountName $account -SasToken $sas

# List all the blobs in the container
$blobs = Get-AzureStorageBlob -container $container -Context $context

$filteredBlobs = $blobs | where-object {$_.Name.ToUpper().Contains($pattern.ToUpper())}

$filteredBlobs | ForEach-Object {Remove-AzureStorageBlob -blob $_.Name -Context $context -Container $container}

You can see how we are using the variables by calling the cmdlet Get-AutomationVariable. You could actually discover that by opening the Assets tree view on the left of the edit pane.

We can then test our Run book by hitting the test button on top. First you might want to insert a few empty file in your blob container, with some containing the word “draft” in them.  Once the workflow ran, it should have deleted the draft files.

Scheduling Runbook

Let’s schedule the runbook.  First let’s publish it.  Close the test pane and click the Publish button.


Then click the Schedule button and Link a schedule to your runbook.


We didn’t create any schedule yet, so let’s create one in place.  Give it any name, set the recurrence to hourly and hit the create button.

By default the start time will be 30 minutes from now.  At the time I wrote this blog, there was a little bug in the interface forbidding me to put it in 5 minutes (because of time zone calculations).  That might be fix by the time you try it.

Click ok and your workbook is scheduled.


Azure Automation is a powerful tool to automate tasks within Azure.

In this article I only touched the surface.  I will try to go further in future posting.

Move Azure Resources between Resource Groups using Powershell

Ouf…  I’ve been using Azure for quite a while in the old (current actually) portal.  Now I look into my resources in the new (preview) portal and…  what a mess of a resource group mosaic!

Unfortunately, at the time of this writing, you can’t move resources from a Resource Group to another via the portal…


If you’ve been there, hang on, I have the remedy and it involves Powershell!

I’ll assume you’ve installed the latest Azure PowerShell cmdlets.  Fire up Powershell ISE or your favorite interface.

First things first, don’t forget to tell Powershell to switch to Resource Manager SDK:

Switch-AzureMode AzureResourceManager

Then, you’ll need to know the resource ID of your resources.  You can use the cmdlet Get-AzureResource.

In my case, I want to move everything related to an app named “Readings” into a new resource group I’ve created (in the portal).  So I can grab all those resources:

Get-AzureResource | Where-Object {$_.Name.Contains(“eading”)}

Then I can move my resources:

Get-AzureResource | Where-Object {$_.Name.Contains(“eading”)} `
| foreach {Move-AzureResource -DestinationResourceGroupName “Reading” -ResourceId $_.ResourceId -Force}

Unfortunately, not every resource will accept to be moved like this.  I had issues with both Insights objects and Job Collection (Scheduler).  The latter is managed only in the old (current) portal, so I would think it isn’t under the Azure Resource Manager yet.  For Insight, a similar story probably applies.

Azure Basics: Premium Storage

I thought I would do a lap around Azure Premium Storage to clear some fog.

Premium storage is Solid State Drive (SSD) backed storage.  That means more expensive but mostly faster storage.

You might have heard the numbers?

  • Up to 64 TB of SSD storage attached to a VM
  • Up to 80K IOPS per VM
  • Up to 2000M per second of throughput

Now, like everything related to high performance, there are a few variables to consider to get the most pert out of the service.  This is what I’m going to talk about here.

Premium storage vs D Series

SSD-Drive-icon[1]First thing to clear out, Premium storage isn’t directly related to D Series and vis versa.

The D Series VMs are big ass VMs optimized for heavy CPU load.  They have huge amount of memory and to complement that, they have local SSD storage for scratch space.

That means the local SSD is a temp storage.  When the VM gets shut down (because you requested to, because there was a planned maintenance or because there was some physical failures), the content of local SSD is gone.

So don’t put anything you cannot lose there.

That being said, D Series are very handy VMs.  Like all VM models they take minutes to setup.

I used them at a customer engagement to spun a database used to migrate 2 databases in one.  The team was trying to run the migration scripts (once off scripts not optimized for performance and running on million records tables) for weeks on laptops and normal VMs without success.  With a D series VM, we put the database on the scratch disk and the scripts ran at acceptable speed.  We didn’t care to lose the DB content on reboot since that was copy of data anyway.

Couple of limitations

To use Premium Storage, you need to create a Premium Storage account.  You cannot just create a Premium Storage container within your existing account.

This comes with a few limitations:

  • Currently (as of October 2015), not every region support Premium Storage.  See the details here.
  • You need to use the Preview Portal to manage the account
  • Only page blobs are supported in Premium Storage Account.  The service is geared to serve VM vhds basically.
  • Premium Storage Accounts do not support Geo-replication.  They are locally replicated (3 times) though.  If you want to have your data geo-replicated, you need to copy it to a normal storage account which can then be configured to be geo replicated.
  • For VM disks (the main scenario for Premium Storage), you need to use DS or GS series VMs.
  • No storage analytics & no custom domain name.


There is more to it than that and Tamar Myers wrote a thorough guide on Premium Storage here.

The main confusion typically is between local SSD storage and Premium Storage.  Premium Storage is external to VMs and 3-times replicated.  Local SSD is used either for caching or as local temp drive.

Azure basics: Availability sets

What are availability sets in Azure?

In a nutshell, they are a way to define declaratively policies about how your services (VMs, Apps, etc.) are deployed in order to ensure high availability.

To get more specific, you need to understand two more concepts:  Fault Domain & Update Domain.

Two physical machines in a Fault Domain share a common power source and network switch.  This means that when there is a physical fault, an outage, all machines on the same fault domain are affected.  Conversely, two machines in two different fault domain shouldn’t fail at the same time.

Update Domain, on the other hand, define a group of machines that are updated at the same time.  Azure has automatic maintenance patches requiring reboots.  Machines on the same update domain will be rebooted at the same time.

So, if you understand those two concepts, you will realize that if you want your solution to be highly available, you’ll want to avoid to either:

  • Have all instances of the same service on the same fault domain (a physical outage would affect both instances)
  • Have all instances of the same service on the same update domain (a planned maintenance would take them both down at the same time)

This is where availability sets come in.  Azure guarantees that an availability set has:

  • 5 Update Domains
  • 2 Fault Domains

Those are defaults and can be modified, to an extend.

So by defining the instances of your services to belong to an availability set, you ensure they will be spread on different fault and update domains.

Conversely, you’ll want the different tiers of your solution to belong to different availability sets.  Otherwise, it could happen (depending how the update & fault domains are distributed among the instances, which you do not control) that all the instances of the same tier be rebooted or failed at the same time.


Those were the two big rules with availability set, which you could coalesce in one:

Define an availability set per application tier, have more than one instance per tier and load balance within a tier.


Where is the statistics in Machine Learning?

I often try to explain what Machine Learning is to people outside the field.  I’m not always good at it but I am getting better.

One of the confusion I often get when I start to elaborate the details is the presence of statistics in Machine Learning.  For people outside the field, statistics are the stuff of survey or their mandatory science class (everyone I know outside of the Mathematics field who had a mandatory Statistic class in their first Uni year ended up postponing it until the end of their undergraduate studies!).

It feels a bit surprising that statistics would get involved while a computer is learning.

So let’s talk about it here.


Back in the days there was two big tracks of Artificial Intelligence.  One track was about codifying our knowledge and then run the code on a Machine.  The other track was about letting the machine learn from the data ; that is Machine Learning in a nutshell.

The first approach has its merit and had a long string of successes such as Expert Systems based on logic and symbolic paradigms.

The best example of those I always give is in vaccine medicine.  The first time I went abroad in tropical countries, in 2000, I went to a vaccine clinic and met with a doctor.  The guy asked me where I was going, consulted a big reference manual, probably used quite a bit of his knowledge and prescribed 3 vaccines his nurse administered to me.  In 2003, when I came back to Canada, I had a similar experience.  In 2009…  nope.  Then I met with a nurse.  She sat behind a computer and asked me similar questions, punching the answer in the software.  The software then prescribed me 7 vaccines to take (yes, it was a more dangerous place to go and I got sick anyway, but that’s another story).

That was an expert system.  Somebody sat with experts and asked them how they proceeded when making a vaccine prescriptions.  They took a lot of notes and codify this into a complex software program.


Expert systems are great at systematizing repetitive tasks requiring a lot of knowledge that can ultimately be described.

If I would ask you to describe me how you differentiate a male from a female face while looking at a photograph, it might get quite hard to codify.  Actually, if you would answer me at all, you would probably give a bad recipe since you do that unconsciously and you aren’t aware of %20 of what your brain does to take the decision.

For those types of problems, Machine Learning tends to do better.

A great example of this is how much better Machine Learning is at solving translation problems.  A turning point was the use of Canadian Hansards (transcripts of Parliamentary Debates in both French & English) to train machines how to translate between the two languages.

Some people, e.g. Chomsky, opposes Machine Learning as it actually hides the codification of knowledge.  Actually there is an interesting field developing to use machines to build explanation out of complicated mathematical proofs.


Nevertheless, why is there statistics in Machine Learning?

image.pngWhen I wrote posts about Machine Learning to explain the basics, I gave an example of linear regression on a 2D data sets.

I glossed over the fact that the data, although vaguely linear, wasn’t a perfect line, far from it.  Nevertheless, the model we used was a line.

This is where a statistical interpretation can come into play.

Basically, we can interpret the data in many ways.  We could say that the data is linear but there were errors in measurements.  We could say that there are much more variables involves that aren’t part of the dataset but we’ll consider those as random since we do not know them.

Both explanation links to statistics and probability.  In the first one we suppose the errors are randomly distributed and explain deviation from linear distribution.  In the second we assume missing data (variables) that if present would make the dataset deterministic but in its absence we model as a random distribution.

Actually, there is a long held debate in physics around the interpretation of Quantum Mechanics that oppose those two explanations.


In our case, the important point is that when we build a machine learning model out of a data set, we assume our model will predict the expectation (or average if you will) of an underlying distribution.


Machine Learning & Statistics are two sides of the same coin.  Unfortunately, they were born from quite different scientific disciplines, have different culture and vocabularies that entertain confusion to this day.  A good article explaining those differences can be found here.

Azure Key Vault & SQL Server Connector Update

enhance-data-protection[1]Azure Key Vault is alive and well!

The Azure service allowing you to store keys and secrets in a secured container has been released at the end of summer and it continues to improve.

The SQL Server Connector is a component that can be installed on SQL Server (not SQL Azure at this point).  It allows you to use the encryption features of SQL, i.e. Transparent Data Encryption (TDE), Column Level Encryption (CLE) & Encrypted Backup, while managing the encryption keys with Azure Key Vault.

This component now gets an upgrade!

On top of fixing bugs and getting rid of dependencies to the preview REST API (that will be discontinued at the end of September 2015), it improves its error-handling of the Azure Key Vault:

  • Better logs for you to see what happened
  • Better handling of transient error on the Azure Key Vault

Indeed, all Cloud Services have transient error and every clients should be robust enough to live with it.


You can download the SQL Connector here.

HDInsight Hadoop Hive – CSV files analysis

hive_logo_medium[1]Ok, on a past blog we’ve been setuping Azure HDInsight for some Hive fun.

So let’s!

Today I’ll go and analyse the data contained in multiple CSV files.  Those files will be created (in Excel) but in a real-world scenario, they could be either data dump on a file server or file imported from a real system.

Creating some flat files

First, let’s create some data to be consumed by Hive.

Open Excel and author the following:

rand1 Widget Price InStock
=RAND() =IF(A2<0.3, “Hammer”, IF(A2>0.7, “Skrewdriver”, “Plier”)) =A2*20 =A2*1000

So basically, in the first column we generate a random number.  In the Widget column we generate a label (from the collection {Hammer, Skrewdriver, Plier}) based on that random number.  In the Price column we generate a price based on the random column.  Finally in the InStock column, we generate a number of items, still based on the random column.

So, yes, we are generating random dummy data.  Can’t wait to see the insight we’re going to get out of that!

Now, let’s auto-fill 100 rows using the first data row (i.e. the second row I made you type).  You should end up with something like that (but with totally different values, thanks to randomness):


(I’ve added the first row styling)

Now let’s save this file three times as CSV, creating files HiveSample-1.csv, HiveSample-2.csv and HiveSample-3.csv.  Between each save, enter a value in a blank cell (at the right of the data).  This will recalculate all the random entries.  You can then hit CTRL-Z and save.

Let’s push those files into your Hadoop blob container in the folder “my-data/sampled”.  This is important as we’ll refer to that in a typed command soon.

Hive Editor

We’ll use Hive with HDInsight Hive editor.  This is a web console.

Let’s open the HDInsight cluster we’ve created in the portal.


In the Quick Links we find the Cluster Dashboard.  When we click that we’re are prompted for some credentials.  This is where we have to give the cluster admin credentials we’ve entered during the setup.


In the top menu of the dashboard, you’ll find the Hive Editor.  Click that and you should land on a page looking like this:


Creating External Table

We’re going to create an external table.  An external table in Hive is a table where only the table definition is stored in Hive ; the data is stored in its original format outside of Hive itself (in the same blob storage container though).

In the query editor, we’re going to type

rand1 double,
Widget string,
Price double,
InStock int
STORED AS TEXTFILE LOCATION ‘wasb:///my-data/sampled/’

and hit the Submit button and…  wait.  You wait a lot when working with Hive unfortunately.

Hive transforms this Hive-QL query into an Hadoop Map-Reduce job and schedule the job.  Eventually your job will complete.  You can view the details of your job.  In this case the job doesn’t output anything but you can see the elapsed time in the logs.

What we did here is to tell Hive to create an external table with a given schema (schema on read, more on that in a minute), parsing it using comma as the delimited field.  We tell Hive to pick all the files within the folder “my-data/sampled” and we tell it to skip the first row of each file (the header).

Sanity queries

Let’s run a few queries for sanity’s sake.  Let’s start with a COUNT:


which should return you the total amount of rows in all the files you put in the folder.


SELECT * FROM hardware LIMIT 10

that should gives you a top 10 of the rows of the first file (yes, SELECT TOP is a TSQL-only instruction and isn’t part of ANSI SQL neither was it picked up by anyone else including HiveQL apparently).

You can submit both queries back-to-back.  You can give them a name to more easily find them back.

Schema on read

A concept you’ll hear a lot about in Hive (and in Big Data Analysis in general) is Schema on read.  That is opposed to Schema on write that we are used to in a typical SQL database.

Here the schema is used to guide the parser to interpret the data, it isn’t a schema used to format the data while written to the database.

It’s a subtle difference but an important one.  Schema-on-read means you wait for your analysis to impose constraints on the data.  This is quite powerful as you do not need to think in advance about the analysis you are going to do.  Well…  mostly!  You still need the information to be there in some form at least.

Slightly more complex queries

Ok, let’s try some analysis of the data.  Let’s group all the widget together and look at their average price and quantity:

AVG(Price) AS Price,
AVG (InStock) AS InStock
FROM hardware

Your millage will vary given the randomness, but you should have something in the line of:


Optimization with Tez

The last query took 85 seconds to run in my cluster.  Remember, I did configure my cluster to have only one data node of the cheapest VM available.

Still…  to compute sums over a dataset of 300…  A tad slow.

By and large, Hadoop map reduce isn’t fast, it is scalable.  It wasn’t build to be fast on small data sets.  It has a file-based job scheduling architecture which incurs lots of overhead.

Nevertheless, some optimizations are readily available.

For instance, if you type

set hive.execution.engine=tez;

on top of the last query, on my cluster it runs in 36 seconds.  Less than half than without the optimization.

That is TEZ.  TEZ is built on top of YARN.  To make a long story short, you can consider them as the Version 2 of the Hadoop Map-Reduce engine.

This works per-query so you always need to prefix your query with the set-instruction.


We’ve created some dummy data in Excel, saved it in CSV formats and exported it to blob storage.

We created an Hive external table reading those files.

We’ve queried that table to perform some analytics.

We’ve optimized the queries using Tez.

HDInsight / Hadoop Hive really shines when you try to perform ad hoc analytics:  you want to explore data.  If you already know that you want your quarterly average, there are better technologies suited for that (e.g. SQL Server Analytic Services).

For a more conceptual tutorial of Hive look here.