Category Archives: Solution

Hyperspheres & the curse of dimensionality

fractal-1118515_640I previously talked about the curse of dimensionality (more than 2 years ago) related to Machine Learning.

Here I wanted to discuss it in more depth and dive into the mathematics of it.

High dimensions might sound like Physics’ string theory where our universe is made of more than 4 dimensions.  This isn’t what we are talking about here.

The curse of dimensionality is related to what happens when a model deals with a data space with dimensions in the hundreds or thousands.

As the title of this article suggests, we’re going to take the angle of the properties of Hyperspheres (spheres in N dimensions) to explore high dimension properties.

This article is inspired by Foundations of Data Science by John Hopcroft & Ravindran Kannan (chapter 2).

Why should I care about High Dimension?

When introducing Machine Learning concepts, we typically use few dimensions in order to help visualization.  For instance, when I introduced linear regression or polynomial regression in past articles, I used datasets in two dimensions and plot them on a chart.

Brown RabbitIn the real world, typical data sets have much more dimensions.

A typical case of high dimension is image recognition (or character recognition as a sub category) where even a low resolution pictures will have hundreds of pixels.  The corresponding model would take gray-scale input vector of dimension 100+.

Close-up of an Animal Eating GrassWith fraud detection, transactions do not contain only the value of the transaction, but the time of day, day of week, geo-location, type of commerce, type of products, etc.  .  This might or might not be a high dimension problem, depending on the available data.

In an e-commerce web site, a Product recommendation algorithm could be as simple as an N x N matrix of 0 to 1 values where N is the number of products.

With IoT, multiple sensors feed a prediction model.

In bioinformatics, DNA sequencing generates a huge amount of data which often is arranged in high dimensional model.

Basically, high dimensions crop up everywhere.

What happens as dimension increases?

For starter a space with more dimensions simply is…  bigger.  In order to sample a space with 2 dimensions with a resolution of 10 units, we need to have 10^2 = 100 points.  Having the same sampling in a space of dimension 3 would require 10^3 = 1000 points.  Dimension 20?  20 would require 10^20 = 100 000 000 000 000 000 000 points.

Right off the bat we can tell that sampling the space of dimension 2 & 3 is realistic while for a space of dimension 20, it’s unlikely.  Hence we are likely going to suffer from under-sampling.

Yoshua Bengio has a nice discussion about Curse of Dimensionality here.

Hypersphere in a cube

Tapemeasure on 20Beyond sampling problems, metrics & measures change behaviour at high dimensions.  Intuitively it makes sense since a measure takes a vector (vectors) and squeeze it (them) into a numerical value ; the higher the dimension, the more data we squeeze into one number & hence we should lose information.

We use metrics & measures heavily in Machine Learning.  For instance, a lot of cost (or loss) functions are based on Euclidean’s distance:

dist(x,y) = \displaystyle\sum_{i=1}^N (x_i-y_i)^2

Now if x and / or y are random variables (e.g. samples), the law of large numbers applies when N becomes large.  This implies the sum will trend to the expected value with a narrower standard deviation as N increases.  In turns, this means there is less and less information in the distance as the number of dimensions increases.

This brings us to the hypersphere.  An hypersphere’s equation is

\displaystyle\sum_{i=1}^N x_i^2 = R^2

where x is a point of dimension N and R is the radius of the hypersphere.

An hypersphere of dimension 1 is a line, an hypersphere of dimension 2 is a circle, dimension 3 is a sphere, dimension 4 is an…  expending universe?  and so on.

A theorem I’ll demonstrate in a future article is that the volume of an hypersphere of radius 1 tends to zero as the dimension increases.

This is fairly unintuitive, so let me give real numerical values:

Dimension Hyper Volume
1 2
2 3.141592654
3 4.188790205
4 4.934802201
5 5.263789014
6 5.16771278
7 4.72476597
8 4.058712126
9 3.298508903
10 2.55016404
11 1.884103879
12 1.335262769
13 0.910628755
14 0.599264529
15 0.381443281
16 0.23533063
17 0.140981107
18 0.082145887
19 0.046621601
20 0.025806891
21 0.01394915
22 0.007370431
23 0.003810656
24 0.001929574
25 0.000957722
26 0.000466303
27 0.000222872
28 0.000104638

If we plot those values:

image

We see the hyper volume increases in the first couple of dimensions.  A circle of radius 1 has an area of pi (3.1416) while a sphere of radius 1 has a volume of 4.19.  It peaks at dimension 5 and then shrinks.

It is unintuitive because in 2 and 3 dimensions (the only dimensions in which we can visualize an hypersphere), the hypersphere pretty much fills its embedding cube.  A way to “visualize” what’s happening in higher dimension is to consider a “diagonal” into an hypersphere.

For a circle, the diagonal (i.e. 45’) intersects with the unit circle at

(\frac {1} {\sqrt {2}}, \frac {1} {\sqrt {2}}) since (\frac {1} {\sqrt {2}})^2 + (\frac {1} {\sqrt {2}})^2 = 1^2

In general, at dimension N, the diagonal intersects at

x_i = \frac {1} {\sqrt {N}}

So, despite the hypersphere of radius 1 touches the cube of side 2 centered at the origin on each of its walls, the surface of the hypersphere, in general, gets further and further away from the cube surface as the dimension increases.

Consequences of the hypersphere volume

A straightforward consequence of the hypersphere volume is sampling.  Randomly sampling a square of side 2 centered at the origin will land points within the unit circle with probability \frac{\pi}{4} = \%79.  The same process with an hypersphere of dimension 8 would hit the inside of the hypersphere with a probability of %1.6.

A corollary to the hypersphere volume is that at higher dimension, the bulk of the volume of the hypersphere is concentrated in a thin annulus below its surface.  An obvious consequence of that is that optimizing a metric (i.e. a distance) in high dimension is difficult.

What should we do about it?

First step is to be aware of it.

A symptom of high dimensionality is under sampling:  the space covered is so large the number of sample points required to learn the underlying model are likely to be over the actual sample set’s size.

The simplest solution is to avoid high dimensionality with some pre-processing.  For instance, if we have a priori knowledge of the domain, we might be able to combine dimensions together.  For example, in an IoT field with 10 000 sensors, for many reasons, including curse of dimensionality, it wouldn’t be a good idea to consider each sensor inputs as an independent input.  It would be worth trying to aggregate out sensor inputs by analyzing the data.

Summary

Some Machine Learning algorithms will be more sensitive to higher dimensionality than others but the curse of dimensionality affects most algorithms.

It is a problem to be aware of and we should be ready to mitigate it with some good feature engineering.

URL Routing with Azure Application Gateway

Update (13-06-2017):  The POC of this article is available on GitHub here.

I have a scenario perfect for a Layer-7 Load Balancer / Reverse Proxy:

  • Multiple web server clusters to be routed under one URL hierarchy (one domain name)
  • Redirect HTTP traffic to the same URL on HTTPS
  • Have reverse proxy performing SSL termination (or SSL offloading), i.e. accepting HTTPS but routing to underlying servers using HTTP

On paper, Azure Application Gateway can do all of those.  Let’s fine out in practice.

Azure Application Gateway Concepts

From the documentation:

Application Gateway is a layer-7 load balancer.  It provides failover, performance-routing HTTP requests between different servers, whether they are on the cloud or on-premises. Application Gateway provides many Application Delivery Controller (ADC) features including HTTP load balancing, cookie-based session affinity, Secure Sockets Layer (SSL) offload, custom health probes, support for multi-site, and many others.

Before we get into the meat of it, there are a bunch of concepts Application Gateway uses and we need to understand:

  • Back-end server pool: The list of IP addresses of the back-end servers. The IP addresses listed should either belong to the virtual network subnet or should be a public IP/VIP.
  • Back-end server pool settings: Every pool has settings like port, protocol, and cookie-based affinity. These settings are tied to a pool and are applied to all servers within the pool.
  • Front-end port: This port is the public port that is opened on the application gateway. Traffic hits this port, and then gets redirected to one of the back-end servers.
  • Listener: The listener has a front-end port, a protocol (Http or Https, these values are case-sensitive), and the SSL certificate name (if configuring SSL offload).
  • Rule: The rule binds the listener, the back-end server pool and defines which back-end server pool the traffic should be directed to when it hits a particular listener.

On top of those, we should probably add probes that are associated to a back-end pool to determine its health.

Proof of Concept

As a proof of concept, we’re going to implement the following:

image

We use Windows Virtual Machine Scale Sets (VMSS) for back-end servers.

In a production setup, we would go for exposing the port 443 on the web, but for a POC, this should be sufficient.

As of this writing, there are no feature to allow automatic redirection from port 80 to port 443.  Usually, for public web site, we want to redirect users to HTTPS.  This could be achieve by having one of the VM scale set implementing the redirection and routing HTTP traffic to it.

ARM Template

We’ve published the ARM template on GitHub.

First, let’s look at the visualization.

image

The template is split within 4 files:

  • azuredeploy.json, the master ARM template.  It simply references the others and passes parameters around.
  • network.json, responsible for the virtual network and Network Security Groups
  • app-gateway.json, responsible for the Azure Application Gateway and its public IP
  • vmss.json, responsible for VM scale set, a public IP and a public load balancer ; this template is invoked 3 times with 3 different set of parameters to create the 3 VM scale sets

We’ve configured the VMSS to have public IPs.  It is quite typical to want to connect directly to a back-end servers while testing.  We also optionally open the VMSS to RDP traffic ; this is controlled by the ARM template’s parameter RDP Rule (Allow, Deny).

Template parameters

Here are the following ARM template parameters.

Parameter Description
Public DNS Prefix The DNS suffix for each VMSS public IP.
They are then suffixed by ‘a’, ‘b’ & ‘c’.
RDP Rule Switch allowing or not allowing RDP network traffic to reach VMSS from public IPs.
Cookie Based Affinity Switch enabling / disabling cookie based affinity on the Application Gateway.
VNET Name Name of the Virtual Network (default to VNet).
VNET IP Prefix Prefix of the IP range for the VNET (default to 10.0.0).
VM Admin Name Local user account for administrator on all the VMs in all VMSS (default to vmssadmin).
VM Admin Password Password for the VM Admin (same for all VMs of all VMSS).
Instance Count Number of VMs in each VMSS.
VM Size SKU of the VMs for the VMSS (default to Standard DS2-v2).

Routing

An important characteristic of URL-based routing is that requests are routed to back-end servers without alteration.

This is important.  It means that /a/ on the Application Gateway is mapped to /a/ on the Web Server.  It isn’t mapped to /, which seems more intuitive as that would seem like the root of the ‘a’ web servers.  This is because URL-base routing can be more general than just defining suffix.

Summary

This proof of concept gives a fully functional example of Azure Application Gateway using URL-based routing.

This is a great showcase for Application Gateway as it can then reverse proxy all traffic while keeping user affinity using cookies.

Automating Role Assignment in Subscriptions & Resource Groups

keys-unlock[1]Azure supports a Role Based Access Control (RBAC) system.  This system links identity (users & groups) to roles.

RBAC is enforced at the REST API access level, which is the fundamental access in Azure:  it can’t be bypassed.

In this article, we’ll look at how we can automate the role assignation procedure.

This is useful if you routinely create resource groups for different people, e.g. each time a department request some Azure environment or even if you routinely create new subscriptions.

We’re going to do this in PowerShell.  So let’s prep a PowerShell environment with Azure SDK & execute the Add-AzureRmAccount (login) command.

Exploring roles

A role is an aggregation of actions.

Let’s look at the available roles.


Get-AzureRmRoleDefinition | select Name | sort -Property Name

This gives us the rather long list of roles:

  • API Management Service Contributor
  • API Management Service Operator Role
  • API Management Service Reader Role
  • Application Insights Component Contributor
  • Application Insights Snapshot Debugger
  • Automation Job Operator
  • Automation Operator
  • Automation Runbook Operator
  • Azure Service Deploy Release Management Contributor
  • Backup Contributor
  • Backup Operator
  • Backup Reader
  • Billing Reader
  • BizTalk Contributor
  • CDN Endpoint Contributor
  • CDN Endpoint Reader
  • CDN Profile Contributor
  • CDN Profile Reader
  • Classic Network Contributor
  • Classic Storage Account Contributor
  • Classic Storage Account Key Operator Service Role
  • Classic Virtual Machine Contributor
  • ClearDB MySQL DB Contributor
  • Contributor
  • Data Factory Contributor
  • Data Lake Analytics Developer
  • DevTest Labs User
  • DNS Zone Contributor
  • DocumentDB Account Contributor
  • GenevaWarmPathResourceContributor
  • Intelligent Systems Account Contributor
  • Key Vault Contributor
  • Log Analytics Contributor
  • Log Analytics Reader
  • Logic App Contributor
  • Logic App Operator
  • Monitoring Contributor Service Role
  • Monitoring Reader Service Role
  • Network Contributor
  • New Relic APM Account Contributor
  • Office DevOps
  • Owner
  • Reader
  • Redis Cache Contributor
  • Scheduler Job Collections Contributor
  • Search Service Contributor
  • Security Admin
  • Security Manager
  • Security Reader
  • SQL DB Contributor
  • SQL Security Manager
  • SQL Server Contributor
  • Storage Account Contributor
  • Storage Account Key Operator Service Role
  • Traffic Manager Contributor
  • User Access Administrator
  • Virtual Machine Contributor
  • Web Plan Contributor
  • Website Contributor

Some roles are specific, e.g. Virtual Machine Contributor, while others are much broader, e.g. Contributor.

Let’s look at a specific role:


Get-AzureRmRoleDefinition "Virtual Machine Contributor"

This gives us a role definition object:


Name             : Virtual Machine Contributor
Id               : 9980e02c-c2be-4d73-94e8-173b1dc7cf3c
IsCustom         : False
Description      : Lets you manage virtual machines, but not access to them, and not the virtual network or storage account
they�re connected to.
Actions          : {Microsoft.Authorization/*/read, Microsoft.Compute/availabilitySets/*, Microsoft.Compute/locations/*,
Microsoft.Compute/virtualMachines/*...}
NotActions       : {}
AssignableScopes : {/}

Of particular interest are the actions allowed by that role:


(Get-AzureRmRoleDefinition "Virtual Machine Contributor").Actions

This returns the 34 actions (as of the time of this writing) the role enables:

  • Microsoft.Authorization/*/read
  • Microsoft.Compute/availabilitySets/*
  • Microsoft.Compute/locations/*
  • Microsoft.Compute/virtualMachines/*
  • Microsoft.Compute/virtualMachineScaleSets/*
  • Microsoft.Insights/alertRules/*

We see that wildcards are used to allow multiple actions.  Therefore there are actually much more than 34 actions allowed by this role.

Let’s look at a more generic role:


Get-AzureRmRoleDefinition "Contributor"

This role definition object is:


Name             : Contributor
Id               : b24988ac-6180-42a0-ab88-20f7382dd24c
IsCustom         : False
Description      : Lets you manage everything except access to resources.
Actions          : {*}
NotActions       : {Microsoft.Authorization/*/Delete, Microsoft.Authorization/*/Write,
Microsoft.Authorization/elevateAccess/Action}
AssignableScopes : {/}

We notice that all actions (*) are allowed but that some actions are explicitly disallowed via the NotActions property.


(Get-AzureRmRoleDefinition "Contributor").NotActions

  • Microsoft.Authorization/*/Delete
  • Microsoft.Authorization/*/Write
  • Microsoft.Authorization/elevateAccess/Action

We could create custom roles aggregating arbitrary groups of actions together but we won’t cover that here.

Users & Groups

Groups-Meeting-Dark-icon[1]

Now that we know about role, let’s look at users & groups.

Users & groups will come from the Azure AD managing our Azure subscription.

We can grab a user with Get-AzureRmADUser.  This will list all the users in the tenant.  If you are part of a large organization, this is likely a long list.  We can grab a specific user with the following command:


Get-AzureRmADUser -UserPrincipalName john.smith@contoso.com

We need to specify the domain of the user since we could have users coming from different domains inside the same tenant.

Let’s grab the object ID of the user:


$userID = (Get-AzureRmADUser -UserPrincipalName john.smith@contoso.com).Id

Similarly, we could grab the object ID of a group:


$groupID = (Get-AzureRmADGroup -SearchString "Azure Team").Id

Scope

Apps-Brackets-B-icon[1]Next thing to determine is the scope where we want to apply a role.

The scope can be either a subscription, a resource group or a resource.

To use our subscription as the scope, let’s run:


$scope = "/subscriptions/" + (Get-AzureRmSubscription)[0].SubscriptionId

To use a resource group as the scope, let’s run:


$scope = (Get-AzureRmResourceGroup -Name MyGroup).ResourceId

Finally, to use a specific resource as the scope, let’s run:


$scope = (Get-AzureRmResource -ResourceGroupName MyGroup -ResourceName MyResource).ResourceId

Assigning a role

Ok, let’s do this:  let’s put it all together:


New-AzureRmRoleAssignment -ObjectId $userID -Scope $scope -RoleDefinitionName "Contributor"

We can double check in the portal the assignation occurred.

Summary

We simply automate the role assignation using PowerShell.

As with everything that can be done in PowerShell, it can be done using Azure Command Line Interface CLI.  Commands are quite similar.

Also, like every automation, it can be bundled in an Azure Automation Runbook.  So if we have routine operations consisting in provisioning subscriptions or resource groups to group of users, we could package it in a Runbook to ensure consistency.

Managing Azure AD Application members in Portal

One of Azure AD’s powerful concept is the application.  It gives context to an authentication as we explained in this article.

An application can also be used as an authorization barrier since we can manage an application members.  This is optional as by default, everyone in a tenant has access to its application.  But if we opt in to control the members, only members can has access to the application, hence only members can authenticate via the application.

In this article, we’ll look at how to manage members of an application in the Portal.  We’ll discuss how to automate this in a future article.

Application Creation

First, let’s create an application.

In the Azure Active Directory (Azure AD or AAD) blade, let’s select App Registrations, then Add.

image

Let’s type the following specifications:

image

Opt in to Manage members

If we now go into the application and select Managed Application in Local Directory:

image

We can select the properties tab and there we can require user assignment.

image

Assigning users

We can then assign users & groups (assigning groups require Azure AD Premium SKU).

image

Summary

Azure AD Application Membership, also called User Assignment, is a simple opt-in feature that allows us to control which user can use a given application.

It can be used as a simple (application-wide) authorization mechanism.

Sizing & Pricing Virtual Machines in Azure

https://pixabay.com/en/dog-dog-breed-large-puppy-1966394/I’m recurrently asked by customers similar questions around sizing & pricing of Virtual Machines (VMs), storage, etc. .  So I thought I would do a reusable asset in the form of this article.

This is especially important if you are trying to size / price VMs “in advance”.  For instance if you are quoting some work in a “fixed bid” context, i.e. you need to provide the Azure cost before you wrote a single line of code of your application.

If that isn’t your case, you can simply trial different VM sizes.  The article would still be useful to see what variables you should be looking at if you do not obtain the right performance.

There are a few things to look for.  We tend to focus on the CPU & RAM but that’s only part of the equation.  The storage & performance target will often drive the choice of VM.

A VM has the following characteristics:  # cores, RAM, Local Disk, # data disks, IOPs, # NICs & Network bandwidth.  We need to consider all of those before choosing a VM.

For starter, we need to understand that Virtual Machines cannot be “hand crafted”, i.e. we cannot choose CPU speed, RAM & IOPS separately.  They come in predefined packages with predefined specs:  SKUs, e.g. D2.

Because of that, we might often have to oversize a characteristic (e.g. # cores) in order to get the right amount of another characteristic (e.g. RAM).

SKUs come in families called Series.  At the time of this writing Azure has the following VM series:

  • A
  • Av2 (A version 2)
  • D & DS
  • Dv2 & DSv2 (D version 2 & DS version 2)
  • F & FS
  • G & GS
  • H & HS
  • L & LS
  • NC
  • NV

Each series will optimize different ratios.  For instance, the F Series will have a higher cores / RAM ratio than the D series.  So if we are looking at a lot of cores and not much RAM, the F series is likely a better choice than D series and will not force us to oversize the RAM as much in order to have the right # of cores.

For pricing, the obvious starting point is the pricing page for VM:  https://azure.microsoft.com/en-us/pricing/details/virtual-machines/windows/.

Cores

Azure compute allocates virtual core from the physical host to the VMs.

Azure cores are dedicated cores.  As of the time of this writing, there is no shared core (except for A0 VM) and there are no hyper threading.

Operating System

There are two components in the price of a VM:

  1. Compute (the raw underlying VM, i.e. the CPU + RAM + local disk)
  2. Licensed software running on it (e.g. Windows, SQL, RHEL, etc.)

The compute price corresponds to the CentOS Linux pricing since CentOS is open source and has no license fee.

Azure has different flavours of licensed software (as of the writing of this article, i.e. March 2017):

  • Windows
    • BizTalk
    • Oracle Java
    • SharePoint
    • SQL Server
  • Linux
    • Open Source (no License)
    • Licensed:  Red Hat Enterprise License (RHEL), R Server, SUSE

Windows by itself comes with the same license fee regardless of Windows version (e.g. Windows 2012 & Windows 2016 have the same license fee).

Windows software (e.g. BizTalk) will come with software license (e.g. BizTalk) + OS license.  This is reflected in the pricing columns.  For instance, for BizTalk Enterprise (https://azure.microsoft.com/en-us/pricing/details/virtual-machines/biztalk-enterprise/), here in Canadian dollars in Canada East region for the F Series:

image

In the OS column is the price of the compute + the Windows license while in the “Software” column is the price of the BizTalk Enterprise license.  The total is what we pay per hour for the VM.

It is possible to “Bring Your Own License” (BYOL) of any software (including Windows or Linux) in Azure and therefore pay only for the bare compute (which, again, correspond to CentOS Linux pricing).

UPDATE:  Through Azure Hybrid Use Benefit, we can even “reuse” an on premise Windows license for a new (unrelated) VM in Azure.

We can also run whatever licensed software we want on top of a VM.  We can install SAP, get an SAP license and be %100 legal.  The licensed software I enumerated come with the option of being integrated in the “per minute” cost.

So one of the first decision to do in pricing is:  do we want to go with integrated pricing or external licensed based pricing?  Quite easy to decide:  simply look at the price of external licenses (e.g. volume licensing) we can have with the vendor and compare.

Typically if we run the VM sporadically, i.e. few hours per day, it is cheaper to go with the integrated pricing.  Also, I see a lot of customer starting with integrated pricing for POCs, run it for a while and optimize pricing later.

Temporary Disk

footprint-93482_640Ok, here, let’s debunk what probably takes 2 hours from me every single week:  the “disk size” column in the pricing sheets.

image

This is local storage.  By local, we mean it’s local to the host itself, it isn’t an attached disk.  For that reason it has lower latency than attached disks.  It has also another very important characteristic:  it is ephemeralIt isn’t persistentIts content does not survive a reboot of the VMThe disk is empty after reboot.

We are insisting on this point because everybody gets confused on that column and for a good reason:  the column title is bunker.  It doesn’t lie, it is a disk and it does have the specified size.  But it is a temporary disk.

Can we install the OS on that disk?  No.  Note, we didn’t say “we shouldn’t”, but “we can’t”.

What we typically put on that disk is:

  • Page file
  • Temporary files (e.g. tempdb for SQL Server running on VM)
  • Caching files

Some VM series have quite large temporary disk.  Take for instance the L series:

image

That VM series was specifically designed to work with Big Data workload where data is replicated within a cluster (e.g. Hadoop, Cassandra, etc.).  Disk latency is key but not durability since the data is replicated around.

Unless you run such a workload, don’t rely on the temporary disk too much.

The major consequence here is:  add attached disks to your pricing.  See https://azure.microsoft.com/en-us/pricing/details/managed-disks/.

Storage Space

The pricing page is nice but to have a deeper conversation we’ll need to look at more VM specs.  We start our journey at https://docs.microsoft.com/en-us/azure/virtual-machines/virtual-machines-windows-sizes.  From there, depending on the “type” of VM we are interested in, we’re going to dive into one of the links, e.g. https://docs.microsoft.com/en-us/azure/virtual-machines/virtual-machines-windows-sizes-general.

The documentation repeats the specs we see on the pricing page, i.e. # of cores, RAM & local disk size, but also gives other specs:  max number of data disks, throughput, max number of NICs and network bandwidth.  Here we’ll focus on the maximum number of data disks.

A VM comes with an OS disk, a local disk and a set of optional data disks.  Depending on the VM SKU, the maximum number of data disks does vary.

At the time of this writing, the maximum size of a disk on a VM is 1TB.  We can have bigger volumes on the VM by stripping multiple disks together on the VM’s OS.  But the biggest disk is 1TB.

For instance, a D1v2 (see https://docs.microsoft.com/en-us/azure/virtual-machines/virtual-machines-windows-sizes-general#dv2-series) can have 2 data disks on top of the OS disk.  That means, if we max out each of the 3 disks, 3 TB, including the space for the OS.

So what if the D1v2 really is enough for our need in terms of # of cores and RAM but we need 4 TB of storage space?  Well, we’ll need to bump up to another VM SKU, a D2v2 for instance, which supports 4 data disks.

Attached Disks

night-computer-hdd-hard-driveBeside the temporary disk all VM disks have attached disks.

Attached means they aren’t local to the VM’s host.  They are attached to the VM and backed by Azure storage.

Azure storage means 3 times synchronous replica, i.e. high resilience, highly persistence.

The Azure storage is its own complex topic with many variables, e.g. LRS / GRS / RA-RGS, Premium / Standard, Cool / Hot, etc.  .

Here we’ll discuss two dimensions:  Premium vs Standard & Managed vs Unmanaged disks.

We’ve explained what managed disks are in contrast with unmanaged disk in this article.  Going forward I recommend only managed disks.

Standard disks are backed by spinning physical disks while Premium disks are backed by Solid State Drive (SSD) disks.  In general:

  • Premium disk has higher IOPs than Standard disk
  • Premium disk has more consistent IOPs than Standard disk (Standard disk IOPs will vary)
  • Premium disk is has higher availability (see Single VM SLA)
  • Premium disk is more expensive than Standard disk

So really, only the price will stop us from only using Premium disk.

In general:  IO intensive workloads (e.g. databases) should always be on premium.  Single VM need to be on Premium in order to have an SLA (again, see Single VM SLA).

For the pricing of disks, see https://azure.microsoft.com/en-us/pricing/details/managed-disks/.  Disks come in predefined sizes.

IOPs

speed-1249610_640We have our VM, the OS on it, we have the storage space but are the disks going to perform?

This is where the Input / Ouput per seconds (IOPs) come into the picture.

An IO intensive workload (e.g. database) will consume IOPs from the VM disks.

Each disk come with a number of IOPs.  In the pricing page (https://azure.microsoft.com/en-us/pricing/details/managed-disks/), the Premium disks, i.e. P10, P20 & P30, have documented IOPs of 500, 2300 & 5000 respectively.  Standard disks (at the time of this writing, March 2017), do not have IOPs documented but it is easy to find out by creating disks in the portal ; for instance an S4 disk with 32 GB will have 500 IOPs & 60 MB/s throughput.

In order to get the total number of IOPs we need, we’ll simply select a set of disks that has the right total of IOPs.  For instance, for 20000 IOPs, we might choose 4 x P30, which we might expose to the OS as a single volume (by stripping the disks) or not.  Again, we might need to oversize here.  For instance, we might need 20000 IOPs for a database of only 1TB but 4 x P30 will give us 4 TB of space.

Is that all?  Well, no.  Now that we have the IOPs we need, we have to make sure the VM can use those IOPs.  Let’s take the DSv2 series as an example (see https://docs.microsoft.com/en-us/azure/virtual-machines/virtual-machines-windows-sizes-general#dsv2-series).  A DS2v2 can have 4 data disks and can therefore accommodate our 4 x P3 disks, but it can only pull 8000 IOPs.  In order to get the full 20000 IOPs, we would need to oversize to a DS4v2.

image

One last thing about IOPs:  what is it with those two columns cached / uncached disks?

When we attach a disk, we can choose from different caching options:  none, read-only & read-write.  Caching uses a part of the host resources to cache the disks’ content which obviously accelerate operations.

Network bandwidth

A VM SKU also controls the network bandwidth of the VM.

There are no precisely documented bandwidth nor SLAs.  Instead, categories are used:  Low, Moderate, High and Very High.  The network bandwidth capacity increases along those categories.

Again, we might need to oversize a VM in order to access higher network throughput if required.

Network Interface Controller (NIC)

Finally, each VM SKU sports a different maximum number of Network Interface Controllers (NICs).

Typically a VM is fine with one NIC.  Network appliances (e.g. virtual firewalls) will often require 2 NICs.

Summary

There are a few variables to consider when sizing a VM.  The number of cores & RAM is a good starting point but you might need to oversize the VMs to satisfy other characteristics such as storage space, disk performance or network performance.

Cloud vs Hosting / Outsourcing

coal-1626368_640There is this recurring discussion with customers:  cloud is the new hosting.

While there are angles to look at the cloud that bring similarities, I will argue it is a bad analogy that is more likely to mislead you than help you.

I would argue further:  considering the cloud as a new form of outsourcing is confusing the technological innovation from the business model innovation (made possible by the technological innovation) of the cloud.

Of course both hosting and cloud allow you to consume compute, storage & networking while not managing the details, but the basic economical premise is very different.

Outsourcing model typically starts off with a standard agreement and gets customized during negotiation.  Customization usually ends up with dedicated infrastructure and with customer-specific SLAs.  With this comes long term contract (usually multi-years) since the outsourcing partner has to invest into specifics for their customer.  The playbook from there is for the outsourcer to limits its cost to protect its margins.  The innovation typically stagnates and customer often look for alternatives after the contract is up as a consequence.

Cloud model is an hyper-scale model.  It’s starts with standards and ends with standards.  Standard hardware, standard processes & standard SLAs are at the heart of cloud model.  It is a self service model, not a negotiation one.  True multi-tenancy increases scalability and puts pressure on security requirements.  Security is baked into standard processes and the architecture.  There are usually no long term agreements but either pay per use or short term (e.g. one year) commitment:  it is consumption base, like utilities (e.g. water & electricity).  The playbook for the cloud provider is to deliver innovation constantly to facilitate consumption.

This usually seems a little strange for folks used to deal with outsourcers:

  • Can I have access to your firewall?
  • Can I have a physical trace of the network packets?
  • Can you decrease the SLA of component A (and reduce the price) while increasing the SLA of component B?

But mostly, there often is a bit of disbelief:  how can you innovate (increase your cost), reduce your price (decrease your revenue) & not get out of business?

It sounds like a paradox because it is.  It is called the Jevons paradox and it isn’t new.  William Stanley Jevons wrote about the paradox that would bear his name in 1865 in a book called The Coal Question:

It is wholly a confusion of ideas to suppose that the economical use of fuel is equivalent to a diminished consumption. The very contrary is the truth.

What Jevons had put his finger on is that making a resource cheaper doesn’t imply that the total economic value of that resource goes down, but the contrary.  It is quite easy to see with our modern eyes:  if I could tomorrow make gasoline %90 cheaper for the foreseeable future, it is likely that we would start using gasoline more an more in areas where it is prohibitive today, not keep the consumption flat.

So much for the “technological innovation will get us out of the energy crisis”.

The same goes for the cloud and this is the business model disruption it brings to outsourcers hanging on their aging assets.

This also explains why the cloud isn’t good for every workload on the face of the Earth.  Some workload do benefit from customization an outsourcer / hosting environment can bring which aren’t available in the cloud (and in some case might never be).

 

Now just to bring some nuance to the standardization motto I brought above:  public cloud does evolve with customer feedback.  In Azure, for instance, there are more and more VM skus, network configurations & overall variety of services.  But this isn’t the result of one customer asking for a customization but general feedback from multiple customers.

Creating an image with 2 Managed Disks for VM Scale Set

UPDATE (23-06-2017):  Fabio Hara, a colleague of mine from Brazil, has published the ARM template on his GitHub.  This makes it much easier to try the content of this article.  Thank you Fabio!

We talked about Managed Disks, now let’s use them.

Let’s create an image from an OS + Data disk & create a Scale Set with that image.

Deploy ARM Template

{
  "$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
  "contentVersion": "1.0.0.0",
  "parameters": {
    "VM Admin User Name": {
      "defaultValue": "myadmin",
      "type": "string"
    },
    "VM Admin Password": {
      "defaultValue": null,
      "type": "securestring"
    },
    "VM Size": {
      "defaultValue": "Standard_DS4",
      "type": "string",
      "allowedValues": [
        "Standard_DS1",
        "Standard_DS2",
        "Standard_DS3",
        "Standard_DS4",
        "Standard_DS5"
      ],
      "metadata": {
        "description": "SKU of the VM."
      }
    },
    "Public Domain Label": {
      "type": "string"
    }
  },
  "variables": {
    "Vhds Container Name": "vhds",
    "frontIpRange": "10.0.1.0/24",
    "Public IP Name": "MyPublicIP",
    "Public LB Name": "PublicLB",
    "Front Address Pool Name": "frontPool",
    "Front NIC": "frontNic",
    "Front VM": "Demo-VM",
    "Front Availability Set Name": "frontAvailSet",
    "Private LB Name": "PrivateLB",
    "VNET Name": "Demo-VNet"
  },
  "resources": [
    {
      "type": "Microsoft.Network/publicIPAddresses",
      "name": "[variables('Public IP Name')]",
      "apiVersion": "2015-06-15",
      "location": "[resourceGroup().location]",
      "tags": {
        "displayName": "Public IP"
      },
      "properties": {
        "publicIPAllocationMethod": "Dynamic",
        "idleTimeoutInMinutes": 4,
        "dnsSettings": {
          "domainNameLabel": "[parameters('Public Domain Label')]"
        }
      }
    },
    {
      "type": "Microsoft.Network/virtualNetworks",
      "name": "[variables('VNet Name')]",
      "apiVersion": "2016-03-30",
      "location": "[resourceGroup().location]",
      "properties": {
        "addressSpace": {
          "addressPrefixes": [
            "10.0.0.0/16"
          ]
        },
        "subnets": [
          {
            "name": "front",
            "properties": {
              "addressPrefix": "[variables('frontIpRange')]",
              "networkSecurityGroup": {
                "id": "[resourceId('Microsoft.Network/networkSecurityGroups', 'frontNsg')]"
              }
            }
          }
        ]
      },
      "resources": [],
      "dependsOn": [
        "[resourceId('Microsoft.Network/networkSecurityGroups', 'frontNsg')]"
      ]
    },
    {
      "type": "Microsoft.Network/loadBalancers",
      "name": "[variables('Public LB Name')]",
      "apiVersion": "2015-06-15",
      "location": "[resourceGroup().location]",
      "tags": {
        "displayName": "Public Load Balancer"
      },
      "properties": {
        "frontendIPConfigurations": [
          {
            "name": "LoadBalancerFrontEnd",
            "comments": "Front end of LB:  the IP address",
            "properties": {
              "publicIPAddress": {
                "id": "[resourceId('Microsoft.Network/publicIPAddresses/', variables('Public IP Name'))]"
              }
            }
          }
        ],
        "backendAddressPools": [
          {
            "name": "[variables('Front Address Pool Name')]"
          }
        ],
        "loadBalancingRules": [
          {
            "name": "Http",
            "properties": {
              "frontendIPConfiguration": {
                "id": "[concat(resourceId('Microsoft.Network/loadBalancers', variables('Public LB Name')), '/frontendIPConfigurations/LoadBalancerFrontEnd')]"
              },
              "frontendPort": 80,
              "backendPort": 80,
              "enableFloatingIP": false,
              "idleTimeoutInMinutes": 4,
              "protocol": "Tcp",
              "loadDistribution": "Default",
              "backendAddressPool": {
                "id": "[concat(resourceId('Microsoft.Network/loadBalancers', variables('Public LB Name')), '/backendAddressPools/', variables('Front Address Pool Name'))]"
              },
              "probe": {
                "id": "[concat(resourceId('Microsoft.Network/loadBalancers', variables('Public LB Name')), '/probes/TCP-Probe')]"
              }
            }
          }
        ],
        "probes": [
          {
            "name": "TCP-Probe",
            "properties": {
              "protocol": "Tcp",
              "port": 80,
              "intervalInSeconds": 5,
              "numberOfProbes": 2
            }
          }
        ],
        "inboundNatRules": [
          {
            "name": "SSH-2-Primary",
            "properties": {
              "frontendIPConfiguration": {
                "id": "[concat(resourceId('Microsoft.Network/loadBalancers', variables('Public LB Name')), '/frontendIPConfigurations/LoadBalancerFrontEnd')]"
              },
              "frontendPort": 22,
              "backendPort": 22,
              "protocol": "Tcp"
            }
          }
        ],
        "outboundNatRules": [],
        "inboundNatPools": []
      },
      "dependsOn": [
        "[resourceId('Microsoft.Network/publicIPAddresses', variables('Public IP Name'))]"
      ]
    },
    {
      "apiVersion": "2015-06-15",
      "name": "frontNsg",
      "type": "Microsoft.Network/networkSecurityGroups",
      "location": "[resourceGroup().location]",
      "tags": {},
      "properties": {
        "securityRules": [
          {
            "name": "Allow-SSH-From-Everywhere",
            "properties": {
              "protocol": "Tcp",
              "sourcePortRange": "*",
              "destinationPortRange": "22",
              "sourceAddressPrefix": "*",
              "destinationAddressPrefix": "*",
              "access": "Allow",
              "priority": 100,
              "direction": "Inbound"
            }
          },
          {
            "name": "Allow-Health-Monitoring",
            "properties": {
              "protocol": "*",
              "sourcePortRange": "*",
              "destinationPortRange": "*",
              "sourceAddressPrefix": "AzureLoadBalancer",
              "destinationAddressPrefix": "*",
              "access": "Allow",
              "priority": 200,
              "direction": "Inbound"
            }
          },
          {
            "name": "Disallow-everything-else-Inbound",
            "properties": {
              "protocol": "*",
              "sourcePortRange": "*",
              "destinationPortRange": "*",
              "sourceAddressPrefix": "*",
              "destinationAddressPrefix": "*",
              "access": "Deny",
              "priority": 300,
              "direction": "Inbound"
            }
          },
          {
            "name": "Allow-to-VNet",
            "properties": {
              "protocol": "*",
              "sourcePortRange": "*",
              "destinationPortRange": "*",
              "sourceAddressPrefix": "*",
              "destinationAddressPrefix": "VirtualNetwork",
              "access": "Allow",
              "priority": 100,
              "direction": "Outbound"
            }
          },
          {
            "name": "Allow-to-8443",
            "properties": {
              "protocol": "*",
              "sourcePortRange": "*",
              "destinationPortRange": "8443",
              "sourceAddressPrefix": "*",
              "destinationAddressPrefix": "Internet",
              "access": "Allow",
              "priority": 200,
              "direction": "Outbound"
            }
          },
          {
            "name": "Disallow-everything-else-Outbound",
            "properties": {
              "protocol": "*",
              "sourcePortRange": "*",
              "destinationPortRange": "*",
              "sourceAddressPrefix": "*",
              "destinationAddressPrefix": "*",
              "access": "Deny",
              "priority": 300,
              "direction": "Outbound"
            }
          }
        ],
        "subnets": []
      }
    },
    {
      "type": "Microsoft.Network/networkInterfaces",
      "name": "[variables('Front NIC')]",
      "tags": {
        "displayName": "Front NICs"
      },
      "apiVersion": "2016-03-30",
      "location": "[resourceGroup().location]",
      "properties": {
        "ipConfigurations": [
          {
            "name": "ipconfig",
            "properties": {
              "privateIPAllocationMethod": "Dynamic",
              "subnet": {
                "id": "[concat(resourceId('Microsoft.Network/virtualNetworks', variables('VNet Name')), '/subnets/front')]"
              },
              "loadBalancerBackendAddressPools": [
                {
                  "id": "[concat(resourceId('Microsoft.Network/loadBalancers', variables('Public LB Name')), '/backendAddressPools/', variables('Front Address Pool Name'))]"
                }
              ],
              "loadBalancerInboundNatRules": [
                {
                  "id": "[concat(resourceId('Microsoft.Network/loadBalancers', variables('Public LB Name')), '/inboundNatRules/SSH-2-Primary')]"
                }
              ]
            }
          }
        ],
        "dnsSettings": {
          "dnsServers": []
        },
        "enableIPForwarding": false
      },
      "resources": [],
      "dependsOn": [
        "[resourceId('Microsoft.Network/virtualNetworks', variables('VNet Name'))]",
        "[resourceId('Microsoft.Network/loadBalancers', variables('Public LB Name'))]"
      ]
    },
    {
      "type": "Microsoft.Compute/disks",
      "name": "[concat(variables('Front VM'), '-data')]",
      "apiVersion": "2016-04-30-preview",
      "location": "[resourceGroup().location]",
      "properties": {
        "creationData": {
          "createOption": "Empty"
        },
        "accountType": "Premium_LRS",
        "diskSizeGB": 32
      }
    },
    {
      "type": "Microsoft.Compute/virtualMachines",
      "name": "[variables('Front VM')]",
      "tags": {
        "displayName": "Front VMs"
      },
      "apiVersion": "2016-04-30-preview",
      "location": "[resourceGroup().location]",
      "properties": {
        "availabilitySet": {
          "id": "[resourceId('Microsoft.Compute/availabilitySets', variables('Front Availability Set Name'))]"
        },
        "hardwareProfile": {
          "vmSize": "[parameters('VM Size')]"
        },
        "storageProfile": {
          "imageReference": {
            "publisher": "OpenLogic",
            "offer": "CentOS",
            "sku": "7.3",
            "version": "latest"
          },
          "osDisk": {
            "name": "[variables('Front VM')]",
            "createOption": "FromImage",
            "caching": "ReadWrite"
          },
          "dataDisks": [
            {
              "lun": 2,
              "name": "[concat(variables('Front VM'), '-data')]",
              "createOption": "attach",
              "managedDisk": {
                "id": "[resourceId('Microsoft.Compute/disks', concat(variables('Front VM'), '-data'))]"
              },
              "caching": "Readonly"
            }
          ]
        },
        "osProfile": {
          "computerName": "[variables('Front VM')]",
          "adminUsername": "[parameters('VM Admin User Name')]",
          "adminPassword": "[parameters('VM Admin Password')]"
        },
        "networkProfile": {
          "networkInterfaces": [
            {
              "id": "[resourceId('Microsoft.Network/networkInterfaces', variables('Front NIC'))]"
            }
          ]
        }
      },
      "resources": [],
      "dependsOn": [
        "[resourceId('Microsoft.Network/networkInterfaces', variables('Front NIC'))]",
        "[resourceId('Microsoft.Compute/availabilitySets', variables('Front Availability Set Name'))]",
        "[resourceId('Microsoft.Compute/disks', concat(variables('Front VM'), '-data'))]"
      ]
    },
    {
      "name": "[variables('Front Availability Set Name')]",
      "type": "Microsoft.Compute/availabilitySets",
      "location": "[resourceGroup().location]",
      "apiVersion": "2016-04-30-preview",
      "tags": {
        "displayName": "FrontAvailabilitySet"
      },
      "properties": {
        "platformUpdateDomainCount": 5,
        "platformFaultDomainCount": 3,
        "managed": true
      },
      "dependsOn": []
    }
  ]
}
  1. We use the resource group named md-demo-image
  2. This deploys a single Linux VM into a managed availability set using a premium managed disk
  3. The VM has both OS & a data disk
  4. The deployment takes a few minutes

Customize VM

  1. Login to the VM
  1. We suggest using Putty tool with SSH (SSH port is opened on NSG)
  2. Look at MyPublicIP to find the DNS of the public IP in order to SSH to it
  1. The data disk is LUN-2 (it should be /dev/sdc)
  2. We will mount it to /data
  3. Write the mount point permanently in /etc/fstab
  • In the bash shell, type
    cd /data
    sudo touch mydata
    ls
  • We just created a file on the data disk

Login into ISE

  1. Open up PowerShell ISE
  2. Type Add-AzureRmAccount
  3. Enter your credentials ; those credentials should be the same you are using to log into the Azure Portal
  4. If you have more than one subscriptions
  1. Type Get-AzureRmSubscription
  2. This should list all subscriptions you have access (even partial) to
  3. Select the SubscriptionId (a GUID) of the subscription you want to use
  4. Type Select-AzureRmSubscription -SubscriptionId <SubscriptionId>
    <SubscriptionId> is the value you just selected
  5. This will select the specified subscription as the “current one”, i.e. future queries will be done against that subscription

Create Image

You can read about details of this procedure at https://docs.microsoft.com/en-us/azure/virtual-machines/virtual-machines-linux-capture-image & https://docs.microsoft.com/en-us/azure/virtual-machines/virtual-machines-windows-capture-image-resource.

  1. In bash shell, type
    sudo waagent -deprovision+user -force
  2. This de-provisions the VM itself
  3. In PowerShell, type
    $rgName = “md-demo-image”
    $imageName = “Demo-VM-Image”
    $vm = Get-AzureRmVM -ResourceGroupName $rgName
    Stop-AzureRmVM -ResourceGroupName $rgName -Name $vm.Name -Force
  4. This will stop the VM
  5. In PowerShell, type
    Set-AzureRmVm -ResourceGroupName $rgName -Name $vm.Name -Generalized
  6. This generalizes the VM
  7. In PowerShell, type
    $imageConfig = New-AzureRmImageConfig -Location $vm.Location -SourceVirtualMachineId $vm.Id
    New-AzureRmImage -ImageName $imageName -ResourceGroupName $rgName -Image $imageConfig
  8. This creates an image resource containing both the OS & data disks
  9. We can see the image in the portal and validate it has two disks in it

Clean up VM

In order to install a Scale Set in the same availability set, we need to remove the VM.

  1. In PowerShell, type
    Remove-AzureRmVM -ResourceGroupName $rgName -Name $vm.Name -Force
    Remove-AzureRmNetworkInterface -ResourceGroupName $rgName -Name frontNic -Force
    Remove-AzureRmAvailabilitySet -ResourceGroupName $rgName -Name frontAvailSet -Force
  2. Optionally, we can remove the disks
    Remove-AzureRmDisk -ResourceGroupName $rgName -DiskName Demo-VM -Force
    Remove-AzureRmDisk -ResourceGroupName $rgName -DiskName Demo-VM-data -Force
  3. Remove-AzureRmLoadBalancer -ResourceGroupName $rgName -Name PublicLB -Force

Deploy Scale Set

{
  "$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
  "contentVersion": "1.0.0.0",
  "parameters": {
    "VM Admin User Name": {
      "defaultValue": "myadmin",
      "type": "string"
    },
    "VM Admin Password": {
      "defaultValue": null,
      "type": "securestring"
    },
    "Instance Count": {
      "defaultValue": 3,
      "type": "int"
    },
    "VM Size": {
      "defaultValue": "Standard_DS4",
      "type": "string",
      "allowedValues": [
        "Standard_DS1",
        "Standard_DS2",
        "Standard_DS3",
        "Standard_DS4",
        "Standard_DS5"
      ],
      "metadata": {
        "description": "SKU of the VM."
      }
    },
    "Public Domain Label": {
      "type": "string"
    }
  },
  "variables": {
    "frontIpRange": "10.0.1.0/24",
    "Public IP Name": "MyPublicIP",
    "Public LB Name": "PublicLB",
    "Front Address Pool Name": "frontPool",
    "Front Nat Pool Name": "frontNatPool",
    "VNET Name": "Demo-VNet",
    "NIC Prefix": "Nic",
    "Scale Set Name": "Demo-ScaleSet",
    "Image Name": "Demo-VM-Image",
    "VM Prefix": "Demo-VM",
    "IP Config Name": "ipConfig"
  },
  "resources": [
    {
      "type": "Microsoft.Network/publicIPAddresses",
      "name": "[variables('Public IP Name')]",
      "apiVersion": "2015-06-15",
      "location": "[resourceGroup().location]",
      "tags": {
        "displayName": "Public IP"
      },
      "properties": {
        "publicIPAllocationMethod": "Dynamic",
        "idleTimeoutInMinutes": 4,
        "dnsSettings": {
          "domainNameLabel": "[parameters('Public Domain Label')]"
        }
      }
    },
    {
      "type": "Microsoft.Network/virtualNetworks",
      "name": "[variables('VNet Name')]",
      "apiVersion": "2016-03-30",
      "location": "[resourceGroup().location]",
      "properties": {
        "addressSpace": {
          "addressPrefixes": [
            "10.0.0.0/16"
          ]
        },
        "subnets": [
          {
            "name": "front",
            "properties": {
              "addressPrefix": "[variables('frontIpRange')]",
              "networkSecurityGroup": {
                "id": "[resourceId('Microsoft.Network/networkSecurityGroups', 'frontNsg')]"
              }
            }
          }
        ]
      },
      "resources": [],
      "dependsOn": [
        "[resourceId('Microsoft.Network/networkSecurityGroups', 'frontNsg')]"
      ]
    },
    {
      "type": "Microsoft.Network/loadBalancers",
      "name": "[variables('Public LB Name')]",
      "apiVersion": "2015-06-15",
      "location": "[resourceGroup().location]",
      "tags": {
        "displayName": "Public Load Balancer"
      },
      "properties": {
        "frontendIPConfigurations": [
          {
            "name": "LoadBalancerFrontEnd",
            "comments": "Front end of LB:  the IP address",
            "properties": {
              "publicIPAddress": {
                "id": "[resourceId('Microsoft.Network/publicIPAddresses/', variables('Public IP Name'))]"
              }
            }
          }
        ],
        "backendAddressPools": [
          {
            "name": "[variables('Front Address Pool Name')]"
          }
        ],
        "loadBalancingRules": [],
        "probes": [
          {
            "name": "TCP-Probe",
            "properties": {
              "protocol": "Tcp",
              "port": 80,
              "intervalInSeconds": 5,
              "numberOfProbes": 2
            }
          }
        ],
        "inboundNatPools": [
          {
            "name": "[variables('Front Nat Pool Name')]",
            "properties": {
              "frontendIPConfiguration": {
                "id": "[concat(resourceId('Microsoft.Network/loadBalancers', variables('Public LB Name')), '/frontendIPConfigurations/loadBalancerFrontEnd')]"
              },
              "protocol": "tcp",
              "frontendPortRangeStart": 5000,
              "frontendPortRangeEnd": 5200,
              "backendPort": 22
            }
          }
        ]
      },
      "dependsOn": [
        "[resourceId('Microsoft.Network/publicIPAddresses', variables('Public IP Name'))]"
      ]
    },
    {
      "apiVersion": "2015-06-15",
      "name": "frontNsg",
      "type": "Microsoft.Network/networkSecurityGroups",
      "location": "[resourceGroup().location]",
      "tags": {},
      "properties": {
        "securityRules": [
          {
            "name": "Allow-SSH-From-Everywhere",
            "properties": {
              "protocol": "Tcp",
              "sourcePortRange": "*",
              "destinationPortRange": "22",
              "sourceAddressPrefix": "*",
              "destinationAddressPrefix": "*",
              "access": "Allow",
              "priority": 100,
              "direction": "Inbound"
            }
          },
          {
            "name": "Allow-Health-Monitoring",
            "properties": {
              "protocol": "*",
              "sourcePortRange": "*",
              "destinationPortRange": "*",
              "sourceAddressPrefix": "AzureLoadBalancer",
              "destinationAddressPrefix": "*",
              "access": "Allow",
              "priority": 200,
              "direction": "Inbound"
            }
          },
          {
            "name": "Disallow-everything-else-Inbound",
            "properties": {
              "protocol": "*",
              "sourcePortRange": "*",
              "destinationPortRange": "*",
              "sourceAddressPrefix": "*",
              "destinationAddressPrefix": "*",
              "access": "Deny",
              "priority": 300,
              "direction": "Inbound"
            }
          },
          {
            "name": "Allow-to-VNet",
            "properties": {
              "protocol": "*",
              "sourcePortRange": "*",
              "destinationPortRange": "*",
              "sourceAddressPrefix": "*",
              "destinationAddressPrefix": "VirtualNetwork",
              "access": "Allow",
              "priority": 100,
              "direction": "Outbound"
            }
          },
          {
            "name": "Allow-to-8443",
            "properties": {
              "protocol": "*",
              "sourcePortRange": "*",
              "destinationPortRange": "8443",
              "sourceAddressPrefix": "*",
              "destinationAddressPrefix": "Internet",
              "access": "Allow",
              "priority": 200,
              "direction": "Outbound"
            }
          },
          {
            "name": "Disallow-everything-else-Outbound",
            "properties": {
              "protocol": "*",
              "sourcePortRange": "*",
              "destinationPortRange": "*",
              "sourceAddressPrefix": "*",
              "destinationAddressPrefix": "*",
              "access": "Deny",
              "priority": 300,
              "direction": "Outbound"
            }
          }
        ],
        "subnets": []
      }
    },
    {
      "type": "Microsoft.Compute/virtualMachineScaleSets",
      "name": "[variables('Scale Set Name')]",
      "location": "[resourceGroup().location]",
      "apiVersion": "2016-04-30-preview",
      "dependsOn": [
        "[concat('Microsoft.Network/loadBalancers/', variables('Public LB Name'))]",
        "[concat('Microsoft.Network/virtualNetworks/', variables('VNET Name'))]"
      ],
      "sku": {
        "name": "[parameters('VM Size')]",
        "tier": "Standard",
        "capacity": "[parameters('Instance Count')]"
      },
      "properties": {
        "overprovision": "true",
        "upgradePolicy": {
          "mode": "Manual"
        },
        "virtualMachineProfile": {
          "storageProfile": {
            "osDisk": {
              "createOption": "FromImage",
              "managedDisk": {
                "storageAccountType": "Premium_LRS"
              }
            },
            "imageReference": {
              "id": "[resourceId('Microsoft.Compute/images', variables('Image Name'))]"
            },
            "dataDisks": [
              {
                "createOption": "FromImage",
                "lun": "2",
                "managedDisk": {
                  "storageAccountType": "Premium_LRS"
                }
              }
            ]
          },
          "osProfile": {
            "computerNamePrefix": "[variables('VM Prefix')]",
            "adminUsername": "[parameters('VM Admin User Name')]",
            "adminPassword": "[parameters('VM Admin Password')]"
          },
          "networkProfile": {
            "networkInterfaceConfigurations": [
              {
                "name": "[variables('NIC Prefix')]",
                "properties": {
                  "primary": "true",
                  "ipConfigurations": [
                    {
                      "name": "[variables('IP Config Name')]",
                      "properties": {
                        "subnet": {
                          "id": "[concat('/subscriptions/', subscription().subscriptionId,'/resourceGroups/', resourceGroup().name, '/providers/Microsoft.Network/virtualNetworks/', variables('VNET Name'), '/subnets/front')]"
                        },
                        "loadBalancerBackendAddressPools": [
                          {
                            "id": "[concat('/subscriptions/', subscription().subscriptionId,'/resourceGroups/', resourceGroup().name, '/providers/Microsoft.Network/loadBalancers/', variables('Public LB Name'), '/backendAddressPools/', variables('Front Address Pool Name'))]"
                          }
                        ],
                        "loadBalancerInboundNatPools": [
                          {
                            "id": "[concat('/subscriptions/', subscription().subscriptionId,'/resourceGroups/', resourceGroup().name, '/providers/Microsoft.Network/loadBalancers/', variables('Public LB Name'), '/inboundNatPools/', variables('Front Nat Pool Name'))]"
                          }
                        ]
                      }
                    }
                  ]
                }
              }
            ]
          }
        }
      }
    }
  ]
}

You can choose the number of instances, by default there are 3

Validate Instance

  1. Connect to the first instance available using SSH on port 5000 of the public IP
  2. SSH ports are NATed from port 5000 up to back-end port 22
  3. In the bash shell type
    ls /data
  4. You should see “mydata”, hence the image carried both the os & data disks

Clean up

We won’t be using the resource groups we have created so we can delete them

In ISE, type Remove-AzureRmResourceGroup -Name md-demo-image -Force