Tag Archives: Virtual Machines

Virtual Machines in Azure (IaaS) ; Azure Backup, Site Recovery & Batch

Azure Managed Disk–Overview

pexels-photo-196520[1]

Microsoft released Azure Managed disk 2 weeks ago.  Let’s look at it!

What did we have until then?  The virtual hard disk (.vhd file) was stored as a page blob in an Azure Storage account.

That worked quite fine and Azure Disks are a little more than that.  A little abstraction.  But at the same time, Azure now knows it’s a disk and can hence optimize for it.

Issues with unmanaged disks

That’s right, our good old page blob vhd is now an unmanaged disk.  This sounds like 2001 when Microsoft released .NET & managed code and you learned that all the code you’ve been writing since then was unmanaged, unruly!

Let’s look at the different issues.

First that comes to mind is the Input / Output Operations per Seconds (IOPs).  A storage account tops IOPS at 20000.  An unmanaged standard disk can have 500 IOPs.  That means that after 40 disks in a storage account, if we only have disks in there, we’ll start to get throttled.  This doesn’t sound too bad if we plan to run 2-3 VMs but for larger deployments, we need to be careful.  Of course, we could choose to put each VHD in different storage account but a subscription is limited to 100 storage accounts and also, it adds to management (managing the domain names & access keys of 100 accounts for instance).

Another one is access rights.  If we put more than one disks in a storage account, we can’t different give access to different people to different disks:  if somebody is contributor on the storage account, he / she will have access to all disks in the account.

A painful one is around custom images.  Say we customize a Windows or Linux image and have our generalized VHD ready to fire up VMs.  That VHD needs to be in the same storage account than the VHD of the created VM.  That means you can only create 40 VMs really.  That’s where the limitation for VM scale set with custom images comes from.

A side effect of being in a storage account is the VHD is publicly accessible.  You still need a SAS token or an access key.  But that’s the thing.  For industries with strict regulations / compliances / audits, the ideas of saying “if somebody walked out with your access key, even if they got fired and their logins do not work anymore, they can now download and even change your VHD” is a deal breaker.

Finally, one that few people are aware of:  reliability.  Storage accounts are highly available and have 3 synchronous copies.  They have a SLA of %99.9.  The problem is when we match them with VMs.  We can setup high availability of a VM set by defining an availability set:  this gives some guarantees on how your VMs are affected during planned / unplanned downtime.  Now 2 VMs can be set to be in two different failure domains, i.e. they are deployed on different hosts and don’t share any critical hardware (e.g. power supply, network switch, etc.) but…  their VHDs might be on the same storage stamp (or cluster).  So if a storage stamp goes down for some reason, two VMs with different failure / update domain could go down at the same time.  If those are our only two VMs in the availability set, the set goes down.

Managed Disks

Managed disks are simply page blobs stored in a Microsoft managed storage account.  On the surface, not much of a change, right?

Well…  let’s address each issues we’ve identified:

  • IOPS:  disks are assigned to different storage accounts in a way that we’ll never get throttled because of storage account.
  • Access Rights:  Managed disks are first class citizens in Azure.  That means they appear as an Azure Resource and can have RBAC permissions assigned to it.
  • Custom Image:  beside managed disks, we now have snapshots and images as first class citizens.  An image no longer belong to a storage account and this removes the constraint we have before.
  • Public Access:  disks aren’t publically accessible.  The only way to access them is via a SAS token.  This also means we do not need to invent a globally unique domain name.
  • Reliability:  when we associate a disk with a VM in an availability set, Azure makes sure that VMs in different failure domains aren’t on the same storage stamp.

Other differences

Beside the obvious advantages here is a list of differences from unmanaged disks:

  • Managed disks can be in both Premium & Standard storage but only LRS
  • Standard Managed disks are priced given the closest pre-defined fix-sizes, not the “currently used # of GBs”
  • Standard Managed disks still price transactions

Also, quite importantly, Managed Disks do not support Storage Service Encryption at the time of this writing (February 2017).  It is supposed to come very soon though and Managed Disks do support encrypted disks.

Summary

Manage Disks bring a couple of goodies with them.  The most significant one is reliability, but other features will clearly make our lives easier.

In future articles, I’ll do a couple of hands on with Azure Managed Disks.

Joining an ARM Linux VM to AAD Domain Services

Active Directory is one of the most popular domain controller / LDAP server around.

In Azure we have Azure Active Directory (AAD).  Despite the name, AAD isn’t just a multi-tenant AD.  It is built for the cloud.

Sometimes though, it is useful to have a traditional domain controller…  in the cloud.  Typically this is with legacy workloads built to work with Active Directory.  But also, a very common scenario is to join an Azure VM to a domain so that users authenticate on it with the same accounts they authenticate to use the Azure Portal.

The underlying directory could even be synced with a corporate network, in which case users could log into the VMs using their corporate account.  I won’t cover this here but you can read about it in a previous article for AD Connect part.

The straightforward option is to build an Active Directory cluster on Azure VMs.  This will work but requires the maintenance of those 2 VMs.

mont-saint-michel-france-normandy-europe[1]

An easier option is AAD Domain Services (AADDS).  AADDS exposes an AAD tenant as a managed domain service.  It does it by provisioning some variant of Active Directory managed cluster, i.e. we do not see or care about the underlying VMs.

The cluster is synchronized one-way (from AAD to AADDS).  For this reason, AAD is read only through its LDAP interface, e.g. we can’t reset a password using LDAP.

The Azure documentation walks us through such an integration with classic (ASM) VMs.  Since ARM has been around for more than a year, I recommend to always go with ARM VMs.  This article aims at showing how to do this.

I’ll heavily leveraged the existing documentation and detail only what differs from the official documentation.

Also, keep in mind this article is written in January 2017.  Azure AD will transition to ARM in the future and will likely render this article obsolete.

Dual Networks

The main challenge we face is that AAD is an ASM service and AAD Domain Service are exposed within an ASM Virtual Network (VNET), which is incompatible with our ARM VM.

Thankfully, we now have Virtual Network peering allowing us to peer an ASM and an ARM VNET together so they can act as if they were one.

image

Peering

As with all VNET peering, the two VNETs must be of mutually exclusive IP addresses space.

I created two VNETs in the portal (https://portal.azure.com).  I recommend creating them in the new portal explicitly, this way even the classic one will be part of the desired resource group.

The classic one has 10.0.0.0/24 address range while the ARM one has 10.1.0.0/24.

The peering can be done from the portal too.  In the Virtual Network pane (say the ARM one), select Peerings:

image

We should see an empty list here, so let’s click Add.

We need to give the peering a name.  Let’s type PeeringToDomainServices.

We then select Classic in the Peer details since we want to peer with a classic VNET.

Finally, we’ll click on Choose a Virtual Network.

image

From there we should see the classic VNET we created.

Configuring AADDS

The online documentation is quite good for this.

Just make sure you select the classic VNET we created.

You can give a domain name that is different that the AAD domain name (i.e. *.onmicrosoft.com).

Enabling AADDS takes up to 30 minutes.  Don’t hold your breath!

Joining a VM

We can create a Linux VM, put it in the ARM VNET we created, and join it to the AADDS domain now.

Again, the online documentation does a great job of walking us through the process.  The documentation is written for Red Hat.

When I tried it, I used a CentOS VM and I ended up using different commands, namely the realmd command (ignoring the SAMBA part of the article).

Conclusion

It is fairly straightforward to enable Domain Services in AAD and well documented.

A challenge we currently have currently is to join or simply communicate from an ARM VM to AADDS.  For this we need two networks, a classic (ASM) one and an ARM one, and we need to peer them together.

Troubleshooting NSGs using Diagnostic Logs

I’ve wrote about how to use Network Security Group (NSG) before.

Chances are, once you get a complicated enough set of rules in a NSG, you’ll find yourself with NSGs that do not do what you think they should do.

Troubleshooting NSGs isn’t trivial.

I’ll try to give some guidance here but to this date (January 2017), there is no tool where you can just say “please follow packet X and tell me against which wall it bumps”.  It’s more indirect than that.

 

Connectivity

First thing, make sure you can connect to your VNET.

If you are connecting to a VM via a public IP, make sure you have access to that IP (i.e. you’re not sitting behind an on premise firewall blocking the outgoing port you are trying to use), that the IP is connected to the VM either directly or via a Load Balancer.

If you are connecting to a VM via a private IP through a VPN Gateway of some sort, make sure you can connect and that your packets are routed to the gateway and from there they get routed to the proper subnet.

An easy way to make sure of that is to remove all NSGs and replace them by a “let everything go in”.  Of course, that’s also opening your workloads to hackers, so I recommend you do that with a test VM that you destroy afterwards.

Diagnose

Then I would recommend to go through the official Azure guidelines to troubleshoot NSGs.  This walks you through the different diagnosis tools.

Diagnostic Logs

If you reached this section and haven’t achieve greatness yet, well…  You need something else.

What we’re going to do here is use NSG Diagnostic Logs to understand a bit more what is going on.

By no means is this magic and especially in an environment already in use where a lot of traffic is occurring, it might be difficult to make sense of what the logs are going to give us.

Nevertheless, the logs give us a picture of what really is happening.  They are aggregated though, so we won’t see your PC IP address for instance.  The aggregation is probably what limit the logs effectiveness the most.

Sample configuration

I provide here a sample configuration I’m going to use to walk through the troubleshooting process.

{
  "$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
  "contentVersion": "1.0.0.0",
  "parameters": {
    "VM Admin User Name": {
      "defaultValue": "myadmin",
      "type": "string"
    },
    "VM Admin Password": {
      "defaultValue": null,
      "type": "securestring"
    },
    "Disk Storage Account Name": {
      "defaultValue": "<your prefix>vmpremium",
      "type": "string"
    },
    "Log Storage Account Name": {
      "defaultValue": "<your prefix>logstandard",
      "type": "string"
    },
    "VM Size": {
      "defaultValue": "Standard_DS2",
      "type": "string",
      "allowedValues": [
        "Standard_DS1",
        "Standard_DS2",
        "Standard_DS3"
      ],
      "metadata": {
        "description": "SKU of the VM."
      }
    },
    "Public Domain Label": {
      "type": "string"
    }
  },
  "variables": {
    "Vhds Container Name": "vhds",
    "VNet Name": "MyVNet",
    "Ip Range": "10.0.1.0/24",
    "Public IP Name": "MyPublicIP",
    "Public LB Name": "PublicLB",
    "Address Pool Name": "addressPool",
    "Subnet NSG Name": "subnetNSG",
    "VM NSG Name": "vmNSG",
    "RDP NAT Rule Name": "RDP",
    "NIC Name": "MyNic",
    "VM Name": "MyVM"
  },
  "resources": [
    {
      "type": "Microsoft.Network/publicIPAddresses",
      "name": "[variables('Public IP Name')]",
      "apiVersion": "2015-06-15",
      "location": "[resourceGroup().location]",
      "tags": {
        "displayName": "Public IP"
      },
      "properties": {
        "publicIPAllocationMethod": "Dynamic",
        "idleTimeoutInMinutes": 4,
        "dnsSettings": {
          "domainNameLabel": "[parameters('Public Domain Label')]"
        }
      }
    },
    {
      "type": "Microsoft.Network/loadBalancers",
      "name": "[variables('Public LB Name')]",
      "apiVersion": "2015-06-15",
      "location": "[resourceGroup().location]",
      "tags": {
        "displayName": "Public Load Balancer"
      },
      "properties": {
        "frontendIPConfigurations": [
          {
            "name": "LoadBalancerFrontEnd",
            "comments": "Front end of LB:  the IP address",
            "properties": {
              "publicIPAddress": {
                "id": "[resourceId('Microsoft.Network/publicIPAddresses/', variables('Public IP Name'))]"
              }
            }
          }
        ],
        "backendAddressPools": [
          {
            "name": "[variables('Address Pool Name')]"
          }
        ],
        "loadBalancingRules": [
          {
            "name": "Http",
            "properties": {
              "frontendIPConfiguration": {
                "id": "[concat(resourceId('Microsoft.Network/loadBalancers', variables('Public LB Name')), '/frontendIPConfigurations/LoadBalancerFrontEnd')]"
              },
              "frontendPort": 80,
              "backendPort": 80,
              "enableFloatingIP": false,
              "idleTimeoutInMinutes": 4,
              "protocol": "Tcp",
              "loadDistribution": "Default",
              "backendAddressPool": {
                "id": "[concat(resourceId('Microsoft.Network/loadBalancers', variables('Public LB Name')), '/backendAddressPools/', variables('Address Pool Name'))]"
              },
              "probe": {
                "id": "[concat(resourceId('Microsoft.Network/loadBalancers', variables('Public LB Name')), '/probes/TCP-Probe')]"
              }
            }
          }
        ],
        "probes": [
          {
            "name": "TCP-Probe",
            "properties": {
              "protocol": "Tcp",
              "port": 80,
              "intervalInSeconds": 5,
              "numberOfProbes": 2
            }
          }
        ],
        "inboundNatRules": [
          {
            "name": "[variables('RDP NAT Rule Name')]",
            "properties": {
              "frontendIPConfiguration": {
                "id": "[concat(resourceId('Microsoft.Network/loadBalancers', variables('Public LB Name')), '/frontendIPConfigurations/LoadBalancerFrontEnd')]"
              },
              "frontendPort": 3389,
              "backendPort": 3389,
              "protocol": "Tcp"
            }
          }
        ],
        "outboundNatRules": [],
        "inboundNatPools": []
      },
      "dependsOn": [
        "[resourceId('Microsoft.Network/publicIPAddresses', variables('Public IP Name'))]"
      ]
    },
    {
      "type": "Microsoft.Network/virtualNetworks",
      "name": "[variables('VNet Name')]",
      "apiVersion": "2016-03-30",
      "location": "[resourceGroup().location]",
      "properties": {
        "addressSpace": {
          "addressPrefixes": [
            "10.0.0.0/16"
          ]
        },
        "subnets": [
          {
            "name": "default",
            "properties": {
              "addressPrefix": "[variables('Ip Range')]",
              "networkSecurityGroup": {
                "id": "[resourceId('Microsoft.Network/networkSecurityGroups', variables('Subnet NSG Name'))]"
              }
            }
          }
        ]
      },
      "resources": [],
      "dependsOn": [
        "[resourceId('Microsoft.Network/networkSecurityGroups', variables('Subnet NSG Name'))]"
      ]
    },
    {
      "apiVersion": "2015-06-15",
      "name": "[variables('Subnet NSG Name')]",
      "type": "Microsoft.Network/networkSecurityGroups",
      "location": "[resourceGroup().location]",
      "tags": {},
      "properties": {
        "securityRules": [
          {
            "name": "Allow-HTTP-From-Internet",
            "properties": {
              "protocol": "Tcp",
              "sourcePortRange": "*",
              "destinationPortRange": "80",
              "sourceAddressPrefix": "Internet",
              "destinationAddressPrefix": "*",
              "access": "Allow",
              "priority": 100,
              "direction": "Inbound"
            }
          },
          {
            "name": "Allow-RDP-From-Everywhere",
            "properties": {
              "protocol": "Tcp",
              "sourcePortRange": "*",
              "destinationPortRange": "3389",
              "sourceAddressPrefix": "*",
              "destinationAddressPrefix": "*",
              "access": "Allow",
              "priority": 150,
              "direction": "Inbound"
            }
          },
          {
            "name": "Allow-Health-Monitoring",
            "properties": {
              "protocol": "*",
              "sourcePortRange": "*",
              "destinationPortRange": "*",
              "sourceAddressPrefix": "AzureLoadBalancer",
              "destinationAddressPrefix": "*",
              "access": "Allow",
              "priority": 200,
              "direction": "Inbound"
            }
          },
          {
            "name": "Disallow-everything-else-Inbound",
            "properties": {
              "protocol": "*",
              "sourcePortRange": "*",
              "destinationPortRange": "*",
              "sourceAddressPrefix": "*",
              "destinationAddressPrefix": "*",
              "access": "Deny",
              "priority": 300,
              "direction": "Inbound"
            }
          },
          {
            "name": "Allow-to-VNet",
            "properties": {
              "protocol": "*",
              "sourcePortRange": "*",
              "destinationPortRange": "*",
              "sourceAddressPrefix": "*",
              "destinationAddressPrefix": "VirtualNetwork",
              "access": "Allow",
              "priority": 100,
              "direction": "Outbound"
            }
          },
          {
            "name": "Disallow-everything-else-Outbound",
            "properties": {
              "protocol": "*",
              "sourcePortRange": "*",
              "destinationPortRange": "*",
              "sourceAddressPrefix": "*",
              "destinationAddressPrefix": "*",
              "access": "Deny",
              "priority": 200,
              "direction": "Outbound"
            }
          }
        ],
        "subnets": []
      }
    },
    {
      "apiVersion": "2015-06-15",
      "name": "[variables('VM NSG Name')]",
      "type": "Microsoft.Network/networkSecurityGroups",
      "location": "[resourceGroup().location]",
      "tags": {},
      "properties": {
        "securityRules": [
          {
            "name": "Allow-HTTP-From-Internet",
            "properties": {
              "protocol": "Tcp",
              "sourcePortRange": "*",
              "destinationPortRange": "80",
              "sourceAddressPrefix": "Internet",
              "destinationAddressPrefix": "*",
              "access": "Allow",
              "priority": 100,
              "direction": "Inbound"
            }
          },
          {
            "name": "Allow-Health-Monitoring",
            "properties": {
              "protocol": "*",
              "sourcePortRange": "*",
              "destinationPortRange": "*",
              "sourceAddressPrefix": "AzureLoadBalancer",
              "destinationAddressPrefix": "*",
              "access": "Allow",
              "priority": 200,
              "direction": "Inbound"
            }
          },
          {
            "name": "Disallow-everything-else-Inbound",
            "properties": {
              "protocol": "*",
              "sourcePortRange": "*",
              "destinationPortRange": "*",
              "sourceAddressPrefix": "*",
              "destinationAddressPrefix": "*",
              "access": "Deny",
              "priority": 300,
              "direction": "Inbound"
            }
          },
          {
            "name": "Allow-to-VNet",
            "properties": {
              "protocol": "*",
              "sourcePortRange": "*",
              "destinationPortRange": "*",
              "sourceAddressPrefix": "*",
              "destinationAddressPrefix": "VirtualNetwork",
              "access": "Allow",
              "priority": 100,
              "direction": "Outbound"
            }
          },
          {
            "name": "Disallow-everything-else-Outbound",
            "properties": {
              "protocol": "*",
              "sourcePortRange": "*",
              "destinationPortRange": "*",
              "sourceAddressPrefix": "*",
              "destinationAddressPrefix": "*",
              "access": "Deny",
              "priority": 200,
              "direction": "Outbound"
            }
          }
        ],
        "subnets": []
      }
    },
    {
      "type": "Microsoft.Network/networkInterfaces",
      "name": "[variables('NIC Name')]",
      "apiVersion": "2016-03-30",
      "location": "[resourceGroup().location]",
      "properties": {
        "ipConfigurations": [
          {
            "name": "ipconfig",
            "properties": {
              "privateIPAllocationMethod": "Dynamic",
              "subnet": {
                "id": "[concat(resourceId('Microsoft.Network/virtualNetworks', variables('VNet Name')), '/subnets/default')]"
              },
              "loadBalancerBackendAddressPools": [
                {
                  "id": "[concat(resourceId('Microsoft.Network/loadBalancers', variables('Public LB Name')), '/backendAddressPools/', variables('Address Pool Name'))]"
                }
              ],
              "loadBalancerInboundNatRules": [
                {
                  "id": "[concat(resourceId('Microsoft.Network/loadBalancers', variables('Public LB Name')), '/inboundNatRules/', variables('RDP NAT Rule Name'))]"
                }
              ]
            }
          }
        ],
        "dnsSettings": {
          "dnsServers": []
        },
        "enableIPForwarding": false,
        "networkSecurityGroup": {
          "id": "[resourceId('Microsoft.Network/networkSecurityGroups', variables('VM NSG Name'))]"
        }
      },
      "resources": [],
      "dependsOn": [
        "[resourceId('Microsoft.Network/virtualNetworks', variables('VNet Name'))]",
        "[resourceId('Microsoft.Network/loadBalancers', variables('Public LB Name'))]"
      ]
    },
    {
      "type": "Microsoft.Compute/virtualMachines",
      "name": "[variables('VM Name')]",
      "apiVersion": "2015-06-15",
      "location": "[resourceGroup().location]",
      "properties": {
        "hardwareProfile": {
          "vmSize": "[parameters('VM Size')]"
        },
        "storageProfile": {
          "imageReference": {
            "publisher": "MicrosoftWindowsServer",
            "offer": "WindowsServer",
            "sku": "2012-R2-Datacenter",
            "version": "latest"
          },
          "osDisk": {
            "name": "[variables('VM Name')]",
            "createOption": "FromImage",
            "vhd": {
              "uri": "[concat('https', '://', parameters('Disk Storage Account Name'), '.blob.core.windows.net', concat('/', variables('Vhds Container Name'),'/', variables('VM Name'), '-os-disk.vhd'))]"
            },
            "caching": "ReadWrite"
          },
          "dataDisks": []
        },
        "osProfile": {
          "computerName": "[variables('VM Name')]",
          "adminUsername": "[parameters('VM Admin User Name')]",
          "windowsConfiguration": {
            "provisionVMAgent": true,
            "enableAutomaticUpdates": true
          },
          "secrets": [],
          "adminPassword": "[parameters('VM Admin Password')]"
        },
        "networkProfile": {
          "networkInterfaces": [
            {
              "id": "[resourceId('Microsoft.Network/networkInterfaces', concat(variables('NIC Name')))]"
            }
          ]
        }
      },
      "resources": [],
      "dependsOn": [
        "[resourceId('Microsoft.Storage/storageAccounts', parameters('Disk Storage Account Name'))]",
        "[resourceId('Microsoft.Network/networkInterfaces', variables('NIC Name'))]"
      ]
    },
    {
      "type": "Microsoft.Storage/storageAccounts",
      "name": "[parameters('Disk Storage Account Name')]",
      "sku": {
        "name": "Premium_LRS",
        "tier": "Premium"
      },
      "kind": "Storage",
      "apiVersion": "2016-01-01",
      "location": "[resourceGroup().location]",
      "properties": {},
      "resources": [],
      "dependsOn": []
    },
    {
      "type": "Microsoft.Storage/storageAccounts",
      "name": "[parameters('Log Storage Account Name')]",
      "sku": {
        "name": "Standard_LRS",
        "tier": "standard"
      },
      "kind": "Storage",
      "apiVersion": "2016-01-01",
      "location": "[resourceGroup().location]",
      "properties": {},
      "resources": [],
      "dependsOn": []
    }
  ]
}

The sample has one VM sitting in a subnet protected by a NSG.  The VM’s NIC is also protected by NSG, to make our life complicated (as we do too often).  The VM is exposed on a Load Balanced Public IP and RDP is enabled via NAT rules on the Load Balancer.

The VM is running on a Premium Storage account but the sample also creates a standard storage account to store the logs.

The Problem

The problem we are going to try to find using Diagnostic Logs is that the subnet’s NSG let RDP in via “Allow-RDP-From-Everywhere” rule while the NIC’s doesn’t and that type of traffic will get blocked, as everything else, by the “Disallow-everything-else-Inbound” rule.

In practice, you’ll likely have something more complicated going on, maybe some IP filtering, etc.  .   But the principles remain.

Enabling Diagnostic Logs

I couldn’t enable the Diagnostic Logs via the ARM template as it isn’t possible to do so yet.  We can do that via the Portal or PowerShell.

I’ll illustrate the Portal here, since it’s for troubleshooting, chances are you won’t automate it.

I’ve covered Azure Monitor in a previous article.  We’ve seen that different providers expose different schemas.

NSGs expose two categories of Diagnostic LogsEvent and Rule Counter.  We’re going to use Rule Counter only.

Rule Counter will give us a count of how many times a given rule was triggered for a given target (MAC address / IP).  Again, if we have lots of traffic flying around, that won’t be super useful.  This is why I recommend to isolate the network (or recreate an isolated one) in order to troubleshoot.

We’ll start by the subnet NSG.

image

Scrolling all the way down on the NSG’s pane left menu, we select Diagnostics Logs.

image

The pane should look as follow since no diagnostics are enabled.  Let’s click on Turn on diagnostics.

image

We then turn it on.

image

For simplicity here, we’re going to use the Archive to a storage account.

image

We will configure the storage account to send the logs to.

image

For that, we’re selecting the standard account created by the template or whichever storage account you fancy.  Log Diagnostics will go and create a blob container for each category in the selected account.  The names a predefined (you can’t choose).

We select the NetworkSecurityGroupRuleCounter category.

image

And finally we hit the save button on the pane.

We’ll do the same thing with the VM NSG.

image

Creating logs

No we are going to try to get through our VM.  We are going to describe how to that with the sample I gave but if you are troubleshooting something, just try the faulty connection.

We’re going to try to RDP to the public IP.  First we need the public IP domain name.  So in the resource group:

image

At the top of the pane we’ll find the DNS name that we can copy.

image

We can then paste it in an RDP window.

image

Trying to connect should fail and it should leave traces in the logs for us to analyse.

Analysis

We’ll have to wait 5-10 minutes for the logs to get in the storage account as this is done asynchronously.

Actually, a way to make sure to get clean logs is to delete the blob container and then try the RDP connection.  The blob container should reappear after 5-10 minutes.

To get the logs in the storage account we need some tool.  I use Microsoft Azure Storage Explorer.

image

The blob container is called insights-logs-networksecuritygrouprulecounter.

The logs are hidden inside a complicated hierarchy allowing us to send all our diagnostic logs from all our NSGs over time there.

Basically, resourceId%3D / SUBSCRIPTIONS / <Your subscription ID> / RESOURCEGROUPS / NSG / PROVIDERS / MICROSOFT.NETWORK / NETWORKSECURITYGROUPS / we’ll see two folders:  SUBNETNSG & VMNSG.  Those are our two NSGs.

If we dig under those two folders, we should find one file (or more if you’ve waited for a while).

Let’s copy those file with appropriate naming somewhere to analyse them.

Preferably, use a viewer / editor that understands JSON (I use Visual Studio).  If you use notepad…  you’re going to have fun.

If we look at the subnet NSG logs first and search for “RDP”, we’ll find this entry:

    {
      "time": "2017-01-09T11:46:44.9090000Z",
      "systemId": "...",
      "category": "NetworkSecurityGroupRuleCounter",
      "resourceId": ".../RESOURCEGROUPS/NSG/PROVIDERS/MICROSOFT.NETWORK/NETWORKSECURITYGROUPS/SUBNETNSG",
      "operationName": "NetworkSecurityGroupCounters",
      "properties": {
        "vnetResourceGuid": "{50C7B76A-4B8F-481A-8029-73569E5C7D87}",
        "subnetPrefix": "10.0.1.0/24",
        "macAddress": "00-0D-3A-00-B6-B5",
        "primaryIPv4Address": "10.0.1.4",
        "ruleName": "UserRule_Allow-RDP-From-Everywhere",
        "direction": "In",
        "type": "allow",
        "matchedConnections": 0
      }
    },

The most interesting part is the matchedConnections property, which is zero because we didn’t achieve connections.

If we look in the VM logs, we’ll find this:

    {
      "time": "2017-01-09T11:46:44.9110000Z",
      "systemId": "...",
      "category": "NetworkSecurityGroupRuleCounter",
      "resourceId": ".../RESOURCEGROUPS/NSG/PROVIDERS/MICROSOFT.NETWORK/NETWORKSECURITYGROUPS/VMNSG",
      "operationName": "NetworkSecurityGroupCounters",
      "properties": {
        "vnetResourceGuid": "{50C7B76A-4B8F-481A-8029-73569E5C7D87}",
        "subnetPrefix": "10.0.1.0/24",
        "macAddress": "00-0D-3A-00-B6-B5",
        "primaryIPv4Address": "10.0.1.4",
        "ruleName": "UserRule_Disallow-everything-else-Inbound",
        "direction": "In",
        "type": "block",
        "matchedConnections": 2
      }
    },

Where matchedConnections is 2 (because I tried twice).

So the logs tell us where the traffic when.

From here we could wonder why it hit that rule and look for a rule with a higher priority that allow RDP in, find none and conclude that’s our problem.

Trial & Error

If the logs are not helping you, the last resort is to modify the NSG until you understand what is going on.

A way to do this is to create a rule “allow everything in from anywhere”, give it maximum priority.

If traffic still doesn’t go in, you have another problem than NSG, so go back to previous steps.

If traffic goes in, good.  Move that allow-everything rule down until you find which rule is blocking you.  You may have a lot of rules, in which case I would recommend a dichotomic search algorithm:  put your allow-everything rule in the middle of your “real rules”, if traffic passes, move the rule to the middle of the bottom half, otherwise, the middle of the top half, and so on.  This way, you’ll only need log(N) steps where N is your number of rules.

Summary

Troubleshooting NSGs can be difficult but here I highlighted a basic methodology to find your way around.

Diagnostic Logs help to give us insight about what is really going on although it can be tricky to work with.

In general, as with every debugging experience just apply the principle of Sherlock Holmes:

Eliminate the impossible.  Whatever remains, however improbable, must be the truth.

In terms of debugging, that means remove all the noise, all the fat and then some meat, until what remains is so simply that the truth will hit you.

Moving existing workloads to Azure

From https://www.pexels.com/

Applications born in the cloud can take full advantage of the cloud and the agility it brings.

But there are a lot of existing solutions out there that weren’t born in the cloud.

In this article I want to sketch a very high level approach on how to proceed about taking an existing on premise solution and move it to Azure.

Let’s first talk about pure Lift & Shift.  Lift & Shift refers to the approach of taking on premise workloads and deploying them as-is in Azure.

Despite its popularity, it receives a fair bit of bad press because performing a lift and shift doesn’t give you most of the advantage of the cloud, mainly the agility.

I agree with the assessment since a lift and shift basically brings you to the cloud with a pre-cloud paradigm.  That being said, I wouldn’t discard that approach wholesale.

For many organizations, it is one of the many paths to get to the cloud.  Do you move to the cloud and then modernize or modernize in order to move to the cloud?  It’s really up to you and each organization have different constraints.

It often makes sense especially for dev & test workloads.  Dev + Test usually:

  • Do not run 24 / 7
  • Do not have High Availability requirements
  • Do not have sensitive data ; unless you bring back your production data, without trimming the sensitive data, for your dev to fiddle with, in which case sensitive data probably isn’t a concern to you

The first point means potential huge economy.  Azure tends to be cheaper than on premise solutions but if you only run it part time, it definitely is cheaper.

The last two points make Dev + Test workloads easier to move, i.e. there are less friction along the way.

Where I would be cautious is to make sure you do not need to do a lot of costly transformations in order to purely do a lift and shift ; if that’s the case I would consider modernizing first, otherwise there won’t be budget in the bucket for the modernization later.

Address blockers

red-building-industry-bricks[1]Will it run on Azure?  Most x86 stuff that run on a VM will run in Azure, but not all.  Typically this boils down to unsupported network protocols and shared disks.  Azure supports most IP protocols, except Generic Routing Encapsulation (GRE), IP in IP & multicast ; User Datagram Protocol is supported but not with multicast.  Shared disks are not supported in Azure:  every disk belong to one-and-only-one-VM.  Shared drive can be mounted via Azure File Storage, but for application requiring a disk accessible by multiple VMs, that isn’t supported.  This often is the case with Quorum disk-based HA solutions, e.g. Oracle RAC.

If you hit one of those walls, the question you have to ask yourself is are there any mitigation?  This will vary greatly depending on your solution and the blockers you face.

Does it provide suitable High Availability (HA) feature support?  A lot of on premise solution relies on hardware for high availability, while cloud-based solutions rely on software, typically by having a cluster of identical workload fronted by a load balancer.  In Azure, this is less of a blocker as it use to be, thanks to the new single VM SLA, which isn’t a full fledged HA solution but at least provide a SLA.

Will it be supported in Azure?  You can run it, now will you get support if you have problems?  This goes for both Microsoft support and other vendors support.  Some vendors won’t support you in the cloud, although the list of such vendors is shrinking everyday.  A good example of support is Windows Server 2003 in Azure:  it isn’t supported out-of-the-box, although it will work.  You do need a Custom Support Agreement (CSA) with Microsoft since Windows Server 2003 is no longer a supported product.

If not, does it matter and / or will ISV work with you?  If you aren’t supported, it isn’t always the end of the road.  It might not matter for dev-test workloads.  Also, most ISVs are typically willing to work with you to make it possible.

Does it have a license that allow running in Azure?  Don’t forget the licenses!  Some vendors will have some funky licensing schemes for solution running in the cloud.  One question I get all the time is about Oracle, so here is the answer:  yes Oracle can be licensed under Azure and no you don’t have to pay for all the cores of the physical server you’re running on ; read about it here.

Address limitations

fence-1809742_640[1]

Time Window to transition should drive your strategy.  This might sound obvious but often people do not know where to start, so start with your destination:  when do you want to be done?

Authentication mechanism (Internal vs External).  Do you need to bring your Domain Controllers over or can you use Azure AD?  Do you have another authentication mechanism that isn’t easy to migrate?

VM Requirements:  cores, RAM, Disk, IOPS, bandwidth.  Basically, do a sizing assessment and make sure you won’t face limitations in Azure.  The number of VM skus has grown significantly in the last year but I still see on premise workloads with “odd configuration” which are hard to migrate to Azure economically.  For instance, a VM with 64 Gb of RAM and only one core will need to be migrated to a VM with many cores and the price might not be compelling.  Disks are limited to 1Tb in Azure (as of this writing, December 2016), but you can stripe many disks to create an OS volume.  That being said, different VM skus have different number of disk limits.

Latency requirements (e.g. web-data tier).  Basically, no, if you put your front end in East US and the back-end in South India, latency won’t be great.  But in general if you have a low latency requirement, make sure you can attain it with the right solution in Azure.

Solution SLA.  Azure offers great SLAs but if you have a very aggressive SLAs, in the 4-5 nines, you’ll need to push the envelope in Azure which will affect the cost.

Recovery Time Objective (RTO) & Recovery Point Objective (RPO).  Again, this will influence your solution which will influence the cost.

Backup strategy / DR.  Similar to the previous point:  make sure you architect your solution accordingly.

Compliance standards.  Different services have different compliances.  See this for details.

Basically, for most of those points, the idea is to consider the point and architect the solution to address it.  This will alter the cost.  For instance, if you put 2 instances instead  of 1, you’re gona pay for twice the compute.

Make it great

london-140785_640[1]

We have our solution.  We worked through the blockers & limitations, now let’s take it to the next level.

Storage:  check out Microsoft Azure Storage Performance and Scalability Checklist.

Scalability:  consult the best practices on scalability.

Availability:  make sure you’ve been through the availability checklist & the high availability checklist.

Express Route:  define your connectivity strategy & consider Express Route prerequisites

Guidance:  in general, consult the Patterns & Practices guidance.

Get it going

As with every initiative involving change, the temptation is to do a heavy analysis before migrating a single app.  People want to get the networking right, the backup strategy, the DR, etc.  .  This is how they do it on premise when creating a data center so this is how they want to do it in Azure.

For many reasons, this approach isn’t optimal in Azure:

  • The constraints aren’t the same in Azure
  • People often have little knowledge of Azure or the cloud in general and therefore spin their wheels for quite a while looking for issues while being blind to the issues that will cause them problem (usual unknown-unknowns problem)
  • The main advantage of the cloud is agility:  long up-front analysis in order to attain agility is the straightest line between the two points

This is why I always give the same advise:  start now, start small, start on something low-risk.  If you migrate 30 solutions and realize that you bust a limit of Virtual Network and have to rebuild it one week-end, that’s expensive.  But if you migrate a solution, experiment, realize that the way you laid out the Network won’t scale to 30, you tear it down and rebuild it:  this will be much cheaper.

imageI’m not advocating to migrate all your environments freestyle in a cowboy manner, quite the opposite:  experiment with something real and low-risk and build from there.  You will learn from the experiment and move forward instead of experimenting in vacuum.  As you migrate more and more workloads, you’ll gain experience and expertise.  You’ll probably start with dev-test and in time you’ll feel confident to move to production workloads.

Look at your application park and try to take a few solutions with little dependencies, so you can move them without carrying your entire park with it.

The diagram I’ve put here might look a bit simplistic.  To get there you’ll probably have to do a few transformations.  For instance, you might want to consider replicating your domain controllers to replica in Azure to break that dependency.  There might be a system everything depend on in a light way ; could your sample solutions access it through a VPN connection?

Summary

imageI tried to summarize the general guidelines we give to customers when considering migration.

This is no one X steps plan, but a bunch of considerations to remove risk from the endeavor.

Cloud brings agility and agility should be your end goal.  The recipe for agility is simple:  small bites, quick turn around, feedback, repeat.  This should be your guideline.

Primer on Azure Monitor

pexels-photo-42384[1]

Azure Monitor is the latest evolution of a set of technologies allowing Azure resources monitoring.

I’ve written about going the extra mile to be able to analyze logs in the past.

The thing is that once our stuff is in production with tons of users hitting it, it might very well start behaving in unpredictable ways.  If we do not have a monitoring strategy, we’re going to be blind to problems and only see unrelated symptoms.

Azure Monitor is a great set of tools.  It doesn’t try to be the end all solution.  On the contrary, although it offers analytics out of the box, it let us export the logs wherever we want to go further.

I found the documentation of Azure Monitor (as of November 2016) a tad confusing, so I thought I would give a summary overview here.  Hopefully it will get you started.

Three types of sources

First thing we come across in Azure Monitor’s literature is the three types of sources:  Activity Logs, Diagnostic Logs & Metrics.

There is a bit of confusion between Diagnostic Logs & Metrics, some references hinting towards the fact that metrics are generated by Azure while diagnostics are generated by the resource itself.  That is confusing & beside the point.  Let’s review those sources here.

Activity logs capture all operations performed on Azure resources.  They used to be called Audit Logs & Operational Logs.  Those comes directly from Azure APIs.  Any operations done on an Azure API (except HTTP-GET operations) traces an activity log.  Activity log are in JSON and contain the following information:  action, caller, status & time stamp.  We’ll want to keep track of those to understand changes done in our Azure environments.

Metrics are emitted by most Azure resources.  They are akin to performance counter, something that has a value (e.g. % CPU, IOPS, # messages in a queue, etc.) over time ; hence Azure Monitor, in the portal, allows us to plot those against time.  Metrics typically comes in JSON and tend to be emitted at regular interval (e.g. every minute) ; see this articles for available metrics.  We’ll want to check those to make sure our resources operate within expected bounds.

Diagnostic logs are logs emitted by a resource that provide detailed data about the operation of that particular resource.  That one is specific to the resource in terms of content:  each resource will have different logs.  Format will also vary (e.g. JSON, CSV, etc.), see this article for different schemas.  They also tend to be much more voluminous for an active resource.

That’s it.  That’s all there is to it.  Avoid the confusion and re-read the last three paragraphs.  It’s a time saver.  Promised.

We’ll discuss the export mechanisms & alerts below, but for now, here’s a summary of the capacity (as of November 2016) of each source:

Source Export to Supports Alerts
Activity Logs Storage Account & Event Hub Yes
Metrics Storage Account, Event Hub & Log Analytics Yes
Diagnostics Logs Storage Account, Event Hub & Log Analytics No

Activity Log example

We can see the activity log of our favorite subscription by opening the monitor blade, which should be on the left hand side, in the https://portal.azure.com.

image

If you do not find it there, hit the More Services and search for Monitor.

Selecting the Activity Logs, we should have a search form and some results.

image

ListKeys is a popular one.  Despite being conceptually a read operation, the List Key action, on a storage account, is done through a POST in the Azure REST API, specifically to trigger an audit trail.

We can select one of those ListKeys and, in the tray below, select the JSON format:


{
"relatedEvents": [],
"authorization": {
"action": "Microsoft.Storage/storageAccounts/listKeys/action",
"condition": null,
"role": null,
"scope": "/subscriptions/<MY SUB GUID>/resourceGroups/securitydata/providers/Microsoft.Storage/storageAccounts/a92430canadaeast"
},
"caller": null,
"category": {
"localizedValue": "Administrative",
"value": "Administrative"
},
"claims": {},
"correlationId": "6c619af4-453e-4b24-8a4c-508af47f2b26",
"description": "",
"eventChannels": 2,
"eventDataId": "09d35196-1cae-4eca-903d-6e9b1fc71a78",
"eventName": {
"localizedValue": "End request",
"value": "EndRequest"
},
"eventTimestamp": "2016-11-26T21:07:41.5355248Z",
"httpRequest": {
"clientIpAddress": "104.208.33.166",
"clientRequestId": "ba51469e-9339-4329-b957-de5d3071d719",
"method": "POST",
"uri": null
},

I truncated the JSON here.  Basically, it is an activity event with all the details.

Metrics example

Metrics can be accessed from the “global” Monitor blade or from any Azure resource’s monitor blade.

Here I look at the CPU usage of an Azure Data Warehouse resource (which hasn’t run for months, hence flat lining).

image

Diagnostic Logs example

For diagnostics, let’s create a storage account and activate diagnostics on it.  For this, under the Monitoring section, let’s select Diagnostics, make sure the status is On and then select Blob logs.

image

We’ll notice that all metrics were already selected.  We also noticed that the retention is controlled there, in this case 7 days.

Let’s create a blob container, copy a file into it and try to access it via its URL.  Then let’s wait a few minutes for the diagnostics to be published.

We should see a special $logs container in the storage account.  This container will contain log files, stored by date & time.  For instance for the first file, just taking the first couple of lines:


1.0;2016-11-26T20:48:00.5433672Z;<strong>GetContainerACL</strong>;Success;200;3;3;authenticated;monitorvpl;monitorvpl;blob;"https://monitorvpl.blob.core.windows.net:443/$logs?restype=container&amp;comp=acl";"/monitorvpl/$logs";295a75a6-0001-0021-7b26-48c117000000;0;184.161.153.48:51484;2015-12-11;537;0;217;62;0;;;"&quot;0x8D4163D73154695&quot;";Saturday, 26-Nov-16 20:47:34 GMT;;"Microsoft Azure Storage Explorer, 0.8.5, win32, Azure-Storage/1.2.0 (NODE-VERSION v4.1.1; Windows_NT 10.0.14393)";;"9e78fc90-b419-11e6-a392-8b41713d952c"
1.0;2016-11-26T20:48:01.0383516Z;<strong>GetContainerACL</strong>;Success;200;3;3;authenticated;monitorvpl;monitorvpl;blob;"https://monitorvpl.blob.core.windows.net:443/$logs?restype=container&amp;comp=acl";"/monitorvpl/$logs";06be52d9-0001-0093-7426-483a6d000000;0;184.161.153.48:51488;2015-12-11;537;0;217;62;0;;;"&quot;0x8D4163D73154695&quot;";Saturday, 26-Nov-16 20:47:34 GMT;;"Microsoft Azure Storage Explorer, 0.8.5, win32, Azure-Storage/1.2.0 (NODE-VERSION v4.1.1; Windows_NT 10.0.14393)";;"9e9c6311-b419-11e6-a392-8b41713d952c"
1.0;2016-11-26T20:48:33.4973667Z;<strong>PutBlob</strong>;Success;201;6;6;authenticated;monitorvpl;monitorvpl;blob;"https://monitorvpl.blob.core.windows.net:443/sample/A.txt";"/monitorvpl/sample/A.txt";965cb819-0001-0000-2a26-48ac26000000;0;184.161.153.48:51622;2015-12-11;655;7;258;0;7;"Tj4nPz2/Vt7I1KEM2G8o4A==";"Tj4nPz2/Vt7I1KEM2G8o4A==";"&quot;0x8D4163D961A76BE&quot;";Saturday, 26-Nov-16 20:48:33 GMT;;"Microsoft Azure Storage Explorer, 0.8.5, win32, Azure-Storage/1.2.0 (NODE-VERSION v4.1.1; Windows_NT 10.0.14393)";;"b2006050-b419-11e6-a392-8b41713d952c"

Storage Account diagnostics obviously log in semicolon delimited values (variant of CSV), which isn’t trivial to read the way I pasted it here.  But basically we can see the logs contain details:  each operation done around the blobs are logged with lots of details.

Querying

As seen in the examples, Azure Monitor allows us to query the logs.  This can be done in the portal but also using Azure Monitor REST API, cross platform Command-Line Interface (CLI) commands, PowerShell cmdlets or the .NET SDK.

Export

We can export the sources to a Storage Account and specify a retention period in days.  We can also export them to Azure Event Hubs & Azure Log Analytics.  As specified in the table above, Activity logs can’t be sent to Log Analytics.  Also, Activity logs can be analyzed using Power BI.

There are a few reasons why we would export the logs:

  • Archiving scenario:  Azure Monitor keeps content for 30 days.  If we need more retention, we need to archive it ourselves.  We can do that by exporting the content to a storage account ; this also enables big data scenario where we keep the logs for future data mining.
  • Analytics:  Log Analytics offers more capacity for analyzing content.  It also offers 30 days of retention by default but can be extended to one year.  Basically, this would upgrade us to Log Analytics.
  • Alternatively, we could export the logs to a storage account where they could be ingested by another SIEM (e.g. HP Arcsight).  See this article for details about SIEM integration.
  • Near real time analysis:  Azure Event Hubs allow us to send the content to many different places, but also we could analyze it on the fly using Azure Stream Analytics.

Alerts (Notifications)

Both Activity Logs & Metrics can trigger alerts.  Currently (as of November 2016), only Metrics alert can be set in the portal ; Activity Logs alerts must be set by PowerShell, CLI or REST API.

Alerts are a powerful way to automatically react to our Azure resource behaviors ; when certain conditions are met (e.g. for a metric, when a value exceeds a threshold for a given period of time), the alert can send an email to a specified list of email addresses but also, it can invoke a Web Hook.

Again, the ability to invoke a web hook opens up the platform.  We could, for instance, expose an Azure Automation runbook as a Web Hook ; it therefore means an alert could trigger whatever a runbook is able to do.

Security

There are two RBAC roles around monitoring:  Reader & Contributor.

There are also some security considerations around monitoring:

  • Use a dedicated storage account (or multiple dedicated storage accounts) for monitoring data.  Basically, avoid mixing monitoring and “other” data, so that people do not gain access to monitoring data inadvertently and, vis versa, that people needing access to monitoring data do not gain access to “other” data (e.g. sensitive business data).
  • For the same reasons, use a dedicated namespace with Event Hubs
  • Limit access to monitoring data by using RBAC, e.g. by putting them in a separate resource group
  • Never grant ListKeys permission across a subscription as users could then gain access to reading monitoring data
  • If you need to give access to monitoring data, consider using a SAS token (for either Storage Account or Event Hubs)

Summary

Azure Monitor brings together a suite of tools to monitor our Azure resources.  It is an open platform in the sense it integrates easily with solutions that can complement it.

Single VM SLA

seal-1771694_640 By now you’ve probably heard the news:  Azure became the first Public Cloud to offer SLA on single VM.

This was announced on Monday, November 21st.

In this article, I’ll quickly explore what that means.

Multi-VMs SLA

Before that announcement, in order to have SLA on connectivity to compute, we needed to have 2 or more VMs in an Availability Set.

This was and still is the High Availability solution.  It gives an SLA of %99.95 availability, measured monthly.

There is no constrain on the storage used (Standard or Premium) and the SLA includes planned maintenance and any failures.  So basically, we put 2+ VMs in an availability set and we’re good all the time.

Single-VM SLA

The new SLA has a few constraints.

  • The SLA isn’t the same.  Single-VM SLA is only %99.9 (as opposed to %99.95).
  • VMs must use Premium storage for both OS & Data disks.  Presumably, Premium Storage has a better reliability.  This is interesting since in terms of SLA, there are no distinction between Premium & Standard.
  • Single VM SLA doesn’t include planned maintenance.  This is important.  It means we are covered with %99.9 availability as long as there are no planned maintenance.  More on that below.
  • SLA is calculated on a monthly basis, as if the VM was up the entire month…  this means that if our VM is up the entire month, it has an SLA of %99.9.  If, on the contrary, we turn it off 12 hours / day, we won’t have %99.9 SLA on the 12 hours / day we are using it.
  • Since this was announced on November 21st, we can expect it to take effect 30 days later ; to be on the safe side, I tell customers January 1st, 2017

So, it is quite important to state that it isn’t a simple extension of the existing SLA to a single-VM.  But it is very useful nonetheless.

Planned maintenance

I just wanted to expand a bit on planned maintenance.

What is a planned maintenance?  Once in a while Azure needs some maintenance which requires a shutdown of hosts.  Either the host itself gets updated (software / hardware) or it gets decommissioned altogether.  In those cases, the underlying VMs are shutdown, the host is rebooted (or decommissioned, in which case the VMs get relocated) and then the VM are booted back.

This is a downtime for a VM.

With an Highly Available configuration, i.e. 2+ VMs in Availability Set, the downtime of one VM doesn’t affect the availability of the availability set since there is a guarantee that there will always be one VM available.

Without an Highly Available configuration, there is no such guarantee.  For that reason, I suppose, this downtime isn’t covered within the SLA.  Remember, %99.9 on a monthly basis means 43 minutes of downtime per month.  A planned maintenance would easily take a few minutes of downtime:  taking into account the VMs shutdown (all of the VMs on the host), the host restart and the VM boot.  That isn’t negligible compare to the 43 minutes of margin the SLA gives.

This would leave very little margin of manoeuver for potential hardware / software failures during the month.

Now, that isn’t the end of the world.  For quite a few months we have the redeploy me now feature in Azure.  This feature redeploys the VM to a new host.  If there is a planned maintenance in course in the Data Center, the new host should be an updated one already, in which case our VM won’t need a reboot anymore.

Planned maintenance follow a workflow where a notification is sent a week in advance subscription owner (see https://docs.microsoft.com/en-us/azure/virtual-machines/virtual-machines-windows-planned-maintenance & https://docs.microsoft.com/en-us/azure/virtual-machines/virtual-machines-linux-planned-maintenance).  We can then trigger a redeploy at our earliest convenience (maintenance window).

Alternatively, we can trigger a redeploy every week, during a maintenance window and ignore notification emails.

High Availability

The previous section should have convinced you that Single-VM SLA isn’t a replacement for an Highly Available (HA) configuration.

On top of the Azure planned maintenance being outside the SLA, our own solution maintenance will impact the SLA of the solution.

In an HA configuration, we can take an instance down, update it, put it back, then upgrade the next one.

With a single VM we cannot do that and a solution maintenance will incur a downtime and should therefore be done inside maintenance window (of the solution).

For those reason, I still recommend to customers to use an HA configurations if HA is a requirement.

Enabled scenarios

What Single VM brings isn’t a cheaper HA configuration.  Instead, it enables non-HA configuration with SLA in Azure.

Until now, there were two modes.  Either we took the HA route or we lived without an SLA.

Often No SLA is ok.  For dev & test scenarios for instance, SLA is rarely required.

Often HA is required.  For most production scenarios I deal with, in the enterprise space & consumer facing space anyway, HA is a requirement.

Sometimes HA doesn’t make business sense and no SLA isn’t acceptable though.  HA might not make business sense when

  • The solution doesn’t support HA ; this is sadly the case for a lot of legacy application
  • The solution supports HA with a premium license, which itself doesn’t make business sense
  • Having HA increases the number of VMs to a point where the management of the solution would be cost prohibitive

For those scenarios, the single-VM SLA might hit the sweet spot.

Virtual Machine with 2 NICs

Colorful Ethernet CableIn Azure Resource Manager (ARM), Network Interface Cards (NICs) are a first class resource.  You can defined them without a Virtual Machine.

UPDATE:  As a reader kingly point out, NIC means Network Interface Controller, not Network Interface Card as I initially wrote.  Don’t be fooled by the Azure logo 😉 

Let’s take a step back and look at how the different Azure Lego blocks snap together to get you a VM exposed on the web.  ARM did decouple a lot of infrastructure components, so each of those are much simpler (compare to the old ASM Cloud Service), but there are many of them.

Related Resources

Here’s a diagram that can help:

image

Let’s look at the different components:

  • Availability Set:  contains a set of (or only one) VMs ; see Azure basics: Availability sets for details
  • Storage Account:  VM hard drives are page blobs located in one or many storage accounts
  • NIC:  A VM has one or many NICs
  • Virtual Network:  a NIC is part of a subnet, where it gets its private IP address
  • Load Balancer:  a load balancer exposed the port of a NIC (or a pool of NICs) through a public IP address

The important point for us here:  the NIC is the one being part of a subnet, not the VM.  That means a VM can have multiple NICs in different subnets.

Also, something not shown on the diagram above, a Network Security Group (NSG) can be associated with each NIC of a VM.

One VM, many NICs

Not all VMs can have multiple NICs.  For instance, in the standard A series, the following SKUs can have only one NIC:  A0, A1, A2 & A5.

You can take a look at https://azure.microsoft.com/en-us/documentation/articles/virtual-machines-windows-sizes/ to see how many NICs a given SKU support.

Why would you want to have multiple NICs?

Typically, this is a requirement for Network Appliances and for VMs passing traffic from one subnet to another.

Having multiple NICs enables more control, such as better traffic isolation.

Another requirement I’ve seen, typically with customer with high security requirements, is to isolate management traffic and transactional traffic.

For instance, let’s say you have a SQL VM with its port 1443 open to another VM (web server).  That VM needs to open its RDP port for maintenance (i.e. sys admin people to log in and do maintenance).  But if both port are opened on the same NIC, then a sys admin having RDP access could also have access to the port 1443.  For some customer, that’s unacceptable.

So the way around that is to have 2 NICs.  One NIC will be used for port 1443 (SQL) and the other for RDP (maintenance).  Then you can put each NIC in different subnet.  The SQL-NIC will be in a subnet with NSG allowing the web server to access it while the RDP-NIC will be in a subnet accessible only from the VPN Gateway, by maintenance people.

Example

You will find here an ARM template (embedded in a Word document due to limitation of the Blog platform I’m using) deploying 2 VMs, each having 2 NICs, a Web NIC & a maintenance NIC.  The Web NICs are in the web subnet and are publically load balanced through a public IP while the maintenance NICs are in a maintenance subnet and accessible only via private IPs.  The maintenance subnet let RDP get in, via its NSG.

The template will take a little while to deploy, thanks to the fact it contains a VM.  You can see most of the resources deployed quite fast though.

If you’ve done VMs with ARM before, it is pretty much the same thing, except with two NICs references in the VM.  The only thing to be watchful for is that you have to specify which NIC is primary.  You do this with the primary property:


"networkProfile": {
  "networkInterfaces": [
    {
      "id": "[resourceId('Microsoft.Network/networkInterfaces', concat(variables('Web NIC Prefix'), '-', copyIndex()))]",
      "properties": {
        "primary": true
      }
    },
    {
      "id": "[resourceId('Microsoft.Network/networkInterfaces', concat(variables('Maintenance NIC Prefix'), '-', copyIndex()))]",
      "properties": {
        "primary": false
      }
    }
  ]
}

If you want to push the example and test it with a VPN gateway, consult https://azure.microsoft.com/en-us/documentation/articles/vpn-gateway-howto-point-to-site-rm-ps/ to do a point-to-site connection with your PC.

Conclusion

Somewhat a special case for VMs, a VM with 2 NICs allow you to understand a lot of design choices in ARM.  For instance, why the NICs are stand-alone resource, why they are the one to be part of a subnet and why NSG are associated to them (not the VM).

To learn more, see https://azure.microsoft.com/en-us/documentation/articles/virtual-networks-multiple-nics/