Tag Archives: Networking

Azure Networking, VPN, Express Route, DNS Services, Traffic Manager & Application Gateway.

Azure Virtual Machines Anatomy

hand-2194170_640Virtual Machines can be pretty complex little beast.  They can have multiple disks, multiple NICs in different subnets, can be exposed on the public internet either directly or through a load balancer, etc.  .

In this article, we’ll look at the anatomy of a Virtual Machine (VM):  what are the components it relates to.

We look at the Azure Resource Model (ARM) version of Virtual Machine, as opposed to Classic version.  In ARM, Virtual Machines have a very granular model.  Most components that relate to a VM are often assimilated to the VM itself when we conceptualize them (e.g. NIC).

Internal Resource Model

Here is component diagram.  It shows the different components, their relationship and the cardinality of the relationships.

image

Virtual Machine

Of course, the Virtual Machine is at the center of this diagram.  We look at the other resources in relationship to a Virtual Machine.

Availability Set

A Virtual Machine can optionally be part of an availability set.

Availability Set is a reliability construct.  We discuss it here.

Disk

A Virtual Machine has at least one disk:  the Operating System (OS) disk.  It can optionally have more disks, also called data disks, as much as the Virtual Machine SKU allows.

Network Interface Controller (NIC)

NIC is the Networking bridge for the Virtual Machine.

A Virtual Machine has at least one (and typical VMs have only one) but can have more.  Network Virtual Appliances (NVAs) are typical cases where multiple NICs are warranted.

We often say that a Virtual Machine is in a subnet / virtual network and we typically represent it that way in a diagram:  a VM box within a subnet box.  Strictly speaking though, the NIC is part of a subnet.  This way a Virtual Machines with multiple NICs could be part of multiple subnets which might be from different Virtual Networks.

A NIC can be load balanced (in either a private or public load balancer) or can also be exposed directly on a Public IP.

Subnet / Virtual Network

Azure Virtual Network are the Networking isolation construct in Azure.

A Virtual Network can have multiple subnets.

A NIC is part of a subnet and therefore has a private IP address from that subnet.  The private IP address can be either static (fixed) or dynamic.

Public Azure Load Balancer

On the diagram we distinguish between Public & Private Load Balancers but they are the same Azure resource per se although used differently.

A Public Load Balancer is associated with a Public IP.  It is also associated to multiple NICs to which it forwards traffic.

Public IP

A public IP is exposed on the public internet.  The actual IP address can be either static or dynamic.

A public IP routes traffic to NICs either through a public load balancer or directly to a NIC (when the NIC exposes a public IP directly).

Private Azure Load Balancer

A private load balancer forwards traffic to multiple NICs like a public load balancer.

A private load balancer isn’t associated to a public IP though.  It has a private IP address instead and is therefore part of a subnet.

Cast in stone

pexels-photo-96127[1]We looked at VM components.  That gives us a static view of what a VM is.

Another interesting aspect is the dynamic nature of a VM.  What can change and what cannot?

For better or worse we can’t change everything about a VM once it’s created.  So let’s mention the aspect we can’t change after a VM is created.

The primary NIC of a VM is permanent.  We can add, remove or change secondary NICs but the primary must stay there.

Similarly, the primary disk, or OS disk, can’t be changed after creation while secondary disks, or data disks, can be changed.

The availability set of a VM is set at creation time and can’t be changed afterwards.

Summary

We did a quick lap around the different resources associated to a Virtual Machine.

It is useful to keep that mental picture when we contemplate different scenarios.

Advertisements

Virtual Network Service Endpoint – Hello World

In our last post we discussed the new feature Virtual Network Service Endpoint.

In this post we’re going to show how to use that feature.

We’re going to use it on a storage account.

We won’t go through the micro steps of setting up each services but we’ll focus on the Service Endpoint configuration.

Resource Group

As usual for demo / feature trial, let’s create a Resource Group for this so we can wipe it out at the end.

Storage Account

Let’s create a storage account in the resource group we’ve just created.

Let’s create a blob container named test.  Let’s configure the blob container to have a public access level of Blob (i.e. anonymous read access for blobs only).

Let’s create a text file with the proverbial Hello World sentence so we can recognize it.  Let’s name that file A.txt in it and copy it in the blob container.

We should be able to access the file via its public URL.  For instance, given a storage account named vplsto we can find the URL by browsing the blobs.

image

Then selecting the container we can select the blob.

image

And there we should have access to the blob URL.image

We should be able to open it in a browser.

image

Virtual Machine

Let’s create a Virtual Machine within the same resource group.

Here we’re going to use a Linux distribution in order to use the CURL command line later on but obviously something quite similar could be done with a Windows Server.

Once the deployment is done, let’s select the Virtual Network.

image

Let’s select the Subnet tab and then the subnet where we deployed the VM (in our case the subnet is names VMs).

image

At the bottom of the page, let’s select the Services drop down under Service Endpoints section.  Let’s pick Microsoft.Storage.

image

Let’s hit save.

Separation of concerns

This is the Virtual Network configuration part we had to do.  Next we’ll need to tell the storage account to accept connections only from our subnet.

By design the configuration is split between two areas:  the Virtual Network and the PaaS Service (Storage in our case).

The aim of this design is to have potentially two individuals with two different permission sets configuring the services.  The network admin configures the Virtual Network while the DBA would configure the database, the storage admin would configure the storage account, etc.  .

Configuring Storage Account

In the Storage Account, main screen, let’s select Firewalls and virtual networks.

image

From there, let’s select the Selected Networks radio button.

Then let’s click on Add existing virtual network and select the VNET & subnet where the VM was deployed.

Let’s leave the Exceptions without changing it.

image

Let’s hit save.

If we refresh our web page pointing to the blob we should have an Authorization error page.

image

This is because our desktop computer isn’t on the VNET we configured.

Let’s SSH to the VM and try the following command line:

curl https://vplsto.blob.core.windows.net/test/A.txt

(replacing the URL by the blob URL we captured previously).

This should return us our Hello World.  This is because the VM is within the subnet we configured within the storage account.

Summary

We’ve done a simple implementation of Azure Virtual Network Service Endpoints.

It is worth nothing that filtering is done at the subnet level.  It is therefore important to design our Virtual Network with the right level of granularity for the subnets.

VNET Service Endpoints for Azure SQL & Storage

internet-1676139_640It’s finally here, it has arrived:  Azure Virtual Network Service Endpoints.

This was a long requested “Enterprise feature”.

Let’s look at what this is and how to use it.

Please note that at the time of this writing (end-of-September 2017) this feature is available only in a few region in Public Preview:

  • Azure Storage: WestCentralUS, WestUS2, EastUS, WestUS, AustraliaEast, and AustraliaSouthEast
  • Azure SQL Database: WestCentralUS, WestUS2, and EastUS

Online Resources

Here is a bit of online documentation about the topic:

The problem

The first (historically) Azure Services, e.g. Azure Storage & Azure SQL, were built with a public cloud philosophy:

  • They are accessible through public IPs
  • They are multi-tenant, e.g. public IPs are shared between many domain names
  • They live on shared infrastructures (e.g. VMs)
  • etc.

Many more recent services share many of those characteristics, for instance Data Factory, Event Hub, Cosmos DB, etc.  .

Those are all Platform as a Service (PaaS) services.

Then came the IaaS wave, offering more control and being less opinionated about how we should expose & manage cloud assets.  With it we could replicate in large parts an on premise environment.  First came Virtual Machines, then Virtual Networks (akin to on premise VLANs), then Network Security Groups (akin to on premise Firewall rules), then Virtual Network Appliances (literally a software version of an on premise Firewall), etc.  .

Enterprises love this IaaS as it allows to more quickly migrate assets to the cloud since they can more easily adapt their governance model.

But Enterprises, like all Cloud users, realize that the best TCO is in PaaS services.

This is where the two models collided.

After we spent all this effort stonewalling our VMs within a Virtual Network, implementing access rules, e.g. inbound PORT 80 connections can only come from on premise users through the VPN Gateway, we were going to access the Azure SQL Database through a public endpoint?

That didn’t go down easily.

Azure SQL DB specifically has an integrated firewall.  We can block all access, leave only IP ranges (again good for connection over the internet) or leave “Azure connections”.  The last one look more secure as no one from an hotel room (or a bed in New Jersey) could access the database.  But anyone within Azure, anyone, could still access it.

The kind of default architecture was something like this:

image

This did put a lot of friction to the adoption of PaaS services by Enterprise customers.

The solution until now

The previous diagram is a somewhat naïve deployment and we could do better.  A lot of production deployments are like this though.

We could do better by controlling the access via incoming IP addresses.  Outbound connections from a VM come through a Public IP.  We could filter access given that IP within the PaaS Service integrated Firewall.

In order to do that, we needed a static public IP though.  Dynamic IP preserves their domain name but they aren’t guaranteed to preserve their underlying IP value.

image

This solution had several disadvantages:

  • It requires a different paradigm (public IP filtering vs VNET / NSGs) to secure access
  • It requires static IPs
  • If the VMs were not meant to be exposed on the internet, it adds on configuration and triggers some security questions during reviews
  • A lot of deployment included a “force tunneling” to the on premise firewall for internet access ; since the Azure SQL DB technically is on the internet, traffic was routed on premise, increasing latency substantially

And this is where we were at until this week when VNET Service Endpoints were announced at Microsoft Ignite.

The solution

The ideal solution would be to instantiate the PaaS service within a VNET.  For a lot of PaaS services, given their multi-tenant nature, it is impossible to do.

That is the approach taken by a lot of single-tenant PaaS services though, e.g. HD Insights, Application Gateway, Redis Cache, etc.  .

For multi-tenant PaaS where the communication is always outbound to the Azure service (i.e. the service doesn’t initiate a connection to our VMs), the solution going forward is VNET Service Endpoints.

At the time of this writing, only Azure Storage, Azure SQL DB & Azure SQL Data Warehouse do support that mechanisms.  Other PaaS services are planned to support it in the future.

VNET Service Endpoints does the next best thing to instantiating the PaaS service in our VNET.  It allows us to filter connections according to the VNET / Subnet of the source.

This is made possible by a fundamental change in the Azure Network Stack.  VNETs now have identities that can be carried with a connection.

So we are back to where we wanted to be:

image

The solution isn’t perfect.  For instance, it doesn’t allow to filter for connections coming from on premise computers via a VPN Gateway:  the connection needs to be initiated from the VNET itself for VNET Service Endpoints to work.

Also, the access to PaaS resources is still done via the PaaS public IP so the VNET must allow connection to the internet.  This is mitigated by new tags allowing to target some specific PaaS ; for instance, we could allow traffic going only to Azure SQL DBs (although not our Azure SQL DB instance only).

The connection does bypass “force tunneling” though and therefore the traffic stays on the Microsoft Network thus improving latency.

Summary

VNET Service Endpoints allow to secure access to PaaS services such as Azure SQL DB, Azure SQL Data Warehouse & Azure Storage (and soon more).

It offers something close to bringing the PaaS services to our VNET.

Azure Application Gateway Anatomy

Back in May, we talked about Azure Application Gateway.

In this article, we’re going to look at its anatomy, i.e. its internal component as exposed in the Azure Resource Manager (ARM) model.

A lot of Azure Resource has an internal structure.  For instance, a Virtual Network has a collection of subnets.

Azure Application Gateway has a very rich internal model.  We will look at this model in order to understand how to configure it.

What is Azure Application Gateway

From the official documentation:

Application Gateway is a layer-7 load balancer.  It provides failover, performance-routing HTTP requests between different servers, whether they are on the cloud or on-premises. Application Gateway provides many Application Delivery Controller (ADC) features including HTTP load balancing, cookie-based session affinity, Secure Sockets Layer (SSL) offload, custom health probes, support for multi-site, and many others.

I like to say that it is at time a Reverse Proxy, a Web Application Firewall (WAF) and a layer 7 load balancer.

Internal Resource Model

Let’s start with a summary diagram.  Each box represent a sub resource (except Application Gateway, which represents the main resource) and each of the bullet point within the box represent a property of that sub resource.

image

We can now look at each sub resource.

Application Gateway

Key properties of the gateway itself are

  • The SKU, i.e. the tier (Small, Medium, Large) & the number of instances (1 to 10)
  • A list of SSL certificates (used by HTTP Listeners)

SSL Certificates are optional if the Gateway exposes only HTTP endpoints but are required for HTTPS endpoints.

The SKU can be anything, although in order to have an SLA, it must be of tier medium or large and have at least 2 instances.

Gateway IP Configuration

The Gateway IP Configuration has a 1:1 relationship with the Application Gateway (trying to register a second configuration results in an error) and can therefore be conceptually considered as properties of the Gateway directly.

It simply defines in which subnet does the Application Gateway live.

Frontend IP Configuration

This sub resource defines how the gateway is exposed.

There can be either one or two of those configuration:  either public or private or both.

The same configuration can be used by more than one HTTP listener, using different port.

Frontend Port

Frontend ports describe which ports on the Application Gateway are exposed.  It simply is a port number.

HTTP Listener

This is a key component.  It combines a frontend IP configuration and port ; it also include a protocol (HTTP or HTTPS) and optionally an SSL certificate.

An HTTP listener is what the Application Gateway is listening to, e.g.:

  • Public IP X.Y.Z.W on port 443, HTTPS with a given SSL
  • Private IP 10.0.0.1 on port 80, HTTP

Backend Address Pool

The real compute handling requests.

Typically it’s going to be stand-alone VMs or VMs from a VM Scale Set (VMSS) but technically only the addresses are registered.  It could therefore be some public web site out there.

Backend HTTP Setting

This component describe how to connect to a backend compute:  port#, protocol (HTTP / HTTS) & cookie based affinity.

The frontend & backend can be different:  we can have HTTPS on 443 on the frontend while routing it to HTTP on port 12345 in the backend.  This is actually a typical SSL termination scenario.

Probe

A probe, actually a custom probe, probes a backend for health.  It is described by a protocol, a URL, interval, timeout, etc.  .  Basically we can customize how a backend is determined to be healthy or not.

A custom probe is optional.  By default, a default probe is configured probing the backend using the port and protocol specified in the backend http setting.

Rule

A rule binds an HTTP Listener, a backend address pool together & a backend setting.  It basically binds and endpoint in the frontend with an endpoint in the backend.

There are two types of rules:  basic rules & path rules.  The former simply binds the aforementioned components together while the later adds the concept of mapping a URL pattern to a given backend.

Summary

We covered the anatomy of Application Gateway and articulated how different components relate to each others.

In future articles we will build on this anatomy in order to address specific scenarios.

URL Routing with Azure Application Gateway

Update (13-06-2017):  The POC of this article is available on GitHub here.

I have a scenario perfect for a Layer-7 Load Balancer / Reverse Proxy:

  • Multiple web server clusters to be routed under one URL hierarchy (one domain name)
  • Redirect HTTP traffic to the same URL on HTTPS
  • Have reverse proxy performing SSL termination (or SSL offloading), i.e. accepting HTTPS but routing to underlying servers using HTTP

On paper, Azure Application Gateway can do all of those.  Let’s fine out in practice.

Azure Application Gateway Concepts

From the documentation:

Application Gateway is a layer-7 load balancer.  It provides failover, performance-routing HTTP requests between different servers, whether they are on the cloud or on-premises. Application Gateway provides many Application Delivery Controller (ADC) features including HTTP load balancing, cookie-based session affinity, Secure Sockets Layer (SSL) offload, custom health probes, support for multi-site, and many others.

Before we get into the meat of it, there are a bunch of concepts Application Gateway uses and we need to understand:

  • Back-end server pool: The list of IP addresses of the back-end servers. The IP addresses listed should either belong to the virtual network subnet or should be a public IP/VIP.
  • Back-end server pool settings: Every pool has settings like port, protocol, and cookie-based affinity. These settings are tied to a pool and are applied to all servers within the pool.
  • Front-end port: This port is the public port that is opened on the application gateway. Traffic hits this port, and then gets redirected to one of the back-end servers.
  • Listener: The listener has a front-end port, a protocol (Http or Https, these values are case-sensitive), and the SSL certificate name (if configuring SSL offload).
  • Rule: The rule binds the listener, the back-end server pool and defines which back-end server pool the traffic should be directed to when it hits a particular listener.

On top of those, we should probably add probes that are associated to a back-end pool to determine its health.

Proof of Concept

As a proof of concept, we’re going to implement the following:

image

We use Windows Virtual Machine Scale Sets (VMSS) for back-end servers.

In a production setup, we would go for exposing the port 443 on the web, but for a POC, this should be sufficient.

As of this writing, there are no feature to allow automatic redirection from port 80 to port 443.  Usually, for public web site, we want to redirect users to HTTPS.  This could be achieve by having one of the VM scale set implementing the redirection and routing HTTP traffic to it.

ARM Template

We’ve published the ARM template on GitHub.

First, let’s look at the visualization.

image

The template is split within 4 files:

  • azuredeploy.json, the master ARM template.  It simply references the others and passes parameters around.
  • network.json, responsible for the virtual network and Network Security Groups
  • app-gateway.json, responsible for the Azure Application Gateway and its public IP
  • vmss.json, responsible for VM scale set, a public IP and a public load balancer ; this template is invoked 3 times with 3 different set of parameters to create the 3 VM scale sets

We’ve configured the VMSS to have public IPs.  It is quite typical to want to connect directly to a back-end servers while testing.  We also optionally open the VMSS to RDP traffic ; this is controlled by the ARM template’s parameter RDP Rule (Allow, Deny).

Template parameters

Here are the following ARM template parameters.

Parameter Description
Public DNS Prefix The DNS suffix for each VMSS public IP.
They are then suffixed by ‘a’, ‘b’ & ‘c’.
RDP Rule Switch allowing or not allowing RDP network traffic to reach VMSS from public IPs.
Cookie Based Affinity Switch enabling / disabling cookie based affinity on the Application Gateway.
VNET Name Name of the Virtual Network (default to VNet).
VNET IP Prefix Prefix of the IP range for the VNET (default to 10.0.0).
VM Admin Name Local user account for administrator on all the VMs in all VMSS (default to vmssadmin).
VM Admin Password Password for the VM Admin (same for all VMs of all VMSS).
Instance Count Number of VMs in each VMSS.
VM Size SKU of the VMs for the VMSS (default to Standard DS2-v2).

Routing

An important characteristic of URL-based routing is that requests are routed to back-end servers without alteration.

This is important.  It means that /a/ on the Application Gateway is mapped to /a/ on the Web Server.  It isn’t mapped to /, which seems more intuitive as that would seem like the root of the ‘a’ web servers.  This is because URL-base routing can be more general than just defining suffix.

Summary

This proof of concept gives a fully functional example of Azure Application Gateway using URL-based routing.

This is a great showcase for Application Gateway as it can then reverse proxy all traffic while keeping user affinity using cookies.

Joining an ARM Linux VM to AAD Domain Services

Active Directory is one of the most popular domain controller / LDAP server around.

In Azure we have Azure Active Directory (AAD).  Despite the name, AAD isn’t just a multi-tenant AD.  It is built for the cloud.

Sometimes though, it is useful to have a traditional domain controller…  in the cloud.  Typically this is with legacy workloads built to work with Active Directory.  But also, a very common scenario is to join an Azure VM to a domain so that users authenticate on it with the same accounts they authenticate to use the Azure Portal.

The underlying directory could even be synced with a corporate network, in which case users could log into the VMs using their corporate account.  I won’t cover this here but you can read about it in a previous article for AD Connect part.

The straightforward option is to build an Active Directory cluster on Azure VMs.  This will work but requires the maintenance of those 2 VMs.

mont-saint-michel-france-normandy-europe[1]

An easier option is AAD Domain Services (AADDS).  AADDS exposes an AAD tenant as a managed domain service.  It does it by provisioning some variant of Active Directory managed cluster, i.e. we do not see or care about the underlying VMs.

The cluster is synchronized one-way (from AAD to AADDS).  For this reason, AAD is read only through its LDAP interface, e.g. we can’t reset a password using LDAP.

The Azure documentation walks us through such an integration with classic (ASM) VMs.  Since ARM has been around for more than a year, I recommend to always go with ARM VMs.  This article aims at showing how to do this.

I’ll heavily leveraged the existing documentation and detail only what differs from the official documentation.

Also, keep in mind this article is written in January 2017.  Azure AD will transition to ARM in the future and will likely render this article obsolete.

Dual Networks

The main challenge we face is that AAD is an ASM service and AAD Domain Service are exposed within an ASM Virtual Network (VNET), which is incompatible with our ARM VM.

Thankfully, we now have Virtual Network peering allowing us to peer an ASM and an ARM VNET together so they can act as if they were one.

image

Peering

As with all VNET peering, the two VNETs must be of mutually exclusive IP addresses space.

I created two VNETs in the portal (https://portal.azure.com).  I recommend creating them in the new portal explicitly, this way even the classic one will be part of the desired resource group.

The classic one has 10.0.0.0/24 address range while the ARM one has 10.1.0.0/24.

The peering can be done from the portal too.  In the Virtual Network pane (say the ARM one), select Peerings:

image

We should see an empty list here, so let’s click Add.

We need to give the peering a name.  Let’s type PeeringToDomainServices.

We then select Classic in the Peer details since we want to peer with a classic VNET.

Finally, we’ll click on Choose a Virtual Network.

image

From there we should see the classic VNET we created.

Configuring AADDS

The online documentation is quite good for this.

Just make sure you select the classic VNET we created.

You can give a domain name that is different that the AAD domain name (i.e. *.onmicrosoft.com).

Enabling AADDS takes up to 30 minutes.  Don’t hold your breath!

Joining a VM

We can create a Linux VM, put it in the ARM VNET we created, and join it to the AADDS domain now.

Again, the online documentation does a great job of walking us through the process.  The documentation is written for Red Hat.

When I tried it, I used a CentOS VM and I ended up using different commands, namely the realmd command (ignoring the SAMBA part of the article).

Conclusion

It is fairly straightforward to enable Domain Services in AAD and well documented.

A challenge we currently have currently is to join or simply communicate from an ARM VM to AADDS.  For this we need two networks, a classic (ASM) one and an ARM one, and we need to peer them together.

Troubleshooting NSGs using Diagnostic Logs

I’ve wrote about how to use Network Security Group (NSG) before.

Chances are, once you get a complicated enough set of rules in a NSG, you’ll find yourself with NSGs that do not do what you think they should do.

Troubleshooting NSGs isn’t trivial.

I’ll try to give some guidance here but to this date (January 2017), there is no tool where you can just say “please follow packet X and tell me against which wall it bumps”.  It’s more indirect than that.

 

Connectivity

First thing, make sure you can connect to your VNET.

If you are connecting to a VM via a public IP, make sure you have access to that IP (i.e. you’re not sitting behind an on premise firewall blocking the outgoing port you are trying to use), that the IP is connected to the VM either directly or via a Load Balancer.

If you are connecting to a VM via a private IP through a VPN Gateway of some sort, make sure you can connect and that your packets are routed to the gateway and from there they get routed to the proper subnet.

An easy way to make sure of that is to remove all NSGs and replace them by a “let everything go in”.  Of course, that’s also opening your workloads to hackers, so I recommend you do that with a test VM that you destroy afterwards.

Diagnose

Then I would recommend to go through the official Azure guidelines to troubleshoot NSGs.  This walks you through the different diagnosis tools.

Diagnostic Logs

If you reached this section and haven’t achieve greatness yet, well…  You need something else.

What we’re going to do here is use NSG Diagnostic Logs to understand a bit more what is going on.

By no means is this magic and especially in an environment already in use where a lot of traffic is occurring, it might be difficult to make sense of what the logs are going to give us.

Nevertheless, the logs give us a picture of what really is happening.  They are aggregated though, so we won’t see your PC IP address for instance.  The aggregation is probably what limit the logs effectiveness the most.

Sample configuration

I provide here a sample configuration I’m going to use to walk through the troubleshooting process.

{
  "$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
  "contentVersion": "1.0.0.0",
  "parameters": {
    "VM Admin User Name": {
      "defaultValue": "myadmin",
      "type": "string"
    },
    "VM Admin Password": {
      "defaultValue": null,
      "type": "securestring"
    },
    "Disk Storage Account Name": {
      "defaultValue": "<your prefix>vmpremium",
      "type": "string"
    },
    "Log Storage Account Name": {
      "defaultValue": "<your prefix>logstandard",
      "type": "string"
    },
    "VM Size": {
      "defaultValue": "Standard_DS2",
      "type": "string",
      "allowedValues": [
        "Standard_DS1",
        "Standard_DS2",
        "Standard_DS3"
      ],
      "metadata": {
        "description": "SKU of the VM."
      }
    },
    "Public Domain Label": {
      "type": "string"
    }
  },
  "variables": {
    "Vhds Container Name": "vhds",
    "VNet Name": "MyVNet",
    "Ip Range": "10.0.1.0/24",
    "Public IP Name": "MyPublicIP",
    "Public LB Name": "PublicLB",
    "Address Pool Name": "addressPool",
    "Subnet NSG Name": "subnetNSG",
    "VM NSG Name": "vmNSG",
    "RDP NAT Rule Name": "RDP",
    "NIC Name": "MyNic",
    "VM Name": "MyVM"
  },
  "resources": [
    {
      "type": "Microsoft.Network/publicIPAddresses",
      "name": "[variables('Public IP Name')]",
      "apiVersion": "2015-06-15",
      "location": "[resourceGroup().location]",
      "tags": {
        "displayName": "Public IP"
      },
      "properties": {
        "publicIPAllocationMethod": "Dynamic",
        "idleTimeoutInMinutes": 4,
        "dnsSettings": {
          "domainNameLabel": "[parameters('Public Domain Label')]"
        }
      }
    },
    {
      "type": "Microsoft.Network/loadBalancers",
      "name": "[variables('Public LB Name')]",
      "apiVersion": "2015-06-15",
      "location": "[resourceGroup().location]",
      "tags": {
        "displayName": "Public Load Balancer"
      },
      "properties": {
        "frontendIPConfigurations": [
          {
            "name": "LoadBalancerFrontEnd",
            "comments": "Front end of LB:  the IP address",
            "properties": {
              "publicIPAddress": {
                "id": "[resourceId('Microsoft.Network/publicIPAddresses/', variables('Public IP Name'))]"
              }
            }
          }
        ],
        "backendAddressPools": [
          {
            "name": "[variables('Address Pool Name')]"
          }
        ],
        "loadBalancingRules": [
          {
            "name": "Http",
            "properties": {
              "frontendIPConfiguration": {
                "id": "[concat(resourceId('Microsoft.Network/loadBalancers', variables('Public LB Name')), '/frontendIPConfigurations/LoadBalancerFrontEnd')]"
              },
              "frontendPort": 80,
              "backendPort": 80,
              "enableFloatingIP": false,
              "idleTimeoutInMinutes": 4,
              "protocol": "Tcp",
              "loadDistribution": "Default",
              "backendAddressPool": {
                "id": "[concat(resourceId('Microsoft.Network/loadBalancers', variables('Public LB Name')), '/backendAddressPools/', variables('Address Pool Name'))]"
              },
              "probe": {
                "id": "[concat(resourceId('Microsoft.Network/loadBalancers', variables('Public LB Name')), '/probes/TCP-Probe')]"
              }
            }
          }
        ],
        "probes": [
          {
            "name": "TCP-Probe",
            "properties": {
              "protocol": "Tcp",
              "port": 80,
              "intervalInSeconds": 5,
              "numberOfProbes": 2
            }
          }
        ],
        "inboundNatRules": [
          {
            "name": "[variables('RDP NAT Rule Name')]",
            "properties": {
              "frontendIPConfiguration": {
                "id": "[concat(resourceId('Microsoft.Network/loadBalancers', variables('Public LB Name')), '/frontendIPConfigurations/LoadBalancerFrontEnd')]"
              },
              "frontendPort": 3389,
              "backendPort": 3389,
              "protocol": "Tcp"
            }
          }
        ],
        "outboundNatRules": [],
        "inboundNatPools": []
      },
      "dependsOn": [
        "[resourceId('Microsoft.Network/publicIPAddresses', variables('Public IP Name'))]"
      ]
    },
    {
      "type": "Microsoft.Network/virtualNetworks",
      "name": "[variables('VNet Name')]",
      "apiVersion": "2016-03-30",
      "location": "[resourceGroup().location]",
      "properties": {
        "addressSpace": {
          "addressPrefixes": [
            "10.0.0.0/16"
          ]
        },
        "subnets": [
          {
            "name": "default",
            "properties": {
              "addressPrefix": "[variables('Ip Range')]",
              "networkSecurityGroup": {
                "id": "[resourceId('Microsoft.Network/networkSecurityGroups', variables('Subnet NSG Name'))]"
              }
            }
          }
        ]
      },
      "resources": [],
      "dependsOn": [
        "[resourceId('Microsoft.Network/networkSecurityGroups', variables('Subnet NSG Name'))]"
      ]
    },
    {
      "apiVersion": "2015-06-15",
      "name": "[variables('Subnet NSG Name')]",
      "type": "Microsoft.Network/networkSecurityGroups",
      "location": "[resourceGroup().location]",
      "tags": {},
      "properties": {
        "securityRules": [
          {
            "name": "Allow-HTTP-From-Internet",
            "properties": {
              "protocol": "Tcp",
              "sourcePortRange": "*",
              "destinationPortRange": "80",
              "sourceAddressPrefix": "Internet",
              "destinationAddressPrefix": "*",
              "access": "Allow",
              "priority": 100,
              "direction": "Inbound"
            }
          },
          {
            "name": "Allow-RDP-From-Everywhere",
            "properties": {
              "protocol": "Tcp",
              "sourcePortRange": "*",
              "destinationPortRange": "3389",
              "sourceAddressPrefix": "*",
              "destinationAddressPrefix": "*",
              "access": "Allow",
              "priority": 150,
              "direction": "Inbound"
            }
          },
          {
            "name": "Allow-Health-Monitoring",
            "properties": {
              "protocol": "*",
              "sourcePortRange": "*",
              "destinationPortRange": "*",
              "sourceAddressPrefix": "AzureLoadBalancer",
              "destinationAddressPrefix": "*",
              "access": "Allow",
              "priority": 200,
              "direction": "Inbound"
            }
          },
          {
            "name": "Disallow-everything-else-Inbound",
            "properties": {
              "protocol": "*",
              "sourcePortRange": "*",
              "destinationPortRange": "*",
              "sourceAddressPrefix": "*",
              "destinationAddressPrefix": "*",
              "access": "Deny",
              "priority": 300,
              "direction": "Inbound"
            }
          },
          {
            "name": "Allow-to-VNet",
            "properties": {
              "protocol": "*",
              "sourcePortRange": "*",
              "destinationPortRange": "*",
              "sourceAddressPrefix": "*",
              "destinationAddressPrefix": "VirtualNetwork",
              "access": "Allow",
              "priority": 100,
              "direction": "Outbound"
            }
          },
          {
            "name": "Disallow-everything-else-Outbound",
            "properties": {
              "protocol": "*",
              "sourcePortRange": "*",
              "destinationPortRange": "*",
              "sourceAddressPrefix": "*",
              "destinationAddressPrefix": "*",
              "access": "Deny",
              "priority": 200,
              "direction": "Outbound"
            }
          }
        ],
        "subnets": []
      }
    },
    {
      "apiVersion": "2015-06-15",
      "name": "[variables('VM NSG Name')]",
      "type": "Microsoft.Network/networkSecurityGroups",
      "location": "[resourceGroup().location]",
      "tags": {},
      "properties": {
        "securityRules": [
          {
            "name": "Allow-HTTP-From-Internet",
            "properties": {
              "protocol": "Tcp",
              "sourcePortRange": "*",
              "destinationPortRange": "80",
              "sourceAddressPrefix": "Internet",
              "destinationAddressPrefix": "*",
              "access": "Allow",
              "priority": 100,
              "direction": "Inbound"
            }
          },
          {
            "name": "Allow-Health-Monitoring",
            "properties": {
              "protocol": "*",
              "sourcePortRange": "*",
              "destinationPortRange": "*",
              "sourceAddressPrefix": "AzureLoadBalancer",
              "destinationAddressPrefix": "*",
              "access": "Allow",
              "priority": 200,
              "direction": "Inbound"
            }
          },
          {
            "name": "Disallow-everything-else-Inbound",
            "properties": {
              "protocol": "*",
              "sourcePortRange": "*",
              "destinationPortRange": "*",
              "sourceAddressPrefix": "*",
              "destinationAddressPrefix": "*",
              "access": "Deny",
              "priority": 300,
              "direction": "Inbound"
            }
          },
          {
            "name": "Allow-to-VNet",
            "properties": {
              "protocol": "*",
              "sourcePortRange": "*",
              "destinationPortRange": "*",
              "sourceAddressPrefix": "*",
              "destinationAddressPrefix": "VirtualNetwork",
              "access": "Allow",
              "priority": 100,
              "direction": "Outbound"
            }
          },
          {
            "name": "Disallow-everything-else-Outbound",
            "properties": {
              "protocol": "*",
              "sourcePortRange": "*",
              "destinationPortRange": "*",
              "sourceAddressPrefix": "*",
              "destinationAddressPrefix": "*",
              "access": "Deny",
              "priority": 200,
              "direction": "Outbound"
            }
          }
        ],
        "subnets": []
      }
    },
    {
      "type": "Microsoft.Network/networkInterfaces",
      "name": "[variables('NIC Name')]",
      "apiVersion": "2016-03-30",
      "location": "[resourceGroup().location]",
      "properties": {
        "ipConfigurations": [
          {
            "name": "ipconfig",
            "properties": {
              "privateIPAllocationMethod": "Dynamic",
              "subnet": {
                "id": "[concat(resourceId('Microsoft.Network/virtualNetworks', variables('VNet Name')), '/subnets/default')]"
              },
              "loadBalancerBackendAddressPools": [
                {
                  "id": "[concat(resourceId('Microsoft.Network/loadBalancers', variables('Public LB Name')), '/backendAddressPools/', variables('Address Pool Name'))]"
                }
              ],
              "loadBalancerInboundNatRules": [
                {
                  "id": "[concat(resourceId('Microsoft.Network/loadBalancers', variables('Public LB Name')), '/inboundNatRules/', variables('RDP NAT Rule Name'))]"
                }
              ]
            }
          }
        ],
        "dnsSettings": {
          "dnsServers": []
        },
        "enableIPForwarding": false,
        "networkSecurityGroup": {
          "id": "[resourceId('Microsoft.Network/networkSecurityGroups', variables('VM NSG Name'))]"
        }
      },
      "resources": [],
      "dependsOn": [
        "[resourceId('Microsoft.Network/virtualNetworks', variables('VNet Name'))]",
        "[resourceId('Microsoft.Network/loadBalancers', variables('Public LB Name'))]"
      ]
    },
    {
      "type": "Microsoft.Compute/virtualMachines",
      "name": "[variables('VM Name')]",
      "apiVersion": "2015-06-15",
      "location": "[resourceGroup().location]",
      "properties": {
        "hardwareProfile": {
          "vmSize": "[parameters('VM Size')]"
        },
        "storageProfile": {
          "imageReference": {
            "publisher": "MicrosoftWindowsServer",
            "offer": "WindowsServer",
            "sku": "2012-R2-Datacenter",
            "version": "latest"
          },
          "osDisk": {
            "name": "[variables('VM Name')]",
            "createOption": "FromImage",
            "vhd": {
              "uri": "[concat('https', '://', parameters('Disk Storage Account Name'), '.blob.core.windows.net', concat('/', variables('Vhds Container Name'),'/', variables('VM Name'), '-os-disk.vhd'))]"
            },
            "caching": "ReadWrite"
          },
          "dataDisks": []
        },
        "osProfile": {
          "computerName": "[variables('VM Name')]",
          "adminUsername": "[parameters('VM Admin User Name')]",
          "windowsConfiguration": {
            "provisionVMAgent": true,
            "enableAutomaticUpdates": true
          },
          "secrets": [],
          "adminPassword": "[parameters('VM Admin Password')]"
        },
        "networkProfile": {
          "networkInterfaces": [
            {
              "id": "[resourceId('Microsoft.Network/networkInterfaces', concat(variables('NIC Name')))]"
            }
          ]
        }
      },
      "resources": [],
      "dependsOn": [
        "[resourceId('Microsoft.Storage/storageAccounts', parameters('Disk Storage Account Name'))]",
        "[resourceId('Microsoft.Network/networkInterfaces', variables('NIC Name'))]"
      ]
    },
    {
      "type": "Microsoft.Storage/storageAccounts",
      "name": "[parameters('Disk Storage Account Name')]",
      "sku": {
        "name": "Premium_LRS",
        "tier": "Premium"
      },
      "kind": "Storage",
      "apiVersion": "2016-01-01",
      "location": "[resourceGroup().location]",
      "properties": {},
      "resources": [],
      "dependsOn": []
    },
    {
      "type": "Microsoft.Storage/storageAccounts",
      "name": "[parameters('Log Storage Account Name')]",
      "sku": {
        "name": "Standard_LRS",
        "tier": "standard"
      },
      "kind": "Storage",
      "apiVersion": "2016-01-01",
      "location": "[resourceGroup().location]",
      "properties": {},
      "resources": [],
      "dependsOn": []
    }
  ]
}

The sample has one VM sitting in a subnet protected by a NSG.  The VM’s NIC is also protected by NSG, to make our life complicated (as we do too often).  The VM is exposed on a Load Balanced Public IP and RDP is enabled via NAT rules on the Load Balancer.

The VM is running on a Premium Storage account but the sample also creates a standard storage account to store the logs.

The Problem

The problem we are going to try to find using Diagnostic Logs is that the subnet’s NSG let RDP in via “Allow-RDP-From-Everywhere” rule while the NIC’s doesn’t and that type of traffic will get blocked, as everything else, by the “Disallow-everything-else-Inbound” rule.

In practice, you’ll likely have something more complicated going on, maybe some IP filtering, etc.  .   But the principles remain.

Enabling Diagnostic Logs

I couldn’t enable the Diagnostic Logs via the ARM template as it isn’t possible to do so yet.  We can do that via the Portal or PowerShell.

I’ll illustrate the Portal here, since it’s for troubleshooting, chances are you won’t automate it.

I’ve covered Azure Monitor in a previous article.  We’ve seen that different providers expose different schemas.

NSGs expose two categories of Diagnostic LogsEvent and Rule Counter.  We’re going to use Rule Counter only.

Rule Counter will give us a count of how many times a given rule was triggered for a given target (MAC address / IP).  Again, if we have lots of traffic flying around, that won’t be super useful.  This is why I recommend to isolate the network (or recreate an isolated one) in order to troubleshoot.

We’ll start by the subnet NSG.

image

Scrolling all the way down on the NSG’s pane left menu, we select Diagnostics Logs.

image

The pane should look as follow since no diagnostics are enabled.  Let’s click on Turn on diagnostics.

image

We then turn it on.

image

For simplicity here, we’re going to use the Archive to a storage account.

image

We will configure the storage account to send the logs to.

image

For that, we’re selecting the standard account created by the template or whichever storage account you fancy.  Log Diagnostics will go and create a blob container for each category in the selected account.  The names a predefined (you can’t choose).

We select the NetworkSecurityGroupRuleCounter category.

image

And finally we hit the save button on the pane.

We’ll do the same thing with the VM NSG.

image

Creating logs

No we are going to try to get through our VM.  We are going to describe how to that with the sample I gave but if you are troubleshooting something, just try the faulty connection.

We’re going to try to RDP to the public IP.  First we need the public IP domain name.  So in the resource group:

image

At the top of the pane we’ll find the DNS name that we can copy.

image

We can then paste it in an RDP window.

image

Trying to connect should fail and it should leave traces in the logs for us to analyse.

Analysis

We’ll have to wait 5-10 minutes for the logs to get in the storage account as this is done asynchronously.

Actually, a way to make sure to get clean logs is to delete the blob container and then try the RDP connection.  The blob container should reappear after 5-10 minutes.

To get the logs in the storage account we need some tool.  I use Microsoft Azure Storage Explorer.

image

The blob container is called insights-logs-networksecuritygrouprulecounter.

The logs are hidden inside a complicated hierarchy allowing us to send all our diagnostic logs from all our NSGs over time there.

Basically, resourceId%3D / SUBSCRIPTIONS / <Your subscription ID> / RESOURCEGROUPS / NSG / PROVIDERS / MICROSOFT.NETWORK / NETWORKSECURITYGROUPS / we’ll see two folders:  SUBNETNSG & VMNSG.  Those are our two NSGs.

If we dig under those two folders, we should find one file (or more if you’ve waited for a while).

Let’s copy those file with appropriate naming somewhere to analyse them.

Preferably, use a viewer / editor that understands JSON (I use Visual Studio).  If you use notepad…  you’re going to have fun.

If we look at the subnet NSG logs first and search for “RDP”, we’ll find this entry:

    {
      "time": "2017-01-09T11:46:44.9090000Z",
      "systemId": "...",
      "category": "NetworkSecurityGroupRuleCounter",
      "resourceId": ".../RESOURCEGROUPS/NSG/PROVIDERS/MICROSOFT.NETWORK/NETWORKSECURITYGROUPS/SUBNETNSG",
      "operationName": "NetworkSecurityGroupCounters",
      "properties": {
        "vnetResourceGuid": "{50C7B76A-4B8F-481A-8029-73569E5C7D87}",
        "subnetPrefix": "10.0.1.0/24",
        "macAddress": "00-0D-3A-00-B6-B5",
        "primaryIPv4Address": "10.0.1.4",
        "ruleName": "UserRule_Allow-RDP-From-Everywhere",
        "direction": "In",
        "type": "allow",
        "matchedConnections": 0
      }
    },

The most interesting part is the matchedConnections property, which is zero because we didn’t achieve connections.

If we look in the VM logs, we’ll find this:

    {
      "time": "2017-01-09T11:46:44.9110000Z",
      "systemId": "...",
      "category": "NetworkSecurityGroupRuleCounter",
      "resourceId": ".../RESOURCEGROUPS/NSG/PROVIDERS/MICROSOFT.NETWORK/NETWORKSECURITYGROUPS/VMNSG",
      "operationName": "NetworkSecurityGroupCounters",
      "properties": {
        "vnetResourceGuid": "{50C7B76A-4B8F-481A-8029-73569E5C7D87}",
        "subnetPrefix": "10.0.1.0/24",
        "macAddress": "00-0D-3A-00-B6-B5",
        "primaryIPv4Address": "10.0.1.4",
        "ruleName": "UserRule_Disallow-everything-else-Inbound",
        "direction": "In",
        "type": "block",
        "matchedConnections": 2
      }
    },

Where matchedConnections is 2 (because I tried twice).

So the logs tell us where the traffic when.

From here we could wonder why it hit that rule and look for a rule with a higher priority that allow RDP in, find none and conclude that’s our problem.

Trial & Error

If the logs are not helping you, the last resort is to modify the NSG until you understand what is going on.

A way to do this is to create a rule “allow everything in from anywhere”, give it maximum priority.

If traffic still doesn’t go in, you have another problem than NSG, so go back to previous steps.

If traffic goes in, good.  Move that allow-everything rule down until you find which rule is blocking you.  You may have a lot of rules, in which case I would recommend a dichotomic search algorithm:  put your allow-everything rule in the middle of your “real rules”, if traffic passes, move the rule to the middle of the bottom half, otherwise, the middle of the top half, and so on.  This way, you’ll only need log(N) steps where N is your number of rules.

Summary

Troubleshooting NSGs can be difficult but here I highlighted a basic methodology to find your way around.

Diagnostic Logs help to give us insight about what is really going on although it can be tricky to work with.

In general, as with every debugging experience just apply the principle of Sherlock Holmes:

Eliminate the impossible.  Whatever remains, however improbable, must be the truth.

In terms of debugging, that means remove all the noise, all the fat and then some meat, until what remains is so simply that the truth will hit you.