Azure Managed Disk–Overview

pexels-photo-196520[1]

Microsoft released Azure Managed disk 2 weeks ago.  Let’s look at it!

What did we have until then?  The virtual hard disk (.vhd file) was stored as a page blob in an Azure Storage account.

That worked quite fine and Azure Disks are a little more than that.  A little abstraction.  But at the same time, Azure now knows it’s a disk and can hence optimize for it.

Issues with unmanaged disks

That’s right, our good old page blob vhd is now an unmanaged disk.  This sounds like 2001 when Microsoft released .NET & managed code and you learned that all the code you’ve been writing since then was unmanaged, unruly!

Let’s look at the different issues.

First that comes to mind is the Input / Output Operations per Seconds (IOPs).  A storage account tops IOPS at 20000.  An unmanaged standard disk can have 500 IOPs.  That means that after 40 disks in a storage account, if we only have disks in there, we’ll start to get throttled.  This doesn’t sound too bad if we plan to run 2-3 VMs but for larger deployments, we need to be careful.  Of course, we could choose to put each VHD in different storage account but a subscription is limited to 100 storage accounts and also, it adds to management (managing the domain names & access keys of 100 accounts for instance).

Another one is access rights.  If we put more than one disks in a storage account, we can’t different give access to different people to different disks:  if somebody is contributor on the storage account, he / she will have access to all disks in the account.

A painful one is around custom images.  Say we customize a Windows or Linux image and have our generalized VHD ready to fire up VMs.  That VHD needs to be in the same storage account than the VHD of the created VM.  That means you can only create 40 VMs really.  That’s where the limitation for VM scale set with custom images comes from.

A side effect of being in a storage account is the VHD is publicly accessible.  You still need a SAS token or an access key.  But that’s the thing.  For industries with strict regulations / compliances / audits, the ideas of saying “if somebody walked out with your access key, even if they got fired and their logins do not work anymore, they can now download and even change your VHD” is a deal breaker.

Finally, one that few people are aware of:  reliability.  Storage accounts are highly available and have 3 synchronous copies.  They have a SLA of %99.9.  The problem is when we match them with VMs.  We can setup high availability of a VM set by defining an availability set:  this gives some guarantees on how your VMs are affected during planned / unplanned downtime.  Now 2 VMs can be set to be in two different failure domains, i.e. they are deployed on different hosts and don’t share any critical hardware (e.g. power supply, network switch, etc.) but…  their VHDs might be on the same storage stamp (or cluster).  So if a storage stamp goes down for some reason, two VMs with different failure / update domain could go down at the same time.  If those are our only two VMs in the availability set, the set goes down.

Managed Disks

Managed disks are simply page blobs stored in a Microsoft managed storage account.  On the surface, not much of a change, right?

Well…  let’s address each issues we’ve identified:

  • IOPS:  disks are assigned to different storage accounts in a way that we’ll never get throttled because of storage account.
  • Access Rights:  Managed disks are first class citizens in Azure.  That means they appear as an Azure Resource and can have RBAC permissions assigned to it.
  • Custom Image:  beside managed disks, we now have snapshots and images as first class citizens.  An image no longer belong to a storage account and this removes the constraint we have before.
  • Public Access:  disks aren’t publically accessible.  The only way to access them is via a SAS token.  This also means we do not need to invent a globally unique domain name.
  • Reliability:  when we associate a disk with a VM in an availability set, Azure makes sure that VMs in different failure domains aren’t on the same storage stamp.

Other differences

Beside the obvious advantages here is a list of differences from unmanaged disks:

  • Managed disks can be in both Premium & Standard storage but only LRS
  • Standard Managed disks are priced given the closest pre-defined fix-sizes, not the “currently used # of GBs”
  • Standard Managed disks still price transactions

Also, quite importantly, Managed Disks do not support Storage Service Encryption at the time of this writing (February 2017).  It is supposed to come very soon though and Managed Disks do support encrypted disks.

Summary

Manage Disks bring a couple of goodies with them.  The most significant one is reliability, but other features will clearly make our lives easier.

In future articles, I’ll do a couple of hands on with Azure Managed Disks.

Extended Outage @ Instapaper – Resiliency example

art-broken-explosion-glass[1]I use Instapaper extensively to store the continuous flow of internet articles I want to read.  I created a bunch of tools integrating with it (e.g. monitoring atom feeds and sending new articles to Instapaper)

Last week my tools didn’t work for a while so I finally logged in directly to the site.  The site was down, citing an extended outage, with a reference to a blog article to explain the outage.

It got back on its feet after around 48 hours.  This isn’t an article to call out Instapaper’s engineering:  that type of outage happens everywhere all the time.  But let’s learn from it.

The article cites the cause of the outage as being “we hit a system limit for our hosted database that’s preventing new articles from being saved”.  The article also cite they had to spent time on the phone with their cloud provider (didn’t mention which one) before diagnosing the problem.

Know the limits

They hit a limit.

What are the limits?

A clear advise when working in Azure:  know where your sandbox ends.  We should always consult http://aka.ms/azurelimits to inform our architecture.

The nice thing about the cloud is that most limit are clearly defined and embedded in SLAs.

This comes as a surprise to a lot of people, especially when they see some of the restrictions.  “How come?  I can only put 2 disks on a Standard D1 v2 VM?”  This is because the experience we have on premise where there are few hard wired limitations.  Instead, we typically have degradations, e.g. sure you can put a Terabyte storage on your old laptop, but you are likely going to saturate its IOs before you will exhaust the storage.  In Azure, it is more clearly defined because Azure is a public cloud, i.e. a multi-tenant environment.  The only way for Azure to guarantee other customers they can have capacity X is to make sure we do not use that capacity.  So there are sandboxes everywhere.

On the flipside, you have less surprises.  The reason you do not have so many sandboxes on prem is that everything is shared.  That is nice as long as we are the only one to crank the resource usage.  But when other systems start grinding that SAN, it isn’t so fast anymore.

So first lesson we get from that Instapaper outage is to know the limit.  Did the Instapaper know the limits of their database?  Did they hit some software limit, e.g. they used all the 32 bits of an identifier column?

Monitor for those limits

Now that we know the limits, we know when our system is going to break.

So we need to monitor it so that we won’t have a bad surprise as Instapaper team must have had (I’m sure some dinners were cancelled during those 48 hours).

In Azure, we can use Azure Monitor / Log Analytics to monitor different metrics on our resources.  We can setup alerts to be notified when some threshold has been reached.

We can then react.  If we setup the threshold low enough, that will give us a window of time to react and to sketch a plan.

In an article on SQL Database sizes in an elastic pool, we saw that we can fix a limit on database size to make sure it doesn’t consume the allowed size of the entire pool.  This is a safeguard mechanism in case our monitoring strategy fails.

Strategy to get beyond the limits

We know the limits of our architecture.  We monitor our system to get some heads up before those limits are reached.

When they are reached, what do we do?

Typically, if we’re aware of the limits, we will have given it some thoughts, even if it isn’t fully documented.  This is something that goes in handover conversations actually “this system has this limit, but you know, if that ever becomes a problem, consider that alternative”.  Hopefully, the alternative doesn’t consist into re-writing the entire system.

Scale out / partitioning strategies

If we are lucky, the original architect(s) of the system have baked in a way to overcome its limits.

A typical way is to partition the system.  We can partition the system per user / group of users or other data segments.  This way, we can scale out the system to handle different partitions.

We can start with only one set of resources (e.g. Virtual Machines) handling all partitions and the day that set of resource hits its limits, we can split the partitions into two groups and have another set of resources handling the other set of partitions.  And so on.

One day, that partition scheme might also hit its limits.  For instance, maybe we have a set of resources handling only one partition each and most of the set of resource have all hit their limits.  In that case, we’re back to figure out how to go beyond the limits of our system.  Typically, that will consist into repartition it in a way we can scale out further.

Azure is a partitioned / scaled out system.  That strategy has allowed it to grow to its current size.  Along the way, repartitioning was needed.  For instance, the transition of ASM to ARM was partially that.  In ASM, there were only 2 regions handling the Azure Management APIs while in ARM, each region handles API requests.

Conclusion

Instapaper was victim of its own success.  That’s a good problem to have but a problem nevertheless.

Make sure you at least know the limits of your system and monitor for them.  This way if success curses you, you’ll be able to react during working hours instead of cancelling your valentine dinner and spending your days justifying yourself to your CEO.

Automating Azure AD

https://pixabay.com/en/machine-factory-automation-1651014/

In the previous article, we explored how to interact (read / write) to an Azure AD tenant using Microsoft Graph API.

In the article before that, we looked at how to authenticate a user without using Azure AD web flow.

Those were motivated by a specific scenario:  replacing a LDAP server by Azure AD while migrating a SaaS application to Azure.

Now a SaaS application will typically have multiple tenants or instances.  Configuring Azure AD by hand, like any other Azure service, can be tedious and error prone.  Furthermore, once we’ve onboarded say 20 tenants, making a configuration change will be even more tedious and error prone.

This is why we’ll look at automation in this article.

We’ll look at how to automate the creation of Azure AD applications we used in the last two articles.  From there it’s pretty easy to generalize (aka exercise to the reader!).

Azure AD Tenant creation

From the get go, bad news, we can’t create the tenant by automation.

No API is exposed for that, we need to go through the portal.

Sorry.

Which PowerShell commands to use?

Automating Azure AD is a little confusing.  Too many options is like not enough.

The first approach should probably be to use the Azure PowerShell package like the rest of Azure services.  For instance, to create an application, we would use New-AzureRmADApplication.

The problem with that package for our scenario is that the Azure AD tenant isn’t attached to a subscription. This is typical for a SaaS model:  we have an Azure AD tenant to manage internal users on all subscriptions and then different tenants to manage external users.  Unfortunately, at the time of this writing, the Azure PowerShell package is tied around the Add-AzureRmAccount command to authenticate the user ; that command binds a subscription (or via the Select-AzureRmSubscription).  But in our case we do not have a subscription:  our Azure AD tenant isn’t managing a subscription.

The second approach would then be to use the MSOnline Module.  That’s centered around Azure AD, but it is slowly being deprecated for…

The third approach, Azure Active Directory V2 PowerShell module.  This is what we’re going to use.

I want to give a big shout to Chris Dituri for tapping the trail here.  His article Azure Active Directory: Creating Applications and SPNs with Powershell was instrumental to write this article.  As we’ll see, there are bunch of intricacies about the application permissions that aren’t documented and that Chris unraveled.

The first thing we’ll need to do is to install the PowerShell package.  Easy:

Install-Module AzureADPreview

If you read this from the future, this might have changed, so check out the documentation page for install instructions.

Connect-AzureAD

We need to connect to our tenant:

connect-azuread -TenantId bc7d0032…

You can see the documentation on the Connect-AzureAD command here.

Where do we take our tenant ID?

image

Now we can go and create applications.

Service-Client

Here we’ll replicate the applications we built by hand in the Authenticating to Azure AD non-interactively article.

Remember, those are two applications, a service and a client one.  The client one has permission to access the service one & let users sign in to it.  As we’ll see, giving those permissions are a little tricky.

Let’s start by the final PowerShell code:

#  Grab the Azure AD Service principal
$aad = (Get-AzureADServicePrincipal | `
    where {$_.ServicePrincipalNames.Contains("https://graph.windows.net")})[0]
#  Grab the User.Read permission
$userRead = $aad.Oauth2Permissions | ? {$_.Value -eq "User.Read"}

#  Resource Access User.Read + Sign in
$readUserAccess = [Microsoft.Open.AzureAD.Model.RequiredResourceAccess]@{
  ResourceAppId=$aad.AppId ;
  ResourceAccess=[Microsoft.Open.AzureAD.Model.ResourceAccess]@{
    Id = $userRead.Id ;
    Type = "Scope"}}

#  Create Service App
$svc = New-AzureADApplication -DisplayName "MyLegacyService" `
    -IdentifierUris "uri://mylegacyservice.com"
# Associate a Service Principal to the service Application 
$spSvc = New-AzureADServicePrincipal -AppId $svc.AppId
#  Grab the user-impersonation permission
$svcUserImpersonation = $spSvc.Oauth2Permissions | `
    ?{$_.Value -eq "user_impersonation"}
 
#  Resource Access 'Access' a service
$accessAccess = [Microsoft.Open.AzureAD.Model.RequiredResourceAccess]@{
  ResourceAppId=$svc.AppId ;
  ResourceAccess=[Microsoft.Open.AzureAD.Model.ResourceAccess]@{
    Id = $svcUserImpersonation.Id ;
    Type = "Scope"}}
#  Create Required Access 
$client = New-AzureADApplication -DisplayName "MyLegacyClient" `
  -PublicClient $true `
  -RequiredResourceAccess $readUserAccess, $accessAccess

As promised, there is ample amount of ceremony.  Let’s go through it.

  • Line 1:  we grab the service principal that has a https://graph.windows.net for a name ; you can check all the service principals living in your tenant with Get-AzureADServicePrincipal ; I have 15 with a clean tenant.  We’ll need the Graph one since we need to give access to it.
  • Line 5:  we grab the specific user read permission inside that service principal’s Oauth2Permissions collection.  Basically, service principals expose the permission that other apps can get with them.  We’re going to need the ID.  Lots of GUIDs in Azure AD.
  • Line 8:  we then construct a user-read RequiredResourceAccess object with that permission
  • Line 15:  we create our Legacy service app
  • Line 16:  we associate a service principal to that app
  • Line 20:  we grab the user impersonation permission of that service principal.  Same mechanism we used for the Graph API, just a different permission.
  • Line 24:  we build another RequiredResourceAccess object around that user impersonation permission.
  • Line 30:  we create our Legacy client app ; we attach both the user-read & user impersonation permission to it.

Grant Permissions

If we try to run the authentication piece of code we had in the article, we’ll first need to change the “clientID” value for $client.AppId (and make sure serviceUri has the value of “uri://mylegacyservice.com”).

Now if we run that, we’ll get an error along the line of

The user or administrator has not consented to use the application with ID ‘…’. Send an interactive authorization request for this user and resource.

What is that?

There is one manual step we have to take, that is to grant the permissions to the application.  In Azure AD, this must be performed by an admin and there are no API exposed for it.

We could sort of automate it with code via an authentication workflow (which is what the error message is suggesting to do), which I won’t do here.

Basically, an administrator (of the Azure AD tenant) needs to approve the use of the app.

We can also do it, still manually, via the portal as we did in the article.  But first, let’s throw the following command:

Get-AzureADOAuth2PermissionGrant

On an empty tenant, there should be nothing returned.  Unfortunately, there are no Add/New-AzureADOAuth2PermissionGrant at the time of this writing (this might have changed if you are from the future so make sure you check out the available commands).

So the manual step is, in the portal, to go in the MyLegacyClient App, select Required Permissions then click the Grant Permissions button.

image

Once we’ve done this we can run the same PowerShell command, i.e.

Get-AzureADOAuth2PermissionGrant

and have two entries now.

image

We see the two permissions we attached to MyLegacyClient.

We should now be able to run the authentication code.

Graph API App

Here we’ll replicate the application we created by hand in the Using Microsoft Graph API to interact with Azure AD article.

This is going to be quite similar, except we’re going to attach a client secret on the application so that we can authenticate against it.

#  Grab the Azure AD Service principal
$aad = (Get-AzureADServicePrincipal | `
    where {$_.ServicePrincipalNames.Contains("https://graph.windows.net")})[0]
#  Grab the User.Read permission
$userRead = $aad.Oauth2Permissions | ? {$_.Value -eq "User.Read"}
#  Grab the Directory.ReadWrite.All permission
$directoryWrite = $aad.Oauth2Permissions | `
  ? {$_.Value -eq "Directory.ReadWrite.All"}


#  Resource Access User.Read + Sign in & Directory.ReadWrite.All
$readWriteAccess = [Microsoft.Open.AzureAD.Model.RequiredResourceAccess]@{
  ResourceAppId=$aad.AppId ;
  ResourceAccess=[Microsoft.Open.AzureAD.Model.ResourceAccess]@{
    Id = $userRead.Id ;
    Type = "Scope"}, [Microsoft.Open.AzureAD.Model.ResourceAccess]@{
    Id = $directoryWrite.Id ;
    Type = "Role"}}

#  Create querying App
$queryApp = New-AzureADApplication -DisplayName "QueryingApp" `
    -IdentifierUris "uri://myqueryingapp.com" `
    -RequiredResourceAccess $readWriteAccess

#  Associate a Service Principal so it can login
$spQuery = New-AzureADServicePrincipal -AppId $queryApp.AppId

#  Create a key credential for the app valid from now
#  (-1 day, to accomodate client / service time difference)
#  till three months from now
$startDate = (Get-Date).AddDays(-1)
$endDate = $startDate.AddMonths(3)

$pwd = New-AzureADApplicationPasswordCredential -ObjectId $queryApp.ObjectId `
  -StartDate $startDate -EndDate $endDate `
  -CustomKeyIdentifier "MyCredentials"

You need to “grant permissions” for the new application before trying to authenticate against it.

Two big remarks on tiny bugs ; they might be fixed by the time you read this and they aren’t critical as they both have easy work around:

  1. The last command in the script, i.e. the password section, will fail with a “stream property was found in a JSON Light request payload. Stream properties are only supported in responses” if you execute the entire script in one go.  If you execute it separately, it doesn’t.  Beat me.
  2. This one took me 3 hours to realize, so use my wisdom:  DO NOT TRY TO AUTHENTICATE THE APP BEFORE GRANTING PERMISSIONS.  There seems to be some caching on the authentication service so if you do try to authenticate when you don’t have the permissions, you’ll keep receiving the same claims after even if you did grant the permissions.  Annoying, but easy to avoid once you know it.

An interesting aspect of the automation is that we have a much more fine grained control on the duration of the password than in the Portal (1 year, 2 years, infinite).  That allows us to implement a more aggressive rotation of secrets.

Summary

Automation with Azure AD, as with other services, helps reduce the effort to provision and the human errors.

There are two big manual steps that can’t be automated in Azure AD:

  • Azure AD tenant creation
  • Granting permissions on an application

That might change in the future, but for now, that limits the amount of automation you can do with human interactions.

Using Microsoft Graph API to interact with Azure AD

In my last article, I showed how to authenticate on Azure AD using a user name / password without using the native web flow.

The underlying scenario was to migrate an application using an LDAP server by leveraging an Azure AD tenant.

The logical continuation of that scenario is to use the Microsoft Graph API to interact with the tenant the same way we would use LDAP queries to interact with the LDAP server.

Microsoft Graph API is a generalization of the Azure AD Graph API and should be used instead.  It consists of simple REST queries which are all documented.

In this scenario, I’ll consider three simple interactions:

  • Testing if a user exists
  • Returning the groups a user belong to
  • Creating a new user

But first we need to setup the Azure AD tenant.

Azure AD setup

We’re going to rely on the last article to do the heavy lifting.

We are going to create a new application (here we’re going to use the name “QueryingApp”) of type Web App / API (although native should probably work).

The important part is to grab its Application-ID & also to give it enough permission to create users.

image

In this scenario, the application is going to authenticate itself (as opposed to a user) so we need to define a secret.

image

We’ll need to add a key, save & copy the key value.

Authenticating with ADAL

This sample is in C# / .NET but since the Active Directory Authentication Library (ADAL) is available on multiple platform (e.g. Java), this should be easy to port.

We need to install the NuGet package Microsoft.IdentityModel.Clients.ActiveDirectory in our project.

        private static async Task<string> AppAuthenticationAsync()
        {
            //  Constants
            var tenant = "LdapVplDemo.onmicrosoft.com";
            var resource = "https://graph.microsoft.com/";
            var clientID = "9a9f5e70-5501-4e9c-bd00-d4114ebeb419";
            var secret = "Ou+KN1DYv8337hG8o8+qRZ1EPqBMWwER/zvgqvmEe74=";

            //  Ceremony
            var authority = $"https://login.microsoftonline.com/{tenant}";
            var authContext = new AuthenticationContext(authority);
            var credentials = new ClientCredential(clientID, secret);
            var authResult = await authContext.AcquireTokenAsync(resource, credentials);

            return authResult.AccessToken;
        }

Here the clientID is the application ID of the application we created at secret is the secret key we created for it.

Here we return the access token as we’re going to use them.

Authenticating with HTTP POST

If we do not want to integrate with the ADAL, here’s the bare bone HTTP post version:

        private static async Task<string> HttpAppAuthenticationAsync()
        {
            //  Constants
            var tenant = "LdapVplDemo.onmicrosoft.com";
            var clientID = "9a9f5e70-5501-4e9c-bd00-d4114ebeb419";
            var resource = "https://graph.microsoft.com/";
            var secret = "Ou+KN1DYv8337hG8o8+qRZ1EPqBMWwER/zvgqvmEe74=";

            using (var webClient = new WebClient())
            {
                var requestParameters = new NameValueCollection();

                requestParameters.Add("resource", resource);
                requestParameters.Add("client_id", clientID);
                requestParameters.Add("grant_type", "client_credentials");
                requestParameters.Add("client_secret", secret);

                var url = $"https://login.microsoftonline.com/{tenant}/oauth2/token";
                var responsebytes = await webClient.UploadValuesTaskAsync(url, "POST", requestParameters);
                var responsebody = Encoding.UTF8.GetString(responsebytes);
                var obj = JsonConvert.DeserializeObject<JObject>(responsebody);
                var token = obj["access_token"].Value<string>();

                return token;
            }
        }

Here I use the popular Newtonsoft Json Nuget Package to handle JSON.

By no mean is the code here a master piece of robustness and style.  It is meant to be straightforward and easy to understand.  By all means, improve it %500 before calling it production ready!

Testing if a user exists

Here we’re going to use the User Get of Microsoft Graph API.

There is actually a NuGet package for Microsoft Graph API and, in general SDKs (at the time of this writing) for 9 platforms.

        private static async Task<bool> DoesUserExistsAsync(HttpClient client, string user)
        {
            try
            {
                var payload = await client.GetStringAsync($"https://graph.microsoft.com/v1.0/users/{user}");

                return true;
            }
            catch (HttpRequestException)
            {
                return false;
            }
        }

Again, the code is minimalist here.  The HTTP GET actually returns user information that could be used.

Returning the groups a user belong to

Here we’re going to use the memberof method of Microsoft Graph API.

        private static async Task<string[]> GetUserGroupsAsync(HttpClient client, string user)
        {
            var payload = await client.GetStringAsync(
                $"https://graph.microsoft.com/v1.0/users/{user}/memberOf");
            var obj = JsonConvert.DeserializeObject<JObject>(payload);
            var groupDescription = from g in obj["value"]
                                   select g["displayName"].Value<string>();

            return groupDescription.ToArray();
        }

Here, we deserialize the returned payload to extract the group display names.  The information returned is richer and could be used.

Creating a new user

Finally we’re going to use the Create User method of Microsoft Graph API.

This is slightly more complicated as it is an HTTP POST with a JSON payload in input.

        private static async Task CreateUserAsync(HttpClient client, string user, string domain)
        {
            using (var stream = new MemoryStream())
            using (var writer = new StreamWriter(stream))
            {
                var payload = new
                {
                    accountEnabled = true,
                    displayName = user,
                    mailNickname = user,
                    userPrincipalName = $"{user}@{domain}",
                    passwordProfile = new
                    {
                        forceChangePasswordNextSignIn = true,
                        password = "tempPa$$w0rd"
                    }
                };
                var payloadText = JsonConvert.SerializeObject(payload);

                writer.Write(payloadText);
                writer.Flush();
                stream.Flush();
                stream.Position = 0;

                using (var content = new StreamContent(stream))
                {
                    content.Headers.Add("Content-Type", "application/json");

                    var response = await client.PostAsync("https://graph.microsoft.com/v1.0/users/", content);

                    if (!response.IsSuccessStatusCode)
                    {
                        throw new InvalidOperationException(response.ReasonPhrase);
                    }
                }
            }
        }

Calling Code

The calling code looks like this:


        private static async Task Test()
        {
            //var token = await AppAuthenticationAsync();
            var token = await HttpAppAuthenticationAsync();

            using (var client = new HttpClient())
            {
                client.DefaultRequestHeaders.Authorization = new AuthenticationHeaderValue("Bearer", token);

                var user = "test@LdapVplDemo.onmicrosoft.com";
                var userExist = await DoesUserExistsAsync(client, user);

                Console.WriteLine($"Does user exists?  {userExist}");

                if (userExist)
                {
                    var groups = await GetUserGroupsAsync(client, user);

                    foreach (var g in groups)
                    {
                        Console.WriteLine($"Group:  {g}");
                    }

                    await CreateUserAsync(client, "newuser", "LdapVplDemo.onmicrosoft.com");
                }
            }
        }

Summary

We can see that using the Microsoft Graph API, most if not all LDAP query can easily be converted.

Microsoft Graph API is aligned with OData v3 which makes it a great REST API where filtering queries are standardized.

Authenticating to Azure AD non-interactively

fingerprint-1382652_640I want to use Azure AD as a user directory but I do not want to use its native web authentication mechanism which requires users to go via an Active Directory page to login (which can be branded and customized to look like my own).

I just want to give a user name & password to an authentication API.

The reason is I want to migrate an existing application currently using an LDAP server and want to change the code as little as possible.

You might have other reasons why you want to do that.

Let’s just say that the online literature HEAVILY leans towards using the web workflow.  Even the documentation on native apps recommends that your native app pops the Azure AD web page to authenticate the user.  There are good reasons for that as this way your app never touches user credentials and is therefore more secure and your app more trustworthy.  So in general, I totally agree with those recommendations.

But in my case, my legacy app is touching the user credentials all over the place.  Also this approach is a stepping stone before considering moving to the web workflow.

After a weekend spent scanning the entire web (as a side note I’ve learned there was little Scandinavian men living inside the Earth, which is flat by the way), I finally found Danny Strockis’ article about Authenticating to Azure AD non-interactively using a username & password.

That is it basically.  Except the article is a little quick on setup, so I’m gona elaborate here.

I’m going to give a sample in C# using ADAL, but since at the end of the day, the authentication is one HTTP POST, I’ll also give a more bare bone sample using HTTP post if you don’t want or cannot to integrate with ADAL.  The samples are quite trivial so you should be able to convert them in the language / platform of your choice.

Conceptually

We basically want our users to interact with our application only, punch in their credentials and have the application check with Azure AD if the credentials are good.

image

Here the application is represented as a Web app here but it could also be a native application.

In order to do this in the world of Azure AD is to use two Azure AD apps:

  1. A client app, representing the agent authenticating the user
  2. A service app, representing the actual application the user is authenticating against

I know it looks weird, it makes your brain hurts and is in great part the reason I’m writing this article, because it isn’t straightforward.

image

In the authentication parlance, the client App is the client (so far so good) while the service app is the resource, i.e. the thing the user is trying to access:  the client is accessing the resource on the user’s behalf.

Setting up Azure AD

Yes, there will be some steps to setup Azure AD.

First we need a tenant.  We can use the tenant used by our subscription but typically for those types of scenario we’ll want to have a separate tenant for our end users.  This article shows how to create an Azure AD tenant.

We then need a user to test.  We can create a user like this:

image

where, of course, ldapvpldemo is my Azure AD tenant.

We’ll also need to give that user a “permanent” password, so let’s show the password on creation (or reset it afterwards) then let’s go to an InPrivate browsing window and navigate to https://login.microsoftonline.com/.  We can then login as the user (e.g. test@ldapvpldemo.onmicrosoft.com) with the said password.  We’ll be prompted to change it (e.g. My$uperComplexPassw0rd).

Let’s create the client app.  In App Registrations, let’s add an app with:

image

It would probably work as a Web app / API but a Native App seems more fitting.

We’ll need the Application ID of that application which we can find by looking at the properties of the application.

image

We then need to create the service app:

image

We’ll need the App ID URI of the service:

image

That URI can be changed, either way we need the final value.

We will need to give permission to the client app to access the service app.  For that we need to go back to the client app, go to the Required Permissions menu and add a permission.  From there, in the search box we can just start to write the name of the service app (e.g. MyLegacyService) and it should appear where we can select it.  We then click the Access MyLegacyService box.

Finally, we need to grant the permissions to users.

image

With all that, we’re ready to authenticate.

ADAL

This sample is in C# / .NET but since ADAL is available on multiple platform (e.g. Java), this should be easy to port.

We’ll create a Console Application project.  We need the full .NET Framework as the .NET core version of ADAL doesn’t have the UserPasswordCredential class we’re gona be using.

We need to install the NuGet package Microsoft.IdentityModel.Clients.ActiveDirectory in the project.

        private static async Task AdalAuthenticationAsync()
        {
            //  Constants
            var tenant = "LdapVplDemo.onmicrosoft.com";
            var serviceUri = "https://LdapVplDemo.onmicrosoft.com/d0f883f6-1c32-4a14-a436-0a995a19c39b";
            var clientID = "b9faf13d-9258-4142-9a5a-bb9f2f335c2d";
            var userName = $"test@{tenant}";
            var password = "My$uperComplexPassw0rd1";

            //  Ceremony
            var authority = "https://login.microsoftonline.com/" + tenant;
            var authContext = new AuthenticationContext(authority);
            var credentials = new UserPasswordCredential(userName, password);
            var authResult = await authContext.AcquireTokenAsync(serviceUri, clientID, credentials);
        }

We have to make sure we’ve copied the constants in the constant section.

This should work and authResult should contain a valid access token that we could use as a bearer token in different scenarios.

If we pass a wrong password or wrong user name, we should obtain an error as expected.

HTTP POST

We can use Fiddler or other HTTP sniffing tool to see what ADAL did for us.  It is easy enough to replicate.

        private static async Task HttpAuthenticationAsync()
        {
            //  Constants
            var tenant = "LdapVplDemo.onmicrosoft.com";
            var serviceUri = "https://LdapVplDemo.onmicrosoft.com/d0f883f6-1c32-4a14-a436-0a995a19c39b";
            var clientID = "b9faf13d-9258-4142-9a5a-bb9f2f335c2d";
            var userName = $"test@{tenant}";
            var password = "My$uperComplexPassw0rd";

            using (var webClient = new WebClient())
            {
                var requestParameters = new NameValueCollection();

                requestParameters.Add("resource", serviceUri);
                requestParameters.Add("client_id", clientID);
                requestParameters.Add("grant_type", "password");
                requestParameters.Add("username", userName);
                requestParameters.Add("password", password);
                requestParameters.Add("scope", "openid");

                var url = $"https://login.microsoftonline.com/{tenant}/oauth2/token";
                var responsebytes = await webClient.UploadValuesTaskAsync(url, "POST", requestParameters);
                var responsebody = Encoding.UTF8.GetString(responsebytes);
            }
        }

Basically, we have an HTTP post where all the previous argument are passed in the POST body.

If we pass a wrong password or wrong user name, we should obtain an error as expected.  Interestingly, the HTTP code is 400 (i.e. bad request) instead of some unauthorized variant.

Summary

Authenticating on an Azure AD tenant isn’t the most recommended method as it means your application is handling credentials whereas the preferred method delegate to an Azure AD hosted page the handling of those credential so your application only see an access token.

But for a legacy migration for instance, it makes sense.  Azure AD definitely is more secure than an LDAP server sitting on a VM.

We’ve seen two ways to perform the authentication.  Under the hood they end up being the same.  One is using the ADAL library while the other uses bare bone HTTP POST.

Keep in mind, ADAL does perform token caching.  If you plan to use it in production, you’ll want to configure the cache properly not to get strange behaviours.

Joining an ARM Linux VM to AAD Domain Services

Active Directory is one of the most popular domain controller / LDAP server around.

In Azure we have Azure Active Directory (AAD).  Despite the name, AAD isn’t just a multi-tenant AD.  It is built for the cloud.

Sometimes though, it is useful to have a traditional domain controller…  in the cloud.  Typically this is with legacy workloads built to work with Active Directory.  But also, a very common scenario is to join an Azure VM to a domain so that users authenticate on it with the same accounts they authenticate to use the Azure Portal.

The underlying directory could even be synced with a corporate network, in which case users could log into the VMs using their corporate account.  I won’t cover this here but you can read about it in a previous article for AD Connect part.

The straightforward option is to build an Active Directory cluster on Azure VMs.  This will work but requires the maintenance of those 2 VMs.

mont-saint-michel-france-normandy-europe[1]

An easier option is AAD Domain Services (AADDS).  AADDS exposes an AAD tenant as a managed domain service.  It does it by provisioning some variant of Active Directory managed cluster, i.e. we do not see or care about the underlying VMs.

The cluster is synchronized one-way (from AAD to AADDS).  For this reason, AAD is read only through its LDAP interface, e.g. we can’t reset a password using LDAP.

The Azure documentation walks us through such an integration with classic (ASM) VMs.  Since ARM has been around for more than a year, I recommend to always go with ARM VMs.  This article aims at showing how to do this.

I’ll heavily leveraged the existing documentation and detail only what differs from the official documentation.

Also, keep in mind this article is written in January 2017.  Azure AD will transition to ARM in the future and will likely render this article obsolete.

Dual Networks

The main challenge we face is that AAD is an ASM service and AAD Domain Service are exposed within an ASM Virtual Network (VNET), which is incompatible with our ARM VM.

Thankfully, we now have Virtual Network peering allowing us to peer an ASM and an ARM VNET together so they can act as if they were one.

image

Peering

As with all VNET peering, the two VNETs must be of mutually exclusive IP addresses space.

I created two VNETs in the portal (https://portal.azure.com).  I recommend creating them in the new portal explicitly, this way even the classic one will be part of the desired resource group.

The classic one has 10.0.0.0/24 address range while the ARM one has 10.1.0.0/24.

The peering can be done from the portal too.  In the Virtual Network pane (say the ARM one), select Peerings:

image

We should see an empty list here, so let’s click Add.

We need to give the peering a name.  Let’s type PeeringToDomainServices.

We then select Classic in the Peer details since we want to peer with a classic VNET.

Finally, we’ll click on Choose a Virtual Network.

image

From there we should see the classic VNET we created.

Configuring AADDS

The online documentation is quite good for this.

Just make sure you select the classic VNET we created.

You can give a domain name that is different that the AAD domain name (i.e. *.onmicrosoft.com).

Enabling AADDS takes up to 30 minutes.  Don’t hold your breath!

Joining a VM

We can create a Linux VM, put it in the ARM VNET we created, and join it to the AADDS domain now.

Again, the online documentation does a great job of walking us through the process.  The documentation is written for Red Hat.

When I tried it, I used a CentOS VM and I ended up using different commands, namely the realmd command (ignoring the SAMBA part of the article).

Conclusion

It is fairly straightforward to enable Domain Services in AAD and well documented.

A challenge we currently have currently is to join or simply communicate from an ARM VM to AADDS.  For this we need two networks, a classic (ASM) one and an ARM one, and we need to peer them together.

Troubleshooting NSGs using Diagnostic Logs

I’ve wrote about how to use Network Security Group (NSG) before.

Chances are, once you get a complicated enough set of rules in a NSG, you’ll find yourself with NSGs that do not do what you think they should do.

Troubleshooting NSGs isn’t trivial.

I’ll try to give some guidance here but to this date (January 2017), there is no tool where you can just say “please follow packet X and tell me against which wall it bumps”.  It’s more indirect than that.

 

Connectivity

First thing, make sure you can connect to your VNET.

If you are connecting to a VM via a public IP, make sure you have access to that IP (i.e. you’re not sitting behind an on premise firewall blocking the outgoing port you are trying to use), that the IP is connected to the VM either directly or via a Load Balancer.

If you are connecting to a VM via a private IP through a VPN Gateway of some sort, make sure you can connect and that your packets are routed to the gateway and from there they get routed to the proper subnet.

An easy way to make sure of that is to remove all NSGs and replace them by a “let everything go in”.  Of course, that’s also opening your workloads to hackers, so I recommend you do that with a test VM that you destroy afterwards.

Diagnose

Then I would recommend to go through the official Azure guidelines to troubleshoot NSGs.  This walks you through the different diagnosis tools.

Diagnostic Logs

If you reached this section and haven’t achieve greatness yet, well…  You need something else.

What we’re going to do here is use NSG Diagnostic Logs to understand a bit more what is going on.

By no means is this magic and especially in an environment already in use where a lot of traffic is occurring, it might be difficult to make sense of what the logs are going to give us.

Nevertheless, the logs give us a picture of what really is happening.  They are aggregated though, so we won’t see your PC IP address for instance.  The aggregation is probably what limit the logs effectiveness the most.

Sample configuration

I provide here a sample configuration I’m going to use to walk through the troubleshooting process.

{
  "$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
  "contentVersion": "1.0.0.0",
  "parameters": {
    "VM Admin User Name": {
      "defaultValue": "myadmin",
      "type": "string"
    },
    "VM Admin Password": {
      "defaultValue": null,
      "type": "securestring"
    },
    "Disk Storage Account Name": {
      "defaultValue": "<your prefix>vmpremium",
      "type": "string"
    },
    "Log Storage Account Name": {
      "defaultValue": "<your prefix>logstandard",
      "type": "string"
    },
    "VM Size": {
      "defaultValue": "Standard_DS2",
      "type": "string",
      "allowedValues": [
        "Standard_DS1",
        "Standard_DS2",
        "Standard_DS3"
      ],
      "metadata": {
        "description": "SKU of the VM."
      }
    },
    "Public Domain Label": {
      "type": "string"
    }
  },
  "variables": {
    "Vhds Container Name": "vhds",
    "VNet Name": "MyVNet",
    "Ip Range": "10.0.1.0/24",
    "Public IP Name": "MyPublicIP",
    "Public LB Name": "PublicLB",
    "Address Pool Name": "addressPool",
    "Subnet NSG Name": "subnetNSG",
    "VM NSG Name": "vmNSG",
    "RDP NAT Rule Name": "RDP",
    "NIC Name": "MyNic",
    "VM Name": "MyVM"
  },
  "resources": [
    {
      "type": "Microsoft.Network/publicIPAddresses",
      "name": "[variables('Public IP Name')]",
      "apiVersion": "2015-06-15",
      "location": "[resourceGroup().location]",
      "tags": {
        "displayName": "Public IP"
      },
      "properties": {
        "publicIPAllocationMethod": "Dynamic",
        "idleTimeoutInMinutes": 4,
        "dnsSettings": {
          "domainNameLabel": "[parameters('Public Domain Label')]"
        }
      }
    },
    {
      "type": "Microsoft.Network/loadBalancers",
      "name": "[variables('Public LB Name')]",
      "apiVersion": "2015-06-15",
      "location": "[resourceGroup().location]",
      "tags": {
        "displayName": "Public Load Balancer"
      },
      "properties": {
        "frontendIPConfigurations": [
          {
            "name": "LoadBalancerFrontEnd",
            "comments": "Front end of LB:  the IP address",
            "properties": {
              "publicIPAddress": {
                "id": "[resourceId('Microsoft.Network/publicIPAddresses/', variables('Public IP Name'))]"
              }
            }
          }
        ],
        "backendAddressPools": [
          {
            "name": "[variables('Address Pool Name')]"
          }
        ],
        "loadBalancingRules": [
          {
            "name": "Http",
            "properties": {
              "frontendIPConfiguration": {
                "id": "[concat(resourceId('Microsoft.Network/loadBalancers', variables('Public LB Name')), '/frontendIPConfigurations/LoadBalancerFrontEnd')]"
              },
              "frontendPort": 80,
              "backendPort": 80,
              "enableFloatingIP": false,
              "idleTimeoutInMinutes": 4,
              "protocol": "Tcp",
              "loadDistribution": "Default",
              "backendAddressPool": {
                "id": "[concat(resourceId('Microsoft.Network/loadBalancers', variables('Public LB Name')), '/backendAddressPools/', variables('Address Pool Name'))]"
              },
              "probe": {
                "id": "[concat(resourceId('Microsoft.Network/loadBalancers', variables('Public LB Name')), '/probes/TCP-Probe')]"
              }
            }
          }
        ],
        "probes": [
          {
            "name": "TCP-Probe",
            "properties": {
              "protocol": "Tcp",
              "port": 80,
              "intervalInSeconds": 5,
              "numberOfProbes": 2
            }
          }
        ],
        "inboundNatRules": [
          {
            "name": "[variables('RDP NAT Rule Name')]",
            "properties": {
              "frontendIPConfiguration": {
                "id": "[concat(resourceId('Microsoft.Network/loadBalancers', variables('Public LB Name')), '/frontendIPConfigurations/LoadBalancerFrontEnd')]"
              },
              "frontendPort": 3389,
              "backendPort": 3389,
              "protocol": "Tcp"
            }
          }
        ],
        "outboundNatRules": [],
        "inboundNatPools": []
      },
      "dependsOn": [
        "[resourceId('Microsoft.Network/publicIPAddresses', variables('Public IP Name'))]"
      ]
    },
    {
      "type": "Microsoft.Network/virtualNetworks",
      "name": "[variables('VNet Name')]",
      "apiVersion": "2016-03-30",
      "location": "[resourceGroup().location]",
      "properties": {
        "addressSpace": {
          "addressPrefixes": [
            "10.0.0.0/16"
          ]
        },
        "subnets": [
          {
            "name": "default",
            "properties": {
              "addressPrefix": "[variables('Ip Range')]",
              "networkSecurityGroup": {
                "id": "[resourceId('Microsoft.Network/networkSecurityGroups', variables('Subnet NSG Name'))]"
              }
            }
          }
        ]
      },
      "resources": [],
      "dependsOn": [
        "[resourceId('Microsoft.Network/networkSecurityGroups', variables('Subnet NSG Name'))]"
      ]
    },
    {
      "apiVersion": "2015-06-15",
      "name": "[variables('Subnet NSG Name')]",
      "type": "Microsoft.Network/networkSecurityGroups",
      "location": "[resourceGroup().location]",
      "tags": {},
      "properties": {
        "securityRules": [
          {
            "name": "Allow-HTTP-From-Internet",
            "properties": {
              "protocol": "Tcp",
              "sourcePortRange": "*",
              "destinationPortRange": "80",
              "sourceAddressPrefix": "Internet",
              "destinationAddressPrefix": "*",
              "access": "Allow",
              "priority": 100,
              "direction": "Inbound"
            }
          },
          {
            "name": "Allow-RDP-From-Everywhere",
            "properties": {
              "protocol": "Tcp",
              "sourcePortRange": "*",
              "destinationPortRange": "3389",
              "sourceAddressPrefix": "*",
              "destinationAddressPrefix": "*",
              "access": "Allow",
              "priority": 150,
              "direction": "Inbound"
            }
          },
          {
            "name": "Allow-Health-Monitoring",
            "properties": {
              "protocol": "*",
              "sourcePortRange": "*",
              "destinationPortRange": "*",
              "sourceAddressPrefix": "AzureLoadBalancer",
              "destinationAddressPrefix": "*",
              "access": "Allow",
              "priority": 200,
              "direction": "Inbound"
            }
          },
          {
            "name": "Disallow-everything-else-Inbound",
            "properties": {
              "protocol": "*",
              "sourcePortRange": "*",
              "destinationPortRange": "*",
              "sourceAddressPrefix": "*",
              "destinationAddressPrefix": "*",
              "access": "Deny",
              "priority": 300,
              "direction": "Inbound"
            }
          },
          {
            "name": "Allow-to-VNet",
            "properties": {
              "protocol": "*",
              "sourcePortRange": "*",
              "destinationPortRange": "*",
              "sourceAddressPrefix": "*",
              "destinationAddressPrefix": "VirtualNetwork",
              "access": "Allow",
              "priority": 100,
              "direction": "Outbound"
            }
          },
          {
            "name": "Disallow-everything-else-Outbound",
            "properties": {
              "protocol": "*",
              "sourcePortRange": "*",
              "destinationPortRange": "*",
              "sourceAddressPrefix": "*",
              "destinationAddressPrefix": "*",
              "access": "Deny",
              "priority": 200,
              "direction": "Outbound"
            }
          }
        ],
        "subnets": []
      }
    },
    {
      "apiVersion": "2015-06-15",
      "name": "[variables('VM NSG Name')]",
      "type": "Microsoft.Network/networkSecurityGroups",
      "location": "[resourceGroup().location]",
      "tags": {},
      "properties": {
        "securityRules": [
          {
            "name": "Allow-HTTP-From-Internet",
            "properties": {
              "protocol": "Tcp",
              "sourcePortRange": "*",
              "destinationPortRange": "80",
              "sourceAddressPrefix": "Internet",
              "destinationAddressPrefix": "*",
              "access": "Allow",
              "priority": 100,
              "direction": "Inbound"
            }
          },
          {
            "name": "Allow-Health-Monitoring",
            "properties": {
              "protocol": "*",
              "sourcePortRange": "*",
              "destinationPortRange": "*",
              "sourceAddressPrefix": "AzureLoadBalancer",
              "destinationAddressPrefix": "*",
              "access": "Allow",
              "priority": 200,
              "direction": "Inbound"
            }
          },
          {
            "name": "Disallow-everything-else-Inbound",
            "properties": {
              "protocol": "*",
              "sourcePortRange": "*",
              "destinationPortRange": "*",
              "sourceAddressPrefix": "*",
              "destinationAddressPrefix": "*",
              "access": "Deny",
              "priority": 300,
              "direction": "Inbound"
            }
          },
          {
            "name": "Allow-to-VNet",
            "properties": {
              "protocol": "*",
              "sourcePortRange": "*",
              "destinationPortRange": "*",
              "sourceAddressPrefix": "*",
              "destinationAddressPrefix": "VirtualNetwork",
              "access": "Allow",
              "priority": 100,
              "direction": "Outbound"
            }
          },
          {
            "name": "Disallow-everything-else-Outbound",
            "properties": {
              "protocol": "*",
              "sourcePortRange": "*",
              "destinationPortRange": "*",
              "sourceAddressPrefix": "*",
              "destinationAddressPrefix": "*",
              "access": "Deny",
              "priority": 200,
              "direction": "Outbound"
            }
          }
        ],
        "subnets": []
      }
    },
    {
      "type": "Microsoft.Network/networkInterfaces",
      "name": "[variables('NIC Name')]",
      "apiVersion": "2016-03-30",
      "location": "[resourceGroup().location]",
      "properties": {
        "ipConfigurations": [
          {
            "name": "ipconfig",
            "properties": {
              "privateIPAllocationMethod": "Dynamic",
              "subnet": {
                "id": "[concat(resourceId('Microsoft.Network/virtualNetworks', variables('VNet Name')), '/subnets/default')]"
              },
              "loadBalancerBackendAddressPools": [
                {
                  "id": "[concat(resourceId('Microsoft.Network/loadBalancers', variables('Public LB Name')), '/backendAddressPools/', variables('Address Pool Name'))]"
                }
              ],
              "loadBalancerInboundNatRules": [
                {
                  "id": "[concat(resourceId('Microsoft.Network/loadBalancers', variables('Public LB Name')), '/inboundNatRules/', variables('RDP NAT Rule Name'))]"
                }
              ]
            }
          }
        ],
        "dnsSettings": {
          "dnsServers": []
        },
        "enableIPForwarding": false,
        "networkSecurityGroup": {
          "id": "[resourceId('Microsoft.Network/networkSecurityGroups', variables('VM NSG Name'))]"
        }
      },
      "resources": [],
      "dependsOn": [
        "[resourceId('Microsoft.Network/virtualNetworks', variables('VNet Name'))]",
        "[resourceId('Microsoft.Network/loadBalancers', variables('Public LB Name'))]"
      ]
    },
    {
      "type": "Microsoft.Compute/virtualMachines",
      "name": "[variables('VM Name')]",
      "apiVersion": "2015-06-15",
      "location": "[resourceGroup().location]",
      "properties": {
        "hardwareProfile": {
          "vmSize": "[parameters('VM Size')]"
        },
        "storageProfile": {
          "imageReference": {
            "publisher": "MicrosoftWindowsServer",
            "offer": "WindowsServer",
            "sku": "2012-R2-Datacenter",
            "version": "latest"
          },
          "osDisk": {
            "name": "[variables('VM Name')]",
            "createOption": "FromImage",
            "vhd": {
              "uri": "[concat('https', '://', parameters('Disk Storage Account Name'), '.blob.core.windows.net', concat('/', variables('Vhds Container Name'),'/', variables('VM Name'), '-os-disk.vhd'))]"
            },
            "caching": "ReadWrite"
          },
          "dataDisks": []
        },
        "osProfile": {
          "computerName": "[variables('VM Name')]",
          "adminUsername": "[parameters('VM Admin User Name')]",
          "windowsConfiguration": {
            "provisionVMAgent": true,
            "enableAutomaticUpdates": true
          },
          "secrets": [],
          "adminPassword": "[parameters('VM Admin Password')]"
        },
        "networkProfile": {
          "networkInterfaces": [
            {
              "id": "[resourceId('Microsoft.Network/networkInterfaces', concat(variables('NIC Name')))]"
            }
          ]
        }
      },
      "resources": [],
      "dependsOn": [
        "[resourceId('Microsoft.Storage/storageAccounts', parameters('Disk Storage Account Name'))]",
        "[resourceId('Microsoft.Network/networkInterfaces', variables('NIC Name'))]"
      ]
    },
    {
      "type": "Microsoft.Storage/storageAccounts",
      "name": "[parameters('Disk Storage Account Name')]",
      "sku": {
        "name": "Premium_LRS",
        "tier": "Premium"
      },
      "kind": "Storage",
      "apiVersion": "2016-01-01",
      "location": "[resourceGroup().location]",
      "properties": {},
      "resources": [],
      "dependsOn": []
    },
    {
      "type": "Microsoft.Storage/storageAccounts",
      "name": "[parameters('Log Storage Account Name')]",
      "sku": {
        "name": "Standard_LRS",
        "tier": "standard"
      },
      "kind": "Storage",
      "apiVersion": "2016-01-01",
      "location": "[resourceGroup().location]",
      "properties": {},
      "resources": [],
      "dependsOn": []
    }
  ]
}

The sample has one VM sitting in a subnet protected by a NSG.  The VM’s NIC is also protected by NSG, to make our life complicated (as we do too often).  The VM is exposed on a Load Balanced Public IP and RDP is enabled via NAT rules on the Load Balancer.

The VM is running on a Premium Storage account but the sample also creates a standard storage account to store the logs.

The Problem

The problem we are going to try to find using Diagnostic Logs is that the subnet’s NSG let RDP in via “Allow-RDP-From-Everywhere” rule while the NIC’s doesn’t and that type of traffic will get blocked, as everything else, by the “Disallow-everything-else-Inbound” rule.

In practice, you’ll likely have something more complicated going on, maybe some IP filtering, etc.  .   But the principles remain.

Enabling Diagnostic Logs

I couldn’t enable the Diagnostic Logs via the ARM template as it isn’t possible to do so yet.  We can do that via the Portal or PowerShell.

I’ll illustrate the Portal here, since it’s for troubleshooting, chances are you won’t automate it.

I’ve covered Azure Monitor in a previous article.  We’ve seen that different providers expose different schemas.

NSGs expose two categories of Diagnostic LogsEvent and Rule Counter.  We’re going to use Rule Counter only.

Rule Counter will give us a count of how many times a given rule was triggered for a given target (MAC address / IP).  Again, if we have lots of traffic flying around, that won’t be super useful.  This is why I recommend to isolate the network (or recreate an isolated one) in order to troubleshoot.

We’ll start by the subnet NSG.

image

Scrolling all the way down on the NSG’s pane left menu, we select Diagnostics Logs.

image

The pane should look as follow since no diagnostics are enabled.  Let’s click on Turn on diagnostics.

image

We then turn it on.

image

For simplicity here, we’re going to use the Archive to a storage account.

image

We will configure the storage account to send the logs to.

image

For that, we’re selecting the standard account created by the template or whichever storage account you fancy.  Log Diagnostics will go and create a blob container for each category in the selected account.  The names a predefined (you can’t choose).

We select the NetworkSecurityGroupRuleCounter category.

image

And finally we hit the save button on the pane.

We’ll do the same thing with the VM NSG.

image

Creating logs

No we are going to try to get through our VM.  We are going to describe how to that with the sample I gave but if you are troubleshooting something, just try the faulty connection.

We’re going to try to RDP to the public IP.  First we need the public IP domain name.  So in the resource group:

image

At the top of the pane we’ll find the DNS name that we can copy.

image

We can then paste it in an RDP window.

image

Trying to connect should fail and it should leave traces in the logs for us to analyse.

Analysis

We’ll have to wait 5-10 minutes for the logs to get in the storage account as this is done asynchronously.

Actually, a way to make sure to get clean logs is to delete the blob container and then try the RDP connection.  The blob container should reappear after 5-10 minutes.

To get the logs in the storage account we need some tool.  I use Microsoft Azure Storage Explorer.

image

The blob container is called insights-logs-networksecuritygrouprulecounter.

The logs are hidden inside a complicated hierarchy allowing us to send all our diagnostic logs from all our NSGs over time there.

Basically, resourceId%3D / SUBSCRIPTIONS / <Your subscription ID> / RESOURCEGROUPS / NSG / PROVIDERS / MICROSOFT.NETWORK / NETWORKSECURITYGROUPS / we’ll see two folders:  SUBNETNSG & VMNSG.  Those are our two NSGs.

If we dig under those two folders, we should find one file (or more if you’ve waited for a while).

Let’s copy those file with appropriate naming somewhere to analyse them.

Preferably, use a viewer / editor that understands JSON (I use Visual Studio).  If you use notepad…  you’re going to have fun.

If we look at the subnet NSG logs first and search for “RDP”, we’ll find this entry:

    {
      "time": "2017-01-09T11:46:44.9090000Z",
      "systemId": "...",
      "category": "NetworkSecurityGroupRuleCounter",
      "resourceId": ".../RESOURCEGROUPS/NSG/PROVIDERS/MICROSOFT.NETWORK/NETWORKSECURITYGROUPS/SUBNETNSG",
      "operationName": "NetworkSecurityGroupCounters",
      "properties": {
        "vnetResourceGuid": "{50C7B76A-4B8F-481A-8029-73569E5C7D87}",
        "subnetPrefix": "10.0.1.0/24",
        "macAddress": "00-0D-3A-00-B6-B5",
        "primaryIPv4Address": "10.0.1.4",
        "ruleName": "UserRule_Allow-RDP-From-Everywhere",
        "direction": "In",
        "type": "allow",
        "matchedConnections": 0
      }
    },

The most interesting part is the matchedConnections property, which is zero because we didn’t achieve connections.

If we look in the VM logs, we’ll find this:

    {
      "time": "2017-01-09T11:46:44.9110000Z",
      "systemId": "...",
      "category": "NetworkSecurityGroupRuleCounter",
      "resourceId": ".../RESOURCEGROUPS/NSG/PROVIDERS/MICROSOFT.NETWORK/NETWORKSECURITYGROUPS/VMNSG",
      "operationName": "NetworkSecurityGroupCounters",
      "properties": {
        "vnetResourceGuid": "{50C7B76A-4B8F-481A-8029-73569E5C7D87}",
        "subnetPrefix": "10.0.1.0/24",
        "macAddress": "00-0D-3A-00-B6-B5",
        "primaryIPv4Address": "10.0.1.4",
        "ruleName": "UserRule_Disallow-everything-else-Inbound",
        "direction": "In",
        "type": "block",
        "matchedConnections": 2
      }
    },

Where matchedConnections is 2 (because I tried twice).

So the logs tell us where the traffic when.

From here we could wonder why it hit that rule and look for a rule with a higher priority that allow RDP in, find none and conclude that’s our problem.

Trial & Error

If the logs are not helping you, the last resort is to modify the NSG until you understand what is going on.

A way to do this is to create a rule “allow everything in from anywhere”, give it maximum priority.

If traffic still doesn’t go in, you have another problem than NSG, so go back to previous steps.

If traffic goes in, good.  Move that allow-everything rule down until you find which rule is blocking you.  You may have a lot of rules, in which case I would recommend a dichotomic search algorithm:  put your allow-everything rule in the middle of your “real rules”, if traffic passes, move the rule to the middle of the bottom half, otherwise, the middle of the top half, and so on.  This way, you’ll only need log(N) steps where N is your number of rules.

Summary

Troubleshooting NSGs can be difficult but here I highlighted a basic methodology to find your way around.

Diagnostic Logs help to give us insight about what is really going on although it can be tricky to work with.

In general, as with every debugging experience just apply the principle of Sherlock Holmes:

Eliminate the impossible.  Whatever remains, however improbable, must be the truth.

In terms of debugging, that means remove all the noise, all the fat and then some meat, until what remains is so simply that the truth will hit you.