Network Access Control on an HDInsight Cluster


In this post, I’m gona try to show how you can use an Azure Virtual Network with a Network Security Group to control the access (at the network level) to an HDInsight cluster.

For a primer on both those technologies, please refer to my Using Network Security Groups (NSG) to secure network access to an environment post.

The main caveat I would add is that in that post I was using Resource Manager virtual network (also known as v2) while HD Insight, at the time of this writing (late January 2016) supports only v1, or classic Virtual Network.  Everything else is the same except I won’t be able to ARM template it at the end.

UPDATE (05-02-2016):  Classic (v1) Virtual Network is required for a Windows Cluster, which is what I do in this article.  For a Linux Cluster, a Resource Manager (v2) Virtual Network is required.  This article remains valid ; you only have to create your Virtual Network with “Resource Manager” option.  See this documentation (that has been updated since I wrote this post) for more details.

The problem

Azure HDInsight is basically Hadoop cluster as a service.  You can stand up a cluster in minutes and get on with your Big Data jobs.  Better, you can externalize the data so that you can destroy the cluster and stand it up days later and continue with the same data (see my post about how to setup HDInsight to externalize data).

Now HDInsight is a public in nature.  So all endpoints on it are open on the internet.  Each endpoint use authentication but for some customers that is not enough.

That is understandable since most companies use Hadoop to analyse business data, sometimes sensitive data.  Therefore having more control on access is mandated.

Virtual Network

The obvious way to have more control on Network access is to start by attaching a Virtual Network on your HDInsight cluster.

Let’s first create a Virtual Network.  One of the critical step is to choose the “Classic” Model as opposed to “Resource Manager” Model.

image

Give you v-net a name.  You can leave the address space as the default, i.e. 10.0.0.0/16.  Put it in the same resource group as your HDInsight cluster (for convenience) and in the same region (that’s mandatory).  Then create it.

You now have a Virtual Network you can put an HDInsight cluster in.

Attaching Virtual Network to cluster

Now this is required to be done at the cluster’s creation time:  you can’t add a virtual Network afterwards (or change its subnet).

So when you create your cluster in the Azure portal, go in the Optional Configuration at the bottom.

image

Then select the Virtual Network box

image

Choose the Virtual Network you just created.  For the subnet, choose the default subnet.

When you’ll create your cluster, it will behave identically to a cluster without a Virtual Network.  The only way to know there is a Virtual Network attached to it is to look at your cluster’s settings

image

and then look at its properties

image

At the bottom of the blade you should see the Virtual Network GUID.

image

A GUID, salt of the Earth!

VPN Gateway

Now that you have a Virtual Network you can connect it to your on-promised network and block internet access altogether.

This is probably your most secure option because in many ways it is like the cluster is running on-premise with the benefice of the cloud.

I won’t cover VPN Gateway in this post.

Network Security Group

Now the Cluster is in a Virtual Network, we can control access via a Network Security Group.

First, we’ll create a Network Security Group (NSG).

image

which should lead you to

image

Again, make sure you select “Classic” deployment model.

Give it a name and make sure it is in the same resource group and region as your cluster.

Next we’ll attach the newly created NSG to the default subnet of the Virtual Network.  NSGs are independent entities and can be attached to multiple subnets (on potentially multiple Virtual Networks).

Let’s open the virtual network and then open its settings.  From there, open its subnets and select the default one (or whichever you put your cluster in).

image

From there, select Network security group and then select the NSG you just created.

This binds the subnet to the NSG.

Configuring NSG rules

At this point your cluster shouldn’t be accessible.  This is because by default, NSGs disable most routes.  To see that, open your NSG and then, in the settings, open the Inbound security rules.

image

You should have no rules in there.  Click the Default Rules at the top and that should display the default rules.  Basically, the NSG allows connections from within the virtual network, connections from the Azure Load Balancer (this is required so that VMs can be monitored internally) and denies every other routes.

This means, among other things, that the RDP route (or SSH for a Linux cluster) is denied from the internet.

Similarly, if you look at the outbound rules:

image

Here, routes toward the virtual network & the internet are allowed but nothing else.

So this is very secure as nothing can get in!  Actually, that would be perfect for the scenario where you connect the Virtual Network to your on premise network (via an Azure VPN Gateway) since then connections from your network would get in.

Now, let’s add a rule to allow traffic coming from your laptop.

First, let’s determine what the IP of your laptop is, by using, for instance, https://www.whatismyip.com/.

Then, let’s go back to the inbound rules of your NSG and let’s add a rule:

  • Name:  Allow laptop
  • Priority:  500 (anything between 100 and 64999 really)
  • Source:  CIDR block
  • Source IP address range:  <the IP of your laptop>/32 (the /32 makes the IP you specify the only enabled IP)
  • Protocol:  TCP
  • Source Port Range:  *
  • Destination:  Any
  • Destination Port Range:  *

Once the rule has been saved and the NSG updated (it usually takes less than a minute), you should be able to access your cluster (e.g. RDP / SSH, HTTPS to dashboard, etc.).

In practice you would specify a larger IP range corresponding to the outbound IPs of your organization (or department).

Now this will open all the ports of your clusters to the specified IP range.  You would be tempted to enable port by port, but there is some port mapping (for instance for RDP) happening before traffic hits the Virtual Network that forbids that approach to be effective.

Conclusion

We’ve seen how to lock down an HDInsight cluster:

  • Create a Virtual Network
  • Associate the Virtual Network to your HDInsight cluster at creation time
  • Create a Network Security Group (NSG)
  • Associate the NSG to your cluster subnet
  • Add access rule to the inbound rules of the NSG

If you tear down your cluster, you can keep the virtual network & associated NSG around.  This way, next time you stand up your cluster, you can simply associate the virtual network and get all the network rules back.

4 thoughts on “Network Access Control on an HDInsight Cluster

  1. Alex

    With the NSG is it possible to connect directly to the zookeeper servers with phoenix + jdbc or a sql client like squirrel or is the tunnel still required ?

    Reply
    1. Vincent-Philippe Lauzon Post author

      Hum… you could open the communication port to the public internet traffic (or even limit it by IP range) I suppose. This way you could connect from your tool on your desktop to it.

      The other option, as I believe you mention, is to lock it down for public traffic and use a VPN Gateway to privately connect to the VNet.

      Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s