More reliable Replica Sets in AKS - Part 1

lion-2449282_640Availability is a core architecture attribute often sought after.

We’ve taken a look at Azure Managed Kubernetes Cluster (AKS) here.  In this article (part 1), we’re going to experiment and prove that replica set aren’t “highly available” by default.  In part 2 , we’re going to look at how to architect replica sets in Azure AKS in order to make them highly available.

High Availability in Azure

It is quite simple to implement High Availability (HA) in Azure.  Azure Availability Sets and Load Balancers are the main tools.  They ensure a set of VMs remains available (i.e. can process requests).  They guard against two types of event:

A prerequisite for this article is to understand Availability Sets.  Also, it is required to know what are Update & Fault domains.  We discussed Availability Sets and those concepts here.

High Availability of Replica Sets

At the time of this writing (end of April 2018), AKS has an SLA based on its underlying virtual machines.  This means the cluster, as a whole, has an SLA.  That SLA won’t necessarily translate to a given replica set (group of containers).

Let’s look at an hypothetical AKS Cluster.  The cluster has 4 nodes.  It belongs to an Availability Set with 2 Fault Domains and 3 Update Domains:


We see the availability set spreads the VMs on different domains.  It does it in such a way that if any of the Fault Domains would go down, there would be VMs left to take the load.  Similarly with Update Domains.  This is how Availability Sets enable High Availability at the VM level.

Now let’s look at replica sets deployed on that cluster:


Replica Sets’ places pods on different cluster nodes (VMs).  We see the ‘triangle’ replica set is highly available:  no domain could take it all down by going down.  Similarly, the ‘circle’ replica set is highly available.  The ‘star’ replica set isn’t.  If Update Domain 2 would go down, it would take the 2 pods of the ‘star’ replica set.

Our experiments showed that replica sets seem to place pods on many fault domains.  They aren’t always placed on many update domains though.  This means that replica sets of 2 or more pods should be resilient to hardware failures.  But they won’t necessarily be resilient to planned maintenance.

We are going to show how to reproduce those results.

Why is this a big deal?

Kubernetes runtime is constantly monitoring to make sure to maintain target state.

When nodes go down Kubernetes would settle back to the desired state.  We could then dismiss the placement pattern issue.

So what is the big deal?

It is true that Kubernetes will try to bring back pods and will hence help to achieve high availability.  Nevertheless, we see two problems:

  1. Although trivial containers deploy quickly, non-trivial pod can take a few seconds.  Some pods can even take a minute or two before their readiness probe clears them to receive requests.  During that time, service would be down.
  2. A cluster being under CPU and/or memory pressure might not be able to reschedule pods when some nodes go down.  An entire service might then stay unavailable until nodes come back online.  In the worst case (reboot), this could take a few minutes.

We need to nuance this issue.

VMs routinely outperform the SLA guarantee.  Most planned maintenance are now memory preserving.  This mean only a brief pause is experienced in the VM execution.  Some still require a reboot of the guest VMs though.

Adding to that, a replica set placed on a single update domain doesn’t occur all the time.  A lot of replica set spread pods across many domains.  When it does happen, it might not be a problem if the cluster has available resources and containers are quick to start.  But if the planets do not align, this could create an availability problem.

AKS availability set

When we create an AKS cluster, it creates VMs for us.  They are created as part of a single availability set.  We can see those “behind the scene” resources by looking at the paired resource group.  Here we created an AKS service named aks2.  It sits inside a resource group also named aks2 in Canada Central Azure region.  The paired resource group is MC_aks2_aks2_canadacentral.


We can open that resource group and look at the resources inside.  If we sort by type, availability set should come up on top.


The overview of the availability set lists the VM members.  It also displays their fault and update domains.


Here we have a 20 nodes cluster (we didn’t display all VMs in the above image).  We can see that VMs are in both 0 and 1 failure domains and 0,1 & 2 update domains.

Reproducing our experiment

In Kubernetes, a replica set is a group of pods.  A pod is a group of containers, although it often is a single container.  A replica set is often defined in a deployment, as we’ll do in our example.

A replica set defines a target state with a number of replicas.  Kubernetes runtime works to achieve that target state.  If a pod falls down (e.g. its host VM shuts down), it will deploy a new pod on another node.

Kubernetes sees Azure VMs as simple nodes.  It is agnostic of which update / failure domains each VM is on.  At least, as of this writing (end of April 2018).

It is quite easy to show.  Let’s do an experiment with our 20 nodes cluster.

We created the cluster with

az aks create --resource-group aks2 --name aks2 --kubernetes-version 1.9.6 --node-count 20 --generate-ssh-keys

Our code is available on GitHub.

We start by creating a few deployments.  Each deploy one replica set.  Each replica set uses the same Docker Image:  vplauzon/get-started:part2-no-redis.  We introduced that container in a past article.  It is a hello-world Flask app and consumes very little resources.  It won’t create memory or CPU pressure on the cluster.

We are going to run the following code in the shell (also available as a shell script here):

kubectl create -f dep-a.yaml

kubectl create -f dep-b.yaml

kubectl create -f dep-c.yaml

kubectl create -f dep-d.yaml

kubectl create -f dep-e.yaml

We can then look at the pods placement by executing kubectl get pods -o wide.

Here we translate the results in terms of node update and fault domains:

Deployment Fault Domains Update Domains
A {1, 0, 0} {2, 1, 1}
B {1, 0} {2, 0}
C {1, 0} {1, 0}
D {1, 1, 0} {1, 1, 2}
E {1, 0} {2, 2}

As we see, it does a pretty good job.  The only deployment having availability exposure is ‘E’.  Both pods land in the update domain ‘2’.

Pod placement is somehow randomize.  We chose that sequence of deployments as it often leads to the result we wanted to show.

An easy way to “reshuffle the deck” here is to delete the replica set ‘e’.  Kubernetes will recreate the replica set and place pods again.  We could also do that to see how frequent a non-HA configuration is.  Let’s run the following code:

kubectl delete rs -l app=app-e

watch kubectl get pods -o wide

If we do that 10 times on our 20 nodes cluster we get the following results:

Trial# Fault Domains Update Domains
1 {1,0} {0,2}
2 {1,0} {2,2}
3 {0,1} {0,0}
4 {1,0} {0,2}
5 {1,0} {2,2}
6 {0,1} {0,0}
7 {1,0} {0,2}
8 {1,0} {2,2}
9 {0,1} {0,0}
10 {1,0} {0,2}

There is a repeating pattern every three trials where the same nodes were selected.  2 out of the 3 placements landed on only one update domain.


We tried multiple times to create the same pattern for fault domains.  We deleted many replica sets.  Every time replica sets got placed on many fault domains.

That is beyond random possibility so we believe AKS does it on purpose.

On the other hand, it is quite easy to have a replica set landing on a single update domain.

As we’ll see in the next article, Kubernetes does have access to the fault domain index but not the update domain.  It is a documented behaviour that Kubernetes spread pods on different “failure domains”.


We have discussed the details of how Azure implements High Availability for VMs.  We also discussed how replica sets place pods on nodes to be highly available themselves.

We observed that some replica set aren’t resilient to Azure planned maintenance.  This is because they place their nodes on a single update domain.

In the next article, we’ll discuss how we can use Kubernetes to address that.

Leave a comment