AKS Auto Scaler with ARM Template


Azure Kubernetes Service (AKS) Auto Scaler is finally out there in public preview!

The online documentation does a great job of getting us started. In this article I wanted to get it a little further with two things. First by showing how to use ARM templates to deploy an AKS Cluster with Auto Scaler on. Second by kicking the autoscaler on with a simple deployment.

Auto scaling is useful to size an AKS cluster on demand. Scenarios for this range from seasonal change to daily change to simply having a cluster that has variable workloads running on it.

It goes without saying: the code is on GitHub.

Scaling

There always is a little confusion about auto scaling in the cloud, so let’s clear that confusion.

There are two ways of scaling: by pods and by nodes. The former is referred to horizontal pod autoscaler and is native to Kubernetes’ platform. The latter is specific to Cloud Providers.

Horizontal pod autoscaler is a Kubernetes controller (like replica set or deployment). Instead of having a fixed replica count, that one varies given a metric (e.g. CPU utilization). When the pods get too “hot”, the autoscaler increases the number of pods. When the pods get too “cold”, it decreases the number of pods.

Increasing the number of pods in a fixed cluster has limitation of course since the compute is fixed. This is where something like Azure Auto Scaler comes in. This increases or decreases the number of nodes depending on demand. Instead of monitoring a metric, it simply looks at the pending queue of pods that can’t be scheduled on the cluster.

Preview registration

At the time of this writing (March 2019), Auto Scaler is in Public Preview. Before we can deploy auto scaler on AKS, we need to enable a feature flag on our subscription.

The online documentation explains the steps.

ARM template

The following button allows us to deploy our cluster:

Deploy button

This deployment is based on our article on Kubenet networking plugin. As such, it deploys the cluster within a custom Virtual Network while using Kubenet (as opposed to Azure CNI). As we’ll see, we could easily alter any deployment to accommodate the auto scaler, but this was the base configuration.

The template deploys a cluster with Kubernetes (orchestrator) version 1.12.6. This might be obsolete in the near future and make the template obsolete. A simple change of that version would fix the template.

As such we need to input the following parameters:

Parameter Name Description
DNS Prefix Domain name for the cluster ; this needs to be unique within a region
Service Principal App ID Application ID of the service principal of the cluster
Service Principal Object ID Object ID of the same principal ; this is used for a role assignment to give that principal privilege to alter the virtual network
Service Principal Secret Secret of the service principal

The ARM template is on GitHub. The key section is the following:

{
    "type": "Microsoft.ContainerService/managedClusters",
    "name": "cluster",
    "apiVersion": "2018-08-01-preview",
    "location": "[resourceGroup().location]",
    "dependsOn": [
        "[resourceId('Microsoft.Network/virtualNetworks', variables('VNET Name'))]"
    ],
    "properties": {
        ...
        "agentPoolProfiles": [
            {
                "name": "agentpool",
                "count": "[variables('instance count')]",
                "vmSize": "[variables('VM Size')]",
                "vnetSubnetID": "[resourceId('Microsoft.Network/virtualNetworks/subnets', variables('VNET Name'), 'aks')]",
                "maxPods": 30,
                "osType": "Linux",
                "storageProfile": "ManagedDisks",
                "enableAutoScaling": true,
                "minCount": 1,
                "maxCount": 5,
                "type": "VirtualMachineScaleSets"
            }
        ],

Key elements are:

Line # Content Description
4 "apiVersion": "2018-08-01-preview" This refers to a preview API ; if we don’t use this, the following elements won’t be understood by the ARM provider.
14 "count": "[variables('instance count')]" This isn’t a new configuration. This usually refer to the static size of the cluster. Here it refers to the “starting size”.
22 "minCount": 1 This is the minimum size of the cluster. The autoscaler won’t decrease the number of nodes further once this size is attained.
23 "maxCount": 5 Similar but for the maximum size.
24 "type": "VirtualMachineScaleSets" This is required.

The last element, i.e. "type": "VirtualMachineScaleSets", is a key change in AKS architecture enabling auto scaling. So far, AKS has been implemented with Virtual Machines. This new feature allows AKS to be implemented with an Azure VM Scale Set.

A scale set allows VM to be managed as a “set”. The number of VMs is simply a parameter of the set. Understandably, it is easier to implement auto scaling in AKS with a VM Scale Set than VMs.

Looking at managed resource group

If we look at the corresponding managed resource group (named MC___), we can see a VM Scale Set present:

Resources in managed resource group

If we “open” the VM Scale Set, we can see it currently has 3 instances.

VM Scale Set overview

We could also go to the Scaling pane and see the scaling history. This is where we go to change the number of instances of a scale set but in our case, we let AKS handle it.

Kicking the auto scaler

Now, let’s test the auto scaler.

We are going to do this by deploying a Kubernetes deployment with 20 replicas.

In order to force Kubernetes to run out of resources, we configured our pods to request more memory than they need. This is done at lines 20-26:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: demo-deploy
  labels:
    app: demo-app
spec:
  replicas: 20
  selector:
    matchLabels:
      app: demo-app
  template:
      metadata:
        labels:
          app: demo-app
      spec:
        containers:
        - name: myapi
          image: vplauzon/get-started:part2-no-redis
          resources:
            requests:
              memory: "1.5G"
              cpu: "250m"
            limits:
              memory: "2G"
              cpu: "500m"
          livenessProbe:
            httpGet:
              path: /
              port: 80
            initialDelaySeconds: 5
          ports:
          - containerPort: 80

We can deploy it using the command line:

kubectl apply -f https://raw.githubusercontent.com/vplauzon/aks/master/aks-auto-scaler/deployment.yaml

We can then check on the status of the deployment:

$ kubectl get pods -o wide

NAME                           READY   STATUS              RESTARTS   AGE   IP            NODE                                NOMINATED NODE
demo-deploy-64567bf9df-49k9b   1/1     Running             0          22s   172.16.0.42   aks-agentpool-38816970-vmss000001   <none>
demo-deploy-64567bf9df-5pjf6   1/1     Running             0          22s   172.16.0.96   aks-agentpool-38816970-vmss000002   <none>
demo-deploy-64567bf9df-5r868   1/1     Running             0          22s   172.16.0.69   aks-agentpool-38816970-vmss000002   <none>
demo-deploy-64567bf9df-8cnnv   0/1     ContainerCreating   0          22s   <none>        aks-agentpool-38816970-vmss000000   <none>
demo-deploy-64567bf9df-8fdkf   0/1     Pending             0          22s   <none>        <none>                              <none>
demo-deploy-64567bf9df-c5f5d   0/1     Pending             0          22s   <none>        <none>                              <none>
demo-deploy-64567bf9df-cf8xx   1/1     Running             0          22s   172.16.0.68   aks-agentpool-38816970-vmss000002   <none>
demo-deploy-64567bf9df-czh8n   0/1     Pending             0          22s   <none>        <none>                              <none>
demo-deploy-64567bf9df-dq5nh   0/1     Pending             0          22s   <none>        <none>                              <none>
demo-deploy-64567bf9df-dzn5p   0/1     Pending             0          22s   <none>        <none>                              <none>
demo-deploy-64567bf9df-g2hx2   0/1     Pending             0          22s   <none>        <none>                              <none>
demo-deploy-64567bf9df-g52j7   0/1     ContainerCreating   0          22s   <none>        aks-agentpool-38816970-vmss000000   <none>
demo-deploy-64567bf9df-gjbhj   1/1     Running             0          22s   172.16.0.75   aks-agentpool-38816970-vmss000002   <none>
demo-deploy-64567bf9df-k4fkv   0/1     Pending             0          22s   <none>        <none>                              <none>
demo-deploy-64567bf9df-kxzr8   0/1     Pending             0          22s   <none>        <none>                              <none>
demo-deploy-64567bf9df-ljmqv   0/1     Pending             0          22s   <none>        <none>                              <none>
demo-deploy-64567bf9df-lm894   0/1     Pending             0          22s   <none>        <none>                              <none>
demo-deploy-64567bf9df-m5q6t   1/1     Running             0          22s   172.16.0.39   aks-agentpool-38816970-vmss000001   <none>
demo-deploy-64567bf9df-p2qhx   1/1     Running             0          22s   172.16.0.37   aks-agentpool-38816970-vmss000001   <none>
demo-deploy-64567bf9df-qmcr6   0/1     ContainerCreating   0          22s   <none>        aks-agentpool-38816970-vmss000000   <none>
demo-deploy-64567bf9df-sbnvk   0/1     ContainerCreating   0          22s   <none>        aks-agentpool-38816970-vmss000000   <none>

We see that a few pods got scheduled on different nodes while many others are in Pending status since there are no node that can accomodate them.

We can then refresh our VM Scale Set view in the Portal and see it is now scaling from 3 to 5 instances:

VM Scale Set scaling out

Once the scaling operation is completed, we can look again at the pods’ status:

$ kubectl get pods -o wide
NAME                           READY   STATUS    RESTARTS   AGE     IP             NODE                                NOMINATED NODE
demo-deploy-64567bf9df-49k9b   1/1     Running   0          9m15s   172.16.0.42    aks-agentpool-38816970-vmss000001   <none>
demo-deploy-64567bf9df-5pjf6   1/1     Running   0          9m15s   172.16.0.96    aks-agentpool-38816970-vmss000002   <none>
demo-deploy-64567bf9df-5r868   1/1     Running   0          9m15s   172.16.0.69    aks-agentpool-38816970-vmss000002   <none>
demo-deploy-64567bf9df-8cnnv   1/1     Running   0          9m15s   172.16.0.34    aks-agentpool-38816970-vmss000000   <none>
demo-deploy-64567bf9df-8fdkf   1/1     Running   0          9m15s   172.16.0.149   aks-agentpool-38816970-vmss000004   <none>
demo-deploy-64567bf9df-cf8xx   1/1     Running   0          9m15s   172.16.0.68    aks-agentpool-38816970-vmss000002   <none>
demo-deploy-64567bf9df-czh8n   1/1     Running   0          9m15s   172.16.0.104   aks-agentpool-38816970-vmss000003   <none>
demo-deploy-64567bf9df-dzn5p   1/1     Running   0          9m15s   172.16.0.123   aks-agentpool-38816970-vmss000003   <none>
demo-deploy-64567bf9df-g52j7   1/1     Running   0          9m15s   172.16.0.17    aks-agentpool-38816970-vmss000000   <none>
demo-deploy-64567bf9df-gjbhj   1/1     Running   0          9m15s   172.16.0.75    aks-agentpool-38816970-vmss000002   <none>
demo-deploy-64567bf9df-k4fkv   1/1     Running   0          9m15s   172.16.0.133   aks-agentpool-38816970-vmss000004   <none>
demo-deploy-64567bf9df-kxzr8   1/1     Running   0          9m15s   172.16.0.115   aks-agentpool-38816970-vmss000003   <none>
demo-deploy-64567bf9df-lm894   1/1     Running   0          9m15s   172.16.0.135   aks-agentpool-38816970-vmss000004   <none>
demo-deploy-64567bf9df-m5q6t   1/1     Running   0          9m15s   172.16.0.39    aks-agentpool-38816970-vmss000001   <none>
demo-deploy-64567bf9df-mpxgb   0/1     Pending   0          54s     <none>         <none>                              <none>
demo-deploy-64567bf9df-p2qhx   1/1     Running   0          9m15s   172.16.0.37    aks-agentpool-38816970-vmss000001   <none>
demo-deploy-64567bf9df-qmcr6   1/1     Running   0          9m15s   172.16.0.9     aks-agentpool-38816970-vmss000000   <none>
demo-deploy-64567bf9df-sbnvk   1/1     Running   0          9m15s   172.16.0.5     aks-agentpool-38816970-vmss000000   <none>
demo-deploy-64567bf9df-wj2pz   1/1     Running   0          9m15s   172.16.0.139   aks-agentpool-38816970-vmss000004   <none>
demo-deploy-64567bf9df-x7tqj   1/1     Running   0          9m15s   172.16.0.108   aks-agentpool-38816970-vmss000003   <none>

We can see that most pods managed to get scheduled on a pod.

There is one pod still in pending state. Despite this, the Auto Scaler won’t scale the cluster to 6 nodes as we set 5 as the maximum number of nodes.

Summary

We looked at how to deploy an Auto Scaler enabled AKS cluster using an ARM template.

This will likely change once the Public Preview period is over

We also looked at the VM Scale set implementation and how an auto scaling event is triggered.


One thought on “AKS Auto Scaler with ARM Template

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s