Dynamic disks in AKS


Some workloads running on Azure Kubernetes Service (AKS) requires persisting state on disk.

In general, I recommend to use external PaaS services, i.e. Azure Blob Storage, Azure SQL DB, Azure Cosmos DB, etc. . Those services take care of the stateful nature of the service, manages HA, backups, geo-replication, etc. .

Persisting state on disks in AKS comes with none of PaaS benefit. It is like running IaaS: we need to manage backups, high availability & geo replication. Those aren’t trivial to manage.

In the few cases where I need to persist on a disk, I use Kubernetes Volume. This is a K8s abstraction where we mount a volume in a pod, but the volume can be defined in different ways. Azure specifics are:

Typically, I see teams using Azure Disk Dynamic to have performant storage own by a pod and Azure File Static for storage shared across many pods with less performance. Don’t run a database on Azure File, even Premium.

In this article we’ll cover disks. The online documentation is a little misleading as the example given there doesn’t scale to a multi-pod deployment. We’ll see how to do that.

As usual, the code is on GitHub.

Deployment with Persistent Volume Claim

The online documentation of Azure Disk Dynamic does a great job of explaining the different concepts at play here:

So, let’s go with an example:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: azure-managed-disk
spec:
  accessModes:
  - ReadWriteOnce
  storageClassName: managed-premium
  resources:
    requests:
      storage: 25Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: disk-deploy
  labels:
    app: busy-box-with-disk
spec:
  replicas: 1
  selector:
    matchLabels:
      app: disk-pod
  template:
    metadata:
      labels:
        app: disk-pod
    spec:
      containers:
      - name:  busy
        image: vplauzon/get-started:part2-no-redis
        ports:
        - containerPort: 80
        volumeMounts:
        - name: volume
          mountPath: /buffer
      volumes:
      - name: volume
        persistentVolumeClaim:
          claimName: azure-managed-disk

This YAML file has two resources: a PersistentVolumeClaim and a Deployment. The former defines a claim for a 25Gb disk of Premium Managed Disk. The latter deploy a pod with 1 replica mounting a volume using the claim.

We can deploy that file using:

kubectl apply -f https://raw.githubusercontent.com/vplauzon/aks/master/dynamic-disks/dynamic-with-deploy.yaml

If we look, in the managed resource group, we can see this create a new disk:

One more disk

Our cluster has 3 nodes. One more was added. Let’s open the disk:

Disk summary

We can see the disk is attached to an agent pool VM. Basically, the disk got attached to the VM running the pod.

We can also look at the tags:

Tags

We see they correspond to Persistent Volume Claim we defined. As for the persistent volume (PV) created from that claim, we can query for it:

$ kubectl get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS        CLAIM                                                STORAGECLASS      REASON   AGE
pvc-d53e41ee-1810-11e9-bea0-0a58ac1f072d   25Gi       RWO            Delete           Bound         default/azure-managed-disk                           managed-premium            18m

Now, let’s look at the pods:

$ kubectl get pods
NAME                         READY   STATUS    RESTARTS   AGE
disk-deploy-6dfbb5cc-79bqx   1/1     Running   0          1m

We see that, as usual, the deployment created a pod with a unique name.

Scaling to multi-pods

We successfully attached a disk on a pod via a deployment.

Let’s scale that deployment to 4 pods:

kubectl patch deploy disk-deploy -p '{"spec":{"replicas":4}}'

We quickly see there is a problem here. First only one pod starts:

$ kubectl get pods -o wide
NAME                         READY   STATUS              RESTARTS   AGE   IP            NODE                       NOMINATED NODE
disk-deploy-6dfbb5cc-79bqx   1/1     Running             0          37m   10.244.2.11   aks-agentpool-22115152-2   <none>
disk-deploy-6dfbb5cc-pl9wm   0/1     ContainerCreating   0          23m   <none>        aks-agentpool-22115152-0   <none>
disk-deploy-6dfbb5cc-q89jb   1/1     Running             0          23m   10.244.2.12   aks-agentpool-22115152-2   <none>
disk-deploy-6dfbb5cc-rkmf5   0/1     ContainerCreating   0          23m   <none>        aks-agentpool-22115152-0   <none>

Important observation: only the pod on the same node as the original pod could start.

If we look in the portal, we do not see new disks getting created. Let’s look at a failing pod:

$ kubectl describe pod disk-deploy-6dfbb5cc-pl9wm
...
Events:
  Type     Reason              Age                From                               Message
  ----     ------              ----               ----                               -------
  Normal   Scheduled           19m                default-scheduler                  Successfully assigned default/disk-deploy-6dfbb5cc-pl9wm to aks-agentpool-22115152-0
  Warning  FailedAttachVolume  19m                attachdetach-controller            Multi-Attach error for volume "pvc-d53e41ee-1810-11e9-bea0-0a58ac1f072d" Volume is already used by pod(s) disk-deploy-6dfbb5cc-79bqx
  Warning  FailedMount         89s (x8 over 17m)  kubelet, aks-agentpool-22115152-0  Unable to mount volumes for pod "disk-deploy-6dfbb5cc-pl9wm_default(41c47e32-1816-11e9-bea0-0a58ac1f072d)": timeout expired waiting for volumes to attach or mount for pod "default"/"disk-deploy-6dfbb5cc-pl9wm". list of unmounted volumes=[volume]. list of unattached volumes=[volume default-token-gljst]

We see the volume failed to attach.

The reason for that is that Kubernetes has the same Persistent Volume (PV) used by each pod.

It isn’t possible for Azure Disk to be shared between VMs. Therefore, as long as pods reside on the same node than the original pod where the disk was attached, it works.

This Kubernetes mechanism clearly fails for scenarios where we want independent volumes.

Enter stateful sets

What we used so far was Kubernetes Deployments which leverages Kubernetes Replica Sets.

To achieve our scenario, we need Kubernetes Stateful sets. Stateful sets guarantee ordering and uniqueness of Pods. Pods identity survive the pod, i.e. a given pod, upon failure would be recreated with the same identity (sticky identity).

A stateful set also manages volumes differently. They are assigned in a deterministic fashion with a claim per pod.

Let’s create something similar as last example but using a stateful set:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: disk-stateful-set
  labels:
    app: busy-box-with-stateful-set-disk
spec:
  replicas: 5
  selector:
    matchLabels:
      app: stateful-disk-pod
  serviceName:  stateful-disk-set
  template:
    metadata:
      labels:
        app: stateful-disk-pod
    spec:
      containers:
      - name:  busy
        image: vplauzon/get-started:part2-no-redis
        ports:
        - containerPort: 80
        volumeMounts:
        - name: azure-managed-disk-stateful
          mountPath: /buffer
  volumeClaimTemplates:
  - metadata:
      name: azure-managed-disk-stateful
    spec:
      storageClassName: managed-premium
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 25Gi

Let’s deploy that stateful set:

$ kubectl apply -f https://raw.githubusercontent.com/vplauzon/aks/master/dynamic-disks/dynamic-with-stateful.yaml

Now if we look at the pods getting created:

$ kubectl get pods
NAME                         READY   STATUS              RESTARTS   AGE
disk-deploy-6dfbb5cc-79bqx   1/1     Running             0          2h
disk-deploy-6dfbb5cc-pl9wm   0/1     ContainerCreating   0          2h
disk-deploy-6dfbb5cc-q89jb   1/1     Running             0          2h
disk-deploy-6dfbb5cc-rkmf5   0/1     ContainerCreating   0          2h
disk-stateful-set-0          0/1     ContainerCreating   0          1m

We notice two things:

  1. The pods’ names are deterministic: disk-stateful-set-0 is the name of the stateful set plus an index (0, 1, …)
  2. The pods are brought online in sequence, not in parallel as with a deployment

With time we can see that Azure disks get created and bound to pods.

$ kubectl get pods
NAME                         READY   STATUS              RESTARTS   AGE
disk-deploy-6dfbb5cc-79bqx   1/1     Running             0          3h
disk-deploy-6dfbb5cc-pl9wm   0/1     ContainerCreating   0          3h
disk-deploy-6dfbb5cc-q89jb   1/1     Running             0          3h
disk-deploy-6dfbb5cc-rkmf5   0/1     ContainerCreating   0          3h
disk-stateful-set-0          1/1     Running             0          3m
disk-stateful-set-1          1/1     Running             0          2m
disk-stateful-set-2          1/1     Running             0          2m
disk-stateful-set-3          1/1     Running             0          1m
disk-stateful-set-4          1/1     Running             0          51s

Now, if we delete a pod to simulate a failure, the pod will be rescheduled with the same name and the same disk.

Limitations of Stateful sets

Kubernetes online documentation lists some limitations of stateful sets.

On top of those we would like to stretch out that a stateful set isn’t a magic bullet for stateful workloads.

It basically implements the equivalent of having multiple instances of a workload, each with its own disk. Failure of pods do not impact that picture.

It doesn’t implement any stateful smarts though. For instance, it doesn’t implement a master / slave configuration typical for databases.

Stateful workloads typically get those more advanced features by implementing a Kubernetes Operator. An operator defines custom resources with their custom controllers which can then, in turn, implement specific logic.

Summary

We’ve seen how we can use Azure Disk as persistent volumes in AKS.

For non-trivial scenarios, i.e. multi-pods, we turned to stateful sets.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s