Monitoring metrics in AKS


AKS has a nice integration with Azure monitor. Out of the box there are a couple of dashboards for common metrics.

What if you need to go beyond those metrics?

This is what we’re going to do in this article. I’m going to show how to get the CPU usage per container. Along the way you should learn enough to be able to dig other information you need.

As usual, code is in GitHub.

Solution Deployment

Here we are going to reuse elements of Requests vs Limits in Kubernetes article. We tweaked it a little, but it is otherwise very similar.

We are going to deploy an AKS cluster along a Log Analytics workspace and the Container Insight solution.

We’ll need the Azure CLI tool connected to an Azure subscription.

First, let’s download an ARM template and a script invoking it:

curl https://raw.githubusercontent.com/vplauzon/aks/master/monitor-metrics/deploy.json > deploy.json
curl https://raw.githubusercontent.com/vplauzon/aks/master/monitor-metrics/create-cluster.sh > create-cluster.sh

We are going to run that script with five parameters:

Parameter Description
Name of the resource group If the group doesn’t exist, the script will create it
Azure region Any Azure region where AKS is supported
Name of workspace This needs to be unique
Name of cluster This is also used as the DNS prefix for the cluster, hence must be unique as well
Service Principal Application ID Application ID of a Service Principal
Service Principal Object ID Object ID of the same Service Principal
Service Principal Password Password of the same Service Principal

The last three parameters are related to the Service Principal that will be used by AKS. See this article on how to create a service principal and recover this information.

Let’s run the command locally, e.g.:

./create-cluster.sh aks-group eastus myuniquelaworkspace myuniqueaks \
    <my-principal-app-id> \
    <my-principal-object-id> \
    <my-principal-password>

This takes a few minutes to execute.

Kubernetes deployment

Let’s deploy a set of pods in the cluster using the following yaml file:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: cpu-ram-api
spec:
  replicas: 6
  selector:
    matchLabels:
      app:  cpu-ram-api
  template:
    metadata:
      labels:
        app: cpu-ram-api
    spec:
      containers:
      - name: cpu-ram-request-api
        image: vplauzon/cpu-ram-request-api:4
        ports:
        - containerPort: 80
        resources:
          requests:
            memory: "64M"
            cpu: "250m"
          limits:
            memory: "128M"
            cpu: "2"
---
apiVersion: v1
kind: Service
metadata:
  name: cpu-ram-request-api-svc
spec:
  type: LoadBalancer
  ports:
  - port: 80
  selector:
    app: cpu-ram-api

We have a deployment and a public service load balancing the pods of the deployment.

The pod has one container. Container’s image is vplauzon/cpu-ram-request-api. The source code of this container also is on GitHub. It’s an API implemented in C#. It basically keeps the CPU busy and allocate memory. This was built on purpose for creating spikes on workloads to test monitoring.

The deployment script already connected kubectl CLI to our cluster (i.e. executed the az aks get-credentials command for us). So, we can simply deploy the yaml file with the following command:

kubectl apply -f https://raw.githubusercontent.com/vplauzon/aks/master/monitor-metrics/service.yaml

If we look at the pods:

$ kubectl get pods

NAME                           READY   STATUS    RESTARTS   AGE
cpu-ram-api-5976cfdfb7-8p5k2   1/1     Running   0          6m3s
cpu-ram-api-5976cfdfb7-crsbh   1/1     Running   0          6m4s
cpu-ram-api-5976cfdfb7-m26gn   1/1     Running   0          6m6s
cpu-ram-api-5976cfdfb7-pgf9v   0/1     Pending   0          6m3s
cpu-ram-api-5976cfdfb7-qj55t   1/1     Running   0          6m4s
cpu-ram-api-5976cfdfb7-zrlcl   1/1     Running   0          6m3s

We see that one of the pods is pending because our single-node cluster is full.

Now let’s look at the service:

$ kubectl get svc

NAME                      TYPE           CLUSTER-IP    EXTERNAL-IP   PORT(S)        AGE
cpu-ram-request-api-svc   LoadBalancer   10.0.40.102   40.70.77.7    80:31744/TCP   7m8s
kubernetes                ClusterIP      10.0.0.1      <none>        443/TCP        123m

We need to copy the external IP of the cpu-ram-request-api-svc service. That’s the Azure public IP associated to the load balancer of that service. Let’s store that in a shell variable:

ip=40.70.77.7  # Here, let's replace that specific IP with the one from our cluster

Now let’s call the API we just deploy a few times:

$ curl "http://$ip/"

{"duration":1,"numberOfCore":1,"ram":10,"realDuration":"00:00:00.9995901"}

$ curl "http://$ip?duration=45"

{"duration":45,"numberOfCore":1,"ram":10,"realDuration":"00:00:44.9990431"}

$ curl "http://$ip?duration=20&core=2"

{"duration":20,"numberOfCore":2,"ram":10,"realDuration":"00:00:20.0014578"}

Those will create a few CPU spikes we’ll be able to pick up in the logs.

KubePodInventory

Let’s open the Log Analytics workspace in the Azure Portal. Under General, let’s select the Logs pane.

On the left-hand side, we can see two categories of “tables”:

  • ContainerInsights
  • LogManagement

Although the key for performance metrics is a LogManagement table, let’s start by looking at the ContainerInsights:

Container tables

Those are all AKS related.

Since we want to find out the CPU usage for pods, let’s look at KubePodInventory. We can type the following query:

KubePodInventory 
| limit 5

We can then click Run (or type Shift-Enter). The screen should look as follow once we exploded the KubePodInventory table on the left:

KubePodInventory exploration

This is a good first step to explore logs, to get a feel of the data available.

The query language used is Kusto. There is a quick tutorial here and a cheat sheet vs SQL here. The syntax might look funny at first but typically people get the hang of it within the first hour.

We can see there are a lot of IDs in the data. For instance, the TenantId is actually the Log Analytics Workspace ID.

We can see the Computer column corresponds to AKS nodes.

To get a better feel of a column, we can fetch its distinct (unique) values. For instance:

KubePodInventory 
| distinct Namespace, Name
| sort by Namespace, Name asc

gives a list of Name that correspond to what we would get from a kubectl get pods --all-namespaces.

The left pane is useful to see the name and type of columns.

Perf

Let’s look at the Perf table under LogManagement category. Its schema looks like a metric table with its TimeGenerated and CounterValue columns.

We can throw a few queries. For instance:

Perf 
| distinct ObjectName

shows us only two object names:

  • K8SContainer
  • K8SNode

This tells us that AKS has put some data in that table.

Perf 
| distinct InstanceName

This last query gives us very long ids where some ends with the name of containers.

This is where some magic must be known. Perf is a generic table used for VM, AKS and many other Azure resource metrics. It therefore doesn’t have a pod, container, namespace, etc. table.

Metrics are typically tracked at the container level. For containers InstanceName corresponds to:

cluster-id/pod id/container name

Thankfully we had all that information in KubePodInventory. This will come in handy so we can filter the metrics for only the pods / containers we are interested in.

Another useful column is

Perf 
| distinct CounterName
| sort by CounterName asc 

where the values are:

  • cpuAllocatableNanoCores
  • cpuCapacityNanoCores
  • cpuLimitNanoCores
  • cpuRequestNanoCores
  • cpuUsageNanoCores
  • memoryAllocatableBytes
  • memoryCapacityBytes
  • memoryLimitBytes
  • memoryRequestBytes
  • memoryRssBytes
  • memoryWorkingSetBytes
  • restartTimeEpoch

Joining KubePodInventory & Perf

Let’s try to find those values in KubePodInventory:

KubePodInventory 
| distinct ClusterId, PodUid, ContainerName

We can see the result is close to what we need:

KubePodInventory trinity

The first two columns look ok. The ContainerName is prepended by an ID we do not need. We can get rid of the prefix easily:

KubePodInventory 
| extend JustContainerName=tostring(split(ContainerName, '/')[1])
| distinct ClusterId, PodUid, JustContainerName

This gives us what we need.

Fixed Container Name

We now have everything to join the two tables:

let clusterName = "<our cluster name>";
let serviceName = "cpu-ram-request-api-svc";
let counterName = "cpuUsageNanoCores";
let startTime=ago(60m);
KubePodInventory
| where ClusterName == clusterName
| where ServiceName == serviceName
| where TimeGenerated >= startTime
| extend JustContainerName=tostring(split(ContainerName, '/')[1])
| extend InstanceName=strcat(ClusterId, '/', PodUid, '/', JustContainerName) 
| distinct Name, InstanceName
| join (
    Perf
    | where TimeGenerated >= startTime
    | where CounterName == counterName
    ) on InstanceName
| project CounterValue, TimeGenerated, Name
| render timechart 

We declared a few variables at the beginning using the let keyword. First, we want to filter for our cluster. In general, one Log Analytics workspace could be use as target for multiple AKS clusters. Then we want to filter for a service name. We also want to look only at the CPU usage metrics. Finally, we are looking at a 60 minutes window.

We might need to perform a few more curl "http://$ip?duration=20&amp;core=2 if more than 60 minutes elapsed since we did it. It takes a few minutes for logs to get ingested and available in the workspace.

Chart

We get a chart of exactly what we needed: the CPU usage of each pod.

Time is given in GMT. In order to convert it, we can add / substract hours. For instance, in Montreal, currently (early May):

| extend LocalTimeGenerated = TimeGenerated - 4h
| project CounterValue, LocalTimeGenerated, Name

As usually with Log Analytics, we can save this query. We can also “pin” it to a shared dashboard.

Summary

We just took a little dive inside the data collected by the Azure Monitor solution for AKS.

Using the content of this article, you should easily be able to track different metrics.

This is useful for dashboards but also for forensic analysis, i.e. after the fact troubleshooting.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s