Monitoring metrics in AKS
Solution ·AKS has a nice integration with Azure monitor. Out of the box there are a couple of dashboards for common metrics.
What if you need to go beyond those metrics?
This is what we’re going to do in this article. I’m going to show how to get the CPU usage per container. Along the way you should learn enough to be able to dig other information you need.
As usual, code is in GitHub.
Solution Deployment
Here we are going to reuse elements of Requests vs Limits in Kubernetes article. We tweaked it a little, but it is otherwise very similar.
We are going to deploy an AKS cluster along a Log Analytics workspace and the Container Insight solution.
We’ll need the Azure CLI tool connected to an Azure subscription.
First, let’s download an ARM template and a script invoking it:
curl https://raw.githubusercontent.com/vplauzon/aks/master/monitor-metrics/deploy.json > deploy.json
curl https://raw.githubusercontent.com/vplauzon/aks/master/monitor-metrics/create-cluster.sh > create-cluster.sh
We are going to run that script with five parameters:
Parameter | Description |
---|---|
Name of the resource group | If the group doesn't exist, the script will create it |
Azure region | Any Azure region where AKS is supported |
Name of workspace | This needs to be unique |
Name of cluster | This is also used as the DNS prefix for the cluster, hence must be unique as well |
Service Principal Application ID | Application ID of a Service Principal |
Service Principal Object ID | Object ID of the same Service Principal |
Service Principal Password | Password of the same Service Principal |
The last three parameters are related to the Service Principal that will be used by AKS. See this article on how to create a service principal and recover this information.
Let’s run the command locally, e.g.:
./create-cluster.sh aks-group eastus myuniquelaworkspace myuniqueaks \
<my-principal-app-id> \
<my-principal-object-id> \
<my-principal-password>
This takes a few minutes to execute.
Kubernetes deployment
Let’s deploy a set of pods in the cluster using the following yaml file:
apiVersion: apps/v1
kind: Deployment
metadata:
name: cpu-ram-api
spec:
replicas: 6
selector:
matchLabels:
app: cpu-ram-api
template:
metadata:
labels:
app: cpu-ram-api
spec:
containers:
- name: cpu-ram-request-api
image: vplauzon/cpu-ram-request-api:4
ports:
- containerPort: 80
resources:
requests:
memory: "64M"
cpu: "250m"
limits:
memory: "128M"
cpu: "2"
---
apiVersion: v1
kind: Service
metadata:
name: cpu-ram-request-api-svc
spec:
type: LoadBalancer
ports:
- port: 80
selector:
app: cpu-ram-api
We have a deployment and a public service load balancing the pods of the deployment.
The pod has one container. Container’s image is vplauzon/cpu-ram-request-api. The source code of this container also is on GitHub. It’s an API implemented in C#. It basically keeps the CPU busy and allocate memory. This was built on purpose for creating spikes on workloads to test monitoring.
The deployment script already connected kubectl CLI to our cluster (i.e. executed the az aks get-credentials command for us). So, we can simply deploy the yaml file with the following command:
kubectl apply -f https://raw.githubusercontent.com/vplauzon/aks/master/monitor-metrics/service.yaml
If we look at the pods:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
cpu-ram-api-5976cfdfb7-8p5k2 1/1 Running 0 6m3s
cpu-ram-api-5976cfdfb7-crsbh 1/1 Running 0 6m4s
cpu-ram-api-5976cfdfb7-m26gn 1/1 Running 0 6m6s
cpu-ram-api-5976cfdfb7-pgf9v 0/1 Pending 0 6m3s
cpu-ram-api-5976cfdfb7-qj55t 1/1 Running 0 6m4s
cpu-ram-api-5976cfdfb7-zrlcl 1/1 Running 0 6m3s
We see that one of the pods is pending because our single-node cluster is full.
Now let’s look at the service:
$ kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
cpu-ram-request-api-svc LoadBalancer 10.0.40.102 40.70.77.7 80:31744/TCP 7m8s
kubernetes ClusterIP 10.0.0.1 <none> 443/TCP 123m
We need to copy the external IP of the cpu-ram-request-api-svc service. That’s the Azure public IP associated to the load balancer of that service. Let’s store that in a shell variable:
ip=40.70.77.7 # Here, let's replace that specific IP with the one from our cluster
Now let’s call the API we just deploy a few times:
$ curl "http://$ip/"
{"duration":1,"numberOfCore":1,"ram":10,"realDuration":"00:00:00.9995901"}
$ curl "http://$ip?duration=45"
{"duration":45,"numberOfCore":1,"ram":10,"realDuration":"00:00:44.9990431"}
$ curl "http://$ip?duration=20&core=2"
{"duration":20,"numberOfCore":2,"ram":10,"realDuration":"00:00:20.0014578"}
Those will create a few CPU spikes we’ll be able to pick up in the logs.
KubePodInventory
Let’s open the Log Analytics workspace in the Azure Portal. Under General, let’s select the Logs pane.
On the left-hand side, we can see two categories of “tables”:
- ContainerInsights
- LogManagement
Although the key for performance metrics is a LogManagement table, let’s start by looking at the ContainerInsights:
Those are all AKS related.
Since we want to find out the CPU usage for pods, let’s look at KubePodInventory. We can type the following query:
KubePodInventory
| limit 5
We can then click Run (or type Shift-Enter). The screen should look as follow once we exploded the KubePodInventory table on the left:
This is a good first step to explore logs, to get a feel of the data available.
The query language used is Kusto. There is a quick tutorial here and a cheat sheet vs SQL here. The syntax might look funny at first but typically people get the hang of it within the first hour.
We can see there are a lot of IDs in the data. For instance, the TenantId is actually the Log Analytics Workspace ID.
We can see the Computer column corresponds to AKS nodes.
To get a better feel of a column, we can fetch its distinct (unique) values. For instance:
KubePodInventory
| distinct Namespace, Name
| sort by Namespace, Name asc
gives a list of Name that correspond to what we would get from a kubectl get pods --all-namespaces
.
The left pane is useful to see the name and type of columns.
Perf
Let’s look at the Perf table under LogManagement category. Its schema looks like a metric table with its TimeGenerated and CounterValue columns.
We can throw a few queries. For instance:
Perf
| distinct ObjectName
shows us only two object names:
- K8SContainer
- K8SNode
This tells us that AKS has put some data in that table.
Perf
| distinct InstanceName
This last query gives us very long ids where some ends with the name of containers.
This is where some magic must be known. Perf is a generic table used for VM, AKS and many other Azure resource metrics. It therefore doesn’t have a pod, container, namespace, etc. table.
Metrics are typically tracked at the container level. For containers InstanceName corresponds to:
cluster-id/pod id/container name
Thankfully we had all that information in KubePodInventory. This will come in handy so we can filter the metrics for only the pods / containers we are interested in.
Another useful column is
Perf
| distinct CounterName
| sort by CounterName asc
where the values are:
- cpuAllocatableNanoCores
- cpuCapacityNanoCores
- cpuLimitNanoCores
- cpuRequestNanoCores
- cpuUsageNanoCores
- memoryAllocatableBytes
- memoryCapacityBytes
- memoryLimitBytes
- memoryRequestBytes
- memoryRssBytes
- memoryWorkingSetBytes
- restartTimeEpoch
Joining KubePodInventory & Perf
Let’s try to find those values in KubePodInventory:
KubePodInventory
| distinct ClusterId, PodUid, ContainerName
We can see the result is close to what we need:
The first two columns look ok. The ContainerName is prepended by an ID we do not need. We can get rid of the prefix easily:
KubePodInventory
| extend JustContainerName=tostring(split(ContainerName, '/')[1])
| distinct ClusterId, PodUid, JustContainerName
This gives us what we need.
We now have everything to join the two tables:
let clusterName = "<our cluster name>";
let serviceName = "cpu-ram-request-api-svc";
let counterName = "cpuUsageNanoCores";
let startTime=ago(60m);
KubePodInventory
| where ClusterName == clusterName
| where ServiceName == serviceName
| where TimeGenerated >= startTime
| extend JustContainerName=tostring(split(ContainerName, '/')[1])
| extend InstanceName=strcat(ClusterId, '/', PodUid, '/', JustContainerName)
| distinct Name, InstanceName
| join (
Perf
| where TimeGenerated >= startTime
| where CounterName == counterName
) on InstanceName
| project CounterValue, TimeGenerated, Name
| render timechart
We declared a few variables at the beginning using the let keyword. First, we want to filter for our cluster. In general, one Log Analytics workspace could be use as target for multiple AKS clusters. Then we want to filter for a service name. We also want to look only at the CPU usage metrics. Finally, we are looking at a 60 minutes window.
We might need to perform a few more curl "http://$ip?duration=20&core=2
if more than 60 minutes elapsed since we did it. It takes a few minutes for logs to get ingested and available in the workspace.
We get a chart of exactly what we needed: the CPU usage of each pod.
Time is given in GMT. In order to convert it, we can add / substract hours. For instance, in Montreal, currently (early May):
| extend LocalTimeGenerated = TimeGenerated - 4h
| project CounterValue, LocalTimeGenerated, Name
As usually with Log Analytics, we can save this query. We can also “pin” it to a shared dashboard.
Summary
We just took a little dive inside the data collected by the Azure Monitor solution for AKS.
Using the content of this article, you should easily be able to track different metrics.
This is useful for dashboards but also for forensic analysis, i.e. after the fact troubleshooting.