Requests vs Limits in Kubernetes

Kubernetes doesn’t know what resources (i.e. CPU & memory) your container needs. That is why you must give it some hints.

If you run way under capacity and / or fairly similar pods, you do not need to do that. But if you start approaching the maximum capacity of your cluster or if you have pods that consume more resources than others, you’ll start to get in trouble.

In this article we will look at how to inform Kubernetes about pods’ resources and how we can optimize for different scenarios.

A scenario that typically comes up is when a cluster has a bunch of pods where a lot of them are dormant, i.e. they don’t consume CPU or memory. Do we have to carve them a space they won’t use most of the time? The answer is no. As usual, it’s safer to provision capacity for a workload than relying on optimistic heuristic that not all workloads will require resources at the same time. So, we can configure Kubernetes optimistically or pessimistically.

We’ll review Kubernetes specs first. Then, as we did recently with outbound traffic, we’ll simply experiment to find answers.

As usual the code is in GitHub.

Managing compute resources

Kubernetes online documentation explains how to configure resources.

Here are some highlights.

There are two resources types: CPU & memory. The former is quantified in number of cores while the latter is quantified in bits of RAM.

Resources are configured at the container level, not pod level. But since a pod is a deployment unit, the total resources required by the containers of a pod is what we focus on.

There are two ways to specify resources: requests & limits. This is where a lot of confusion arises. Let’s try to clarify it here. Each one (i.e. requests and limits) can be specified with CPU and / or memory. Here are key characteristics:

Specify Description
Requests The requests specification is used at pod placement time: Kubernetes will look for a node that has both enough CPU and memory according to the requests configuration.
Limits This is enforced at runtime. If a container exceeds the limits, Kubernetes will try to stop it. For CPU, it will simply curb the usage so a container typically can't exceed its limit capacity ; it won't be killed, just won't be able to use more CPU. If a container exceeds its memory limits, it could be terminated.

It’s tempting to see those two as minimum / maximum, but it isn’t really. Requests is only used for placement and creates a theoretical map of the cluster. Kubernetes makes sure that the sum of requested resources for a node is equal or less the capacity of the node. It isn’t a minimum. Our container could actually use less. It’s a hint at what it needs.

The limits are closer to the concept of maximum

Pessimistic vs Optimistic

So how should we use those configurations?

If we are pessimistic, we’ll want to make sure our containers have enough resource at all time. So, we’ll the requests for what the containers need to run at all time. We’ll set the limits just to prevent bad behaviours.

If we are optimistic, we’ll want to make sure our containers have enough resource to start but will rely of “over capacity” to peak. That over capacity will be coming from the resources other containers happen not to be using at that time.

Pessimistic Optimistic
Requests What container needs to run at all time What container needs to start and when idle
Limits Prevent bad behaviour, i.e. noisy neighbour: what the container should never need Also to prevent bad behaviour. Moreover, to ensure there will be over capacity by containing the runtime of each individual containers

Of course, there are degrees of optimisms. The lower we push the requests, the more we rely on the ethereal “over capacity”. The higher we push it, the more we provision capacity.

Setup experiences

Let’s experiment using Azure Kubernetes Service (AKS).

In order to run those experiments, we’ll need the Azure CLI tool connected to an Azure subscription.

First, let’s download an ARM template and a script invoking it:

curl > deploy.json
curl >

Let’s make the script executable:

chmod +x

We are going to run that script with five parameters:

Parameter Description
Name of the resource group If the group doesn't exist, the script will create it
Azure region Must be compatible with regions supporting ACI in VNET. At the time of this writing, i.e. end-of-March 2019, that means one of the following: EastUS2EUAP, CentralUSEUAP, WestUS, WestCentralUS, NorthEurope, WestEurope, EastUS or AustraliaEast.
Name of workspace This needs to be unique
Name of cluster This is also used as the DNS prefix for the cluster, hence must be unique
Service Principal Application ID Application ID of a Service Principal
Service Principal Object ID Object ID of the same Service Principal
Service Principal Password Password of the same Service Principal

We are using Log Analytics to monitor the CPU / memory usage of containers.

The last three parameters are related to the Service Principal that will be used by AKS.

Let’s run the command locally, e.g.:

./ aks-group eastus myuniquelaworkspace myuniqueaks \
    <my-principal-app-id> \
    <my-principal-object-id> \

This takes a few minutes to execute.

Exceeding CPU requests

Now that we have a cluster, let’s deploy something on it:

apiVersion: apps/v1
kind: Deployment
  name: cpu-ram-api
  replicas: 6
      app:  cpu-ram-api
        app: cpu-ram-api
      - name: myapp
        image: vplauzon/cpu-ram-request-api:4
        - containerPort: 80
            memory: "64M"
            cpu: "250m"
            memory: "128M"
            cpu: "2"
apiVersion: v1
kind: Service
  name: web-service
  type: LoadBalancer
  - port: 80
    app: cpu-ram-api

We have a deployment and a public service load balancing the pods of the deployment.

The pod has one container. Container’s image is vplauzon/cpu-ram-request-api. The source code of this container also is on GitHub. It’s an API implemented in C#. It basically keeps the CPU busy and allocate memory. This was built on purpose for these tests.

We see under the subsection resources that we specify a request of 64 Mb of RAM and 250 “mili core” or 1/4 of a core. Similarly, we specify a limit of 128 Mb and 2 cores.

Let’s deploy it:

kubectl apply -f

Let’s look at the pods:

$ kubectl get pods

NAME                           READY   STATUS    RESTARTS   AGE
cpu-ram-api-76cb6dbbff-926nk   1/1     Running   0          84s
cpu-ram-api-76cb6dbbff-gvp4t   1/1     Running   0          84s
cpu-ram-api-76cb6dbbff-sfjc4   1/1     Running   0          84s
cpu-ram-api-76cb6dbbff-wn7rr   1/1     Running   0          84s
cpu-ram-api-76cb6dbbff-wrpwv   0/1     Pending   0          84s
cpu-ram-api-76cb6dbbff-zh5q8   1/1     Running   0          84s

We see that one of the pods is pending. Our cluster has a single node. The VM sku (B2ms) has 2 cores and 8 Gb of RAM. Kubernetes uses a portion of those resources. We do not have access to all resources of the VM for our pods. With 5 pods active, we requested 1.25 cores. There weren’t enough resources for 1.5.

Now let’s look at the service:

$ kubectl get svc

NAME          TYPE           CLUSTER-IP    EXTERNAL-IP      PORT(S)        AGE
kubernetes    ClusterIP      <none>           443/TCP        175m
web-service   LoadBalancer   80:31146/TCP   11m

We need to copy the external IP of the web-service service. That’s the Azure public IP associated to the load balancer of that service. Let’s store that in a shell variable:

ip=  # Here, let's replace that specific IP with the one from our cluster

Now let’s try a few things:

$ curl "http://$ip/"


By default, the API will allocate 10 Mb of RAM, use one core to do some work and run for one second.

Let’s see if we can find that usage. Let’s open the AKS cluster in the portal and let’s look at the Metrics, under Monitoring:


Now we are going to maximize the view (little chevrons on both sides), select the containers tab, search for myapp and display CPU with Max:

Empty insights

We do not see a blip on the radar. So, let’s run that for a little longer. We can do that with specific query strings:

$ curl "http://$ip/?duration=90"


This will take 90 seconds to run.

We’ll need to wait a little for Log Analytics to catch up on the metrics. But then:

90 seconds

We switched to “The last 30 minutes” in Time range.

We see that most of one core (895 mili core) was used.

Here we just proved that a container can consumed more than its “requests” specs which was 0.25 core.

Let’s do the same thing with 2 cores:

$ curl "http://$ip/?duration=90&core=2"


we can see here:

90 seconds & 2 cores

The usage went close to 2 cores and is highlighted in red, since it’s close to its limits.

Here the container went to the maximum of its limit.

We won’t do it here, but if we set the limit to 1 core, the same experience would yield the result that only one core would be used. Kubernetes enforces the limit.

Let’s try to run many of those at the same time:

curl "http://$ip/?duration=90&core=2" &
curl "http://$ip/?duration=90&core=2" &
curl "http://$ip/?duration=90&core=2" &

Here we can see that each container takes less than one core:

Multiple 2 cores

Basically, they all pull on the blanket but none of the containers can fully used two cores.

This is the flip side of under provisioning: if all pods peak at the same time, they won’t all get the “over capacity”.

Exceeding Memory requests

Now let’s try to crank the memory used:

curl "http://$ip/?duration=5&ram=20"

Here we request to use 20Mb within the request. That adds up to the rest of the memory used by the container:

20 Mb

We then get close to our 128 Mb limit. Let exceed it:

$ curl "http://$ip/?duration=5&ram=100"

curl: (52) Empty reply from server

We see an error occurs. That’s because the container got killed when its memory demand occurred and the total memory exceeded the container’s limit specification.

Deleted Container

We can see from the logs the container was restarted since the memory went down. The pod didn’t get replaced, only the container.


We’ve looked at how we can specify resources allocated for containers. We’ve looked at a few examples.

What sums it the best is this matrix:

Pessimistic Optimistic
Requests What container needs to run at all time What container needs to start and when idle
Limits Prevent bad behaviour, i.e. noisy neighbour: what the container should never need Also to prevent bad behaviour. Moreover, to ensure there will be over capacity by containing the runtime of each individual containers

As part of our deployment strategy with Kubernetes, we need to think if we want to be pessimistic or optimistic.

If we are pessimistic, we’ll always have enough resources, but the amount of resources will be higher. And with resources in the cloud, the cost is higher.

If we are optimistic, it will be cheaper, but we might run out of capacity sometimes.

Kubernetes gives us the flexibility to make that decision ourselves.

2 responses

  1. dobesv 2019-04-11 at 08:16

    It would be nice to hear more about what happens if you are optimistic about memory but your node runs out of memory, with examples.

  2. Vincent-Philippe Lauzon 2019-04-12 at 07:38

    Good question. I haven’t tried it. I guess that processes will just start to starve in memory and get refused memory allocation… i.e. bad!

Leave a comment