As a rule, there is always a need to provide a dedicated pool of resources to any application for its correct and stable operation. But what if several applications work on the same capacities at once? How to provide the minimum necessary resources for each of them? How can I limit resource consumption? How to correctly distribute the load between nodes? How to ensure the operation of the horizontal scaling mechanism in case of increased load on the application?

You need to start with what basic types of resources exist in the system – processor time and RAM. In k8s manifests, these types of resources are measured in the following units:

  •     CPU – in the cores
  •     RAM – in bytes

Moreover, for each resource there is an opportunity to set two types of requirements – requests and limits. Requests – describes the minimum requirements for the free resources of the node to run the container (and the pod as a whole), while limits sets a strict limit on the resources available to the container.

It is important to understand that in the manifest it is not necessary to explicitly define both types, and the behavior will be as follows:

  •  If only the limits of the resource are explicitly set, then requests for this resource automatically take a value equal to limits (this can be verified by calling describe entities). Those in fact, the operation of the container will be limited by the same amount of resources that it requires to run.
  •  If only requests are explicitly set for a resource, then no restrictions are set on top of this resource – i.e. the container is limited only by the resources of the node itself.

It is also possible to configure resource management not only at the level of a specific container, but also at the level of namespace using the following entities:

  •  LimitRange – describes the restriction policy at the container /pod level in ns and is needed in order to describe the default restrictions on the container / pod , as well as to prevent the creation of obviously big containers / pods (or vice versa), limit their number and determine the possible difference between the limits and requests
  • ResourceQuotas – describe the restriction policy in general for all containers in ns and is used, as a rule, to distinguish resources from environments (useful when environments are not rigidly delimited at the level of nodes)

The following are examples of manifests where resource limits are set:

    At the specific container level:

- name: app-nginx
  image: nginx
      memory: 1Gi
      cpu: 200m

Those. in this case, to start a container with nginx, you will need at least the presence of free 1G OP and 0.2 CPU on the node, while the maximum container can take 0.2 CPU and all available OP on the node.

At integer level of ns:

apiVersion: v1
kind: ResourceQuota
  name: nxs-test
    requests.cpu: 300m
    requests.memory: 1Gi
    limits.cpu: 700m
    limits.memory: 2Gi

The sum of all request containers in the default ns cannot exceed 300m for the CPU and 1G for the OP, and the sum of all limit is 700m for the CPU and 2G for the OP.

Default restrictions for containers in ns:

apiVersion: v1
kind: LimitRange
  name: nxs-limit-per-container
   - type: Container
       cpu: 100m
       memory: 1Gi
       cpu: 1
       memory: 2Gi
       cpu: 50m
       memory: 500Mi
       cpu: 2
       memory: 4Gi

In the default namespace for all containers, by default, request will be set to 100m for the CPU and 1G for the OP, limit – 1 CPU and 2G. At the same time, a restriction was also established on the possible values in request / limit for CPU (50m <x <2) and RAM (500M <x <4G).

Limitations on the ns pod level:

apiVersion: v1
kind: LimitRange
 name: nxs-limit-pod
 - type: Pod
     cpu: 4
     memory: 1Gi

Those for each pod in the default ns, a limit of 4 vCPU and 1G will be set.

Now I would like to tell you what advantages the installation of these restrictions can give us.

The mechanism of load balancing between nodes

As you know, the k8s component such as scheduler, which works according to a certain algorithm, is responsible for the distribution of the pods over the nodes. This algorithm in the process of choosing the optimal node to run goes through two stages:

  1.     Filtration
  2.     Ranging

According to the described policy, nodes are initially selected on which a pod can be launched based on a set of predicates (including whether the node has enough resources to run a pod – PodFitsResources), and then points are awarded for each of these nodes, according to priorities (including, the more free resources a node has – the more points it is assigned – LeastResourceAllocation / LeastRequestedPriority / BalancedResourceAllocation) and is run on the node with the most points (if several nodes satisfy this condition at once, then a random one is selected).

At the same time, you need to understand that the scheduler, when evaluating the available resources of the node, focuses on the data that is stored in etcd – i.e. by the amount of the requested / limit resource of each pod running on this node, but not by the actual consumption of resources. This information can be obtained in the output of the kubectl describe node $ NODE command, for example:

# kubectl describe nodes nxs-k8s-s1
Non-terminated Pods:         (9 in total)
  Namespace                  Name                                         CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                  ----                                         ------------  ----------  ---------------  -------------  ---
  ingress-nginx              nginx-ingress-controller-754b85bf44-qkt2t    0 (0%)        0 (0%)      0 (0%)           0 (0%)         233d
  kube-system                kube-flannel-26bl4                           150m (0%)     300m (1%)   64M (0%)         500M (1%)      233d
  kube-system                kube-proxy-exporter-cb629                    0 (0%)        0 (0%)      0 (0%)           0 (0%)         233d
  kube-system                kube-proxy-x9fsc                             0 (0%)        0 (0%)      0 (0%)           0 (0%)         233d
  kube-system                nginx-proxy-k8s-worker-s1                    25m (0%)      300m (1%)   32M (0%)         512M (1%)      233d
  nxs-monitoring             alertmanager-main-1                          100m (0%)     100m (0%)   425Mi (1%)       25Mi (0%)      233d
  nxs-logging                filebeat-lmsmp                               100m (0%)     0 (0%)      100Mi (0%)       200Mi (0%)     233d
  nxs-monitoring             node-exporter-v4gdq                          112m (0%)     122m (0%)   200Mi (0%)       220Mi (0%)     233d
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests           Limits
  --------           --------           ------
  cpu                487m (3%)          822m (5%)
  memory             15856217600 (2%)  749976320 (3%)
  ephemeral-storage  0 (0%)             0 (0%)


Here we see all the pods running on a particular node, as well as the resources that each of the pods requests. And here is what the scheduler logs look like when starting the cronjob-cron-events-1573793820-xt6q9 pod (this information appears in the scheduler log when setting the 10th level of logging in the arguments of the start command –v = 10):

Here we see that initially the scheduler performs filtering and forms a list of 3 nodes on which it is possible to run (nxs-k8s-s8, nxs-k8s-s9, nxs-k8s-s10). It then calculates points according to several parameters (including BalancedResourceAllocation, LeastResourceAllocation) for each of these nodes in order to determine the most suitable node. In the end, it is planned to run under the node with the most points (here, two nodes have the same number of points at 100037, so a random one is selected – nxs-k8s-s10).

Conclusion: if pods work on the node for which no restrictions are set, then for k8s (from the point of view of resource consumption) this will be equivalent to as if such pods were completely absent on this node. Therefore, if you conditionally have a pod with a voracious process (for example, wowza) and there are no restrictions for it, then a situation may arise when in fact the given one has eaten all the resources of the node, but for k8s this node is considered unloaded and it will be awarded the same number of points when ranking (namely, in points with an assessment of available resources), as well as a node that does not have work pitches, which ultimately can lead to an uneven distribution of the load between the nodes.

Pod eviction

As you know, each of the pods is assigned one of the 3 QoS classes:

  •     guaranuted – is assigned when request and limit are set for each container in the pod for memory and cpu, and these values ​​must match
  •     burstable – at least one container in the pod has request and limit, while request <limit
  •     best effort – when no container in the pod is limited in resources

At the same time, when there is a lack of resources (disk, memory) on the node, kubelet starts ranking and evicting sub’s according to a certain algorithm that takes into account the priority of the pod and its QoS class. For example, if we are talking about RAM, then based on the QoS class points are awarded according to the following principle:

  •     Guaranteed: -998
  •     BestEffort: 1000
  •     Burstable: min (max (2, 1000 – (1000 * memoryRequestBytes) / machineMemoryCapacityBytes), 999)

Those with the same priority, kubelet will first expel the pods with the best effort QoS class from the node.

Conclusion: if you want to reduce the likelihood of eviction of the required pod from the node in case of insufficient resources on it, then along with the priority, you must also take care of setting the request / limit for it.

Application of horizontal pod auto-scaling mechanism (HPA)

When the task is to automatically increase and decrease the number of pod depending on the use of resources (system – CPU / RAM or user – rps), such an entity k8s as HPA (Horizontal Pod Autoscaler) can help in its solution. The algorithm of which is as follows:

  •   The current readings of the observed resource (currentMetricValue) are determined
  •   The desired values ​​for the resource (desiredMetricValue) are determined, which are set for system resources using request
  •  The current number of replicas is determined (currentReplicas)
  •   The following formula calculates the desired number of replicas (desiredReplicas)  desiredReplicas = [currentReplicas * (currentMetricValue / desiredMetricValue)

However, scaling will not happen when the coefficient (currentMetricValue / desiredMetricValue) is close to 1 (we can set the allowable error ourselves, by default it is 0.1)

Lets consider hpa using the app-test application (described as Deployment), where it is necessary to change the number of replicas, depending on CPU consumption:

  •     Application manifest
kind: Deployment
apiVersion: apps/v1beta2
 name: app-test
     app: app-test
 replicas: 2
       app: app-test
     - name: nginx
       image: registry.nixys.ru/generic-images/nginx
       imagePullPolicy: Always
           cpu: 60m
       - name: http
         containerPort: 80
       - name: nginx-exporter
         image: nginx/nginx-prometheus-exporter
             cpu: 30m
         - name: nginx-exporter
           containerPort: 9113
         - -nginx.scrape-uri


Those we see that under with the application it is initially launched in two instances, each of which contains two containers nginx and nginx-exporter, for each of which requests for the CPU are given.

  • HPA manifest
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
 name: app-test-hpa
 maxReplicas: 10
 minReplicas: 2
   apiVersion: extensions/v1beta1
   kind: Deployment
   name: app-test
 - type: Resource
     name: cpu
       type: Utilization
       averageUtilization: 30

We created an HPA that will monitor the Deployment app-test and adjust the number of pods with the application based on the CPU indicator (we expect that the pod should consume 30% percent of the CPU it requests), while the number of replicas is in the range of 2-10.

Now, we will consider the HPA operation mechanism if we apply a load to one of the pods:

# kubectl top pod
 NAME                                                   CPU(cores)   MEMORY(bytes)
 app-test-78559f8f44-pgs58            101m         243Mi
 app-test-78559f8f44-cj4jz            4m           240Mi

In total we have the following:

  • Desired value (desiredMetricValue) – according to the HPA settings, we have 30%
  • Current value (currentMetricValue) – for calculation, the controller-manager calculates the average value of resource consumption in%, i.e. conditionally does the following:
  1. Gets the absolute values ​​of pod metrics from the metric server, i.e. 101m and 4m
  2. Calculates the average absolute value, i.e. (101m + 4m) / 2 = 53m
  3. Gets the absolute value for the desired resource consumption (for this, the request of all containers are summed) 60m + 30m = 90m
  4. Calculates the average percentage of CPU consumption relative to the request pod, i.e. 53m / 90m * 100% = 59%

Now we have everything necessary to determine whether it is necessary to change the number of replicas, for this we calculate the coefficient:

ratio = 59% / 30% = 1.9

The number of replicas should be increased ~ 2 times and make up [2 * 1.96] = 4.

Conclusion: As you can see, in order for this mechanism to work, a prerequisite is including the availability of requests for all containers in the observed pod.

The mechanism of horizontal auto-scaling of nodes (Cluster Autoscaler)

In order to neutralize the negative impact on the system during bursts of load, the presence of a tuned hpa is not enough. For example, according to the settings in the hpa controller manager decides that the number of replicas needs to be increased by 2 times, however, there are no free resources on the nodes to run such a number of pods (i.e. the node cannot provide the requested resources for the pod requests) and these pods enter the Pending state.

In this case, if the provider has the appropriate IaaS / PaaS (for example, GKE / GCE, AKS, EKS, etc.), a tool such as Node Autoscaler can help us. It allows you to set the maximum and minimum number of nodes in the cluster and automatically adjust the current number of nodes (by accessing the cloud provider API to order / delete nodes) when there is a shortage of resources in the cluster and the pods cannot be scheduled (in the Pending state).

Conclusion: in order to be able to automatically scale the nodes, it is necessary to specify requests in the pod containers so that k8s can correctly evaluate the load of nodes and accordingly report that there are no resources in the cluster to start the next pod.


It should be noted that setting resource limits for the container is not a prerequisite for the successful launch of the application, but it is still better to do this for the following reasons:

  •     For more accurate scheduler operation in terms of load balancing between k8s nodes
  •     To reduce the likelihood of a pod eviction event
  •     For horizontal auto-scaling application of pods (HPA)
  •     For horizontal auto-scaling of nodes (Cluster Autoscaling) for cloud providers