The Kubernetes Scheduler
By Andrew Chen and Dominik Tornow
Kubernetes is a Container Orchestration Engine designed to host containerized applications on a set of nodes, commonly referred to as a cluster. Using a systems modeling approach, this series aims to advance the understanding of Kubernetes and its underlying concepts.
The Kubernetes Scheduler is a core component of Kubernetes: After a user or a controller creates a Pod, the Kubernetes Scheduler, monitoring the Object Store for unassigned Pods, will assign the Pod to a Node. Then, the Kubelet, monitoring the Object Store for assigned Pods, will execute the Pod.
This blog post provides a concise, detailed model of the Kubernetes Scheduler. The model is supported by partial specifications in TLA+.
Scheduling
The task of the Kubernetes Scheduler is to choose a placement. A placement is a partial, non-injective assignment of a set of Pods to a set of Nodes.
Scheduling is an optimization problem: First, the Scheduler determines the set of feasible placements, which is the set of placements that meet a set of given constraints. Then, the Scheduler determines the set of viable placements, which is the set of feasible placements with the highest score.
The Kubernetes Scheduler is a multi-step scheduler ensuring a local optimum instead of being a single-step scheduler ensuring a global optimum.
The Kubernetes Scheduler
Figure 5. depicts the Kubernetes Objects and attributes that are of interest to the Kubernetes Scheduler. Kubernetes represents
- a Pod as a Kubernetes Pod Object,
- a Node as a Kubernetes Node Object, and
- the assignment of a Pod to a Node as the Pod’s .Spec.NodeName.
A Pod Object is bound to a Node Object if the Pod’s .Spec.NodeName equals the Node’s .Name
The task of the Kubernetes Scheduler can now more formally be described as: The Kubernetes Scheduler, for a Pod p, selects a Node n and updates(*) the Pod’s .Spec.NodeName so that BoundTo(p, n) is true.
The Control Loop
The Kubernetes Scheduler monitors the Kubernetes Object Store and chooses an unbound Pod of the highest priority to perform either a Scheduling Step or a Preemption Step.
Scheduling Step
For a given Pod, the Scheduling Step is enabled if there exists at least one Node, such that the Node is feasible to host the Pod.
If the Scheduling Step is enabled, the Scheduler will bind the Pod to a feasible Node, such that the binding will achieve the highest possible viability.
If the Scheduling Step is not enabled, the Scheduler will attempt to perform a Preemption Step.
Preemption Step
For a given Pod, the Preemption Step is enabled if there exists at least one Node, such that the Node is feasible to host the Pod if a subset of Pods with lower priorities bound to this Node were to be deleted.
If the Preemption Step is enabled, the Scheduler will trigger the deletion of a subset of Pods with lower priorities bound to one Node, such that the Preemption Step will inflict the lowest possible casualties.
(The inflicted casualty is assessed in terms of Pod Disruption Budget (PDB) violations, but is beyond the scope of this post.)
Note that the Scheduler does not guarantee that the Pod which triggered the Preemption Step will be bound to that Node in a subsequent Scheduling Step.
1. Feasibility
For each Pod, the Kubernetes Scheduler identifies the set of feasible Nodes, which is the set of Nodes that satisfy the constraints of the Pod.
Conceptually, the Kubernetes Scheduler defines a set of filter functions that, given a Pod and a Node, determine if the Node satisfies the constraints of the Pod. All filter functions must yield true for the Node to host the Pod.
The following subsections detail some of the available filter functions:
1.1 Schedulability and Lifecycle Phase
This filter function deems a Node feasible based on the Node’s schedulability and lifecycle phase. Node conditions are accounted for via taints and tolerations (see below).
1.2 Resource Requirements and Resource Availability
This filter function deems a Node feasible based on the Pod’s resource requirements and the Node’s resource availabilities.
1.3 Node Selector
This filter function deems a Node feasible based on the Pod’s node selector values and the Node’s label values.
1.4 Node Taints and Pod Tolerations
This filter function deems a Node feasible based on the Pod’s taints’ key value pairs and the Node’s tolerations’ key value pairs.
A Pod may be bound to a Node such that the Node’s Taints match the Pod’s Tolerations. A Pod must not be bound to a Node if the Node’s Taints do not match the Pod’s Tolerations.
1.5 Required Affinity
This filter function deems a Node feasible based on the Pod’s required Node Affinity Terms, Pod Affinity Terms, and Pod Anti Affinity Terms.
Node Affinity
A Pod must be assigned to a Node such that the Node’s labels match the Pod’s Node Affinity Requirements. In addition, a Pod must not be assigned to a Node such that the Node’s labels do not match the Pods Node Affinity Requirements.
Pod Affinity
A Pod must be assigned to a Node such that at least one Pod on a Node matching the TopologyKey matches the Pod’s Pod Affinity Requirements.
Pod Anti-Affinity
A Pod must be assigned to a Node such that no Pod on a Node matching the TopologyKey matches the Pod’s Pod Anti-Affinity Requirements.
2. Viability
For each Pod, the Kubernetes Scheduler identifies the set of feasible Nodes, which is the set of Nodes that satisfy the constraints of the Pod. Then, the Kubernetes Scheduler identifies the set of feasible Nodes with the highest Viability.
Conceptually, the Kubernetes Scheduler defines a set of rating functions that, given a Pod and a Node, determine the viability of a Pod and Node Pair. Ratings are summed.
The following subsection details one available filter function:
2.1 Preferred Affinity
This filter function rates a Node’s viability based on the Pod’s preferred Node Affinity Terms, Pod Affinity Terms, and Pod Anti Affinity Terms.
The rating is the sum of the
- Sum of the Term Weights for each matching Node Selector Term
- Sum of the Term Weights for each matching Pod Affinity Term
- Sum of the Term Weights for each matching Pod Anti Affinity Term
Case Study
Figure 6. depicts a case study involving two different types of Nodes and two different types of Pods:
- 9 Nodes without GPU resources
- 6 Nodes with GPU resources
The objective of the case study is to ensure that:
- Pods that do not require GPUs are assigned to Nodes without GPUs
- Pods that do require GPUs are assigned to Nodes with GPUs
(*) Kubernetes Binding Objects
The blog post states that the Kubernetes Scheduler binds a Pod to a Node by setting the Pod’s .Spec.NodeName to the Node’s Name. However, the Scheduler sets the .Spec.NodeName not directly but indirectly.
The Kubernetes Scheduler is not permitted to update a Pod’s .Spec. Therefore, instead of updating the Pod, the Kubernetes Scheduler creates a Kubernetes Binding Object. On creation of a Binding Object the Kubernetes API will update the Pod’s .Spec.NodeName.
About this post
This blog post is a summary of the KubeCon 2018 Contributor Summit’s “Unconference” track hosted by Google and SAP detailing the Kubernetes Scheduler.