Author: Derek Carr
Last Updated: 04/17/2017
Status: Pending Review
This proposal describes a mechanism to extend Kubernetes via a custom node isolator and scheduler to support containers that need to avoid cross-NUMA node memory access.
The solution is intended to enable the scheduler
to support
individual NUMA node topology aware scheduling decisions that
are enforced by a node isolator extension in the kubelet
.
Non-uniform memory architecture (NUMA) describes multi-socket machines that subdivide memory into nodes where each node is associated with a list of CPU cores. This architecture is the norm for modern machines.
An interconnect bus provides connections between nodes so each CPU can access all memory. The interconnect can be overwhelmed by concurrent cross-node traffic, and as a result, processes that need to access memory on a different node can experience increased latency.
As a result, many applications see a performance benefit when the workload is affined to a particular NUMA node and CPU core(s).
insert details how it introspects node topology
document extension API server similar to service-catalog has additional information per node used by custom scheduler NUMATopology, HugePages configuration, etc.
describe pod.Spec.nodeOpaqueBindings to hold node local assignment information
In order to support NUMA affined workloads, the Node
must make its
NUMA topology available for introspection by other agents that schedule
pods.
This proposal recommends that the NodeStatus
is augmented as follows:
// NodeStatus is information about the current status of a node.
type NodeStatus struct {
...
// Topology represents the NUMA topology of a node to aid NUMA aware scheduling.
// +optional
Topology NUMATopology
}
// NUMATopology describes the NUMA topology of a node.
type NUMATopology struct {
// NUMANodes represents the list of NUMA nodes in the topology.
NUMANodes []NUMANode
}
// NUMANode describes a single NUMA node.
type NUMANode struct {
// Identifies a NUMA node on a single host.
NUMANodeID string
// Capacity represents the total resources associated to the NUMA node.
// cpu: 4 <number of cores>
// memory: <amount of memory in normal page size>
// hugepages: <amount of memory in huge page size>
Capacity ResourceList
// Allocatable represents the resources of a NUMA node that are available for scheduling.
// +optional
Allocatable ResourceList
// CPUSet represents the physical numbers of the CPU cores
// associated with this node.
// Example: 0-3 or 0,2,4,6
// The values are expressed in the List Format syntax specified
// here: http://man7.org/linux/man-pages/man7/cpuset.7.html
CPUSet string
}
By default, load balancing is done across all CPUs, except those marked isolated
using the kernel boot time isolcpus=
argument. When configuring a node to support
CPU and NUMA affinity, many operators may wish to isolate host processes to particular
cores.
It is recommended that operators set a CPU value for --system-reserved
in whole cores that aligns with the set of cpus that are made available to the default
kernel scheduling algorithm. If an operator is on a systemd
managed platform, they
may choose instead to set the CPUAffinity
value for the root slice to the set of CPU
cores that are reserved for the host processes.
TODO
- how should
kubelet
discover the reservedcpu-set
value? - in a numa system,
kubelet
reservation for memory needs to be removed from a particular numa node capacity so numa node allocatable is as expected.
The following Taint
keys are defined to enable CPU pinning and NUMA awareness.
- Effect:
NoScheduleNoAdmitNoExecute
- Potential values:
dedicated
If dedicated
, all pods that match this taint will require dedicated compute resources. Each
pod bound to this node must request CPU in whole cores. The CPU limit must equal the request.
- Effect:
NoScheduleNoAdmitNoExecute
- Potential values:
strict
If strict
, all pods that match this taint must request CPU (whole or fractional cores) that
fit a single NUMA node cpu
allocatable.
- Effect:
NoScheduleNoAdmitNoExecute
- Potential values:
strict
preferred
If strict
, all pods that match this taint must request memory
that fits it's assigned
NUMA node memory
allocatable.
If preferred
, all pods that match this taint are not required to have their memory
request
fit it's assigned NUMA node memory
allocatable.
The following API changes are proposed to the PodSpec
to allow CPU and NUMA affinity to be defined.
// PodSpec is a description of a pod
type PodSpec struct {
...
// NodeName is a request to schedule this pod onto a specific node. If it is non-empty,
// the scheduler simply schedules this pod onto that node, assuming that it fits resource
// requirements.
// +optional
NodeName string
// Identifies a NUMA node that affines the pod. If it is non-empty, the value must
// correspond to a particular NUMA node on the same node that the pod is scheduled against.
// This value is only set if either the `CPUAffinity` or `NUMACPUAffinity` tolerations
// are present on the pod.
// +optional
NUMANodeID string
// CPUAffinity controls the CPU affinity of the executed pod.
// If it is non-empty, the value must correspond to a particular set
// of CPU cores in the matching NUMA node on the machine that the pod is scheduled against.
// This value is only set if either the `CPUAffinity` or `NUMACPUAffinity` tolerations
// are present on the pod.
// The values are expressed in the List Format syntax specified here:
// here: http://man7.org/linux/man-pages/man7/cpuset.7.html
// +optional
CPUAffinity string
The /pod/<pod-name>/bind
operation will allow updating the NUMA and CPU
affinity values. The same permissions required to schedule a pod to a
node in the cluster will be required to bind a pod to a particular NUMA node
and CPU set.
Pods that require CPU and NUMA affinity prior to execution must set the
appropriate Tolerations
for the associated taints.
If a pod has multiple containers, the set of containers must all fit a specific NUMA node, and the set of affined CPUs are shared among containers.
Pod level cgroups are used to actually affine the container to the specified CPU set.
Operators must be able to limit the consumption of dedicated CPU cores via quota.
The kubelet
will enforce the presence of the required pod tolerations assigned to the node.
The kubelet
will pend the execution of any pod that is assigned to the node, but has
not populated the required fields for a particular toleration.
- If the toleration
CPUAffinity
is present on aPod
, the pod will not start any associated container until thePod.Spec.CPUAffinity
is populated. - If the toleration
NUMAAffinity
is present on aPod
, the pod will not start any associated container until thePod.Spec.NUMANodeID
is populated.
The delayed execution of the pod enables both a single and dual-phase scheduler to place pods on a particular NUMA node and set of CPU cores.
- pod level cgroup support roll-out
- implement support for
NoScheduleNoAdmitNoExecute
taint effect - expose NUMA topology in cAdvisor
- expose NUMA topology in node status
- pod level cgroup support for enabling cpu set
- Author
NUMATopologyPredicate
in scheduler to enable NUMA aware scheduling. - Restrict vertical autoscaling of CPU and NUMA affined workloads.
Is this POD/containers specific CPU cores affinity binding provision available?
With 1.17 version introduced "Reserved CPU List" and "static policy" features, Guaranteed pod and have integer CPU requests are assigned exclusive CPUs. But i believe this is still not setting the affinity of the container to a specific CPU or set of CPUs.