NOTE: This seems fixed our cluster. BUT I do see some still reporting cgroup2 having same issue, for example here. So YMMV.
DISCLAIMER: This seems works in our env. may not work in others. I'm still not sure what is the real root cause(s) yet. Not even 100% sure it full fixes in our env - it's been good for 2 weeks. But if it reappears, (for example, under certain use cases. high load or something), I'll be doomed.
Switching to cgroup v2 seems fixed the nvml suddenly go away in pod issue.
nvidia-smi no longer sees the gpus in the container after a few random hours.
When it's brought up it sees the GPUs correctly:
$ k exec -it gengwg-test -- nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-2539bcac-ab41-8855-e99e-868518451d27)
After a few hours, it starts breaking:
$ k exec -it gengwg-test -- nvidia-smi -L
Failed to initialize NVML: Unknown Error
command terminated with exit code 255
This is actually pretty easy to reproduce.
Schedule a pod to some node. It should run nvidia-smi successful initially:
$ k exec -it gengwg-test -- nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-2539bcac-ab41-8855-e99e-868518451d27)
Do a systemd reload on that node:
# systemctl daemon-reload
You will see the error immediately:
$ k exec -it gengwg-test -- nvidia-smi -L
Failed to initialize NVML: Unknown Error
command terminated with exit code 255
If you set security context to privileged, it doesn't break any more. The drawback is the accounting for GPUs is not accurate, but better than killing the jobs. If you are undedr a paper submission deadline, you may use that to unblock the users temporarily before implementing a long term fix like below. Just add below to your pod manifest:
securityContext:
privileged: true
Switching to cgroup v2 seems fix it for us. Below are the steps.
Before proceeding you need verify all components are all capable/compatible with cgroup v2. ALL of them need to be able to support cgroup2. If any of them not capable of cgroup v2, I think it may not work for you. Below are a few I checked. May not be exhaustive.
# grep cgroup /proc/filesystems
nodev cgroup
nodev cgroup2
Kernel recommended version: 5.2 or later:
$ uname -r
5.4.0-105.119.1.ubuntu.x86_64
$ systemctl --version
systemd 239 (239-58.el8)
+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=legacy
# rpm -qa | grep libnvidia-container
libnvidia-container1-1.10.0-1.x86_64
libnvidia-container-tools-1.10.0-1.x86_64
runc fully supports cgroup v2 (unified mode) since v1.0.0-rc93.
# runc --version
runc version 1.1.2
commit: v1.1.2-0-ga916309
spec: 1.0.2-dev
go: go1.17.11
libseccomp: 2.5.2
On systemd-based distros, cgroup v2 can be enabled by adding systemd.unified_cgroup_hierarchy=1
to the kernel cmdline.
Add systemd.unified_cgroup_hierarchy=1
to the GRUB_CMDLINE_LINUX
line in /etc/default/grub
. It should look something like this:
# vim /etc/default/grub
....
GRUB_CMDLINE_LINUX="xxxxxxxx apparmor=0 systemd.unified_cgroup_hierarchy=1"
....
I omitted other parts on the line. You keep others no change; just append that item.
NOTE: I also disabled AppArmor, because it's breaking containerd. apparmor=0
. I'm not totally sure if it's needed for cgroupv2. If systemd.unified_cgroup_hierarchy=1
is not enough, you may try add apparmor=0
too.
# grub2-mkconfig -o /boot/grub2/grub.cfg
Generating grub configuration file ...
Adding boot menu entry for EFI firmware configuration
done
so it doesn't affect users.
k drain xxxx # with other necessary options
# reboot
After reboot, you should see this:
$ cat /proc/cmdline
.... apparmor=0 systemd.unified_cgroup_hierarchy=1 ....
confirm it's cgroupv2:
# stat -fc %T /sys/fs/cgroup/
cgroup2fs
This is easy to test, because it doesn't require waiting a few hours/days.
Before:
$ k exec -it gengwg-test-gpu-9 -- nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-212a30ad-0ea4-8201-1be0-cdc575e55034)
Do the reload on that node itself:
# systemctl daemon-reload
After:
$ k exec -it gengwg-test-gpu-9 -- nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-212a30ad-0ea4-8201-1be0-cdc575e55034)
This is to repeat what users see on their end.
I did above fix to 4 nodes, and scheduled 4 pods on the 4 nodes, each requesting 1 GPU. (Since this is long running, I only requested 1 GPU, leaving 7 GPUs for users to schedule on each node).
They have been running for 4 days without issue. (Note the AGE field)
$ k get po -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
gengwg-test-gpu-1 1/1 Running 0 3d10h 2620:10d:xxxxx node0021 <none> <none>
gengwg-test-gpu-2 1/1 running 0 3d10h 2620:10d:xxxxx node0023 <none> <none>
gengwg-test-gpu-3 1/1 running 0 3d10h 2620:10d:xxxxx node0024 <none> <none>
gengwg-test-gpu-4 1/1 running 0 3d10h 2620:10d:xxxxx node0056 <none> <none>
Check nvidia-smi:
$ for i in {1..4}; do k exec -it gengwg-test-gpu-$i -- nvidia-smi -L ; done
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-29b18a6d-4246-6edd-d102-92c3dbbec667)
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-8b009714-3ee2-ac82-6b2f-4ebf8d103a7c)
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-3836675c-e987-1f01-7ce7-12da20038909)
GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-d9050ed3-2859-bcde-a56f-f1562a403db9)
That made me more confident this is the fix (at least for our environment).
cgroup driver for kubelet, docker and containerd are all systemd
.
# cat /etc/systemd/system/kubelet.service | grep -i cgroup
--runtime-cgroups=/systemd/system.slice \
--kubelet-cgroups=/systemd/system.slice \
--cgroup-driver=systemd \
We are in the middle of migrating docker to containerd, so we have both docker and containerd nodes. This seem fixed it for BOTH.
Docker nodes:
# docker info | grep -i cgroup
WARNING: No swap limit support
Cgroup Driver: systemd
Cgroup Version: 2
cgroupns
Containerd nodes:
$ sudo crictl info | grep -i cgroup
"SystemdCgroup": true
"SystemdCgroup": true
"systemdCgroup": false,
"disableCgroup": false,
Here is our k8s version:
$ k version --short
Client Version: v1.21.3
Server Version: v1.22.9
containerd version:
# containerd --version
containerd containerd.io 1.6.6 10c12954828e7c7c9b6e0ea9b0c02b01407d3ae1
nvidia-container-toolkit version:
# dnf info nvidia-container-toolkit | grep Version
Version : 1.11.0
it can't fix my cluster. i set cgroup-driver=cgroupfs on docker and k8s to fix my cluster。