Fix jobs originally seeing the GPUs fine, suddenly nvml goes away after a few hours

NOTE: This seems fixed our cluster. BUT I do see some still reporting cgroup2 having same issue, for example here. So YMMV.

DISCLAIMER: This seems works in our env. may not work in others. I'm still not sure what is the real root cause(s) yet. Not even 100% sure it full fixes in our env - it's been good for 2 weeks. But if it reappears, (for example, under certain use cases. high load or something), I'll be doomed.

TLDR

Switching to cgroup v2 seems fixed the nvml suddenly go away in pod issue.

Problem

nvidia-smi no longer sees the gpus in the container after a few random hours.

When it's brought up it sees the GPUs correctly:

$ k exec -it gengwg-test -- nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-2539bcac-ab41-8855-e99e-868518451d27)

After a few hours, it starts breaking:

$ k exec -it gengwg-test -- nvidia-smi -L
Failed to initialize NVML: Unknown Error
command terminated with exit code 255

Reproduce

This is actually pretty easy to reproduce.

Schedule a pod to some node. It should run nvidia-smi successful initially:

$ k exec -it gengwg-test -- nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-2539bcac-ab41-8855-e99e-868518451d27)

Do a systemd reload on that node:

# systemctl daemon-reload

You will see the error immediately:

$ k exec -it gengwg-test -- nvidia-smi -L
Failed to initialize NVML: Unknown Error
command terminated with exit code 255

Temp Hack

If you set security context to privileged, it doesn't break any more. The drawback is the accounting for GPUs is not accurate, but better than killing the jobs. If you are undedr a paper submission deadline, you may use that to unblock the users temporarily before implementing a long term fix like below. Just add below to your pod manifest:

    securityContext:
      privileged: true

Long Term Fix

Switching to cgroup v2 seems fix it for us. Below are the steps.

Preflight

Before proceeding you need verify all components are all capable/compatible with cgroup v2. ALL of them need to be able to support cgroup2. If any of them not capable of cgroup v2, I think it may not work for you. Below are a few I checked. May not be exhaustive.

node

# grep cgroup /proc/filesystems
nodev	cgroup
nodev	cgroup2

Kernel recommended version: 5.2 or later:

$ uname -r
5.4.0-105.119.1.ubuntu.x86_64

systemd

$ systemctl --version
systemd 239 (239-58.el8)
+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=legacy

NVIDIA container runtime library

# rpm -qa | grep libnvidia-container
libnvidia-container1-1.10.0-1.x86_64
libnvidia-container-tools-1.10.0-1.x86_64

runc

runc fully supports cgroup v2 (unified mode) since v1.0.0-rc93.

# runc --version
runc version 1.1.2
commit: v1.1.2-0-ga916309
spec: 1.0.2-dev
go: go1.17.11
libseccomp: 2.5.2

Steps

Modify grub

On systemd-based distros, cgroup v2 can be enabled by adding systemd.unified_cgroup_hierarchy=1 to the kernel cmdline.

Add systemd.unified_cgroup_hierarchy=1 to the GRUB_CMDLINE_LINUX line in /etc/default/grub. It should look something like this:

# vim /etc/default/grub
....
GRUB_CMDLINE_LINUX="xxxxxxxx apparmor=0 systemd.unified_cgroup_hierarchy=1"
....

I omitted other parts on the line. You keep others no change; just append that item.

NOTE: I also disabled AppArmor, because it's breaking containerd. apparmor=0. I'm not totally sure if it's needed for cgroupv2. If systemd.unified_cgroup_hierarchy=1 is not enough, you may try add apparmor=0 too.

Rebuild grub

# grub2-mkconfig -o /boot/grub2/grub.cfg
Generating grub configuration file ...
Adding boot menu entry for EFI firmware configuration
done

Drain the node

so it doesn't affect users.

k drain xxxx # with other necessary options

Reboot the machine

# reboot

Confirm

After reboot, you should see this:

$ cat /proc/cmdline
....  apparmor=0 systemd.unified_cgroup_hierarchy=1 ....

confirm it's cgroupv2:

# stat -fc %T /sys/fs/cgroup/
cgroup2fs

Verify Fix

Test systemd reload

This is easy to test, because it doesn't require waiting a few hours/days.

Before:

$ k exec -it gengwg-test-gpu-9 -- nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-212a30ad-0ea4-8201-1be0-cdc575e55034)

Do the reload on that node itself:

# systemctl daemon-reload

After:

$ k exec -it gengwg-test-gpu-9 -- nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-212a30ad-0ea4-8201-1be0-cdc575e55034)

Test long time

This is to repeat what users see on their end.

I did above fix to 4 nodes, and scheduled 4 pods on the 4 nodes, each requesting 1 GPU. (Since this is long running, I only requested 1 GPU, leaving 7 GPUs for users to schedule on each node).

They have been running for 4 days without issue. (Note the AGE field)

$ k get po -o wide
NAME                READY   STATUS    RESTARTS   AGE     IP                        NODE                    NOMINATED NODE   READINESS GATES
gengwg-test-gpu-1   1/1     Running   0          3d10h   2620:10d:xxxxx            node0021   <none>           <none>
gengwg-test-gpu-2   1/1     running   0          3d10h   2620:10d:xxxxx            node0023   <none>           <none>
gengwg-test-gpu-3   1/1     running   0          3d10h   2620:10d:xxxxx            node0024   <none>           <none>
gengwg-test-gpu-4   1/1     running   0          3d10h   2620:10d:xxxxx            node0056   <none>           <none>

Check nvidia-smi:

$ for i in {1..4}; do k exec -it gengwg-test-gpu-$i -- nvidia-smi -L ; done
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-29b18a6d-4246-6edd-d102-92c3dbbec667)
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-8b009714-3ee2-ac82-6b2f-4ebf8d103a7c)
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-3836675c-e987-1f01-7ce7-12da20038909)
GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-d9050ed3-2859-bcde-a56f-f1562a403db9)

That made me more confident this is the fix (at least for our environment).

More debugging info for our env

cgroup driver for kubelet, docker and containerd are all systemd.

# cat /etc/systemd/system/kubelet.service | grep -i cgroup
  --runtime-cgroups=/systemd/system.slice \
  --kubelet-cgroups=/systemd/system.slice \
  --cgroup-driver=systemd \

We are in the middle of migrating docker to containerd, so we have both docker and containerd nodes. This seem fixed it for BOTH.

Docker nodes:

# docker info | grep -i cgroup
WARNING: No swap limit support
 Cgroup Driver: systemd
 Cgroup Version: 2
  cgroupns

Containerd nodes:

$ sudo crictl info | grep -i cgroup
            "SystemdCgroup": true
            "SystemdCgroup": true
    "systemdCgroup": false,
    "disableCgroup": false,

Here is our k8s version:

$ k version --short
Client Version: v1.21.3
Server Version: v1.22.9

containerd version:

# containerd --version
containerd containerd.io 1.6.6 10c12954828e7c7c9b6e0ea9b0c02b01407d3ae1

nvidia-container-toolkit version:

# dnf info nvidia-container-toolkit | grep Version
Version      : 1.11.0

gengwg/nvml_cgroupv2_fix.md