NOTE: This seems fixed our cluster. BUT I do see some still reporting cgroup2 having same issue, for example here. So YMMV.
DISCLAIMER: This seems works in our env. may not work in others. I'm still not sure what is the real root cause(s) yet. Not even 100% sure it full fixes in our env - it's been good for 2 weeks. But if it reappears, (for example, under certain use cases. high load or something), I'll be doomed.
Switching to cgroup v2 seems fixed the nvml suddenly go away in pod issue.