fighterhit

NOTE: This seems fixed our cluster. BUT I do see some still reporting cgroup2 having same issue, for example here. So YMMV.

DISCLAIMER: This seems works in our env. may not work in others. I'm still not sure what is the real root cause(s) yet. Not even 100% sure it full fixes in our env - it's been good for 2 weeks. But if it reappears, (for example, under certain use cases. high load or something), I'll be doomed.

TLDR

Switching to cgroup v2 seems fixed the nvml suddenly go away in pod issue.

Problem

Kubectl

Imperative == refers to cli commands Declarative == using YAML files

--export
--save-config
--record

	module github.com/pfnet-research/nvidia-create-symlinks

	go 1.19

	require (
	github.com/NVIDIA/nvidia-container-toolkit v1.12.0-rc.2.0.20230127101129-9fc2c5912242 // indirect
	github.com/cpuguy83/go-md2man/v2 v2.0.1 // indirect
	github.com/fsnotify/fsnotify v1.5.4 // indirect
	github.com/russross/blackfriday/v2 v2.1.0 // indirect
	github.com/sirupsen/logrus v1.9.0 // indirect

	# Inspired by https://github.com/vpenso/ganglia-sensors/blob/master/lib/python_modules/infiniband.py#/

	import logging
	import re
	import sys
	import json
	import time
	import subprocess

	package main

	import (
	"bytes"
	"fmt"
	"io"
	"log"
	"net"
	"regexp"
	"strings"

	#!/bin/bash

	set -o errexit
	set -o xtrace

	main() {
	local namespaces=$(list_namespaces)

	for namespace in $namespaces; do
	local tasks=$(list_tasks $namespace)

	// C++ includes used for precompiling -- C++ --

	// Copyright (C) 2003-2015 Free Software Foundation, Inc.
	//
	// This file is part of the GNU ISO C++ Library. This library is free
	// software; you can redistribute it and/or modify it under the
	// terms of the GNU General Public License as published by the
	// Free Software Foundation; either version 3, or (at your option)
	// any later version.

	#!/usr/bin/python
	bpf_text = """
	#include <linux/ptrace.h>
	#include <linux/sched.h> /* For TASK_COMM_LEN */

	#include <linux/icmp.h>
	#include <linux/netdevice.h>

	struct probe_icmp_data_t
	{

	''' Script for downloading all GLUE data.

	Note: for legal reasons, we are unable to host MRPC.
	You can either use the version hosted by the SentEval team, which is already tokenized,
	or you can download the original data from (https://download.microsoft.com/download/D/4/6/D46FF87A-F6B9-4252-AA8B-3604ED519838/MSRParaphraseCorpus.msi) and extract the data from it manually.
	For Windows users, you can run the .msi file. For Mac and Linux users, consider an external library such as 'cabextract' (see below for an example).
	You should then rename and place specific files in a folder (see below for an example).

	mkdir MRPC
	cabextract MSRParaphraseCorpus.msi -d MRPC

	package main

	import (
	"log"
	"github.com/streadway/amqp"
	"time"
	"os"
	"errors"
	)