I have an NVIDIA graphics card that I use for TensorFlow work, I build TensorFlow from source and keep up with the current releases of CUDA, cuDNN and NCCL.
I have the following configuration:
- Centos 8 (2004)
- NVIDIA graphics card
You will need to download the following files:
cuDNN and NCCL require you to create an NVIDIA developer account.
Check that your system can see your NVIDIA card:
lspci | grep -i nvidia
65:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] (rev a1)
65:00.1 Audio device: NVIDIA Corporation TU102 High Definition Audio Controller (rev a1)
65:00.2 USB controller: NVIDIA Corporation TU102 USB 3.1 Host Controller (rev a1)
65:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU102 USB Type-C UCSI Controller (rev a1)
Check that kernel development tools and headers are installed:
dnf install kernel-devel-$(uname -r) kernel-headers-$(uname -r)
You will need to disable the nouveau drivers, the CUDA installer will create this file for but needs the drivers disabled before the install. This requires you to create the following file:
vi /usr/lib/modprobe.d/nvidia-installer-disable-nouveau.conf
Which needs to contain the following two lines:
blacklist nouveau
options nouveau modeset=0
Run dracut:
dracut --force
Check that nouveau drivers are not loaded:
lsmod | grep nouveau
Stop X prior to install:
systemctl stop gdm.service
Check that X is no longer running:
systemctl status gdm.service
ps -aux | grep gnome
Install CUDA toolkit:
sh cuda_11.0.3_450.51.06_linux.run
Check the CUDA ld configuration:
cat /etc/ld.so.conf.d/cuda-11-0.conf
/usr/local/cuda-11.0/targets/x86_64-linux/lib
You can run ldconfig to make sure it took:
ldconfig
And you can check the logs:
more /var/log/cuda-installer.log
more /var/log/nvidia-installer.log
At this point I like to reboot to make sure that the install is ok:
reboot
Once rebooted you can check the driver version:
cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 450.51.06 Sun Jul 19 20:02:54 UTC 2020
GCC version: gcc version 8.3.1 20191121 (Red Hat 8.3.1-5) (GCC)
And the status of driver modules:
lsmod | grep -i nvidia
nvidia_uvm 1142784 0
nvidia_drm 53248 3
nvidia_modeset 1183744 6 nvidia_drm
nvidia 19677184 258 nvidia_uvm,nvidia_modeset
drm_kms_helper 212992 1 nvidia_drm
drm 536576 6 drm_kms_helper,nvidia_drm
Unpack the tar file:
tar zxf cudnn-11.0-linux-x64-v8.0.2.39.tgz
Copy the files in place:
cp -v cuda/include/cudnn* /usr/local/cuda/include/
cp -v cuda/lib64/libcudnn* /usr/local/cuda/lib64/
chmod a+r /usr/local/cuda/include/cudnn* /usr/local/cuda/lib64/libcudnn*
Fix the library soft links so that ld doesn't complain:
cd /usr/local/cuda/lib64/
rm -f libcudnn.so.8 libcudnn.so
ln -s libcudnn.so.8.0.2 libcudnn.so.8
ln -s libcudnn.so.8 libcudnn.so
rm -f libcudnn_ops_train.so.8 libcudnn_ops_train.so
ln -s libcudnn_ops_train.so.8.0.2 libcudnn_ops_train.so.8
ln -s libcudnn_ops_train.so.8 libcudnn_ops_train.so
rm -f libcudnn_ops_infer.so.8 libcudnn_ops_infer.so
ln -s libcudnn_ops_infer.so.8.0.2 libcudnn_ops_infer.so.8
ln -s libcudnn_ops_infer.so.8 libcudnn_ops_infer.so
rm -f libcudnn_cnn_train.so.8 libcudnn_cnn_train.so
ln -s libcudnn_cnn_train.so.8.0.2 libcudnn_cnn_train.so.8
ln -s libcudnn_cnn_train.so.8 libcudnn_cnn_train.so
rm -f libcudnn_cnn_infer.so.8 libcudnn_cnn_infer.so
ln -s libcudnn_cnn_infer.so.8.0.2 libcudnn_cnn_infer.so.8
ln -s libcudnn_cnn_infer.so.8 libcudnn_cnn_infer.so
rm -f libcudnn_adv_train.so.8 libcudnn_adv_train.so
ln -s libcudnn_adv_train.so.8.0.2 libcudnn_adv_train.so.8
ln -s libcudnn_adv_train.so.8 libcudnn_adv_train.so
rm -f libcudnn_adv_infer.so.8 libcudnn_adv_infer.so
ln -s libcudnn_adv_infer.so.8.0.2 libcudnn_adv_infer.so.8
ln -s libcudnn_adv_infer.so.8 libcudnn_adv_infer.so
Unpack the txz file:
7za x nccl_2.7.8-1+cuda11.0_x86_64.txz
tar xf nccl_2.7.8-1+cuda11.0_x86_64.tar
Copy the files in place:
cp -v nccl_2.7.8-1+cuda11.0_x86_64/LICENSE.txt /usr/local/cuda/NCCL-SLA.txt
cp -v nccl_2.7.8-1+cuda11.0_x86_64/include/nccl.h /usr/local/cuda/include/
cp -v nccl_2.7.8-1+cuda11.0_x86_64/lib/libnccl* /usr/local/cuda/lib64/
(Building TensorFlow from source requires the LICENSE.txt file, at least it did when I last checked.)
Fix the library soft links so that ld doesn't complain:
cd /usr/local/cuda/lib64/
rm -f libnccl.so.2 libnccl.so
ln -s libnccl.so.2.7.8 libnccl.so.2
ln -s libnccl.so.2 libnccl.so
Get some debug info:
nvidia-debugdump -l
Get some load/process info:
nvidia-smi
Get card load stats:
nvidia-smi dmon -s pucvmet -d 5
Be sure to check the man pages from these for more information.
Remove CUDA:
/usr/local/cuda-11.0/bin/cuda-uninstaller
Remove CUDA files:
rm -rf /usr/local/cuda-11.0 /usr/local/cuda
rm -f /etc/ld.so.conf.d/cuda-11-0.conf
Remove NVIDIA driver:
nvidia-uninstall
Remove NVIDIA files:
rm -f /usr/lib/modprobe.d/nvidia-installer-disable-nouveau.conf
dracut --force
Remove cuDNN files:
rm -rf /usr/local/cuda/include/cudnn* /usr/local/cuda/lib64/libcudnn*
Remove NCCL files:
rm -rf /usr/local/cuda/include/nccl.h /usr/local/cuda/lib/libnccl*