Updated 11/28/2018

Here's my experience of installing the NVIDIA CUDA kit 9.0 on a fresh install of Ubuntu Desktop 16.04.4 LTS. Scroll down to the bottom if you wish to only install NVIDIA drivers and run tensorflow via docker container.

Table of contents generated with markdown-toc

1. Install NVIDIA Graphics Driver via apt-get

Do not use the CUDA run file to install your driver. Use apt-get instead. This way you do not need to worry about the Nouveau stuff you read about on StackOverflow.

As of 04/11/2018, the latest version of NVIDIA driver for Ubuntu 16.04.4 LTS is 384. To install the driver, excute

sudo apt-get update
sudo apt-get install nvidia-384 nvidia-modprobe -y

Reboot the machine.

Afterwards, you can check the Installation with the nvidia-smi command, which will report all your CUDA-capable devices in the system.

Common Errors and Solutions

ERROR: Unable to load the 'nvidia-drm' kernel module.

One probable reason is that the system is boot from UEFI but Secure Boot option is turned on in the BIOS setting. Turn it off and the problem will be solved.

Additional Notes

nvidia-smi -pm 1 can enable the persistent mode, which will save some time from loading the driver. It will have significant effect on machines with more than 4 GPUs.

nvidia-smi -e 0 can disable ECC on TESLA products, which will provide about 1/15 more video memory. Reboot is reqired for taking effect. nvidia-smi -e 1 can be used to enable ECC again.

nvidia-smi -pl <some power value> can be used for increasing or decrasing the TDP limit of the GPU. Increasing will encourage higher GPU Boost frequency, but is somehow DANGEROUS and HARMFUL to the GPU. Decreasing will help to same some power, which is useful for machines that does not have enough power supply and will shutdown unintendedly when pull all GPU to their maximum load.

-i <GPUID> can be added after above commands to specify individual GPU.

These commands can be added to /etc/rc.local for excuting at system boot.

2. Install CUDA 9.0

Installing CUDA from runfile is much simpler and smoother than installing the NVIDIA driver. It just involves copying files to system directories and has nothing to do with the system kernel or online compilation. Removing CUDA is simply removing the installation directory. So I personally does not recommend adding NVIDIA's repositories and install CUDA via apt-get or other package managers as it will not reduce the complexity of installation or uninstallation but increase the risk of messing up the configurations for repositories.

The CUDA runfile installer can be downloaded from NVIDIA's websie, or using wget in case you can't find it easily on NVIDIA:

cd
wget https://developer.nvidia.com/compute/cuda/9.0/Prod/local_installers/cuda_9.0.176_384.81_linux-run

What you download is a package the following three components:

an NVIDIA driver installer, but usually of stale version;
the actual CUDA installer;
the CUDA samples installer;

I suggest extracting the above three components and executing 2 and 3 separately (remember we installed the driver ourselves already). To extract them, execute the runfile installer with --extract option:

cd
chmod +x cuda_9.0.176_384.81_linux-run
./cuda_9.0.176_384.81_linux-run --extract=$HOME

You should have unpacked three components: NVIDIA-Linux-x86_64-384.81.run (1. NVIDIA driver that we ignore), cuda-linux.9.0.176-22781540.run (2. CUDA 9.0 installer), and cuda-samples.9.0.176-22781540-linux.run (3. CUDA 9.0 Samples).

Execute the second one to install the CUDA Toolkit 9.0:

sudo ./cuda-linux.9.0.176-22781540.run

You now have to accept the license by scrolling down to the bottom (hit the "d" key on your keyboard) and enter "accept". Next accept the defaults.

To verify our CUDA installation, install the sample tests by

sudo ./cuda-samples.9.0.176-22781540-linux.run

Please make sure that

PATH includes /usr/local/cuda-9.0/bin
LD_LIBRARY_PATH includes /usr/local/cuda-9.0/lib64, or, add /usr/local/cuda-9.0/lib64 to /etc/ld.so.conf and run ldconfig as root

After the installation finishes, configure the runtime library.

sudo bash -c "echo /usr/local/cuda/lib64/ > /etc/ld.so.conf.d/cuda.conf"
sudo ldconfig

It is also recommended for Ubuntu users to append string /usr/local/cuda/bin to system file /etc/environments so that nvcc will be included in $PATH. This will take effect after reboot. To do that, you just have to

sudo vim /etc/environment

and then add :/usr/local/cuda/bin (including the ":") at the end of the PATH="/blah:/blah/blah" string (inside the quotes).

After a reboot, let's test our installation by making and invoking our tests:

cd /usr/local/cuda-9.0/samples
sudo make

It's a long process with many irrelevant warnings about deprecated architectures (sm_20 and such ancient GPUs). After it completes, run deviceQuery and p2pBandwidthLatencyTest:

cd /usr/local/cuda/samples/bin/x86_64/linux/release
./deviceQuery

The result of running deviceQuery should look something like this:

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX 1060"
  CUDA Driver Version / Runtime Version          9.0 / 9.0
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 6073 MBytes (6367739904 bytes)
  (10) Multiprocessors, (128) CUDA Cores/MP:     1280 CUDA Cores
  GPU Max Clock rate:                            1671 MHz (1.67 GHz)
  Memory Clock rate:                             4004 Mhz
  Memory Bus Width:                              192-bit
  L2 Cache Size:                                 1572864 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 9.0, NumDevs = 1
Result = PASS

Cleanup: if ./deviceQuery works, remember to rm the 4 files (1 downloaded and 3 extracted).

Install cuDNN 7.0

The recommended way for installing cuDNN is to

Download the "cuDNN Library for Ubuntu 16" (need to register for an Nvidia account) and select the right version compatible with cuda 9
you should have following files on your system:

libcudnn7_7.4.1.5-1+cuda9.0_amd64.deb
libcudnn7-dev_7.4.1.5-1+cuda9.0_amd64.deb
libcudnn7-doc_7.4.1.5-1+cuda9.0_amd64.deb

install these:

sudo dpkg -i libcudnn7_7.4.1.5-1+cuda9.0_amd64.deb
sudo dpkg -i libcudnn7-dev_7.4.1.5-1+cuda9.0_amd64.deb
sudo dpkg -i libcudnn7-doc_7.4.1.5-1+cuda9.0_amd64.deb

Finally, execute sudo ldconfig to update the shared library cache.

Install TensorFlow GPU library

Select GPU tarball and save it.

Extract .so files and move to system path.

tar -zxvf libtensorflow-gpu-linux-x86_64-<tab>
sudo chown -R root:root lib
sudo mv lib/lib* /usr/local/lib
sudo ldconfig

Check if binary that links against libtensorflow is able to find all dynamic dependencies

ldd matrix-inversion-benchmark-tf`

Example output from ldd command:

$ ldd matrix-inversion-benchmark-tf
	linux-vdso.so.1 =>  (0x00007ffdf776e000)
	libtensorflow.so => /usr/local/lib/libtensorflow.so (0x00007efe88ebc000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007efe88c9f000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007efe888d5000)
	libtensorflow_framework.so => /usr/local/lib/libtensorflow_framework.so (0x00007efe8797a000)
	libcublas.so.9.0 => /usr/local/cuda-9.0/lib64/libcublas.so.9.0 (0x00007efe84544000)
	libcusolver.so.9.0 => /usr/local/cuda-9.0/lib64/libcusolver.so.9.0 (0x00007efe7f949000)
	libcudart.so.9.0 => /usr/local/cuda-9.0/lib64/libcudart.so.9.0 (0x00007efe7f6dc000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007efe7f4d8000)
	libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1 (0x00007efe7f2b6000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007efe7efad000)
	librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007efe7eda5000)
	libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007efe7ea23000)
	libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007efe7e80d000)
	/lib64/ld-linux-x86-64.so.2 (0x00007efe972d5000)
	libcuda.so.1 => /usr/lib/x86_64-linux-gnu/libcuda.so.1 (0x00007efe7d98f000)
	libcudnn.so.7 => /usr/lib/x86_64-linux-gnu/libcudnn.so.7 (0x00007efe6b3df000)
	libcufft.so.9.0 => /usr/local/cuda-9.0/lib64/libcufft.so.9.0 (0x00007efe6333e000)
	libcurand.so.9.0 => /usr/local/cuda-9.0/lib64/libcurand.so.9.0 (0x00007efe5f3da000)
	libnvidia-fatbinaryloader.so.384.130 => /usr/lib/nvidia-384/libnvidia-fatbinaryloader.so.384.130 (0x00007efe5f188000)

Finally run the benchmark to see awesome power of GPU. A random 10k by 10k matrix inverted in under 8 seconds! Same Go binary linked against CPU tensorflow library on a laptop takes anywhere from 5-10 minutes!

$ ./matrix-inversion-benchmark-tf 2> /dev/null   # this runs on machine with GPU
[100 100] 354.559305ms
[100 100] 402.992636ms
[200 200] 7.406717ms
[200 200] 8.793598ms
[500 500] 17.260441ms
[500 500] 27.268701ms
[1000 1000] 49.058466ms
[1000 1000] 88.828322ms
[2000 2000] 159.050976ms
[2000 2000] 333.588065ms
[5000 5000] 1.229218361s
[5000 5000] 2.00629059s
[10000 10000] 4.162459538s
[10000 10000] 7.302393948s

Same binary running without GPU

$ ./matrix-inversion-benchmark-tf 2> /dev/null
[100 100] 27.505568ms
[100 100] 6.989513ms
[200 200] 6.123381ms
[200 200] 9.456749ms
[500 500] 14.438066ms
[500 500] 46.444771ms
[1000 1000] 39.278103ms
[1000 1000] 282.240379ms
[2000 2000] 148.83378ms
[2000 2000] 2.016059554s
[5000 5000] 1.113634783s
[5000 5000] 28.653253206s
[10000 10000] 6.156776647s
[10000 10000] 3m57.371907035s

Prepare environment to run TF GPU apps in container.

1. Install NVIDIA Graphics Driver via apt-get on host machine

Do not use the CUDA run file to install your driver. Use apt-get instead. This way you do not need to worry about the Nouveau stuff you read about on StackOverflow.

As of 04/11/2018, the latest version of NVIDIA driver for Ubuntu 16.04.4 LTS is 384. To install the driver, excute

sudo apt-get install nvidia-384 nvidia-modprobe

Reboot the machine.

Afterwards, you can check the Installation with the nvidia-smi command, which will report all your CUDA-capable devices in the system.

2. Install docker CE

Install docker CE

3. Install nvidia-docker

Install nvidia docker

# Add the package repositories
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | \
  sudo apt-key add -
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update

# Install nvidia-docker2 and reload the Docker daemon configuration
sudo apt-get install -y nvidia-docker2
sudo pkill -SIGHUP dockerd

4. Reboot machine, then test installation

# Test nvidia-smi with the latest official CUDA image
sudo docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi

5. Pull TF container

sudo docker pull tensorflow/tensorflow:1.12.0-gpu

This does not have libtensorflow_framework.so and libtensorflow.so, so these two dynamic dependencies need to be installed from here (select the GPU supported tarball and extract contents of lib folder into /usr/local/lib

6. Run with nvidia runtime

sudo docker run --runtime=nvidia -v /home/sdeoras:/home/sdeoras --entrypoint /bin/bash -it tensorflow/tensorflow:1.12.0-gpu Running the matrix inversion binary from within the container is now possible

# ldd ./matrix-inversion-benchmark-tf
linux-vdso.so.1 =>  (0x00007ffc0b1fe000)
libtensorflow.so => /usr/local/lib/libtensorflow.so (0x00007fb593ed1000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fb593cb4000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fb5938ea000)
libtensorflow_framework.so => /usr/local/lib/libtensorflow_framework.so (0x00007fb59298f000)
libcublas.so.9.0 => /usr/local/cuda-9.0/targets/x86_64-linux/lib/libcublas.so.9.0 (0x00007fb58ed12000)
libcusolver.so.9.0 => /usr/local/cuda-9.0/targets/x86_64-linux/lib/libcusolver.so.9.0 (0x00007fb58a117000)
libcudart.so.9.0 => /usr/local/cuda-9.0/targets/x86_64-linux/lib/libcudart.so.9.0 (0x00007fb589eaa000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fb589ca6000)
libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1 (0x00007fb589a84000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fb58977b000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fb589573000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fb5891f1000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fb588fdb000)
/lib64/ld-linux-x86-64.so.2 (0x00007fb5a22ea000)
libcuda.so.1 => /usr/lib/x86_64-linux-gnu/libcuda.so.1 (0x00007fb58815d000)
libcudnn.so.7 => /usr/lib/x86_64-linux-gnu/libcudnn.so.7 (0x00007fb576c56000)
libcufft.so.9.0 => /usr/local/cuda-9.0/targets/x86_64-linux/lib/libcufft.so.9.0 (0x00007fb56ebb5000)
libcurand.so.9.0 => /usr/local/cuda-9.0/targets/x86_64-linux/lib/libcurand.so.9.0 (0x00007fb56ac51000)
libnvidia-fatbinaryloader.so.384.130 => /usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so.384.130 (0x00007fb56a9ff000)

Finally execute the binary that links against TF libraries:

$ ./matrix-inversion-benchmark-tf 2> /dev/null
[100 100] 369.287241ms
[100 100] 478.682456ms
[200 200] 8.72297ms
[200 200] 9.57128ms
[500 500] 19.873331ms
[500 500] 31.00355ms
[1000 1000] 56.71972ms
[1000 1000] 99.010054ms
[2000 2000] 189.614255ms
[2000 2000] 358.203112ms
[5000 5000] 1.334024582s
[5000 5000] 2.202555466s
[10000 10000] 4.340356657s
[10000 10000] 8.713958014s

sdeoras/Install NVIDIA Driver and CUDA.md

Table of Contents