Getting a Tesla K80 working on an Ubuntu 20.04.3 VM.
Turns out, easiest to go via conda.
conda create -n gpu python=3.10
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
Check for a heartbeat.
python -c "import torch;print(torch.rand(5, 3));print(torch.cuda.is_available())"
Should result in:
tensor([[0.4617, 0.7625, 0.4423],
[0.8141, 0.4264, 0.2836],
[0.2107, 0.0038, 0.1685],
[0.6512, 0.5361, 0.1323],
[0.9526, 0.5774, 0.5037]])
False
So... PyTorch is OK but CUDA is not there.
PyTorch can't use CUDA because we didn't install drivers.
Look for drivers:
# first, see if anything is already installed
dkms status
# use headless because we don't need a GUI on the VM
sudo apt search nvidia-driver | grep headless
sudo apt search nvidia-utils | grep server
Choose the highest version (510 in this case):
sudo apt install nvidia-headless-510-server nvidia-utils-510-server -y
sudo reboot # optional
Test we can see the GPU:
nvidia-smi
If above OK, then do:
conda activate gpu
python -c "import torch;print(torch.cuda.is_available());print(torch.cuda.get_device_name(0))"
Which should give:
True
Tesla K80
# useful module
conda install pynvml -c conda-forge
Then doing:
python -c "import torch;print(torch.cuda.list_gpu_processes());print(torch.cuda.memory_summary())"
Should give you something like:
GPU:0
no processes are running
|===========================================================================|
| PyTorch CUDA memory summary, device ID 0 |
|---------------------------------------------------------------------------|
| CUDA OOMs: 0 | cudaMalloc retries: 0 |
|===========================================================================|
| Metric | Cur Usage | Peak Usage | Tot Alloc | Tot Freed |
|---------------------------------------------------------------------------|
| Allocated memory | 0 B | 0 B | 0 B | 0 B |
| from large pool | 0 B | 0 B | 0 B | 0 B |
| from small pool | 0 B | 0 B | 0 B | 0 B |
|---------------------------------------------------------------------------|
| Active memory | 0 B | 0 B | 0 B | 0 B |
| from large pool | 0 B | 0 B | 0 B | 0 B |
| from small pool | 0 B | 0 B | 0 B | 0 B |
|---------------------------------------------------------------------------|
| GPU reserved memory | 0 B | 0 B | 0 B | 0 B |
| from large pool | 0 B | 0 B | 0 B | 0 B |
| from small pool | 0 B | 0 B | 0 B | 0 B |
|---------------------------------------------------------------------------|
| Non-releasable memory | 0 B | 0 B | 0 B | 0 B |
| from large pool | 0 B | 0 B | 0 B | 0 B |
| from small pool | 0 B | 0 B | 0 B | 0 B |
|---------------------------------------------------------------------------|
| Allocations | 0 | 0 | 0 | 0 |
| from large pool | 0 | 0 | 0 | 0 |
| from small pool | 0 | 0 | 0 | 0 |
|---------------------------------------------------------------------------|
| Active allocs | 0 | 0 | 0 | 0 |
| from large pool | 0 | 0 | 0 | 0 |
| from small pool | 0 | 0 | 0 | 0 |
|---------------------------------------------------------------------------|
| GPU reserved segments | 0 | 0 | 0 | 0 |
| from large pool | 0 | 0 | 0 | 0 |
| from small pool | 0 | 0 | 0 | 0 |
|---------------------------------------------------------------------------|
| Non-releasable allocs | 0 | 0 | 0 | 0 |
| from large pool | 0 | 0 | 0 | 0 |
| from small pool | 0 | 0 | 0 | 0 |
|---------------------------------------------------------------------------|
| Oversize allocations | 0 | 0 | 0 | 0 |
|---------------------------------------------------------------------------|
| Oversize GPU segments | 0 | 0 | 0 | 0 |
|===========================================================================|
Now install everything else:
conda install jupyter pandas numpy matplotlib
conda install humanize # optional
conda clean --all -y # clean it all up
And start by:
jupyter notebook --no-browser
Set up an SSH tunnel:
ssh -f -N -L 8888:localhost:8888 USER@VM
# -f backgrounds it
# -N no active terminal
Then type localhost:8888
in your browser.
On first attempt, it will tell you to use token or set password
- copy/paste the token from the Jupyter logs on the VM
- something like
73a7598019be7b8f0fb6XXc66fc1f93ce963e698541713e1
- then set a strong password
Useful to check on your local machine:
sudo lsof -i -n -P | egrep '\<ssh\>'
ps aux | grep ssh
Okay now go do the tutorial at: https://pytorch.org/tutorials/beginner/basics/quickstart_tutorial.html
While running the training, you can check GPU activity on the VM via:
watch -n0.1 nvidia-smi