I want Pytorch, Jax and Tensorflow.
Check: nvidia-smi
still gives 'NVIDIA-SMI 510.85.02 Driver Version: 510.85.02 CUDA Version: 11.6 '
conda activate base
unset LD_LIBRARY_PATH
unset CUDA_HOME
export NEW_ENV='cuda118redux'
export CONDA_ALWAYS_YES=yes
export CONDA_CHANNELS="anaconda,conda-forge,nvidia"
rm -rf $CONDA_PREFIX/envs/$NEW_ENV
export PYTHONUSERBASE=$CONDA_PREFIX/envs/$NEW_ENV
conda create -y -n $NEW_ENV python==3.10 &&
conda activate $NEW_ENV &&
conda env config vars set PYTHONUSERBASE=$CONDA_PREFIX
conda env config vars set CONDA_OVERRIDE_CUDA=11.8
conda env config vars set LD_LIBRARY_PATH="$CONDA_PREFIX:/usr/local/cuda/compat:/.singularity.d/libs"
conda env config vars set CUDA_HOME=$CONDA_PREFIX
# reset
conda deactivate &&
conda activate $NEW_ENV &&
conda install -y ipykernel &&
export CONDA_CHANNEL_PRIORITY='strict'
export CONDA_CHANNELS="nvidia/label/cuda-11.8.0"
conda install -y nvidia/label/cuda-11.8.0::cuda
conda install -y nvidia/label/cuda-11.8.0::cuda-toolkit
conda install -y nvidia/label/cuda-11.8.0::cuda-nvrtc
conda install -y nvidia/label/cuda-11.8.0::libcufile
conda install -y nvidia/label/cuda-11.8.0::cuda-tools
conda install -y nvidia/label/cuda-11.8.0::cuda-cudart
conda install -y nvidia/label/cuda-11.8.0::cuda-cudart-dev
pip install -q tensorrt
conda install tensorflow-gpu
pip install -q torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install -q --upgrade "jax[cuda11_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
conda install -y conda-forge::openmm
TF works but not JAX and Torch
PyTorch was compiled against (8, 7, 0) but found runtime version (8, 4, 0).
Found CUDA version 11070, but JAX was built against version 11080, which is newer.
Checking... conda list cudnn
# packages in environment at /data/xchem-fragalysis/mferla/waconda/envs/cuda118redux:
#
# Name Version Build Channel
cudnn 8.9.2.26 cuda11_0 anaconda
nvidia-cudnn-cu11 2022.5.19 pypi_0 pypi
nvidia-cudnn-cu116 8.4.0.27 pypi_0 pypi
nvidia-cudnn-cu12 8.9.7.29 pypi_0 pypi
Fixing... pip uninstall nvidia-cudnn-cu116
Torch is fine now
conda list cuda* | grep '11.7'
nvidia-cuda-cupti-cu117 11.7.50 pypi_0 pypi
nvidia-cuda-nvcc-cu117 11.7.64 pypi_0 pypi
nvidia-cuda-runtime-cu117 11.7.60 pypi_0 pypi
pip uninstall nvidia-cuda-runtime-cu117 nvidia-cuda-nvcc-cu117 nvidia-cuda-cupti-cu117
Found cuFFT version 10702, but JAX was built against version 10900
conda list cuFFT
# packages in environment at /data/xchem-fragalysis/mferla/waconda/envs/cuda118redux:
#
# Name Version Build Channel
libcufft 11.0.12.1 0 nvidia
libcufft-dev 11.0.12.1 0 nvidia
libcufft-static 11.0.12.1 0 nvidia
nvidia-cufft-cu11 2022.4.8 pypi_0 pypi
nvidia-cufft-cu117 10.7.2.50 pypi_0 pypi
Now I get Unable to load cuSOLVER
conda list cusolver
# packages in environment at /data/xchem-fragalysis/mferla/waconda/envs/cuda118redux:
#
# Name Version Build Channel
libcusolver 11.5.4.101 0 nvidia
libcusolver-dev 11.5.4.101 0 nvidia
libcusolver-static 11.5.4.101 0 nvidia
nvidia-cusolver-cu11 2022.4.8 pypi_0 pypi
nvidia-cusolver-cu117 11.3.5.50 pypi_0 pypi
Fixing conda uninstall libcublas
... but no!
The following packages will be SUPERSEDED by a higher-priority channel:
cuda-libraries-dev conda-forge::cuda-libraries-dev-12.3.~ --> nvidia::cuda-libraries-dev-11.6.1-0 cuda-tools nvidia/label/cuda-11.8.0::cuda-tools-~ --> nvidia::cuda-tools-11.6.1-0 cuda-visual-tools conda-forge::cuda-visual-tools-12.3.2~ --> nvidia::cuda-visual-tools-11.6.1-0
Cancel.
export CONDA_CHANNEL_PRIORITY='strict'
export CONDA_CHANNELS="nvidia/label/cuda-11.8.0"
conda uninstall libcublas
unset CONDA_CHANNEL_PRIORITY
unset CONDA_CHANNELS
Ehrm... Why did I deal with cublas?
conda install nvidia/label/cuda-11.8.0::libcublas
double tap:
conda install -y nvidia/label/cuda-11.8.0::libcusolver
conda install -y nvidia/label/cuda-11.8.0::libcusolver-dev
conda install -y nvidia/label/cuda-11.8.0::libcusolver-static
export CONDA_CHANNEL_PRIORITY='strict'
export CONDA_CHANNELS="nvidia/label/cuda-11.8.0,nvidia,conda-forge,anaconda"
conda install -y nvidia/label/cuda-11.8.0::cuda
conda install -y nvidia/label/cuda-11.8.0::cuda-toolkit
conda install -y nvidia/label/cuda-11.8.0::cuda-nvrtc
conda install -y nvidia/label/cuda-11.8.0::libcufile
conda install -y nvidia/label/cuda-11.8.0::cuda-tools
conda install -y nvidia/label/cuda-11.8.0::cuda-cudart
conda install -y nvidia/label/cuda-11.8.0::cuda-cudart-dev
conda install -y nvidia/label/cuda-11.8.0::cuda-cupti
JAX gives: XlaRuntimeError INTERNAL: libdevice not found at ./libdevice.10.bc
CONDA_PREFIX -name "libdevice.*"
/data/xchem-fragalysis/mferla/waconda/envs/cuda118redux/lib/python3.10/site-packages/triton/third_party/cuda/lib/libdevice.10.bc
/data/xchem-fragalysis/mferla/waconda/envs/cuda118redux/lib/python3.10/site-packages/jaxlib/cuda/nvvm/libdevice/libdevice.10.bc
/data/xchem-fragalysis/mferla/waconda/envs/cuda118redux/lib/libdevice.10.bc
/data/xchem-fragalysis/mferla/waconda/envs/cuda118redux/nvvm/libdevice/libdevice.10.bc
Whereas Jax says it is looking in:
/data/xchem-fragalysis/mferla/waconda/envs/cuda118redux/lib/python3.10/site-packages/nvidia/cuda_nvcc /usr/local/cuda-11.8 /usr/local/cuda /data/xchem-fragalysis/mferla/waconda/envs/cuda118redux/lib/python3.10/site-packages/nvidia/cuda_nvcc
None of these do anything:
#export CUDA_HOME=$CONDA_PREFIX/lib/python3.10/site-packages/jaxlib/cuda
export CUDA_HOME=$CONDA_PREFIX
export CUDA_DIR=$CUDA_HOME
export XLA_FLAGS='--xla_gpu_cuda_data_dir=='$CUDA_HOME
export LD_LIBRARY_PATH=$CONDA_PREFIX/lib/python3.10/site-packages/jaxlib/cuda:$CONDA_PREFIX:/usr/local/cuda/compat:/.singularity.d/libs
cp -r $CONDA_PREFIX/nvvm $CONDA_PREFIX/lib/python3.10/site-packages/nvidia/cuda_nvcc/nvvm
Nothing worked.
Switched TF and Jax. Now TF works but with warning:
successful NUMA node read from SysFS had negative value (-1)
And Jax says:
Found cuBLAS version 111001, but JAX was built against version 111103, which is newer.
rm -rf $CONDA_PREFIX/lib/python3.10/site-packages/nvidia/cuda_nvcc/nvvm
-> no change
Nothing reverting env vars
unset XLA_FLAGS
export LD_LIBRARY_PATH=$CONDA_PREFIX:/usr/local/cuda/compat:/.singularity.d/libs
conda list cublas
# packages in environment at /data/xchem-fragalysis/mferla/waconda/envs/cuda118redux:
#
# Name Version Build Channel
libcublas 11.11.3.6 0 nvidia/label/cuda-11.8.0
libcublas-dev 11.11.3.6 0 nvidia/label/cuda-11.8.0
nvidia-cublas-cu11 2022.4.8 pypi_0 pypi
nvidia-cublas-cu117 11.10.1.25 pypi_0 pypi
nvidia-cublas-cu12 12.3.4.1 pypi_0 pypi
pip uninstall nvidia-cublas-cu117 nvidia-cublas-cu11 nvidia-cublas-cu12
Reverted to missing path error even TF then Jax.
Doing the sledgehammer way: rm -rf $CONDA_PREFIX/lib/python3.10/site-packages/nvidia
It works.