初めに
AI周りの学習でtorchを使うことがありますが、cudannのエラーによって学習が始めらない問題にぶつかったので解決方法をメモしておきます
開発環境
nvidia-smi Sun May 12 08:37:42 2024 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA A100-SXM4-40GB Off | 00000000:00:04.0 Off | 0 | | N/A 31C P0 54W / 400W | 4MiB / 40960MiB | 26% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+
nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Mon_Apr__3_17:16:06_PDT_2023 Cuda compilation tools, release 12.1, V12.1.105 Build cuda_12.1.r12.1/compiler.32688072_0
エラー詳細
/usr/local/cuda/lib64/libcudnn_cnn_train.so.8: undefined symbol: _ZN5cudnn3cnn34layerNormFwd_execute_internal_implERKNS_7backend11VariantPackEP11CUstream_stRNS0_18LayerNormFwdParamsERKNS1_20NormForwardOperationEmb, version libcudnn_cnn_infer.so.8
解決方法
ローカルのcudaのlibraryを削除します
cd /usr/local/cuda-12.1/lib64 sudo rm -f libcudnn* cd /usr/local/cuda-12.1/include sudo rm -f cudnn*
次にcudannのversionをbashに適応します
# cuda version change export PATH=/usr/local/cuda-12.2/bin:$PATH export LD_LIBRARY_PATH=/usr/local/cuda-12.2/lib64:$LD_LIBRARY_PATH source ~/.bashrc
最後に現在の状況を確認します
import torch print(torch.__version__) print(torch.cuda.is_available()) print(torch.version.cuda) print(torch.backends.cudnn.version())
これで以下のようになっていれば問題ないです
Python 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import torch >>> print(torch.__version__) 2.3.0+cu121 >>> print(torch.cuda.is_available()) True >>> print(torch.version.cuda) 12.1 >>> print(torch.backends.cudnn.version()) 8902