マルチGPUででLLMの学習時をする際の「NCCL communicator and retrieving ncclUniqueId」のエラーの対応

開発環境

nvidia-smi

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX A5000               Off |   00000000:1C:00.0 Off |                  Off |
| 54%   79C    P2            215W /  230W |   17200MiB /  24564MiB |     94%      Default |
|                                         |                        |                  N/A |

nvcc

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

詳細

マルチGPUで学習する際に、以下のエラーが出ていました。

RuntimeError: [6] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout

この際に、NCCL_IB_DISABLE=1環境変数で設定します

export NCCL_IB_DISABLE=1