admin管理员组

文章数量:1122826

I’m encountering an issue when trying to run distributed training using the Accelerate library from Huggingface. The training process freezes after the dataloader initialization when using multiple GPUs, but works fine on a single GPU. IN the main.py file, I have added a marker #### STOPS OVER HERE ####. The code stops at that point.

Environment:

  • Python 3.10
  • PyTorch 2.1
  • Accelerate library from Huggingface
  • Model: DeBERTa-v3-base
  • Using 2 GPUs

Command Used: accelerate launch --multi_gpu --num_processes=2 --mixed_precision=fp16 main.py

Current Behavior:

  • Training process initializes successfully (model loading, tokenization, and data mapping complete)
  • Process freezes after the dataloader stage
  • No error message is displayed - it simply stops proceeding

Additional Information:

  • The code successfully runs on a single GPU
  • Mixed precision (fp16) is enabled
  • Data preprocessing appears successful (mapping shows 100% completion)
  • Using NCCL backend for distributed training

log:

root@af4cc4b13b7c:/workspace/embedding_layer# accelerate launch --multi_gpu --num_processes=2 --mixed_precision=fp16 main.py
The following values were not passed to `accelerate launch` and had defaults used instead:
        `--num_machines` was set to a value of `1`
        `--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
01/06/2025 11:56:05 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 2
Process index: 1
Local process index: 1
Device: cuda:1

Mixed precision type: fp16

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
wandb: Using wandb-core as the SDK backend.  Please refer to  for more information.
wandb: Currently logged in as: jamesjohnson1097 (threado_ml). Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.19.1
wandb: Run data is saved locally in /workspace/embedding_layer/wandb/run-20250106_115605-l9j5glhs
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run fresh-sound-5
wandb: ⭐️ View project at 
wandb: 

本文标签: python 3xTraining Freezes with Accelerate Library on MultiGPU SetupStack Overflow