admin管理员组文章数量:1122826
I’m encountering an issue when trying to run distributed training using the Accelerate library from Huggingface. The training process freezes after the dataloader initialization when using multiple GPUs, but works fine on a single GPU. IN the main.py
file, I have added a marker #### STOPS OVER HERE ####
. The code stops at that point.
Environment:
- Python 3.10
- PyTorch 2.1
- Accelerate library from Huggingface
- Model: DeBERTa-v3-base
- Using 2 GPUs
Command Used:
accelerate launch --multi_gpu --num_processes=2 --mixed_precision=fp16 main.py
Current Behavior:
- Training process initializes successfully (model loading, tokenization, and data mapping complete)
- Process freezes after the dataloader stage
- No error message is displayed - it simply stops proceeding
Additional Information:
- The code successfully runs on a single GPU
- Mixed precision (fp16) is enabled
- Data preprocessing appears successful (mapping shows 100% completion)
- Using NCCL backend for distributed training
log:
root@af4cc4b13b7c:/workspace/embedding_layer# accelerate launch --multi_gpu --num_processes=2 --mixed_precision=fp16 main.py
The following values were not passed to `accelerate launch` and had defaults used instead:
`--num_machines` was set to a value of `1`
`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
01/06/2025 11:56:05 - INFO - __main__ - Distributed environment: MULTI_GPU Backend: nccl
Num processes: 2
Process index: 1
Local process index: 1
Device: cuda:1
Mixed precision type: fp16
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
wandb: Using wandb-core as the SDK backend. Please refer to for more information.
wandb: Currently logged in as: jamesjohnson1097 (threado_ml). Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.19.1
wandb: Run data is saved locally in /workspace/embedding_layer/wandb/run-20250106_115605-l9j5glhs
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run fresh-sound-5
wandb: ⭐️ View project at
wandb:
本文标签:
python 3xTraining Freezes with Accelerate Library on MultiGPU SetupStack Overflow
版权声明:本文标题:python 3.x - Training Freezes with Accelerate Library on Multi-GPU Setup - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人,
转载请联系作者并注明出处:http://www.betaflare.com/web/1736280834a1926127.html,
本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论