python 3.x - Training Freezes with Accelerate Library on Multi-GPU Setup - Stack Overflow-软件玩家

admin管理员组
文章数量:1122826

I’m encountering an issue when trying to run distributed training using the Accelerate library from Huggingface. The training process freezes after the dataloader initialization when using multiple GPUs, but works fine on a single GPU. IN the main.py file, I have added a marker #### STOPS OVER HERE ####. The code stops at that point.

Environment:

Python 3.10
PyTorch 2.1
Accelerate library from Huggingface
Model: DeBERTa-v3-base
Using 2 GPUs

Command Used: accelerate launch --multi_gpu --num_processes=2 --mixed_precision=fp16 main.py

Current Behavior:

Training process initializes successfully (model loading, tokenization, and data mapping complete)
Process freezes after the dataloader stage
No error message is displayed - it simply stops proceeding

Additional Information:

The code successfully runs on a single GPU
Mixed precision (fp16) is enabled
Data preprocessing appears successful (mapping shows 100% completion)
Using NCCL backend for distributed training

log:

root@af4cc4b13b7c:/workspace/embedding_layer# accelerate launch --multi_gpu --num_processes=2 --mixed_precision=fp16 main.py
The following values were not passed to `accelerate launch` and had defaults used instead:
        `--num_machines` was set to a value of `1`
        `--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
01/06/2025 11:56:05 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 2
Process index: 1
Local process index: 1
Device: cuda:1

Mixed precision type: fp16

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
wandb: Using wandb-core as the SDK backend.  Please refer to  for more information.
wandb: Currently logged in as: jamesjohnson1097 (threado_ml). Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.19.1
wandb: Run data is saved locally in /workspace/embedding_layer/wandb/run-20250106_115605-l9j5glhs
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run fresh-sound-5
wandb: ⭐️ View project at 
wandb: 
                本文标签：
                python 3xTraining Freezes with Accelerate Library on MultiGPU SetupStack Overflow

                        版权声明：本文标题：python 3.x - Training Freezes with Accelerate Library on Multi-GPU Setup - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，
                        转载请联系作者并注明出处：http://www.betaflare.com/web/1736280834a1926127.html，
                        本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

python 3.x - Training Freezes with Accelerate Library on Multi-GPU Setup - Stack Overflow

`更多相关文章`

python 3.x - Training Freezes with Accelerate Library on Multi-GPU Setup - Stack Overflow

`发表评论`

`推荐文章`

mime types - My website suddenly won't play any videos

Do I actually need to link my theme's style.css in the theme files

Python Selenium DevTools "failed to check if window was closed: disconnected: not connected to DevTools)" - St

python - Input always eats one character - Stack Overflow

See more than just the first compiler error in VS Code? (Or a different IDE?) - Stack Overflow

`热门文章`

python - Use flask sessions to utilize data obtained from POST request in a GET request - Stack Overflow

javascript - Stop a spinning wheel at specific angle degree? - Stack Overflow

安装 SQL Sever 2000至最后步骤报错“安装程序配置服务器失败。参考服务器错误日志和 C: Windowssqltsp.log 了解更多信息”

plugins - How can I access uploaded file submitted via Forminator?

How do I create a finished application using only Racket, Figma + something like Android Studio - Stack Overflow

r - Centralize figure captions in Quarto - docx output - Stack Overflow

android - How to change menu actionitem background Color? - Stack Overflow

php - Contact Form 7 Wordpress, checking a few fields, if empty then invalid

java - Hibernate ObjectNotFoundException: No row with the given identifier exists - Stack Overflow

show only a given level in nav menu

`最新文章`

Java入门级教学（IDEA的下载与安装与JDK的环境配置）

华硕笔记本电脑用U盘重装windows系统

物理网卡MAC修改器v3.0 - 真实网卡硬件MAC地址修改，重装系统不变！

如何一键安装win7系统(一键安装win7系统步骤)

Windows 11最稳定版本详解

python - Diffusers pipeline Instant ID with Ipadapter - Stack Overflow

asp.net core - aspnetboilerplate InvalidOperationException - Stack Overflow

If I use a Google Site along with an Apps Script webapp(set to 'Anyone' access)linked to a Google Sheet, is the

ios - Sending "Start" Live Activity Notification from Apple Push Notifications Console successfully received b

linux - Do all fragments of an IP packet greater than MTU carry the full PPPoE header when modified in an eBPF tc program? - Sta

惠普OMEN 15-CE001TX 2EF91PA参数报价

苹果新款MacBook Pro 15英寸 i732GB1TBVega Pro 20参数报价

联想Y330A-PSE L参数报价

神舟战神Z7 D6 i7-12650H16GB512GBRTX4050旗舰版参数报价

神舟战神Z7 D6 i7-12650H16GB1TBRTX4050参数报价

编程频道|软件玩家 - 软件改变生活！

python 3.x - Training Freezes with Accelerate Library on Multi-GPU Setup - Stack Overflow

更多相关文章

python 3.x - Training Freezes with Accelerate Library on Multi-GPU Setup - Stack Overflow

发表评论

推荐文章

mime types - My website suddenly won&#39;t play any videos

Do I actually need to link my theme&#39;s style.css in the theme files

Python Selenium DevTools &quot;failed to check if window was closed: disconnected: not connected to DevTools)&quot; - St

python - Input always eats one character - Stack Overflow

See more than just the first compiler error in VS Code? (Or a different IDE?) - Stack Overflow

热门文章

python - Use flask sessions to utilize data obtained from POST request in a GET request - Stack Overflow

javascript - Stop a spinning wheel at specific angle degree? - Stack Overflow

安装 SQL Sever 2000至最后步骤报错“安装程序配置服务器失败。 参考服务器错误日志和 C: Windowssqltsp.log 了解更多信息”

plugins - How can I access uploaded file submitted via Forminator?

How do I create a finished application using only Racket, Figma + something like Android Studio - Stack Overflow

r - Centralize figure captions in Quarto - docx output - Stack Overflow

android - How to change menu actionitem background Color? - Stack Overflow

php - Contact Form 7 Wordpress, checking a few fields, if empty then invalid

java - Hibernate ObjectNotFoundException: No row with the given identifier exists - Stack Overflow

show only a given level in nav menu

最新文章

Java入门级教学（IDEA的下载与安装与JDK的环境配置）

华硕笔记本电脑用U盘重装windows系统

物理网卡MAC修改器v3.0 - 真实网卡硬件MAC地址修改，重装系统不变！

如何一键安装win7系统(一键安装win7系统步骤)

Windows 11最稳定版本详解

python - Diffusers pipeline Instant ID with Ipadapter - Stack Overflow

asp.net core - aspnetboilerplate InvalidOperationException - Stack Overflow

If I use a Google Site along with an Apps Script webapp(set to &#39;Anyone&#39; access)linked to a Google Sheet, is the

ios - Sending &quot;Start&quot; Live Activity Notification from Apple Push Notifications Console successfully received b

linux - Do all fragments of an IP packet greater than MTU carry the full PPPoE header when modified in an eBPF tc program? - Sta

惠普OMEN 15-CE001TX 2EF91PA参数报价

苹果新款MacBook Pro 15英寸 i732GB1TBVega Pro 20参数报价

联想Y330A-PSE L参数报价

神舟战神Z7 D6 i7-12650H16GB512GBRTX4050旗舰版参数报价

神舟战神Z7 D6 i7-12650H16GB1TBRTX4050参数报价

`更多相关文章`

`发表评论`

`推荐文章`

mime types - My website suddenly won't play any videos

Do I actually need to link my theme's style.css in the theme files

Python Selenium DevTools "failed to check if window was closed: disconnected: not connected to DevTools)" - St

`热门文章`

安装 SQL Sever 2000至最后步骤报错“安装程序配置服务器失败。参考服务器错误日志和 C: Windowssqltsp.log 了解更多信息”

`最新文章`

If I use a Google Site along with an Apps Script webapp(set to 'Anyone' access)linked to a Google Sheet, is the

ios - Sending "Start" Live Activity Notification from Apple Push Notifications Console successfully received b