admin管理员组文章数量:1356472
I'm running into a segmentation fault (core dumped) error while training a model using PyTorch on a CUDA-enabled GPU. I'm not sure what's going wrong, and would really appreciate any guidance.
My Environment:
GPU: 2× NVIDIA GeForce RTX 4060 Ti
Driver Version: 550.120
CUDA Version (Driver-side): 12.4
cuDNN Version: 8902
PyTorch Version: 2.2.0+cu121
Python: 3.10.12
CUDA available: True
Detected CUDA from PyTorch: 12.1
Host OS: Ubuntu 24.04
Docker Image: nvidia/cuda:12.4.1-runtime-ubuntu22.04
Kernel: Linux 6.8.0-55-generic x86_64 with glibc 2.35
Running inside Docker: Yes
The Problem During training, the script suddenly crashes with a segmentation fault. The crash does not happen at a specific line every time — sometimes it happens in .backward(), sometimes while creating a tensor on GPU using .to(device). It usually occurs after a few training batches, not at the very beginning.
Here’s a simplified version of the code:
def train_test_ht_sl(model, train_data, test_data, head_list, tail_list):
import datetime
model.scheduler.step()
print('start training: ', datetime.datetime.now())
model.train()
total_loss = 0.0
slices = train_data.generate_batch(model.batch_size)
for i, j in zip(slices, np.arange(len(slices))):
model.optimizer.zero_grad()
targets, scores = forward(model, i, train_data)
# targets = torch.from_numpy(np.array(targets)).long().to('cuda:1')
targets = torch.tensor(targets).long().to(device)
loss = model.loss_function(scores, targets - 1)
loss.backward() # <- crash sometimes happens here
def forward(model, i, data):
alias_inputs, A, items, mask, targets = data.get_slice(i)
alias_inputs = torch.tensor(alias_inputs, dtype=torch.long, device=device)
items = torch.tensor(items, dtype=torch.long, device=device)
mask = torch.tensor(mask, dtype=torch.long, device=device)
A_np = np.stack(A)
A = torch.tensor(A_np, dtype=torch.float, device=device) # <- or here
hidden = model(items, A)
get = lambda i: hidden[i][alias_inputs[i]]
seq_hidden = torch.stack([get(i) for i in torch.arange(len(alias_inputs)).long()])
return targets, modelpute_scores(seq_hidden, mask)
Error Excerpt
Training Progress: 20%|██ | 6/30 [16:14<1:04:58, 162.43s/it]
Fatal Python error: Segmentation fault
Current thread 0x00007... (most recent call first):
<no Python frame>
Thread 0x00007...:
File "/usr/lib/python3.10/threading.py", line 324 in wait
...
File "/usr/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266 in backward
** My Question** I'm still new to CUDA programming and PyTorch internals, so I’m not sure: Why might this segmentation fault occur? Am I doing something wrong when moving data to the GPU? Is there a safer or more proper way to handle tensors before calling .backward()? Any help or explanation would be really appreciated. Thank you in advance!
I'm running into a segmentation fault (core dumped) error while training a model using PyTorch on a CUDA-enabled GPU. I'm not sure what's going wrong, and would really appreciate any guidance.
My Environment:
GPU: 2× NVIDIA GeForce RTX 4060 Ti
Driver Version: 550.120
CUDA Version (Driver-side): 12.4
cuDNN Version: 8902
PyTorch Version: 2.2.0+cu121
Python: 3.10.12
CUDA available: True
Detected CUDA from PyTorch: 12.1
Host OS: Ubuntu 24.04
Docker Image: nvidia/cuda:12.4.1-runtime-ubuntu22.04
Kernel: Linux 6.8.0-55-generic x86_64 with glibc 2.35
Running inside Docker: Yes
The Problem During training, the script suddenly crashes with a segmentation fault. The crash does not happen at a specific line every time — sometimes it happens in .backward(), sometimes while creating a tensor on GPU using .to(device). It usually occurs after a few training batches, not at the very beginning.
Here’s a simplified version of the code:
def train_test_ht_sl(model, train_data, test_data, head_list, tail_list):
import datetime
model.scheduler.step()
print('start training: ', datetime.datetime.now())
model.train()
total_loss = 0.0
slices = train_data.generate_batch(model.batch_size)
for i, j in zip(slices, np.arange(len(slices))):
model.optimizer.zero_grad()
targets, scores = forward(model, i, train_data)
# targets = torch.from_numpy(np.array(targets)).long().to('cuda:1')
targets = torch.tensor(targets).long().to(device)
loss = model.loss_function(scores, targets - 1)
loss.backward() # <- crash sometimes happens here
def forward(model, i, data):
alias_inputs, A, items, mask, targets = data.get_slice(i)
alias_inputs = torch.tensor(alias_inputs, dtype=torch.long, device=device)
items = torch.tensor(items, dtype=torch.long, device=device)
mask = torch.tensor(mask, dtype=torch.long, device=device)
A_np = np.stack(A)
A = torch.tensor(A_np, dtype=torch.float, device=device) # <- or here
hidden = model(items, A)
get = lambda i: hidden[i][alias_inputs[i]]
seq_hidden = torch.stack([get(i) for i in torch.arange(len(alias_inputs)).long()])
return targets, modelpute_scores(seq_hidden, mask)
Error Excerpt
Training Progress: 20%|██ | 6/30 [16:14<1:04:58, 162.43s/it]
Fatal Python error: Segmentation fault
Current thread 0x00007... (most recent call first):
<no Python frame>
Thread 0x00007...:
File "/usr/lib/python3.10/threading.py", line 324 in wait
...
File "/usr/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266 in backward
** My Question** I'm still new to CUDA programming and PyTorch internals, so I’m not sure: Why might this segmentation fault occur? Am I doing something wrong when moving data to the GPU? Is there a safer or more proper way to handle tensors before calling .backward()? Any help or explanation would be really appreciated. Thank you in advance!
Share Improve this question asked Mar 28 at 4:22 탁승연탁승연 11 Answer
Reset to default -1Here’s a modified version of your train_test_ht_sl
function with some of the suggestions applied:
def train_test_ht_sl(model, train_data, test_data, head_list, tail_list):
import datetime
model.scheduler.step()
print('start training: ', datetime.datetime.now())
model.train()
total_loss = 0.0
slices = train_data.generate_batch(model.batch_size)
for i, j in zip(slices, np.arange(len(slices))):
model.optimizer.zero_grad()
targets, scores = forward(model, i, train_data)
# Ensure targets are on the correct device
targets = torch.tensor(targets, dtype=torch.long, device=device)
# Check for NaNs or Infs
assert not torch.isnan(targets).any(), "Targets contain NaNs"
assert not torch.isinf(targets).any(), "Targets contain Infs"
loss = model.loss_function(scores, targets - 1)
# Check for NaNs in loss
assert not torch.isnan(loss).any(), "Loss contains NaNs"
loss.backward() # <- crash sometimes happens here
def forward(model, i, data):
alias_inputs, A, items, mask, targets = data.get_slice(i)
alias_inputs = torch.tensor(alias_inputs, dtype=torch.long, device=device)
items = torch.tensor(items, dtype=torch.long, device=device)
mask = torch.tensor(mask, dtype=torch.long, device=device)
A_np = np.stack(A)
A =
版权声明:本文标题:docker - Segmentation fault when calling .backward() after moving data to GPU (PyTorch + CUDA 12.1) - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1744057139a2583459.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论