admin管理员组

文章数量:1356472

I'm running into a segmentation fault (core dumped) error while training a model using PyTorch on a CUDA-enabled GPU. I'm not sure what's going wrong, and would really appreciate any guidance.

My Environment:

GPU: 2× NVIDIA GeForce RTX 4060 Ti
Driver Version: 550.120
CUDA Version (Driver-side): 12.4
cuDNN Version: 8902
PyTorch Version: 2.2.0+cu121
Python: 3.10.12
CUDA available: True
Detected CUDA from PyTorch: 12.1
Host OS: Ubuntu 24.04
Docker Image: nvidia/cuda:12.4.1-runtime-ubuntu22.04
Kernel: Linux 6.8.0-55-generic x86_64 with glibc 2.35
Running inside Docker: Yes

The Problem During training, the script suddenly crashes with a segmentation fault. The crash does not happen at a specific line every time — sometimes it happens in .backward(), sometimes while creating a tensor on GPU using .to(device). It usually occurs after a few training batches, not at the very beginning.

Here’s a simplified version of the code:

def train_test_ht_sl(model, train_data, test_data, head_list, tail_list):
    import datetime
    model.scheduler.step()
    print('start training: ', datetime.datetime.now())
    model.train()
    total_loss = 0.0
    slices = train_data.generate_batch(model.batch_size)

    for i, j in zip(slices, np.arange(len(slices))):
        model.optimizer.zero_grad()
        targets, scores = forward(model, i, train_data)

        # targets = torch.from_numpy(np.array(targets)).long().to('cuda:1')
        targets = torch.tensor(targets).long().to(device)
        loss = model.loss_function(scores, targets - 1)
        loss.backward()  # <- crash sometimes happens here
def forward(model, i, data):
    alias_inputs, A, items, mask, targets = data.get_slice(i)

    alias_inputs = torch.tensor(alias_inputs, dtype=torch.long, device=device)
    items = torch.tensor(items, dtype=torch.long, device=device)
    mask = torch.tensor(mask, dtype=torch.long, device=device)

    A_np = np.stack(A)
    A = torch.tensor(A_np, dtype=torch.float, device=device)  # <- or here

    hidden = model(items, A)

    get = lambda i: hidden[i][alias_inputs[i]]
    seq_hidden = torch.stack([get(i) for i in torch.arange(len(alias_inputs)).long()])

    return targets, modelpute_scores(seq_hidden, mask)

Error Excerpt

Training Progress:  20%|██        | 6/30 [16:14<1:04:58, 162.43s/it]
Fatal Python error: Segmentation fault

Current thread 0x00007... (most recent call first):
  <no Python frame>

Thread 0x00007...:
  File "/usr/lib/python3.10/threading.py", line 324 in wait
  ...
  File "/usr/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266 in backward

** My Question** I'm still new to CUDA programming and PyTorch internals, so I’m not sure: Why might this segmentation fault occur? Am I doing something wrong when moving data to the GPU? Is there a safer or more proper way to handle tensors before calling .backward()? Any help or explanation would be really appreciated. Thank you in advance!

I'm running into a segmentation fault (core dumped) error while training a model using PyTorch on a CUDA-enabled GPU. I'm not sure what's going wrong, and would really appreciate any guidance.

My Environment:

GPU: 2× NVIDIA GeForce RTX 4060 Ti
Driver Version: 550.120
CUDA Version (Driver-side): 12.4
cuDNN Version: 8902
PyTorch Version: 2.2.0+cu121
Python: 3.10.12
CUDA available: True
Detected CUDA from PyTorch: 12.1
Host OS: Ubuntu 24.04
Docker Image: nvidia/cuda:12.4.1-runtime-ubuntu22.04
Kernel: Linux 6.8.0-55-generic x86_64 with glibc 2.35
Running inside Docker: Yes

The Problem During training, the script suddenly crashes with a segmentation fault. The crash does not happen at a specific line every time — sometimes it happens in .backward(), sometimes while creating a tensor on GPU using .to(device). It usually occurs after a few training batches, not at the very beginning.

Here’s a simplified version of the code:

def train_test_ht_sl(model, train_data, test_data, head_list, tail_list):
    import datetime
    model.scheduler.step()
    print('start training: ', datetime.datetime.now())
    model.train()
    total_loss = 0.0
    slices = train_data.generate_batch(model.batch_size)

    for i, j in zip(slices, np.arange(len(slices))):
        model.optimizer.zero_grad()
        targets, scores = forward(model, i, train_data)

        # targets = torch.from_numpy(np.array(targets)).long().to('cuda:1')
        targets = torch.tensor(targets).long().to(device)
        loss = model.loss_function(scores, targets - 1)
        loss.backward()  # <- crash sometimes happens here
def forward(model, i, data):
    alias_inputs, A, items, mask, targets = data.get_slice(i)

    alias_inputs = torch.tensor(alias_inputs, dtype=torch.long, device=device)
    items = torch.tensor(items, dtype=torch.long, device=device)
    mask = torch.tensor(mask, dtype=torch.long, device=device)

    A_np = np.stack(A)
    A = torch.tensor(A_np, dtype=torch.float, device=device)  # <- or here

    hidden = model(items, A)

    get = lambda i: hidden[i][alias_inputs[i]]
    seq_hidden = torch.stack([get(i) for i in torch.arange(len(alias_inputs)).long()])

    return targets, modelpute_scores(seq_hidden, mask)

Error Excerpt

Training Progress:  20%|██        | 6/30 [16:14<1:04:58, 162.43s/it]
Fatal Python error: Segmentation fault

Current thread 0x00007... (most recent call first):
  <no Python frame>

Thread 0x00007...:
  File "/usr/lib/python3.10/threading.py", line 324 in wait
  ...
  File "/usr/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266 in backward

** My Question** I'm still new to CUDA programming and PyTorch internals, so I’m not sure: Why might this segmentation fault occur? Am I doing something wrong when moving data to the GPU? Is there a safer or more proper way to handle tensors before calling .backward()? Any help or explanation would be really appreciated. Thank you in advance!

Share Improve this question asked Mar 28 at 4:22 탁승연탁승연 1
Add a comment  | 

1 Answer 1

Reset to default -1

Here’s a modified version of your train_test_ht_sl function with some of the suggestions applied:

def train_test_ht_sl(model, train_data, test_data, head_list, tail_list):
    import datetime
    model.scheduler.step()
    print('start training: ', datetime.datetime.now())
    model.train()
    total_loss = 0.0
    slices = train_data.generate_batch(model.batch_size)

    for i, j in zip(slices, np.arange(len(slices))):
        model.optimizer.zero_grad()
        targets, scores = forward(model, i, train_data)

        # Ensure targets are on the correct device
        targets = torch.tensor(targets, dtype=torch.long, device=device)
        
        # Check for NaNs or Infs
        assert not torch.isnan(targets).any(), "Targets contain NaNs"
        assert not torch.isinf(targets).any(), "Targets contain Infs"

        loss = model.loss_function(scores, targets - 1)
        
        # Check for NaNs in loss
        assert not torch.isnan(loss).any(), "Loss contains NaNs"
        
        loss.backward()  # <- crash sometimes happens here

def forward(model, i, data):
    alias_inputs, A, items, mask, targets = data.get_slice(i)

    alias_inputs = torch.tensor(alias_inputs, dtype=torch.long, device=device)
    items = torch.tensor(items, dtype=torch.long, device=device)
    mask = torch.tensor(mask, dtype=torch.long, device=device)

    A_np = np.stack(A)
    A =

本文标签: dockerSegmentation fault when calling backward() after moving data to GPU (PyTorchCUDA 121)Stack Overflow