python - YOLOv9e-seg training on 6 A100-80G and tried to optimize as much as I could but after the validation stage there is the

IT技术

更新时间：2025-03-101

admin管理员组
文章数量:1292101

I am trying to train YOLOv9e-seg model on 336 total images of size 4096x4096 which have been split in train and val in the ratio 80:20. Previously I used to have error even from the training part but with some optimizations in the train method's parameters I was able to overcome that error. I am not sure but the validation was gets done a few older version of my code and then for some step I used to get this error but in the current version the program fails in the validation step with the program gives the "torch.OutOfMemoryError: CUDA out of memory" error. Code for the training is below:

import os
import torch
import atexit
import gc
from ultralytics import YOLO
from torch.nn import DataParallel

# Remap GPUs to a contiguous set using CUDA_VISIBLE_DEVICES.
# For example, if you want to use physical GPUs 0, 1, 3, 4, 5, 6:
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,3,4,5,6"

# Set environment variable to help reduce memory fragmentation.
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'expandable_segments:True'

# Function to clear GPU memory.
def clear_gpu_memory():
    torch.cuda.empty_cache()
    torch.cuda.reset_peak_memory_stats()

# Ensure that GPU memory is cleared on exit.
atexit.register(clear_gpu_memory)

# Load the pretrained YOLOv9 segmentation model and compile it.
model = YOLO("yolov9e-seg.pt")
model.model = torchpile(model.model)

try:
    # Train the model with your specified parameters.
    model.train(
        data='training_data/brain_data.yaml',
        epochs=2,
        imgsz=4096,
        batch=6,
        project='brain_segmentation',
        name='testrun',
        device=[0, 1, 3, 4, 5, 6],
        close_mosaic=1,
        save_period=1,
        amp=True,
        cache=False,
        overlap_mask=False,
        workers=4,
    )

    # If available, try deleting the optimizer to free memory.
    try:
        del model.optimizer
    except AttributeError:
        pass

    # Force garbage collection and clear cached GPU memory after training.
    gc.collect()
    torch.cuda.empty_cache()

    # Get the number of GPUs now visible (they are renumbered from 0 to N-1).
    available_gpus = torch.cuda.device_count()
    print(f"Available GPUs (contiguous numbering): {list(range(available_gpus))}")

    # Wrap the model in DataParallel for training.
    model.model = DataParallel(model.model, device_ids=list(range(available_gpus)))
    model.model.to('cuda')

    # --- Before validation, unwrap and fuse the model ---
    # The fused model is expected to be used on a single device, so we unwrap the DataParallel container.
    if isinstance(model.model, DataParallel):
        # Unwrap and call the underlying fuse() method.
        fused_module = model.model.module.fuse(verbose=False)
        model.model = fused_module
    else:
        model.model = model.model.fuse(verbose=False)
    
    print("Model fused.")

    # Validate using memory optimizations:
    # - torch.inference_mode() to disable gradient tracking.
    # - torch.amp.autocast with device_type='cuda' for mixed-precision inference.
    with torch.inference_mode():
        with torch.amp.autocast(device_type='cuda'):
            model.val(
                device=list(range(available_gpus)),
                batch=6,
                imgsz=4096
            )
    
    print("Validation complete.")

    # Export the fused model to ONNX (typically done on a single GPU).
    model.export(
        device=0,
        imgsz=4096,
        half=True,
        simplify=True,
        opset=12
    )

except KeyboardInterrupt:
    print("Training interrupted. Clearing GPU memory...")
    clear_gpu_memory()
    raise

except Exception as e:
    print(f"An error occurred: {e}. Clearing GPU memory...")
    clear_gpu_memory()
    raise

My config file is training_data/brain_data.yaml :

path: work_my/new_yolo_4096/training_data
train:
  - images/train  # Path to training images
  - labels/train  # Path to training annotations
val:
  - images/val  # Path to validation images
  - labels/val  # Path to validation annotations

nc: 25
names: ['Thalamus', 'Caudate nucleus', 'Putamen', 'Globus pallidus', 'Nucleus accumbens', 'Internal capsule', 'Substantia innominata', 'Fornix', 'Anterior commissure', 'Ganglionic eminence', 'Hypothalamus', 'Amygdala', 'Hippocampus', 'Choroid plexus', 'Lateral ventricle', 'Olfactory tubercle', 'Pretectum', 'Inferior colliculus', 'Superior colliculus', 'Tegmentum', 'Pons', 'Medulla', 'Cerebellum', 'Corpus callosum', 'Cerebral cortex']

Some points:

my training data is properly prepared, and there is no issue on that part of loading the data or the issue of wrong paths in the config
I want to train my model on the same resolution of 4096x4096 so please don't suggest reducing the image size.
batch size must be equal to the number of devices so the minimum is 6 so have kept the same, cant reduce further can only increase in multiples of 6 (wouldn't wont to do that because already I am out of memory)
All the GPUs are empty and no prior memory or computer was taken by any other program.

Training gets completed in this part:

Starting training for 2 epochs...

Epoch GPU_mem box_loss seg_loss cls_loss dfl_loss Instances Size
1/2 81G 2.821 6.069 54.57 2.973 30 4096: 100%|██████████| 45/45 [00:56<00:00, 1.25s/it]
Class Images Instances Box(P R mAP50 mAP50-95) Mask(P R mAP50 mAP50-95): 100%|██████████| 34/34 [00:27<00:00, 1.26it/s]
all 68 2181 0.00188 0.0077 0.00106 0.000624 0.000624 0.00377 0.000345 0.000176
Closing dataloader mosaic

Epoch GPU_mem box_loss seg_loss cls_loss dfl_loss Instances Size
2/2 60.8G 2.783 4.883 49.26 2.989 37 4096: 100%|██████████| 45/45 [00:49<00:00, 1.09s/it]
Class Images Instances Box(P R mAP50 mAP50-95) Mask(P R mAP50 mAP50-95): 100%|██████████| 34/34 [00:27<00:00, 1.24it/s]
all 68 2181 0.00188 0.0077 0.00106 0.000624 0.000624 0.00377 0.000345 0.000176

2 epochs completed in 0.048 hours.
Optimizer stripped from brain_segmentation/testrun21/weights/last.pt, 124.0MB
Optimizer stripped from brain_segmentation/testrun21/weights/best.pt, 124.0MB

Then the error:

Results saved to brain_segmentation/testrun21
Ultralytics 8.3.74 
                本文标签：
                

                        版权声明：本文标题：python - YOLOv9e-seg training on 6 A100-80G and tried to optimize as much as I could but after the validation stage there is the 内容由网友自发贡献，该文观点仅代表作者本人，
                        转载请联系作者并注明出处：http://www.betaflare.com/web/1741543712a2384471.html，
                        本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

`更多相关文章`

Custom Taxonomy terms aren&#39;t getting referenced or saved in Quick Edit or Bulk Edit, only on Single product page?IT技术
25分钟前
I am using Woocommerce, Elementor, ACF Custom Fields and CPT for Custom Post types.I have a custom taxonomy called

javascript - Ajax sourced data in JSON format - Unable to get property &#39;length&#39; of undefined or null reference -IT技术
24分钟前
I'm trying to load json data (retrieved from a Web API ajax call) to a jQuery DataTables, but I am

asynchronous - Mapbox Flutter flyTo and easeTo animations not working while setCamera works for location changes - Stack OverfloIT技术
23分钟前
For context, I am using Riverpod, which might be part of the problem.In my build method, I have the fo

pages - Disable ADD PDF media button on rich text editorIT技术
22分钟前
I'm looking for a way to completely disable "add pdf" media button on text editor but not with css tricks

css - How can I prevent slide overflow on Swiper slideshow (do not show more than the desired number of slides) - Stack OverflowIT技术
21分钟前
Problem: My Swiper slideshow is displaying as many slides as can fit on the screen, rather than adherin

javascript - Local storage and JSON - Stack OverflowIT技术
21分钟前
Where are the data stored in local storage? Is it in form of some text or ASCII format or some other? I

javascript - Dynamic sorting in jquery DataTables - Stack OverflowIT技术
17分钟前
I'm using DataTables with the columns.render option to implement custom sorting for a table.That

javascript - Flow Charts and State Machines - Stack OverflowIT技术
16分钟前
Specifically I am looking at the JavaScript libraries; JavaScript InfoVis Toolkit, and D3 to draw flow

reactjs - How to get pairAddress for DexTools API from CA - Stack OverflowIT技术
16分钟前
I want to create a website where you can enter in a Solana CA and it returns a page with a dextools cha

javascript - Does the d3 treemap layout get cached when a root node is passed to it? - Stack OverflowIT技术
15分钟前
I was trying to get a d3 treemap to animate, and had something likeApp.svg = d3.select("#medals-tr

jquery - html5javascript audio play multiple tracks at the same time - Stack OverflowIT技术
12分钟前
Is it possible to play the whole audio instance at the same time?I know, I could create every time a n

dart - Setting BOTH InputDecoration and InputDecoration.collapsed on TextFormFieldTextField in Flutter - Stack OverflowIT技术
11分钟前
How do I set BOTH InputDecoration and InputDecoration.collapsed on a TextField?TextField and FormTextF

python - launching multiple commands in parallel and parsing their outputs - Stack OverflowIT技术
10分钟前
I have two functions, each launch a command, and I need to run both in parallel:def time_render(filena

javascript - Code hanging up in test, how should i fix? - Stack OverflowIT技术
9分钟前
So I am doing a course through coursera. Max pair in an array. Highest 2 numbers multiply.The code wor

javascript - can I use consolealertsome-other-means to read out ALL CSS properties at once? - Stack OverflowIT技术
8分钟前
I'm trying to debug a site on iPad. On desktop an element shows, on iPad it's missing.Questio

html - Checkboxes in CSS GridFlexbox Layout Have Inconsistent Sizes (rendering) - Stack OverflowIT技术
5分钟前
I’m trying to create a layout with checkboxes using CSS Grid and Flexbox, but the checkboxes are render

python - ValueError: X has 7 features, but ColumnTransformer expects 13 features - Stack OverflowIT技术
3分钟前
I have the following code where I try to predict price of tools for which I use poisson regression.# -

Telerik RadWindow Javascript return values to ASP.NET - Stack OverflowIT技术
3分钟前
I have a parent page that launches a telerik radwindow and passes it an argument.Once the radwindow is

Date manipulations in Javascript - Stack OverflowIT技术
2分钟前
I am using a dropdownlist and a calendar control in my page. In that I am having following list items.

ubuntu - Cloud-init per device configuration - Stack OverflowIT技术
15秒前
I have a question about using cloud init for per device configurations. I am wanting to use this with a

`发表评论`

全部评论 0

暂无评论

编程频道|软件玩家 - 软件改变生活！

python - YOLOv9e-seg training on 6 A100-80G and tried to optimize as much as I could but after the validation stage there is the

更多相关文章

Custom Taxonomy terms aren&#39;t getting referenced or saved in Quick Edit or Bulk Edit, only on Single product page?

javascript - Ajax sourced data in JSON format - Unable to get property &#39;length&#39; of undefined or null reference -

asynchronous - Mapbox Flutter flyTo and easeTo animations not working while setCamera works for location changes - Stack Overflo

pages - Disable ADD PDF media button on rich text editor

css - How can I prevent slide overflow on Swiper slideshow (do not show more than the desired number of slides) - Stack Overflow

javascript - Local storage and JSON - Stack Overflow

javascript - Dynamic sorting in jquery DataTables - Stack Overflow

javascript - Flow Charts and State Machines - Stack Overflow

reactjs - How to get pairAddress for DexTools API from CA - Stack Overflow

javascript - Does the d3 treemap layout get cached when a root node is passed to it? - Stack Overflow

jquery - html5javascript audio play multiple tracks at the same time - Stack Overflow

dart - Setting BOTH InputDecoration and InputDecoration.collapsed on TextFormFieldTextField in Flutter - Stack Overflow

python - launching multiple commands in parallel and parsing their outputs - Stack Overflow

javascript - Code hanging up in test, how should i fix? - Stack Overflow

javascript - can I use consolealertsome-other-means to read out ALL CSS properties at once? - Stack Overflow

html - Checkboxes in CSS GridFlexbox Layout Have Inconsistent Sizes (rendering) - Stack Overflow

python - ValueError: X has 7 features, but ColumnTransformer expects 13 features - Stack Overflow

Telerik RadWindow Javascript return values to ASP.NET - Stack Overflow

Date manipulations in Javascript - Stack Overflow

ubuntu - Cloud-init per device configuration - Stack Overflow

发表评论

推荐文章

javascript - Jquery Datepicker Set Value From String - Stack Overflow

conditional tags - if (is_page(**PAGE ID**)) not working

webrtc - The JavaScript Event Loop and Web Workers - Stack Overflow

wp head - Hook into wp_head(); in a plugin

javascript - Cannot access to values in store of Map in Svelte - Stack Overflow

热门文章

javascript - &#39;gap:ready&#39; is not getting served over https - Stack Overflow

compare two input values in input validation html javascript? - Stack Overflow

javascript - Facebook Graph API only returns user name - Stack Overflow

python - Why I&#39;m having an error on my robotframework automation - Stack Overflow

docker - kernal-headers.rpm is flagging as security vulnerability from Aqua scans for the version 6.1.119-129.201.amzn2023 - Sta

angular - primeNg multiselect should not disable options if selection limit is reached - Stack Overflow

javascript - Why does EcmaScript 5 strict mode go to such great lengths to restrict the identifier `eval` - Stack Overflow

python - Pyinstaller with crawl4ai module not working? - Stack Overflow

mysql - Wpdb query with dynamic table name

javascript - Android keyboards disappear in dropdown editor field in Tabulator - Stack Overflow

最新文章

Win7各正式版下载地址和SHA验证

怎么样把中文版的Windows7改成英文版的Windows7

Win7系统笔记本蓝牙打开指南：详细步骤助你轻松连接

win7开机弹计算机,win7开机弹出Windows Installer窗口的解决方法

windows7虚拟机安装vmtools方法

Apply manual image cropping not to thumbnail but medium_large size

ubuntu - Cloud-init per device configuration - Stack Overflow

Unable to retrieve user account after installing MAUI app from Intune Company Portal - Stack Overflow

javascript - How to call another serviceapi only if the first one has successfully executed? - Stack Overflow

css - JavaScript: Toggle class with inline onClick? - Stack Overflow

惠普OMEN 15-CE001TX 2EF91PA参数报价

苹果新款MacBook Pro 15英寸 i732GB1TBVega Pro 20参数报价

联想Y330A-PSE L参数报价

神舟战神Z7 D6 i7-12650H16GB512GBRTX4050旗舰版参数报价

神舟战神Z7 D6 i7-12650H16GB1TBRTX4050参数报价

`更多相关文章`

Custom Taxonomy terms aren't getting referenced or saved in Quick Edit or Bulk Edit, only on Single product page?

javascript - Ajax sourced data in JSON format - Unable to get property 'length' of undefined or null reference -

`发表评论`

`推荐文章`

conditional tags - if (is_page(PAGE ID)) not working

`热门文章`

javascript - 'gap:ready' is not getting served over https - Stack Overflow

python - Why I'm having an error on my robotframework automation - Stack Overflow

`最新文章`