pytorch - No matter how low is the GPU Vram is set in max_memory, the loading of the model throws error - Stack Overflow-软件玩家

admin管理员组
文章数量:1301593

On one Nvidia 4070 Super with 12GB VRAM, the loading of deepseek "deepseek-ai/deepseek-llm-7b-chat" model constantly throws error even the GPU VRAM is set at as low as 4GB in max_memory:

Error loading quantized model: Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules in 32-bit, you need to set `llm_int8_enable_fp32_cpu_offload=True` and pass a custom `device_map` to `from_pretrained`. Check  for more details.

here is the portion of the code:

                 model_name: str = "deepseek-ai/deepseek-llm-7b-chat",
                 quantization_config = BitsAndBytesConfig(
                            load_in_4bit=True,  # You can try load_in_8bit=True if needed
                            bnb_4bit_compute_dtype=torch.float16,
                            bnb_4bit_use_double_quant=True,
                            bnb_4bit_quant_type="nf4"  # Use NormalFloat4 quantization
                        )

                        # Step 2: Load config only (avoids full model load)
                        config = AutoConfig.from_pretrained(model_name) #, low_cpu_mem_usage=True)
                        with init_empty_weights():
                            model = AutoModelForCausalLM.from_config(config)

                        # Step 3: Infer the device map **without fully loading the model**
                        device_map1 = infer_auto_device_map(
                            model,
                            max_memory={0: "4GB", "cpu": "32GB"}  # Adjust based on available resources
                        )
                        device_map_update = device_map1
                        print(f"Custom device map1 in 4bit: {device_map1}")
                        
                        torch.cuda.empty_cache()
                        model = AutoModelForCausalLM.from_pretrained(. ##<<== loading error
                            model_name,
                            #config=config,
                            quantization_config=quantization_config,
                            trust_remote_code=True,
                            device_map=device_map1,  # Let Transformers manage GPU allocation
                            **kwargs
                        )

Here is the output of Nvidia-smi and the GPU Vram occupied was still over 10GB:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.03             Driver Version: 550.144.03     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4070 ...    Off |   00000000:01:00.0 Off |                  N/A |
|  0%   37C    P8              4W /  220W |   10467MiB /  12282MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     54363      C   ...aiworker9/code/py/myenv/bin/python3      10460MiB |
+-----------------------------------------------------------------------------------------+

The code seems OK to me but it constantly throws OOM error for CUDA. device_map="cpu" works fine. What is missing here for CUDA?

On one Nvidia 4070 Super with 12GB VRAM, the loading of deepseek "deepseek-ai/deepseek-llm-7b-chat" model constantly throws error even the GPU VRAM is set at as low as 4GB in max_memory:

Error loading quantized model: Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules in 32-bit, you need to set `llm_int8_enable_fp32_cpu_offload=True` and pass a custom `device_map` to `from_pretrained`. Check https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu for more details.

here is the portion of the code:

                 model_name: str = "deepseek-ai/deepseek-llm-7b-chat",
                 quantization_config = BitsAndBytesConfig(
                            load_in_4bit=True,  # You can try load_in_8bit=True if needed
                            bnb_4bit_compute_dtype=torch.float16,
                            bnb_4bit_use_double_quant=True,
                            bnb_4bit_quant_type="nf4"  # Use NormalFloat4 quantization
                        )

                        # Step 2: Load config only (avoids full model load)
                        config = AutoConfig.from_pretrained(model_name) #, low_cpu_mem_usage=True)
                        with init_empty_weights():
                            model = AutoModelForCausalLM.from_config(config)

                        # Step 3: Infer the device map **without fully loading the model**
                        device_map1 = infer_auto_device_map(
                            model,
                            max_memory={0: "4GB", "cpu": "32GB"}  # Adjust based on available resources
                        )
                        device_map_update = device_map1
                        print(f"Custom device map1 in 4bit: {device_map1}")
                        
                        torch.cuda.empty_cache()
                        model = AutoModelForCausalLM.from_pretrained(. ##<<== loading error
                            model_name,
                            #config=config,
                            quantization_config=quantization_config,
                            trust_remote_code=True,
                            device_map=device_map1,  # Let Transformers manage GPU allocation
                            **kwargs
                        )

Here is the output of Nvidia-smi and the GPU Vram occupied was still over 10GB:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.03             Driver Version: 550.144.03     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4070 ...    Off |   00000000:01:00.0 Off |                  N/A |
|  0%   37C    P8              4W /  220W |   10467MiB /  12282MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     54363      C   ...aiworker9/code/py/myenv/bin/python3      10460MiB |
+-----------------------------------------------------------------------------------------+

The code seems OK to me but it constantly throws OOM error for CUDA. device_map="cpu" works fine. What is missing here for CUDA?

Share edited Feb 11 at 4:39 talonmies 72.4k35 gold badges202 silver badges289 bronze badges asked Feb 11 at 4:29 user938363 10.4k41 gold badges159 silver badges327 bronze badges

Add a comment |

1 Answer 1

Sorted by: Reset to default 1

You can’t force the entire 7B model (even 4bit quantized) into just 4GB of GPU memory with the default auto device-mapping. Some parts of the model will need to be offloaded to CPU in 32-bit, so you need to (1) enable offloading with llm_int8_enable_fp32_cpu_offload=True and (2) provide or adjust a custom device map that tells PyTorch which layers go to the CPU (or GPU). Simply setting max_memory={0: "4GB", "cpu": "32GB"} does not guarantee a successful fit if the library determines it can’t keep all quantized modules within 4GB, since the max memory parameter is just a guideline, not a hard limit.

First try increasing max memory and see if it can figure itself out, I would put it at or just below your total available VRAM.

Failing that, you will need to inspect the auto generated device map (just print(device_map1). Then make adjustments to which layers get offloaded to CPU. Ideally you wanna find a balance to where you can fit as many layers on GPU as possible without OOM errors. This map basically tells the model which layers to give to the CPU that you explicitly determine. Here is an example of moving a block to the cpu

device_map1["lm_head"] = "cpu"
for key in device_map1.keys():
    if key.startswith("transformer.h."):
        # Offload half or more of the layers
        device_map1[key] = "cpu"

Then init your model with the cpu offload flag true and giving it your new device_map

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quantization_config,
    device_map=device_map1,
    llm_int8_enable_fp32_cpu_offload=True,
    trust_remote_code=True,
    **kwargs
)

本文标签：

版权声明：本文标题：pytorch - No matter how low is the GPU Vram is set in max_memory, the loading of the model throws error - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1741678570a2392031.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

pytorch - No matter how low is the GPU Vram is set in max_memory, the loading of the model throws error - Stack Overflow

1 Answer 1

更多相关文章

javascript - Implementing fluid JS tile interface - Stack Overflow

javascript - isssue detecting backspace from mobile keyboards in js - Stack Overflow

javascript - how to make bootstrap off canvas nav overlap content instead of move it - Stack Overflow

javascript - TypeError: $.ajax is not a function - Stack Overflow

javascript - is there any better way to handling axios error with laravel and vue js - Stack Overflow

javascript - how to set mongo objectId field to null or empty or even delete the field? - Stack Overflow

java - Alexa Smart Home Skill - Discovery Directive not working - Stack Overflow

php - How to make custom total price reactive in navigation

javascript - Troubleshooting React issue: warning &#39;Expected server HTML to contain a matching &lt;header&gt; in

javascript - how to solve the error &quot;THREE.GLTFLoader: Failed to load buffer &quot;scene.bin&quot;.&quot;?

ios - NSUserActivity State Restoration of a UITabViewController with UINavigationControllers - Stack Overflow

loop - Change date format

c++ - Rendering to a osg::Texture2D using OpenSceneGraph in order to do some post-processing - Stack Overflow

wmi query - Does WQL support parameterized queries? - Stack Overflow

python - why do i have this compile error installing numpy in venv on cygwin? - Stack Overflow

sql - Convert a value column into multiple columns in Power BI per customer - Stack Overflow

javascript - innerHTML with getElementByClassName doesn&#39;t work - Stack Overflow

c# - Extracting a Specific Text from PDF using RegEx - Stack Overflow

404 error - Redirect a page id url but not the page slug

javascript - Failed to load resource: net::ERR_EMPTY_RESPONSE http:test.com - Stack Overflow

发表评论

推荐文章

javascript - jQuery Filter divs By Class - Stack Overflow

Is it possible to insert javascript into Google Chrome&#39;s &quot;Inspect Element&quot;? - Stack Overflow

javascript - Use export default in .eslintrc.js file instead of module.exports - Stack Overflow

Query to Exclude Child Pages from Custom Post Type Archive

hooks - Using wp_editor tinyMCE in metabox cause form alert on leaving page

热门文章

javascript - What is my backbone version? - Stack Overflow

javascript - DayJS: Format and Convert an ISO DateTime String to Local Timezone&#39;s DateTime - Stack Overflow

Touch events on background elements triggered when popup is open in .NET MAUI - Stack Overflow

linux - Self hosted azure devops build agent not getting tool from $PATH - Stack Overflow

c# - How to change Action attribute of the aspnetForm on MasterPage dynamically - Stack Overflow

custom field - Create new post with meta data using WordPress API

macos - What is the difference between fileReferenceURL and NSURLFileResourceIdentifierKey - Stack Overflow

java - Spring Cloud Dataflow task properties caching - Stack Overflow

javascript - Test DOMParser using Jest - Stack Overflow

javascript - How to load a partial template into a div in grails - Stack Overflow

最新文章

Win7各正式版下载地址和SHA验证

怎么样把中文版的Windows7改成英文版的Windows7

Win7系统笔记本蓝牙打开指南：详细步骤助你轻松连接

win7开机弹计算机,win7开机弹出Windows Installer窗口的解决方法

windows7虚拟机安装vmtools方法

javascript - Failed to load resource: net::ERR_EMPTY_RESPONSE http:test.com - Stack Overflow

git client erroneously downloads files when server does not support filter - Stack Overflow

404 error - Redirect a page id url but not the page slug

php - Where to find the Facebook cookie? - Stack Overflow

c# - Extracting a Specific Text from PDF using RegEx - Stack Overflow

惠普OMEN 15-CE001TX 2EF91PA参数报价

苹果新款MacBook Pro 15英寸 i732GB1TBVega Pro 20参数报价

联想Y330A-PSE L参数报价

神舟战神Z7 D6 i7-12650H16GB512GBRTX4050旗舰版参数报价

神舟战神Z7 D6 i7-12650H16GB1TBRTX4050参数报价

javascript - Troubleshooting React issue: warning 'Expected server HTML to contain a matching <header> in

javascript - how to solve the error "THREE.GLTFLoader: Failed to load buffer "scene.bin"."?

javascript - innerHTML with getElementByClassName doesn't work - Stack Overflow

Is it possible to insert javascript into Google Chrome's "Inspect Element"? - Stack Overflow

javascript - DayJS: Format and Convert an ISO DateTime String to Local Timezone's DateTime - Stack Overflow