admin管理员组

文章数量:1301593

On one Nvidia 4070 Super with 12GB VRAM, the loading of deepseek "deepseek-ai/deepseek-llm-7b-chat" model constantly throws error even the GPU VRAM is set at as low as 4GB in max_memory:

Error loading quantized model: Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules in 32-bit, you need to set `llm_int8_enable_fp32_cpu_offload=True` and pass a custom `device_map` to `from_pretrained`. Check  for more details. 

here is the portion of the code:

                 model_name: str = "deepseek-ai/deepseek-llm-7b-chat",
                 quantization_config = BitsAndBytesConfig(
                            load_in_4bit=True,  # You can try load_in_8bit=True if needed
                            bnb_4bit_compute_dtype=torch.float16,
                            bnb_4bit_use_double_quant=True,
                            bnb_4bit_quant_type="nf4"  # Use NormalFloat4 quantization
                        )

                        # Step 2: Load config only (avoids full model load)
                        config = AutoConfig.from_pretrained(model_name) #, low_cpu_mem_usage=True)
                        with init_empty_weights():
                            model = AutoModelForCausalLM.from_config(config)

                        # Step 3: Infer the device map **without fully loading the model**
                        device_map1 = infer_auto_device_map(
                            model,
                            max_memory={0: "4GB", "cpu": "32GB"}  # Adjust based on available resources
                        )
                        device_map_update = device_map1
                        print(f"Custom device map1 in 4bit: {device_map1}")
                        
                        torch.cuda.empty_cache()
                        model = AutoModelForCausalLM.from_pretrained(. ##<<== loading error
                            model_name,
                            #config=config,
                            quantization_config=quantization_config,
                            trust_remote_code=True,
                            device_map=device_map1,  # Let Transformers manage GPU allocation
                            **kwargs
                        )

Here is the output of Nvidia-smi and the GPU Vram occupied was still over 10GB:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.03             Driver Version: 550.144.03     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4070 ...    Off |   00000000:01:00.0 Off |                  N/A |
|  0%   37C    P8              4W /  220W |   10467MiB /  12282MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     54363      C   ...aiworker9/code/py/myenv/bin/python3      10460MiB |
+-----------------------------------------------------------------------------------------+

The code seems OK to me but it constantly throws OOM error for CUDA. device_map="cpu" works fine. What is missing here for CUDA?

On one Nvidia 4070 Super with 12GB VRAM, the loading of deepseek "deepseek-ai/deepseek-llm-7b-chat" model constantly throws error even the GPU VRAM is set at as low as 4GB in max_memory:

Error loading quantized model: Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules in 32-bit, you need to set `llm_int8_enable_fp32_cpu_offload=True` and pass a custom `device_map` to `from_pretrained`. Check https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu for more details. 

here is the portion of the code:

                 model_name: str = "deepseek-ai/deepseek-llm-7b-chat",
                 quantization_config = BitsAndBytesConfig(
                            load_in_4bit=True,  # You can try load_in_8bit=True if needed
                            bnb_4bit_compute_dtype=torch.float16,
                            bnb_4bit_use_double_quant=True,
                            bnb_4bit_quant_type="nf4"  # Use NormalFloat4 quantization
                        )

                        # Step 2: Load config only (avoids full model load)
                        config = AutoConfig.from_pretrained(model_name) #, low_cpu_mem_usage=True)
                        with init_empty_weights():
                            model = AutoModelForCausalLM.from_config(config)

                        # Step 3: Infer the device map **without fully loading the model**
                        device_map1 = infer_auto_device_map(
                            model,
                            max_memory={0: "4GB", "cpu": "32GB"}  # Adjust based on available resources
                        )
                        device_map_update = device_map1
                        print(f"Custom device map1 in 4bit: {device_map1}")
                        
                        torch.cuda.empty_cache()
                        model = AutoModelForCausalLM.from_pretrained(. ##<<== loading error
                            model_name,
                            #config=config,
                            quantization_config=quantization_config,
                            trust_remote_code=True,
                            device_map=device_map1,  # Let Transformers manage GPU allocation
                            **kwargs
                        )

Here is the output of Nvidia-smi and the GPU Vram occupied was still over 10GB:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.03             Driver Version: 550.144.03     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4070 ...    Off |   00000000:01:00.0 Off |                  N/A |
|  0%   37C    P8              4W /  220W |   10467MiB /  12282MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     54363      C   ...aiworker9/code/py/myenv/bin/python3      10460MiB |
+-----------------------------------------------------------------------------------------+

The code seems OK to me but it constantly throws OOM error for CUDA. device_map="cpu" works fine. What is missing here for CUDA?

Share edited Feb 11 at 4:39 talonmies 72.4k35 gold badges202 silver badges289 bronze badges asked Feb 11 at 4:29 user938363user938363 10.4k41 gold badges159 silver badges327 bronze badges
Add a comment  | 

1 Answer 1

Reset to default 1

You can’t force the entire 7B model (even 4bit quantized) into just 4GB of GPU memory with the default auto device-mapping. Some parts of the model will need to be offloaded to CPU in 32-bit, so you need to (1) enable offloading with llm_int8_enable_fp32_cpu_offload=True and (2) provide or adjust a custom device map that tells PyTorch which layers go to the CPU (or GPU). Simply setting max_memory={0: "4GB", "cpu": "32GB"} does not guarantee a successful fit if the library determines it can’t keep all quantized modules within 4GB, since the max memory parameter is just a guideline, not a hard limit.

First try increasing max memory and see if it can figure itself out, I would put it at or just below your total available VRAM.

Failing that, you will need to inspect the auto generated device map (just print(device_map1). Then make adjustments to which layers get offloaded to CPU. Ideally you wanna find a balance to where you can fit as many layers on GPU as possible without OOM errors. This map basically tells the model which layers to give to the CPU that you explicitly determine. Here is an example of moving a block to the cpu

device_map1["lm_head"] = "cpu"
for key in device_map1.keys():
    if key.startswith("transformer.h."):
        # Offload half or more of the layers
        device_map1[key] = "cpu"

Then init your model with the cpu offload flag true and giving it your new device_map

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quantization_config,
    device_map=device_map1,
    llm_int8_enable_fp32_cpu_offload=True,
    trust_remote_code=True,
    **kwargs
)

本文标签: