admin管理员组文章数量:1301593
On one Nvidia 4070 Super with 12GB VRAM, the loading of deepseek "deepseek-ai/deepseek-llm-7b-chat" model constantly throws error even the GPU VRAM is set at as low as 4GB in max_memory:
Error loading quantized model: Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules in 32-bit, you need to set `llm_int8_enable_fp32_cpu_offload=True` and pass a custom `device_map` to `from_pretrained`. Check for more details.
here is the portion of the code:
model_name: str = "deepseek-ai/deepseek-llm-7b-chat",
quantization_config = BitsAndBytesConfig(
load_in_4bit=True, # You can try load_in_8bit=True if needed
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4" # Use NormalFloat4 quantization
)
# Step 2: Load config only (avoids full model load)
config = AutoConfig.from_pretrained(model_name) #, low_cpu_mem_usage=True)
with init_empty_weights():
model = AutoModelForCausalLM.from_config(config)
# Step 3: Infer the device map **without fully loading the model**
device_map1 = infer_auto_device_map(
model,
max_memory={0: "4GB", "cpu": "32GB"} # Adjust based on available resources
)
device_map_update = device_map1
print(f"Custom device map1 in 4bit: {device_map1}")
torch.cuda.empty_cache()
model = AutoModelForCausalLM.from_pretrained(. ##<<== loading error
model_name,
#config=config,
quantization_config=quantization_config,
trust_remote_code=True,
device_map=device_map1, # Let Transformers manage GPU allocation
**kwargs
)
Here is the output of Nvidia-smi and the GPU Vram occupied was still over 10GB:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.03 Driver Version: 550.144.03 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4070 ... Off | 00000000:01:00.0 Off | N/A |
| 0% 37C P8 4W / 220W | 10467MiB / 12282MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 54363 C ...aiworker9/code/py/myenv/bin/python3 10460MiB |
+-----------------------------------------------------------------------------------------+
The code seems OK to me but it constantly throws OOM error for CUDA. device_map="cpu" works fine. What is missing here for CUDA?
On one Nvidia 4070 Super with 12GB VRAM, the loading of deepseek "deepseek-ai/deepseek-llm-7b-chat" model constantly throws error even the GPU VRAM is set at as low as 4GB in max_memory:
Error loading quantized model: Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules in 32-bit, you need to set `llm_int8_enable_fp32_cpu_offload=True` and pass a custom `device_map` to `from_pretrained`. Check https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu for more details.
here is the portion of the code:
model_name: str = "deepseek-ai/deepseek-llm-7b-chat",
quantization_config = BitsAndBytesConfig(
load_in_4bit=True, # You can try load_in_8bit=True if needed
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4" # Use NormalFloat4 quantization
)
# Step 2: Load config only (avoids full model load)
config = AutoConfig.from_pretrained(model_name) #, low_cpu_mem_usage=True)
with init_empty_weights():
model = AutoModelForCausalLM.from_config(config)
# Step 3: Infer the device map **without fully loading the model**
device_map1 = infer_auto_device_map(
model,
max_memory={0: "4GB", "cpu": "32GB"} # Adjust based on available resources
)
device_map_update = device_map1
print(f"Custom device map1 in 4bit: {device_map1}")
torch.cuda.empty_cache()
model = AutoModelForCausalLM.from_pretrained(. ##<<== loading error
model_name,
#config=config,
quantization_config=quantization_config,
trust_remote_code=True,
device_map=device_map1, # Let Transformers manage GPU allocation
**kwargs
)
Here is the output of Nvidia-smi and the GPU Vram occupied was still over 10GB:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.03 Driver Version: 550.144.03 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4070 ... Off | 00000000:01:00.0 Off | N/A |
| 0% 37C P8 4W / 220W | 10467MiB / 12282MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 54363 C ...aiworker9/code/py/myenv/bin/python3 10460MiB |
+-----------------------------------------------------------------------------------------+
The code seems OK to me but it constantly throws OOM error for CUDA. device_map="cpu" works fine. What is missing here for CUDA?
Share edited Feb 11 at 4:39 talonmies 72.4k35 gold badges202 silver badges289 bronze badges asked Feb 11 at 4:29 user938363user938363 10.4k41 gold badges159 silver badges327 bronze badges1 Answer
Reset to default 1You can’t force the entire 7B model (even 4bit quantized) into just 4GB of GPU memory with the default auto device-mapping. Some parts of the model will need to be offloaded to CPU in 32-bit, so you need to (1) enable offloading with llm_int8_enable_fp32_cpu_offload=True
and (2) provide or adjust a custom device map that tells PyTorch which layers go to the CPU (or GPU). Simply setting max_memory={0: "4GB", "cpu": "32GB"}
does not guarantee a successful fit if the library determines it can’t keep all quantized modules within 4GB, since the max memory parameter is just a guideline, not a hard limit.
First try increasing max memory and see if it can figure itself out, I would put it at or just below your total available VRAM.
Failing that, you will need to inspect the auto generated device map (just print(device_map1)
. Then make adjustments to which layers get offloaded to CPU. Ideally you wanna find a balance to where you can fit as many layers on GPU as possible without OOM errors. This map basically tells the model which layers to give to the CPU that you explicitly determine. Here is an example of moving a block to the cpu
device_map1["lm_head"] = "cpu"
for key in device_map1.keys():
if key.startswith("transformer.h."):
# Offload half or more of the layers
device_map1[key] = "cpu"
Then init your model with the cpu offload flag true and giving it your new device_map
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=quantization_config,
device_map=device_map1,
llm_int8_enable_fp32_cpu_offload=True,
trust_remote_code=True,
**kwargs
)
本文标签:
版权声明:本文标题:pytorch - No matter how low is the GPU Vram is set in max_memory, the loading of the model throws error - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1741678570a2392031.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论