admin管理员组

文章数量:1356476

I was trying to run deepseek-r1-distill-llama70b-bf16.gguf (131gb on disk) on two A6000 gpus (48gb vram each) with llama.cpp. It runs with partial gpu offload but the gpu utilization is at 9-10% and is bursty.

I understand that llama.cpp is not optimized for multi-gpu setup. Othe engines that are optimized are vllm, tensor-RT etc but they support quantization until 4 bits. Now if i want to run deepseek-r1, to fit into these two gpu's it has to be extremely quantized (say in IQ1_XS). This is only possible with GGUF models that vllm etc. doesnt support. But with llama.cpp the full gpu won't be utilized. How can I run extremely quantized models (sub-4 bits which are only in gguf format) in a nvidia multi-gpu setup?

本文标签: quantizationsub4 bit quantized model on nvidia gpuStack Overflow