quantization - sub-4 bit quantized model on nvidia gpu - Stack Overflow-软件玩家

admin管理员组
文章数量:1356476

I was trying to run deepseek-r1-distill-llama70b-bf16.gguf (131gb on disk) on two A6000 gpus (48gb vram each) with llama.cpp. It runs with partial gpu offload but the gpu utilization is at 9-10% and is bursty.

I understand that llama.cpp is not optimized for multi-gpu setup. Othe engines that are optimized are vllm, tensor-RT etc but they support quantization until 4 bits. Now if i want to run deepseek-r1, to fit into these two gpu's it has to be extremely quantized (say in IQ1_XS). This is only possible with GGUF models that vllm etc. doesnt support. But with llama.cpp the full gpu won't be utilized. How can I run extremely quantized models (sub-4 bits which are only in gguf format) in a nvidia multi-gpu setup?

本文标签： quantizationsub4 bit quantized model on nvidia gpuStack Overflow

版权声明：本文标题：quantization - sub-4 bit quantized model on nvidia gpu - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1744056452a2583342.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

quantization - sub-4 bit quantized model on nvidia gpu - Stack Overflow

更多相关文章