admin管理员组文章数量:1356476
I was trying to run deepseek-r1-distill-llama70b-bf16.gguf
(131gb on disk) on two A6000 gpus (48gb vram each) with llama.cpp
. It runs with partial gpu offload but the gpu utilization is at 9-10% and is bursty.
I understand that llama.cpp
is not optimized for multi-gpu setup. Othe engines that are optimized are vllm, tensor-RT etc but they support quantization until 4 bits. Now if i want to run deepseek-r1
, to fit into these two gpu's it has to be extremely quantized (say in IQ1_XS). This is only possible with GGUF models that vllm etc. doesnt support. But with llama.cpp
the full gpu won't be utilized. How can I run extremely quantized models (sub-4 bits which are only in gguf format) in a nvidia multi-gpu setup?
本文标签: quantizationsub4 bit quantized model on nvidia gpuStack Overflow
版权声明:本文标题:quantization - sub-4 bit quantized model on nvidia gpu - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1744056452a2583342.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论