llm/memory.go · d75557747357bfb3afd441a0cc207ec944bd3a18 · OpenDAS / ollama

llm: Estimate projector memory correctly for Ollama engine · d7555774

Jesse Gross authored May 13, 2025

The Llama engine always places vision projectors on the first GPU
if one exists. However, the Ollama engine groups it with the output
layer, which means the projector is only offloaded if all other layers
are offloaded. The memory estimation code always assumes the former
layout - this changes it to use the correct layout based on the engine.

This addresses two impacts of the current behavior:
 - In multi-GPU setups, we can crash with OOM errors when we try to
   allocate memory on a full GPU while another still has space.
 - If the vision projector is large, it may prevent us from offloading
   anything when we could have fit some of the text layers.

d7555774

memory.go 12.4 KB

Replace memory.go