llm/memory.go · ced7d0e53df146200c322ccc7f3493aa32f627e1 · OpenDAS / ollama

ggml: Support heterogeneous KV cache layer sizes in memory estimation · f66216e3

Jesse Gross authored Mar 24, 2025

Gemma3 uses sliding windows for its context on 5/6 layers, significantly
reducing memory usage but leading to uneven usage across layers,
which makes allocation to the correct GPU difficult. We currently
estimate very conservatively by assuming all layers are consistent
at the max size.

Llama3.2-vision is also inconsistent between self attention and cross
attention layers - at moment, we calculate the correct total size
and then average this across layers. In some cases, this may lead
to crashes if a large layer is placed on a GPU sized by the average.

This allows memory estimation to calculate per-layer KV cache size
and take this account when placing layers onto GPUs. We already do
this for weights that vary per-tensor, so this is a logical extension.

Fixes #9730
Fixes #9890

f66216e3

memory.go 12.8 KB

Replace memory.go