• Jesse Gross's avatar
    ggml: Support heterogeneous KV cache layer sizes in memory estimation · f66216e3
    Jesse Gross authored
    Gemma3 uses sliding windows for its context on 5/6 layers, significantly
    reducing memory usage but leading to uneven usage across layers,
    which makes allocation to the correct GPU difficult. We currently
    estimate very conservatively by assuming all layers are consistent
    at the max size.
    
    Llama3.2-vision is also inconsistent between self attention and cross
    attention layers - at moment, we calculate the correct total size
    and then average this across layers. In some cases, this may lead
    to crashes if a large layer is placed on a GPU sized by the average.
    
    This allows memory estimation to calculate per-layer KV cache size
    and take this account when placing layers onto GPUs. We already do
    this for weights that vary per-tensor, so this is a logical extension.
    
    Fixes #9730
    Fixes #9890
    f66216e3
memory.go 12.8 KB