• Jesse Gross's avatar
    llm: Estimate projector memory correctly for Ollama engine · d7555774
    Jesse Gross authored
    The Llama engine always places vision projectors on the first GPU
    if one exists. However, the Ollama engine groups it with the output
    layer, which means the projector is only offloaded if all other layers
    are offloaded. The memory estimation code always assumes the former
    layout - this changes it to use the correct layout based on the engine.
    
    This addresses two impacts of the current behavior:
     - In multi-GPU setups, we can crash with OOM errors when we try to
       allocate memory on a full GPU while another still has space.
     - If the vision projector is large, it may prevent us from offloading
       anything when we could have fit some of the text layers.
    d7555774
memory.go 12.4 KB