• Jesse Gross's avatar
    ggml: Enable op_offload to improve partial offload performance · afaf7ce8
    Jesse Gross authored
    When a model is partially offloaded to system RAM, we can either
    do the calculations on the CPU or we can temporarily transfer the
    data to the GPU to do the calculations there. Small batches tend
    to be better on the CPU, large batches on the GPU.
    
    The llamarunner used the GPU in most cases and the ollamarunner
    used the CPU. Although the ollamarunner saw an improvement in
    token generation performance, there was a large performance hit
    in prompt processing (3-10x).
    
    There is an existing heuristic to dynamically switch between these
    two modes but in practice it doesn't have enough information to
    accurately make that decision. This adds authoritative data to make
    the check work to get the best of both worlds.
    
    Fixes #12037
    afaf7ce8
backend.go 10.2 KB