• Jesse Gross's avatar
    kvcache: Use SetRows to store cache data · 53985b3c
    Jesse Gross authored
    We currently copy data into the KV cache in contiguous buffers using
    ggml_cpy(). ggml_set_rows() was introduced to allow scatter operation
    so that contiguous buffers are no longer required. The direct primary
    benefit of this is that we no longer need to perform defragmentation.
    
    However, GGML recently removed an optimization for ggml_cpy() and
    we picked it up in 544b6739 "ggml update to b6840 (#12791)". This
    caused a roughly 40% drop in token generation performance on CUDA
    due to CUDA graphs no longer being used. By switching to
    ggml_set_rows(), the original optimization is no longer necessary
    and CUDA performance is restored.
    
    Fixes #13112
    53985b3c
backend.go 10.6 KB