ml/backend.go · 53985b3c4d94f22517e4090696a5b8ecd06caedb · OpenDAS / ollama

kvcache: Use SetRows to store cache data · 53985b3c

Jesse Gross authored Aug 18, 2025

We currently copy data into the KV cache in contiguous buffers using
ggml_cpy(). ggml_set_rows() was introduced to allow scatter operation
so that contiguous buffers are no longer required. The direct primary
benefit of this is that we no longer need to perform defragmentation.

However, GGML recently removed an optimization for ggml_cpy() and
we picked it up in 544b6739 "ggml update to b6840 (#12791)". This
caused a roughly 40% drop in token generation performance on CUDA
due to CUDA graphs no longer being used. By switching to
ggml_set_rows(), the original optimization is no longer necessary
and CUDA performance is restored.

Fixes #13112

53985b3c

backend.go 10.6 KB

Replace backend.go