• Jesse Gross's avatar
    kvcache: Use Cast instead of Copy for flash attention masks · 05ccb17c
    Jesse Gross authored
    Flash attention kernels require the mask of the KV cache be a F16
    rather than an F32. We can use the GGML operation ggml_cast to do
    this rather than doing it ourselves, which allows reuse of a
    preallocated buffer in the graph rather than allocating a new one
    for each batch. This improves token generation performance with
    flash attention by 10-30% (with gpt-oss). This also makes performance
    with flash attention better than without it, as expected.
    05ccb17c
causal.go 19.6 KB