• Jesse Gross's avatar
    ml: Enable support for flash attention · 21aa666a
    Jesse Gross authored
    The GGML flash attention kernel has specific requirements for
    padding and permutation. This adds support to the KV cache
    for conforming to these requirements so that flash attention
    can be enabled.
    
    Flash attention can be used in the same situations as the llama
    engine and is enabled by the user in the same way.
    21aa666a
backend.go 7.13 KB