• Jesse Gross's avatar
    ggml: Disable flash attention for gemma2 · 29ddfc2c
    Jesse Gross authored
    Our new engine implementation of gemma2 doesn't support flash
    attention, which means that it also doesn't support KV cache
    quantization. Currently, it is possible to turn these two on,
    which will result in a crash.
    29ddfc2c
ggml.go 24.4 KB