ggml: Disable flash attention for gemma2

Our new engine implementation of gemma2 doesn't support flash attention, which means that it also doesn't support KV cache quantization. Currently, it is possible to turn these two on, which will result in a crash.

ggml: Disable flash attention for gemma2
Our new engine implementation of gemma2 doesn't support flash attention, which means that it also doesn't support KV cache quantization. Currently, it is possible to turn these two on, which will result in a crash.
29ddfc2c · Jesse Gross · Jesse Gross · 71cb86af · 29ddfc2c
Commit 29ddfc2c authored Sep 09, 2025 by Jesse Gross Committed by Jesse Gross Sep 10, 2025
Hide whitespace changes
Inline Side-by-side

Showing with 4 additions and 0 deletions

fs/ggml/ggml.go fs/ggml/ggml.go +4 -0

No files found.
--- a/fs/ggml/ggml.go
+++ b/fs/ggml/ggml.go
@@ -883,6 +883,10 @@ func (f GGML) SupportsFlashAttention() bool {
 		return false
 	}
+	if arch := f.KV().Architecture(); slices.Contains([]string{"gemma2"}, arch) {
+		return false
+	}
 	// Check head counts match and are non-zero
 	headCountK := f.KV().EmbeddingHeadCountK()
 	headCountV := f.KV().EmbeddingHeadCountV()