ggml: Prevent kv cache quanitization on gpt-oss

KV cache quantization has a dependency on the flash attention kernel. We currently cannot use flash attention with gpt-oss as it requires additional operations. The model definition does not call flash attention, so it works regardless of the setting but the cache will pick up the quantization type. This updates the flash attention setting earlier in the loading flow so that all downstream settings are also set correctly. Fixes: #11671

ggml: Prevent kv cache quanitization on gpt-oss
KV cache quantization has a dependency on the flash attention kernel. We currently cannot use flash attention with gpt-oss as it requires additional operations. The model definition does not call flash attention, so it works regardless of the setting but the cache will pick up the quantization type. This updates the flash attention setting earlier in the loading flow so that all downstream settings are also set correctly. Fixes: #11671
8253ad4d · Jesse Gross · Jesse Gross · fa7776fd · 8253ad4d
Commit 8253ad4d authored Aug 05, 2025 by Jesse Gross Committed by Jesse Gross Aug 05, 2025
Hide whitespace changes
Inline Side-by-side

Showing with 4 additions and 0 deletions

fs/ggml/ggml.go fs/ggml/ggml.go +4 -0

No files found.
--- a/fs/ggml/ggml.go
+++ b/fs/ggml/ggml.go
@@ -761,6 +761,10 @@ func (f GGML) SupportsFlashAttention() bool {
 		return false
 	}
+	if f.KV().Architecture() == "gptoss" {
+		return false
+	}
 	// Check head counts match and are non-zero
 	headCountK := f.KV().EmbeddingHeadCountK()
 	headCountV := f.KV().EmbeddingHeadCountV()