Commit 8253ad4d authored by Jesse Gross's avatar Jesse Gross Committed by Jesse Gross
Browse files

ggml: Prevent kv cache quanitization on gpt-oss

KV cache quantization has a dependency on the flash attention kernel.
We currently cannot use flash attention with gpt-oss as it requires
additional operations.

The model definition does not call flash attention, so it works
regardless of the setting but the cache will pick up the
quantization type. This updates the flash attention setting earlier
in the loading flow so that all downstream settings are also set correctly.

Fixes: #11671
parent fa7776fd
...@@ -761,6 +761,10 @@ func (f GGML) SupportsFlashAttention() bool { ...@@ -761,6 +761,10 @@ func (f GGML) SupportsFlashAttention() bool {
return false return false
} }
if f.KV().Architecture() == "gptoss" {
return false
}
// Check head counts match and are non-zero // Check head counts match and are non-zero
headCountK := f.KV().EmbeddingHeadCountK() headCountK := f.KV().EmbeddingHeadCountK()
headCountV := f.KV().EmbeddingHeadCountV() headCountV := f.KV().EmbeddingHeadCountV()
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment