llm: Support KV cache quantization with gpt-oss

With the new version of GGML in #12245, KV cache quantization no longer causes a fallback to CPU.

llm: Support KV cache quantization with gpt-oss
With the new version of GGML in #12245, KV cache quantization no longer causes a fallback to CPU.
19e6796e · Jesse Gross · Jesse Gross · 33801c15 · 19e6796e
Commit 19e6796e authored Oct 03, 2025 by Jesse Gross Committed by Jesse Gross Oct 03, 2025
Hide whitespace changes
Inline Side-by-side

Showing with 0 additions and 5 deletions

fs/ggml/ggml.go fs/ggml/ggml.go +0 -5

No files found.
--- a/fs/ggml/ggml.go
+++ b/fs/ggml/ggml.go
@@ -870,11 +870,6 @@ func (f GGML) SupportsKVCacheType(cacheType string) bool {
 		return true
 	}
-	if arch := f.KV().Architecture(); slices.Contains([]string{"gptoss", "gpt-oss"}, arch) {
-		// gpt-oss uses attention with sinks which does not support quantized cache types
-		slog.Warn("model only supports non-quantized cache types", "model", arch)
-		return false
-	}
 	return slices.Contains([]string{"q8_0", "q4_0"}, cacheType)
 }