• Jesse Gross's avatar
    ggml: Prevent kv cache quanitization on gpt-oss · 8253ad4d
    Jesse Gross authored
    KV cache quantization has a dependency on the flash attention kernel.
    We currently cannot use flash attention with gpt-oss as it requires
    additional operations.
    
    The model definition does not call flash attention, so it works
    regardless of the setting but the cache will pick up the
    quantization type. This updates the flash attention setting earlier
    in the loading flow so that all downstream settings are also set correctly.
    
    Fixes: #11671
    8253ad4d
ggml.go 20.3 KB