Commit 7837a5bc authored by Jesse Gross's avatar Jesse Gross Committed by Jesse Gross
Browse files

ggml: Always set cache padding to 256

We currently use cache padding of 32 when not using flash attention
and 256 with flash attention, which is based on the historic alignment
requirements of these kernels. The restrictions have since been
loosened but there are still performance benefits, such as better
CUDA graph reuse.

Since the requirement is no longer kernel-specific, set the padding
uniformly to 256, as llama.cpp has.
parent 0a844f8e
...@@ -687,7 +687,7 @@ func (b *Backend) CacheConfig() ml.CacheConfig { ...@@ -687,7 +687,7 @@ func (b *Backend) CacheConfig() ml.CacheConfig {
if b.flashAttention { if b.flashAttention {
return ml.CacheConfig{CachePadding: 256, MaskDType: ml.DTypeF16, MaskBatchPadding: C.GGML_KQ_MASK_PAD} return ml.CacheConfig{CachePadding: 256, MaskDType: ml.DTypeF16, MaskBatchPadding: C.GGML_KQ_MASK_PAD}
} else { } else {
return ml.CacheConfig{CachePadding: 32, PermutedV: true} return ml.CacheConfig{CachePadding: 256, PermutedV: true}
} }
} }
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment