ggml: Always set cache padding to 256

We currently use cache padding of 32 when not using flash attention and 256 with flash attention, which is based on the historic alignment requirements of these kernels. The restrictions have since been loosened but there are still performance benefits, such as better CUDA graph reuse. Since the requirement is no longer kernel-specific, set the padding uniformly to 256, as llama.cpp has.

ggml: Always set cache padding to 256
We currently use cache padding of 32 when not using flash attention and 256 with flash attention, which is based on the historic alignment requirements of these kernels. The restrictions have since been loosened but there are still performance benefits, such as better CUDA graph reuse. Since the requirement is no longer kernel-specific, set the padding uniformly to 256, as llama.cpp has.
7837a5bc · Jesse Gross · Jesse Gross · 0a844f8e · 7837a5bc
Commit 7837a5bc authored Dec 04, 2025 by Jesse Gross Committed by Jesse Gross Dec 04, 2025
Hide whitespace changes
Inline Side-by-side

Showing with 1 addition and 1 deletion

ml/backend/ggml/ggml.go ml/backend/ggml/ggml.go +1 -1

No files found.
--- a/ml/backend/ggml/ggml.go
+++ b/ml/backend/ggml/ggml.go
@@ -687,7 +687,7 @@ func (b *Backend) CacheConfig() ml.CacheConfig {
 	if b.flashAttention {
 		return ml.CacheConfig{CachePadding: 256, MaskDType: ml.DTypeF16, MaskBatchPadding: C.GGML_KQ_MASK_PAD}
 	} else {
-		return ml.CacheConfig{CachePadding: 32, PermutedV: true}
+		return ml.CacheConfig{CachePadding: 256, PermutedV: true}
 	}
 }