[Bugfix][Kernel][ROCm] Fix triton_w4a16 scales mismatch when BLOCK_K > group_size (#39705)

Signed-off-by: JartX <sagformas@epdcenter.es>

[Bugfix][Kernel][ROCm] Fix triton_w4a16 scales mismatch when BLOCK_K > group_size (#39705)
Signed-off-by: JartX <sagformas@epdcenter.es>
f414f906 · JartX · GitHub · 8625ec26 · f414f906
Unverified Commit f414f906 authored Apr 13, 2026 by JartX Committed by GitHub Apr 13, 2026
Hide whitespace changes
Inline Side-by-side

Showing with 8 additions and 0 deletions

vllm/model_executor/kernels/linear/mixed_precision/triton_w4a16.py ...l_executor/kernels/linear/mixed_precision/triton_w4a16.py +8 -0

No files found.
--- a/vllm/model_executor/kernels/linear/mixed_precision/triton_w4a16.py
+++ b/vllm/model_executor/kernels/linear/mixed_precision/triton_w4a16.py
@@ -235,6 +235,14 @@ def triton_w4a16_gemm(
        else:
            BLOCK_M, BLOCK_N, BLOCK_K = 128, 128, 32
+    # The kernel loads scales/zeros for a single group per BLOCK_K tile
+    # (one g_idx per iteration). If BLOCK_K > group_size, rows at the tail
+    # of the tile dequantize with the wrong group's scales, silently
+    # corrupting the output. Clamp BLOCK_K to group_size to keep one
+    # scale group per tile.
+    if group_size < BLOCK_K:
+        BLOCK_K = group_size
    grid = (triton.cdiv(M, BLOCK_M), triton.cdiv(N, BLOCK_N))
    triton_w4a16_gemm_kernel[grid](