[Bugfix][Kernel] nvfp4 cutlass MoE: fix nvfp4 experts quant out-of-bounds read...

[Bugfix][Kernel] nvfp4 cutlass MoE: fix nvfp4 experts quant out-of-bounds read for expert counts not divisible by 4 or 16 (#40351) Signed-off-by: Jakub Zakrzewski <jzakrzewski@nvidia.com>

[Bugfix][Kernel] nvfp4 cutlass MoE: fix nvfp4 experts quant out-of-bounds read...
[Bugfix][Kernel] nvfp4 cutlass MoE: fix nvfp4 experts quant out-of-bounds read for expert counts not divisible by 4 or 16 (#40351) Signed-off-by: Jakub Zakrzewski <jzakrzewski@nvidia.com>
6fbec8ed · Jakub Zakrzewski · GitHub · 5544f8c1 · 6fbec8ed
Unverified Commit 6fbec8ed authored Apr 21, 2026 by Jakub Zakrzewski Committed by GitHub Apr 21, 2026
Hide whitespace changes
Inline Side-by-side

Showing with 6 additions and 2 deletions

csrc/libtorch_stable/quantization/fp4/nvfp4_experts_quant.cu csrc/libtorch_stable/quantization/fp4/nvfp4_experts_quant.cu +6 -2

No files found.
--- a/csrc/libtorch_stable/quantization/fp4/nvfp4_experts_quant.cu
+++ b/csrc/libtorch_stable/quantization/fp4/nvfp4_experts_quant.cu
@@ -277,7 +277,9 @@ void quant_impl(void* output, void* output_scale, void* input,
      (totalWorkSize + block.x * grid.x - 1) / (block.x * grid.x);
  if (blockRepeat > 1) {
    size_t shared_mem_size = (n_experts + 1) * sizeof(uint32_t);
-    if (n_experts >= 4) {
+    // The shared-memory vectorized offset load only handles full 4-expert
+    // chunks. Use the scalar specialization for the remainder cases.
+    if (n_experts >= 4 && n_experts % 4 == 0) {
      cvt_fp16_to_fp4<T, FUSE_SILU_MUL, false, false>
          <<<grid, block, shared_mem_size, stream>>>(
              m_topk, k, reinterpret_cast<T*>(input),
@@ -299,7 +301,9 @@ void quant_impl(void* output, void* output_scale, void* input,
              n_experts);
    }
  } else {
-    if (n_experts >= 16) {
+    // The low-latency vectorized expert lookup only handles full 16-expert
+    // chunks. Fall back to the scalar lookup path for the remainder cases.
+    if (n_experts >= 16 && n_experts % 16 == 0) {
      cvt_fp16_to_fp4<T, FUSE_SILU_MUL, false, false>
          <<<grid, block, 0, stream>>>(
              m_topk, k, reinterpret_cast<T*>(input),