[ROCm][Bugfix][Model] Fix illegal memory access when running qwen3_moe models...

[ROCm][Bugfix][Model] Fix illegal memory access when running qwen3_moe models with rms_norm (Qwen3-235B-A22B, Qwen3-30B-A3B, etc.) (#26192) Signed-off-by: Randall Smith <ransmith@amd.com> Signed-off-by: Randall Smith <Randall.Smith@amd.com> Signed-off-by: rasmith <Randall.Smith@amd.com> Co-authored-by: Randall Smith <ransmith@amd.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>

[ROCm][Bugfix][Model] Fix illegal memory access when running qwen3_moe models...
[ROCm][Bugfix][Model] Fix illegal memory access when running qwen3_moe models with rms_norm (Qwen3-235B-A22B, Qwen3-30B-A3B, etc.) (#26192) Signed-off-by: Randall Smith <ransmith@amd.com> Signed-off-by: Randall Smith <Randall.Smith@amd.com> Signed-off-by: rasmith <Randall.Smith@amd.com> Co-authored-by: Randall Smith <ransmith@amd.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
b10c64c8 · rasmith · GitHub · 0925b28a · b10c64c8
Unverified Commit b10c64c8 authored Oct 17, 2025 by rasmith Committed by GitHub Oct 17, 2025
Hide whitespace changes
Inline Side-by-side

Showing with 16 additions and 8 deletions

csrc/layernorm_kernels.cu csrc/layernorm_kernels.cu +16 -8

No files found.
--- a/csrc/layernorm_kernels.cu
+++ b/csrc/layernorm_kernels.cu
@@ -364,18 +364,26 @@ void rms_norm(torch::Tensor& out,     // [..., hidden_size]
  TORCH_CHECK(weight.is_contiguous());
  int hidden_size = input.size(-1);
-  int num_tokens = input.numel() / hidden_size;
-  int64_t input_stride = input.stride(-2);
+  // We cannot just use `input.stride(-2)` if the tensor is not row-major.
+  // Instead, we use a 2d view to get the second-innermost stride.
+  // That way the dimensions (except the last one) can be arbitrarily permuted.
+  torch::Tensor input_view = input.view({-1, hidden_size});
+  int num_tokens = input_view.numel() / hidden_size;
+  int64_t input_stride = input_view.stride(-2);
  dim3 grid(num_tokens);
  dim3 block(std::min(hidden_size, 1024));
-  const at::cuda::OptionalCUDAGuard device_guard(device_of(input));
+  const at::cuda::OptionalCUDAGuard device_guard(device_of(input_view));
  const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
-  VLLM_DISPATCH_FLOATING_TYPES(input.scalar_type(), "rms_norm_kernel", [&] {
+  VLLM_DISPATCH_FLOATING_TYPES(
-    vllm::rms_norm_kernel<scalar_t><<<grid, block, 0, stream>>>(
+      input_view.scalar_type(), "rms_norm_kernel", [&] {
-        out.data_ptr<scalar_t>(), input.data_ptr<scalar_t>(), input_stride,
+        vllm::rms_norm_kernel<scalar_t><<<grid, block, 0, stream>>>(
-        weight.data_ptr<scalar_t>(), epsilon, num_tokens, hidden_size);
+            out.data_ptr<scalar_t>(), input_view.data_ptr<scalar_t>(),
-  });
+            input_stride, weight.data_ptr<scalar_t>(), epsilon, num_tokens,
+            hidden_size);
+      });
 }
 #define LAUNCH_FUSED_ADD_RMS_NORM(width)                                    \