NVFP4 Move RHT BLAS to GPU (#2275)

* CUDA RHT Signed-off-by: Kevin Tong <kevin@augmentcode.com> * Fix cuda graphs Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix bug where RHT mask is tensor instead of int Signed-off-by: Tim Moon <tmoon@nvidia.com> --------- Signed-off-by: Kevin Tong <kevin@augmentcode.com> Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by: Tim Moon <tmoon@nvidia.com>

NVFP4 Move RHT BLAS to GPU (#2275)
* CUDA RHT Signed-off-by: Kevin Tong <kevin@augmentcode.com> * Fix cuda graphs Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix bug where RHT mask is tensor instead of int Signed-off-by: Tim Moon <tmoon@nvidia.com> --------- Signed-off-by: Kevin Tong <kevin@augmentcode.com> Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by: Tim Moon <tmoon@nvidia.com>
05dc1e62 · Kevin Tong · GitHub · 9dd61922 · 05dc1e62
Unverified Commit 05dc1e62 authored Oct 17, 2025 by Kevin Tong Committed by GitHub Oct 17, 2025
Hide whitespace changes
Inline Side-by-side

Showing with 6 additions and 4 deletions

transformer_engine/pytorch/tensor/nvfp4_tensor.py transformer_engine/pytorch/tensor/nvfp4_tensor.py +6 -4

No files found.
--- a/transformer_engine/pytorch/tensor/nvfp4_tensor.py
+++ b/transformer_engine/pytorch/tensor/nvfp4_tensor.py
@@ -29,7 +29,7 @@ aten = torch.ops.aten

 def get_no_random_sign_vector() -> torch.Tensor:
    """Non-random sign vector for Hadamard transform."""
-    return torch.tensor([1], dtype=torch.float32)
+    return torch.tensor([1], dtype=torch.float32, device="cuda")


 def get_sign_from_vector(vector: torch.Tensor) -> int:
@@ -41,7 +41,7 @@ def get_sign_from_vector(vector: torch.Tensor) -> int:
    mask = 0
    for i, v in enumerate(vector):
        mask |= (v == -1) << i
-    return mask
+    return mask.item()


 def get_wgrad_sign_vector() -> torch.Tensor:
@@ -53,6 +53,7 @@ def get_wgrad_sign_vector() -> torch.Tensor:
    return torch.tensor(
        [1, 1, 1, -1, 1, -1, -1, -1, -1, -1, -1, 1, -1, 1, -1, -1],
        dtype=torch.float32,
+        device="cuda",
    )


@@ -81,6 +82,7 @@ def get_hadamard_matrix(hadamard_dimension: int) -> torch.Tensor:
                [1, -1, -1, 1, -1, 1, 1, -1, -1, 1, 1, -1, 1, -1, -1, 1],
            ],
            dtype=torch.float32,
+            device="cuda",
        )
        * hadamard_scale
    )
@@ -94,9 +96,9 @@ def get_rht_matrix(with_random_sign_mask: bool) -> torch.Tensor:
        signs = get_wgrad_sign_vector()
    else:
        signs = get_no_random_sign_vector()
-    sign_matrix = signs * torch.eye(hadamard_dimension, dtype=torch.float32)
+    sign_matrix = signs * torch.eye(hadamard_dimension, dtype=torch.float32, device="cuda")
    rht_matrix = sign_matrix @ get_hadamard_matrix(hadamard_dimension)
-    return rht_matrix.to(dtype=torch.bfloat16).cuda()
+    return rht_matrix.to(dtype=torch.bfloat16)


 @functools.lru_cache(maxsize=None)