[Bugfix] Fix fp8 dtype for some cases (#1246)

* [Enhancement] Add FP8 support and reproducibility in lighting indexer * Introduced a manual seed in `test_fp8_lighting_indexer` to ensure reproducible performance. * Added specializations for `cute::float_e4m3_t` and `cute::float_e5m2_t` in `gemm_mma.h` for enhanced FP8 support across multiple CUDA architectures, ensuring compatibility and improved functionality.ix * Fix typos in `fp8_lighting_indexer.py` and improve formatting in `gemm_mma.h` * Corrected a typo in the comment for `test_fp8_lighting_indexer` to enhance clarity. * Reformatted lines in `gemm_mma.h` for better readability by aligning template specializations across multiple CUDA architectures. * test fix * bug fix

[Bugfix] Fix fp8 dtype for some cases (#1246)
* [Enhancement] Add FP8 support and reproducibility in lighting indexer * Introduced a manual seed in `test_fp8_lighting_indexer` to ensure reproducible performance. * Added specializations for `cute::float_e4m3_t` and `cute::float_e5m2_t` in `gemm_mma.h` for enhanced FP8 support across multiple CUDA architectures, ensuring compatibility and improved functionality.ix * Fix typos in `fp8_lighting_indexer.py` and improve formatting in `gemm_mma.h` * Corrected a typo in the comment for `test_fp8_lighting_indexer` to enhance clarity. * Reformatted lines in `gemm_mma.h` for better readability by aligning template specializations across multiple CUDA architectures. * test fix * bug fix
63bf1609 · Lei Wang · GitHub · f550a58d · 63bf1609 · 63bf1609
Unverified Commit 63bf1609 authored Nov 13, 2025 by Lei Wang Committed by GitHub Nov 13, 2025
Show whitespace changes
Inline Side-by-side

Showing with 4 additions and 2 deletions

examples/deepseek_v32/fp8_lighting_indexer.py examples/deepseek_v32/fp8_lighting_indexer.py +2 -0

src/tl_templates/cuda/gemm_mma.h src/tl_templates/cuda/gemm_mma.h +2 -2

No files found.
--- a/examples/deepseek_v32/fp8_lighting_indexer.py
+++ b/examples/deepseek_v32/fp8_lighting_indexer.py
@@ -258,6 +258,8 @@ def ref_fp8_mqa_logits(q: torch.Tensor, kv: torch.Tensor, weights: torch.Tensor,
 def test_fp8_lighting_indexer(S=4096, SKV=8192, H=32, HKV=1, D=64, kv_stride=1):
+    # initial random seed to make the performance reproducible
+    torch.manual_seed(0)
    q = torch.randn(S, H, D, device="cuda", dtype=torch.bfloat16).to(torch.bfloat16)
    kv = torch.randn(SKV, D, device="cuda", dtype=torch.bfloat16).to(torch.bfloat16)
    weights = torch.randn(S, H, device="cuda", dtype=torch.float32)

--- a/src/tl_templates/cuda/gemm_mma.h
+++ b/src/tl_templates/cuda/gemm_mma.h
@@ -273,8 +273,8 @@ public:
                                tfloat32_t, B_type_cute>::type;
  using C_type = C_type_raw;
-  using Instruction =
+  using Instruction = DispatchInstruction<A_type_raw, B_type_raw, C_type_raw,
-      DispatchInstruction<A_type, B_type, C_type, num_warp_m, num_warp_n, N>;
+                                          num_warp_m, num_warp_n, N>;
  using OperandATraits = OperandTraits<sizeof_bits<A_type>::value, M, K,
                                       !trans_A, num_warp_m, lda>;