Commits · e547d247a4d45c271efa2ecd9b2dbb970afb1e79 · OpenDAS / tilelang

01 Dec, 2025 4 commits

[Bugfix] Update TIR registration for GemmSPPy to use tile operation (#1361) · e547d247
Lei Wang authored Dec 01, 2025

e547d247

[Language] support `T.gemm_sp_v2` on sm80 and sm89 (#1056) · 283a9a00

botbw authored Dec 01, 2025

* [misc] add a cpp side wrapper for gemm_sp_py

* [misc] typing

* [IR] bind GemmSPWarpPolicy

* [chore] add wrapper code

* [IR] fix GemmSPWarpPolicy

* [codegen] apply ptxas instructions

* [intrinsic] add typical (unused) mma layout

* [template] add uint16 debug func

* [intrinsic] add b matrix layout

* [gemm_sp] enable fp16/bf16 on sm8x

* [layout] refactor fp16/bf16 layout

* [gemm_sp] enable int8

* [chore] update test case dtype

* [gemm_sp] enable fp32

* [layout] refactor layouts

* [intrinsic] enable ldmatrix for mat A

* [layout] enable ldsm for matrix b

* [layout] add ldmatrix for fp32 and fp8

* [chore] refine

* [chore] refactor

* [chore] add fp8 efactor

* [chore] refactor

* [chore] add remove negative zero util

* [example] add a custom compress kernel

* [chore] minor update

* [test] refactor gemm_sp test

* [refactor] make metadata layout func

* [example] add option for using cutlass layout

* [doc] add a gemm_sp doc

* [doc] minor polish

* [chore] remove unused

* [bugfix] fix non replicate b case

* [test] refactor

* [chore] add a check

* [bugfix] fix util bug

* [wip] init a new test case for v2

* [chore] minor refactor

* [chore] minor update

* [bugfix] enable 16bit rs

* [language] enable rs

* [language] enable gemm_sp_sr

* [language] enable gemm_sp_rr

* [test] enable more tests

* [tvm] update ffi binding

* [chore] remove print

* [chore] fix benchmark script

* [lint] precommit lint

* [chore] apply feedback

* [test] use arch 8.0

* [chore] rollback ::ordered_metadata for backward compatibility

* [bugfix] fix captialized

* [example] keep gemm_sp on hopper

* [test] fix no fp8 normal kernel

* [test] reduce matmul size to satisfy accum error

* [test] use cal_diff for assertion

* [bugfix] expand float8 type

* [lib] add make_int4 for short type

* [language] add transpose E

* [bugfix] fix wrong var

* [format] format

* [chore] refactor binding

* [chore] fix wrong passing var

283a9a00

[Analysis] Enhance NestedLoopChecker with tile op cases (#1358) · b10ef75f
Chaofan Lin authored Dec 01, 2025
```
* [Analysis] Enhance NestedLoopChecker with tile op cases

* fix tileop issue
```
b10ef75f

[Refactor] Update Fragment Indexing in ParallelOpNode's InferLayout Method (#1359) · 1b42c87b

Lei Wang authored Dec 01, 2025

This commit refines the Fragment creation process in the InferLayout method of ParallelOpNode. It removes the unnecessary forward_index array and utilizes default fragment indexing for consistency with other operations. Additionally, it binds the thread range to enhance comparability across different operations.

1b42c87b

28 Nov, 2025 3 commits

[Bugfix] Disable floordiv optimization due to integer overflow risk (#1355) · a4ea7da9

LJC00118 authored Nov 28, 2025

* disable overflow-prone floordiv optimization in lower_intrin.cc

* disable overflow-prone floordiv optimization in lower_intrin.cc

a4ea7da9

[Enhancement] Improve error handling and assertion messages across runtime and... · 17cfeb76

Lei Wang authored Nov 28, 2025

[Enhancement] Improve error handling and assertion messages across runtime and argument binding (#1356)

This commit enhances the error handling mechanisms in the runtime by introducing CPU-safe runtime helpers and refining assertion messages in the CodeGenCHost and ArgBinder. It includes structured packed error messages for various conditions, improving clarity in diagnostics. Additionally, the CMake configuration is updated to always include necessary runtime helpers, ensuring consistent error reporting. The changes aim to provide clearer feedback during runtime errors and improve the overall robustness of the argument binding process.

17cfeb76

[Refactor] Simplify index sign state handling in LegalizeNegativeIndex (#1354) · 36a2b2f3

Lei Wang authored Nov 28, 2025

This commit refines the logic for determining the sign state of indices in the LegalizeNegativeIndex transformation. It prioritizes vector patterns, specifically Ramp and Broadcast nodes, to avoid compile-time lane queries. The handling of scalar indices is also streamlined, ensuring clearer diagnostics when non-negativity cannot be proven. These changes enhance the robustness and clarity of index handling in the transformation pass.

36a2b2f3

27 Nov, 2025 1 commit

[Refactor] Improve assertion handling in CodeGenCHost and ArgBinder (#1352) · 1e92d11c

Lei Wang authored Nov 28, 2025

* [Refactor] Improve assertion handling in CodeGenCHost and ArgBinder

This commit refines the assertion message generation in CodeGenCHost by optimizing the handling of equality checks and reducing buffer size for error messages. Additionally, it enhances the ArgBinder by introducing a nullable guard mechanism for assertions, allowing for more precise error handling when binding arguments. The changes improve the clarity and efficiency of assertion handling across the codebase.

* [Enhancement] Update matmul kernel and optimize argument binding

This commit enhances the matmul kernel by introducing additional tensor parameters and refining the pipeline stages for improved performance. It also updates the argument binding mechanism to include a flag indicating whether buffers are used, enhancing the efficiency of buffer management. Furthermore, the optimization phase in the engine is improved by adding a simplification step, ensuring better performance and clarity in the generated code.

* lint fix

* [Enhancement] Add tensor checks documentation and improve argument binding assertions

This commit introduces a new documentation page for host-side tensor checks, detailing the automatic validations performed by TileLang on kernel arguments. It enhances the ArgBinder by adding assertions for non-null pointers when arguments are used, improving error handling. Additionally, the optimization phase in the engine is updated to include a simplification step, ensuring better performance and clarity in the generated code.

* [Enhancement] Update .gitignore and refine matmul kernel for improved performance

This commit adds host checks logs to the .gitignore file to prevent unnecessary log files from being tracked. Additionally, it refines the matmul kernel by adjusting pipeline stages, updating tensor parameters, and enhancing argument handling for better performance. The changes also include improved error messages in the argument binding process, ensuring clearer diagnostics for users.

* lint fix

* lint fix

* [Refactor] Simplify tensor_null_test function and remove ptr_null_test

This commit refactors the tensor_null_test function by adding a with_bias parameter and removing the ptr_null_test function, which was previously unused. The run_test function is updated to reflect these changes, streamlining the testing process for tensor operations.

* lint fix

* fix

1e92d11c

26 Nov, 2025 5 commits

[Enhancement] Add support for k_pack in gemm_mfma (#1344) · 6bae64f6
Gongen-Ali authored Nov 26, 2025
```
* add support for k_pack

* support benchmark on ROCm

* fix format
```
6bae64f6

[Refactor] Enhance CopyNode's IterVar Creation and Range Handling (#1346) · 17718bec

Lei Wang authored Nov 26, 2025

* [Refactor] Enhance CopyNode's IterVar Creation and Range Handling

This commit refines the `MakeIterVars` method in `CopyNode` to select base ranges based on memory scope levels, ensuring that the chosen ranges are not smaller than the original source ranges. Additionally, it updates the Python `copy` function to clarify range handling, including broadcasting logic and extent alignment. These changes improve the robustness and clarity of the copy operation's implementation.

* test fix

17718bec

[Enhancement] add more dtype and fix mma.ws for fp16 for tcgen05 (#1327) · f0c721a4

Yunqian Fan authored Nov 26, 2025

* feat: add fp8 variants; add placeholder for fp6/fp4 in meta

support ld with pack for fp32 dtype

add dump

add tempalte expand

remove unused dtype and change to rebased apis

* fix: when atom-m!=128, enable_ws

* fix: typo in tcgen05 meta; dispatch in gemm sm100

f0c721a4

[Refactor] Phaseout vmap for Tile Operators (#1334) · f5d9da46

Lei Wang authored Nov 26, 2025



* Refactor GEMM and Reduce operations by moving NormalizeToBufferRegion and MakeAccessPtrFromRegion to utils.{h,cc} for better code organization and reuse.

* lint fix

* Refactor region handling by removing the RegionOp and updating NormalizeToBufferRegion to only accept BufferLoad and BufferRegion. This change improves code organization and simplifies the handling of memory regions across various operations.

* fix

* Refactor memory region handling by introducing `tl.region` calls across various operations, including GEMM and fill functions. This change enhances the consistency of region management and improves code organization by utilizing utility functions for buffer region conversions.

* fix

* fix

* test fix

* lint fix

* Refactor GEMM operations to improve memory region handling by replacing `mbarPtr_` with `mbarRegion_` and updating related logic in both C++ and Python implementations. This change enhances the clarity and consistency of buffer region management.

* fix

* lint fix

* fix

* fix

* test fix

* lint fix

* lint fix

* minor fix

* fix

---------
Co-authored-by: Zhiwen Mo <zm125@ic.ac.uk>

f5d9da46

[Feat] Extend LegalizeNegativeIndex to support buffer store stmts (#1339) · fac04006

ConvolutedDog authored Nov 26, 2025

This commit enhances the LegalizeNegativeIndex transformation pass to handle
both buffer load and store operations with negative indices and adds some
test cases.

fac04006

25 Nov, 2025 3 commits
- [Language][UX] Semantic check for parallel fragment access (#1338) · e2b10c58
  Chaofan Lin authored Nov 25, 2025
  
  e2b10c58
- [Fix] Fix bug copying from or to local buffer (#1304) (#1324) · 2ae4f1b7
  Kuris authored Nov 25, 2025
```
* [Fix] fix copy from or to local buffer (#1304)

* fix lint error

* minor fix testing script
```
  2ae4f1b7
- [Refactor] Moving `NormalizeToBufferRegion` and `MakeAccessPtrFromRegion` to utils (#1333) · 2f34840f
  Lei Wang authored Nov 25, 2025
```
* Refactor GEMM and Reduce operations by moving NormalizeToBufferRegion and MakeAccessPtrFromRegion to utils.{h,cc} for better code organization and reuse.

* lint fix
```
  2f34840f
24 Nov, 2025 4 commits

[BugFix] Use BufferRegion in tl.cumsum to infer buffer shape (#1321) · 9dda774a

Chaofan Lin authored Nov 25, 2025



* [BugFix] Use BufferRegion in tl.cumsum to infer buffer shape

* remove debug lines

* remove rubbish

* Fix decorator syntax for atomic_different_memory_orders_program

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

9dda774a

[Enhancement] Support more dtype in `T.print` (#1329) · c30df2a1
Wenhao Xie authored Nov 25, 2025
```
* [Enhancement] Support more dtype in `T.print`

* upd

* upd
```
c30df2a1
[Feat] Support warp reduce (#1316) · caa6dd3f
Tong WU authored Nov 24, 2025
```
* [Feat] Support warp reduce

* lint

* add test

* lint
```
caa6dd3f
Revert "[WIP] support more dtypes for tcgen05 (#1229)" (#1323) · ca98cc39
Lei Wang authored Nov 24, 2025
```
This reverts commit 0d101c11

.
Co-authored-by: Zhiwen Mo <zm125@ic.ac.uk>
```
ca98cc39

23 Nov, 2025 1 commit

[Refactor] Backup Analyzer to get the appropriate arith informations (#1311) · 9f7bac4c

Lei Wang authored Nov 23, 2025

* [Refactor] Update Vectorization Functions to Accept Analyzer Parameter

- Modified `VectorizeLoop` and related functions to accept an `arith::Analyzer` parameter, enhancing their capability to perform analysis during vectorization.
- Updated multiple instances in `copy.cc`, `fill.cc`, `parallel.cc`, and layout inference files to utilize the new analyzer parameter for improved performance and correctness.
- Ensured consistency across vectorization logic by integrating the analyzer into existing workflows, facilitating better optimization opportunities.

* [Fix] Corrected PostOrderVisit call in loop_vectorize.cc

- Updated the PostOrderVisit function to analyze the body of the loop node instead of the node itself, ensuring proper handling of nested loops during vectorization analysis.

* fix

* lint fix

* fix

9f7bac4c

22 Nov, 2025 1 commit

Improve memory access safety and `T.assume` handling (#1292) · 470eb74c

LJC00118 authored Nov 22, 2025



* Improve memory access safety and T.assume handling

* Improve memory access safety and T.assume handling

* bugfix

* lint fix

* bugfix

* bugfix

* refactor legalize safe memory access pass

---------
Co-authored-by: Lei Wang <leiwang1999@outlook.com>

470eb74c

21 Nov, 2025 2 commits
- [WIP] support more dtypes for tcgen05 (#1229) · 0d101c11
  Yunqian Fan authored Nov 21, 2025
```
support ld with pack for fp32 dtype

add dump

add tempalte expand

remove unused dtype and change to rebased apis
```
  0d101c11
- [Bugfix] Fallback to the old AtomicAdd implementation for legacy architectures (#1306) · 17bbc0ca
  Lei Wang authored Nov 21, 2025
  
  17bbc0ca
20 Nov, 2025 3 commits
- [Enhancement] Shared Memory Size Can be Dynamic (#1294) · d4b6d094
  Lei Wang authored Nov 20, 2025
```
* bugfix

* lint fix

* test

* lint fix

* increate procs

* recover
```
  d4b6d094
- [Feat] Add support for using `T.Tensor(n * 2 + 1)` in function annotation (#1285) · bccb6485
  Kuris authored Nov 20, 2025
```
* [Feature] Add support for A: T.Tensor(n + 1) and A: T.Tensor(2*n)

* issue fix

* fix

* fix

* decreate nproc for debugging

---------
Co-authored-by: Lei Wang <leiwang1999@outlook.com>
```
  bccb6485
- [Compatibility] Support CUDA 11.3 (#1290) · bef7e52e
  Lei Wang authored Nov 20, 2025
  
  bef7e52e
19 Nov, 2025 3 commits

[Language][UX] Nested loop checker in pre-lowering stage (#1288) · 9e67b861
Chaofan Lin authored Nov 20, 2025
```
* [Language][UX] Nested loop checker in pre-lowering stage

* rename

* comment

* address comments
```
9e67b861

[Enhancement] Enhance CUDA compilation by integrating pass context configuration (#1283) · 551ac60d

Lei Wang authored Nov 19, 2025

- Updated the `tilelang_callback_cuda_compile` function to accept a `pass_config` parameter, allowing for more flexible compilation options.
- Introduced handling for fast math and PTXAS options based on the provided pass configuration.
- Modified the CUDA build process in `rt_mod_cuda.cc` to utilize the current pass context, improving the integration of compilation settings.
- Refactored NVCC command construction to use a dedicated function for better clarity and maintainability.

551ac60d

[Bugfix] Supply missing `T.print` for bool type (#1279) · 4c8b9ada
Lei Wang authored Nov 19, 2025
```
* fix for bool dtype

* lint fix

* fix

* ci fix
```
4c8b9ada

18 Nov, 2025 2 commits

[FFI] Use tvm ffi as the default execution backend (#1259) · 74da3696

Lei Wang authored Nov 18, 2025

* [Refactor] Update FFI type handling and simplify argument management

* Refactored FFI type definitions in runtime and code generation files to use `TVMFFIAny` instead of `TVMValue`, enhancing type clarity.
* Updated function registration in `runtime.cc` to utilize canonical names for better consistency.
* Simplified argument handling in the `simplify` transformation, ensuring unused buffer parameters are removed only when simplification is enabled.
* Adjusted autotuner and profiler parameters to standardize the execution backend to `tvm_ffi`, improving clarity in backend selection.
* Removed obsolete `adapt_torch2tvm` function from tensor utilities to streamline the codebase and reduce complexity.

* [Update] Sync TVM submodule and enhance kernel source handling

* Updated the TVM submodule to commit cdc2aced, ensuring compatibility with recent changes.
* Added functionality to print kernel source in `example_blocksparse_gemm.py` for better debugging.
* Commented out the main execution call in test files to prevent unintended execution during testing.
* Introduced `tilelang.disable_cache()` in various test files to streamline testing and avoid cache-related issues.
* Refactored kernel source retrieval methods to improve clarity and consistency across different execution backends.

* [Refactor] Clean up imports and improve code formatting

* Removed unused import of `tilelang.testing` in `test_example_blocksparse_gemm.py` to streamline the code.
* Reformatted several lines in `arg_binder.cc`, `make_packed_api.cc`, `tvm_ffi.py`, and `adapter.py` for improved readability and consistency.
* Updated comments and spacing in `tvm_ffi.py` to enhance clarity without altering functionality.

* Update execution backend options and improve resolution logic

- Changed default execution backend from "cython" to "auto" in multiple locations to allow automatic selection based on the target.
- Expanded the list of supported execution backends to include "torch" and "nvrtc" across various classes and functions.
- Enhanced backend resolution logic in `KernelCache` and `AutoTuner` to ensure appropriate backend selection based on the target.
- Updated documentation to reflect changes in execution backend options and their defaults.

* lint fix

* fix

* Enhance argument handling in CUDA and HIP runtime modules

- Updated `ExtractFuncInfo` in `rt_mod_cuda.cc` and `rt_mod_hip.cc` to map boolean argument types to int32, ensuring compatibility with device runtime.
- Refactored `BindDLTensor` in `arg_binder.cc` to improve null handling and validation checks for DLTensor parameters, utilizing expression-level guards to prevent dereferencing null pointers.
- Enhanced error checking for buffer shape, strides, and data fields, ensuring robust handling of optional inputs and maintaining consistency across various checks.

* lint fix

* minor fix

* fix

* recover check

* Refactor argument binding and validation in `arg_binder.cc`

- Improved null handling and validation checks in `BindDLTensor`, ensuring safe dereferencing of pointers.
- Enhanced consistency checks for buffer shape, strides, and data fields, utilizing expression-level guards.
- Updated `MakePackedAPI` to maintain code clarity and consistency in argument handling.
- Minor adjustments in test files to streamline kernel execution and improve readability.

* lint fix

* stride fix

* minor fix

* fix

* lint fix

* Add CUDA stream access policy window helpers and integrate with L2 persistent cache management

- Introduced functions to set and reset the CUDA stream access policy window, allowing for better control over L2 cache usage.
- Updated runtime files to include new FFI packed functions for managing stream attributes.
- Modified lower_hopper_intrin to incorporate prologue and epilogue statements for L2 cache setup and teardown.
- Enhanced tests to verify the inclusion of new FFI calls in the generated kernel source.

* check with symbolic

* support null ptr

* Update CMakeLists and lower.py for code generation and subproject status

- Added `codegen_c_host.cc` to the list of source files in CMakeLists.txt for improved code generation support.
- Updated the function call in `lower.py` to use `target.build.tilelang_c` for C target host code generation, enhancing compatibility.
- Marked the TVM subproject as dirty to indicate local modifications.

* lint fix

* Update comments for clarity in quickstart.py

74da3696

Fix various issues under `int64_t` static and dynamic shape. (#1218) · 49c85715

Elevator14B authored Nov 18, 2025



* Fix various issues under int64_t static and dynamic shape.

* Resolve reviewed issues.

* Add unit test.

* fix

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

49c85715

17 Nov, 2025 1 commit
- [Bugfix] Fix multiple cg defination when using T.sync_grid (#1272) · 220c3236
  Yu Cheng authored Nov 18, 2025
  
  220c3236
16 Nov, 2025 1 commit

[BugFix] Remove memory_order in atomic constexpr and fix NSA bwd (#1260) · 2de566e7

Kevinzz authored Nov 16, 2025



* fix nsa bwd and atomic

* [Lint]

* [BugFix]
- New implementation for atomicMax and atomicMin using atomicCAS
- PTX version atomicAdd for single 16-byte data
- Modify the test cases

* [Lint]

---------
Co-authored-by: tzj-fxz <tzjfxz@gmail.com>

2de566e7

15 Nov, 2025 1 commit

[fix] NVRTC execution backend (#1256) · eb415744

Gabriel Wu authored Nov 15, 2025

* [fix] NVRTC execution backend

* [fmt] run pre-commit

* [fix] coderabbit reviews

* [test] add cuda-python to test dep

* [fix] coderabbit reviews

* [fix] CUDA 13 compatibility

* [fix] sm90

* [fix] CUDA 13 compatibility

* [fix] pre-commit

* [fix] always use cuda::std::__atomic_ref_impl

* [fix] restore to external API

* Revert "[fix] restore to external API"

This reverts commit 49bd875638fb631d270015f408991d38fd1e9a5d.

* [fmt] use space instead tabs for py codegen

* [fix] im2col API

* [fix] revert atomic.h

* [fix] dynamic shape

* [refactor] extract common utils

* [feat] support L2 persistent map

* [fix] l2 persistent map

* [fix] pre-commit

* [fix] restore _TYPE_MAP

* [fix] pre-commit

* [fix] avoid duplicate TMA descs

* [docs] add docstring

* [fix] coderabbit

* [fix] coderabbit

* [fix] coderabbit

* [fix] coderabbit

eb415744

13 Nov, 2025 4 commits

[Refactor] Update buffer handling in copy and atomic operations (#1247) · 2c0072a8

Lei Wang authored Nov 14, 2025

* [Refactor] Update buffer handling in copy and atomic operations

* Refactored the `copy` and `atomic_add` functions to use element-wise minimum for defining copy extents, ensuring correct handling of overlapping regions.
* Updated utility functions to create `BufferLoad` instances with explicit extents, improving memory management and clarity.
* Removed unused imports from `atomic.py` and `copy.py` to streamline the codebase.
* Adjusted logging in `copy.cc` to provide clearer warnings for fallback scenarios in bulk copy operations.

* Remove obsolete .git_commit.txt file

* Add unit test for dynamic copy extent handling in TileLang

* Introduced a new test file `test_tilelang_issue_1237.py` to verify that the `T.copy` function correctly manages dynamic extents during primitive function building.
* The test reproduces a specific issue related to dynamic slice lengths and static buffer sizes, ensuring robustness in the handling of such scenarios.
* The test does not require execution of the kernel, as building the primitive function is sufficient to validate the fix.

* lint fix

* fix

* Revert "fix"

This reverts commit 828b4c1e4de76a7d11e4d4092927303fbbe00097.

* Update TVM submodule and refactor atomic and copy functions

* Updated the TVM submodule to a dirty state.
* Refactored `atomic_add` and `copy` functions to pass extents explicitly to the `_to_region` helper, improving clarity and correctness in handling buffer regions.
* Commented out the main execution call in the test example for `cast` and added a new function call to better demonstrate the example usage.

* Enhance extent handling in atomic and copy functions

* Introduced `legalize_pairwise_extents` utility to align and broadcast extent lists for `atomic_add` and `copy` functions, ensuring compatibility and correctness in buffer operations.
* Updated both functions to utilize the new utility, improving clarity and robustness in handling dynamic and static extents.
* Added comments to clarify the extent handling logic.

* Enhance `legalize_pairwise_extents` function with early-exit rule

* Added an early-exit condition to the `legalize_pairwise_extents` function to return original extents if the number of non-1 dimensions in both source and destination extents is equal, improving performance by avoiding unnecessary adjustments.
* Updated the function's documentation to clarify the new behavior and maintain clarity in the extent handling logic.

* lint fix

2c0072a8

[Language][Reshape] Improve variable handling and ensure correctness during Layout Reshape (#1248) · d7164abf

Lei Wang authored Nov 13, 2025

* fix

* Refactor tensor reshaping in fp8_lighting_indexer.py

- Replaced the allocation of `s_reshaped` with a reshape operation to improve clarity and performance.
- Updated the logic in the computation of `s_reshaped` to utilize the reshaped tensor, enhancing the overall functionality of the attention mechanism.

* Refactor analyzer usage in Layout and Fragment reshaping

- Consolidated analyzer logic in the `Reshape` methods of `LayoutNode` and `FragmentNode` to utilize a fallback analyzer, improving code clarity and preventing potential null dereference issues.
- Updated variable binding and simplification calls to use the selected analyzer consistently, enhancing robustness in shape validation and index computation.

d7164abf

[Bugfix] Fix fp8 dtype for some cases (#1246) · 63bf1609

Lei Wang authored Nov 13, 2025

* [Enhancement] Add FP8 support and reproducibility in lighting indexer

* Introduced a manual seed in `test_fp8_lighting_indexer` to ensure reproducible performance.
* Added specializations for `cute::float_e4m3_t` and `cute::float_e5m2_t` in `gemm_mma.h` for enhanced FP8 support across multiple CUDA architectures, ensuring compatibility and improved functionality.ix

* Fix typos in `fp8_lighting_indexer.py` and improve formatting in `gemm_mma.h`

* Corrected a typo in the comment for `test_fp8_lighting_indexer` to enhance clarity.
* Reformatted lines in `gemm_mma.h` for better readability by aligning template specializations across multiple CUDA architectures.

* test fix

* bug fix

63bf1609

[Refactor] Phaseout legacy loop vectorize dynamic pass (#1245) · f550a58d

Lei Wang authored Nov 13, 2025



* Deleted the LoopVectorizeDynamic implementation from the transform module.
* Removed associated references in the phase and initialization files to streamline the codebase.
* This change simplifies the transformation pipeline by eliminating unused functionality.
Co-authored-by: Zhiwen Mo <zm125@ic.ac.uk>

f550a58d

12 Nov, 2025 1 commit

[Bugfix] Minor fix for tcgen05 (#1242) · 6882bd50

Lei Wang authored Nov 12, 2025



* Add correctness evaluation script for GEMM v2

- Introduced a new Python script `correctness_evaluation_tcgen05.py` for testing the correctness of GEMM v2 implementations using pytest.
- Implemented matrix multiplication and compilation checks, along with parameterized tests for various input configurations.
- Enhanced the testing framework to validate GEMM operations with different data types and configurations, ensuring robustness in the implementation.
- Updated logging in `legalize_negative_index.cc` to reduce verbosity by changing from WARNING to DLOG.
- Adjusted assertions in `tcgen05_macro_generator.py` to accommodate new warp size requirements for improved performance.
- Removed unused variable in `gemm_tcgen05.py` to streamline the codebase.

* lint fix

---------
Co-authored-by: Zhiwen Mo <zm125@ic.ac.uk>

6882bd50