Commits · 154799581fdb9a96eed8fd0759c544f1fb7a0081 · OpenDAS / tilelang

17 Sep, 2025 1 commit
- [DSL] Support python tenary if then else expression (#822) · 15479958
  Lei Wang authored Sep 17, 2025
```
* support python tenary if then else expression

* lint fix
```
  15479958
15 Sep, 2025 1 commit

[feat] support gemm_sp for ampere and ada arch (#691) · 0b3683bf

botbw authored Sep 16, 2025



* [feat] add an example mma atom

* [fix] fix typo naming

* [feat] add a template to enable compilation

* [feat] add print util

* [WIP] pass on single block tile

* [feat] add sm80 metadata layout

* [chore] clean codebase

* [CI] format.sh

* [feat] add sm80 compress utils

* [bugfix] fix C fragment layout

* [refactor] use nvcc version instead of str

* [test] add test cases

* [chore] add a param check

* [chore] format a bit

* [chore] rename func to satisfy PEP 8 and appease gemini

* [chore] add check

* [feat] support sm75 layout && add assertion && chore

* [bug] fix illegal memory access when using two warps over N=32

This could be a missing check related to cutlass 2.x implementation.
Using the cutlass example can't trigger this cause it's bypassed by
padding the input.

For now I think it might be safe to increase the atom size and inve-
sgate in the future.

* [chore] add example

* [chore] format

* [example] update benchmark

* [bugfix] fix namespace and format

* [bugfix] fix incorrect param passing

* [refactor] update variable declaration for clarity in gemm_layouts and gemm_sp

* [Cleanup] Remove unnecessary blank lines in metadata layout functions in gemm_sp.py

* [CI] fix arch

* [example] add torch sparse benchmark

* [misc] polish && add reference && apply review suggestionsi && format

* [CI] format with clang-tidy

* [Cleanup] Format and align template struct definitions in half.hpp, common.h, and gemm_sp_sm80.h

* [Update] Modify CUDA version requirements in test_gemm_sp_sm80 and mark cutlass subproject as dirty

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

0b3683bf

13 Sep, 2025 1 commit
- [Lint] Add ruff config to check for useless spaces (#807) · 5e529522
  Yichen Yan authored Sep 13, 2025
```
* update lint config

* Remove spaces for blank line

* update
```
  5e529522
12 Sep, 2025 1 commit
- [AMD] support preshuffle weight mfma (#806) · 143b5222
  Jiaxing Ding authored Sep 12, 2025
```
Co-authored-by: Jiaxing Ding <jiaxing.ding@bytedance.com>
```
  143b5222
10 Sep, 2025 2 commits

[TileOp] Introduce a experimental python defined `T.gemm_v2` (#793) · 91a7bb2b

Lei Wang authored Sep 11, 2025

* Refactor GEMM and GEMM-SP operations to enhance clarity and maintainability

- Removed deprecated prime factorization functions from `gemm.cc` and `gemm_sp.cc`.
- Introduced a new `GemmWarpPolicy` class to manage warp policy attributes and methods, improving encapsulation.
- Updated reflection methods to include the new policy structure, ensuring proper registration and introspection capabilities.
- Enhanced `GetArchInt` function in `utils.cc` for better readability and type safety.
- Added new `gemm_v2` function in `gemm.py` for improved GEMM operation with additional parameters and checks.

* Refactor GEMM and frontend legalize operations for improved clarity and functionality

- Updated `gemm_py.h` to include the correct header for GEMM operations.
- Renamed `FrontendLegalizer` class to `LetInliner` and updated related methods to reflect this change, enhancing code clarity.
- Modified the pass function from `FrontendLegalize` to `LetInline` for better alignment with its purpose.
- Updated test cases to utilize the new `gemm_v2` function and adjusted the testing framework for improved output and clarity.
- Removed obsolete test file `test_tilelang_transform_frontend_legalize.py` to streamline the test suite.
- Enhanced the `LowerAndLegalize` function to utilize the new `LetInline` pass, improving the overall transformation process.

* Enhance CUDA code generation and testing for GEMM operations

- Added indentation printing in `codegen_cuda.cc` for improved assembly code formatting.
- Updated `test_tilelang_tilelibrary_gemm.py` to include additional GEMM test cases and shared memory allocation with specified scope.
- Introduced new `matmul_sr` and `run_gemm_sr` functions for GEMM operations with shared and fragment memory layouts.
- Refactored layout inference in `mma_macro_generator.py` to improve clarity and correctness in shared memory handling.
- Enhanced `gemm/__init__.py` to support new GEMM operation combinations and layout inference logic.

These changes improve the clarity, functionality, and testing coverage of GEMM operations in the TileLang framework.

* Refactor GEMM layout and testing for improved clarity and functionality

- Updated `gemm_layouts.cc` to enhance the layout generation logic for transposed and non-transposed GEMM operations.
- Renamed and modified functions in `test_tilelang_tilelibrary_gemm.py` to reflect changes in GEMM function signatures and improve test coverage.
- Introduced new GEMM operation combinations in `gemm/__init__.py` to support additional layouts and configurations.
- Enhanced layout inference in `mma_layout.py` and `mma_macro_generator.py` for better handling of shared memory layouts.

These changes improve the clarity, functionality, and testing coverage of GEMM operations in the TileLang framework.

* Refactor GEMM layout and Python integration for improved functionality

- Updated `gemm_layouts.cc` to correct the order of layout replication and repetition for transposed and non-transposed GEMM operations.
- Enhanced `gemm_py.cc` to handle block realization more robustly, ensuring correct assignment of global symbols and block attributes.
- Refactored `inject_pipeline.cc` to streamline buffer read/write region handling, improving clarity and maintainability.
- Cleaned up test cases in `test_tilelang_tilelibrary_gemm.py` by removing unnecessary print statements and adjusting function calls for better test execution flow.

These changes enhance the clarity, functionality, and robustness of GEMM operations and their testing in the TileLang framework.

* Refactor GEMM layout and testing for improved clarity and functionality

- Updated `gemm_layouts.cc` to enhance layout generation logic for transposed and non-transposed GEMM operations.
- Improved block realization handling in `gemm_py.cc` for better assignment of global symbols.
- Streamlined buffer read/write region handling in `inject_pipeline.cc` for clarity.
- Enhanced test cases in `test_tilelang_tilelibrary_gemm.py` by adjusting function calls and adding new GEMM operation combinations.

These changes improve the clarity, functionality, and robustness of GEMM operations and their testing in the TileLang framework.

* tfloat32 support.

* lint fix

* lint fix

* Refactor shared memory allocation in GEMM tests

- Removed unnecessary scope specification in shared memory allocation for matrices A and B in `test_tilelang_tilelibrary_gemm.py`.
- This change simplifies the allocation process and aligns with the updated GEMM function signatures.

91a7bb2b

[AMD] support mfma i32_16x16x32_i8 (#800) · 9fd6bb30
Jiaxing Ding authored Sep 10, 2025
```
Co-authored-by: Jiaxing Ding <jiaxing.ding@bytedance.com>
```
9fd6bb30

31 Aug, 2025 2 commits

📝

Add docstrings to `reducer_0825` (#772) · 9a869396

coderabbitai[bot] authored Aug 31, 2025

* 📝 Add docstrings to `reducer_0825`

Docstrings generation was requested by @LeiWang1999.

* https://github.com/tile-ai/tilelang/pull/757#issuecomment-3219088118



The following files were modified:

* `setup.py`
* `src/op/builtin.h`
* `src/op/finalize_reducer.cc`
* `src/op/finalize_reducer.h`
* `src/op/parallel.cc`
* `src/op/parallel.h`
* `src/op/reduce.cc`
* `src/target/codegen_cuda.cc`
* `src/tl_templates/cuda/common.h`
* `src/transform/layout_inference.cc`
* `src/transform/layout_reducer.cc`
* `src/transform/layout_reducer.h`
* `src/transform/merge_shared_memory_allocations.cc`
* `src/transform/storage_access.cc`
* `src/transform/warp_specialized_rewriter.cc`
* `testing/python/autotune/test_tilelang_autotune_with_inputs.py`
* `tilelang/engine/phase.py`
* `tilelang/language/customize.py`
* `tilelang/language/reduce.py`
* `tilelang/transform/__init__.py`

* lint fix

* lint fix

---------
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

9a869396

[Reducer] Introduce `alloc_reducer` to separate inter and intra warp reduction (#757) · 8eab7755

Lei Wang authored Aug 31, 2025



* [Enhancement] Introduce finalize_reducer operator and layout reducer support

- Added `FinalizeReducer` operator to handle reduction finalization in the TileLang framework, allowing for efficient reduction operations.
- Implemented layout inference for local.reducer buffers, enhancing the handling of layout mappings and reducing complexity in buffer management.
- Updated `setup.py` to include logging for build directory paths, improving build process visibility.
- Enhanced atomic operations with new functions for atomic max, min, load, and store, providing more robust atomicity control in memory operations.
- Refactored parallel loop handling to incorporate reducer information, ensuring proper management of reduction operations in parallel contexts.
- Cleaned up test cases by removing unnecessary cache disabling and optimizing test parameters for better performance.

* Refactor code formatting and improve readability in multiple files

- Cleaned up whitespace in `setup.py` to enhance logging clarity.
- Reformatted `AtomicMax` and `AtomicMin` functions in `common.h` for better alignment and readability.
- Adjusted `debug_print_var` function in `debug.h` to improve code structure and maintainability.
- Enhanced readability of the `atomic_add` function in `customize.py` by breaking long lines for better clarity.

* Remove debug print statements from `copy.cc` and `inject_tma_barrier.cc` to enhance code clarity and maintainability.

* [Enhancement] Disable reuse of small arrays in shared memory allocation

- Added logic to prevent the reuse of small arrays (<= 32 bits) in `merge_shared_memory_allocations.cc`, ensuring they are lowered to registers in LLVM for improved performance and memory management.

* Refactor `setup.py` to remove duplicate logging statements and enhance clarity. Update `finalize_reducer` function documentation in `reduce.py` to include detailed parameter and return descriptions, improving code readability and maintainability.

* Refactor `finalize_reducer` and `reduce` functions to remove redundant target checks. Simplified conditionals by retaining only the `TargetIsHopper` check, enhancing code clarity and maintainability.

* bug fix

* Add thread checks workaround for replicated cases

* Remove the is_one check

* fix lint error

* lint fix

* Update autotune tests to use smaller matrix sizes for improved performance and reliability

* [Refactor] Update FinalizeReducer to FinalizeReducerOp and adjust related methods

- Refactored FinalizeReducer class to FinalizeReducerOp, updating constructor and method signatures for consistency with the new TileOperator structure.
- Enhanced layout inference and cloning methods in FinalizeReducerOpNode.
- Updated test_example_flash_attention.py to call test_example_gqa_bwd instead of tilelang.testing.main.
- Adjusted header inclusions for improved organization and clarity across multiple files.

* [Refactor] Update atomic operations in common.h and modify test_example_flash_attention.py

- Enhanced atomic operations (Add, Min, Max) in common.h to handle half and bfloat16 types more efficiently.
- Updated test_example_flash_attention.py to call test_example_gqa_bwd instead of tilelang.testing.main, improving test organization.

* [Refactor] Simplify CopyNode::LowerBulkCopy logic and update test execution

- Removed redundant checks for contiguous memory access in CopyNode::LowerBulkCopy, streamlining the logic for TMA copy operations.
- Updated test_tilelang_kernel_gemm.py to comment out the main testing function and call a specific test for i8i8i32 tensor operations instead, improving test focus.

---------
Co-authored-by: Huanqi Cao <caohuanqi@deepseek.com>
Co-authored-by: Freebase6912 <amid-gauze-racing@duck.com>

8eab7755

28 Aug, 2025 1 commit

[Bugfix] Address PassContext contamination from CI and fix incorrect rewrites... · ff35fc08

Wenhao Xie authored Aug 28, 2025

[Bugfix] Address PassContext contamination from CI and fix incorrect rewrites in warp specialized pass (#767)

* fix ci and pass bug

* fix

* try

* lint

ff35fc08

24 Aug, 2025 2 commits

[Typo] Remove `disable_cache` in some tests (#755) · 556d411e

Lei Wang authored Aug 25, 2025

* Update test parameters and remove debug print statement

- Adjusted test cases in `test_tilelang_dynamic_symbolic_bench.py` to use smaller matrix sizes (1024x1024) for improved performance and quicker execution.
- Removed a debug print statement from `phase.py` to clean up the code and enhance clarity.

* Refactor loop stack management in warp_specialized_rewriter

- Introduced a new `LoopInfo` struct to encapsulate loop variable details, including `loop_var`, `extent`, and `min`, enhancing clarity and maintainability.
- Updated the `loop_stack_` to utilize `LoopInfo` instead of a pair, improving type safety and readability.
- Adjusted linear index calculations to account for the new structure, ensuring correct behavior in loop transformations.

* Remove unused `torch.backends` import and `tilelang.disable_cache()` calls from multiple test files to enhance code clarity and maintainability.

556d411e

[Bugfix][WS] Consider loop min extent when computing phase id (#754) · b39aaf5b

Lei Wang authored Aug 24, 2025

* Update test parameters and remove debug print statement

- Adjusted test cases in `test_tilelang_dynamic_symbolic_bench.py` to use smaller matrix sizes (1024x1024) for improved performance and quicker execution.
- Removed a debug print statement from `phase.py` to clean up the code and enhance clarity.

* Refactor loop stack management in warp_specialized_rewriter

- Introduced a new `LoopInfo` struct to encapsulate loop variable details, including `loop_var`, `extent`, and `min`, enhancing clarity and maintainability.
- Updated the `loop_stack_` to utilize `LoopInfo` instead of a pair, improving type safety and readability.
- Adjusted linear index calculations to account for the new structure, ensuring correct behavior in loop transformations.

b39aaf5b

23 Aug, 2025 1 commit

[Refactor] Merge ThreadPartialSync and ThreadStorageSync (#741) · 6b125028

Lei Wang authored Aug 23, 2025

* Remove `thread_partial_sync.cc` and refactor `thread_storage_sync.cc` to streamline synchronization handling. Introduce `thread_sync_types.h` for thread-bound key definitions and reserved named barriers. Update related logic in `ThreadSyncInserter` and `TileLangThreadSync` for improved clarity and efficiency.

* Remove `sync_thread_partial` references and related documentation from the codebase. Update CUDA and HIP code generation files to eliminate calls to the removed function. Refactor `__sync_thread_partial` to `sync_thread_partial` in CUDA common header for consistency.

* Remove unused import of `bulk_copy.h` in `codegen_hip.cc` to enhance code clarity and maintainability.

* Add import of `bulk_copy.h` in `codegen_hip.cc` to support new functionality.

* typo fix

* Update data type in reduce_sum tests from float16 to float32 for consistency and clarity. Remove redundant dtype tests and streamline run functions. Enhance reshape kernel compilation with pass configurations to address shared memory layout issues.

* lint fix

* test fix

* Enhance CI configuration by adding verbose output to pip install command for better visibility during installation.

* use ninja instead of make

* Add CMake configuration step for Ninja build system in setup.py

* Update pyproject.toml to include additional build dependencies: build, torch, tox, auditwheel, patchelf, and ninja.

* Enhance CI configuration by adding verbose output to pytest commands for improved test visibility.

* Update pyproject.toml to add Cython as a build dependency. Enhance thread storage synchronization in thread_storage_sync.cc by introducing new thread variable handling and improving index disjointness checks.

* Update data type in cumulative sum tests from float16 to float32 for consistency. Modify run_cumsum function to utilize the updated dtype and enhance result validation with assertions. Adjust test cases accordingly.

* Refactor storage access handling by introducing buffer data mapping in TileLangStorageAccessVisitor. Enhance access entry structure to include pointer access flag. Update thread storage synchronization to accommodate new buffer data mappings. Adjust quickstart example to print kernel source for debugging purposes.

* Refactor linear index conversion in TileLangStorageAccessVisitor to utilize the analyzer for simplification. Update buffer index calculations to ensure consistent simplification of range expressions.

* bugfix

* Refactor buffer index calculation in TileLangStorageAccessVisitor to simplify access handling. Removed unused buffer mapping logic, ensuring consistent buffer index generation with a default ramp.

* Refactor TileLangStorageAccessVisitor to replace buffer indices with buffer ranges for improved pointer access handling. Update AccessEntry structure to include buffer_ranges and adjust thread storage synchronization logic to account for pointer access conflicts.

* Refactor thread storage synchronization to replace 'shared.dyn' with 'shared' for consistency in memory allocation. Update related test cases to reflect this change and ensure proper functionality.

6b125028

22 Aug, 2025 1 commit

[Refactor] Merge bulk copy into copy and improve layout inference for bulk copy (#746) · 5c11d245

Lei Wang authored Aug 22, 2025

* [Refactor] Merge bulk copy into copy and refactor layout inference for bulk copy

* Deleted the `bulk_copy` operator implementation and its header file as it is no longer needed.
* Introduced a new function `cuTensorMapType()` to return the data type for CUDA tensor mapping.
* Updated related files to reflect these changes, ensuring that the codebase remains clean and maintainable.

* lint fix

* Fix typos in intrinsic names and remove unused print statement in block_sparse_attn_tilelang.py. Updated references from `ptx_ldmatirx` to `ptx_ldmatrix` across multiple files for consistency.

* remove bulk copy

* Refactor copy and atomic add operations to support TMA lower configuration

- Updated `GetCopyInst` to accept a `disable_tma_lower` parameter, allowing for conditional usage of TMA in bulk load/store operations.
- Modified `Lower` method in `Copy` to incorporate the new TMA configuration.
- Refactored `AtomicAdd::Lower` to streamline layout inference and vectorization logic.
- Removed unused `disable_tma_lower` field from `LowerArgs` structure for clarity.
- Enhanced atomic add vectorization by replacing the buggy implementation with a more robust loop vectorization approach.

* Enhance TMA bulk copy logic in `LowerBulkCopy` method

- Added a condition to set `desc.swizzle` to `CU_TENSOR_MAP_SWIZZLE_NONE` when `shared_layout` matches `linear_layout`, improving clarity in layout handling.
- Updated warning log to provide more detailed information about fallback scenarios, including source and destination buffer names and shapes, enhancing debugging capabilities.

* lint fix

* Remove fallback logging for non-swizzled global layout in `LowerBulkCopy` method to streamline the bulk copy logic. This change enhances code clarity by eliminating unnecessary warning messages related to inner box dimensions.

* Enhance reshape kernel compilation in `run_reshape` and `run_reshape_smem_1d_2_2d` functions

- Updated the `tl.compile` method to include `pass_configs` that disable TMA lower and warp specialization, addressing shared memory layout transformation limitations.
- Added TODO comments to indicate the need for further improvements in shared memory handling.

* Update `native_sparse_attention` function to include TMA configuration options

- Added `pass_configs` to the JIT decorator to disable TMA lower and warp specialization, addressing potential issues with shared memory layout transformations.
- Updated comments to clarify modifications in tensor shapes for inference, specifically setting `q` sequence length to 1.

* Refactor JIT decorator formatting in `native_sparse_attention` function

- Improved readability by reformatting the JIT decorator parameters for `native_sparse_attention`, ensuring consistent style across the codebase.
- No functional changes were made; this update focuses on code clarity and maintainability.

* Enhance thread management and logging in TileLang compilation

- Added a method to check if printing is enabled during compilation, improving control over logging behavior.
- Updated the JIT kernel class to utilize the new method for logging compilation status, ensuring consistent and clear output.
- Added comments to clarify the purpose of changes and improve code readability.

* Add warp specialization scope and refactor register management in TileLang

- Introduced a new constant `kWarpSpecializationScope` in `builtin.h` for better attribute management.
- Removed the `SetMaxNRegCollector` class and its related logic from `warp_specialized_rewriter.cc`, streamlining the warp specialization process.
- Added functions `annotate_producer_reg_dealloc` and `annotate_consumer_reg_alloc` in `builtin.py` to facilitate register management.
- Implemented `AnnotateWarpGroupRegAlloc` in `__init__.py` to inject register allocation calls into warp-specialized functions, enhancing the overall register handling in the compilation process.

* Refactor test for InjectSetMaxNReg pass in TileLang

- Improved readability by restructuring conditional checks and assertions in the test cases.
- Enhanced clarity in the collection of `set_max_nreg` calls by simplifying the logic.
- Ensured consistent formatting and spacing throughout the test functions for better maintainability.

* Enhance bulk copy and store checks in `Copy` class

- Updated scope validation for source and destination tensors in `CheckBulkLoad` and `CheckBulkStore` methods to include both `shared.dyn` and `shared` as valid options.
- Modified `CheckLDSMCopy` and `CheckSTSMCopy` methods to accommodate the new scope validation, ensuring compatibility with shared memory configurations.
- Improved logging in `LowerBulkCopy` to provide clearer warnings regarding unsupported swizzle layouts, including source and destination names for better debugging.

* lint fix

5c11d245

21 Aug, 2025 1 commit

[Refactor] Refactor barrier management (#744) · cb37bfef

Lei Wang authored Aug 21, 2025

* Introduce Barrier

* Enhance CUDA kernel with new barrier management and post-processing support

- Added a new CUDA kernel implementation in `example_mla_decode.py` for improved performance with shared memory barriers.
- Refactored barrier handling in `codegen_cuda.cc` and `codegen_hip.cc` to utilize a more flexible mbarrier structure.
- Updated intrinsic definitions from `ptx_stmatirx` to `ptx_stmatrix` across multiple files for consistency.
- Introduced additional print statements for debugging in the lowering phase of the TileLang engine.
- Enhanced the overall structure and readability of the codebase.

* Remove unused barrier handling code in CUDA and HIP code generators to streamline the implementation. This change enhances code clarity and reduces complexity in the barrier management logic.

* Enhance barrier management in TileLang

- Introduced a new intrinsic `allocate_barrier` for dynamic barrier allocation in the TileLang framework.
- Updated CUDA code generation to support the new barrier structure, allowing for improved synchronization in shared memory.
- Refactored existing barrier handling logic to accommodate the new intrinsic and streamline code.
- Added print statements for debugging purposes in various examples and the lowering phase of the TileLang engine.
- Removed deprecated memory scope handling code to enhance clarity and maintainability.

* lint fix

* lint fix

* Remove `allocate_barrier` intrinsic and related code from TileLang to streamline barrier management. This includes updates to CUDA code generation and the removal of associated Python wrappers, enhancing code clarity and maintainability.

* Refactor logging in JITKernel to improve kernel compilation tracking

- Removed unused import of `torch.backends` in the example file.
- Introduced logging for kernel compilation in `JITKernel`, replacing print statements with structured logging for better traceability and debugging.
- Added an assertion to ensure the presence of the `global_symbol` attribute in the kernel function.

* Refactor dequantization tests and update barrier function

- Removed the test for `example_dequant_gemm_bf16_fp4_hopper_serial` to streamline the testing suite.
- Updated the `mbarrier_cp_async_arrive` function to support both pointer and non-pointer types, enhancing flexibility in barrier management.

* Update CI configuration to increase pytest parallelism from 4 to 8 threads for improved test execution speed.

* Fix typos in rasterization parameters and update import path for cached module

- Corrected the spelling of `enable_rasteration` to `enable_rasterization` in the matmul function and its usage.
- Updated the import statement for the `cached` module to reflect the new path in the cache submodule.
- Added `StridedTensor` import in the language module for enhanced tensor functionality.

* Update ci.yml

cb37bfef

19 Aug, 2025 1 commit

[Refactor] Refactor env into a more flexible version (#740) · 72be4909

Lei Wang authored Aug 19, 2025

* Fix environment variable name for compilation print setting in `env.py`

* Remove deprecated test file for warp specialized pass configuration and refactor environment variable access in `env.py` to utilize a centralized `EnvVar` class for better management and clarity.

* lint fix

* Refactor cache check to use `env.is_cache_enabled()` for consistency in `tuner.py`

72be4909

17 Aug, 2025 1 commit

[Language] Introduce `StridedTensor` to support non contigious torch inputs (#722) · 1b308baf

Lei Wang authored Aug 18, 2025



* Update submodule 'tvm' to commit e11521e6936a827efa334588d29571fbb4620107

* Support strided tensors

* Refactor target attribute helper functions for improved clarity

* No code changes made in proxy.py and setup.py

* lint fix

* lint fix via gemini

* lint fix

* test fix

* test fix

* lint fix

* Update wrapper.py

* test fix

* Enhance test for InjectSoftwarePipeline by adding LowerOpaqueBlock transformation and updating expected function signature to use match_buffer for better clarity.

* lint fix

---------
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>

1b308baf

15 Aug, 2025 1 commit
- [Chore] fix typos (#719) · d0742860
  Gabriel Wu authored Aug 15, 2025
```
* chore: fix typos

* chore: fix ruff

* chore: fix clang-format
```
  d0742860
13 Aug, 2025 1 commit

[Pipeline] Phaseout fragment and double buffer info from pipeline pass (#711) · 49d5d80e

Lei Wang authored Aug 13, 2025

* Update submodule 'tvm' to commit e11521e6936a827efa334588d29571fbb4620107

* Refactor inject_pipeline.cc to enhance pipeline body rewriting and condition handling

- Introduced a new function to replace IfThenElse nodes with their then_case while preserving attributes.
- Streamlined the PipelineBodyRewriter to improve buffer access rewriting and async state management.
- Enhanced the handling of pipeline loop conditions and added support for predicate conditions in the pipeline body.
- Removed obsolete code and improved overall code clarity and maintainability.

* lint fix

* Refactor return statements in inject_pipeline.cc to remove unnecessary std::move calls

- Updated return statements in multiple methods to return objects directly instead of using std::move, improving code clarity and potentially avoiding unnecessary moves.
- Ensured consistent handling of BufferStore and BufferLoad nodes during pipeline transformations.

* test fix

49d5d80e

11 Aug, 2025 1 commit

[Feat] Support mma gemm with stride (#701) · fe70549f

FeiyangChen authored Aug 11, 2025



* gemm_with_stride sm89

* fix offset issue

* bug fix

* format

* sm80 support

* add sm90

* add testing

* format

* add static_assert for wgmma

* Enhance error message for inner_box_dim validation in LowerBulkCopy

* lint fix

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

fe70549f

10 Aug, 2025 1 commit

[Pipeline] Optimize inject software pipeline and pipeline planing pass (#706) · 376ba9eb

Lei Wang authored Aug 10, 2025

* Refactor inject_pipeline.cc to improve version handling and add unique producer head tracking

- Updated version check to allow for cases with two or more versions.
- Adjusted logic to decrement num_versions when multi-versioning is not needed.
- Introduced a helper function to ensure unique producer heads are added to the commit group.
- Removed obsolete AddAllocBuffers method to streamline code.

* lint fix

* Refactor pipeline planning logic to enhance copy stage dependency management

- Removed obsolete conditional expression handling from the pipeline planning code.
- Introduced a new structure to manage copy stage dependency reads, improving clarity and efficiency.
- Updated logic to correctly identify producer stages for copy stages, ensuring accurate pipeline stage assignment.
- Added a new block sparse matrix multiplication function in the testing suite to validate the pipeline planning changes.

* Update ci.yml

* Fix structural equality checks in AddUnique and Contains methods to compare buffer references instead of entire regions in pipeline planning.

* Refactor pipeline planning logic to improve copy stage dependency propagation

- Updated structural equality checks in AddUnique and Contains methods to use buffer reference comparison.
- Enhanced the iteration logic for managing copy stage dependencies, ensuring accurate identification of producer stages.
- Added safeguards against exceeding maximum iterations during dependency propagation.

376ba9eb

03 Aug, 2025 1 commit

[Refactor] Rebase pipeline injector from upstream tvm (#687) · 73bf8346

Lei Wang authored Aug 03, 2025

* [Enhancement] Introduce software pipeline rewriter and refactor buffer access handling

- Added a new `PipelineOpaqueAccessRewriter` class to manage opaque buffer accesses in the software pipeline.
- Refactored the `PipelineBodyRewriter` to utilize the new rewriter for improved buffer access handling.
- Enhanced the `PipelineRewriter` to support additional fragment information and streamline pipeline construction.
- Updated tests to reflect changes in buffer management and access patterns, ensuring compatibility with the new structure.
- Removed obsolete code related to previous buffer access methods for clarity and maintainability.

* test fix

73bf8346

30 Jul, 2025 2 commits

[CI] Update CI workflow to use Python 3.12 (#679) · eb026b79

Lei Wang authored Jul 30, 2025

* Update CI workflow to use Python 3.12 and enable build isolation for pip installations

- Changed the Python version in the CI configuration from 3.9 to 3.12 to ensure compatibility with the latest features and improvements.
- Updated the `PIP_NO_BUILD_ISOLATION` environment variable from `0` to `1` in the CI configuration, allowing pip to install testing requirements with build isolation enabled, which enhances the installation process during CI runs.

* Update CI workflow to trigger on pull requests instead of pull_request_target

- Changed the event trigger in the CI configuration from `pull_request_target` to `pull_request` to ensure the workflow runs on pull requests, enhancing the integration process.

* Refactor CI workflow to remove unnecessary repository and token settings

- Removed the repository and token parameters from the checkout step in the CI configuration, simplifying the workflow setup and improving security by not exposing sensitive information.

* Remove pip install command from CI workflow to streamline installation process

* Refactor reshape functions and tests for shared memory operations

- Renamed and updated `reshape_test_smem` to `reshape_test_smem_1d_2_2d` and `run_reshape_smem` to `run_reshape_smem_1d_2_2d` for clarity.
- Introduced a new reshape function `reshape_test_smem_2d_2_1d` and its corresponding runner `run_reshape_smem_2d_2_1d`.
- Updated tests to reflect the new function names and added a test for the 2D to 1D reshape functionality, enhancing test coverage and clarity.

eb026b79

Refactor to support upstream tvm (#595) · a7c9a8b9

Siyuan Feng authored Jul 30, 2025



**Summarize part of the rebase pr:**

1. **Support T.thread_return() → CUDA return syntax**  
   Added support for translating `T.thread_return()` to CUDA's native `return` statement.

2. **Dynamic type support for function inputs**  
   Functions now accept dynamically typed parameters using `typing`:
   ```python
   dyn_type = T.int32 or T.float
   @T.prim_func
   def main(
       a: dyn_type,
   )
   ```

3. **Device Function Codegen**  
   Added support for generating `__device__` functions in CUDA:
   ```python
   @I.ir_module
   class Module:
       @T.prim_func(private=True)
       def add(a: T.int32, b: T.int32) -> T.int32:
           return a + b

       @T.prim_func
       def main(
           A: T.Buffer((128, 128), "int32"),
           B: T.Buffer((128, 128), "int32"),
           C: T.Buffer((128, 128), "int32"),
       ):
           T.func_attr({"global_symbol": "main"})
           length: T.int32 = Module.add(64, 64)  # Host call
           for bx in T.thread_binding(length, "blockIdx.x"):
               for tx in T.thread_binding(length, "threadIdx.x"):
                   C[bx, tx] = Module.add(A[bx, tx], B[bx, tx])  # Device call
   ```
   After compilation, `add` becomes a CUDA `__device__` function.

4. **Cython-based Python/C++ interop**  
   Replaced ctypes with Cython for all Python/C++ interactions:
   - Python → C++ calls
   - C++ → Cython calls  
   This improves performance by around 100x and reduces CPU overhead during compile/runtime.

5. **FP8 data type standardization**  
   Migrated `e5m2_float8` and similar types to Torch-standardized variants`float8_e5m2` and etc.



* Refactor CMakeLists.txt to set default build type and manage dependencies for tvm_cython modules

* Update default value of `check_well_formed` parameter in `prim_func` to False for improved flexibility in TIR function parsing.

* Add StorageRewrite function to transform module

Introduced the StorageRewrite function in the tilelang.transform module, which returns a TVM transform pass. This addition enhances the functionality of the module by providing a new transformation option for users.

* Refactor null option handling in IR and layout inference

- Updated instances of `NullOpt` to `std::nullopt` in `ir.cc` and `parallel.cc` for consistency with modern C++ practices.
- Enhanced layout inference logic in `layout_inference.cc` to improve type safety by replacing `as<Fragment>().get()` with `as<FragmentNode>()`.
- Adjusted error handling in `multi_version_buffer_rewriter.cc` and `persist_threadblock.cc` to use more concise null checks.
- Cleaned up test files by commenting out `tilelang.testing.main()` and replacing it with specific test function calls for better clarity.
- Removed unused test file `test_tilelang_kernel_deepseek_nsa.py` to streamline the testing suite.

* Update TVM subproject and refactor cluster planning and tile operation handling

- Updated the TVM subproject to a dirty commit state.
- Refactored copyright headers in `cluster_planning.cc` to reflect the new licensing.
- Enhanced error handling in `lower_tile_op.cc` to check for missing padding map annotations.
- Modified test files to improve clarity and functionality, including adjustments to kernel compilation and test assertions.
- Updated various test cases to ensure proper handling of annotations and configurations in the TileLang testing framework.

* Update annotation type in warp specialized test for consistency

- Changed the annotation type in the `test_warp_specialized` function from a literal integer to `T.int32(3)` for improved type safety and consistency with the TileLang framework.

* Refactor test execution in warp specialized test

- Replaced the direct call to `test_warp_specialized()` with `tilelang.testing.main()` in the test file to standardize test execution and improve integration with the TileLang testing framework.

* refactor

* [Enhancement] Add strict layout map for improved buffer layout inference (#594)

- Introduced a `strict_layout_map` to enhance layout inference by ensuring that buffers with strict layout requirements are properly accounted for during the inference process.
- Updated the inference logic to check for the presence of buffers in the `strict_layout_map` before applying layout changes, improving the accuracy of layout assignments.
- Refactored the layout inference steps to include the copying of layouts into the new strict map, ensuring a clear separation of layout handling based on inference levels.

* [Example] Update examples to use @tilelang.jit (#597)

* [Example] Update kernel compilation in examples to use @tilelang.jit

- Refactored multiple examples to eliminate the use of `tilelang.compile` for kernel creation, directly invoking the functions instead.
- Added `@tilelang.jit` decorators with appropriate output indices to enhance performance and maintainability.
- Improved code clarity by simplifying the kernel invocation process across various examples, ensuring consistency in how kernels are defined and executed.

* format

* Update example_tilelang_sparse_gqa_decode_varlen_indice.py

* Update example_dequant_gemm_fine_grained.py

* Update example_gemm_autotune.py

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

* [Enhancement] Refine error messaging in LowerBulkCopy for global and shared range checks (#599)

* [Enhancement] Improve error messaging for global and shared range legality checks in LowerBulkCopy

- Updated error messages in the LowerBulkCopy function to provide clearer context when global and shared ranges are illegal.
- Enhanced the readability of the error output by including tensor names, improving debugging and validation processes during bulk copy operations.

* [Enhancement] Refine error messaging in LowerBulkCopy for global and shared range checks

- Improved the clarity of error messages in the LowerBulkCopy function by enhancing the output format.
- Included additional context in error messages to aid debugging when global and shared ranges are found to be illegal, ensuring better traceability during bulk copy operations.

* [Enhancement] Introduce PassConfig `TL_ENABLE_AGGRESSIVE_SHARED_MEMORY_MERGE` to enable aggressive shared memory reuse (#602)

* [Enhancement] Add aggressive shared memory merge option in memory allocation

- Introduced a new configuration option `tl.enable_aggressive_shared_memory_merge` to enable aggressive merging of shared memory allocations.
- Updated the `SharedMemLinearAccessPatternFinder` class to support an aggressive merge strategy, allowing for improved memory reuse.
- Modified the `MergeSharedMemoryAllocations` function to incorporate the new merging strategy based on the configuration.
- Enhanced the `PassConfigKey` enumeration to include the new aggressive merge option, ensuring it can be configured appropriately.

* lint fix

* [Enhancement] Add aggressive shared memory merge configuration option

- Introduced a new configuration option `kEnableAggressiveSharedMemoryMerge` to enable aggressive merging of shared memory allocations, enhancing memory management capabilities.

* [Enhancement] Update MergeSharedMemoryAllocations to support aggressive merge option

- Modified the `MergeSharedMemoryAllocations` function to accept an `enable_aggressive_merge` parameter, allowing for more flexible memory management.
- Introduced a new helper function `should_enable_aggressive_merge` to determine the aggressive merge configuration based on the pass context and target.
- Updated the relevant calls in the `phase.py` and `__init__.py` files to utilize the new aggressive merge functionality, enhancing the overall memory allocation strategy.

* [Refactor] Update accumulation handling in gemm_sm90.h (#603)

- Replaced the use of `tiled_mma.accumulate_ = GMMA::ScaleOut::Zero` with a call to `clear(acc)` for better clarity and maintainability in the accumulation logic.
- This change enhances the readability of the code by standardizing the approach to clearing accumulation values across multiple sections of the file.

* [Enhancement] Add tma bulk copy. (#600)

* [Bugfix] Fixed mha_bwd shape inconsistency error (#604)

* lint fix

* Update requirements-lint.txt to maintain clang-format version consistency

* [Bugfix] Avoid duplicate data access when cross thread buffer meet replicate register (#606)

* [Enhancement] Improve debug output formatting in layout and fragment nodes

- Updated the `DebugOutput` methods in `LayoutNode` and `FragmentNode` to provide more structured and informative output, including transformation details and thread range information.
- Enhanced layout inference logic in `ParallelOp` to add predicates for cross-thread shared memory access, improving layout handling in parallel operations.
- Minor adjustment in `layout_inference.cc` to ensure clarity in parallel loop handling.

* lint fix

* [Enhancement] Support tf32 gemm_rs (#607)

- Added a line break in `quickstart.py` for better readability.
- Simplified the JIT kernel compilation in `quickstart.py` by removing the unused execution backend option.
- Modified `example_elementwise_add.py` to disable cache for `tilelang` and optimized the element-wise addition kernel by utilizing shared memory for input tensors, improving performance.
- Updated default values for matrix dimensions and block sizes in the argument parser to enhance usability.

* [Enhancement] Introduce option `TL_DISABLE_FAST_MATH` and `TL_ENABLE_PTXAS_VERBOSE_OUTPUT` (#609)

* [Enhancement] Introduce new PassConfig options for fast math and PTXAS verbosity

- Added `kDisableFastMath` and `kEnablePTXASVerboseOutput` configuration options to enhance control over compilation settings.
- Updated `LibraryGenerator` to utilize these new pass configurations, allowing for more flexible compilation behavior based on user preferences.
- Enhanced `PassConfigKey` enumeration to include the new options, ensuring they can be configured appropriately in the pass context.

* [Refactor] Update PTXAS verbosity configuration key in LibraryGenerator

- Changed the configuration key for PTXAS verbosity from `TL_VERBOSE_PTXAS_OUTPUT` to `TL_ENABLE_PTXAS_VERBOSE_OUTPUT` to align with the new naming convention introduced in recent enhancements.
- This update ensures consistency in the configuration options used within the `LibraryGenerator` class, improving clarity and maintainability of the code.

* lint fix

* fix build

* [Experimental][Language] add `T.GEMM_SP` for sm90 sparse tensor core (#526)

* [experimental] add a draft gemm_sp

* [3rdparty] bump cutlass to v3.9.3

* [lint] run format.sh

* [chore] rebase

* [chore] use abs path

* [gemm_sp] add metadata layout

* [ci] add more example

* [lint] run format.sh

* [chore] polish

* [chore] move gemm_sp to experimental

* [chore] polish

* [lint] run format.sh

* [Enhancement] Improve bulk copy handling and update GEMM sparse tensor test

* Added a warning log for unsupported non-swizzled global layouts in the bulk copy operation, ensuring fallback to normal copy.
* Refactored the GEMM sparse tensor test by removing unnecessary imports and simplifying the kernel compilation process.
* Updated the test to directly call the `run_gemm_sp` function, enhancing clarity and functionality.

* Implement Test

* [Enhancement] Update GEMM SP and SM89 templates for improved functionality

* Refactored GEMM SP computation to enhance warp partitioning logic, ensuring compatibility with Hopper architecture.
* Updated layout inference to support new WGMMA conditions and improved error messaging for unsupported targets.
* Modified SM89 templates to utilize new MMA atom structures, enhancing performance and compatibility with fp8 types.
* Added conditional inclusion for GEMM SP header based on CUDA architecture version.

* lint fix

* [gemm_sp] support more layout and data types

* Enhancement: sync T.gemm_sp's layout inference with T.gemm

* Enhancement: support more block_k in compress util

* [Enhancement] enable block_k=64

* [Lint] run format.sh

* [Enhancement] compressor support more dtype

* Enhancement: enable block_K=32

* [Lint] format.sh

* [Fixbug] fix shape

* Refactor: sync gemm

* [Enhancement] enable transpose

* [Enhancement] enable fp8_e4m3

* [Enhancement] enable int8

* [Lint] run format.sh

* [Benchmark] add gemm_sp benchmark

* [Example] fix 256 threads hang

* [CI] fix ci

* [Chore] resolve gemini feedback

* [Benchmark] increase search space

* [Lint] format

* [CI] skip sparse tensor core related tests as only sm90 is supported

* [CI] pass local run

* Update gemm_sm89.h

* lint fix

* lint fix

* [Enhancement] Add support for sparse GEMM and initialize CUDA architecture flags

- Introduced a new boolean flag `enable_sparse_gemm_` to control the inclusion of sparse GEMM functionality in CUDA code generation.
- Updated the `Finish` method to conditionally include the sparse GEMM header based on the new flag.
- Implemented logic in `VisitStmt_` to enable sparse GEMM when the corresponding external call is detected.
- Added a function to initialize the `TORCH_CUDA_ARCH_LIST` environment variable based on the target compute version, enhancing compatibility with PyTorch.
- Refactored the initialization function into the appropriate module and ensured it is called in the sparse utilities module.

* Update test_compress_utils.py

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

* [Doc] Phaseout Legacy documentations (#610)

- Added a new entry in the README for the introduction of `T.gemm_sp` supporting 2:4 sparse tensor core.
- Removed several outdated documentation files related to convolution, flash attention, and other tutorials to streamline the documentation structure.

* [Refactor] Phaseout Pass ParallelLoopTransformer (#611)

* Refactor layout inference by removing the ParallelLoopTransformer class. Updated layout inference logic to streamline buffer access collection and condition handling in parallel loops. This change simplifies the code structure and enhances maintainability.

* Update MHA backward test cases to use reduced dimensions for batch size and context length

* fix build

* [Enhancement] Update ReduceOp initialization values for integer types (#614)

* [Enhancement] Update ReduceOp initialization values for integer types

- Modified the `MakeInitValue` method in `ReduceOp` to handle integer data types correctly by returning appropriate minimum and maximum values based on the bit width.
- Added checks for integer types to ensure correct initialization for `kMax` and `kMin` reduction types, enhancing the robustness of the reduction operations.

* [Enhancement] Update ReduceOp to handle unsigned integer initialization values

- Enhanced the `MakeInitValue` method in `ReduceOp` to include support for unsigned integer data types.
- Added conditions to return appropriate initialization values for `kMax` and `kMin` reduction types based on the data type, improving the robustness of reduction operations.

* Bump transformers from 4.50.0 to 4.51.0 in /examples/bitnet-1.58b (#615)

Bumps [transformers](https://github.com/huggingface/transformers) from 4.50.0 to 4.51.0.
- [Release notes](https://github.com/huggingface/transformers/releases)
- [Commits](https://github.com/huggingface/transformers/compare/v4.50.0...v4.51.0

)

---
updated-dependencies:
- dependency-name: transformers
  dependency-version: 4.51.0
  dependency-type: direct:production
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* [Refactor] refactor autotune examples (#617)

* [Refactor] Update tilelang kernel functions and remove unused imports

- Refactored the `flashattn_fwd`, `flashattn_bwd_preprocess`, and `flashattn_bwd_postprocess` functions to utilize direct kernel calls instead of cached versions, improving clarity and performance.
- Added `@tilelang.jit` decorators with specified output indices to enhance kernel compilation.
- Removed unused import of `cached` from `tilelang`, streamlining the code.
- Commented out the main testing function call in `test_tilelang_kernel_mha_bwd.py` for potential future use.

* [Refactor] Simplify configuration generation in benchmark and example scripts

- Refactored the `get_configs` functions in multiple benchmark and example scripts to utilize a dictionary-based approach for parameter configuration, improving readability and maintainability.
- Updated the `flashattn` and `chunk_scan_fwd` functions to directly accept configuration parameters, enhancing flexibility in kernel tuning.
- Removed redundant code and streamlined the configuration generation process across various files, ensuring consistency in how configurations are defined and utilized.

* [Refactor] Update configuration handling in benchmark scripts

- Refactored the `get_configs` functions in benchmark scripts to accept a variable argument list, improving flexibility in configuration management.
- Enhanced the `matmul` and `flashattn` functions to utilize the updated configuration approach, streamlining parameter handling for kernel tuning.
- Added `@autotune` decorators to relevant functions, ensuring consistent autotuning behavior across benchmarks.
- Cleaned up redundant code and improved overall readability in the affected files.

* [Refactor] Clean up formatting and update subproject commit

- Updated the subproject commit reference in the TVM directory to indicate a dirty state.
- Removed unnecessary blank lines and improved formatting in the `benchmark_matmul` and `benchmark_matmul_fp8` scripts for better readability.
- Streamlined the function definitions in the `flashattn` example script to enhance clarity and maintainability.

* [Refactor] Update AutoTuner configuration handling

- Modified the AutoTuner class to check if kernel parameters are set before processing tunable arguments, improving robustness in configuration handling.
- Enhanced the logic for skipping compilation when tunable parameters are already provided, ensuring efficient use of resources.
- Updated comments for clarity and maintainability.

* lint fix

* Update TVM subproject commit to indicate dirty state and modify MHA backward test cases

- Updated the subproject commit reference in the TVM directory to reflect a dirty state.
- Adjusted the `test_mha_bwd` function to use a new configuration for the MHA backward tests, changing the context size from 128 to 256.
- Uncommented the main testing function call for potential execution.

* lint fix

* Bump transformers from 4.51.0 to 4.52.1 in /examples/bitnet-1.58b (#619)

Bumps [transformers](https://github.com/huggingface/transformers) from 4.51.0 to 4.52.1.
- [Release notes](https://github.com/huggingface/transformers/releases)
- [Commits](https://github.com/huggingface/transformers/compare/v4.51.0...v4.52.1

)

---
updated-dependencies:
- dependency-name: transformers
  dependency-version: 4.52.1
  dependency-type: direct:production
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Fix PTXAS options flag in LibraryGenerator for consistency (#620)

* Refactor FP8 type handling across multiple files to standardize usage of "float8_e4m3" and "float8_e5m2" instead of "e4m3_float8" and "e5m2_float8". This includes updates in benchmarks, examples, tests, and internal utilities.

* [Refactor] Add parallel loop transform pass for condition extraction (#618)

* [Refactor] Add parallel loop transform

* done format check

* pull 3rdparty repo

* Refactor loop variable handling in transformation utilities

- Updated the logic in `loop_parallel_transform_utils.h` to simplify the handling of related loop variables.
- Removed the check that enforced a single related loop variable, replacing it with a return statement when multiple variables are detected, enhancing clarity and maintainability of the transformation process.

* Update loop_parallel_transform_utils.h

* Refactor loop variable handling in transformation utilities

- Enhanced the logic in `loop_parallel_transform_utils.h` to improve clarity and maintainability by simplifying the handling of related loop variables.
- Replaced the previous enforcement of a single related loop variable with a return statement for multiple variables detected.

* remove disable cache flag as commit id has been key component

* lint fix

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

* [Dev] Update linear attention examples to enhance performance on Hopper GPUs (#621)

* Tune linear attention examples on H100

* Add retnet fwd kernel

* fix lint

* [Enhancement] Add ahead of time cython compilation in setup.py (#622)

* [Enhancement] Add Cython support and compiler detection in setup.py

- Introduced a new `CythonExtension` class for building Cython-based extensions, enhancing the build process for Cython projects.
- Implemented functions to detect the Cython compiler and C++ compiler, improving compatibility and user experience.
- Updated the build process to handle Cython extensions alongside CMake extensions, ensuring a seamless integration for users.
- Added caching mechanisms for Cython compilation to optimize build times and reduce unnecessary recompilation.

* [Enhancement] Add Cython dependency and enable CMake extension building

- Added Cython as a required dependency in `pyproject.toml` to support Cython-based extensions.
- Updated `setup.py` to enable building CMake extensions, improving the build process for projects utilizing both Cython and CMake.
- Modified the Cython compiler detection logic to streamline installation instructions for users.

* [Enhancement] Support more flexible layout host pythonic expr (#623)

* [Refactor] Enhance expression handling in utils.py and update wrapper to use pythonic_expr

- Added support for additional TIR expressions (FloorDiv, Min, Max, Add, Sub, FloorMod) in the pythonic_expr function to improve string representation.
- Replaced the deprecated legalize_c function calls in TLCUDASourceWrapper and TLCPUSourceWrapper with pythonic_expr for better expression handling in kernel launch code.

* [Refactor] Simplify expression handling in pythonic_expr function

- Consolidated binary and min/max operation handling in the pythonic_expr function to improve readability and maintainability.
- Replaced individual checks for binary operations with a mapping approach, streamlining the code and enhancing performance in expression representation.

* [Enhancement] Improve expression representation in pythonic_expr function

- Added operator precedence handling to the pythonic_expr function, enhancing the conversion of TVM PrimExpr to Python-style strings.
- Updated the visitor logic to intelligently add parentheses based on operator precedence, improving the accuracy of expression representation.
- Included a docstring for better clarity on the function's purpose and usage.

* test fix

* [Enhancement] support composable expression for shape with symbolic vars (#624)

* [Refactor] Enhance expression handling in utils.py and update wrapper to use pythonic_expr

- Added support for additional TIR expressions (FloorDiv, Min, Max, Add, Sub, FloorMod) in the pythonic_expr function to improve string representation.
- Replaced the deprecated legalize_c function calls in TLCUDASourceWrapper and TLCPUSourceWrapper with pythonic_expr for better expression handling in kernel launch code.

* [Refactor] Simplify expression handling in pythonic_expr function

- Consolidated binary and min/max operation handling in the pythonic_expr function to improve readability and maintainability.
- Replaced individual checks for binary operations with a mapping approach, streamlining the code and enhancing performance in expression representation.

* [Enhancement] Improve expression representation in pythonic_expr function

- Added operator precedence handling to the pythonic_expr function, enhancing the conversion of TVM PrimExpr to Python-style strings.
- Updated the visitor logic to intelligently add parentheses based on operator precedence, improving the accuracy of expression representation.
- Included a docstring for better clarity on the function's purpose and usage.

* test fix

* minor update

* 🐍

Fix the file name "test_exmaple_tilelang_nsa" (#629)

* [Enhancement] Add CPU utilization and count settings for Auto-Tuning (#630)

* [Enhancement] Add CPU utilization and count settings for Auto-Tuning

- Introduced environment variables for CPU utilization, counts, and maximum CPU count for auto-tuning.
- Updated the AutoTuner class to utilize these new settings, improving flexibility and performance in multi-threaded environments.
- Enhanced logging to provide better insights into the auto-tuning process based on the configured CPU settings.

* typo fix

* [AutoTune] Support `with set_autotune_inputs` to set auto tuning input tensors (#632)

* [Refactor] Simplify and modularize autotuner implementation

- Removed unused imports and extensive code sections from the autotuner module to enhance readability and maintainability.
- Modularized the code by introducing new imports for autotuning and capturing functionalities, streamlining the overall structure.
- Improved logging setup and removed redundant timeout handling functions, focusing on core autotuning logic.
- Updated the AutoTuner class to better utilize the new modular structure, ensuring efficient performance during auto-tuning processes.

* [Refactor] Clean up and enhance capture and tuner modules

- Improved code readability by removing unnecessary blank lines and organizing imports in `capture.py` and `tuner.py`.
- Enhanced logging in the `AutoTuner` class to provide clearer warnings regarding the usage of `supply_prog` in the context of auto-tuning.
- Streamlined the `CaptureStack` class for better thread-local context management.

* lint fix

* [Refactor] Simplify configuration and autotuning logic in blocksparse GEMM example

- Updated `get_configs` function to reduce the number of configurations, enhancing performance and clarity.
- Removed the `get_best_config` function, integrating its logic directly into the `blocksparse_matmul` function with the `@autotune` decorator for streamlined autotuning.
- Adjusted the main function to directly utilize the autotuned kernel, simplifying the overall structure and improving readability.
- Deleted obsolete test file for autotuning decorator, cleaning up the codebase.

* [Refactor] Improve code formatting and readability in autotune test file

- Reformatted the `matmul` function and `get_configs` function for better readability by adjusting line breaks and indentation.
- Fixed a typo in the `enable_rasteration` parameter name to ensure consistency.
- Cleaned up unnecessary blank lines to enhance overall code clarity.

* Update example_blocksparse_gemm.py

* Update capture.py

* [Pass] Introduce flag to diable cp async lowering (#633)

* [Enhancement] Update PipelinePlanner to support async copy configuration

- Modified the `Substitute` method in `PipelinePlanner` to accept a `use_async_copy` parameter, allowing for more flexible pipeline planning based on async copy requirements.
- Updated the constructor of `PipelinePlanner` to initialize the `use_async_copy_` member variable.
- Adjusted the logic in the pipeline planning process to conditionally apply async copy annotations based on the new parameter.
- Commented out the `LoopVectorizeDynamic` call in `LowerAndLegalize` to prevent unintended modifications during the legalizing phase.

* Refactor PipelinePlanning function for improved readability

- Adjusted the formatting of the `use_async_copy` variable assignment in the `PipelinePlanning` function to enhance code clarity and maintainability.

* fix typo (#635)

* [Pass][Simplify] Introduce symbolic level simplify for condition expression (#634)

* [Enhancement] Add argument simplification option to StmtSimplifier

- Introduced a new `simplify_arguments` flag in the `StmtSimplifier::Apply` method to control argument simplification behavior.
- Updated the `Simplify` function to accept the new flag, allowing for enhanced flexibility in the simplification process.
- Adjusted the `LowerAndLegalize` and `_Simplify` functions to utilize the new argument, ensuring consistent behavior across the codebase.
- Added comments to clarify the purpose of the new flag and its impact on simplification logic.

* lint fix

* [Enhancement] Improve layout inference and reduce operation handling

- Updated `ParallelOp::InferLayout` to check for pure buffer stores, enhancing layout inference logic.
- Modified `ReduceOp::Lower` to include all threads in the AllReduce operation, improving performance on specific architectures.
- Added a TODO comment in `AllReduce` to consider merging synchronization barriers for optimization.

* lint fix

* [Enhancement] Add input validation for GEMM parameters

- Introduced checks to ensure that the dimensions M and N are divisible by their respective warp sizes (kMPerWarp and kNPerWarp) in the Gemm::ComputeWarpPartition method.
- Added informative error messages to assist in debugging when the input parameters do not meet the required conditions.

* bug fix

* Enhance test coverage by adding LLVM requirement decorator to multiple function call tests. This ensures that tests for argument count, type code, null data pointer, and dimensionality checks are only executed when LLVM is available, improving test reliability and clarity.

* lint fix

* Fix software pipeline stage annotation and update optional config handling in StmtSimplifier

* Add Python executable detection in CMake configuration and update TVM submodule reference. Remove unused vectorization tests for improved clarity.

* Update TVM submodule reference and refactor FFI registration to use static initialization blocks for improved organization and clarity.

* Refactor attribute handling in layout and IR nodes to use reflection registration. This change replaces the VisitAttrs method with a RegisterReflection method for improved clarity and organization across multiple classes, including KernelLaunchFrameNode, WarpSpecializeFrameNode, LayoutNode, FragmentNode, and SwizzledLayoutNode.

* finish rebase

* tvm update

* Refactor FFI registration across tilelang modules to use the updated `tvm.ffi` namespace. This includes changes in various files to replace `tvm._ffi` with `tvm.ffi`, enhancing consistency and clarity in the codebase.

* lint fix

* Update TVM submodule reference and modify CUDA runtime argument handling to use the new runtime constants for improved clarity and consistency.

* lint fix

* Refactor tensor data type references from "e4m3_float8" and "e5m2_float8" to "float8_e4m3" and "float8_e5m2" across multiple files for consistency and clarity.

* lint fix

* Refactor forward_index initialization in Fragment class to default to an empty array instead of None, ensuring consistent handling of optional outputs.

* test fix

* lint fix

* bugfix

* lint fix

* reduce fix

* lint fix

* carver fix

* cast fix

* Update submodule and enhance kernel launch functionality with optional block size parameter; add device kernel launch transformation.

* lint fix

* bugfix

* Refactor test execution in test_tilelang_cpu_gemm.py and enhance device call checks in lower.py to exclude C packed functions from kernel launch conditions.

* lint fix

* Update runtime.cc

* phase out lisence

* Update subproject commit for TVM to 555cc71

* Update subproject commit for TVM to d39953fa

* Update subproject commit for TVM to 9574805f

* Update subproject commit for TVM to a08b7c3

* fix ci

* ci fix

---------
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Co-authored-by: Cunxiao Ni <85601223+Cunxiao2002@users.noreply.github.com>
Co-authored-by: Yuxi Chi <cherichy@outlook.com>
Co-authored-by: Nathan Chen <120630832+Nathancgy@users.noreply.github.com>
Co-authored-by: botbw <wang1570@e.ntu.edu.sg>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: xs-keju <93414213+xs-keju@users.noreply.github.com>
Co-authored-by: Tong WU <109033598+Rachmanino@users.noreply.github.com>
Co-authored-by: Kadir Nar <kadir.nar@hotmail.com>
Co-authored-by: Yuqing Xia <35415939+xiayuqing0622@users.noreply.github.com>
Co-authored-by: xwhzz <wh.xie@outlook.com>

a7c9a8b9

23 Jul, 2025 1 commit

[Bugfix][CI] Bug fixing and migrate CI from ada to hopper (#652) · e9a608e2

Wenhao Xie authored Jul 23, 2025

* fix CI bugs in hopper

* lint fix

* Update bulk_copy.cc

* Refactor bulk copy logic in LowerBulkCopy function

- Removed unnecessary blank lines for improved code readability.
- Enhanced stride validation by checking for null pointers in global stride calculations, ensuring robustness against symbolic strides.
- Updated pass configuration handling in dynamic tile language tests to streamline dynamic alignment and TMA lower pass settings.

* test fix

* ci fix

* Update flash-attention dependencies and clean up example code

- Downgraded `flash-attn` dependency version in `requirements-test.txt` to `<=2.2.0`.
- Removed unused imports and commented-out code in various example files to enhance readability and maintainability.
- Updated the `flashattn` function signature to include default parameters for `block_M`, `block_N`, `num_stages`, and `threads`.
- Cleaned up the `example_mha_fwd_varlen.py` and `example_mha_bwd_wgmma_pipelined.py` files by removing unnecessary comments and improving code clarity.
- Deleted the `example_mha_inference.py` file as it is no longer needed.

* Update CI workflow to remove `--user` flag from pip install commands

- Removed the `--user` flag from the pip install commands in both the development and testing sections of the CI workflow to ensure proper installation of dependencies in the virtual environment.

* Update CI workflow to include `--no-user` flag in pip install commands

- Added the `--no-user` flag to the pip install commands in both the development and testing sections of the CI workflow to ensure dependencies are installed correctly within the virtual environment.

* Update CI workflow to include `--no-user` flag in pip install command for wheel mode

- Added the `--no-user` flag to the pip install command in the wheel mode section of the CI workflow to ensure dependencies are installed correctly within the virtual environment.

* test fix

* avoid conflict with system environments

* test fix

* add commnets

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

e9a608e2

13 Jul, 2025 1 commit

[AutoTune] Support `with set_autotune_inputs` to set auto tuning input tensors (#632) · eec47592

Lei Wang authored Jul 13, 2025

* [Refactor] Simplify and modularize autotuner implementation

- Removed unused imports and extensive code sections from the autotuner module to enhance readability and maintainability.
- Modularized the code by introducing new imports for autotuning and capturing functionalities, streamlining the overall structure.
- Improved logging setup and removed redundant timeout handling functions, focusing on core autotuning logic.
- Updated the AutoTuner class to better utilize the new modular structure, ensuring efficient performance during auto-tuning processes.

* [Refactor] Clean up and enhance capture and tuner modules

- Improved code readability by removing unnecessary blank lines and organizing imports in `capture.py` and `tuner.py`.
- Enhanced logging in the `AutoTuner` class to provide clearer warnings regarding the usage of `supply_prog` in the context of auto-tuning.
- Streamlined the `CaptureStack` class for better thread-local context management.

* lint fix

* [Refactor] Simplify configuration and autotuning logic in blocksparse GEMM example

- Updated `get_configs` function to reduce the number of configurations, enhancing performance and clarity.
- Removed the `get_best_config` function, integrating its logic directly into the `blocksparse_matmul` function with the `@autotune` decorator for streamlined autotuning.
- Adjusted the main function to directly utilize the autotuned kernel, simplifying the overall structure and improving readability.
- Deleted obsolete test file for autotuning decorator, cleaning up the codebase.

* [Refactor] Improve code formatting and readability in autotune test file

- Reformatted the `matmul` function and `get_configs` function for better readability by adjusting line breaks and indentation.
- Fixed a typo in the `enable_rasteration` parameter name to ensure consistency.
- Cleaned up unnecessary blank lines to enhance overall code clarity.

* Update example_blocksparse_gemm.py

* Update capture.py

eec47592

09 Jul, 2025 1 commit

[Refactor] Add parallel loop transform pass for condition extraction (#618) · 67b81609

xs-keju authored Jul 09, 2025



* [Refactor] Add parallel loop transform

* done format check

* pull 3rdparty repo

* Refactor loop variable handling in transformation utilities

- Updated the logic in `loop_parallel_transform_utils.h` to simplify the handling of related loop variables.
- Removed the check that enforced a single related loop variable, replacing it with a return statement when multiple variables are detected, enhancing clarity and maintainability of the transformation process.

* Update loop_parallel_transform_utils.h

* Refactor loop variable handling in transformation utilities

- Enhanced the logic in `loop_parallel_transform_utils.h` to improve clarity and maintainability by simplifying the handling of related loop variables.
- Replaced the previous enforcement of a single related loop variable with a return statement for multiple variables detected.

* remove disable cache flag as commit id has been key component

* lint fix

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

67b81609

08 Jul, 2025 1 commit

[Refactor] refactor autotune examples (#617) · d110d087

Lei Wang authored Jul 08, 2025

* [Refactor] Update tilelang kernel functions and remove unused imports

- Refactored the `flashattn_fwd`, `flashattn_bwd_preprocess`, and `flashattn_bwd_postprocess` functions to utilize direct kernel calls instead of cached versions, improving clarity and performance.
- Added `@tilelang.jit` decorators with specified output indices to enhance kernel compilation.
- Removed unused import of `cached` from `tilelang`, streamlining the code.
- Commented out the main testing function call in `test_tilelang_kernel_mha_bwd.py` for potential future use.

* [Refactor] Simplify configuration generation in benchmark and example scripts

- Refactored the `get_configs` functions in multiple benchmark and example scripts to utilize a dictionary-based approach for parameter configuration, improving readability and maintainability.
- Updated the `flashattn` and `chunk_scan_fwd` functions to directly accept configuration parameters, enhancing flexibility in kernel tuning.
- Removed redundant code and streamlined the configuration generation process across various files, ensuring consistency in how configurations are defined and utilized.

* [Refactor] Update configuration handling in benchmark scripts

- Refactored the `get_configs` functions in benchmark scripts to accept a variable argument list, improving flexibility in configuration management.
- Enhanced the `matmul` and `flashattn` functions to utilize the updated configuration approach, streamlining parameter handling for kernel tuning.
- Added `@autotune` decorators to relevant functions, ensuring consistent autotuning behavior across benchmarks.
- Cleaned up redundant code and improved overall readability in the affected files.

* [Refactor] Clean up formatting and update subproject commit

- Updated the subproject commit reference in the TVM directory to indicate a dirty state.
- Removed unnecessary blank lines and improved formatting in the `benchmark_matmul` and `benchmark_matmul_fp8` scripts for better readability.
- Streamlined the function definitions in the `flashattn` example script to enhance clarity and maintainability.

* [Refactor] Update AutoTuner configuration handling

- Modified the AutoTuner class to check if kernel parameters are set before processing tunable arguments, improving robustness in configuration handling.
- Enhanced the logic for skipping compilation when tunable parameters are already provided, ensuring efficient use of resources.
- Updated comments for clarity and maintainability.

* lint fix

* Update TVM subproject commit to indicate dirty state and modify MHA backward test cases

- Updated the subproject commit reference in the TVM directory to reflect a dirty state.
- Adjusted the `test_mha_bwd` function to use a new configuration for the MHA backward tests, changing the context size from 128 to 256.
- Uncommented the main testing function call for potential execution.

d110d087

04 Jul, 2025 1 commit

[Refactor] Phaseout Pass ParallelLoopTransformer (#611) · 42c3b452

Lei Wang authored Jul 04, 2025

* Refactor layout inference by removing the ParallelLoopTransformer class. Updated layout inference logic to streamline buffer access collection and condition handling in parallel loops. This change simplifies the code structure and enhances maintainability.

* Update MHA backward test cases to use reduced dimensions for batch size and context length

42c3b452

03 Jul, 2025 1 commit

[Experimental][Language] add `T.GEMM_SP` for sm90 sparse tensor core (#526) · be44758c

botbw authored Jul 04, 2025



* [experimental] add a draft gemm_sp

* [3rdparty] bump cutlass to v3.9.3

* [lint] run format.sh

* [chore] rebase

* [chore] use abs path

* [gemm_sp] add metadata layout

* [ci] add more example

* [lint] run format.sh

* [chore] polish

* [chore] move gemm_sp to experimental

* [chore] polish

* [lint] run format.sh

* [Enhancement] Improve bulk copy handling and update GEMM sparse tensor test

* Added a warning log for unsupported non-swizzled global layouts in the bulk copy operation, ensuring fallback to normal copy.
* Refactored the GEMM sparse tensor test by removing unnecessary imports and simplifying the kernel compilation process.
* Updated the test to directly call the `run_gemm_sp` function, enhancing clarity and functionality.

* Implement Test

* [Enhancement] Update GEMM SP and SM89 templates for improved functionality

* Refactored GEMM SP computation to enhance warp partitioning logic, ensuring compatibility with Hopper architecture.
* Updated layout inference to support new WGMMA conditions and improved error messaging for unsupported targets.
* Modified SM89 templates to utilize new MMA atom structures, enhancing performance and compatibility with fp8 types.
* Added conditional inclusion for GEMM SP header based on CUDA architecture version.

* lint fix

* [gemm_sp] support more layout and data types

* Enhancement: sync T.gemm_sp's layout inference with T.gemm

* Enhancement: support more block_k in compress util

* [Enhancement] enable block_k=64

* [Lint] run format.sh

* [Enhancement] compressor support more dtype

* Enhancement: enable block_K=32

* [Lint] format.sh

* [Fixbug] fix shape

* Refactor: sync gemm

* [Enhancement] enable transpose

* [Enhancement] enable fp8_e4m3

* [Enhancement] enable int8

* [Lint] run format.sh

* [Benchmark] add gemm_sp benchmark

* [Example] fix 256 threads hang

* [CI] fix ci

* [Chore] resolve gemini feedback

* [Benchmark] increase search space

* [Lint] format

* [CI] skip sparse tensor core related tests as only sm90 is supported

* [CI] pass local run

* Update gemm_sm89.h

* lint fix

* lint fix

* [Enhancement] Add support for sparse GEMM and initialize CUDA architecture flags

- Introduced a new boolean flag `enable_sparse_gemm_` to control the inclusion of sparse GEMM functionality in CUDA code generation.
- Updated the `Finish` method to conditionally include the sparse GEMM header based on the new flag.
- Implemented logic in `VisitStmt_` to enable sparse GEMM when the corresponding external call is detected.
- Added a function to initialize the `TORCH_CUDA_ARCH_LIST` environment variable based on the target compute version, enhancing compatibility with PyTorch.
- Refactored the initialization function into the appropriate module and ensured it is called in the sparse utilities module.

* Update test_compress_utils.py

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

be44758c

20 Jun, 2025 1 commit

[Enhancement] align shared memory allocations (#583) · fecc8336

Lei Wang authored Jun 20, 2025

* [Enhancement] Update `pythonic_expr` to format type casts and improve tensor validation in Cython wrapper

- Enhanced `pythonic_expr` to represent type casts as `(type)value` for better clarity in expression representation.
- Modified tensor validation in `CythonKernelWrapper` to conditionally check for tensor contiguity based on a new `skip_tensor_validation` parameter.
- Improved type mapping in `map_torch_type` to include version checks for new float8 types, ensuring compatibility with specific PyTorch versions.

* [Feature] Implement dynamic shared memory allocation alignment

- Added a new transformation pass `AlignDynamicSharedMemoryAllocations` to align dynamic shared memory allocations to specified byte boundaries, enhancing memory access efficiency.
- Introduced a new utility class `TileLangAlignDynamicSharedMemoryAllocations` to handle the alignment logic for both allocation and buffer operations.
- Updated the `LowerAndLegalize` function to apply the alignment transformation based on the target device's capabilities, ensuring compatibility with different architectures.

* [Enhancement] Update dtype and argument defaults in GEMM autotuning example

- Changed data type from `float16` to `bfloat16` for improved precision in computations.
- Updated the default value of the `--with_roller` argument from `True` to `False` to modify the behavior of the autotuning process.

* [Enhancement] Improve thread range computation in storage access

- Added a new method `ComputeThreadRange` to calculate the range of threads for better access tracking.
- Updated `AccessEntry` structure to include `thread_range`.
- Modified various visitor methods to utilize `IRVisitorWithAnalyzer` for improved analysis during expression and statement visits.
- Ensured thread range is computed and stored during buffer load and store operations, enhancing memory access efficiency.

* [Refactor] Update comments for clarity in dynamic shared memory allocation alignment

- Translated comments in `align_dynamic_shared_memory_allocations.cc` from Chinese to English for better understanding.
- Removed an unnecessary call to `IRVisitorWithAnalyzer::VisitStmt_` in `storage_access.cc`.
- Added a blank line for improved readability in `thread_storage_sync.cc`.

* [Refactor] Enhance storage access analysis and thread range computation

- Introduced `ExtractRealCondition` to improve condition handling in `IfThenElseNode` visits.
- Updated `ComputeThreadRange` to use `Var` instead of `IterVar` for thread range mapping, enhancing clarity and consistency.
- Wrapped statement visits in `With<arith::ConstraintContext>` to ensure proper analysis context during condition evaluations.

* [Enhancement] Update default matrix dimensions in GEMM autotune example

- Changed default values for matrix dimensions M, N, and K from 16384 to 4096 in `example_gemm_autotune.py` to facilitate quicker testing and benchmarking.

* typo fix

* enhancement

* [Fix] Add conflict detection for buffer index size mismatch in thread storage sync

- Implemented a check to return true if the sizes of previous and current buffer indices do not match, indicating a conflict.

fecc8336

18 Jun, 2025 1 commit

[Enhancement] Update warp specialization checking (#580) · 6cede73d

Lei Wang authored Jun 19, 2025

* Fix L2 cache size calculation to handle symbolic expressions and ensure float conversion of hit ratios in annotation

* [Enhancement] Update warp specialization check in phase.py

* lint fix

* [Enhancement] Add ContainsSeqStmt method to improve statement handling in merge_shared_memory_allocations.cc

* [Refactor] Simplify memory copy operations in GEMM kernel tests

- Updated memory copy operations in `test_tilelang_kernel_gemm.py` to use shared memory allocations for both A and B matrices, improving clarity and performance.
- Adjusted the main execution block to include a new `run_gemm_rs` function call for testing, enhancing the test structure.

* revert memory reuse pass.

* revert the memory resue and thread sync pass/

* Update test_tilelang_kernel_gemm.py

* Update test_tilelang_kernel_mha_bwd.py

6cede73d

16 Jun, 2025 1 commit

[Refactor] Phaseout tf32 Casting from GEMM Templates (#573) · 9ba8b480

Lei Wang authored Jun 16, 2025

* [Feature] Add Quarter Bank Swizzle Layout and Update GEMM Layout Logic

- Introduced a new `makeQuarterBankSwizzleLayout` function for layout swizzling of 32 bytes.
- Updated `makeGemmABLayout` to include an `enable_padding` parameter, allowing for conditional layout selection between padded and quarter bank swizzle layouts.
- Adjusted layout inference in GEMM operations to utilize the new quarter bank swizzle layout when appropriate.
- Enhanced bulk copy operations to recognize and handle the new layout type, improving memory access patterns.

* lint fix

* [Refactor] Update GEMM Layout Functions and Inference Logic

- Removed the `enable_padding` parameter from `makeGemmABLayout` to simplify its signature.
- Introduced `makeGemmABLayoutHopper` for enhanced layout handling specific to Hopper architecture.
- Updated layout inference in GEMM operations to utilize the new `makeGemmABLayoutHopper` function, improving clarity and maintainability in layout selection.
- Adjusted related layout functions to ensure consistent behavior across different architectures.

* [Refactor] Remove tf32 Casting Logic from GEMM Templates

- Eliminated the `cast_float_to_tf32` function from `gemm_sm80`, `gemm_sm89`, and `gemm_sm90` templates to streamline the code.
- Removed conditional casting logic for float32 to tfloat32 conversion, enhancing clarity and maintainability.
- Updated relevant sections in GEMM operations to reflect the removal of casting, ensuring consistent behavior across templates.
- Adjusted tensor view handling to improve performance and accuracy in matrix operations.

* Update bulk_copy.cc

* Fix profiler initialization in GEMM test by removing TensorSupplyType argument for improved flexibility.

9ba8b480

13 Jun, 2025 1 commit

[Enhancement] Include Metadata (LayoutMap etc.) into hashing (#570) · 9247a879

Lei Wang authored Jun 13, 2025

- Modified the serialization of function scripts in both KernelCache and AutoTunerCache to include metadata by setting `show_meta=True` in `cloudpickle.dumps()`. This change enhances the hash key generation for kernel configurations, improving cache accuracy and consistency.

9247a879

07 Jun, 2025 1 commit

[Feature] Support persistent kernels and add persistent GEMM examples (#559) · 225aca61

Yu Cheng authored Jun 07, 2025

* [Enhancement] Fix multi-version buffer index in nested-loop

* [Feature] Support persistent kernels and add persistent GEMM example

* lint fix

* lint fix

* [CI] Remove test_tilelang_transform_annotate_device_regions.py

225aca61

01 Jun, 2025 1 commit

[AMD] Support float8 matrix core (#537) · 5872e647

Lei Wang authored Jun 02, 2025



* [Enhancement] Add support for FP8 types in CUDA and HIP code generation

* Updated `GetFP8Type` function in `codegen_cuda.cc` and `codegen_hip.cc` to handle new FP8 types, including `kFloat8_e4m3fnuz`.
* Introduced a new header file `hip_fp8.h` for FP8 type definitions in HIP.
* Modified type mappings in `dlpack.py` and `mfma_macro_generator.py` to accommodate new FP8 types.
* Enhanced type handling in `TLHIPSourceWrapper` and `tensor.py` for better integration with FP8 types.
* Added necessary includes and logic to support FP8 in the code generation process, improving performance and compatibility with FP8 data types.

* lint fix

* Update src/target/codegen_hip.cc
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* Update tilelang/intrinsics/mfma_macro_generator.py
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* workaround

* fix

* Update submodule TVM to latest commit 587028ffebfff0ded520f8f90d62f0f6b165906c

* bug fix

* Refactor tilelang matrix multiplication to support transposition and packing options. Adjusted shared memory shapes and loading logic for A and B matrices. Updated test cases to validate new functionality.

* Refactor assertion function for tilelang matrix multiplication to improve readability by formatting parameters and aligning code. Cleaned up whitespace in intrinsic layout functions for consistency.

* Update bfloat16 type definitions in common.h and gemm.h for consistency. Changed __hip_bfloat16 to hip_bfloat16 and updated MfmaTraits specialization accordingly.

* lint fix

---------
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

5872e647

28 May, 2025 1 commit

[Autotune] Introduce cache mechanism for auto tuner (#527) · 7171aff6

Lei Wang authored May 28, 2025

* [Enhancement] Add commit ID to versioning and improve logging initialization

* Updated `get_tilelang_version` to include an optional commit ID in the version string.
* Enhanced the `TileLangBuilPydCommand` to write the version with commit ID to the VERSION file during the build process.
* Introduced a new function `get_git_commit_id` in `version.py` to retrieve the current git commit hash.
* Refactored logger initialization in `autotuner/__init__.py` to ensure handlers are set up only once, improving performance and clarity.
* Minor fixes in `flatten_buffer.cc` and `kernel_cache.py` for better handling of versioning and logging.

* [Refactor] Enhance AutoTuner and JITKernel for improved performance and caching

* Refactored the AutoTuner class to include new methods for setting compilation and profiling arguments, enhancing configurability.
* Introduced caching mechanisms for tuning results, allowing for faster retrieval of previously computed configurations.
* Updated JITKernel to store tuning results, including latency and configuration details, improving the kernel's performance tracking.
* Added new methods for generating cache keys and saving/loading results to/from disk, streamlining the tuning process.
* Enhanced the overall structure and readability of the autotuning logic, ensuring better maintainability and clarity.
* Minor adjustments in related modules to support the new caching and profiling features.

* [Refactor] Clean up code formatting and improve readability in AutoTuner and related modules

* Consolidated import statements and removed unnecessary line breaks for better readability.
* Standardized function argument formatting across the AutoTuner and CompileArgs classes.
* Enhanced consistency in the use of whitespace and indentation throughout the codebase.
* Minor adjustments in the Profiler and JITKernel classes to improve clarity and maintainability.
* Ensured that all changes adhere to the project's coding style guidelines.

* [Refactor] Remove redundant type hints in AutoTuner modules

* Simplified import statements in `__init__.py` and `param.py` by removing unnecessary duplicate type hints for `Any`.
* Improved code readability and maintainability by streamlining type imports across the AutoTuner module.

* [Refactor] Update AutoTuner configuration for improved profiling and target detection

* Enhanced the AutoTuner configuration across multiple examples by adding `set_profile_args` to better manage profiling settings.
* Standardized the use of `target="auto"` in compile arguments to ensure automatic target detection.
* Removed redundant target specifications in certain instances to streamline the configuration process.
* Improved overall clarity and maintainability of the autotuning logic in various example scripts.

* [Refactor] Simplify code formatting and improve readability in example scripts

* Consolidated function argument formatting in `benchmark_mla_decode_amd_tilelang.py`, `example_elementwise_add.py`, and `performance.py` for better clarity.
* Removed unnecessary line breaks and standardized argument placement across multiple files.
* Enhanced overall code readability and maintainability in autotuning examples and performance scripts.

* [Refactor] Update JIT decorator usage across multiple files

* Removed redundant parameters from the JIT decorator in various benchmark and example scripts, simplifying the code.
* Standardized the import of the JIT decorator from `tilelang`, enhancing consistency across the codebase.
* Improved overall readability and maintainability by consolidating import statements and cleaning up function definitions.

* [Refactor] Standardize JIT decorator formatting across benchmark and example scripts

* Simplified the formatting of the JIT decorator in multiple files by removing unnecessary line breaks.
* Enhanced code readability and consistency in the usage of the JIT decorator across benchmark and example scripts.
* Improved overall maintainability by ensuring uniformity in function definitions and decorator usage.

7171aff6

22 May, 2025 1 commit

[Enhancement] Introduce padding annotation and improve memory access validation (#511) · f23c4d30

Lei Wang authored May 22, 2025

* Added a new attribute `kPaddingMap` in `builtin.h` for managing padding annotations.
* Enhanced `SafeMemorysRewriter` to utilize an annotated padding map for buffer stores, improving memory access safety.
* Implemented checks in `layout_inference.cc` to ensure buffers are correctly referenced during layout mapping.
* Introduced a new test file for validating the padding annotation functionality in TileLang.

f23c4d30

20 May, 2025 1 commit

[Refactor] Refactor `jit` to `_JitImplementation` to support `@tilelang.jit` (#502) · 8c8d8ca2

Lei Wang authored May 20, 2025

* [Refactor] Rename `jit` class to `_JitImplementation` and improve debug path handling

* Refactored the `jit` class to `_JitImplementation` for clarity and encapsulation.
* Enhanced handling of `debug_root_path` to ensure it is correctly set as an absolute path when provided.
* Updated the public `jit` function to serve as a decorator interface, allowing for both default and configured usage.
* Added validation to ensure input tensors are contiguous in the Cython wrapper, improving error handling.

* [Refactor] Improve formatting and handling in `_JitImplementation` and `jit` function

* Refactored the `_JitImplementation` class to enhance readability by adjusting comment formatting and consolidating conditions for setting `debug_root_path`.
* Updated the `jit` function signature for better alignment and clarity in parameter definitions.
* Ensured consistent spacing and comments throughout the code for improved maintainability.

* [Refactor] Update GEMM test parameters for performance optimization

* Set num_stages to 0 and adjusted matrix dimensions in the GEMM test function to enhance performance and consistency across tests in test_tilelang_jit_gemm.py.
* Reduced the number of threads used in the test to align with the updated configuration, improving overall test efficiency.

* [Refactor] Enhance buffer error logging in layout inference

* Updated the warning message in layout inference to provide clearer context when a buffer cannot be inferred due to its absence in the use list. This change improves the clarity of error reporting during layout inference operations.
* Refactored tensor handling in the Cython wrapper to ensure input tensors are checked for contiguity before processing, enhancing error handling and robustness in tensor management.

* bugfix

8c8d8ca2

18 May, 2025 1 commit

[Refactor] refactor `tilelang.jit` to support a faster and more flexible kernel cache (#501) · 25a50f1a

Lei Wang authored May 19, 2025

* [Refactor] Update JIT kernel functions and streamline GEMM tests

* Renamed and refactored matmul and run_gemm functions to matmul_kernel_jit and run_gemm_kernel_jit for clarity.
* Removed redundant JIT decorator from the matmul function, ensuring it is applied only to the kernel function.
* Updated test function names to reflect changes in the kernel functions, enhancing consistency and readability.
* Cleaned up commented-out code and unnecessary imports to improve overall code quality.

* Update main function call in GEMM test to use tilelang testing framework

* Update README and example scripts to include JIT decorator comments

* Added comments in README.md and various example scripts to indicate the use of the @tilelang.jit decorator for returning torch functions.
* Removed redundant comments that previously instructed to add the decorator, streamlining the documentation and improving clarity.

* Update GEMM test parameters for improved performance

* Set num_stages to 0 and adjusted matrix dimensions in test functions to enhance performance and consistency across GEMM tests in test_tilelang_kernel_gemm.py.

25a50f1a

16 May, 2025 1 commit

[Enhancement] Introduce flag to visualize shared memory merge plan (#496) · dca2fb48

Lei Wang authored May 16, 2025

* Remove debug print statement from block_sparse_attn_triton.py and implement a timeout handler in autotuner for function execution. This enhances the robustness of the autotuner by allowing it to handle timeouts gracefully.

* Enhance the autotuner module by adding a timeout handler for function execution, improving robustness in handling long-running tasks. This change includes the introduction of a custom TimeoutException and updates to the run_with_timeout function for better signal management.

* Add merge shared memory allocations pass and related configurations

- Introduced a new pass for merging shared memory allocations in GPU kernels, allowing for more efficient memory usage.
- Registered configuration options for debugging and controlling the merging behavior.
- Updated relevant files to integrate the new pass into the TileLang engine and transform modules.
- Adjusted import paths and added documentation for the new functionality.

* Reduce num_stages parameter in GEMM functions from 3 to 1 for improved performance in test_tilelang_kernel_gemm.py

dca2fb48