Commits · 33937683000a6ae6f4e462fd4bd5e030d51a4f50 · OpenDAS / tilelang

17 May, 2025 3 commits

[Refactor] Update GEMM layout and operand traits for improved CUDA compatibility (#500) · 33937683

Lei Wang authored May 18, 2025

* [Enhancement] Improve GEMM layout function and documentation

* Added detailed documentation for the makeGemmABLayout function, explaining parameters and layout selection strategies.
* Updated the layout selection logic to use mat_continuous consistently, enhancing clarity and correctness in memory layout calculations.
* Adjusted the InferLayout method to reflect changes in the layout function, ensuring accurate matrix dimension handling for transposed cases.

* lint fix

* [Refactor] Update GEMM layout and operand traits for improved CUDA compatibility

* Adjusted the InferLayout method in gemm.cc to include trans_A in fragment creation, enhancing layout inference for transposed matrices.
* Updated OperandTraits in gemm_sm89.h and gemm_sm90.h to change the Copy type from SM75_U16x4_LDSM_N to SM75_U16x4_LDSM_T, optimizing memory access patterns for different warp configurations.
* Enhanced static assertions in gemm_sm90.h to clarify requirements for num_warp_m, ensuring compatibility with Hopper architecture.

* [Refactor] Clean up formatting in GEMM implementation and CUDA templates

* Simplified the formatting of the fragment creation in the InferLayout method of gemm.cc for better readability.
* Adjusted the static assertion message in gemm_sm90.h to enhance clarity regarding the num_warp_m requirement for Hopper architecture.

33937683

[Bugfix] Rename SM75_U16x8_LDSM_N into SM75_U16x8_LDSM_T for correctness (#499) · 2837878f

Lei Wang authored May 18, 2025

* Remove debug print statement from block_sparse_attn_triton.py and implement a timeout handler in autotuner for function execution. This enhances the robustness of the autotuner by allowing it to handle timeouts gracefully.

* Enhance the autotuner module by adding a timeout handler for function execution, improving robustness in handling long-running tasks. This change includes the introduction of a custom TimeoutException and updates to the run_with_timeout function for better signal management.

* Add merge shared memory allocations pass and related configurations

- Introduced a new pass for merging shared memory allocations in GPU kernels, allowing for more efficient memory usage.
- Registered configuration options for debugging and controlling the merging behavior.
- Updated relevant files to integrate the new pass into the TileLang engine and transform modules.
- Adjusted import paths and added documentation for the new functionality.

* Reduce num_stages parameter in GEMM functions from 3 to 1 for improved performance in test_tilelang_kernel_gemm.py

* Update Copy type in OperandTraits for GEMM templates to use conditional selection based on num_warp_n. This change enhances memory access patterns for different configurations in CUDA kernels.

* lint fix

* Update Copy type in OperandTraits for GEMM templates to use SM75_U16x4_LDSM_T and SM75_U16x8_LDSM_T for improved memory access patterns across CUDA architectures.

2837878f

[Enhancement] Fallback transposed_ldmatrix into `SM75_U16x4_LDSM_N` when warp_n is 8 (#498) · 68a3c4f3

Lei Wang authored May 17, 2025

* Remove debug print statement from block_sparse_attn_triton.py and implement a timeout handler in autotuner for function execution. This enhances the robustness of the autotuner by allowing it to handle timeouts gracefully.

* Enhance the autotuner module by adding a timeout handler for function execution, improving robustness in handling long-running tasks. This change includes the introduction of a custom TimeoutException and updates to the run_with_timeout function for better signal management.

* Add merge shared memory allocations pass and related configurations

- Introduced a new pass for merging shared memory allocations in GPU kernels, allowing for more efficient memory usage.
- Registered configuration options for debugging and controlling the merging behavior.
- Updated relevant files to integrate the new pass into the TileLang engine and transform modules.
- Adjusted import paths and added documentation for the new functionality.

* Reduce num_stages parameter in GEMM functions from 3 to 1 for improved performance in test_tilelang_kernel_gemm.py

* Update Copy type in OperandTraits for GEMM templates to use conditional selection based on num_warp_n. This change enhances memory access patterns for different configurations in CUDA kernels.

* lint fix

68a3c4f3

16 May, 2025 3 commits

[Bugfix] Fix Hopper GEMM layout for small tile size (#497) · c93e8695

Lei Wang authored May 16, 2025

* [Enhancement] Improve GEMM layout function and documentation

* Added detailed documentation for the makeGemmABLayout function, explaining parameters and layout selection strategies.
* Updated the layout selection logic to use mat_continuous consistently, enhancing clarity and correctness in memory layout calculations.
* Adjusted the InferLayout method to reflect changes in the layout function, ensuring accurate matrix dimension handling for transposed cases.

* lint fix

c93e8695

[Refactor] Update main function structure in example scripts and add tests (#475) · 73ae8087

Yu Cheng authored May 16, 2025

* [Refactor] Update example_mla_decode.py and add tests for block_sparse_attn_tilelang

* Refactor example_mla_decode.py to define a main function for better structure and clarity.
* Introduce test_example_mla_decode.py to validate the functionality of example_mla_decode.
* Refactor block_sparse_attn_tilelang.py to define a main function and add test_block_sparse_attn_tilelang.py for testing.
* Ensure all new test files are integrated with tilelang testing framework.

* [Test] Enhance test_example_mla_decode with argument mocking

* Update test_example_mla_decode.py to mock sys.argv for better test isolation.
* Ensure the main function of example_mla_decode is called with the correct arguments during testing.

73ae8087

[Enhancement] Introduce flag to visualize shared memory merge plan (#496) · dca2fb48

Lei Wang authored May 16, 2025

* Remove debug print statement from block_sparse_attn_triton.py and implement a timeout handler in autotuner for function execution. This enhances the robustness of the autotuner by allowing it to handle timeouts gracefully.

* Enhance the autotuner module by adding a timeout handler for function execution, improving robustness in handling long-running tasks. This change includes the introduction of a custom TimeoutException and updates to the run_with_timeout function for better signal management.

* Add merge shared memory allocations pass and related configurations

- Introduced a new pass for merging shared memory allocations in GPU kernels, allowing for more efficient memory usage.
- Registered configuration options for debugging and controlling the merging behavior.
- Updated relevant files to integrate the new pass into the TileLang engine and transform modules.
- Adjusted import paths and added documentation for the new functionality.

* Reduce num_stages parameter in GEMM functions from 3 to 1 for improved performance in test_tilelang_kernel_gemm.py

dca2fb48

14 May, 2025 1 commit

[Refactor] Introduce quantize components of TileLang and add testing for... · cde1886f

Lei Wang authored May 14, 2025

[Refactor] Introduce quantize components of TileLang and add testing for dequant gemm exmaple (#494)

* Remove deprecated example_dequant_gemm.py and add DataType import in __init__.py

* lint fix

* lint fix

* Refactor dequantization examples to use tilelang imports and update data type handling in quantization utilities

* lint fix

cde1886f

13 May, 2025 3 commits

[CI] Add Reminder Bot for pull request contributions (#491) · 31dbb471
Wenhao Xie authored May 13, 2025
```
* [CI] Add Reminder Bot for pull request contributions

* upd
```
31dbb471

[CI] Add flash_decoding example to CI (#487) · 7b66fb19

徐畅 authored May 13, 2025

* [CI] Add flash_decoding example to CI

* Add output of ref latency

* format example_gqa_decode.py

7b66fb19

[Enhancement] Support register input for gemm when trans_a or trans_b is true (#490) · d4f096ef

Lei Wang authored May 13, 2025

* [Refactor] Enhance makeGemmFragmentB to support transposition

* Updated the `makeGemmFragmentB` function to include a `transposed` parameter, allowing for flexible layout generation based on matrix transposition.
* Adjusted layout calculations for both transposed and non-transposed cases to ensure correct fragment generation.
* Modified the function signature in `layout.h` and updated all relevant calls in `gemm.cc` to accommodate the new parameter.
* Added a new `matmul_sr` function in the test suite to validate the behavior of the updated fragment generation with transposition support.

* [Refactor] Enhance makeGemmFragmentA and makeGemmFragmentB for transposition support

* Updated the `makeGemmFragmentA` and `makeGemmFragmentB` functions to include a `transposed` parameter, allowing for flexible layout generation based on matrix transposition.
* Adjusted layout calculations for both transposed and non-transposed cases to ensure correct fragment generation.
* Modified function signatures in `layout.h` and updated all relevant calls in `gemm.cc` to accommodate the new parameter.
* Added a new `matmul_rs` function in the test suite to validate the behavior of the updated fragment generation with transposition support.
*

* Improve error messaging in layout equality checks

* Enhanced the error output in layout equality checks to provide clearer context by adding line breaks for better readability in the debug output.
* This change ensures that when layouts are structurally unequal, the current and previous layouts are displayed more distinctly, aiding in debugging.

d4f096ef

12 May, 2025 2 commits
- Revert "[Bugfix] Use AutoTune cache_input_tensors properly (#483)" (#488) · 39ae28e4
  Lei Wang authored May 12, 2025
```
This reverts commit 22e6de184fa4b307640b108b779f3d46d132f96c.
```
  39ae28e4
- [Bugfix] Use AutoTune cache_input_tensors properly (#483) · a10882e0
  yyttt6 authored May 12, 2025
  
  a10882e0
11 May, 2025 2 commits

[Bugfix] Check CUDA target before checking for TMA #482 · fa0fca58
Thien Tran authored May 12, 2025

fa0fca58

[Feature] Fix Device Consistency in Autotuner Threads and Add Manual Profiler Check (#481) · 089cc0a7

yuanjypku authored May 11, 2025



* Fix Device Consistency in Autotuner Threads and Add Manual Profiler Check

* lint fix

* Update example_mla_decode.py

* Update __init__.py

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

089cc0a7

10 May, 2025 7 commits

Update version retrieval in conf.py to read from VERSION file (#478) · 2297af9a
Wenhao Xie authored May 10, 2025

2297af9a

[Refactor] Improve layout equality checks and error messaging (#471) · c2480907

Lei Wang authored May 10, 2025

* [Refactor] Simplify buffer_region_to_tile_region function in copy.py

* Removed redundant logic for handling region extents in the buffer_region_to_tile_region function, streamlining the code for better readability and maintainability.
* Enhanced error handling by focusing on essential checks while eliminating unnecessary complexity related to variable extents.

* [Refactor] Improve layout equality checks and error messaging

* Updated the `IsEqual` method in `FragmentNode` to ensure consistent evaluation of thread ranges.
* Enhanced error messaging in `ParallelOp::InferLayout` to include source buffer information for better debugging.
* Adjusted `ReduceOp::InferLayout` to set thread range during layout condensation, improving layout inference accuracy.

* lintfix

* [Refactor] Rename SetThreadRange to BindThreadRange for clarity

* Updated the `SetThreadRange` method in `FragmentNode` and related classes to `BindThreadRange`, improving method naming consistency and clarity.
* Adjusted all references to the renamed method across the codebase, ensuring proper functionality and maintaining existing behavior.
* Enhanced layout equality checks to handle thread ranges more robustly in `IsEqual` method.
* Updated layout inference methods in `Gemm`, `ParallelOp`, and `ReduceOp` to utilize the new method name, ensuring seamless integration with the updated API.

* [Refactor] Update BindThreadRange usage across layout inference methods

* Modified the implementation of `BindThreadRange` in `FragmentNode` to create a new object instance, enhancing thread range binding functionality.
* Updated all references to `BindThreadRange` in layout inference methods across `Gemm`, `ParallelOp`, and `ReduceOp` to ensure consistency with the new implementation.
* Adjusted the return statements in various layout inference functions to utilize the updated method, maintaining existing behavior while improving clarity.

* lint fix

c2480907

[Refactor] Skip patchelf if not installed (#477) · 273be768

Lei Wang authored May 10, 2025

* [Refactor] Enhance TMA barrier validation and support for additional architectures

* Updated the TMA barrier validation in `inject_tma_barrier.cc` to check for non-empty `barrier_id_to_range_` before raising an error for missing `create_list_of_mbarrier`.
* Refactored architecture checks in `phase.py` to utilize a new constant `SUPPORTED_TMA_ARCHS`, allowing for easier updates and improved readability in the target architecture validation logic.

* Enhance logging in setup.py and refactor TMA architecture checks in phase.py

* Added logging configuration to setup.py, replacing print statements with logger for better traceability.
* Updated download and extraction functions to use logger for status messages.
* Refactored TMA architecture checks in phase.py to utilize the new `have_tma` function for improved clarity and maintainability.
* Introduced support for additional compute capabilities in nvcc.py, including TMA support checks.

* Update documentation for get_target_compute_version to reflect correct GPU compute capability range

* Refactor have_tma function to accept tvm.target.Target instead of compute_version

* Updated the `have_tma` function in nvcc.py to take a `target` parameter, improving clarity and usability.
* Adjusted calls to `have_tma` in phase.py to pass the target directly, enhancing maintainability and consistency in TMA support checks.

273be768

[CI] Add Analyzer and blocksparse_attention examples to CI (#472) · 8dec14e0

yyttt6 authored May 10, 2025



* yes

* [Bugfix] fix the unexpected keyword error of autotune

* format

* test

* [CI] Add Analyzer and blocksparse_attention examples to CI

* format

* try

* try

* try

* try

* t

* format

* d

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

8dec14e0

[Refactor] set USE_LLVM to optional. (#476) · 66dba763
Yuxuan Hu authored May 10, 2025

66dba763

[BugFix] Correct argparse for example_convolution test (#474) · 3f25bd1b

Wenhao Xie authored May 10, 2025



* add convolution example to CI

* lint fix

* Update test_example_convolution.py

* fix bug

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

3f25bd1b

[CI] Add Convolution example to CI (#473) · abe170a6

Wenhao Xie authored May 10, 2025



* add convolution example to CI

* lint fix

* Update test_example_convolution.py

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

abe170a6

09 May, 2025 10 commits

[Refactor] Simplify buffer_region_to_tile_region function in copy.py (#470) · c5a989f5

Lei Wang authored May 10, 2025

* Removed redundant logic for handling region extents in the buffer_region_to_tile_region function, streamlining the code for better readability and maintainability.
* Enhanced error handling by focusing on essential checks while eliminating unnecessary complexity related to variable extents.

c5a989f5

[Refactor] Update set_compile_args to allow None for out_idx parameter (#469) · 1f2f1554

Lei Wang authored May 09, 2025

* Modified the `set_compile_args` method in `AutoTuner` to accept `None` as a valid input for the `out_idx` parameter, enhancing flexibility in argument handling.

1f2f1554

[CI] Add BlocksparseGemm, Dynamic, and Cast examples to CI (#467) · 46eb4589

Zhengju Tang authored May 09, 2025



* [Refactor] Enhance TMA barrier validation and support for additional architectures (#463)

* Updated the TMA barrier validation in `inject_tma_barrier.cc` to check for non-empty `barrier_id_to_range_` before raising an error for missing `create_list_of_mbarrier`.
* Refactored architecture checks in `phase.py` to utilize a new constant `SUPPORTED_TMA_ARCHS`, allowing for easier updates and improved readability in the target architecture validation logic.

* [CI] Add BlocksparseGemm, Dynamic, and Cast examples to CI.

* Lint

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

46eb4589

[Typo] Rename `power_of_int` with `pow_of_int` for consistency (#468) · c99b7056

Lei Wang authored May 09, 2025

* typo fix

* Rename `power_of_int` to `pow_of_int` in math operations and update corresponding Python API reference. Adjusted registration attributes to reflect the new naming convention.

c99b7056

[Feature] Implement fast integer power operation and related API (#466) · 1f5eb492

Lei Wang authored May 09, 2025

* [Refactor] Enhance TMA barrier validation and support for additional architectures (#463)

* Updated the TMA barrier validation in `inject_tma_barrier.cc` to check for non-empty `barrier_id_to_range_` before raising an error for missing `create_list_of_mbarrier`.
* Refactored architecture checks in `phase.py` to utilize a new constant `SUPPORTED_TMA_ARCHS`, allowing for easier updates and improved readability in the target architecture validation logic.

* [Feature] Implement fast integer power operation and related API

* Added a new math operation `tl.power_of_int` in `math.cc` for efficient integer exponentiation.
* Introduced a corresponding Python API `pow_of_int` in `tir/op.py` to facilitate usage in TileLang.
* Enhanced `common.h` with a template function for integer power calculations.
* Updated documentation to reflect the new functionality and usage examples.

1f5eb492

[Bugfix] Fix copy region automation for dynamic extent (#465) · 2ffbd369

Lei Wang authored May 09, 2025

* [Refactor] Enhance TMA barrier validation and support for additional architectures (#463)

* Updated the TMA barrier validation in `inject_tma_barrier.cc` to check for non-empty `barrier_id_to_range_` before raising an error for missing `create_list_of_mbarrier`.
* Refactored architecture checks in `phase.py` to utilize a new constant `SUPPORTED_TMA_ARCHS`, allowing for easier updates and improved readability in the target architecture validation logic.

* [Refactor] Improve buffer region validation in copy.py

* Added handling for variable extents in buffer_region_to_tile_region function to enhance type checking and error handling.
* Introduced debug print statements to trace values of region extents and temporary extents during validation.
* Updated logic to account for variable extent counts when determining alignment of extents.

* [Refactor] Remove debug print statements in buffer_region_to_tile_region function

* Eliminated unnecessary print statements that were used for debugging temporary extents and region extents.
* Streamlined the code for better readability while maintaining the existing functionality of buffer region validation.

* [Refactor] Clean up whitespace in buffer_region_to_tile_region function

* Removed an unnecessary blank line in the buffer_region_to_tile_region function to improve code readability and maintain consistency in formatting.

2ffbd369

[Refactor] Enhance TMA barrier validation and support for additional architectures (#463) · f41c467c

Lei Wang authored May 09, 2025

* Updated the TMA barrier validation in `inject_tma_barrier.cc` to check for non-empty `barrier_id_to_range_` before raising an error for missing `create_list_of_mbarrier`.
* Refactored architecture checks in `phase.py` to utilize a new constant `SUPPORTED_TMA_ARCHS`, allowing for easier updates and improved readability in the target architecture validation logic.

f41c467c

[Bugfix] Fix for T.copy with dynamic range (#462) · d946d1d4

Lei Wang authored May 09, 2025

* [Refactor] Update barrier functions and remove argparse in example_warp_specialize_flashmla.py

* Refactored barrier functions to use new signatures for improved clarity and consistency.
* Replaced `mbarrier_arrive` and `mbarrier_wait_parity` with `barrier_arrive` and `barrier_wait` respectively.
* Removed argparse dependency and replaced it with hardcoded parameters for batch size and dimensions in the main function, simplifying the example script.

* [Refactor] Update warp_specialized_rewriter with license change and code cleanup

* Replaced Apache License header with MIT License in `warp_specialized_rewriter.cc`.
* Removed the `ThreadTagChecker` class to streamline the code, as it was no longer needed.
* Added `#include` for `common/collector.h` to support new functionality.
* Updated file documentation to reflect the correct filename and purpose.
* Improved overall code readability by removing unnecessary comments and sections.

* [Feature] Add thread synchronization functions in builtin.py and refine buffer region checks in copy.py

* Introduced `sync_threads` and `sync_thread_partial` functions in `builtin.py` for improved thread synchronization capabilities.
* Enhanced documentation for new synchronization functions to clarify usage and parameters.
* Updated buffer region validation in `copy.py` to ensure type checking for integer values, improving error handling for region extents.

* lint fix

* [Feature] Introduce TMA barrier injection and related utilities

* Added `inject_tma_barrier.cc` to implement TMA barrier rewriting for CUDA GPU (sm90+).
* Created `common/attr.h` and `common/collector.h` for attribute checks and information collection from the IR.
* Updated `ir.cc` to use a constant for the main block name instead of a hardcoded string.
* Cleaned up `warp_specialized_rewriter.cc` by removing unnecessary whitespace.
* Enhanced thread tag validation with `ThreadTagChecker` to ensure only `threadIdx.x` is used in TMA barrier contexts.

* lint fix

d946d1d4

[CI] Add elementwise and gemv examples to CI. (#458) · dd7eb488
Cunxiao Ni authored May 09, 2025
```
* [CI] Add elementwise and gemv examples to CI.

* fix lint

* test

* fix gemv lint

* fix lint
```
dd7eb488
[Doc] add llvm version info to installation.md. (#459) · 8d5e803e
Rinne authored May 09, 2025

8d5e803e

08 May, 2025 2 commits

[Refactor] Update barrier functions and remove argparse in... · b0122d74

Lei Wang authored May 08, 2025

[Refactor] Update barrier functions and remove argparse in example_warp_specialize_flashmla.py (#457)

* Refactored barrier functions to use new signatures for improved clarity and consistency.
* Replaced `mbarrier_arrive` and `mbarrier_wait_parity` with `barrier_arrive` and `barrier_wait` respectively.
* Removed argparse dependency and replaced it with hardcoded parameters for batch size and dimensions in the main function, simplifying the example script.

b0122d74

[Refactor] Update barrier functions and add new example for GEMM with warp specialization (#456) · a91bc2a9

Lei Wang authored May 08, 2025

* Add example for warp specialization with flash attention

* Introduced a new example script `example_warp_specialize_flashmla.py` demonstrating flash attention using warp specialization in TileLang.
* Implemented the `flashattn` function with shared memory allocation and memory barrier synchronization for improved performance.
* Added a reference program for validation against PyTorch's implementation, including profiling for latency and performance metrics.
* Removed the outdated `example_warp_specialize_mla.py` to streamline examples and focus on the new implementation.

* Add memory barrier functions to builtin.py

* Introduced `barrier_wait` and `barrier_arrive` functions for memory barrier synchronization.
* Enhanced documentation with detailed docstrings for both functions, clarifying their usage and parameters.
* The `barrier_wait` function serves as a wrapper for `mbarrier_wait_parity`, supporting parity values 0 and 1.
* Improved code organization and readability by adding blank lines for better separation of logical sections.

* Enhance code readability by adding blank lines in example_warp_specialize_flashmla.py and builtin.py

* Added blank lines to improve code organization and separation of logical sections in `example_warp_specialize_flashmla.py`.
* Included blank lines in `builtin.py` around the `wait_wgmma` and `barrier_wait` functions for better readability.

* [Refactor] Update barrier functions and add new example for GEMM with warp specialization

* Refactored memory barrier functions in `example_warp_specialize_flashmla.py` to use the new `barrier_wait` and `barrier_arrive` methods for improved clarity and consistency.
* Introduced a new example script `example_warp_specialize_gemm_copy_gemm_0_1.py` demonstrating matrix multiplication with warp specialization and shared memory allocation.
* Enhanced the `layout.cc` and `elem.cc` files to improve structural equality checks and error handling in copy operations.
* Updated `warpgroup.py` to refine thread ID calculations for better performance in warp specialization scenarios.
* Added new shuffle operations in `builtin.py` for enhanced functionality in parallel computations.

* lint fix

* Update loop variable checks in SIMT loop and buffer region validation

* Modified checks in `elem.cc` to ensure loop variable sizes are less than or equal to source and destination range sizes for better error handling.
* Adjusted assertions in `copy.py` to reflect the updated logic, allowing for more flexible region extent comparisons and improved error messaging.

* lint fix

* test fix

a91bc2a9

07 May, 2025 1 commit
- [Bugfix] Fix get_swizzle_layout implementation. (#455) · 8adfc117
  Yuxi Chi authored May 08, 2025
```
* fix get_swizzle_layout implementation.

* format.
```
  8adfc117
06 May, 2025 5 commits

[Feature] Add cache directory management functions in tilelang.cache (#453) · 0aaef97d

Lei Wang authored May 06, 2025

* [Feature] Add cache directory management functions in tilelang.cache

* Introduced `get_cache_dir` and `set_cache_dir` functions to manage the kernel cache directory.
* Updated `KernelCache` class to store cache directory as a `Path` object for improved path handling.
* Enhanced documentation with examples for new cache directory functions.

* [Refactor] Update cache imports in tilelang.__init__.py

* Added `set_cache_dir` and `get_cache_dir` functions to the import statement for improved cache directory management.
* This change enhances the accessibility of cache directory management functions within the module.

0aaef97d

[Enhancement] Introduce pass_configs parameter for kernel Caching (#452) · b1ba0cc8

Lei Wang authored May 06, 2025

* [Enhancement] Introduce pass_configs parameter for kernel compilation

* Added a new `pass_configs` parameter to the `tilelang.compile` function to allow for more flexible kernel compilation configurations.
* Updated related classes and methods to accommodate the new parameter, ensuring compatibility across the codebase.
* Enhanced the `torch_assert_close` function to include customizable tensor names for better debugging output.
* Refactored input handling in example scripts to streamline the process of obtaining inputs for kernel execution.

* lint fix

b1ba0cc8

[Feature] Add TILELANG_CHECK_LAST_ERROR macro for improved error handling in CUDA and HIP (#450) · 0a8c8b99

Lei Wang authored May 06, 2025

* [Feature] Add TILELANG_CHECK_LAST_ERROR macro for improved error handling in CUDA and HIP

* Introduced TILELANG_CHECK_LAST_ERROR macro to streamline error checking for kernel launches in both CUDA and HIP.
* Updated kernel launch code in wrapper.py to utilize the new macro, enhancing readability and maintainability.
* This change improves error reporting by providing detailed messages when kernel execution fails.

* [Refactor] Standardize error message formatting in TILELANG_CHECK_LAST_ERROR macro

* Updated the TILELANG_CHECK_LAST_ERROR macro in both CUDA and HIP implementations to ensure consistent formatting of error messages.
* Enhanced readability by aligning the error message structure across different platforms, improving maintainability of error handling code.

0a8c8b99

Update requirements.txt (#451) · 025929d8
Lei Wang authored May 06, 2025

025929d8

[Enhancement] Add new examples for warp specialization and TMA integration (#448) · b5faf25a

Lei Wang authored May 06, 2025

* [Refactor] Update KernelLaunch to clarify CPU and GPU kernel launch logic

* Added comments to distinguish between CPU and GPU kernel launch sections for better code readability.
* Changed the creation of empty blocks to use a consistent "root" identifier, enhancing clarity in frame management.

* [Refactor] Rename operations for consistency in lower_hopper_intrin and related files

* Updated function names from CamelCase to snake_case for better consistency across the codebase.
* Refactored calls to `CreateTMADescriptorOp`, `CreateListofMBarrierOp`, and similar functions to their new names: `create_tma_descriptor`, `create_list_of_mbarrier`, etc.
* Adjusted corresponding test cases to reflect these changes, ensuring compatibility with the new naming conventions.

* [Refactor] Rename operations to snake_case for consistency

* Updated function names from CamelCase to snake_case across various files, including `CreateTMADescriptorOp` to `create_tma_descriptor`, `GetMBarrierOp` to `get_mbarrier`, and others.
* Adjusted corresponding calls and definitions in the codebase to reflect these naming changes, ensuring uniformity and improved readability.
* Enhanced layout inference and loop partitioning logic to accommodate the new naming conventions.

* [Feature] Introduce Warp Specialization and Eliminate Storage Sync for MBarrier

* Added a new example `gemm_ws.py` demonstrating matrix multiplication with warp specialization using TileLang.
* Implemented `WarpSpecializeFrame` and `WarpSpecialize` functionality to manage warp group indices in TIR frames.
* Introduced `EliminateStorageSyncForMBarrier` transformation to optimize storage synchronization in mbarrier regions.
* Enhanced the TileLang API with new methods for retrieving block and thread extents.
* Updated the `LowerAndLegalize` and `OptimizeForTarget` functions to incorporate the new transformation.
* Improved layout inference and kernel launch logic for better performance and clarity.

* [Refactor] Clean up code formatting and improve readability

* Added blank lines for better separation of code blocks in `gemm_ws.py`, `phase.py`, `kernel.py`, and `warpgroup.py`.
* Reformatted the `tilelang.compile` call in `gemm_ws.py` for improved clarity.
* Updated comments in `warpgroup.py` to clarify the availability of the `WarpSpecialize` function for NVIDIA GPUs.
* Ensured consistent spacing and formatting across multiple files to enhance overall code readability.

* lint fix

* [Refactor] Update mbarrier functions for improved clarity and consistency

* Refactored `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` to accept explicit parameters for better readability.
* Updated calls in `gemm_ws.py` to use the new function signatures, enhancing code clarity.
* Adjusted `warpgroup.py` to remove unused thread extent variable, streamlining the code.
* Added detailed docstrings to clarify usage examples for memory barrier functions.

* Added blank lines in `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` for improved code readability and separation of logical sections.

* [Feature] Add examples for warp specialization and TMA barrier integration

* Introduced three new example scripts: `example_warp_specialize_gemm.py`, `example_warp_specialize_gemm_barrier4.py`, and `example_warp_specialize_mla.py` demonstrating matrix multiplication with warp specialization and TMA barriers.
* Implemented kernel functions with shared memory allocation and memory barrier synchronization for improved performance.
* Enhanced the TileLang API with new methods for compiling and testing kernels in Python using PyTorch.
* Updated the `phase.py` to include TMA barrier injection in the optimization process.
* Improved documentation and comments for better clarity on usage and functionality.

* [Feature] Add example for warp specialization in GEMM with TMA barriers

* Introduced a new example script `example_warp_specialize_gemm_stage2.py` demonstrating matrix multiplication using warp specialization and TMA barriers.
* Implemented a kernel function with shared memory allocation and memory barrier synchronization for enhanced performance.
* Included functionality to compile the kernel into a PyTorch-compatible function and validate its correctness against PyTorch's reference implementation.
* Enhanced documentation and comments for clarity on usage and functionality.

* lint fix

* [Feature] Implement WarpSpecializedDetector for TMA and MBarrier Detection

* Added the `WarpSpecializedDetector` class to identify the presence of TMA operations and memory barrier operations within a given TIR statement.
* Enhanced the `WarpSpecialized` pass to utilize the detector, allowing for conditional substitution based on the detection results.
* Improved code organization by including necessary headers and utilizing the `IRVisitorWithAnalyzer` for analysis.
* This addition aims to optimize warp specialization by ensuring that only relevant functions are transformed, enhancing performance and correctness.

* lint fix

* [Feature] Add new examples for warp specialization and TMA integration

* Introduced multiple new example scripts demonstrating warp specialization techniques, including `example_warp_specialize_flashmla.py`, `example_warp_specialize_gemm_barrierpipe_stage2.py`, `example_warp_specialize_gemm_copy_0_gemm_1.py`, `example_warp_specialize_gemm_copy_1_gemm_0.py`, and `example_warp_specialize_gemm_softpipe_stage2.py`.
* Each example showcases matrix multiplication with warp specialization and TMA barriers, implementing kernel functions with shared memory allocation and memory barrier synchronization for enhanced performance.
* Added a test suite in `test_example_warp_specialize.py` to validate the functionality of the new examples.
* Updated the TileLang API to support these examples and improve kernel compilation and testing processes.
* Removed outdated example scripts to streamline the codebase and enhance clarity on available functionalities.

* lint fix

* Remove outdated example scripts for warp specialization and TMA integration to streamline the codebase. This includes `example_warp_specialize_gemm.py`, `example_warp_specialize_gemm_barrier4.py`, `example_warp_specialize_gemm_stage2.py`, and `example_warp_specialize_mla.py`, which are no longer needed following recent updates and improvements in the TileLang API.

b5faf25a

03 May, 2025 1 commit

[Refactor] Separate warp specialize rewriter and tma barrier injector pass (#447) · fce16b00

Lei Wang authored May 03, 2025

* [Refactor] Update KernelLaunch to clarify CPU and GPU kernel launch logic

* Added comments to distinguish between CPU and GPU kernel launch sections for better code readability.
* Changed the creation of empty blocks to use a consistent "root" identifier, enhancing clarity in frame management.

* [Refactor] Rename operations for consistency in lower_hopper_intrin and related files

* Updated function names from CamelCase to snake_case for better consistency across the codebase.
* Refactored calls to `CreateTMADescriptorOp`, `CreateListofMBarrierOp`, and similar functions to their new names: `create_tma_descriptor`, `create_list_of_mbarrier`, etc.
* Adjusted corresponding test cases to reflect these changes, ensuring compatibility with the new naming conventions.

* [Refactor] Rename operations to snake_case for consistency

* Updated function names from CamelCase to snake_case across various files, including `CreateTMADescriptorOp` to `create_tma_descriptor`, `GetMBarrierOp` to `get_mbarrier`, and others.
* Adjusted corresponding calls and definitions in the codebase to reflect these naming changes, ensuring uniformity and improved readability.
* Enhanced layout inference and loop partitioning logic to accommodate the new naming conventions.

* [Feature] Introduce Warp Specialization and Eliminate Storage Sync for MBarrier

* Added a new example `gemm_ws.py` demonstrating matrix multiplication with warp specialization using TileLang.
* Implemented `WarpSpecializeFrame` and `WarpSpecialize` functionality to manage warp group indices in TIR frames.
* Introduced `EliminateStorageSyncForMBarrier` transformation to optimize storage synchronization in mbarrier regions.
* Enhanced the TileLang API with new methods for retrieving block and thread extents.
* Updated the `LowerAndLegalize` and `OptimizeForTarget` functions to incorporate the new transformation.
* Improved layout inference and kernel launch logic for better performance and clarity.

* [Refactor] Clean up code formatting and improve readability

* Added blank lines for better separation of code blocks in `gemm_ws.py`, `phase.py`, `kernel.py`, and `warpgroup.py`.
* Reformatted the `tilelang.compile` call in `gemm_ws.py` for improved clarity.
* Updated comments in `warpgroup.py` to clarify the availability of the `WarpSpecialize` function for NVIDIA GPUs.
* Ensured consistent spacing and formatting across multiple files to enhance overall code readability.

* lint fix

* [Refactor] Update mbarrier functions for improved clarity and consistency

* Refactored `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` to accept explicit parameters for better readability.
* Updated calls in `gemm_ws.py` to use the new function signatures, enhancing code clarity.
* Adjusted `warpgroup.py` to remove unused thread extent variable, streamlining the code.
* Added detailed docstrings to clarify usage examples for memory barrier functions.

* Added blank lines in `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` for improved code readability and separation of logical sections.

* [Feature] Add examples for warp specialization and TMA barrier integration

* Introduced three new example scripts: `example_warp_specialize_gemm.py`, `example_warp_specialize_gemm_barrier4.py`, and `example_warp_specialize_mla.py` demonstrating matrix multiplication with warp specialization and TMA barriers.
* Implemented kernel functions with shared memory allocation and memory barrier synchronization for improved performance.
* Enhanced the TileLang API with new methods for compiling and testing kernels in Python using PyTorch.
* Updated the `phase.py` to include TMA barrier injection in the optimization process.
* Improved documentation and comments for better clarity on usage and functionality.

* [Feature] Add example for warp specialization in GEMM with TMA barriers

* Introduced a new example script `example_warp_specialize_gemm_stage2.py` demonstrating matrix multiplication using warp specialization and TMA barriers.
* Implemented a kernel function with shared memory allocation and memory barrier synchronization for enhanced performance.
* Included functionality to compile the kernel into a PyTorch-compatible function and validate its correctness against PyTorch's reference implementation.
* Enhanced documentation and comments for clarity on usage and functionality.

* lint fix

* [Feature] Implement WarpSpecializedDetector for TMA and MBarrier Detection

* Added the `WarpSpecializedDetector` class to identify the presence of TMA operations and memory barrier operations within a given TIR statement.
* Enhanced the `WarpSpecialized` pass to utilize the detector, allowing for conditional substitution based on the detection results.
* Improved code organization by including necessary headers and utilizing the `IRVisitorWithAnalyzer` for analysis.
* This addition aims to optimize warp specialization by ensuring that only relevant functions are transformed, enhancing performance and correctness.

* lint fix

fce16b00