Commits · 67d0b6777d8c78984e7c3a0799e1512b666bf70a · OpenDAS / tilelang

16 Jun, 2025 2 commits

[BugFix] Fix precision issue in GQA decode when block_N exceeds seqlen/num_split (#575) · 67d0b677

徐畅 authored Jun 16, 2025

* [CI] Add flash_decoding example to CI

* Add output of ref latency

* format example_gqa_decode.py

* [BugFix] Fix precision issue in GQA decode when block_N exceeds seqlen/num_split

* format example_gqa_decode.py

67d0b677

[BugFix] Fix import error in nsa examples when `fla.__version__ >=0.2.1` (#579) · 18ab72c9

Tong WU authored Jun 16, 2025

* Update FLA import path for `prepare_token_indices`

* Update FLA import path for `prepare_token_indices`

* Compare versions via packaging.version.parse

18ab72c9

13 Jun, 2025 1 commit
- Fix assertion in GQA backward example to ensure correct tensor comparison for... · 37d0dead
  Lei Wang authored Jun 13, 2025
```
Fix assertion in GQA backward example to ensure correct tensor comparison for gradient validation (#568)
```
  37d0dead
11 Jun, 2025 1 commit

[Feature] Introduce Persistent Loop and Update GEMM Example (#563) · e7b97be2

Yu Cheng authored Jun 11, 2025

* [Feature] Added Support for Synchronizing Grids and Persistent Threadblock Transformation

- Defined the sync_grid operation in builtin.cc and builtin.h, allowing synchronization of all threads within a grid.
- Implemented support for sync_grid in codegen_cuda.cc, ensuring proper handling of this operation in the generated CUDA code.
- Added the PersistThreadblock transformation, enabling the conversion of thread blocks to persistent thread blocks, enhancing support for persistent kernels.
- Updated relevant documentation and comments to reflect the addition of new features and usage instructions.

* [Example] Add MLA Decode With Persistent Threadblock Example

* [Feature] Introduce Persistent Loop and Update GEMM Example

- Added a new persistent loop construct in the TIR framework, enabling more efficient kernel execution.
- Updated the GEMM example to utilize the new persistent primitive, enhancing performance for matrix multiplication.
- Introduced a `loop_break` intrinsic for better control flow within persistent loops.
- Updated relevant files to support the new features, including changes in code generation and language interface.

* lint fix

e7b97be2

07 Jun, 2025 1 commit

[Feature] Support persistent kernels and add persistent GEMM examples (#559) · 225aca61

Yu Cheng authored Jun 07, 2025

* [Enhancement] Fix multi-version buffer index in nested-loop

* [Feature] Support persistent kernels and add persistent GEMM example

* lint fix

* lint fix

* [CI] Remove test_tilelang_transform_annotate_device_regions.py

225aca61

06 Jun, 2025 1 commit

[CI] Add CI test for flash_attention examples (#558) · 24403aea

xs-keju authored Jun 06, 2025



* [CI] Add CI test for flash_attention examples

* Update example_gqa_fwd_bshd.py

* Update example_mha_fwd_bshd_wgmma_pipelined.py

* [CI] Added conditional annotations for tests in flash_attention

* [CI] Added conditional annotations for tests in flash_attention

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

24403aea

05 Jun, 2025 1 commit

[CI] Add FusedMoE example (#555) · 88c622c9

Zhengju Tang authored Jun 05, 2025



* [CI] Add FusedMoE example

* Lint

* Fix import bug

* Fix comment bug

* Update example_fusedmoe_torch.py

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

88c622c9

04 Jun, 2025 3 commits

[CI]Add norm and layout_plot (#534) · c9e503be

alex_xiao authored Jun 04, 2025



* [CI]Add norm and layout_plot

* fix lint

* Remove obsolete test files for RMS normalization and plot layout, streamlining the testing suite.

* Add make_mma_load_base_layout function to create MMA result layouts

- Introduced a new function `make_mma_load_base_layout` for generating layout functions for storing MMA results in fragment buffers.
- Added detailed docstring explaining parameters, return values, and potential exceptions.
- Implemented logic for handling different data types and matrix configurations, including assertions for input validation.
- Defined internal functions for mapping fragment indices to threads and local indices, enhancing the layout functionality.

* Enhance MMA load test with additional imports and functionality

- Added imports for `tilelang.language`, `Literal`, `Callable`, `DataType`, `IndexMap`, and `get_mma_micro_size` to support extended functionality.
- Improved the `make_mma_load_base_layout` function by ensuring it can handle various data types and configurations.
- Updated the test function `test_mma_load_base_layout` to validate the layout for float16 matrix A.

* Fix formatting in test_fragment_mma_load_a.py by adding a blank line for improved readability.

* Add RMS normalization functions to test_rms_norm.py

- Introduced `rms_norm` and `rms_norm_splitk` functions for RMS normalization, enhancing the testing capabilities.
- Implemented kernel functions with shared memory allocation and parallel processing for improved performance.
- Updated the test function to validate the new RMS normalization implementations.

* Add reference program for RMS normalization in test_rms_norm.py

- Introduced `ref_program` function to provide a reference implementation for RMS normalization.
- This addition enhances the testing framework by allowing comparisons against a known reference output.

* Enhance RMS normalization tests with additional imports and formatting

- Added import for `tilelang.language` to support extended functionality in `test_rms_norm.py`.
- Improved code readability by adding blank lines for better separation of code sections.

* Update RMS normalization test parameters and enhance layout plotting

- Increased matrix dimensions in `test_rms_norm` to 8192 for improved performance testing.
- Removed obsolete test functions in `test_fragment_mma_load_a.py` to streamline the test suite.
- Enhanced layout plotting functionality by ensuring proper visualization of base, warp, and block layouts in `test_fragment_mma_load_a.py`.

* Refactor RMS normalization test parameters and improve layout plotting readability

- Simplified the parameters in `test_rms_norm` by removing `blk_k` for clarity.
- Enhanced code readability in `test_fragment_mma_load_a.py` by adjusting the formatting of the `block_layout` definition and removing the unused `warp_cols` variable.

* Enhance RMS normalization with split-k implementation and additional profiling

- Added a new function `test_rms_norm_splitk` to test the split-k variant of RMS normalization.
- Updated the main RMS normalization script to include profiling for the split-k implementation.
- Ensured all checks pass with appropriate latency measurements for both reference and tile-lang implementations.

* Remove obsolete test file `test_fragment_mma_load_a.py` to streamline the test suite.

* Refactor `rms_norm.py` to streamline benchmarking output and remove redundant code. Comment out the `plot_layout` call in `fragment_mma_load_a.py` for clarity.

* Refactor `test_rms_norm.py` by removing redundant test function `test_rms_norm_splitk` to streamline the test suite and improve clarity.

---------
Co-authored-by: Your Name <you@example.com>

c9e503be

[CI] Add linear attention examples to CI (#552) · eec07578
Tong WU authored Jun 04, 2025
```
* Add linear attention examples.

* Add license

* Remove comments

* Run yapf and ruff
```
eec07578

[Refactor] Include several examples into ci (#531) · 3ca3a8af

Lei Wang authored Jun 04, 2025

* Remove unused 2D continuous cumulative sum example and related functions from the cumsum module.

* lint fix

* fix split k example

* Enable cache disabling in gemm_streamk example and add validation checks in if_stmt_binding transformation

* Update gemm_streamk example to use tilelang's cdiv function for block calculations and add copyright notice

3ca3a8af

03 Jun, 2025 1 commit
- [CI] Add hadamard example to CI (#549) · 20810691
  Tong WU authored Jun 04, 2025
```
* [CI] Add hadamard example to CI

* Run yapf and ruff

* Run yapf and ruff
```
  20810691
01 Jun, 2025 1 commit

[AMD] Support float8 matrix core (#537) · 5872e647

Lei Wang authored Jun 02, 2025



* [Enhancement] Add support for FP8 types in CUDA and HIP code generation

* Updated `GetFP8Type` function in `codegen_cuda.cc` and `codegen_hip.cc` to handle new FP8 types, including `kFloat8_e4m3fnuz`.
* Introduced a new header file `hip_fp8.h` for FP8 type definitions in HIP.
* Modified type mappings in `dlpack.py` and `mfma_macro_generator.py` to accommodate new FP8 types.
* Enhanced type handling in `TLHIPSourceWrapper` and `tensor.py` for better integration with FP8 types.
* Added necessary includes and logic to support FP8 in the code generation process, improving performance and compatibility with FP8 data types.

* lint fix

* Update src/target/codegen_hip.cc
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* Update tilelang/intrinsics/mfma_macro_generator.py
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* workaround

* fix

* Update submodule TVM to latest commit 587028ffebfff0ded520f8f90d62f0f6b165906c

* bug fix

* Refactor tilelang matrix multiplication to support transposition and packing options. Adjusted shared memory shapes and loading logic for A and B matrices. Updated test cases to validate new functionality.

* Refactor assertion function for tilelang matrix multiplication to improve readability by formatting parameters and aligning code. Cleaned up whitespace in intrinsic layout functions for consistency.

* Update bfloat16 type definitions in common.h and gemm.h for consistency. Changed __hip_bfloat16 to hip_bfloat16 and updated MfmaTraits specialization accordingly.

* lint fix

---------
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

5872e647

28 May, 2025 3 commits

[Refactor] add autotune example to convolution examples (#536) · 8af5eb77
yyttt6 authored May 29, 2025

8af5eb77

[Autotune] Introduce cache mechanism for auto tuner (#527) · 7171aff6

Lei Wang authored May 28, 2025

* [Enhancement] Add commit ID to versioning and improve logging initialization

* Updated `get_tilelang_version` to include an optional commit ID in the version string.
* Enhanced the `TileLangBuilPydCommand` to write the version with commit ID to the VERSION file during the build process.
* Introduced a new function `get_git_commit_id` in `version.py` to retrieve the current git commit hash.
* Refactored logger initialization in `autotuner/__init__.py` to ensure handlers are set up only once, improving performance and clarity.
* Minor fixes in `flatten_buffer.cc` and `kernel_cache.py` for better handling of versioning and logging.

* [Refactor] Enhance AutoTuner and JITKernel for improved performance and caching

* Refactored the AutoTuner class to include new methods for setting compilation and profiling arguments, enhancing configurability.
* Introduced caching mechanisms for tuning results, allowing for faster retrieval of previously computed configurations.
* Updated JITKernel to store tuning results, including latency and configuration details, improving the kernel's performance tracking.
* Added new methods for generating cache keys and saving/loading results to/from disk, streamlining the tuning process.
* Enhanced the overall structure and readability of the autotuning logic, ensuring better maintainability and clarity.
* Minor adjustments in related modules to support the new caching and profiling features.

* [Refactor] Clean up code formatting and improve readability in AutoTuner and related modules

* Consolidated import statements and removed unnecessary line breaks for better readability.
* Standardized function argument formatting across the AutoTuner and CompileArgs classes.
* Enhanced consistency in the use of whitespace and indentation throughout the codebase.
* Minor adjustments in the Profiler and JITKernel classes to improve clarity and maintainability.
* Ensured that all changes adhere to the project's coding style guidelines.

* [Refactor] Remove redundant type hints in AutoTuner modules

* Simplified import statements in `__init__.py` and `param.py` by removing unnecessary duplicate type hints for `Any`.
* Improved code readability and maintainability by streamlining type imports across the AutoTuner module.

* [Refactor] Update AutoTuner configuration for improved profiling and target detection

* Enhanced the AutoTuner configuration across multiple examples by adding `set_profile_args` to better manage profiling settings.
* Standardized the use of `target="auto"` in compile arguments to ensure automatic target detection.
* Removed redundant target specifications in certain instances to streamline the configuration process.
* Improved overall clarity and maintainability of the autotuning logic in various example scripts.

* [Refactor] Simplify code formatting and improve readability in example scripts

* Consolidated function argument formatting in `benchmark_mla_decode_amd_tilelang.py`, `example_elementwise_add.py`, and `performance.py` for better clarity.
* Removed unnecessary line breaks and standardized argument placement across multiple files.
* Enhanced overall code readability and maintainability in autotuning examples and performance scripts.

* [Refactor] Update JIT decorator usage across multiple files

* Removed redundant parameters from the JIT decorator in various benchmark and example scripts, simplifying the code.
* Standardized the import of the JIT decorator from `tilelang`, enhancing consistency across the codebase.
* Improved overall readability and maintainability by consolidating import statements and cleaning up function definitions.

* [Refactor] Standardize JIT decorator formatting across benchmark and example scripts

* Simplified the formatting of the JIT decorator in multiple files by removing unnecessary line breaks.
* Enhanced code readability and consistency in the usage of the JIT decorator across benchmark and example scripts.
* Improved overall maintainability by ensuring uniformity in function definitions and decorator usage.

7171aff6

[Refactor] Refactor convolution example to streamline configuration and remove unused code (#530) · 09581e4e

Lei Wang authored May 28, 2025



* Refactor convolution example to streamline configuration and remove unused code

* Updated the `check_hopper` function to properly check for CUDA availability and compute capability.
* Removed the `get_configs` and `get_best_config` functions, simplifying the example by eliminating unused autotuning logic.
* Adjusted argument parsing in the `main` function to directly compile the convolution kernel without autotuning options.
* Cleaned up the code for better readability and maintainability.

* Update examples/convolution/example_convolution.py
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

---------
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

09581e4e

27 May, 2025 1 commit

[CI] Add gemm and gemm_fp8 example to CI (#516) · ee4e708d

Leslin authored May 28, 2025

* [CI] Add gemm and gemm_fp8 example to CI

* Fix lint via format.sh

* Resolved the issues with profiler API usage and parse_args

ee4e708d

24 May, 2025 1 commit

[Refactor] Support auto index bitwidth casting (#517) · 6ad73f6f

Lei Wang authored May 24, 2025

* [Refactor] Enhance GEMM Warp Partitioning Logic and Introduce Buffer Remapping (#516)

* Improved the warp partitioning logic in `Gemm::ComputeWarpPartition` to better accommodate various GEMM policies, including FullRow, FullCol, and Square, ensuring optimal performance based on matrix dimensions.
* Introduced a new `RemapBufferRewriter` class to handle buffer reference updates and padding annotations during statement transformations, enhancing memory access safety and clarity.
* Updated the `OptimizeForTarget` function to include a new step for configuring index bitwidth, improving the overall optimization process.
* Refactored existing code to utilize constants for warp sizes, enhancing maintainability and readability.
* Added checks to ensure correct warp allocation and padding map handling, improving robustness in memory management strategies.

* [Refactor] Update ConfigIndexBitwidthRewriter to Support Auto-Check Feature

* Modified the constructor of `ConfigIndexBitwidthRewriter` to include an `auto_check` parameter, allowing for dynamic bitwidth adjustments based on input conditions.
* Enhanced the `VisitExpr_` methods to apply the new auto-check logic, ensuring that integer types are upgraded to 64 bits when necessary, or to a specified index bitwidth otherwise.
* Updated the `ConfigIndexBitwidth` pass to determine the index bitwidth based on the presence of configuration, improving flexibility in handling different scenarios.

* Add dynamic matrix multiplication example and corresponding test

* Introduced `example_dynamic.py` to demonstrate dynamic matrix multiplication using TileLang and PyTorch, including a main function for execution and performance profiling.
* Added `test_example_dynamic.py` to validate the functionality of the dynamic matrix multiplication example.
* The example includes detailed parameter configurations and checks against PyTorch's implementation for correctness.

* lint fix

* Add get_num_sms function to retrieve the number of streaming multiprocessors on the CUDA device

* Implemented the `get_num_sms` function in `cuda_driver.py` to return the count of streaming multiprocessors for a specified CUDA device.
* Updated the `__init__.py` file to include the new function in the module exports.

* lint fix

6ad73f6f

23 May, 2025 3 commits

Fix deepgemm exmaple (#513) · 0d1eab57

Taoyu Zhu authored May 23, 2025



* fix deepgemm example

* fix deepgemm example

* make format

* Update example_deepgemm_fp8_2xAcc.py

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

0d1eab57

[Dev] Add grouped GEMM backward example scripts (#515) · de028927

Yu Cheng authored May 23, 2025

* Introduced `example_grouped_gemm_fwd.py` and `example_grouped_gemm_bwd.py` to demonstrate grouped matrix multiplication with forward and backward operations.
* Implemented functions for grouped GEMM, input construction, and validation against PyTorch's implementation.
* Added command-line argument parsing for flexible input configuration, including batch sizes and matrix dimensions.
* Included a test function to validate the functionality with various input scenarios.

de028927

[Dev] Add grouped GEMM example with TileLang and PyTorch integration (#514) · fb801940

Yu Cheng authored May 23, 2025

* Introduced a new example script `example_grouped_gemm.py` demonstrating grouped matrix multiplication using TileLang and PyTorch.
* Implemented functions for performing grouped GEMM, constructing inputs, and validating results against PyTorch's implementation.
* Added command-line argument parsing for flexible input configuration, including batch sizes and matrix dimensions.
* Included a test function to validate the grouped GEMM functionality with various input scenarios.

fb801940

18 May, 2025 1 commit

[Refactor] refactor `tilelang.jit` to support a faster and more flexible kernel cache (#501) · 25a50f1a

Lei Wang authored May 19, 2025

* [Refactor] Update JIT kernel functions and streamline GEMM tests

* Renamed and refactored matmul and run_gemm functions to matmul_kernel_jit and run_gemm_kernel_jit for clarity.
* Removed redundant JIT decorator from the matmul function, ensuring it is applied only to the kernel function.
* Updated test function names to reflect changes in the kernel functions, enhancing consistency and readability.
* Cleaned up commented-out code and unnecessary imports to improve overall code quality.

* Update main function call in GEMM test to use tilelang testing framework

* Update README and example scripts to include JIT decorator comments

* Added comments in README.md and various example scripts to indicate the use of the @tilelang.jit decorator for returning torch functions.
* Removed redundant comments that previously instructed to add the decorator, streamlining the documentation and improving clarity.

* Update GEMM test parameters for improved performance

* Set num_stages to 0 and adjusted matrix dimensions in test functions to enhance performance and consistency across GEMM tests in test_tilelang_kernel_gemm.py.

25a50f1a

17 May, 2025 1 commit

[Enhancement] Fallback transposed_ldmatrix into `SM75_U16x4_LDSM_N` when warp_n is 8 (#498) · 68a3c4f3

Lei Wang authored May 17, 2025

* Remove debug print statement from block_sparse_attn_triton.py and implement a timeout handler in autotuner for function execution. This enhances the robustness of the autotuner by allowing it to handle timeouts gracefully.

* Enhance the autotuner module by adding a timeout handler for function execution, improving robustness in handling long-running tasks. This change includes the introduction of a custom TimeoutException and updates to the run_with_timeout function for better signal management.

* Add merge shared memory allocations pass and related configurations

- Introduced a new pass for merging shared memory allocations in GPU kernels, allowing for more efficient memory usage.
- Registered configuration options for debugging and controlling the merging behavior.
- Updated relevant files to integrate the new pass into the TileLang engine and transform modules.
- Adjusted import paths and added documentation for the new functionality.

* Reduce num_stages parameter in GEMM functions from 3 to 1 for improved performance in test_tilelang_kernel_gemm.py

* Update Copy type in OperandTraits for GEMM templates to use conditional selection based on num_warp_n. This change enhances memory access patterns for different configurations in CUDA kernels.

* lint fix

68a3c4f3

16 May, 2025 2 commits

[Refactor] Update main function structure in example scripts and add tests (#475) · 73ae8087

Yu Cheng authored May 16, 2025

* [Refactor] Update example_mla_decode.py and add tests for block_sparse_attn_tilelang

* Refactor example_mla_decode.py to define a main function for better structure and clarity.
* Introduce test_example_mla_decode.py to validate the functionality of example_mla_decode.
* Refactor block_sparse_attn_tilelang.py to define a main function and add test_block_sparse_attn_tilelang.py for testing.
* Ensure all new test files are integrated with tilelang testing framework.

* [Test] Enhance test_example_mla_decode with argument mocking

* Update test_example_mla_decode.py to mock sys.argv for better test isolation.
* Ensure the main function of example_mla_decode is called with the correct arguments during testing.

73ae8087

[Enhancement] Introduce flag to visualize shared memory merge plan (#496) · dca2fb48

Lei Wang authored May 16, 2025

* Remove debug print statement from block_sparse_attn_triton.py and implement a timeout handler in autotuner for function execution. This enhances the robustness of the autotuner by allowing it to handle timeouts gracefully.

* Enhance the autotuner module by adding a timeout handler for function execution, improving robustness in handling long-running tasks. This change includes the introduction of a custom TimeoutException and updates to the run_with_timeout function for better signal management.

* Add merge shared memory allocations pass and related configurations

- Introduced a new pass for merging shared memory allocations in GPU kernels, allowing for more efficient memory usage.
- Registered configuration options for debugging and controlling the merging behavior.
- Updated relevant files to integrate the new pass into the TileLang engine and transform modules.
- Adjusted import paths and added documentation for the new functionality.

* Reduce num_stages parameter in GEMM functions from 3 to 1 for improved performance in test_tilelang_kernel_gemm.py

dca2fb48

14 May, 2025 1 commit

[Refactor] Introduce quantize components of TileLang and add testing for... · cde1886f

Lei Wang authored May 14, 2025

[Refactor] Introduce quantize components of TileLang and add testing for dequant gemm exmaple (#494)

* Remove deprecated example_dequant_gemm.py and add DataType import in __init__.py

* lint fix

* lint fix

* Refactor dequantization examples to use tilelang imports and update data type handling in quantization utilities

* lint fix

cde1886f

13 May, 2025 1 commit

[CI] Add flash_decoding example to CI (#487) · 7b66fb19

徐畅 authored May 13, 2025

* [CI] Add flash_decoding example to CI

* Add output of ref latency

* format example_gqa_decode.py

7b66fb19

10 May, 2025 3 commits

[CI] Add Analyzer and blocksparse_attention examples to CI (#472) · 8dec14e0

yyttt6 authored May 10, 2025



* yes

* [Bugfix] fix the unexpected keyword error of autotune

* format

* test

* [CI] Add Analyzer and blocksparse_attention examples to CI

* format

* try

* try

* try

* try

* t

* format

* d

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

8dec14e0

[BugFix] Correct argparse for example_convolution test (#474) · 3f25bd1b

Wenhao Xie authored May 10, 2025



* add convolution example to CI

* lint fix

* Update test_example_convolution.py

* fix bug

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

3f25bd1b

[CI] Add Convolution example to CI (#473) · abe170a6

Wenhao Xie authored May 10, 2025



* add convolution example to CI

* lint fix

* Update test_example_convolution.py

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

abe170a6

09 May, 2025 2 commits

[CI] Add BlocksparseGemm, Dynamic, and Cast examples to CI (#467) · 46eb4589

Zhengju Tang authored May 09, 2025



* [Refactor] Enhance TMA barrier validation and support for additional architectures (#463)

* Updated the TMA barrier validation in `inject_tma_barrier.cc` to check for non-empty `barrier_id_to_range_` before raising an error for missing `create_list_of_mbarrier`.
* Refactored architecture checks in `phase.py` to utilize a new constant `SUPPORTED_TMA_ARCHS`, allowing for easier updates and improved readability in the target architecture validation logic.

* [CI] Add BlocksparseGemm, Dynamic, and Cast examples to CI.

* Lint

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

46eb4589

[CI] Add elementwise and gemv examples to CI. (#458) · dd7eb488
Cunxiao Ni authored May 09, 2025
```
* [CI] Add elementwise and gemv examples to CI.

* fix lint

* test

* fix gemv lint

* fix lint
```
dd7eb488

08 May, 2025 2 commits

[Refactor] Update barrier functions and remove argparse in... · b0122d74

Lei Wang authored May 08, 2025

[Refactor] Update barrier functions and remove argparse in example_warp_specialize_flashmla.py (#457)

* Refactored barrier functions to use new signatures for improved clarity and consistency.
* Replaced `mbarrier_arrive` and `mbarrier_wait_parity` with `barrier_arrive` and `barrier_wait` respectively.
* Removed argparse dependency and replaced it with hardcoded parameters for batch size and dimensions in the main function, simplifying the example script.

b0122d74

[Refactor] Update barrier functions and add new example for GEMM with warp specialization (#456) · a91bc2a9

Lei Wang authored May 08, 2025

* Add example for warp specialization with flash attention

* Introduced a new example script `example_warp_specialize_flashmla.py` demonstrating flash attention using warp specialization in TileLang.
* Implemented the `flashattn` function with shared memory allocation and memory barrier synchronization for improved performance.
* Added a reference program for validation against PyTorch's implementation, including profiling for latency and performance metrics.
* Removed the outdated `example_warp_specialize_mla.py` to streamline examples and focus on the new implementation.

* Add memory barrier functions to builtin.py

* Introduced `barrier_wait` and `barrier_arrive` functions for memory barrier synchronization.
* Enhanced documentation with detailed docstrings for both functions, clarifying their usage and parameters.
* The `barrier_wait` function serves as a wrapper for `mbarrier_wait_parity`, supporting parity values 0 and 1.
* Improved code organization and readability by adding blank lines for better separation of logical sections.

* Enhance code readability by adding blank lines in example_warp_specialize_flashmla.py and builtin.py

* Added blank lines to improve code organization and separation of logical sections in `example_warp_specialize_flashmla.py`.
* Included blank lines in `builtin.py` around the `wait_wgmma` and `barrier_wait` functions for better readability.

* [Refactor] Update barrier functions and add new example for GEMM with warp specialization

* Refactored memory barrier functions in `example_warp_specialize_flashmla.py` to use the new `barrier_wait` and `barrier_arrive` methods for improved clarity and consistency.
* Introduced a new example script `example_warp_specialize_gemm_copy_gemm_0_1.py` demonstrating matrix multiplication with warp specialization and shared memory allocation.
* Enhanced the `layout.cc` and `elem.cc` files to improve structural equality checks and error handling in copy operations.
* Updated `warpgroup.py` to refine thread ID calculations for better performance in warp specialization scenarios.
* Added new shuffle operations in `builtin.py` for enhanced functionality in parallel computations.

* lint fix

* Update loop variable checks in SIMT loop and buffer region validation

* Modified checks in `elem.cc` to ensure loop variable sizes are less than or equal to source and destination range sizes for better error handling.
* Adjusted assertions in `copy.py` to reflect the updated logic, allowing for more flexible region extent comparisons and improved error messaging.

* lint fix

* test fix

a91bc2a9

06 May, 2025 2 commits

[Enhancement] Introduce pass_configs parameter for kernel Caching (#452) · b1ba0cc8

Lei Wang authored May 06, 2025

* [Enhancement] Introduce pass_configs parameter for kernel compilation

* Added a new `pass_configs` parameter to the `tilelang.compile` function to allow for more flexible kernel compilation configurations.
* Updated related classes and methods to accommodate the new parameter, ensuring compatibility across the codebase.
* Enhanced the `torch_assert_close` function to include customizable tensor names for better debugging output.
* Refactored input handling in example scripts to streamline the process of obtaining inputs for kernel execution.

* lint fix

b1ba0cc8

[Enhancement] Add new examples for warp specialization and TMA integration (#448) · b5faf25a

Lei Wang authored May 06, 2025

* [Refactor] Update KernelLaunch to clarify CPU and GPU kernel launch logic

* Added comments to distinguish between CPU and GPU kernel launch sections for better code readability.
* Changed the creation of empty blocks to use a consistent "root" identifier, enhancing clarity in frame management.

* [Refactor] Rename operations for consistency in lower_hopper_intrin and related files

* Updated function names from CamelCase to snake_case for better consistency across the codebase.
* Refactored calls to `CreateTMADescriptorOp`, `CreateListofMBarrierOp`, and similar functions to their new names: `create_tma_descriptor`, `create_list_of_mbarrier`, etc.
* Adjusted corresponding test cases to reflect these changes, ensuring compatibility with the new naming conventions.

* [Refactor] Rename operations to snake_case for consistency

* Updated function names from CamelCase to snake_case across various files, including `CreateTMADescriptorOp` to `create_tma_descriptor`, `GetMBarrierOp` to `get_mbarrier`, and others.
* Adjusted corresponding calls and definitions in the codebase to reflect these naming changes, ensuring uniformity and improved readability.
* Enhanced layout inference and loop partitioning logic to accommodate the new naming conventions.

* [Feature] Introduce Warp Specialization and Eliminate Storage Sync for MBarrier

* Added a new example `gemm_ws.py` demonstrating matrix multiplication with warp specialization using TileLang.
* Implemented `WarpSpecializeFrame` and `WarpSpecialize` functionality to manage warp group indices in TIR frames.
* Introduced `EliminateStorageSyncForMBarrier` transformation to optimize storage synchronization in mbarrier regions.
* Enhanced the TileLang API with new methods for retrieving block and thread extents.
* Updated the `LowerAndLegalize` and `OptimizeForTarget` functions to incorporate the new transformation.
* Improved layout inference and kernel launch logic for better performance and clarity.

* [Refactor] Clean up code formatting and improve readability

* Added blank lines for better separation of code blocks in `gemm_ws.py`, `phase.py`, `kernel.py`, and `warpgroup.py`.
* Reformatted the `tilelang.compile` call in `gemm_ws.py` for improved clarity.
* Updated comments in `warpgroup.py` to clarify the availability of the `WarpSpecialize` function for NVIDIA GPUs.
* Ensured consistent spacing and formatting across multiple files to enhance overall code readability.

* lint fix

* [Refactor] Update mbarrier functions for improved clarity and consistency

* Refactored `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` to accept explicit parameters for better readability.
* Updated calls in `gemm_ws.py` to use the new function signatures, enhancing code clarity.
* Adjusted `warpgroup.py` to remove unused thread extent variable, streamlining the code.
* Added detailed docstrings to clarify usage examples for memory barrier functions.

* Added blank lines in `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` for improved code readability and separation of logical sections.

* [Feature] Add examples for warp specialization and TMA barrier integration

* Introduced three new example scripts: `example_warp_specialize_gemm.py`, `example_warp_specialize_gemm_barrier4.py`, and `example_warp_specialize_mla.py` demonstrating matrix multiplication with warp specialization and TMA barriers.
* Implemented kernel functions with shared memory allocation and memory barrier synchronization for improved performance.
* Enhanced the TileLang API with new methods for compiling and testing kernels in Python using PyTorch.
* Updated the `phase.py` to include TMA barrier injection in the optimization process.
* Improved documentation and comments for better clarity on usage and functionality.

* [Feature] Add example for warp specialization in GEMM with TMA barriers

* Introduced a new example script `example_warp_specialize_gemm_stage2.py` demonstrating matrix multiplication using warp specialization and TMA barriers.
* Implemented a kernel function with shared memory allocation and memory barrier synchronization for enhanced performance.
* Included functionality to compile the kernel into a PyTorch-compatible function and validate its correctness against PyTorch's reference implementation.
* Enhanced documentation and comments for clarity on usage and functionality.

* lint fix

* [Feature] Implement WarpSpecializedDetector for TMA and MBarrier Detection

* Added the `WarpSpecializedDetector` class to identify the presence of TMA operations and memory barrier operations within a given TIR statement.
* Enhanced the `WarpSpecialized` pass to utilize the detector, allowing for conditional substitution based on the detection results.
* Improved code organization by including necessary headers and utilizing the `IRVisitorWithAnalyzer` for analysis.
* This addition aims to optimize warp specialization by ensuring that only relevant functions are transformed, enhancing performance and correctness.

* lint fix

* [Feature] Add new examples for warp specialization and TMA integration

* Introduced multiple new example scripts demonstrating warp specialization techniques, including `example_warp_specialize_flashmla.py`, `example_warp_specialize_gemm_barrierpipe_stage2.py`, `example_warp_specialize_gemm_copy_0_gemm_1.py`, `example_warp_specialize_gemm_copy_1_gemm_0.py`, and `example_warp_specialize_gemm_softpipe_stage2.py`.
* Each example showcases matrix multiplication with warp specialization and TMA barriers, implementing kernel functions with shared memory allocation and memory barrier synchronization for enhanced performance.
* Added a test suite in `test_example_warp_specialize.py` to validate the functionality of the new examples.
* Updated the TileLang API to support these examples and improve kernel compilation and testing processes.
* Removed outdated example scripts to streamline the codebase and enhance clarity on available functionalities.

* lint fix

* Remove outdated example scripts for warp specialization and TMA integration to streamline the codebase. This includes `example_warp_specialize_gemm.py`, `example_warp_specialize_gemm_barrier4.py`, `example_warp_specialize_gemm_stage2.py`, and `example_warp_specialize_mla.py`, which are no longer needed following recent updates and improvements in the TileLang API.

b5faf25a

03 May, 2025 1 commit

[Refactor] Separate warp specialize rewriter and tma barrier injector pass (#447) · fce16b00

Lei Wang authored May 03, 2025

* [Refactor] Update KernelLaunch to clarify CPU and GPU kernel launch logic

* Added comments to distinguish between CPU and GPU kernel launch sections for better code readability.
* Changed the creation of empty blocks to use a consistent "root" identifier, enhancing clarity in frame management.

* [Refactor] Rename operations for consistency in lower_hopper_intrin and related files

* Updated function names from CamelCase to snake_case for better consistency across the codebase.
* Refactored calls to `CreateTMADescriptorOp`, `CreateListofMBarrierOp`, and similar functions to their new names: `create_tma_descriptor`, `create_list_of_mbarrier`, etc.
* Adjusted corresponding test cases to reflect these changes, ensuring compatibility with the new naming conventions.

* [Refactor] Rename operations to snake_case for consistency

* Updated function names from CamelCase to snake_case across various files, including `CreateTMADescriptorOp` to `create_tma_descriptor`, `GetMBarrierOp` to `get_mbarrier`, and others.
* Adjusted corresponding calls and definitions in the codebase to reflect these naming changes, ensuring uniformity and improved readability.
* Enhanced layout inference and loop partitioning logic to accommodate the new naming conventions.

* [Feature] Introduce Warp Specialization and Eliminate Storage Sync for MBarrier

* Added a new example `gemm_ws.py` demonstrating matrix multiplication with warp specialization using TileLang.
* Implemented `WarpSpecializeFrame` and `WarpSpecialize` functionality to manage warp group indices in TIR frames.
* Introduced `EliminateStorageSyncForMBarrier` transformation to optimize storage synchronization in mbarrier regions.
* Enhanced the TileLang API with new methods for retrieving block and thread extents.
* Updated the `LowerAndLegalize` and `OptimizeForTarget` functions to incorporate the new transformation.
* Improved layout inference and kernel launch logic for better performance and clarity.

* [Refactor] Clean up code formatting and improve readability

* Added blank lines for better separation of code blocks in `gemm_ws.py`, `phase.py`, `kernel.py`, and `warpgroup.py`.
* Reformatted the `tilelang.compile` call in `gemm_ws.py` for improved clarity.
* Updated comments in `warpgroup.py` to clarify the availability of the `WarpSpecialize` function for NVIDIA GPUs.
* Ensured consistent spacing and formatting across multiple files to enhance overall code readability.

* lint fix

* [Refactor] Update mbarrier functions for improved clarity and consistency

* Refactored `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` to accept explicit parameters for better readability.
* Updated calls in `gemm_ws.py` to use the new function signatures, enhancing code clarity.
* Adjusted `warpgroup.py` to remove unused thread extent variable, streamlining the code.
* Added detailed docstrings to clarify usage examples for memory barrier functions.

* Added blank lines in `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` for improved code readability and separation of logical sections.

* [Feature] Add examples for warp specialization and TMA barrier integration

* Introduced three new example scripts: `example_warp_specialize_gemm.py`, `example_warp_specialize_gemm_barrier4.py`, and `example_warp_specialize_mla.py` demonstrating matrix multiplication with warp specialization and TMA barriers.
* Implemented kernel functions with shared memory allocation and memory barrier synchronization for improved performance.
* Enhanced the TileLang API with new methods for compiling and testing kernels in Python using PyTorch.
* Updated the `phase.py` to include TMA barrier injection in the optimization process.
* Improved documentation and comments for better clarity on usage and functionality.

* [Feature] Add example for warp specialization in GEMM with TMA barriers

* Introduced a new example script `example_warp_specialize_gemm_stage2.py` demonstrating matrix multiplication using warp specialization and TMA barriers.
* Implemented a kernel function with shared memory allocation and memory barrier synchronization for enhanced performance.
* Included functionality to compile the kernel into a PyTorch-compatible function and validate its correctness against PyTorch's reference implementation.
* Enhanced documentation and comments for clarity on usage and functionality.

* lint fix

* [Feature] Implement WarpSpecializedDetector for TMA and MBarrier Detection

* Added the `WarpSpecializedDetector` class to identify the presence of TMA operations and memory barrier operations within a given TIR statement.
* Enhanced the `WarpSpecialized` pass to utilize the detector, allowing for conditional substitution based on the detection results.
* Improved code organization by including necessary headers and utilizing the `IRVisitorWithAnalyzer` for analysis.
* This addition aims to optimize warp specialization by ensuring that only relevant functions are transformed, enhancing performance and correctness.

* lint fix

fce16b00

30 Apr, 2025 2 commits

[Language] Support explicit programming for identified warp groups (#445) · 6972aed7

Lei Wang authored Apr 30, 2025

* [Refactor] Update KernelLaunch to clarify CPU and GPU kernel launch logic

* Added comments to distinguish between CPU and GPU kernel launch sections for better code readability.
* Changed the creation of empty blocks to use a consistent "root" identifier, enhancing clarity in frame management.

* [Refactor] Rename operations for consistency in lower_hopper_intrin and related files

* Updated function names from CamelCase to snake_case for better consistency across the codebase.
* Refactored calls to `CreateTMADescriptorOp`, `CreateListofMBarrierOp`, and similar functions to their new names: `create_tma_descriptor`, `create_list_of_mbarrier`, etc.
* Adjusted corresponding test cases to reflect these changes, ensuring compatibility with the new naming conventions.

* [Refactor] Rename operations to snake_case for consistency

* Updated function names from CamelCase to snake_case across various files, including `CreateTMADescriptorOp` to `create_tma_descriptor`, `GetMBarrierOp` to `get_mbarrier`, and others.
* Adjusted corresponding calls and definitions in the codebase to reflect these naming changes, ensuring uniformity and improved readability.
* Enhanced layout inference and loop partitioning logic to accommodate the new naming conventions.

* [Feature] Introduce Warp Specialization and Eliminate Storage Sync for MBarrier

* Added a new example `gemm_ws.py` demonstrating matrix multiplication with warp specialization using TileLang.
* Implemented `WarpSpecializeFrame` and `WarpSpecialize` functionality to manage warp group indices in TIR frames.
* Introduced `EliminateStorageSyncForMBarrier` transformation to optimize storage synchronization in mbarrier regions.
* Enhanced the TileLang API with new methods for retrieving block and thread extents.
* Updated the `LowerAndLegalize` and `OptimizeForTarget` functions to incorporate the new transformation.
* Improved layout inference and kernel launch logic for better performance and clarity.

* [Refactor] Clean up code formatting and improve readability

* Added blank lines for better separation of code blocks in `gemm_ws.py`, `phase.py`, `kernel.py`, and `warpgroup.py`.
* Reformatted the `tilelang.compile` call in `gemm_ws.py` for improved clarity.
* Updated comments in `warpgroup.py` to clarify the availability of the `WarpSpecialize` function for NVIDIA GPUs.
* Ensured consistent spacing and formatting across multiple files to enhance overall code readability.

* lint fix

* [Refactor] Update mbarrier functions for improved clarity and consistency

* Refactored `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` to accept explicit parameters for better readability.
* Updated calls in `gemm_ws.py` to use the new function signatures, enhancing code clarity.
* Adjusted `warpgroup.py` to remove unused thread extent variable, streamlining the code.
* Added detailed docstrings to clarify usage examples for memory barrier functions.

* Added blank lines in `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` for improved code readability and separation of logical sections.

6972aed7

Bump transformers from 4.48.0 to 4.50.0 in /examples/bitnet-1.58b (#444) · 0fa03398

dependabot[bot] authored Apr 30, 2025

Bumps [transformers](https://github.com/huggingface/transformers) from 4.48.0 to 4.50.0.
- [Release notes](https://github.com/huggingface/transformers/releases)
- [Commits](https://github.com/huggingface/transformers/compare/v4.48.0...v4.50.0

)

---
updated-dependencies:
- dependency-name: transformers
  dependency-version: 4.50.0
  dependency-type: direct:production
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

0fa03398

26 Apr, 2025 1 commit
- [Bugfix] fix the unexpected keyword error of autotune (#438) · bfb5b0a3
  yyttt6 authored Apr 26, 2025
```
* yes

* [Bugfix] fix the unexpected keyword error of autotune

* format

* test
```
  bfb5b0a3
21 Apr, 2025 1 commit

[Bugfix] Support larger than 256 box size tma copy (#413) · bf824406

Lei Wang authored Apr 21, 2025

* [New Feature] Add FP8 Flash Attention Implementation (#412)

* Introduce a new example script for FP8 Flash Attention in `example_mla_decode_kv_fp8.py`, showcasing the use of tilelang for efficient attention computation.
* Implement the `flashattn` function with optimized memory management and kernel execution.
* Include a reference program for comparison and performance evaluation.
* Add command-line argument parsing for batch size, number of heads, and dimensions to facilitate testing and experimentation.
* Enhance the overall structure and readability of the code.

This addition aims to improve the performance of attention mechanisms in deep learning models by leveraging FP8 precision and optimized kernel execution.

* lint fix

* optimize quick start

* lint fix

bf824406