Commits · b0122d749da404214a77285506ade57305da632b · OpenDAS / tilelang

08 May, 2025 2 commits

[Refactor] Update barrier functions and remove argparse in... · b0122d74

Lei Wang authored May 08, 2025

[Refactor] Update barrier functions and remove argparse in example_warp_specialize_flashmla.py (#457)

* Refactored barrier functions to use new signatures for improved clarity and consistency.
* Replaced `mbarrier_arrive` and `mbarrier_wait_parity` with `barrier_arrive` and `barrier_wait` respectively.
* Removed argparse dependency and replaced it with hardcoded parameters for batch size and dimensions in the main function, simplifying the example script.

b0122d74

[Refactor] Update barrier functions and add new example for GEMM with warp specialization (#456) · a91bc2a9

Lei Wang authored May 08, 2025

* Add example for warp specialization with flash attention

* Introduced a new example script `example_warp_specialize_flashmla.py` demonstrating flash attention using warp specialization in TileLang.
* Implemented the `flashattn` function with shared memory allocation and memory barrier synchronization for improved performance.
* Added a reference program for validation against PyTorch's implementation, including profiling for latency and performance metrics.
* Removed the outdated `example_warp_specialize_mla.py` to streamline examples and focus on the new implementation.

* Add memory barrier functions to builtin.py

* Introduced `barrier_wait` and `barrier_arrive` functions for memory barrier synchronization.
* Enhanced documentation with detailed docstrings for both functions, clarifying their usage and parameters.
* The `barrier_wait` function serves as a wrapper for `mbarrier_wait_parity`, supporting parity values 0 and 1.
* Improved code organization and readability by adding blank lines for better separation of logical sections.

* Enhance code readability by adding blank lines in example_warp_specialize_flashmla.py and builtin.py

* Added blank lines to improve code organization and separation of logical sections in `example_warp_specialize_flashmla.py`.
* Included blank lines in `builtin.py` around the `wait_wgmma` and `barrier_wait` functions for better readability.

* [Refactor] Update barrier functions and add new example for GEMM with warp specialization

* Refactored memory barrier functions in `example_warp_specialize_flashmla.py` to use the new `barrier_wait` and `barrier_arrive` methods for improved clarity and consistency.
* Introduced a new example script `example_warp_specialize_gemm_copy_gemm_0_1.py` demonstrating matrix multiplication with warp specialization and shared memory allocation.
* Enhanced the `layout.cc` and `elem.cc` files to improve structural equality checks and error handling in copy operations.
* Updated `warpgroup.py` to refine thread ID calculations for better performance in warp specialization scenarios.
* Added new shuffle operations in `builtin.py` for enhanced functionality in parallel computations.

* lint fix

* Update loop variable checks in SIMT loop and buffer region validation

* Modified checks in `elem.cc` to ensure loop variable sizes are less than or equal to source and destination range sizes for better error handling.
* Adjusted assertions in `copy.py` to reflect the updated logic, allowing for more flexible region extent comparisons and improved error messaging.

* lint fix

* test fix

a91bc2a9

06 May, 2025 2 commits

[Enhancement] Introduce pass_configs parameter for kernel Caching (#452) · b1ba0cc8

Lei Wang authored May 06, 2025

* [Enhancement] Introduce pass_configs parameter for kernel compilation

* Added a new `pass_configs` parameter to the `tilelang.compile` function to allow for more flexible kernel compilation configurations.
* Updated related classes and methods to accommodate the new parameter, ensuring compatibility across the codebase.
* Enhanced the `torch_assert_close` function to include customizable tensor names for better debugging output.
* Refactored input handling in example scripts to streamline the process of obtaining inputs for kernel execution.

* lint fix

b1ba0cc8

[Enhancement] Add new examples for warp specialization and TMA integration (#448) · b5faf25a

Lei Wang authored May 06, 2025

* [Refactor] Update KernelLaunch to clarify CPU and GPU kernel launch logic

* Added comments to distinguish between CPU and GPU kernel launch sections for better code readability.
* Changed the creation of empty blocks to use a consistent "root" identifier, enhancing clarity in frame management.

* [Refactor] Rename operations for consistency in lower_hopper_intrin and related files

* Updated function names from CamelCase to snake_case for better consistency across the codebase.
* Refactored calls to `CreateTMADescriptorOp`, `CreateListofMBarrierOp`, and similar functions to their new names: `create_tma_descriptor`, `create_list_of_mbarrier`, etc.
* Adjusted corresponding test cases to reflect these changes, ensuring compatibility with the new naming conventions.

* [Refactor] Rename operations to snake_case for consistency

* Updated function names from CamelCase to snake_case across various files, including `CreateTMADescriptorOp` to `create_tma_descriptor`, `GetMBarrierOp` to `get_mbarrier`, and others.
* Adjusted corresponding calls and definitions in the codebase to reflect these naming changes, ensuring uniformity and improved readability.
* Enhanced layout inference and loop partitioning logic to accommodate the new naming conventions.

* [Feature] Introduce Warp Specialization and Eliminate Storage Sync for MBarrier

* Added a new example `gemm_ws.py` demonstrating matrix multiplication with warp specialization using TileLang.
* Implemented `WarpSpecializeFrame` and `WarpSpecialize` functionality to manage warp group indices in TIR frames.
* Introduced `EliminateStorageSyncForMBarrier` transformation to optimize storage synchronization in mbarrier regions.
* Enhanced the TileLang API with new methods for retrieving block and thread extents.
* Updated the `LowerAndLegalize` and `OptimizeForTarget` functions to incorporate the new transformation.
* Improved layout inference and kernel launch logic for better performance and clarity.

* [Refactor] Clean up code formatting and improve readability

* Added blank lines for better separation of code blocks in `gemm_ws.py`, `phase.py`, `kernel.py`, and `warpgroup.py`.
* Reformatted the `tilelang.compile` call in `gemm_ws.py` for improved clarity.
* Updated comments in `warpgroup.py` to clarify the availability of the `WarpSpecialize` function for NVIDIA GPUs.
* Ensured consistent spacing and formatting across multiple files to enhance overall code readability.

* lint fix

* [Refactor] Update mbarrier functions for improved clarity and consistency

* Refactored `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` to accept explicit parameters for better readability.
* Updated calls in `gemm_ws.py` to use the new function signatures, enhancing code clarity.
* Adjusted `warpgroup.py` to remove unused thread extent variable, streamlining the code.
* Added detailed docstrings to clarify usage examples for memory barrier functions.

* Added blank lines in `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` for improved code readability and separation of logical sections.

* [Feature] Add examples for warp specialization and TMA barrier integration

* Introduced three new example scripts: `example_warp_specialize_gemm.py`, `example_warp_specialize_gemm_barrier4.py`, and `example_warp_specialize_mla.py` demonstrating matrix multiplication with warp specialization and TMA barriers.
* Implemented kernel functions with shared memory allocation and memory barrier synchronization for improved performance.
* Enhanced the TileLang API with new methods for compiling and testing kernels in Python using PyTorch.
* Updated the `phase.py` to include TMA barrier injection in the optimization process.
* Improved documentation and comments for better clarity on usage and functionality.

* [Feature] Add example for warp specialization in GEMM with TMA barriers

* Introduced a new example script `example_warp_specialize_gemm_stage2.py` demonstrating matrix multiplication using warp specialization and TMA barriers.
* Implemented a kernel function with shared memory allocation and memory barrier synchronization for enhanced performance.
* Included functionality to compile the kernel into a PyTorch-compatible function and validate its correctness against PyTorch's reference implementation.
* Enhanced documentation and comments for clarity on usage and functionality.

* lint fix

* [Feature] Implement WarpSpecializedDetector for TMA and MBarrier Detection

* Added the `WarpSpecializedDetector` class to identify the presence of TMA operations and memory barrier operations within a given TIR statement.
* Enhanced the `WarpSpecialized` pass to utilize the detector, allowing for conditional substitution based on the detection results.
* Improved code organization by including necessary headers and utilizing the `IRVisitorWithAnalyzer` for analysis.
* This addition aims to optimize warp specialization by ensuring that only relevant functions are transformed, enhancing performance and correctness.

* lint fix

* [Feature] Add new examples for warp specialization and TMA integration

* Introduced multiple new example scripts demonstrating warp specialization techniques, including `example_warp_specialize_flashmla.py`, `example_warp_specialize_gemm_barrierpipe_stage2.py`, `example_warp_specialize_gemm_copy_0_gemm_1.py`, `example_warp_specialize_gemm_copy_1_gemm_0.py`, and `example_warp_specialize_gemm_softpipe_stage2.py`.
* Each example showcases matrix multiplication with warp specialization and TMA barriers, implementing kernel functions with shared memory allocation and memory barrier synchronization for enhanced performance.
* Added a test suite in `test_example_warp_specialize.py` to validate the functionality of the new examples.
* Updated the TileLang API to support these examples and improve kernel compilation and testing processes.
* Removed outdated example scripts to streamline the codebase and enhance clarity on available functionalities.

* lint fix

* Remove outdated example scripts for warp specialization and TMA integration to streamline the codebase. This includes `example_warp_specialize_gemm.py`, `example_warp_specialize_gemm_barrier4.py`, `example_warp_specialize_gemm_stage2.py`, and `example_warp_specialize_mla.py`, which are no longer needed following recent updates and improvements in the TileLang API.

b5faf25a

03 May, 2025 1 commit

[Refactor] Separate warp specialize rewriter and tma barrier injector pass (#447) · fce16b00

Lei Wang authored May 03, 2025

* [Refactor] Update KernelLaunch to clarify CPU and GPU kernel launch logic

* Added comments to distinguish between CPU and GPU kernel launch sections for better code readability.
* Changed the creation of empty blocks to use a consistent "root" identifier, enhancing clarity in frame management.

* [Refactor] Rename operations for consistency in lower_hopper_intrin and related files

* Updated function names from CamelCase to snake_case for better consistency across the codebase.
* Refactored calls to `CreateTMADescriptorOp`, `CreateListofMBarrierOp`, and similar functions to their new names: `create_tma_descriptor`, `create_list_of_mbarrier`, etc.
* Adjusted corresponding test cases to reflect these changes, ensuring compatibility with the new naming conventions.

* [Refactor] Rename operations to snake_case for consistency

* Updated function names from CamelCase to snake_case across various files, including `CreateTMADescriptorOp` to `create_tma_descriptor`, `GetMBarrierOp` to `get_mbarrier`, and others.
* Adjusted corresponding calls and definitions in the codebase to reflect these naming changes, ensuring uniformity and improved readability.
* Enhanced layout inference and loop partitioning logic to accommodate the new naming conventions.

* [Feature] Introduce Warp Specialization and Eliminate Storage Sync for MBarrier

* Added a new example `gemm_ws.py` demonstrating matrix multiplication with warp specialization using TileLang.
* Implemented `WarpSpecializeFrame` and `WarpSpecialize` functionality to manage warp group indices in TIR frames.
* Introduced `EliminateStorageSyncForMBarrier` transformation to optimize storage synchronization in mbarrier regions.
* Enhanced the TileLang API with new methods for retrieving block and thread extents.
* Updated the `LowerAndLegalize` and `OptimizeForTarget` functions to incorporate the new transformation.
* Improved layout inference and kernel launch logic for better performance and clarity.

* [Refactor] Clean up code formatting and improve readability

* Added blank lines for better separation of code blocks in `gemm_ws.py`, `phase.py`, `kernel.py`, and `warpgroup.py`.
* Reformatted the `tilelang.compile` call in `gemm_ws.py` for improved clarity.
* Updated comments in `warpgroup.py` to clarify the availability of the `WarpSpecialize` function for NVIDIA GPUs.
* Ensured consistent spacing and formatting across multiple files to enhance overall code readability.

* lint fix

* [Refactor] Update mbarrier functions for improved clarity and consistency

* Refactored `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` to accept explicit parameters for better readability.
* Updated calls in `gemm_ws.py` to use the new function signatures, enhancing code clarity.
* Adjusted `warpgroup.py` to remove unused thread extent variable, streamlining the code.
* Added detailed docstrings to clarify usage examples for memory barrier functions.

* Added blank lines in `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` for improved code readability and separation of logical sections.

* [Feature] Add examples for warp specialization and TMA barrier integration

* Introduced three new example scripts: `example_warp_specialize_gemm.py`, `example_warp_specialize_gemm_barrier4.py`, and `example_warp_specialize_mla.py` demonstrating matrix multiplication with warp specialization and TMA barriers.
* Implemented kernel functions with shared memory allocation and memory barrier synchronization for improved performance.
* Enhanced the TileLang API with new methods for compiling and testing kernels in Python using PyTorch.
* Updated the `phase.py` to include TMA barrier injection in the optimization process.
* Improved documentation and comments for better clarity on usage and functionality.

* [Feature] Add example for warp specialization in GEMM with TMA barriers

* Introduced a new example script `example_warp_specialize_gemm_stage2.py` demonstrating matrix multiplication using warp specialization and TMA barriers.
* Implemented a kernel function with shared memory allocation and memory barrier synchronization for enhanced performance.
* Included functionality to compile the kernel into a PyTorch-compatible function and validate its correctness against PyTorch's reference implementation.
* Enhanced documentation and comments for clarity on usage and functionality.

* lint fix

* [Feature] Implement WarpSpecializedDetector for TMA and MBarrier Detection

* Added the `WarpSpecializedDetector` class to identify the presence of TMA operations and memory barrier operations within a given TIR statement.
* Enhanced the `WarpSpecialized` pass to utilize the detector, allowing for conditional substitution based on the detection results.
* Improved code organization by including necessary headers and utilizing the `IRVisitorWithAnalyzer` for analysis.
* This addition aims to optimize warp specialization by ensuring that only relevant functions are transformed, enhancing performance and correctness.

* lint fix

fce16b00

30 Apr, 2025 2 commits

[Language] Support explicit programming for identified warp groups (#445) · 6972aed7

Lei Wang authored Apr 30, 2025

* [Refactor] Update KernelLaunch to clarify CPU and GPU kernel launch logic

* Added comments to distinguish between CPU and GPU kernel launch sections for better code readability.
* Changed the creation of empty blocks to use a consistent "root" identifier, enhancing clarity in frame management.

* [Refactor] Rename operations for consistency in lower_hopper_intrin and related files

* Updated function names from CamelCase to snake_case for better consistency across the codebase.
* Refactored calls to `CreateTMADescriptorOp`, `CreateListofMBarrierOp`, and similar functions to their new names: `create_tma_descriptor`, `create_list_of_mbarrier`, etc.
* Adjusted corresponding test cases to reflect these changes, ensuring compatibility with the new naming conventions.

* [Refactor] Rename operations to snake_case for consistency

* Updated function names from CamelCase to snake_case across various files, including `CreateTMADescriptorOp` to `create_tma_descriptor`, `GetMBarrierOp` to `get_mbarrier`, and others.
* Adjusted corresponding calls and definitions in the codebase to reflect these naming changes, ensuring uniformity and improved readability.
* Enhanced layout inference and loop partitioning logic to accommodate the new naming conventions.

* [Feature] Introduce Warp Specialization and Eliminate Storage Sync for MBarrier

* Added a new example `gemm_ws.py` demonstrating matrix multiplication with warp specialization using TileLang.
* Implemented `WarpSpecializeFrame` and `WarpSpecialize` functionality to manage warp group indices in TIR frames.
* Introduced `EliminateStorageSyncForMBarrier` transformation to optimize storage synchronization in mbarrier regions.
* Enhanced the TileLang API with new methods for retrieving block and thread extents.
* Updated the `LowerAndLegalize` and `OptimizeForTarget` functions to incorporate the new transformation.
* Improved layout inference and kernel launch logic for better performance and clarity.

* [Refactor] Clean up code formatting and improve readability

* Added blank lines for better separation of code blocks in `gemm_ws.py`, `phase.py`, `kernel.py`, and `warpgroup.py`.
* Reformatted the `tilelang.compile` call in `gemm_ws.py` for improved clarity.
* Updated comments in `warpgroup.py` to clarify the availability of the `WarpSpecialize` function for NVIDIA GPUs.
* Ensured consistent spacing and formatting across multiple files to enhance overall code readability.

* lint fix

* [Refactor] Update mbarrier functions for improved clarity and consistency

* Refactored `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` to accept explicit parameters for better readability.
* Updated calls in `gemm_ws.py` to use the new function signatures, enhancing code clarity.
* Adjusted `warpgroup.py` to remove unused thread extent variable, streamlining the code.
* Added detailed docstrings to clarify usage examples for memory barrier functions.

* Added blank lines in `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` for improved code readability and separation of logical sections.

6972aed7

Bump transformers from 4.48.0 to 4.50.0 in /examples/bitnet-1.58b (#444) · 0fa03398

dependabot[bot] authored Apr 30, 2025

Bumps [transformers](https://github.com/huggingface/transformers) from 4.48.0 to 4.50.0.
- [Release notes](https://github.com/huggingface/transformers/releases)
- [Commits](https://github.com/huggingface/transformers/compare/v4.48.0...v4.50.0

)

---
updated-dependencies:
- dependency-name: transformers
  dependency-version: 4.50.0
  dependency-type: direct:production
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

0fa03398

26 Apr, 2025 1 commit
- [Bugfix] fix the unexpected keyword error of autotune (#438) · bfb5b0a3
  yyttt6 authored Apr 26, 2025
```
* yes

* [Bugfix] fix the unexpected keyword error of autotune

* format

* test
```
  bfb5b0a3
21 Apr, 2025 1 commit

[Bugfix] Support larger than 256 box size tma copy (#413) · bf824406

Lei Wang authored Apr 21, 2025

* [New Feature] Add FP8 Flash Attention Implementation (#412)

* Introduce a new example script for FP8 Flash Attention in `example_mla_decode_kv_fp8.py`, showcasing the use of tilelang for efficient attention computation.
* Implement the `flashattn` function with optimized memory management and kernel execution.
* Include a reference program for comparison and performance evaluation.
* Add command-line argument parsing for batch size, number of heads, and dimensions to facilitate testing and experimentation.
* Enhance the overall structure and readability of the code.

This addition aims to improve the performance of attention mechanisms in deep learning models by leveraging FP8 precision and optimized kernel execution.

* lint fix

* optimize quick start

* lint fix

bf824406

16 Apr, 2025 2 commits

Bump transformers from 4.40 to 4.48.0 in /examples/bitnet-1.58b (#400) · 6c63bb40

dependabot[bot] authored Apr 16, 2025

Bumps [transformers](https://github.com/huggingface/transformers) from 4.40 to 4.48.0.
- [Release notes](https://github.com/huggingface/transformers/releases)
- [Commits](https://github.com/huggingface/transformers/compare/v4.40.0...v4.48.0

)

---
updated-dependencies:
- dependency-name: transformers
  dependency-version: 4.48.0
  dependency-type: direct:production
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

6c63bb40

[Example] Add bitnet-1.58b examples (#399) · 58c6b912

Lei Wang authored Apr 16, 2025

* Update copyright notice in example_mha_bwd_wgmma_pipelined.py to reflect Tile-AI Corporation ownership.

* lint fix

58c6b912

14 Apr, 2025 1 commit

[Doc] Update README.md for deepseek_mla on AMD (#389) · e9d4ceda

Lei Wang authored Apr 14, 2025

* Update README.md for deepseek_mla: Refine performance comparison details and add acknowledgment section. Adjusted performance metrics for TileLang, highlighting its efficiency over Triton and assembly kernels. Included gratitude to the AMD ROCm team for their contributions.

* Update README.md for deepseek_mla: Clarify performance metrics for TileLang, specifying the range of performance parity with hand-optimized assembly kernels. This adjustment enhances the accuracy of the comparative analysis against Triton implementations.

e9d4ceda

13 Apr, 2025 1 commit

[Dynamic Symbolic] Add pass_config to customize vectorization and tail split (#383) · 280e6627

Zhengju Tang authored Apr 13, 2025



* [Dynamic Symbolic] Add pass_config to customize vectorization and tail split

* Lint

* Only check for vectorized dimension. Add docs.

* Lint

* Update comment for cache directory in .gitignore

* Use CUTLASS convention to represent dynamic alignment. Fix bugs

* Add benchmark examples

* Add more benchmarks. Fix accumulate type bug.

* Lint

* Lint

* Test Lint

* Lint

* Test Lint

* Lint

* Fix typo

* Lint

* Lint

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

280e6627

12 Apr, 2025 3 commits

[Enhancement][Pipeline] More precise copy code block detection in pipeline (#384) · abaacde5

Lei Wang authored Apr 12, 2025

* Update legalize_safe_memory_access.cc

* Add cache path handling and file locking in Cython adapter

- Introduced a new cache path based on the code hash for the Cython JIT adapter, enhancing cache management.
- Added a lock file mechanism to ensure safe access during cache operations, improving concurrency handling.
- These changes aim to optimize the compilation process and prevent race conditions during library loading.

* lint fix

* refactor

* refactor

* Add GlobalCopyPatternDetector to identify global memory copy patterns

- Introduced a new class, GlobalCopyPatternDetector, to detect specific memory copy patterns in statements.
- Enhanced the PipelinePlanner to utilize this detector for determining copy stages based on global and local memory scopes.
- Improved code clarity and maintainability by encapsulating detection logic within the new class.

* Refactor copy stage detection logic in pipeline planning

- Simplified the determination of copy stages by directly assigning the result of GlobalCopyPatternDetector to pinfo.copy_stage.
- Removed redundant checks for read and write scopes, enhancing code clarity and maintainability.

* lint fix

abaacde5

[Doc] Add deepseek_mla to documentation index (#380) · aa85ddc7
Lei Wang authored Apr 12, 2025
```
* Add deepseek_mla to documentation index (#380)

* lint fix
```
aa85ddc7

[Docs] Add AMD Flash MLA Documentation to Tutorials Section (#376) · 0997c333

Lei Wang authored Apr 12, 2025

* [Add] Introduce deepseek_mla documentation for high-performance FlashMLA with TileLang

- Added a comprehensive guide on writing high-performance kernels using TileLang, focusing on the Multi-Head Latent Attention (MLA) mechanism.
- Included benchmark results comparing FlashMLA, TileLang, Torch, Triton, and FlashInfer, highlighting TileLang's efficiency and ease of use.
- Detailed implementation strategies, including layout inference, threadblock swizzling, shared memory swizzling, and warp specialization.
- Provided examples and explanations of optimization techniques to enhance performance in GPU kernel programming.

* doc update

* [Add] Enhance AMD FlashMLA implementation and documentation

- Refactored variable names in `benchmark_mla_decode_amd_tilelang.py` for clarity, changing `Q_shared` and `Q_pe_shared` to `Q_local` and `Q_pe_local` to reflect their usage in register allocation.
- Added a new `README.md` detailing the high-performance FlashMLA implementation on AMD MI300X accelerators, including architectural considerations, optimization strategies, and performance evaluation.
- Introduced a performance comparison figure to illustrate the efficiency of the TileLang implementation against other frameworks.

* lint fix

* [Add] Expand deepseek_mla documentation for AMD MI300X optimization strategies

- Introduced a new section detailing architectural differences and optimization strategies for implementing FlashMLA on AMD MI300X accelerators.
- Highlighted key considerations such as instruction set variations, shared memory constraints, tile size flexibility, and memory bank conflict swizzling.
- Included performance evaluation results demonstrating TileLang's efficiency compared to other frameworks.
- Discussed future optimization opportunities for memory bank conflict mitigation and dimension parallelization.

0997c333

10 Apr, 2025 2 commits

[Bugfix] Adjust Autotuner threadpool `max_workers` limit to available CPUs (#368) · 9a7a569d
Haodong Tian authored Apr 10, 2025
```
* [Bugfix] Adjust Autotuner threadpool `max_workers` limit to available CPUs

* [Example] Small fix on example_blocksparse_gemm.py
```
9a7a569d

[MLA][AMD] Add amd mla benchmarking (#367) · d3536d9e

Lei Wang authored Apr 10, 2025



* [Add] Introduce benchmark scripts for MLA decoding with AMD support

- Added three new benchmark scripts: `benchmark_mla_decode_amd_tilelang.py`, `benchmark_mla_decode_amd_torch.py`, and `benchmark_mla_decode_amd_triton.py` to evaluate the performance of the MLA decoding mechanism across different frameworks.
- Each script includes implementations for attention calculation, performance profiling, and output validation against reference implementations.
- Enhanced command-line argument parsing for customizable input parameters, including batch size, number of heads, and dimensions.
- Integrated performance comparison functionality to facilitate benchmarking between different implementations.

* lint fix

* lint fix

---------
Co-authored-by: Zhiwen Mo <zhiwen.mo25@ic.ac.uk>

d3536d9e

09 Apr, 2025 3 commits

[AMD] Implement Deepseek MLA for AMD (#363) · e3065f0b

Lei Wang authored Apr 09, 2025

* [Bugfix] Correct dynamic shared memory size error handling in HIP wrapper

- Updated the error handling logic in `PREDEF_ATTRIBUTE_SET_DYNAMIC_MEMORY_HIP` to check if the dynamic shared memory size exceeds the maximum limit of 65536.
- Improved error message clarity by specifying the function name and the attempted size, ensuring better debugging information.
- Ensured the function returns 0 upon successful setting of the dynamic shared memory size.

* [Add] Implement example for MLA decoding with AMD support

- Introduced a new example script `example_mla_decode_amd.py` demonstrating the use of the flash attention mechanism with AMD hardware.
- Implemented functions for attention calculation, including support for split processing and combining outputs.
- Added command-line argument parsing for customizable input parameters such as batch size, number of heads, and dimensions.
- Included a reference implementation for validation against the Tile-AI output, ensuring correctness of the implementation.
- Enhanced performance profiling and output comparison for debugging and optimization purposes.

* lint fix

e3065f0b

[Example] Handle Scenarios in Which a Threadblock is Assigned Only Invalid... · 55614f18

Yuqing Xia authored Apr 09, 2025

[Example] Handle Scenarios in Which a Threadblock is Assigned Only Invalid Block Indices for Sparse Attention  (#361)

* Fix issue where threadblock with only invalid blocks produces incorrect output.

* fix score scale

* format

55614f18

[Example] Introduce autotuning example for GEMM with enhanced configuration options (#360) · d4194222

Yu Cheng authored Apr 09, 2025

* Added a new example script `example_gemm_autotune.py` to demonstrate autotuning for matrix multiplication (GEMM) using TileLang.
* Implemented functions for generating configurations, selecting the best configuration, and benchmarking performance.
* Refactored the existing `matmul` function to support dynamic configuration parameters and improved kernel compilation.
* Updated the main execution block to include command-line argument parsing for matrix dimensions and autotuning options.
* Enhanced the example to validate results against a reference implementation, ensuring correctness in matrix multiplication operations.

d4194222

08 Apr, 2025 2 commits

[Enhancement] Update group_per_split_token_cast_to_fp8 to support multiple data types (#356) · a686f0f1

Yu Cheng authored Apr 08, 2025

- Modified the `group_per_split_token_cast_to_fp8` function to support `bfloat16`, `float`, and `float16` data types.
- Updated local fragment allocations to use the new `accum_dtype` for consistency.
- Enhanced the main execution block to handle different tensor data types based on the specified `dtype`, improving flexibility in tensor operations.

a686f0f1

[Typo] Replace `kernel.func` with `kernel` in mla benchmark scripts (#354) · 6d44c465

Lei Wang authored Apr 08, 2025

* [Refactor] Update import structure in benchmark_mla.py

- Moved the import of `flash_mla` functions to the `run_flash_mla` function for better encapsulation.
- Added a comment for `flashinfer` installation to clarify dependencies.
- Cleaned up unused imports to enhance code readability.

* lint fix

6d44c465

07 Apr, 2025 1 commit

[AutoTune] Refactor AutoTuneArtifact to utilize kernel as context instead of profiler (#344) · f005db9f

Lei Wang authored Apr 07, 2025

* [Enhancement] Update GEMM examples and autotuner for improved performance

- Modified `example_gemm_intrinsics.py` to enhance matrix multiplication configurations, increasing warp sizes and adjusting data types for better performance.
- Updated the kernel compilation process to utilize the new `tilelang.compile` method and improved latency measurement with the profiler.
- Refactored `example_gemm.py` to include a new autotuning configuration and ensure consistency in latency checks against reference results.
- Adjusted tensor supply generation in `tilelang/utils/tensor.py` to use `torch.randn` for better randomness in tensor initialization.
- Enhanced the `JITContext` in `tilelang/autotuner/__init__.py` to replace the profiler with a kernel instance for performance measurement, improving the overall structure of the autotuner.

* bug fix

* fix

* [Enhancement] Update convolution tests and profiling assertions

- Added a random seed setting for reproducibility in convolution tests.
- Removed several redundant convolution test cases to streamline the testing process.
- Updated the assertion in the matrix multiplication profiling to include a maximum mismatched ratio for improved accuracy in results.
- Enabled the main testing function for better test execution.

* lint fix

f005db9f

06 Apr, 2025 2 commits

[Bugfix] Fix X_amax Correctness Issue in Group Cast FP8 (#345) · 847a461b

Yu Cheng authored Apr 06, 2025

- Modified the `group_per_split_token_cast_to_fp8` function to include a conditional check for batch sizes, ensuring that the scaling factor is applied only when within the valid range. This change enhances the robustness of the FP8 conversion process for grouped per-split tokens.

847a461b

[Example] Add triton block sparse gqa decode (#341) · bee5618e

YizhaoGao authored Apr 06, 2025



* [Example] Add triton block sparse gqa decode

* lint fix

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

bee5618e

05 Apr, 2025 3 commits

[Dev] Add Group Cast FP8 Example (#338) · 73885cfd

Yu Cheng authored Apr 05, 2025

Implements FP8 type conversion functionality for grouped per-split tokens. The script includes several helper functions for handling tensor TMA alignment and FP8 conversion, enhancing support for FP8 data types and providing performance benchmarks. This change provides users with more flexible examples of FP8 operations.

73885cfd

[Enhancement] Enhance FP8/FP4 type handling in CUDA codegen (#323) · 89725f7f

Lei Wang authored Apr 05, 2025

* [Enhancement] Introduce CUDA driver module and refactor CUDA device handling

- Added a new `cuda_driver` module to encapsulate CUDA device properties and functionalities.
- Updated `CUDA` class in `cuda.py` to utilize the new driver for fetching device name and shared memory capabilities.
- Introduced `get_device_name` and `get_shared_memory_per_block` functions in the `cuda_driver` for improved device property management.
- This refactor enhances code organization and maintainability while improving the handling of CUDA device attributes.

* [Refactor] Clean up whitespace in CUDA-related files

- Removed unnecessary blank lines in `cuda.py`, `__init__.py`, and `cuda_driver.py` to improve code readability and maintainability.
- This change enhances the overall organization of the codebase without altering functionality.

* [Benchmark] Add FP8 Matrix Multiplication Benchmark Script

- Introduced a new benchmark script for FP8 matrix multiplication in `benchmark/matmul_fp8/benchmark_matmul.py`.
- The script includes functions for reference matrix multiplication, configuration generation for autotuning, and an autotuned kernel for performance measurement.
- Added command-line argument parsing for matrix dimensions and the option to enable BitBLAS roller for search space exploration.
- The benchmark computes and prints the best latency and performance metrics, enhancing the benchmarking capabilities for FP8 operations.

* lint fix

* Update submodule and enhance FP8 type handling in CUDA codegen

- Updated the TVM submodule to the latest commit.
- Modified FP8 type handling in `codegen_cuda.cc` to use more descriptive type codes.
- Improved constant printing for FP8 and bfloat16 types, ensuring correct representation in generated code.
- Added error handling for missing configuration keys in the AutoTuner class.

* lint fix

* Remove print statement from example script

* lint fix

* fix

---------
Co-authored-by: LeiWang1999 <wyatuestc@gmail.com>

89725f7f

[Example] Add sparse gqa decode example (#332) · 8fdfdf03

Yuqing Xia authored Apr 05, 2025



* add example gqa decode wgmma pipelined

* add sparse gqa

* support num split

* support num split

* add if condition

* add heuristic num split

* clean code

* add ref

* fix bug

* add torch ref

* fix bug

* integrate to torch

* symbolic

* clean mask

* rm actual_num_blocks

* clean code

* get num_sm via torch

* add sparse gqa decode example

* format

* rm example_gqa_decode_wgmma_pipelined.py

* Add license headers to example scripts

* format

* Remove commented-out cache disabling lines

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

8fdfdf03

04 Apr, 2025 3 commits

[Refactor] Optimize RMS normalization kernel in rms_norm.py (#333) · 85e411c8

Yu Cheng authored Apr 04, 2025

- Introduced a new local fragment for squared values to improve performance.
- Updated the computation of the RMS normalization to use the new fragment, enhancing memory efficiency.
- Refactored the final multiplication step to operate on the local fragment instead of shared memory.
- Added a configuration option to the kernel compilation for better control over TMA lowering.

These changes enhance the efficiency and clarity of the RMS normalization implementation.

85e411c8

[Enhancement] Improve flashattn function in example_gqa_decode.py (#329) · 32060ecd

Lei Wang authored Apr 04, 2025

- Added a manual seed for reproducibility in PyTorch.
- Refactored local variable allocations for better memory management.
- Enhanced parallel processing in the flashattn function to improve performance.
- Updated layout annotations for clarity and efficiency.

These changes optimize the flash attention mechanism and ensure consistent behavior across runs.

32060ecd

[Dynamic Symbolic] Adaptively vectorize with different condition expressions (#326) · 5ee58ec7

Zhengju Tang authored Apr 04, 2025



* [Dynamic Symbolic] Adaptively vectorize with different condition expressions

* Format

* Format

* Format

* Format

* Add MIT License headers to Python files

* Simplify return statement in loop vectorization

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

5ee58ec7

03 Apr, 2025 4 commits

[Feat] Enhance CUDA Property Handling (#322) · c0378aa9

Lei Wang authored Apr 03, 2025

* [Enhancement] Introduce CUDA driver module and refactor CUDA device handling

* [Refactor] Clean up whitespace in CUDA-related files

* [Benchmark] Add FP8 Matrix Multiplication Benchmark Script

* lint fix

---------
Co-authored-by: LeiWang1999 <wyatuestc@gmail.com>

c0378aa9

Support block_N sizes that are 2^n in deepgemm example (#319) · d2f59cfa
Chunan Zeng authored Apr 03, 2025

d2f59cfa

[Tools] Summarize TFLOPS Information from a tilelang program (#321) · 853898a7

yyttt6 authored Apr 03, 2025

* refactor autotune

* refactor autotune

* refactor autotune

* refactor autotune

* format init.py

* add tutorial for autotune

* merge

* merge

* format analyzer

* add readme for analyzer

* format

* [Tools] Summarize TFLOPS Information from a tilelang program

* Summarize TFLOPS Information from a tilelang program

853898a7

[Dev] Add FP8 Quantization Examples and Absolute Maximum Reduction Operation Support (#320) · 4b705eb2

Yu Cheng authored Apr 03, 2025

* [Dev] Add FP8 Quantization Examples and Absolute Maximum Reduction Operation Support

* Added `example_per_token_cast_to_fp8.py` in examples/cast, providing token-wise FP8 quantization implementation.
* Added `example_triton_cast_to_fp8.py` in examples/cast, providing Triton-based FP8 quantization implementation.
* Added support for absolute maximum (absmax) reduction operation in reduce.cc and reduce.h.
* Implemented `reduce_absmax` function in reduce.py, allowing absolute maximum reduction on input buffers.
* Updated tilelang.language module to include the new `reduce_absmax` function.

These changes enhance FP8 quantization capabilities and extend reduction operation support.

* [Enhancement] Update per_token_cast_to_fp8 for improved FP8 quantization

* Modified the `per_token_cast_to_fp8` function to support variable block sizes and improved memory layout annotations.
* Adjusted the handling of absolute maximum values and scaling factors for better performance and accuracy.
* Updated the main execution block to allow for larger matrix dimensions and refined the profiler setup for benchmarking.

These changes enhance the flexibility and efficiency of the FP8 quantization process.

* lint

* [Dev] Update per_token_cast_fp8.py

4b705eb2

31 Mar, 2025 2 commits

[Bugfix] Fix layout conflict issue for gqa decoding examples (#314) · 0fd82ed5

Lei Wang authored Apr 01, 2025

* Remove logging statement from LoopVectorizerDynamic Substitute method for cleaner output.

* Refactor flashattn example to improve CUDA configuration handling

- Updated the `flashattn` function in `example_gqa_decode.py` to utilize a heuristic configuration based on CUDA device capabilities, enhancing compatibility with different architectures.
- Replaced local variable allocations with more efficient constructs and removed unnecessary logging statements for cleaner output.
- Adjusted the `do_bench` method call to streamline performance profiling.

* lint fix

0fd82ed5

[Bugfix] Updated autotune usage in the examples to align with the latest changes (#309) · 66c7f6a1

Lei Wang authored Mar 31, 2025

* [Enhancement] Add support for CUDA architecture 8.9 in GEMM template

- Introduced conditional inclusion of "gemm_sm89.h" for CUDA architectures 8.9 and above, enhancing compatibility with newer hardware.
- This change ensures that the GEMM template can leverage optimizations specific to the 8.9 architecture, improving performance for users with compatible GPUs.

* lintfix

* [Refactor] Clean up includes in gemm_sm89.h

- Removed duplicate inclusion of "common.h" and added "cuda_fp8.h" for improved clarity and organization.
- This change enhances the maintainability of the code by ensuring that header files are included only once and in a logical order.

* [Enhancement] Improve KernelCache with in-memory caching and detailed docstrings

- Added an in-memory cache to the KernelCache class to enhance performance by reducing disk access.
- Updated the __new__ method to initialize the memory cache and added logic to check the cache before loading from disk.
- Enhanced docstrings across multiple methods to provide clearer explanations of parameters and return values, improving code readability and maintainability.
- Implemented a clear_cache method to clear both in-memory and disk caches, ensuring efficient cache management.

* lint fix

* typofix

* [Refactor] Update matmul and flashattn function calls to return structured results

- Modified the matmul and flashattn function calls to return a single object containing latency, configuration, and reference latency, improving code clarity and reducing the number of returned variables.
- Updated all relevant instances in benchmark and example scripts to accommodate the new return structure, ensuring consistent usage across the codebase.

* lint fix

66c7f6a1

30 Mar, 2025 2 commits

[Bugfix] Replace profiler.mod with profiler.adapter to fix AttributeError (#305) · 6e294de9

Leslin authored Mar 30, 2025



* Update elementwise_add.py

[Bugfix] Replace profiler.mod with profiler.adapter to fix AttributeError

* Update rms_norm.py

[Bugfix] Replace profiler.mod with profiler.adapter to fix AttributeError

* Remove adapter argument from do_bench call

* Remove adapter argument from do_bench call

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

6e294de9

[Bugfix] Resolve autotuner bugs for blocksparse GEMM example (#300) · 92e8d5f4

Haodong Tian authored Mar 30, 2025

* [Bugfix] Configure autotuner specific logger for correct level handling
- Previously, logging relied on basicConfig, which configured the root logger. This caused the named autotuner logger to ignore DEBUG messages.
- This commit sets up a dedicated logger for autotuner, correctly route DEBUG messages to 'autotuner.log' and INFO+ messages to the console.

* [Bugfix] Fix tensor_supply for boolean type
- Previously `get_tensor_supply` used `torch.randint(-2, 3)` as a fallback, which caused error when the dtype was `torch.bool`.
- This commits adds an `is_boolean` check in `KernelParam` and updates `get_tensor_supply` to specifically use `torch.randint(0, 2)` for boolean dtypes.

* [Bugfix] Always regenerate JIT inputs during tuning
- Removes the caching for `self.jit_input_tensors` within `AutoTuner`. When different autotuning configurations can alter the required input tensor shapes or other properties, reusing cached inputs from a previous configuration lead to errors or incorrect assessments.
- This change ensures that `profiler._get_inputs()` is called unconditionally for each configuration evaluation. Since `_get_inputs` is assumed to be relatively inexpensive, the potential overhead is considered acceptable.

* [Example] Update example_blocksparse_gemm for autotuner

* Run code formatter

* [Feature] Enable custom tensor supply and input caching control in Autotuner
- Previously, tensor generation was tied to `supply_type` and input caching behavior across configurations was less explicit/controlled.
- This commit introduces a `supply_prog` parameter to allow providing a custom function for generating input tensors, overriding the default mechanism.
- Adds a `cache_input_tensors` flag (default True) to control input tensor caching:
- If True, tensors are generated once per configuration and reused for repetitions, with a check for potential shape mismatches between configurations.
- If False, tensors are regenerated for every configuration trial.
- Refactors internal input tensor handling using supplier functions for clarity.
- Adds a `check_tensor_list_compatibility` utility for shape comparison.

* [Example] Update example_blocksparse_gemm for autotuner

* Run code formatter

* [Example] Small fix in example_blocksparse_gemm

* [Fix] Raise error if autotuning yields no valid configuration

92e8d5f4