Commits · b5faf25ab14c4f3882a3e243f23e230dcef455cd · OpenDAS / tilelang

06 May, 2025 1 commit

[Enhancement] Add new examples for warp specialization and TMA integration (#448) · b5faf25a

Lei Wang authored May 06, 2025

* [Refactor] Update KernelLaunch to clarify CPU and GPU kernel launch logic

* Added comments to distinguish between CPU and GPU kernel launch sections for better code readability.
* Changed the creation of empty blocks to use a consistent "root" identifier, enhancing clarity in frame management.

* [Refactor] Rename operations for consistency in lower_hopper_intrin and related files

* Updated function names from CamelCase to snake_case for better consistency across the codebase.
* Refactored calls to `CreateTMADescriptorOp`, `CreateListofMBarrierOp`, and similar functions to their new names: `create_tma_descriptor`, `create_list_of_mbarrier`, etc.
* Adjusted corresponding test cases to reflect these changes, ensuring compatibility with the new naming conventions.

* [Refactor] Rename operations to snake_case for consistency

* Updated function names from CamelCase to snake_case across various files, including `CreateTMADescriptorOp` to `create_tma_descriptor`, `GetMBarrierOp` to `get_mbarrier`, and others.
* Adjusted corresponding calls and definitions in the codebase to reflect these naming changes, ensuring uniformity and improved readability.
* Enhanced layout inference and loop partitioning logic to accommodate the new naming conventions.

* [Feature] Introduce Warp Specialization and Eliminate Storage Sync for MBarrier

* Added a new example `gemm_ws.py` demonstrating matrix multiplication with warp specialization using TileLang.
* Implemented `WarpSpecializeFrame` and `WarpSpecialize` functionality to manage warp group indices in TIR frames.
* Introduced `EliminateStorageSyncForMBarrier` transformation to optimize storage synchronization in mbarrier regions.
* Enhanced the TileLang API with new methods for retrieving block and thread extents.
* Updated the `LowerAndLegalize` and `OptimizeForTarget` functions to incorporate the new transformation.
* Improved layout inference and kernel launch logic for better performance and clarity.

* [Refactor] Clean up code formatting and improve readability

* Added blank lines for better separation of code blocks in `gemm_ws.py`, `phase.py`, `kernel.py`, and `warpgroup.py`.
* Reformatted the `tilelang.compile` call in `gemm_ws.py` for improved clarity.
* Updated comments in `warpgroup.py` to clarify the availability of the `WarpSpecialize` function for NVIDIA GPUs.
* Ensured consistent spacing and formatting across multiple files to enhance overall code readability.

* lint fix

* [Refactor] Update mbarrier functions for improved clarity and consistency

* Refactored `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` to accept explicit parameters for better readability.
* Updated calls in `gemm_ws.py` to use the new function signatures, enhancing code clarity.
* Adjusted `warpgroup.py` to remove unused thread extent variable, streamlining the code.
* Added detailed docstrings to clarify usage examples for memory barrier functions.

* Added blank lines in `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` for improved code readability and separation of logical sections.

* [Feature] Add examples for warp specialization and TMA barrier integration

* Introduced three new example scripts: `example_warp_specialize_gemm.py`, `example_warp_specialize_gemm_barrier4.py`, and `example_warp_specialize_mla.py` demonstrating matrix multiplication with warp specialization and TMA barriers.
* Implemented kernel functions with shared memory allocation and memory barrier synchronization for improved performance.
* Enhanced the TileLang API with new methods for compiling and testing kernels in Python using PyTorch.
* Updated the `phase.py` to include TMA barrier injection in the optimization process.
* Improved documentation and comments for better clarity on usage and functionality.

* [Feature] Add example for warp specialization in GEMM with TMA barriers

* Introduced a new example script `example_warp_specialize_gemm_stage2.py` demonstrating matrix multiplication using warp specialization and TMA barriers.
* Implemented a kernel function with shared memory allocation and memory barrier synchronization for enhanced performance.
* Included functionality to compile the kernel into a PyTorch-compatible function and validate its correctness against PyTorch's reference implementation.
* Enhanced documentation and comments for clarity on usage and functionality.

* lint fix

* [Feature] Implement WarpSpecializedDetector for TMA and MBarrier Detection

* Added the `WarpSpecializedDetector` class to identify the presence of TMA operations and memory barrier operations within a given TIR statement.
* Enhanced the `WarpSpecialized` pass to utilize the detector, allowing for conditional substitution based on the detection results.
* Improved code organization by including necessary headers and utilizing the `IRVisitorWithAnalyzer` for analysis.
* This addition aims to optimize warp specialization by ensuring that only relevant functions are transformed, enhancing performance and correctness.

* lint fix

* [Feature] Add new examples for warp specialization and TMA integration

* Introduced multiple new example scripts demonstrating warp specialization techniques, including `example_warp_specialize_flashmla.py`, `example_warp_specialize_gemm_barrierpipe_stage2.py`, `example_warp_specialize_gemm_copy_0_gemm_1.py`, `example_warp_specialize_gemm_copy_1_gemm_0.py`, and `example_warp_specialize_gemm_softpipe_stage2.py`.
* Each example showcases matrix multiplication with warp specialization and TMA barriers, implementing kernel functions with shared memory allocation and memory barrier synchronization for enhanced performance.
* Added a test suite in `test_example_warp_specialize.py` to validate the functionality of the new examples.
* Updated the TileLang API to support these examples and improve kernel compilation and testing processes.
* Removed outdated example scripts to streamline the codebase and enhance clarity on available functionalities.

* lint fix

* Remove outdated example scripts for warp specialization and TMA integration to streamline the codebase. This includes `example_warp_specialize_gemm.py`, `example_warp_specialize_gemm_barrier4.py`, `example_warp_specialize_gemm_stage2.py`, and `example_warp_specialize_mla.py`, which are no longer needed following recent updates and improvements in the TileLang API.

b5faf25a

03 May, 2025 1 commit

[Refactor] Separate warp specialize rewriter and tma barrier injector pass (#447) · fce16b00

Lei Wang authored May 03, 2025

* [Refactor] Update KernelLaunch to clarify CPU and GPU kernel launch logic

* Added comments to distinguish between CPU and GPU kernel launch sections for better code readability.
* Changed the creation of empty blocks to use a consistent "root" identifier, enhancing clarity in frame management.

* [Refactor] Rename operations for consistency in lower_hopper_intrin and related files

* Updated function names from CamelCase to snake_case for better consistency across the codebase.
* Refactored calls to `CreateTMADescriptorOp`, `CreateListofMBarrierOp`, and similar functions to their new names: `create_tma_descriptor`, `create_list_of_mbarrier`, etc.
* Adjusted corresponding test cases to reflect these changes, ensuring compatibility with the new naming conventions.

* [Refactor] Rename operations to snake_case for consistency

* Updated function names from CamelCase to snake_case across various files, including `CreateTMADescriptorOp` to `create_tma_descriptor`, `GetMBarrierOp` to `get_mbarrier`, and others.
* Adjusted corresponding calls and definitions in the codebase to reflect these naming changes, ensuring uniformity and improved readability.
* Enhanced layout inference and loop partitioning logic to accommodate the new naming conventions.

* [Feature] Introduce Warp Specialization and Eliminate Storage Sync for MBarrier

* Added a new example `gemm_ws.py` demonstrating matrix multiplication with warp specialization using TileLang.
* Implemented `WarpSpecializeFrame` and `WarpSpecialize` functionality to manage warp group indices in TIR frames.
* Introduced `EliminateStorageSyncForMBarrier` transformation to optimize storage synchronization in mbarrier regions.
* Enhanced the TileLang API with new methods for retrieving block and thread extents.
* Updated the `LowerAndLegalize` and `OptimizeForTarget` functions to incorporate the new transformation.
* Improved layout inference and kernel launch logic for better performance and clarity.

* [Refactor] Clean up code formatting and improve readability

* Added blank lines for better separation of code blocks in `gemm_ws.py`, `phase.py`, `kernel.py`, and `warpgroup.py`.
* Reformatted the `tilelang.compile` call in `gemm_ws.py` for improved clarity.
* Updated comments in `warpgroup.py` to clarify the availability of the `WarpSpecialize` function for NVIDIA GPUs.
* Ensured consistent spacing and formatting across multiple files to enhance overall code readability.

* lint fix

* [Refactor] Update mbarrier functions for improved clarity and consistency

* Refactored `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` to accept explicit parameters for better readability.
* Updated calls in `gemm_ws.py` to use the new function signatures, enhancing code clarity.
* Adjusted `warpgroup.py` to remove unused thread extent variable, streamlining the code.
* Added detailed docstrings to clarify usage examples for memory barrier functions.

* Added blank lines in `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` for improved code readability and separation of logical sections.

* [Feature] Add examples for warp specialization and TMA barrier integration

* Introduced three new example scripts: `example_warp_specialize_gemm.py`, `example_warp_specialize_gemm_barrier4.py`, and `example_warp_specialize_mla.py` demonstrating matrix multiplication with warp specialization and TMA barriers.
* Implemented kernel functions with shared memory allocation and memory barrier synchronization for improved performance.
* Enhanced the TileLang API with new methods for compiling and testing kernels in Python using PyTorch.
* Updated the `phase.py` to include TMA barrier injection in the optimization process.
* Improved documentation and comments for better clarity on usage and functionality.

* [Feature] Add example for warp specialization in GEMM with TMA barriers

* Introduced a new example script `example_warp_specialize_gemm_stage2.py` demonstrating matrix multiplication using warp specialization and TMA barriers.
* Implemented a kernel function with shared memory allocation and memory barrier synchronization for enhanced performance.
* Included functionality to compile the kernel into a PyTorch-compatible function and validate its correctness against PyTorch's reference implementation.
* Enhanced documentation and comments for clarity on usage and functionality.

* lint fix

* [Feature] Implement WarpSpecializedDetector for TMA and MBarrier Detection

* Added the `WarpSpecializedDetector` class to identify the presence of TMA operations and memory barrier operations within a given TIR statement.
* Enhanced the `WarpSpecialized` pass to utilize the detector, allowing for conditional substitution based on the detection results.
* Improved code organization by including necessary headers and utilizing the `IRVisitorWithAnalyzer` for analysis.
* This addition aims to optimize warp specialization by ensuring that only relevant functions are transformed, enhancing performance and correctness.

* lint fix

fce16b00

30 Apr, 2025 1 commit

[Language] Support explicit programming for identified warp groups (#445) · 6972aed7

Lei Wang authored Apr 30, 2025

* [Refactor] Update KernelLaunch to clarify CPU and GPU kernel launch logic

* Added comments to distinguish between CPU and GPU kernel launch sections for better code readability.
* Changed the creation of empty blocks to use a consistent "root" identifier, enhancing clarity in frame management.

* [Refactor] Rename operations for consistency in lower_hopper_intrin and related files

* Updated function names from CamelCase to snake_case for better consistency across the codebase.
* Refactored calls to `CreateTMADescriptorOp`, `CreateListofMBarrierOp`, and similar functions to their new names: `create_tma_descriptor`, `create_list_of_mbarrier`, etc.
* Adjusted corresponding test cases to reflect these changes, ensuring compatibility with the new naming conventions.

* [Refactor] Rename operations to snake_case for consistency

* Updated function names from CamelCase to snake_case across various files, including `CreateTMADescriptorOp` to `create_tma_descriptor`, `GetMBarrierOp` to `get_mbarrier`, and others.
* Adjusted corresponding calls and definitions in the codebase to reflect these naming changes, ensuring uniformity and improved readability.
* Enhanced layout inference and loop partitioning logic to accommodate the new naming conventions.

* [Feature] Introduce Warp Specialization and Eliminate Storage Sync for MBarrier

* Added a new example `gemm_ws.py` demonstrating matrix multiplication with warp specialization using TileLang.
* Implemented `WarpSpecializeFrame` and `WarpSpecialize` functionality to manage warp group indices in TIR frames.
* Introduced `EliminateStorageSyncForMBarrier` transformation to optimize storage synchronization in mbarrier regions.
* Enhanced the TileLang API with new methods for retrieving block and thread extents.
* Updated the `LowerAndLegalize` and `OptimizeForTarget` functions to incorporate the new transformation.
* Improved layout inference and kernel launch logic for better performance and clarity.

* [Refactor] Clean up code formatting and improve readability

* Added blank lines for better separation of code blocks in `gemm_ws.py`, `phase.py`, `kernel.py`, and `warpgroup.py`.
* Reformatted the `tilelang.compile` call in `gemm_ws.py` for improved clarity.
* Updated comments in `warpgroup.py` to clarify the availability of the `WarpSpecialize` function for NVIDIA GPUs.
* Ensured consistent spacing and formatting across multiple files to enhance overall code readability.

* lint fix

* [Refactor] Update mbarrier functions for improved clarity and consistency

* Refactored `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` to accept explicit parameters for better readability.
* Updated calls in `gemm_ws.py` to use the new function signatures, enhancing code clarity.
* Adjusted `warpgroup.py` to remove unused thread extent variable, streamlining the code.
* Added detailed docstrings to clarify usage examples for memory barrier functions.

* Added blank lines in `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` for improved code readability and separation of logical sections.

6972aed7

28 Apr, 2025 1 commit

[Enhancement] Improve layout inference accuracy in ParallelOp (#441) (#442) · 734c7fbe

Lei Wang authored Apr 28, 2025

* Added logic to use non-replicated buffers as source buffers for more accurate layout inference.
* Enhanced comments to clarify the rationale behind buffer selection in layout inference process.

734c7fbe

26 Apr, 2025 1 commit

[Language] Support accumulative `T.reduce_sum` (#436) · 6c737768

Lei Wang authored Apr 26, 2025

* [Enhancement] Update reduce operations to support clear option in sum and abs sum (#436)

* Modified reduce_sum and reduce_absmax functions to include a clear parameter, allowing for accumulation on existing values.
* Updated ReduceOp::Lower method to handle initialization and buffer duplication based on the clear flag for sum and abs sum operations.
* Added new tests for reduce_sum and reduce_max with clear functionality to ensure correctness in various scenarios.
* Enhanced documentation for reduce functions to clarify the behavior of the clear parameter.

* lint fix

* Update tensor type annotations in test_tilelang_transform_annotate_device_regions.py from Buffer to Tensor

* Update tensor type in reduce sum tests from float16 to float32 for improved precision

6c737768

25 Apr, 2025 1 commit

[Enhancement] Support cute mma tile mxn8ky (#434) · d1c15bc5

Lei Wang authored Apr 25, 2025

* [Enhancement] Improve error handling in layout inference and update profiler type in tests

* Added a detailed error message in the layout inference for local.fragment to clarify the requirement for trans_B.
* Updated the profiler type in the cumulative sum test from TensorSupplyType.One to TensorDistributionType.Randn for better profiling accuracy.

* lint fix

* [Refactor] Update OperandTraits to include num_warp_n parameter

* Modified OperandTraits templates across gemm_sm80.h, gemm_sm89.h, and gemm_sm90.h to include an additional num_warp_n parameter for improved flexibility in layout and copy operations.
* Adjusted Copy type selection based on the new parameter to enhance performance and adaptability in various scenarios.

* lint fix

* [Refactor] Update DispatchInstruction templates to include N parameter

* Modified DispatchInstruction templates in gemm_sm80.h, gemm_sm89.h, and gemm_sm90.h to include an additional N parameter, enhancing flexibility in tile size calculations.
* Adjusted MMA_Group definitions to use std::min for improved handling of warp sizes, ensuring better performance and adaptability in various scenarios.

d1c15bc5

24 Apr, 2025 1 commit

[Enhancement] Remove DeReplicate during parallel loop layout inference (#430) · bb1a5fd8

Lei Wang authored Apr 24, 2025

* [Refactor] Adjust layout inference calculations in Gemm and ParallelOp

* Updated block size calculation in Gemm to account for the range of thread bounds, improving accuracy in layout inference.
* Simplified layout conflict error messages in ParallelOp for better clarity, enhancing debugging experience.
* Removed redundant buffer checks in ParallelOp layout inference logic, streamlining the code.

* [Refactor] Clean up layout inference logic in Gemm and ParallelOp

* Removed unnecessary warning log in Gemm related to WGMMA conditions, streamlining the layout inference process.
* Commented out redundant checks in ParallelOp's layout inference, improving code clarity while maintaining functionality.
* Enhanced error messages in ParallelOp to provide clearer context for layout conflicts, aiding in debugging efforts.

* lint fix

bb1a5fd8

23 Apr, 2025 1 commit

[Layout] Enhance layout inference pass (#427) · 97d63fab

Lei Wang authored Apr 23, 2025

* [Enhancement] Improve layout inference in Copy operation (#426)

* Updated the Copy operation to infer layouts at multiple levels (kCommon, kStrict, kFree) for enhanced flexibility in layout optimization.
* Added detailed documentation for layout inference levels in ParallelOp, clarifying their purposes and use cases.
* Refactored layout inference logic to accommodate new levels, improving overall robustness and performance in parallel operations.

* lint fix

97d63fab

22 Apr, 2025 3 commits

[Language] Support tile operator `T.cumsum` (#423) · 88747fcd

Lei Wang authored Apr 22, 2025

* [Feature] Implement CumSum operation in TileLang

* Added CumSumOp class for cumulative sum operations, including argument validation and lowering logic.
* Introduced CumSum2D template for CUDA, supporting both forward and reverse cumulative sums.
* Created tests for CumSum functionality in shared memory and fragment contexts.
* Updated language interface to include cumsum operation, enhancing the reduction capabilities of TileLang.
* Refactored reduce.py to support cumsum functionality with appropriate memory allocation and copying mechanisms.

* lint fix

88747fcd

[Refactor] Enhance layout inference logic in ParallelOp (#420) · bf27e641

Yu Cheng authored Apr 22, 2025

* Updated the layout inference in ParallelOp to improve the selection of source buffers for layout accuracy.
* Introduced logic to choose the read source buffer based on the number of indices, ensuring more precise layout inference.
* Refactored the loop handling to maintain clarity and improve the overall robustness of the layout inference process.

bf27e641

[Enhancement] Support Auto Layout Inference and Parallelism with variable constraint (#417) · 73a6cb8b

Lei Wang authored Apr 22, 2025

* [Enhancement] Introduce thread range management in layout and operation handling

* Added `SetThreadRange` method to `FragmentNode` for managing thread ranges.
* Updated `LayoutNode::Inverse` to provide more informative error messages.
* Refactored layout inference and operation lowering to utilize `thread_bounds` instead of `block_size`, enhancing flexibility for thread management.
* Introduced new tests for tilelang operations to validate thread range functionality and ensure correctness in parallel execution scenarios.

* lint fix

* [Refactor] Improve thread variable handling in layout inference and operation lowering

* Removed workaround for undefined thread_var in layout inference, ensuring proper handling of thread bounds.
* Updated logic to define thread bounds based on the presence of thread_var, enhancing robustness in thread management.
* Refactored thread_var initialization in lower_tile_op to maintain consistency across the codebase.

* [Refactor] Update thread variable handling in layout inference and operation lowering

* Refactored thread variable checks to ensure bounds are only accessed when defined, improving safety and clarity.
* Initialized thread_var with a default range to prevent undefined behavior.
* Updated logic in lower_tile_op to align with new thread variable handling, enhancing consistency across the codebase.

73a6cb8b

21 Apr, 2025 1 commit

[Bugfix] Support larger than 256 box size tma copy (#413) · bf824406

Lei Wang authored Apr 21, 2025

* [New Feature] Add FP8 Flash Attention Implementation (#412)

* Introduce a new example script for FP8 Flash Attention in `example_mla_decode_kv_fp8.py`, showcasing the use of tilelang for efficient attention computation.
* Implement the `flashattn` function with optimized memory management and kernel execution.
* Include a reference program for comparison and performance evaluation.
* Add command-line argument parsing for batch size, number of heads, and dimensions to facilitate testing and experimentation.
* Enhance the overall structure and readability of the code.

This addition aims to improve the performance of attention mechanisms in deep learning models by leveraging FP8 precision and optimized kernel execution.

* lint fix

* optimize quick start

* lint fix

bf824406

16 Apr, 2025 2 commits

[Enhancement] Move T.any_of and T.all_of op registration from python into cpp (#398) · 7c266adf
Cunxiao Ni authored Apr 17, 2025
```
* [Enhancement] Move T.any_of and T.all_of op registration from python into cpp

* format

* add license
```
7c266adf

[Enhancement] Introduce a smarter warp partition strategy (#396) · ca730c0a

Lei Wang authored Apr 16, 2025

* make it python 3.8- happy

* [Enhancement] Improve loop partitioning and vectorization logic in layout inference and loop vectorization

- Enhanced the VisitStmt_ method to support local buffer handling in parallel loops, allowing for register usage without explicit thread binding.
- Updated loop vectorization logic to simplify expressions and ensure accurate vector size calculations, improving performance and clarity in the vectorization process.

* lint fix

* [Refactor] Update warp size checks and enhance warp partitioning logic in GEMM

- Changed warp_n size check from 16 to 8 in gemm_layouts.cc to improve compatibility with specific configurations.
- Refactored warp partitioning logic in gemm.cc to prioritize N dimension for better performance based on aspect ratio.
- Introduced a new CompileArgs dataclass in autotuner to streamline compile argument management and improve code clarity.

* lint fix

* [Enhancement] Initialize jit_compile in AutoTuner class

- Added initialization for jit_compile attribute in the AutoTuner class to ensure it is set to None by default.
- Updated the assignment logic for jit_compile to prevent overwriting an existing compile function, enhancing the flexibility of the AutoTuner's compilation process.

ca730c0a

15 Apr, 2025 1 commit

[Enhancement] Report Error Body in ParallelOp Layout Inference (#394) · 192a3995

Yu Cheng authored Apr 15, 2025

Added detailed error messages in the InferLayout method to provide better context when layout conflicts occur. This includes the body of the operation that triggered the error, aiding in debugging and layout validation.

192a3995

13 Apr, 2025 1 commit

[Dynamic Symbolic] Add pass_config to customize vectorization and tail split (#383) · 280e6627

Zhengju Tang authored Apr 13, 2025



* [Dynamic Symbolic] Add pass_config to customize vectorization and tail split

* Lint

* Only check for vectorized dimension. Add docs.

* Lint

* Update comment for cache directory in .gitignore

* Use CUTLASS convention to represent dynamic alignment. Fix bugs

* Add benchmark examples

* Add more benchmarks. Fix accumulate type bug.

* Lint

* Lint

* Test Lint

* Lint

* Test Lint

* Lint

* Fix typo

* Lint

* Lint

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

280e6627

09 Apr, 2025 1 commit

[Bugfix] Fix compilation issues for amd cdna element size check (#364) · d627fd58

Lei Wang authored Apr 09, 2025

* [Refactor] Update AutoTuner run method and timeout handling

- Modified the `run` method to reduce the default timeout from 100 to 30 seconds for improved responsiveness.
- Changed the `get_input_tensors_supply` call to disable output generation, enhancing performance during tensor supply retrieval.
- Refactored the latency measurement to streamline the benchmarking process, ensuring proper timeout handling with `ThreadPoolExecutor`.
- Added logging for timeout occurrences to aid in debugging and performance analysis.

* bug fix

* lint fix

d627fd58

08 Apr, 2025 1 commit

[Enhancement] Support pass config `disable_warp_specialize` to disable auto... · 7fdcedd0

Lei Wang authored Apr 08, 2025

[Enhancement] Support pass config `disable_warp_specialize` to disable auto specialization on hopper (#357)

* [Enhancement] Add warp specialization configuration option and update related functionality

* [Add] Introduced a new pass configuration option `kDisableWarpSpecialized` to control warp specialization behavior.
* [Refactor] Updated `WarpSpecializedRewriter` and `WSCodeEmitter` to utilize the new configuration option, allowing for more flexible optimization strategies.
* [Update] Modified the optimization pipeline in `phase.py` to include pipeline planning when warp specialization is disabled, enhancing performance with async copy.
* [Documentation] Updated JIT compilation parameters to reflect the new configuration option for better clarity.

* lint fix

* [Add] Implement test for GEMM with warp specialization configuration

* Introduced a new test file `test_tilelang_pass_config_disable_warp_specialized.py` to validate the functionality of the warp specialization configuration option.
* Added a `run_gemm` function to execute matrix multiplication tests with and without warp specialization, ensuring correctness through profiling against reference results.
* Included a specific test case for GEMM with float16 data types, enhancing test coverage for the new configuration feature.

* [Refactor] Improve formatting in test_tilelang_pass_config_disable_warp_specialized.py

* Reformatted the `tilelang.compile` call in the `run_gemm` function for better readability by breaking it into multiple lines.
* Added a blank line for improved code structure and clarity in the `test_gemm_f16f16f16_nn` function.

7fdcedd0

06 Apr, 2025 2 commits

[Enhancement] Support index bit width configuration (#343) · 70546adc

Lei Wang authored Apr 06, 2025

* [Refactor] Clean up whitespace in CUDA-related files

- Removed unnecessary blank lines in `cuda.py`, `__init__.py`, and `cuda_driver.py` to improve code readability and maintainability.
- This change enhances the overall organization of the codebase without altering functionality.

* [Benchmark] Add FP8 Matrix Multiplication Benchmark Script

- Introduced a new benchmark script for FP8 matrix multiplication in `benchmark/matmul_fp8/benchmark_matmul.py`.
- The script includes functions for reference matrix multiplication, configuration generation for autotuning, and an autotuned kernel for performance measurement.
- Added command-line argument parsing for matrix dimensions and the option to enable BitBLAS roller for search space exploration.
- The benchmark computes and prints the best latency and performance metrics, enhancing the benchmarking capabilities for FP8 operations.

* lint fix

* Enhance variable creation by associating data types in IR and layout files, and introduce ExpandIndexDataType transformation

- Updated variable creation in `ir.cc`, `gemm_layouts.cc`, and `elem.cc` to include data types for better type safety.
- Added a new transformation `ExpandIndexDataType` to promote integer types to int64 where necessary, improving compatibility and performance.
- Integrated the new transformation into the optimization pipeline in `phase.py`.
- Documented the new transformation in `__init__.py` for clarity.

* lint fix

* Add configuration option for index bitwidth and remove ExpandIndexDataType transformation

- Introduced a new pass configuration option `kConfigIndexBitwidth` to allow customization of index bitwidth.
- Updated the optimization pipeline in `phase.py` to utilize the new configuration option instead of the removed `ExpandIndexDataType` transformation.
- Documented the new configuration option in the JIT compilation function's parameters for clarity.
- Removed the `ExpandIndexDataType` transformation implementation from the codebase to streamline the transformation process.

* lint fix

* Refactor index bitwidth configuration handling

- Updated the `ConfigIndexBitwidth` pass to only apply the bitwidth transformation if the configuration option is defined, preventing potential errors with undefined values.
- Changed the default value of `tl.config_index_bitwidth` in the JIT compilation function's parameters from 32 to None for better clarity and flexibility.

* lint fix

---------
Co-authored-by: LeiWang1999 <wyatuestc@gmail.com>

70546adc

[Enhancement] Support region padding when convert buffer load to buffer region (#342) · 10804a0d

Lei Wang authored Apr 06, 2025

* Enhance error checking in RegionOp and buffer_load_to_tile_region

- Added detailed error messages to the index size check in `RegionOp` to aid debugging.
- Implemented a check in `buffer_load_to_tile_region` to ensure the length of indices matches extents, with a fallback to expand extents if necessary. This improves robustness in handling buffer loads with mismatched dimensions.

* lint fix

10804a0d

04 Apr, 2025 1 commit

[AMD] Adapt rocm and support `T.gemm` with transpose_b=False for amd backend (#327) · eab47249

Lei Wang authored Apr 04, 2025



* [Enhancement] Update GEMM and ROCm Integration

- Removed the restriction on transposing matrix B for CDNA in `gemm.cc`, allowing for more flexible matrix operations.
- Added a new debug header file `debug.h` for enhanced debugging capabilities in ROCm kernels.
- Updated `codegen_hip.cc` to include the new debug header and improved handling of float16 and bfloat16 types in vector element stores.
- Refactored `rt_mod_hip.cc` to return a ROCM module directly from `BuildTileLangHIPWithoutCompile`, enhancing the module creation process.
- Introduced a new ROCm utility in `rocm.py` for linking and managing ROCm paths, improving the build process for ROCm applications.
- Updated tests to reflect changes in GEMM configurations and ensure compatibility with the new features.

These changes enhance the flexibility and debugging capabilities of the GEMM operations and improve the integration with the ROCm backend.

* [Fix] Corrected syntax error in pyproject.toml and improved error message formatting in rocm.py

- Added missing quotation mark for "HSA" in the `select` section of `pyproject.toml`.
- Simplified the error message formatting in `get_rocm_arch` function of `rocm.py` for better readability and consistency.

* lint fix

* Update tilelang/jit/adapter/wrapper.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* lint fix

---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

eab47249

03 Apr, 2025 1 commit

[Dev] Add FP8 Quantization Examples and Absolute Maximum Reduction Operation Support (#320) · 4b705eb2

Yu Cheng authored Apr 03, 2025

* [Dev] Add FP8 Quantization Examples and Absolute Maximum Reduction Operation Support

* Added `example_per_token_cast_to_fp8.py` in examples/cast, providing token-wise FP8 quantization implementation.
* Added `example_triton_cast_to_fp8.py` in examples/cast, providing Triton-based FP8 quantization implementation.
* Added support for absolute maximum (absmax) reduction operation in reduce.cc and reduce.h.
* Implemented `reduce_absmax` function in reduce.py, allowing absolute maximum reduction on input buffers.
* Updated tilelang.language module to include the new `reduce_absmax` function.

These changes enhance FP8 quantization capabilities and extend reduction operation support.

* [Enhancement] Update per_token_cast_to_fp8 for improved FP8 quantization

* Modified the `per_token_cast_to_fp8` function to support variable block sizes and improved memory layout annotations.
* Adjusted the handling of absolute maximum values and scaling factors for better performance and accuracy.
* Updated the main execution block to allow for larger matrix dimensions and refined the profiler setup for benchmarking.

These changes enhance the flexibility and efficiency of the FP8 quantization process.

* lint

* [Dev] Update per_token_cast_fp8.py

4b705eb2

01 Apr, 2025 1 commit

[Bugfix] Fix logic error in ReduceOp when handling CUDA architecture (#316) · 19c85907

Yu Cheng authored Apr 01, 2025

* [Bugfix] Fixed the handling logic of IfThenElseNode in if_stmt_binding

* [Bugfix] Fix logic error in ReduceOp when handling CUDA architecture

- Added a check for the existence of the target attribute "arch" to ensure that there is no undefined behavior when handling the specific architecture "sm_90". This change improves the robustness and compatibility of the code.

19c85907

26 Mar, 2025 1 commit

[Feature] Introduce NoSetMaxNReg for warp specialization (#289) · 76435ca8

Yu Cheng authored Mar 26, 2025

- Added NoSetMaxNReg as a new TIR built-in to indicate no register hint for warp-specialized branches.
- Updated the warp specialization rewriter to handle the new NoSetMaxNReg operation, allowing for improved register management.
- Enhanced the Python interface to include NoSetMaxNReg for consistency with TIR operations.

76435ca8

24 Mar, 2025 2 commits

[Refactor] Improve flash attention example and layout comparison logic (#270) · 5f5bf53c

Lei Wang authored Mar 24, 2025

* [Refactor] Improve flash attention example and layout comparison logic

- Removed unnecessary annotation for `lse_local_split` in the flash attention example to streamline the code.
- Updated the handling of `lse_local_split` to utilize parallel processing for better performance.
- Refactored kernel compilation and profiling logic to enhance clarity and maintainability in the flash attention example.
- Added a condition in `FragmentNode::IsEqual` to handle broadcast cases, improving the robustness of layout comparisons.

* lint fix

* [Enhancement] Add support for shared memory scope in Fill operation

- Introduced handling for `shared.dyn` and `shared` memory scopes in the Fill operation.
- Implemented parallel operation and layout inference for improved performance in shared memory scenarios.
- Updated thread loop partitioning and vectorization logic to accommodate new memory scope handling.

5f5bf53c

[Bugfix] Support `T.clear` for let binding (#268) · 47caf219

Lei Wang authored Mar 24, 2025

* Fix indentation in JIT adapter wrapper to ensure consistent formatting of return statement in generated C code.

* Enhance Fill Operation in TileLang

- Updated the Fill constructor to support BufferLoad instances, adding checks for ramp indices and ensuring only stride 1 ramps are processed.
- Introduced a region array to manage the bounds of the fill operation, improving error checking for static regions.
- Modified the MakeSIMTLoop method to utilize the new region array for loop variable bounds, enhancing flexibility in kernel generation.
- Updated the fill and clear functions in fill.py to accept both tir.Buffer and tir.BufferRegion types, improving usability and type handling.

* Refactor Fill Operation and Improve Readability

- Simplified the Fill constructor by enhancing the handling of BufferLoad instances and ensuring proper checks for ramp indices.
- Improved error messages for region size checks to enhance clarity.
- Cleaned up formatting in the Fill method for better readability.
- Added a blank line in the matmul function test to improve code organization.
- Introduced a blank line in the fill function to enhance readability in fill.py.

* Add matrix multiplication functionality and test in TileLang

- Introduced a new test file `test_tilelang_language_clear.py` that implements a matrix multiplication function using TileLang's primitives.
- The `matmul` function defines a kernel for performing tile-level GEMM operations with customizable block sizes and data types.
- Added a `run_matmul` function to compile and execute the kernel, along with a test function to validate the implementation.
- Updated the `__init__.py` in the utils module to include `map_torch_type`, enhancing type handling for tensor operations.

* lint fix

47caf219

20 Mar, 2025 1 commit

[Refactor] Phaseout LLVM Dependency by Making it Optional (#247) · f2e99180

Lei Wang authored Mar 20, 2025

* remove llvm build

* [Refactor] Update kernel compilation and profiling in examples

- Replaced `tilelang.lower` with `tilelang.compile` in multiple example scripts to streamline kernel compilation.
- Updated profiling calls to utilize the new `get_profiler` method, enhancing performance measurement consistency.
- Adjusted assertions and benchmarking methods to align with the new profiling structure across various examples, ensuring correctness and clarity in performance evaluations.

* lint fix

* License Update

* [Refactor] Improve code formatting and documentation in CUDA header and HIP runtime files

- Adjusted formatting in `cuda.h` for better readability, including alignment of comments and struct fields.
- Cleaned up whitespace and improved comment clarity in `rt_mod_hip.cc` to enhance code maintainability.

* [Refactor] Enhance formatting and clarity in CUDA header and HIP runtime files

- Improved comment alignment and readability in `cuda.h`.
- Cleaned up whitespace and formatting in `rt_mod_hip.cc` to enhance maintainability.

* lint fix

* fix

* License update

* [Enhancement] Update JITKernel to use artifact for kernel source

- Assigned the generated artifact to `self.artifact` for better management.
- Updated kernel source references to use `artifact.kernel_source` for consistency in execution backend handling.

* lint fix

* Add @tilelang.testing.requires_llvm decorator to vectorization tests

* Enhance setup.py and env.py for library management

- Added functionality to remove original files after copying in CMakeBuild.
- Updated TVM_LIBRARY_PATH in env.py to include the PyPI build library path for better integration.

* Refactor TVM_LIBRARY_PATH assignment for improved readability in env.py

* Refactor CMakeBuild file handling in setup.py

- Added a check to ensure the target library directory exists before copying .so files.
- Improved the logic for creating the target directory and copying files to enhance robustness.

* bugfix

* Rename BuildTLDebug to BuildTileLangCUDAWithoutCompile and update registration. Add @tilelang.testing.requires_llvm decorator to multiple tests for LLVM requirement.

* lint fix

* Enhance TileLang code generation by adding support for device code generation without compilation. Updated `host_codegen` and `device_codegen` functions to include new transformations and registration for `tilelang_hip_without_compile`. Refactored JIT kernel adapters to accommodate host and device modules, improving overall integration and flexibility.

* lint fix

* Add support for C target in device code generation

- Updated `device_codegen_without_compile` to include handling for the C target by registering the `tilelang_cpp` function.

* [Enhancement] Implement auto-clear cache feature based on environment variable

* Added TILELANG_CLEAR_CACHE environment variable to control cache clearing.
* Updated CI workflow to set TILELANG_CLEAR_CACHE during testing.
* Modified cache initialization to clear cache if TILELANG_CLEAR_CACHE is set to true.

* [Refactor] Update kernel invocation and import paths in tests and cache

* Changed kernel invocation in `test_tilelang_kernel_dequantize_gemm.py` to return the result.
* Updated import statements in `test_tilelang_kernel_int4_gemm_mma.py` to use `bitblas` instead of `tilelang`.
* Refactored paths for artifact and parameters in `kernel_cache.py` for better maintainability.

* [Refactor] Clean up whitespace and improve code formatting in kernel_cache.py

* Removed unnecessary blank lines and adjusted spacing for better readability in the KernelCache class.
* Enhanced overall code formatting to align with project standards.

* [Enhancement] Add bfloat16 test case and improve kernel caching logic

* Introduced a new test case for bfloat16 matrix multiplication in `test_tilelang_kernel_gemm_mma_intrinsic.py`.
* Updated `KernelCache` to handle multiple kernel source files and improve error handling during saving and loading.
* Refactored `JITKernel` to support instantiation from a database, enhancing flexibility in kernel management.
* Adjusted `CtypesKernelAdapter` and `CythonKernelAdapter` to utilize the new kernel loading mechanism from the database.
* Improved code formatting and readability across several files.

* lint fix

* Update bfloat16 matrix multiplication test case to use larger dimensions for improved coverage

f2e99180

19 Mar, 2025 1 commit

[Enhancement] Add zero initialization option to GEMM operations (#246) · 701e9234

Yu Cheng authored Mar 19, 2025

* [Enhancement] Add zero initialization option to GEMM operations

- Introduced a new `zero_init` parameter to the GEMM function, allowing for optional zero initialization of the accumulator.
- Updated the GEMM implementation across various CUDA architectures to support the new parameter.
- Modified the Python interface for GEMM to include the `zero_init` argument, enhancing flexibility in kernel execution.
- Ensured compatibility with existing functionality while improving initialization control for performance optimization.

* rename zero_init to clear_accum

* lint

701e9234

18 Mar, 2025 2 commits

[Dev] Implement FlashAttention3 Backward (#244) · c264f37f

Yu Cheng authored Mar 18, 2025

* [BugFix] Fix bug of missing MBarrierExpectTX

* [Dev] Implement FlashAttention3 Backward

- Added a new example for Flash Attention using pipelined WGMMA, including forward and backward pass implementations.
- Introduced functions for forward and backward processing, leveraging tilelang for optimized tensor operations.
- Enhanced the attention mechanism with support for both causal and non-causal configurations.
- Included command-line arguments for batch size, number of heads, context size, and head dimension for flexibility in testing.
- Updated GEMM operations to support a new `wg_wait` parameter for improved synchronization in kernel execution.

c264f37f

[Refactor] Refactor for Better Layout Conflict Handling (#240) · 2a286ae6

Lei Wang authored Mar 18, 2025

* [Feature] Add reduce_max functionality and corresponding tests

* Introduced a new test file for the reduce_max operation in the tilelang language module.
* Implemented the reduce_max functionality using T.prim_func, including local memory allocation and result copying.
* Added tests for various input sizes and data types to ensure correctness of the reduce_max implementation.
* Enhanced profiling assertions to validate the output against reference implementations.

* Fix whitespace issues in reduce_max test file for improved readability

* [Refactor] Update DebugOutput methods to return strings instead of void

* Modified DebugOutput methods in LayoutNode, FragmentNode, and SwizzledLayoutNode to return std::string instead of void, enhancing usability for logging and debugging.
* Updated corresponding header files to reflect the new return types.
* Improved layout inference error messages by incorporating DebugOutput for better clarity in layout conflicts.

* lint fix

* Fix typo in matmul function: changed loop from T.Parallel to T.grid for correct parallel execution in webgpu code generation tests.

* [Enhancement] Improve layout inference conflict handling in ParallelOp

* Updated the layout inference logic in ParallelOp to better handle conflicts for local.fragment buffers.
* Added checks to ensure that layout conflicts are reported only when both source and destination buffers are defined, improving clarity in error messages.
* Enhanced the overall robustness of the layout inference process by addressing specific cases where conflicts may arise.

* [Feature] Add IsEqual methods for layout comparison

* Introduced IsEqual methods in LayoutNode, FragmentNode, and SwizzledLayoutNode to facilitate structural equality checks, allowing for optional index comparison.
* Enhanced layout inference logic in Copy and ParallelOp to utilize the new IsEqual methods for better conflict detection in local.fragment layouts.
* Improved error messages for layout conflicts to provide clearer guidance on potential issues.houm

* [Refactor] Update profiler usage in benchmark_nsa_fwd.py and improve layout inference in elem.cc and parallel.cc

* Modified the profiler call in benchmark_nsa_fwd.py to streamline latency measurement.
* Updated layout inference logic in elem.cc and parallel.cc to use const pointers for FragmentNode, enhancing type safety and clarity.
* Improved error messages in layout conflict checks to provide better guidance on potential issues.

* [Refactor] Clean up pointer formatting in layout inference files

* Standardized pointer formatting for FragmentNode in elem.cc and parallel.cc to improve code readability.
* Minor adjustments to error message formatting in layout conflict checks for better clarity.

2a286ae6

16 Mar, 2025 1 commit

[Bugfix] Fix mismatch of shared memory layout and mma atom on Hopper (#224) · c5bbc608

zqh-wz authored Mar 16, 2025



* add test for issue 101

* use ss_smem_selector from cutlass

* fix mismatch between smem layout and mma

* only fix for sm90

* Add CUDA requirements to GEMM thread tests

* lint fix

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

c5bbc608

14 Mar, 2025 1 commit

[Enhancement] Allow mma fallback when wgmma is not supported (#206) · 45559a1f

Lei Wang authored Mar 14, 2025

* Enhance error message for constant size stack allocation in CUDA codegen. Include the actual constant size and buffer variable name in the error output for better debugging.

* Refactor GEMM and Bulk Copy operations to enhance layout handling and support for Hopper architecture

- Update `ComputeWarpPartition` to include a new parameter for Hopper WGMMA support.
- Modify layout checks in `LowerBulkCopy` to accommodate new GEMM layout types.
- Enhance layout inference logic in `InferLayout` for better compatibility with Hopper architecture.
- Include necessary header files for built-in operations and layout inference improvements.

* lint fix

* Remove unused builtin.h include directive

* Update include path for builtin.h

45559a1f

13 Mar, 2025 1 commit

[Feature] Upgrade cutlass version and support fp8 T.gemm (#202) · 2cccf1f5

zqh-wz authored Mar 13, 2025



* upgrade cutlass to upstream v3.8.0

* Implement fp8 gemm and add example script

* Fix dtype retrieval with map_torch_type for fp8 inputs

* Disable vectorization of fp8 values

* Make MMA declaration compatible with cutlass 3.4.0+

* Add test for fp8 T.gemm

* fix indent

* fix indent

* Add copyright and license header

* Add copyright and license header

* lint fix

* Refactor matmul_nt and assert_matmul_correctness functions for improved readability by consolidating parameter definitions and adjusting formatting.

* clang format lint

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

2cccf1f5

12 Mar, 2025 2 commits

[Feature] Add TMA Store Synchronization Support (#195) · eba7dd5a

Yu Cheng authored Mar 12, 2025

- Introduce TMAStoreArrive and TMAStoreWait operations for CUDA TMA store synchronization
- Add new builtin operations in op/builtin.cc and op/builtin.h
- Implement TMAStoreSyncInjector to automatically inject TMA store synchronization calls
- Update CUDA codegen to support new TMA store synchronization intrinsics
- Add Python language bindings for new TMA store synchronization operations

eba7dd5a

[Bugfix] Fix `T.copy` for scalar datatypes (#190) · 454248c7

Lei Wang authored Mar 12, 2025

* Optimize CMake build process with dynamic job count calculation

- Modify build_csrc function to use 90% of available CPU cores
- Ensure at least one job is used during compilation
- Improve build performance by dynamically adjusting parallel job count

* Optimize build_csrc function with multiprocessing module

- Replace os.cpu_count() with multiprocessing.cpu_count()
- Maintain existing 90% CPU utilization logic
- Improve CPU core count calculation for build process

* Add dynamic shape support with out_idx in Cython JIT kernel compilation

- Implement `run_cython_dynamic_shape_with_out_idx` function in test_tilelang_jit_gemm_cython.py
- Update Cython wrapper to handle dynamic symbolic shapes during tensor allocation
- Add support for resolving dynamic shape dimensions using input tensor references
- Enhance flexibility of JIT kernel compilation with symbolic shape handling

* Enhance error reporting for dynamic symbolic shape resolution in Cython JIT kernel

- Add detailed error message when a dynamic symbolic dimension is not found in dynamic_symbolic_map
- Improve debugging by providing context about missing symbolic dimensions
- Maintain existing dynamic shape resolution logic

* Fix Copy operation handling for scalar and multi-dimensional tensors

- Add special handling for scalar tensor copy operations
- Enhance error reporting in MakeIndices method with more detailed diagnostic information
- Improve SIMT loop generation to support zero-dimensional tensors
- Add explicit check and handling for scalar tensor scenarios

* Refactor Copy operation code formatting and improve readability

- Improve code formatting in MakeIndices and MakeSIMTLoop methods
- Add line breaks to enhance readability of complex ICHECK statements
- Simplify code structure in scalar tensor handling
- Remove unnecessary whitespace and improve code alignment

454248c7

09 Mar, 2025 1 commit

[Feat] Append Pass Context and TMA lowering configuration option (#175) · fb6b101c

Lei Wang authored Mar 09, 2025

* Add TMA lowering configuration option and update copyright notices

This commit introduces a new configuration option to disable TMA (Tensor Memory Access) lowering and updates copyright notices across multiple files. Key changes include:

- Add `kDisableTMALower` configuration option in builtin.h and builtin.cc
- Update copyright notices from Microsoft Corporation to Tile-AI Corporation
- Modify `LowerArgs` struct to include `disable_tma_lower` flag
- Update JIT compilation interfaces to support pass configuration
- Enhance error reporting in bulk copy lowering
- Propagate pass configuration through various adapter layers

* lint fix

fb6b101c

05 Mar, 2025 1 commit

[Dev] Adjust computation logic to avoid precision loss when casting acc_s from... · e1d82bf3

Yu Cheng authored Mar 05, 2025

[Dev] Adjust computation logic to avoid precision loss when casting acc_s from float to float16 (#141)

- Remove redundant `acc_s_0` fragment in flash attention kernel
- Simplify memory copy and reduction operations
- Reorder memory copy and scaling steps for improved performance
- Add Hopper-specific synchronization method in CUDA reduce template
- Update reduce operation to use architecture-specific synchronization

e1d82bf3

27 Jan, 2025 1 commit

[CI][Test] Add test cases for tilelang transform LowerHopperIntrin (#59) · 7d4156df

Yu Cheng authored Jan 27, 2025

* [Dev] Add FlashDecoding example

* [CI][Test] Add test cases for tilelang kernel convolution

* [CI][Test] Add test cases for tilelang kernel FlashAttention

* Reduce the number of stages to ensure the shared memory allocation is valid

* Temporarily remove the dim128 case

* lint

* update einops in requirements-dev.txt

* update einops in requirements-test.txt

* remove einops in requirements-dev.txt

* [CI][Test] Add test cases for tilelang transform ClusterPlanning

* [CI][Test] Add test cases for tilelang transform LowerHopperIntrin

7d4156df

17 Jan, 2025 1 commit
- [CPU] Support CPU Code generation (#17) · 913d14f2
  Lei Wang authored Jan 18, 2025
```
* README.md fixed

* test fix

* cpu backend update

* cpu test case
```
  913d14f2
11 Jan, 2025 1 commit
- [Lint] Overall Typo and Linting Fixes (#13) · fa511857
  Lei Wang authored Jan 11, 2025
```
* README.md fixed

* update test ci

* Lint and Typo Fix

* Clang Format Lint Fix
```
  fa511857