Commits · e7b97be297885c493507dbc3e55da2f4af46ae39 · OpenDAS / tilelang

11 Jun, 2025 1 commit

[Feature] Introduce Persistent Loop and Update GEMM Example (#563) · e7b97be2

Yu Cheng authored Jun 11, 2025

* [Feature] Added Support for Synchronizing Grids and Persistent Threadblock Transformation

- Defined the sync_grid operation in builtin.cc and builtin.h, allowing synchronization of all threads within a grid.
- Implemented support for sync_grid in codegen_cuda.cc, ensuring proper handling of this operation in the generated CUDA code.
- Added the PersistThreadblock transformation, enabling the conversion of thread blocks to persistent thread blocks, enhancing support for persistent kernels.
- Updated relevant documentation and comments to reflect the addition of new features and usage instructions.

* [Example] Add MLA Decode With Persistent Threadblock Example

* [Feature] Introduce Persistent Loop and Update GEMM Example

- Added a new persistent loop construct in the TIR framework, enabling more efficient kernel execution.
- Updated the GEMM example to utilize the new persistent primitive, enhancing performance for matrix multiplication.
- Introduced a `loop_break` intrinsic for better control flow within persistent loops.
- Updated relevant files to support the new features, including changes in code generation and language interface.

* lint fix

e7b97be2

07 Jun, 2025 2 commits

[Feature] Support persistent kernels and add persistent GEMM examples (#559) · 225aca61

Yu Cheng authored Jun 07, 2025

* [Enhancement] Fix multi-version buffer index in nested-loop

* [Feature] Support persistent kernels and add persistent GEMM example

* lint fix

* lint fix

* [CI] Remove test_tilelang_transform_annotate_device_regions.py

225aca61

[Bugfix] Add tf32 casting to GEMM templates (#556) · 8cc8db52

Lei Wang authored Jun 07, 2025

* Add tf32 casting functionality to GEMM templates

- Introduced a `cast_float_to_tf32` function to convert float32 values to tfloat32 format across gemm_sm80, gemm_sm89, and gemm_sm90 templates.
- Implemented conditional casting in relevant sections of the GEMM operations to ensure compatibility with tfloat32 types.
- Enhanced the handling of tensor views to support the new casting logic, improving performance and accuracy in matrix operations.

* lint fix

* Refactor tfloat32 casting logic in GEMM templates

- Replaced the `is_tfloat32` boolean with `need_tfloat32_cast` to improve clarity and accuracy in determining when to cast float32 to tfloat32.
- Updated relevant sections in `gemm_sm80`, `gemm_sm89`, and `gemm_sm90` to utilize the new casting logic, enhancing compatibility with tfloat32 types.
- Ensured consistent application of casting across tensor views, improving performance and correctness in matrix operations.

* Refactor GEMM template functions for improved readability

- Simplified the function signature of `body_rs` in both `gemm_sm80` and `gemm_sm90` templates for better clarity.
- Adjusted the casting logic in `gemm_sm90` to ensure consistent application of `cast_float_to_tf32` across tensor views, enhancing performance and maintainability.

* Enhance tf32 casting logic in GEMM templates

- Updated the `cast_float_to_tf32` function in `gemm_sm80`, `gemm_sm89`, and `gemm_sm90` to conditionally apply the casting only if the input is finite, improving robustness.
- Simplified the `need_tfloat32_cast` logic to clarify the conditions under which tfloat32 casting is required, enhancing code readability and maintainability.

* Refactor GEMM template functions and layout inference logic

- Removed the `cast_float_to_tf32` function from `gemm_sm90` and updated the `body_sr` function to streamline the casting process for tensor views, enhancing code clarity and maintainability.
- Improved layout inference in `layout_inference.cc` by adding checks for the layout map's definition, ensuring robustness in handling layout annotations.
- Simplified the handling of layout maps in the `annotate_layout` function, allowing for more flexible layout definitions and error handling.

8cc8db52

05 Jun, 2025 1 commit

[Enhancement] Add nvrtc execution backend (#461) · 17f7394f

Gabriel Wu authored Jun 05, 2025



* [wip] feat: add nvrtc backend

* [wip] fix: handle out_idx

* [wip] refactor: move lib logic to libgen

* feat: cache for nvrtc backend

* fmt: run format

* fix: handle cuda bindings import error

* fix: handle cuda bindings import error

* fix: handle cuda bindings import error

* fix: handle cuda bindings import error

* fix: get kernel source

* refactor: speedup pyimport

* Improve error handling for missing cuda-python dependency in nvrtc backend. Raise ImportError with detailed installation instructions instead of logging a warning.

* Enhance nvrtc backend error handling by introducing a flag to check for cuda-python availability. Raise ImportError with detailed installation instructions during initialization if the nvrtc backend is unavailable, improving user experience and clarity.

* Update README.md to include recent NVRTC Backend addition, highlighting reduced compilation time for CUDA templates.

* fix tl_templates

* ensure CUDA context

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

17f7394f

04 Jun, 2025 3 commits

[Bugfix] Enhance layout inference pass for flexibility (#550) · 444b7c4e

Lei Wang authored Jun 04, 2025

* Enhance Layout

* strict update

* lint fix

* Refactor layout inference by removing unnecessary logging statements in `parallel.cc` and `layout_inference.cc`. This cleanup enhances code readability and reduces log clutter during layout inference steps.

* lint fix

* Refactor file copying logic in setup.py to simplify directory creation and file copying process. Removed unnecessary existence check before copying source files to the target directory.

444b7c4e

[AMD][Enhancement] Add support for Vectorized FP8 DataPacking (#542) · 319bc6b1

Lei Wang authored Jun 04, 2025

* [Enhancement] Add support for new FP8 types in HIP code generation

* Updated `PrintConst` function in `codegen_hip.cc` to handle `float8_e4m3fnuz` type.
* Introduced new functions in `hip_fp8.h` for creating FP8 types, including `make_fp8_e4_4_t` and `make_fp8_e4_8_t`, enhancing type handling for FP8 data structures.
* Improved overall compatibility and performance for FP8 data types in HIP.

* workaround for competition

* enhance autotune

* autotune cache fix

* Implement validation for unused keys in AutoTuner configuration

* Added a check in the AutoTuner class to raise a ValueError if there are unused keys in the configuration, enhancing error handling and ensuring configuration integrity.

* lint fix

* revert changes of threads

* Update pipelining in `example_mla_decode.py` to improve performance

* Changed the number of stages in the pipelined loop from 0 to 2, enhancing the efficiency of the attention mechanism in the decoding process.

* Enhance Cython kernel validation by adding tensor attribute checks

* Updated the `CythonKernelWrapper` to include dedicated methods for validating tensor device, dtype, and static shape.
* Modified the `forward` method to utilize these new validation methods, improving error handling and ensuring input integrity.
* Updated the `lambda_forward` function in `CythonKernelAdapter` to reflect changes in validation parameters.

319bc6b1

[Refactor] Include several examples into ci (#531) · 3ca3a8af

Lei Wang authored Jun 04, 2025

* Remove unused 2D continuous cumulative sum example and related functions from the cumsum module.

* lint fix

* fix split k example

* Enable cache disabling in gemm_streamk example and add validation checks in if_stmt_binding transformation

* Update gemm_streamk example to use tilelang's cdiv function for block calculations and add copyright notice

3ca3a8af

01 Jun, 2025 1 commit

[AMD] Support float8 matrix core (#537) · 5872e647

Lei Wang authored Jun 02, 2025



* [Enhancement] Add support for FP8 types in CUDA and HIP code generation

* Updated `GetFP8Type` function in `codegen_cuda.cc` and `codegen_hip.cc` to handle new FP8 types, including `kFloat8_e4m3fnuz`.
* Introduced a new header file `hip_fp8.h` for FP8 type definitions in HIP.
* Modified type mappings in `dlpack.py` and `mfma_macro_generator.py` to accommodate new FP8 types.
* Enhanced type handling in `TLHIPSourceWrapper` and `tensor.py` for better integration with FP8 types.
* Added necessary includes and logic to support FP8 in the code generation process, improving performance and compatibility with FP8 data types.

* lint fix

* Update src/target/codegen_hip.cc
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* Update tilelang/intrinsics/mfma_macro_generator.py
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* workaround

* fix

* Update submodule TVM to latest commit 587028ffebfff0ded520f8f90d62f0f6b165906c

* bug fix

* Refactor tilelang matrix multiplication to support transposition and packing options. Adjusted shared memory shapes and loading logic for A and B matrices. Updated test cases to validate new functionality.

* Refactor assertion function for tilelang matrix multiplication to improve readability by formatting parameters and aligning code. Cleaned up whitespace in intrinsic layout functions for consistency.

* Update bfloat16 type definitions in common.h and gemm.h for consistency. Changed __hip_bfloat16 to hip_bfloat16 and updated MfmaTraits specialization accordingly.

* lint fix

---------
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

5872e647

31 May, 2025 1 commit
- [Bugfix] Fix a bug when simplifying warp combination for T.gemm (#540) · 1940b3c9
  Lei Wang authored May 31, 2025
  
  1940b3c9
29 May, 2025 1 commit

[Language] Support `T.annotate_l2_hit_ratio` via `cudaStreamSetAttribute` (#539) · a65f481e

Lei Wang authored May 30, 2025

* Refactor OptimizeForTarget function by removing redundant buffer allocation step and cleaning up code

* Removed the PlanAndUpdateBufferAllocationLocation step from the OptimizeForTarget function to streamline the optimization process.
* Cleaned up unnecessary whitespace in the function for improved readability.
* Enhanced the overall clarity and maintainability of the code.

* Refactor AllocateNode handling in vectorize_loop.cc

* Simplified the VisitStmt_ method for AllocateNode by removing the complex extent mutation logic.
* Streamlined the allocation process to directly call the base class method, enhancing code clarity and maintainability.
* Improved overall readability by eliminating unnecessary comments and code related to extent handling.

* Remove `tl_kernel.c` file, eliminating the backward kernel implementation and associated error handling functions. This cleanup enhances code maintainability by removing unused components related to the backward kernel processing.

* Add buffer allocation planning step in OptimizeForTarget function

* Introduced the PlanAndUpdateBufferAllocationLocation step to the OptimizeForTarget function, enhancing the optimization process.
* This addition improves the overall efficiency of buffer allocation during the target optimization phase, ensuring better resource management.

* Update submodule TVM to latest commit db50d4e, ensuring alignment with upstream changes.

* Add L2 persistent annotation support and related functionality

* Introduced a new file `lower_l2_persistent_annotation.cc` to handle the lowering of L2 persistent annotations.
* Added functions to annotate L2 hit ratios for buffers, ensuring compatibility with global buffer requirements.
* Updated the `LowerAndLegalize` function to include the new L2 persistent map lowering step.
* Enhanced CUDA driver with a function to retrieve the maximum size of the persisting L2 cache.
* Modified the `TLCUDASourceWrapper` class to integrate L2 persistent map handling during kernel launches.

These changes improve the framework's ability to manage L2 cache optimizations, enhancing performance for CUDA applications.

* lint fix

a65f481e

28 May, 2025 1 commit

[Refactor] Disable legacy vectorization for buffer allocation (#535) · e71c7a17

Lei Wang authored May 29, 2025

* Refactor OptimizeForTarget function by removing redundant buffer allocation step and cleaning up code

* Removed the PlanAndUpdateBufferAllocationLocation step from the OptimizeForTarget function to streamline the optimization process.
* Cleaned up unnecessary whitespace in the function for improved readability.
* Enhanced the overall clarity and maintainability of the code.

* Refactor AllocateNode handling in vectorize_loop.cc

* Simplified the VisitStmt_ method for AllocateNode by removing the complex extent mutation logic.
* Streamlined the allocation process to directly call the base class method, enhancing code clarity and maintainability.
* Improved overall readability by eliminating unnecessary comments and code related to extent handling.

* Remove `tl_kernel.c` file, eliminating the backward kernel implementation and associated error handling functions. This cleanup enhances code maintainability by removing unused components related to the backward kernel processing.

* Add buffer allocation planning step in OptimizeForTarget function

* Introduced the PlanAndUpdateBufferAllocationLocation step to the OptimizeForTarget function, enhancing the optimization process.
* This addition improves the overall efficiency of buffer allocation during the target optimization phase, ensuring better resource management.

e71c7a17

27 May, 2025 1 commit

[Enhancement] Add warp specialization attribute handling in IR and rewriter (#518) · 41bc15cb

Yu Cheng authored May 27, 2025

* Introduced an `AttrFrame` for warp specialization in the IR, enhancing the handling of warp-specific optimizations.
* Refactored the `VisitStmt_` method in `warp_specialized_rewriter.cc` to check for the new warp specialization attribute, improving the detection of warp specialization conditions.
* Removed outdated code related to condition checks in `IfThenElseNode`, streamlining the specialization logic.

41bc15cb

26 May, 2025 3 commits

[Enhancement] Add commit ID to versioning and improve logging initialization (#524) · 62a8d7f0

Lei Wang authored May 27, 2025

* Updated `get_tilelang_version` to include an optional commit ID in the version string.
* Enhanced the `TileLangBuilPydCommand` to write the version with commit ID to the VERSION file during the build process.
* Introduced a new function `get_git_commit_id` in `version.py` to retrieve the current git commit hash.
* Refactored logger initialization in `autotuner/__init__.py` to ensure handlers are set up only once, improving performance and clarity.
* Minor fixes in `flatten_buffer.cc` and `kernel_cache.py` for better handling of versioning and logging.

62a8d7f0

[Enhancement] Add atomicAdd for FLOAT16x2 and FLOAT16x4 (#522) · 46798f25

Lei Wang authored May 26, 2025

* [Enhancement] Add atomic addition functions for FLOAT16x2 and FLOAT16x4 in CUDA

* Introduced `AtomicAddx2` and `AtomicAddx4` functions for performing atomic addition operations on double-width float types in CUDA.
* Updated `customize.py` to include the new `atomic_addx4` function for external calls.
* Modified `__init__.py` to export the new atomic addition function, ensuring accessibility in the module.

* lint fix

46798f25

[Refactor] Replace default fp8 dtype with cute to perform fast cast (#520) · 6addc509

Lei Wang authored May 26, 2025

* [Refactor] Enhance GEMM Warp Partitioning Logic and Introduce Buffer Remapping (#516)

* Improved the warp partitioning logic in `Gemm::ComputeWarpPartition` to better accommodate various GEMM policies, including FullRow, FullCol, and Square, ensuring optimal performance based on matrix dimensions.
* Introduced a new `RemapBufferRewriter` class to handle buffer reference updates and padding annotations during statement transformations, enhancing memory access safety and clarity.
* Updated the `OptimizeForTarget` function to include a new step for configuring index bitwidth, improving the overall optimization process.
* Refactored existing code to utilize constants for warp sizes, enhancing maintainability and readability.
* Added checks to ensure correct warp allocation and padding map handling, improving robustness in memory management strategies.

* [Refactor] Update ConfigIndexBitwidthRewriter to Support Auto-Check Feature

* Modified the constructor of `ConfigIndexBitwidthRewriter` to include an `auto_check` parameter, allowing for dynamic bitwidth adjustments based on input conditions.
* Enhanced the `VisitExpr_` methods to apply the new auto-check logic, ensuring that integer types are upgraded to 64 bits when necessary, or to a specified index bitwidth otherwise.
* Updated the `ConfigIndexBitwidth` pass to determine the index bitwidth based on the presence of configuration, improving flexibility in handling different scenarios.

* Add dynamic matrix multiplication example and corresponding test

* Introduced `example_dynamic.py` to demonstrate dynamic matrix multiplication using TileLang and PyTorch, including a main function for execution and performance profiling.
* Added `test_example_dynamic.py` to validate the functionality of the dynamic matrix multiplication example.
* The example includes detailed parameter configurations and checks against PyTorch's implementation for correctness.

* lint fix

* Add get_num_sms function to retrieve the number of streaming multiprocessors on the CUDA device

* Implemented the `get_num_sms` function in `cuda_driver.py` to return the count of streaming multiprocessors for a specified CUDA device.
* Updated the `__init__.py` file to include the new function in the module exports.

* lint fix

* Add global barrier state and expectation handling in CUDA code generation

* Introduced `vid_global_barrier_state_` and `vid_global_barrier_expect_` to manage global barrier synchronization in the CUDA code generator.
* Updated `Finish` method to declare the global barrier state if needed.
* Implemented handling for `EvaluateNode` to initialize the barrier expectation.
* Removed unnecessary extern declaration for the global barrier state in `PrintStorageSync` method.
* Enhanced CUDA FP8 type definitions for better alignment and structure.

* Enhance CUDA FP8 type handling and debug printing

* Updated `cuda_fp8.h` to replace NVidia's FP8 types with Cute's FP8 types for better compatibility and structure.
* Added specializations for `debug_print_var` and `debug_print_buffer_value` functions to support the new FP8 types, improving debugging capabilities for these data types.
* Updated `debug.h` to include the new `cuda_fp8.h` header for access to the FP8 type definitions.

* Refactor CUDA code generation to remove unnecessary managed qualifier for global barrier state

* Updated the `Finish` method in `codegen_cuda.cc` to declare the global barrier state without the `__managed__` qualifier, simplifying the declaration.
* Added a new `sync_global` function in `builtin.py` to synchronize all threads in a block, enhancing synchronization capabilities in the TileLang framework.

* Remove deprecated CUDA kernel and Python script for FP8 E4M3 casting

* Deleted the `cast_to_fp8_e4m3_kernel` CUDA kernel implementation and its corresponding Python script, streamlining the codebase by removing unused components related to FP8 E4M3 type casting.
* This cleanup enhances maintainability and reduces potential confusion regarding obsolete code.

* lint fix

6addc509

25 May, 2025 1 commit

[Enhancement] Support auto synchronization for global memory access (#519) · 623edf4c

Lei Wang authored May 25, 2025

* [Refactor] Enhance GEMM Warp Partitioning Logic and Introduce Buffer Remapping (#516)

* Improved the warp partitioning logic in `Gemm::ComputeWarpPartition` to better accommodate various GEMM policies, including FullRow, FullCol, and Square, ensuring optimal performance based on matrix dimensions.
* Introduced a new `RemapBufferRewriter` class to handle buffer reference updates and padding annotations during statement transformations, enhancing memory access safety and clarity.
* Updated the `OptimizeForTarget` function to include a new step for configuring index bitwidth, improving the overall optimization process.
* Refactored existing code to utilize constants for warp sizes, enhancing maintainability and readability.
* Added checks to ensure correct warp allocation and padding map handling, improving robustness in memory management strategies.

* [Refactor] Update ConfigIndexBitwidthRewriter to Support Auto-Check Feature

* Modified the constructor of `ConfigIndexBitwidthRewriter` to include an `auto_check` parameter, allowing for dynamic bitwidth adjustments based on input conditions.
* Enhanced the `VisitExpr_` methods to apply the new auto-check logic, ensuring that integer types are upgraded to 64 bits when necessary, or to a specified index bitwidth otherwise.
* Updated the `ConfigIndexBitwidth` pass to determine the index bitwidth based on the presence of configuration, improving flexibility in handling different scenarios.

* Add dynamic matrix multiplication example and corresponding test

* Introduced `example_dynamic.py` to demonstrate dynamic matrix multiplication using TileLang and PyTorch, including a main function for execution and performance profiling.
* Added `test_example_dynamic.py` to validate the functionality of the dynamic matrix multiplication example.
* The example includes detailed parameter configurations and checks against PyTorch's implementation for correctness.

* lint fix

* Add get_num_sms function to retrieve the number of streaming multiprocessors on the CUDA device

* Implemented the `get_num_sms` function in `cuda_driver.py` to return the count of streaming multiprocessors for a specified CUDA device.
* Updated the `__init__.py` file to include the new function in the module exports.

* lint fix

* Add global barrier state and expectation handling in CUDA code generation

* Introduced `vid_global_barrier_state_` and `vid_global_barrier_expect_` to manage global barrier synchronization in the CUDA code generator.
* Updated `Finish` method to declare the global barrier state if needed.
* Implemented handling for `EvaluateNode` to initialize the barrier expectation.
* Removed unnecessary extern declaration for the global barrier state in `PrintStorageSync` method.
* Enhanced CUDA FP8 type definitions for better alignment and structure.

623edf4c

24 May, 2025 1 commit

[Refactor] Support auto index bitwidth casting (#517) · 6ad73f6f

Lei Wang authored May 24, 2025

* [Refactor] Enhance GEMM Warp Partitioning Logic and Introduce Buffer Remapping (#516)

* Improved the warp partitioning logic in `Gemm::ComputeWarpPartition` to better accommodate various GEMM policies, including FullRow, FullCol, and Square, ensuring optimal performance based on matrix dimensions.
* Introduced a new `RemapBufferRewriter` class to handle buffer reference updates and padding annotations during statement transformations, enhancing memory access safety and clarity.
* Updated the `OptimizeForTarget` function to include a new step for configuring index bitwidth, improving the overall optimization process.
* Refactored existing code to utilize constants for warp sizes, enhancing maintainability and readability.
* Added checks to ensure correct warp allocation and padding map handling, improving robustness in memory management strategies.

* [Refactor] Update ConfigIndexBitwidthRewriter to Support Auto-Check Feature

* Modified the constructor of `ConfigIndexBitwidthRewriter` to include an `auto_check` parameter, allowing for dynamic bitwidth adjustments based on input conditions.
* Enhanced the `VisitExpr_` methods to apply the new auto-check logic, ensuring that integer types are upgraded to 64 bits when necessary, or to a specified index bitwidth otherwise.
* Updated the `ConfigIndexBitwidth` pass to determine the index bitwidth based on the presence of configuration, improving flexibility in handling different scenarios.

* Add dynamic matrix multiplication example and corresponding test

* Introduced `example_dynamic.py` to demonstrate dynamic matrix multiplication using TileLang and PyTorch, including a main function for execution and performance profiling.
* Added `test_example_dynamic.py` to validate the functionality of the dynamic matrix multiplication example.
* The example includes detailed parameter configurations and checks against PyTorch's implementation for correctness.

* lint fix

* Add get_num_sms function to retrieve the number of streaming multiprocessors on the CUDA device

* Implemented the `get_num_sms` function in `cuda_driver.py` to return the count of streaming multiprocessors for a specified CUDA device.
* Updated the `__init__.py` file to include the new function in the module exports.

* lint fix

6ad73f6f

23 May, 2025 1 commit

[Refactor] Enhance MergeSharedMemoryAllocations Pass for Improved Liveness... · 0fdefe2b

Lei Wang authored May 23, 2025

[Refactor] Enhance MergeSharedMemoryAllocations Pass for Improved Liveness Analysis and Scope Management (#508)

* Introduced a new StmtAttr structure to track the scope level of statements, enhancing the liveness analysis process.
* Updated the UpdateStmtAttr function to manage statement attributes effectively during memory allocation visits.
* Modified the VisitStmt_ methods to utilize the new scope level tracking, ensuring accurate memory access patterns.
* Refactored the LivenessAnalysis and PlanMemory functions to incorporate statement attributes, improving the handling of gen and kill points in memory management.
* Added a new helper function allow_warp_specialized in phase.py to conditionally enable warp specialization based on pass context and target, addressing potential bugs in the MergeSharedMemoryAllocations pass.
* Enhanced the OptimizeForTarget function to conditionally apply the MergeSharedMemoryAllocations pass based on warp specialization settings, improving robustness in memory allocation strategies.

0fdefe2b

22 May, 2025 3 commits

[Enhancement] Introduce padding annotation and improve memory access validation (#511) · f23c4d30

Lei Wang authored May 22, 2025

* Added a new attribute `kPaddingMap` in `builtin.h` for managing padding annotations.
* Enhanced `SafeMemorysRewriter` to utilize an annotated padding map for buffer stores, improving memory access safety.
* Implemented checks in `layout_inference.cc` to ensure buffers are correctly referenced during layout mapping.
* Introduced a new test file for validating the padding annotation functionality in TileLang.

f23c4d30

[Bugfix] Enhance smem copy selector for uncommon shape (#510) · dbe8689f

Lei Wang authored May 22, 2025

* [Refactor] Enhance GEMM warp partitioning logic for improved performance and flexibility

* Updated the warp partitioning logic in `Gemm::ComputeWarpPartition` to better handle various GEMM policies, including FullRow, FullCol, and Square.
* Implemented checks to dynamically adjust warp allocation based on matrix dimensions, ensuring optimal performance.
* Introduced a new `SelectCopy` template to streamline memory access patterns in CUDA templates, enhancing compatibility across different architectures.
* Refactored the Python `GemmWarpPolicy` class to align with the updated C++ logic, improving clarity and maintainability in warp allocation strategies.

* [Refactor] Optimize matrix multiplication parameters and performance in quickstart example

* Updated thread count in the kernel context from 256 to 128 to enhance performance.
* Increased block sizes for matrix dimensions (M, N, block_M, block_N) to 1024 and 128 respectively, improving computational efficiency.
* Adjusted the pipeline stages in the GEMM loop from 0 to 3 for better parallel execution.
* Cleaned up comments for clarity and corrected a typo in the memory copy comment.

* [Refactor] Simplify Copy type selection in OperandTraits for improved clarity

* Replaced the conditional Copy type definition with a new SelectCopy template in OperandTraits, enhancing readability and maintainability of the code.
* This change streamlines the logic for selecting memory copy patterns based on matrix dimensions and warp configurations.

dbe8689f

[Refactor] Update buffer handling in layout transformation functions (#509) · 094796b6

Lei Wang authored May 22, 2025

* Modified `makeBufferWithLayout` to include a `var_remap` parameter for improved variable remapping during buffer creation.
* Enhanced buffer load and store operations to utilize the new variable remapping logic, ensuring correct buffer references.
* Commented out a check in `ThreadExtent` for clarity, maintaining functionality while improving code readability.

094796b6

21 May, 2025 1 commit

[Enhancement] Enhance ReduceOp and JITKernel for improved dimension handling... · 41d4988b

Lei Wang authored May 22, 2025

[Enhancement] Enhance ReduceOp and JITKernel for improved dimension handling and initialization (#507)

* [Refactor] Update reduce functions to support default dimension values and improve dimension handling

* Added a helper function `_legalize_dim` to handle negative dimension values in reduction operations.
* Updated `reduce_max`, `reduce_min`, `reduce_sum`, `reduce_abssum`, and `reduce_absmax` functions to accept a default dimension value of -1, enhancing usability and flexibility in buffer reduction operations.
* Ensured consistent dimension handling across all reduction functions for improved clarity and correctness.

* Update submodule `tvm` to latest commit c2921fd, ensuring compatibility with recent changes.

* [Refactor] Enhance ReduceOp and JITKernel for improved dimension handling and initialization

* Updated ReduceOp to handle 1D reduction cases and ensure correct dimension checks, improving robustness in reduction operations.
* Initialized prim_func in JITKernel to enhance clarity and prevent potential null reference issues.
* Added whitespace for better code readability in reduce.py.

41d4988b

20 May, 2025 3 commits

[Refactor] Update GlobalMemChecker to Detect Lower Bound illegal memory access automatically (#505) · 84ddb9e1

Lei Wang authored May 20, 2025

* [Refactor] Update GlobalMemChecker to use IRVisitorWithAnalyzer for improved analysis (#505)

* Refactored GlobalMemChecker to inherit from IRVisitorWithAnalyzer, enhancing its capabilities for expression analysis.
* Updated condition checks to utilize the new analyzer interface, improving clarity and correctness in memory access validation.
* Added additional lower bound condition checks to ensure comprehensive validation of memory access indices.

* [Refactor] Update GlobalMemChecker to use StmtExprVisitor for improved memory access validation

* Refactored GlobalMemChecker to inherit from StmtExprVisitor, enhancing its capabilities for expression analysis.
* Updated condition checks to utilize the new analyzer interface, improving clarity and correctness in memory access validation.
* Ensured that the analyzer is passed correctly during instantiation, maintaining consistency in condition checks.

84ddb9e1

[Refactor] Adjust GEMM fragment layout for improved clarity and performance (#504) · c59e1aab

Lei Wang authored May 20, 2025

* Modified the layout creation in makeGemmFragmentB to enhance the order of operations, ensuring the Replicate method is called before Repeat for better readability and performance.
* This change improves the logical flow of fragment creation, aligning with best practices for GEMM layout management.

c59e1aab

[Refactor] Refactor `jit` to `_JitImplementation` to support `@tilelang.jit` (#502) · 8c8d8ca2

Lei Wang authored May 20, 2025

* [Refactor] Rename `jit` class to `_JitImplementation` and improve debug path handling

* Refactored the `jit` class to `_JitImplementation` for clarity and encapsulation.
* Enhanced handling of `debug_root_path` to ensure it is correctly set as an absolute path when provided.
* Updated the public `jit` function to serve as a decorator interface, allowing for both default and configured usage.
* Added validation to ensure input tensors are contiguous in the Cython wrapper, improving error handling.

* [Refactor] Improve formatting and handling in `_JitImplementation` and `jit` function

* Refactored the `_JitImplementation` class to enhance readability by adjusting comment formatting and consolidating conditions for setting `debug_root_path`.
* Updated the `jit` function signature for better alignment and clarity in parameter definitions.
* Ensured consistent spacing and comments throughout the code for improved maintainability.

* [Refactor] Update GEMM test parameters for performance optimization

* Set num_stages to 0 and adjusted matrix dimensions in the GEMM test function to enhance performance and consistency across tests in test_tilelang_jit_gemm.py.
* Reduced the number of threads used in the test to align with the updated configuration, improving overall test efficiency.

* [Refactor] Enhance buffer error logging in layout inference

* Updated the warning message in layout inference to provide clearer context when a buffer cannot be inferred due to its absence in the use list. This change improves the clarity of error reporting during layout inference operations.
* Refactored tensor handling in the Cython wrapper to ensure input tensors are checked for contiguity before processing, enhancing error handling and robustness in tensor management.

* bugfix

8c8d8ca2

17 May, 2025 3 commits

[Refactor] Update GEMM layout and operand traits for improved CUDA compatibility (#500) · 33937683

Lei Wang authored May 18, 2025

* [Enhancement] Improve GEMM layout function and documentation

* Added detailed documentation for the makeGemmABLayout function, explaining parameters and layout selection strategies.
* Updated the layout selection logic to use mat_continuous consistently, enhancing clarity and correctness in memory layout calculations.
* Adjusted the InferLayout method to reflect changes in the layout function, ensuring accurate matrix dimension handling for transposed cases.

* lint fix

* [Refactor] Update GEMM layout and operand traits for improved CUDA compatibility

* Adjusted the InferLayout method in gemm.cc to include trans_A in fragment creation, enhancing layout inference for transposed matrices.
* Updated OperandTraits in gemm_sm89.h and gemm_sm90.h to change the Copy type from SM75_U16x4_LDSM_N to SM75_U16x4_LDSM_T, optimizing memory access patterns for different warp configurations.
* Enhanced static assertions in gemm_sm90.h to clarify requirements for num_warp_m, ensuring compatibility with Hopper architecture.

* [Refactor] Clean up formatting in GEMM implementation and CUDA templates

* Simplified the formatting of the fragment creation in the InferLayout method of gemm.cc for better readability.
* Adjusted the static assertion message in gemm_sm90.h to enhance clarity regarding the num_warp_m requirement for Hopper architecture.

33937683

[Bugfix] Rename SM75_U16x8_LDSM_N into SM75_U16x8_LDSM_T for correctness (#499) · 2837878f

Lei Wang authored May 18, 2025

* Remove debug print statement from block_sparse_attn_triton.py and implement a timeout handler in autotuner for function execution. This enhances the robustness of the autotuner by allowing it to handle timeouts gracefully.

* Enhance the autotuner module by adding a timeout handler for function execution, improving robustness in handling long-running tasks. This change includes the introduction of a custom TimeoutException and updates to the run_with_timeout function for better signal management.

* Add merge shared memory allocations pass and related configurations

- Introduced a new pass for merging shared memory allocations in GPU kernels, allowing for more efficient memory usage.
- Registered configuration options for debugging and controlling the merging behavior.
- Updated relevant files to integrate the new pass into the TileLang engine and transform modules.
- Adjusted import paths and added documentation for the new functionality.

* Reduce num_stages parameter in GEMM functions from 3 to 1 for improved performance in test_tilelang_kernel_gemm.py

* Update Copy type in OperandTraits for GEMM templates to use conditional selection based on num_warp_n. This change enhances memory access patterns for different configurations in CUDA kernels.

* lint fix

* Update Copy type in OperandTraits for GEMM templates to use SM75_U16x4_LDSM_T and SM75_U16x8_LDSM_T for improved memory access patterns across CUDA architectures.

2837878f

[Enhancement] Fallback transposed_ldmatrix into `SM75_U16x4_LDSM_N` when warp_n is 8 (#498) · 68a3c4f3

Lei Wang authored May 17, 2025

* Remove debug print statement from block_sparse_attn_triton.py and implement a timeout handler in autotuner for function execution. This enhances the robustness of the autotuner by allowing it to handle timeouts gracefully.

* Enhance the autotuner module by adding a timeout handler for function execution, improving robustness in handling long-running tasks. This change includes the introduction of a custom TimeoutException and updates to the run_with_timeout function for better signal management.

* Add merge shared memory allocations pass and related configurations

- Introduced a new pass for merging shared memory allocations in GPU kernels, allowing for more efficient memory usage.
- Registered configuration options for debugging and controlling the merging behavior.
- Updated relevant files to integrate the new pass into the TileLang engine and transform modules.
- Adjusted import paths and added documentation for the new functionality.

* Reduce num_stages parameter in GEMM functions from 3 to 1 for improved performance in test_tilelang_kernel_gemm.py

* Update Copy type in OperandTraits for GEMM templates to use conditional selection based on num_warp_n. This change enhances memory access patterns for different configurations in CUDA kernels.

* lint fix

68a3c4f3

16 May, 2025 2 commits

[Bugfix] Fix Hopper GEMM layout for small tile size (#497) · c93e8695

Lei Wang authored May 16, 2025

* [Enhancement] Improve GEMM layout function and documentation

* Added detailed documentation for the makeGemmABLayout function, explaining parameters and layout selection strategies.
* Updated the layout selection logic to use mat_continuous consistently, enhancing clarity and correctness in memory layout calculations.
* Adjusted the InferLayout method to reflect changes in the layout function, ensuring accurate matrix dimension handling for transposed cases.

* lint fix

c93e8695

[Enhancement] Introduce flag to visualize shared memory merge plan (#496) · dca2fb48

Lei Wang authored May 16, 2025

* Remove debug print statement from block_sparse_attn_triton.py and implement a timeout handler in autotuner for function execution. This enhances the robustness of the autotuner by allowing it to handle timeouts gracefully.

* Enhance the autotuner module by adding a timeout handler for function execution, improving robustness in handling long-running tasks. This change includes the introduction of a custom TimeoutException and updates to the run_with_timeout function for better signal management.

* Add merge shared memory allocations pass and related configurations

- Introduced a new pass for merging shared memory allocations in GPU kernels, allowing for more efficient memory usage.
- Registered configuration options for debugging and controlling the merging behavior.
- Updated relevant files to integrate the new pass into the TileLang engine and transform modules.
- Adjusted import paths and added documentation for the new functionality.

* Reduce num_stages parameter in GEMM functions from 3 to 1 for improved performance in test_tilelang_kernel_gemm.py

dca2fb48

13 May, 2025 1 commit

[Enhancement] Support register input for gemm when trans_a or trans_b is true (#490) · d4f096ef

Lei Wang authored May 13, 2025

* [Refactor] Enhance makeGemmFragmentB to support transposition

* Updated the `makeGemmFragmentB` function to include a `transposed` parameter, allowing for flexible layout generation based on matrix transposition.
* Adjusted layout calculations for both transposed and non-transposed cases to ensure correct fragment generation.
* Modified the function signature in `layout.h` and updated all relevant calls in `gemm.cc` to accommodate the new parameter.
* Added a new `matmul_sr` function in the test suite to validate the behavior of the updated fragment generation with transposition support.

* [Refactor] Enhance makeGemmFragmentA and makeGemmFragmentB for transposition support

* Updated the `makeGemmFragmentA` and `makeGemmFragmentB` functions to include a `transposed` parameter, allowing for flexible layout generation based on matrix transposition.
* Adjusted layout calculations for both transposed and non-transposed cases to ensure correct fragment generation.
* Modified function signatures in `layout.h` and updated all relevant calls in `gemm.cc` to accommodate the new parameter.
* Added a new `matmul_rs` function in the test suite to validate the behavior of the updated fragment generation with transposition support.
*

* Improve error messaging in layout equality checks

* Enhanced the error output in layout equality checks to provide clearer context by adding line breaks for better readability in the debug output.
* This change ensures that when layouts are structurally unequal, the current and previous layouts are displayed more distinctly, aiding in debugging.

d4f096ef

10 May, 2025 1 commit

[Refactor] Improve layout equality checks and error messaging (#471) · c2480907

Lei Wang authored May 10, 2025

* [Refactor] Simplify buffer_region_to_tile_region function in copy.py

* Removed redundant logic for handling region extents in the buffer_region_to_tile_region function, streamlining the code for better readability and maintainability.
* Enhanced error handling by focusing on essential checks while eliminating unnecessary complexity related to variable extents.

* [Refactor] Improve layout equality checks and error messaging

* Updated the `IsEqual` method in `FragmentNode` to ensure consistent evaluation of thread ranges.
* Enhanced error messaging in `ParallelOp::InferLayout` to include source buffer information for better debugging.
* Adjusted `ReduceOp::InferLayout` to set thread range during layout condensation, improving layout inference accuracy.

* lintfix

* [Refactor] Rename SetThreadRange to BindThreadRange for clarity

* Updated the `SetThreadRange` method in `FragmentNode` and related classes to `BindThreadRange`, improving method naming consistency and clarity.
* Adjusted all references to the renamed method across the codebase, ensuring proper functionality and maintaining existing behavior.
* Enhanced layout equality checks to handle thread ranges more robustly in `IsEqual` method.
* Updated layout inference methods in `Gemm`, `ParallelOp`, and `ReduceOp` to utilize the new method name, ensuring seamless integration with the updated API.

* [Refactor] Update BindThreadRange usage across layout inference methods

* Modified the implementation of `BindThreadRange` in `FragmentNode` to create a new object instance, enhancing thread range binding functionality.
* Updated all references to `BindThreadRange` in layout inference methods across `Gemm`, `ParallelOp`, and `ReduceOp` to ensure consistency with the new implementation.
* Adjusted the return statements in various layout inference functions to utilize the updated method, maintaining existing behavior while improving clarity.

* lint fix

c2480907

09 May, 2025 4 commits

[Typo] Rename `power_of_int` with `pow_of_int` for consistency (#468) · c99b7056

Lei Wang authored May 09, 2025

* typo fix

* Rename `power_of_int` to `pow_of_int` in math operations and update corresponding Python API reference. Adjusted registration attributes to reflect the new naming convention.

c99b7056

[Feature] Implement fast integer power operation and related API (#466) · 1f5eb492

Lei Wang authored May 09, 2025

* [Refactor] Enhance TMA barrier validation and support for additional architectures (#463)

* Updated the TMA barrier validation in `inject_tma_barrier.cc` to check for non-empty `barrier_id_to_range_` before raising an error for missing `create_list_of_mbarrier`.
* Refactored architecture checks in `phase.py` to utilize a new constant `SUPPORTED_TMA_ARCHS`, allowing for easier updates and improved readability in the target architecture validation logic.

* [Feature] Implement fast integer power operation and related API

* Added a new math operation `tl.power_of_int` in `math.cc` for efficient integer exponentiation.
* Introduced a corresponding Python API `pow_of_int` in `tir/op.py` to facilitate usage in TileLang.
* Enhanced `common.h` with a template function for integer power calculations.
* Updated documentation to reflect the new functionality and usage examples.

1f5eb492

[Refactor] Enhance TMA barrier validation and support for additional architectures (#463) · f41c467c

Lei Wang authored May 09, 2025

* Updated the TMA barrier validation in `inject_tma_barrier.cc` to check for non-empty `barrier_id_to_range_` before raising an error for missing `create_list_of_mbarrier`.
* Refactored architecture checks in `phase.py` to utilize a new constant `SUPPORTED_TMA_ARCHS`, allowing for easier updates and improved readability in the target architecture validation logic.

f41c467c

[Bugfix] Fix for T.copy with dynamic range (#462) · d946d1d4

Lei Wang authored May 09, 2025

* [Refactor] Update barrier functions and remove argparse in example_warp_specialize_flashmla.py

* Refactored barrier functions to use new signatures for improved clarity and consistency.
* Replaced `mbarrier_arrive` and `mbarrier_wait_parity` with `barrier_arrive` and `barrier_wait` respectively.
* Removed argparse dependency and replaced it with hardcoded parameters for batch size and dimensions in the main function, simplifying the example script.

* [Refactor] Update warp_specialized_rewriter with license change and code cleanup

* Replaced Apache License header with MIT License in `warp_specialized_rewriter.cc`.
* Removed the `ThreadTagChecker` class to streamline the code, as it was no longer needed.
* Added `#include` for `common/collector.h` to support new functionality.
* Updated file documentation to reflect the correct filename and purpose.
* Improved overall code readability by removing unnecessary comments and sections.

* [Feature] Add thread synchronization functions in builtin.py and refine buffer region checks in copy.py

* Introduced `sync_threads` and `sync_thread_partial` functions in `builtin.py` for improved thread synchronization capabilities.
* Enhanced documentation for new synchronization functions to clarify usage and parameters.
* Updated buffer region validation in `copy.py` to ensure type checking for integer values, improving error handling for region extents.

* lint fix

* [Feature] Introduce TMA barrier injection and related utilities

* Added `inject_tma_barrier.cc` to implement TMA barrier rewriting for CUDA GPU (sm90+).
* Created `common/attr.h` and `common/collector.h` for attribute checks and information collection from the IR.
* Updated `ir.cc` to use a constant for the main block name instead of a hardcoded string.
* Cleaned up `warp_specialized_rewriter.cc` by removing unnecessary whitespace.
* Enhanced thread tag validation with `ThreadTagChecker` to ensure only `threadIdx.x` is used in TMA barrier contexts.

* lint fix

d946d1d4

08 May, 2025 1 commit

[Refactor] Update barrier functions and add new example for GEMM with warp specialization (#456) · a91bc2a9

Lei Wang authored May 08, 2025

* Add example for warp specialization with flash attention

* Introduced a new example script `example_warp_specialize_flashmla.py` demonstrating flash attention using warp specialization in TileLang.
* Implemented the `flashattn` function with shared memory allocation and memory barrier synchronization for improved performance.
* Added a reference program for validation against PyTorch's implementation, including profiling for latency and performance metrics.
* Removed the outdated `example_warp_specialize_mla.py` to streamline examples and focus on the new implementation.

* Add memory barrier functions to builtin.py

* Introduced `barrier_wait` and `barrier_arrive` functions for memory barrier synchronization.
* Enhanced documentation with detailed docstrings for both functions, clarifying their usage and parameters.
* The `barrier_wait` function serves as a wrapper for `mbarrier_wait_parity`, supporting parity values 0 and 1.
* Improved code organization and readability by adding blank lines for better separation of logical sections.

* Enhance code readability by adding blank lines in example_warp_specialize_flashmla.py and builtin.py

* Added blank lines to improve code organization and separation of logical sections in `example_warp_specialize_flashmla.py`.
* Included blank lines in `builtin.py` around the `wait_wgmma` and `barrier_wait` functions for better readability.

* [Refactor] Update barrier functions and add new example for GEMM with warp specialization

* Refactored memory barrier functions in `example_warp_specialize_flashmla.py` to use the new `barrier_wait` and `barrier_arrive` methods for improved clarity and consistency.
* Introduced a new example script `example_warp_specialize_gemm_copy_gemm_0_1.py` demonstrating matrix multiplication with warp specialization and shared memory allocation.
* Enhanced the `layout.cc` and `elem.cc` files to improve structural equality checks and error handling in copy operations.
* Updated `warpgroup.py` to refine thread ID calculations for better performance in warp specialization scenarios.
* Added new shuffle operations in `builtin.py` for enhanced functionality in parallel computations.

* lint fix

* Update loop variable checks in SIMT loop and buffer region validation

* Modified checks in `elem.cc` to ensure loop variable sizes are less than or equal to source and destination range sizes for better error handling.
* Adjusted assertions in `copy.py` to reflect the updated logic, allowing for more flexible region extent comparisons and improved error messaging.

* lint fix

* test fix

a91bc2a9

06 May, 2025 2 commits

[Feature] Add TILELANG_CHECK_LAST_ERROR macro for improved error handling in CUDA and HIP (#450) · 0a8c8b99

Lei Wang authored May 06, 2025

* [Feature] Add TILELANG_CHECK_LAST_ERROR macro for improved error handling in CUDA and HIP

* Introduced TILELANG_CHECK_LAST_ERROR macro to streamline error checking for kernel launches in both CUDA and HIP.
* Updated kernel launch code in wrapper.py to utilize the new macro, enhancing readability and maintainability.
* This change improves error reporting by providing detailed messages when kernel execution fails.

* [Refactor] Standardize error message formatting in TILELANG_CHECK_LAST_ERROR macro

* Updated the TILELANG_CHECK_LAST_ERROR macro in both CUDA and HIP implementations to ensure consistent formatting of error messages.
* Enhanced readability by aligning the error message structure across different platforms, improving maintainability of error handling code.

0a8c8b99

[Enhancement] Add new examples for warp specialization and TMA integration (#448) · b5faf25a

Lei Wang authored May 06, 2025

* [Refactor] Update KernelLaunch to clarify CPU and GPU kernel launch logic

* Added comments to distinguish between CPU and GPU kernel launch sections for better code readability.
* Changed the creation of empty blocks to use a consistent "root" identifier, enhancing clarity in frame management.

* [Refactor] Rename operations for consistency in lower_hopper_intrin and related files

* Updated function names from CamelCase to snake_case for better consistency across the codebase.
* Refactored calls to `CreateTMADescriptorOp`, `CreateListofMBarrierOp`, and similar functions to their new names: `create_tma_descriptor`, `create_list_of_mbarrier`, etc.
* Adjusted corresponding test cases to reflect these changes, ensuring compatibility with the new naming conventions.

* [Refactor] Rename operations to snake_case for consistency

* Updated function names from CamelCase to snake_case across various files, including `CreateTMADescriptorOp` to `create_tma_descriptor`, `GetMBarrierOp` to `get_mbarrier`, and others.
* Adjusted corresponding calls and definitions in the codebase to reflect these naming changes, ensuring uniformity and improved readability.
* Enhanced layout inference and loop partitioning logic to accommodate the new naming conventions.

* [Feature] Introduce Warp Specialization and Eliminate Storage Sync for MBarrier

* Added a new example `gemm_ws.py` demonstrating matrix multiplication with warp specialization using TileLang.
* Implemented `WarpSpecializeFrame` and `WarpSpecialize` functionality to manage warp group indices in TIR frames.
* Introduced `EliminateStorageSyncForMBarrier` transformation to optimize storage synchronization in mbarrier regions.
* Enhanced the TileLang API with new methods for retrieving block and thread extents.
* Updated the `LowerAndLegalize` and `OptimizeForTarget` functions to incorporate the new transformation.
* Improved layout inference and kernel launch logic for better performance and clarity.

* [Refactor] Clean up code formatting and improve readability

* Added blank lines for better separation of code blocks in `gemm_ws.py`, `phase.py`, `kernel.py`, and `warpgroup.py`.
* Reformatted the `tilelang.compile` call in `gemm_ws.py` for improved clarity.
* Updated comments in `warpgroup.py` to clarify the availability of the `WarpSpecialize` function for NVIDIA GPUs.
* Ensured consistent spacing and formatting across multiple files to enhance overall code readability.

* lint fix

* [Refactor] Update mbarrier functions for improved clarity and consistency

* Refactored `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` to accept explicit parameters for better readability.
* Updated calls in `gemm_ws.py` to use the new function signatures, enhancing code clarity.
* Adjusted `warpgroup.py` to remove unused thread extent variable, streamlining the code.
* Added detailed docstrings to clarify usage examples for memory barrier functions.

* Added blank lines in `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` for improved code readability and separation of logical sections.

* [Feature] Add examples for warp specialization and TMA barrier integration

* Introduced three new example scripts: `example_warp_specialize_gemm.py`, `example_warp_specialize_gemm_barrier4.py`, and `example_warp_specialize_mla.py` demonstrating matrix multiplication with warp specialization and TMA barriers.
* Implemented kernel functions with shared memory allocation and memory barrier synchronization for improved performance.
* Enhanced the TileLang API with new methods for compiling and testing kernels in Python using PyTorch.
* Updated the `phase.py` to include TMA barrier injection in the optimization process.
* Improved documentation and comments for better clarity on usage and functionality.

* [Feature] Add example for warp specialization in GEMM with TMA barriers

* Introduced a new example script `example_warp_specialize_gemm_stage2.py` demonstrating matrix multiplication using warp specialization and TMA barriers.
* Implemented a kernel function with shared memory allocation and memory barrier synchronization for enhanced performance.
* Included functionality to compile the kernel into a PyTorch-compatible function and validate its correctness against PyTorch's reference implementation.
* Enhanced documentation and comments for clarity on usage and functionality.

* lint fix

* [Feature] Implement WarpSpecializedDetector for TMA and MBarrier Detection

* Added the `WarpSpecializedDetector` class to identify the presence of TMA operations and memory barrier operations within a given TIR statement.
* Enhanced the `WarpSpecialized` pass to utilize the detector, allowing for conditional substitution based on the detection results.
* Improved code organization by including necessary headers and utilizing the `IRVisitorWithAnalyzer` for analysis.
* This addition aims to optimize warp specialization by ensuring that only relevant functions are transformed, enhancing performance and correctness.

* lint fix

* [Feature] Add new examples for warp specialization and TMA integration

* Introduced multiple new example scripts demonstrating warp specialization techniques, including `example_warp_specialize_flashmla.py`, `example_warp_specialize_gemm_barrierpipe_stage2.py`, `example_warp_specialize_gemm_copy_0_gemm_1.py`, `example_warp_specialize_gemm_copy_1_gemm_0.py`, and `example_warp_specialize_gemm_softpipe_stage2.py`.
* Each example showcases matrix multiplication with warp specialization and TMA barriers, implementing kernel functions with shared memory allocation and memory barrier synchronization for enhanced performance.
* Added a test suite in `test_example_warp_specialize.py` to validate the functionality of the new examples.
* Updated the TileLang API to support these examples and improve kernel compilation and testing processes.
* Removed outdated example scripts to streamline the codebase and enhance clarity on available functionalities.

* lint fix

* Remove outdated example scripts for warp specialization and TMA integration to streamline the codebase. This includes `example_warp_specialize_gemm.py`, `example_warp_specialize_gemm_barrier4.py`, `example_warp_specialize_gemm_stage2.py`, and `example_warp_specialize_mla.py`, which are no longer needed following recent updates and improvements in the TileLang API.

b5faf25a

03 May, 2025 1 commit

[Refactor] Separate warp specialize rewriter and tma barrier injector pass (#447) · fce16b00

Lei Wang authored May 03, 2025

* [Refactor] Update KernelLaunch to clarify CPU and GPU kernel launch logic

* Added comments to distinguish between CPU and GPU kernel launch sections for better code readability.
* Changed the creation of empty blocks to use a consistent "root" identifier, enhancing clarity in frame management.

* [Refactor] Rename operations for consistency in lower_hopper_intrin and related files

* Updated function names from CamelCase to snake_case for better consistency across the codebase.
* Refactored calls to `CreateTMADescriptorOp`, `CreateListofMBarrierOp`, and similar functions to their new names: `create_tma_descriptor`, `create_list_of_mbarrier`, etc.
* Adjusted corresponding test cases to reflect these changes, ensuring compatibility with the new naming conventions.

* [Refactor] Rename operations to snake_case for consistency

* Updated function names from CamelCase to snake_case across various files, including `CreateTMADescriptorOp` to `create_tma_descriptor`, `GetMBarrierOp` to `get_mbarrier`, and others.
* Adjusted corresponding calls and definitions in the codebase to reflect these naming changes, ensuring uniformity and improved readability.
* Enhanced layout inference and loop partitioning logic to accommodate the new naming conventions.

* [Feature] Introduce Warp Specialization and Eliminate Storage Sync for MBarrier

* Added a new example `gemm_ws.py` demonstrating matrix multiplication with warp specialization using TileLang.
* Implemented `WarpSpecializeFrame` and `WarpSpecialize` functionality to manage warp group indices in TIR frames.
* Introduced `EliminateStorageSyncForMBarrier` transformation to optimize storage synchronization in mbarrier regions.
* Enhanced the TileLang API with new methods for retrieving block and thread extents.
* Updated the `LowerAndLegalize` and `OptimizeForTarget` functions to incorporate the new transformation.
* Improved layout inference and kernel launch logic for better performance and clarity.

* [Refactor] Clean up code formatting and improve readability

* Added blank lines for better separation of code blocks in `gemm_ws.py`, `phase.py`, `kernel.py`, and `warpgroup.py`.
* Reformatted the `tilelang.compile` call in `gemm_ws.py` for improved clarity.
* Updated comments in `warpgroup.py` to clarify the availability of the `WarpSpecialize` function for NVIDIA GPUs.
* Ensured consistent spacing and formatting across multiple files to enhance overall code readability.

* lint fix

* [Refactor] Update mbarrier functions for improved clarity and consistency

* Refactored `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` to accept explicit parameters for better readability.
* Updated calls in `gemm_ws.py` to use the new function signatures, enhancing code clarity.
* Adjusted `warpgroup.py` to remove unused thread extent variable, streamlining the code.
* Added detailed docstrings to clarify usage examples for memory barrier functions.

* Added blank lines in `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` for improved code readability and separation of logical sections.

* [Feature] Add examples for warp specialization and TMA barrier integration

* Introduced three new example scripts: `example_warp_specialize_gemm.py`, `example_warp_specialize_gemm_barrier4.py`, and `example_warp_specialize_mla.py` demonstrating matrix multiplication with warp specialization and TMA barriers.
* Implemented kernel functions with shared memory allocation and memory barrier synchronization for improved performance.
* Enhanced the TileLang API with new methods for compiling and testing kernels in Python using PyTorch.
* Updated the `phase.py` to include TMA barrier injection in the optimization process.
* Improved documentation and comments for better clarity on usage and functionality.

* [Feature] Add example for warp specialization in GEMM with TMA barriers

* Introduced a new example script `example_warp_specialize_gemm_stage2.py` demonstrating matrix multiplication using warp specialization and TMA barriers.
* Implemented a kernel function with shared memory allocation and memory barrier synchronization for enhanced performance.
* Included functionality to compile the kernel into a PyTorch-compatible function and validate its correctness against PyTorch's reference implementation.
* Enhanced documentation and comments for clarity on usage and functionality.

* lint fix

* [Feature] Implement WarpSpecializedDetector for TMA and MBarrier Detection

* Added the `WarpSpecializedDetector` class to identify the presence of TMA operations and memory barrier operations within a given TIR statement.
* Enhanced the `WarpSpecialized` pass to utilize the detector, allowing for conditional substitution based on the detection results.
* Improved code organization by including necessary headers and utilizing the `IRVisitorWithAnalyzer` for analysis.
* This addition aims to optimize warp specialization by ensuring that only relevant functions are transformed, enhancing performance and correctness.

* lint fix

fce16b00