Commits · 48e70e68ce3461279fc52440a61f2a7d78e2aa01 · OpenDAS / tilelang

17 Dec, 2025 2 commits

[Enhancement] Update examples and tests for improved type handling functionality (#1448) · c750fb8a

Lei Wang authored Dec 17, 2025

* [Enhancement] Update examples and tests for improved type handling and functionality

- Enhanced various example scripts to support new data types and improve compatibility with PyTorch.
- Updated tests across multiple modules to ensure correct functionality with the latest changes in type handling.
- Refactored code in examples to streamline operations and improve clarity, particularly in tensor operations and memory management.
- Added comprehensive tests for new features and fixed existing issues related to type conversions and buffer handling.

* [Refactor] Update accumulation data type to float32 across examples

- Changed accumulation data type from "float" to T.float32 in multiple example scripts to ensure consistency and improve numerical stability.
- This update affects various modules including flash attention, GEMM analysis, convolution, and deepseek MLA examples, enhancing type handling across the board.

* [Refactor] Standardize data type usage across benchmark scripts

- Updated data type definitions in benchmark scripts to use T.float16 and T.float32 consistently, enhancing clarity and type handling.
- Adjusted dtype assignments in matmul functions and configuration setups to align with the new standard.
- Improved overall code consistency and maintainability by ensuring uniform data type usage across various modules.

* [Refactor] Standardize data type usage in templates and scripts

- Updated data type definitions in various templates and scripts to use string representations (e.g., "float16", "int32") instead of T.float16 and T.int32 for improved consistency and clarity.
- Enhanced overall code maintainability by ensuring uniform data type usage across multiple modules, including convolution, elementwise operations, and matrix multiplication templates.
- This change aims to streamline type handling and improve compatibility with existing workflows.

* [Refactor] Standardize data type usage in examples and benchmarks

- Updated data type definitions in various example and benchmark scripts to use T.float16 and T.int32 consistently, enhancing clarity and maintainability.
- Adjusted dtype assignments in kernel functions and configuration setups to align with the new standard.
- Improved overall code consistency by ensuring uniform data type usage across multiple modules, including attention mechanisms, matrix multiplication, and GEMM examples.

* [Refactor] Import dtypes from language.v2 module

- Added import statement for dtypes from the language.v2 module to enhance type handling and maintain consistency across the codebase.
- This change aims to streamline data type management and improve overall code clarity.

* fix

* [Refactor] Standardize data type usage across scripts

- Updated data type definitions in various scripts to use string representations (e.g., "float16", "int8") instead of T.float16 and T.int8 for improved consistency and clarity.
- Adjusted dtype assignments in functions and configuration setups to align with the new standard, enhancing overall code maintainability.
- This change affects multiple modules, including benchmark and attention mechanisms, ensuring uniform data type usage throughout the codebase.

* [Refactor] Update data type handling for consistency and clarity

- Changed string representations of data types in the Hint class to use T.float32 and T.int32 for improved consistency.
- Added new data types "int4" and "int16" to the dtypes module, enhancing type support across the codebase.
- Updated function signatures and assertions in the lop3 and mxfp modules to utilize the new data types, ensuring uniformity in type handling.
- This refactor aims to streamline data type management and improve overall code clarity and maintainability.

* [Enhancement] Improve data type handling and error messaging

- Introduced a mapping for canonical data types to their display strings, enhancing clarity in type representation.
- Updated the dtype creation logic to utilize the new mapping, ensuring more intuitive handling of string inputs.
- Refined error messages in the lop3 module to provide clearer feedback on invalid source formats, improving debugging and user experience.

* [Fix] Correct boolean flag in GEMM SP test case

- Updated the boolean flag in the test_gemm_sp_sm90 function to ensure proper functionality in the test case.
- This change enhances the accuracy of the test and aligns it with expected behavior for the GEMM SP implementation.

* [Refactor] Standardize data type usage across scripts

- Updated data type definitions in various scripts to use T.float16 and T.bfloat16 consistently, enhancing clarity and maintainability.
- Adjusted dtype assignments in function signatures and argument parsing to align with the new standard, ensuring uniform data type usage throughout the codebase.
- This change affects multiple modules, including benchmarks and examples, improving overall code consistency and readability.

* [Refactor] Standardize data type usage in various modules

- Updated data type assignments in multiple scripts to utilize T.float32, T.int8, and T.int32 consistently, enhancing clarity and maintainability.
- Adjusted function signatures and parameter types across benchmarks, examples, and tests to align with the new standard, ensuring uniform data type usage throughout the codebase.
- This change improves overall code consistency and readability, impacting modules related to matrix multiplication, GEMM, and tensor operations.

* [Refactor] Update argument parsing for data types in benchmarks

- Changed argument parsing for data types in benchmark_matmul_intrinsic.py and benchmark_matmul_sp.py to use string representations ("float16", "int8", "float") instead of T.float16 and T.float.
- This update enhances consistency in data type handling across benchmark scripts, improving clarity and maintainability.

* [Refactor] Update data type handling in benchmark and example scripts

- Changed data type arguments in benchmark and example scripts to use string representations ("float16") instead of T.float16 for improved consistency.
- Updated function signatures and argument parsing to align with the new standard, enhancing clarity and maintainability across the codebase.
- This change affects multiple modules related to attention mechanisms and tensor operations, ensuring uniform data type usage throughout the examples.

* [Refactor] Fix data type conversion in multiple scripts

- Corrected the usage of the data type conversion method from dtype..as_torch() to dtype.as_torch() across various benchmark and example scripts.
- This change enhances consistency in data type handling and improves code readability, impacting modules related to attention mechanisms and tensor operations.

* [Refactor] Update float8 data type usage across multiple scripts

- Changed instances of T.float8_e4m3 to T.float8_e4m3fn in various benchmark, example, and test scripts to ensure consistency in data type handling.
- This update enhances clarity and maintainability across the codebase, particularly in modules related to matrix multiplication and tensor operations.

* [Refactor] Enhance float8 data type handling in CUDA code generation

- Updated the handling of float8 data types in the CUDA code generation to include additional float8 variants, improving type conversion logic.
- Adjusted conditions to ensure proper type checks for float8 conversions, enhancing clarity and maintainability in the codebase.
- Modified layout inference to streamline float8 type checks, ensuring consistency across the implementation.
- This change impacts modules related to matrix operations and CUDA code generation, improving overall type handling and conversion accuracy.

* [Refactor] Streamline float8 data type handling in CUDA and related modules

- Enhanced float8 data type handling in CUDA code generation by refining type conversion logic and ensuring consistent type checks.
- Updated layout inference for float8 types to improve clarity and maintainability across the implementation.
- This change impacts modules related to matrix operations and CUDA code generation, improving overall type handling and conversion accuracy.

* [Refactor] Remove unnecessary cache disabling in float8 example script

- Eliminated the call to tilelang.disable_cache() in example_group_per_split_token_cast_to_fp8.py to streamline the code.
- This change enhances clarity and maintainability of the example script without affecting its functionality.

* [Refactor] Update data type usage in debug print tests

- Changed the argument for dtype in the test_debug_print_buffer function from a string representation to the corresponding T.bool type.
- This update enhances consistency in data type handling within the test suite, improving clarity and maintainability.

* lint fix

* Update function parameter types from `str` to `T.dtype` for improved type safety in attention sink and related examples

* Refactor `gemv_alloc_reducer` function signature for improved readability by formatting parameters across multiple lines.

c750fb8a

[Language] Introduce `T.annotate_restrict_buffers` (#1428) · 0814b171

Lei Wang authored Dec 17, 2025

* [Enhancement] Introduce non-restrict parameter support in code generation

- Added a new PrimFunc-level attribute `tl.non_restrict_params` to specify handle Vars that should not be marked with the restrict qualifier during code generation.
- Updated `CodeGenTileLangCPP`, `CodeGenTileLangCUDA`, and `CodeGenTileLangHIP` to handle non-restrict parameters, ensuring proper treatment of overlapping buffer aliases.
- Implemented a new annotation function `annotate_restrict_buffers` to facilitate the marking of buffer parameters as non-restrict.
- Enhanced the `SplitHostDevice` transformation to propagate non-restrict parameters from host to device functions.
- Added a new transform function `HoistNonRestrictParams` to manage non-restrict parameters effectively.

* [Enhancement] Improve HoistNonRestrictParams transformation

- Updated the HoistNonRestrictParams function to recursively collect all `tl.non_restrict_params` annotations from nested blocks, enhancing flexibility in annotation placement.
- Introduced a new NonRestrictCollector class to manage the collection and deduplication of non-restrict parameters.
- Modified the SplitHostDevice transformation to remove the non-restrict attribute from the host-side PrimFunc after propagation to device kernels.
- Adjusted the LowerAndLegalize function to directly apply the HoistNonRestrictParams transformation without exception handling, streamlining the process.

* [Refactor] Simplify non-restrict parameter handling in code generation

- Removed unnecessary normalization logic and associated data structures from `CodeGenTileLangCPP`, `CodeGenTileLangCUDA`, and `CodeGenTileLangHIP`.
- Streamlined the handling of non-restrict parameters by directly inserting them into the `non_restrict` set, improving code clarity and maintainability.
- Updated conditional checks to eliminate redundant checks against normalized names, enhancing performance and readability.

* [Dependency] Update TVM subproject to latest commit 68aa8461

- Updated the TVM subproject to the latest commit, ensuring compatibility with recent changes and improvements.
- Refactored non-restrict parameter handling in `CodeGenTileLangCPP`, `CodeGenTileLangCUDA`, and `CodeGenTileLangHIP` to enhance code clarity and maintainability.
- Adjusted the `SplitHostDevice` transformation to streamline the propagation of non-restrict parameters.

* fix

0814b171

16 Dec, 2025 1 commit

[Refactor] Reduce direct dependency on PyTorch due to its limited type support (#1444) · dda45126

Lei Wang authored Dec 16, 2025

* [Enhancement] Update KernelParam to use tvm.DataType directly and add torch_dtype conversion method

- Changed dtype in KernelParam from torch.dtype to tvm.DataType to support a wider range of data types and prevent information loss during conversions.
- Added a new method, torch_dtype, to convert tvm.DataType back to torch.dtype for tensor creation.
- Updated various adapters to utilize the new torch_dtype method for parameter type conversion during initialization.

* [Enhancement] Refactor CUDA type handling and add support for FP4 and FP8 types

- Renamed functions for clarity: GetFP8Type, GetFP6Type, and GetFP4Type are now GetTileLangFP8Type, GetTileLangFP6Type, and GetTileLangFP4Type respectively.
- Enhanced FP4 type handling to support additional lane sizes (2, 4, 8, 16, 32, 64).
- Updated CUDA code generation to include new FP8 and FP4 types, ensuring proper type handling in PrintType and related functions.
- Introduced new structures for FP8 types in cuda_fp8.h to facilitate better memory management and type packing.
- Added methods in KernelParam and tensor utilities to recognize and handle float4 types, improving compatibility with PyTorch.
- Enhanced logging for debugging purposes in various CUDA functions to track type handling and memory operations more effectively.

* lint fix

* Remove unnecessary logging statements from CUDA code generation and delete obsolete matrix multiplication test file.

* [Enhancement] Add support for FP4 and FP8 types in CUDA code generation

- Enhanced PrintVecElemLoad and PrintVecElemStore functions to handle new FP4 types.
- Updated arg_binder to allow float4 to match int8 at runtime, improving compatibility with PyTorch.
- Modified loop_vectorize to account for buffer dtype lanes in vectorization calculations.
- Refactored tensor type mapping to support new float4 and float8 types, ensuring correct type handling in tensor operations.
- Added tests for FP4 and FP8 copy operations to validate functionality and integration with existing workflows.

---------
Co-authored-by: Zhiwen Mo <zm125@ic.ac.uk>

dda45126

15 Dec, 2025 1 commit
- Enhance vectorized conversion support (#1438) · bcae814e
  Xiangwen Wang authored Dec 15, 2025
  
  bcae814e
13 Dec, 2025 1 commit

[CUDA] Add read-only parameter annotation for CUDA codegen (#1416) · 00dd7388

Lei Wang authored Dec 14, 2025

* [Enhancement] Add read-only parameter annotation for CUDA codegen

* Introduced the `AnnotateReadOnlyParams` transformation to annotate read-only handle parameters in PrimFuncs, enabling the generation of `const` qualifiers in CUDA codegen.
* Updated `PrintFunctionSignature` and `AddFunction` methods to utilize the new attribute `tl.readonly_param_indices`, enhancing performance by allowing read-only cache loads.
* Modified the optimization pipeline to include the new annotation step, improving the overall efficiency of the code generation process.

* lint fix

* [Dependency] Update apache-tvm-ffi version to >=0.1.3

* Updated the version of apache-tvm-ffi in pyproject.toml, requirements.txt, and requirements-dev.txt to ensure compatibility with the latest features and fixes.
* Made adjustments in CUDA and HIP template files to use `const` qualifiers for global pointer parameters, enhancing code safety and clarity.

* lint fix

* [Enhancement] Refactor ReadWriteMarker for improved parameter handling

* Updated the ReadWriteMarker class to accept a set of parameter or data variables, enhancing its ability to track written variables.
* Introduced a new method, ResolveDataVarFromPtrArg, to resolve underlying buffer data from pointer-like arguments, improving accuracy in identifying written variables.
* Modified the MarkReadOnlyParams function to gather handle parameters and their corresponding buffer data variables, streamlining the process of determining read-only parameters.
* Enhanced the logic for identifying written variables to account for aliased data variables, ensuring comprehensive tracking of modifications.

* lint fix

* Update tma_load function to use const qualifier for global memory pointer

* Changed the parameter type of gmem_ptr in the tma_load function from void* to void const* to enhance type safety and clarity in memory operations.
* This modification ensures that the function correctly handles read-only global memory pointers, aligning with best practices in CUDA programming.

* Remove commented-out code and reorder transformations in OptimizeForTarget function for clarity

* Refactor buffer marking logic in annotate_read_only_params.cc to improve accuracy in identifying written variables. Update OptimizeForTarget function to reorder transformations for better clarity.

00dd7388

12 Dec, 2025 1 commit

[Enhancement] Introduce `T.__ldg` (#1414) · 6f67da84

Lei Wang authored Dec 12, 2025

* [Enhancement] Add __ldg intrinsic for CUDA read-only cache loads

* Introduced the __ldg intrinsic to enable explicit read-only cached loads from global memory in CUDA.
* Updated the corresponding documentation and added support in both CUDA and HIP code generation.
* Enhanced the Python interface for __ldg to accept BufferLoad and Buffer types, improving usability.

* [Enhancement] Update formatting and linting rules in pyproject.toml; minor test adjustment

* Added new formatting rules in pyproject.toml to enforce consistent code style, including hanging indents and argument splitting.
* Updated test_tilelang_language_intrinsics_codegen.py to improve readability by adding a blank line before the main execution block.
* Refactored error messages in builtin.py for better clarity and consistency, ensuring proper formatting in function definitions and raising ValueErrors.

* lint fix

6f67da84

11 Dec, 2025 1 commit
- [AMD] Enable FA2 fwd on AMD MI300X (#1406) · 53be59dc
  danielhua23 authored Dec 11, 2025
```
* enable FA2 on AMD MI300X

* make lint happy
```
  53be59dc
01 Dec, 2025 1 commit

[Enhancement] Implement dynamic unroll factor in CUDA code generation (#1360) · 388ee7ee

Lei Wang authored Dec 02, 2025

* [Enhancement] Implement dynamic unroll factor in CUDA code generation

This commit introduces support for specifying a dynamic unroll factor in the CUDA code generation. The `unroll_factor` map is added to store unroll factors for loop variables, allowing for more flexible and optimized loop unrolling. Additionally, the `unroll` function is integrated into the loop language, enabling users to define unroll factors directly in their code. This enhancement improves performance by allowing tailored unrolling strategies based on specific loop characteristics.

* lint fix

* [Bugfix] Correct initialization of non-zero counters in custom compress kernel and update TIR registration for gemm_sp_py to use the correct tile operation

388ee7ee

28 Nov, 2025 1 commit

[Enhancement] Improve error handling and assertion messages across runtime and... · 17cfeb76

Lei Wang authored Nov 28, 2025

[Enhancement] Improve error handling and assertion messages across runtime and argument binding (#1356)

This commit enhances the error handling mechanisms in the runtime by introducing CPU-safe runtime helpers and refining assertion messages in the CodeGenCHost and ArgBinder. It includes structured packed error messages for various conditions, improving clarity in diagnostics. Additionally, the CMake configuration is updated to always include necessary runtime helpers, ensuring consistent error reporting. The changes aim to provide clearer feedback during runtime errors and improve the overall robustness of the argument binding process.

17cfeb76

27 Nov, 2025 1 commit

[Refactor] Improve assertion handling in CodeGenCHost and ArgBinder (#1352) · 1e92d11c

Lei Wang authored Nov 28, 2025

* [Refactor] Improve assertion handling in CodeGenCHost and ArgBinder

This commit refines the assertion message generation in CodeGenCHost by optimizing the handling of equality checks and reducing buffer size for error messages. Additionally, it enhances the ArgBinder by introducing a nullable guard mechanism for assertions, allowing for more precise error handling when binding arguments. The changes improve the clarity and efficiency of assertion handling across the codebase.

* [Enhancement] Update matmul kernel and optimize argument binding

This commit enhances the matmul kernel by introducing additional tensor parameters and refining the pipeline stages for improved performance. It also updates the argument binding mechanism to include a flag indicating whether buffers are used, enhancing the efficiency of buffer management. Furthermore, the optimization phase in the engine is improved by adding a simplification step, ensuring better performance and clarity in the generated code.

* lint fix

* [Enhancement] Add tensor checks documentation and improve argument binding assertions

This commit introduces a new documentation page for host-side tensor checks, detailing the automatic validations performed by TileLang on kernel arguments. It enhances the ArgBinder by adding assertions for non-null pointers when arguments are used, improving error handling. Additionally, the optimization phase in the engine is updated to include a simplification step, ensuring better performance and clarity in the generated code.

* [Enhancement] Update .gitignore and refine matmul kernel for improved performance

This commit adds host checks logs to the .gitignore file to prevent unnecessary log files from being tracked. Additionally, it refines the matmul kernel by adjusting pipeline stages, updating tensor parameters, and enhancing argument handling for better performance. The changes also include improved error messages in the argument binding process, ensuring clearer diagnostics for users.

* lint fix

* lint fix

* [Refactor] Simplify tensor_null_test function and remove ptr_null_test

This commit refactors the tensor_null_test function by adding a with_bias parameter and removing the ptr_null_test function, which was previously unused. The run_test function is updated to reflect these changes, streamlining the testing process for tensor operations.

* lint fix

* fix

1e92d11c

24 Nov, 2025 1 commit
- [Feat] Support warp reduce (#1316) · caa6dd3f
  Tong WU authored Nov 24, 2025
```
* [Feat] Support warp reduce

* lint

* add test

* lint
```
  caa6dd3f
19 Nov, 2025 1 commit

[Enhancement] Enhance CUDA compilation by integrating pass context configuration (#1283) · 551ac60d

Lei Wang authored Nov 19, 2025

- Updated the `tilelang_callback_cuda_compile` function to accept a `pass_config` parameter, allowing for more flexible compilation options.
- Introduced handling for fast math and PTXAS options based on the provided pass configuration.
- Modified the CUDA build process in `rt_mod_cuda.cc` to utilize the current pass context, improving the integration of compilation settings.
- Refactored NVCC command construction to use a dedicated function for better clarity and maintainability.

551ac60d

18 Nov, 2025 1 commit

[FFI] Use tvm ffi as the default execution backend (#1259) · 74da3696

Lei Wang authored Nov 18, 2025

* [Refactor] Update FFI type handling and simplify argument management

* Refactored FFI type definitions in runtime and code generation files to use `TVMFFIAny` instead of `TVMValue`, enhancing type clarity.
* Updated function registration in `runtime.cc` to utilize canonical names for better consistency.
* Simplified argument handling in the `simplify` transformation, ensuring unused buffer parameters are removed only when simplification is enabled.
* Adjusted autotuner and profiler parameters to standardize the execution backend to `tvm_ffi`, improving clarity in backend selection.
* Removed obsolete `adapt_torch2tvm` function from tensor utilities to streamline the codebase and reduce complexity.

* [Update] Sync TVM submodule and enhance kernel source handling

* Updated the TVM submodule to commit cdc2aced, ensuring compatibility with recent changes.
* Added functionality to print kernel source in `example_blocksparse_gemm.py` for better debugging.
* Commented out the main execution call in test files to prevent unintended execution during testing.
* Introduced `tilelang.disable_cache()` in various test files to streamline testing and avoid cache-related issues.
* Refactored kernel source retrieval methods to improve clarity and consistency across different execution backends.

* [Refactor] Clean up imports and improve code formatting

* Removed unused import of `tilelang.testing` in `test_example_blocksparse_gemm.py` to streamline the code.
* Reformatted several lines in `arg_binder.cc`, `make_packed_api.cc`, `tvm_ffi.py`, and `adapter.py` for improved readability and consistency.
* Updated comments and spacing in `tvm_ffi.py` to enhance clarity without altering functionality.

* Update execution backend options and improve resolution logic

- Changed default execution backend from "cython" to "auto" in multiple locations to allow automatic selection based on the target.
- Expanded the list of supported execution backends to include "torch" and "nvrtc" across various classes and functions.
- Enhanced backend resolution logic in `KernelCache` and `AutoTuner` to ensure appropriate backend selection based on the target.
- Updated documentation to reflect changes in execution backend options and their defaults.

* lint fix

* fix

* Enhance argument handling in CUDA and HIP runtime modules

- Updated `ExtractFuncInfo` in `rt_mod_cuda.cc` and `rt_mod_hip.cc` to map boolean argument types to int32, ensuring compatibility with device runtime.
- Refactored `BindDLTensor` in `arg_binder.cc` to improve null handling and validation checks for DLTensor parameters, utilizing expression-level guards to prevent dereferencing null pointers.
- Enhanced error checking for buffer shape, strides, and data fields, ensuring robust handling of optional inputs and maintaining consistency across various checks.

* lint fix

* minor fix

* fix

* recover check

* Refactor argument binding and validation in `arg_binder.cc`

- Improved null handling and validation checks in `BindDLTensor`, ensuring safe dereferencing of pointers.
- Enhanced consistency checks for buffer shape, strides, and data fields, utilizing expression-level guards.
- Updated `MakePackedAPI` to maintain code clarity and consistency in argument handling.
- Minor adjustments in test files to streamline kernel execution and improve readability.

* lint fix

* stride fix

* minor fix

* fix

* lint fix

* Add CUDA stream access policy window helpers and integrate with L2 persistent cache management

- Introduced functions to set and reset the CUDA stream access policy window, allowing for better control over L2 cache usage.
- Updated runtime files to include new FFI packed functions for managing stream attributes.
- Modified lower_hopper_intrin to incorporate prologue and epilogue statements for L2 cache setup and teardown.
- Enhanced tests to verify the inclusion of new FFI calls in the generated kernel source.

* check with symbolic

* support null ptr

* Update CMakeLists and lower.py for code generation and subproject status

- Added `codegen_c_host.cc` to the list of source files in CMakeLists.txt for improved code generation support.
- Updated the function call in `lower.py` to use `target.build.tilelang_c` for C target host code generation, enhancing compatibility.
- Marked the TVM subproject as dirty to indicate local modifications.

* lint fix

* Update comments for clarity in quickstart.py

74da3696

17 Nov, 2025 1 commit
- [Bugfix] Fix multiple cg defination when using T.sync_grid (#1272) · 220c3236
  Yu Cheng authored Nov 18, 2025
  
  220c3236
12 Nov, 2025 1 commit

[Refactor] Add kernel selection option for GEMM v1 in environment settings (#1200) · 8fbe1b3a

Lei Wang authored Nov 12, 2025

* Add kernel selection option for GEMM v1 in environment settings

- Introduced `TILELANG_USE_GEMM_V1` environment variable to control the selection of GEMM version.
- Added `use_gemm_v1` method in the `Environment` class to determine if GEMM v1 should be used based on the environment variable.
- Updated GEMM function assignment to default to v2, allowing for v1 to be forced via the new environment variable.

* bug fix

* Add kernel selection option for GEMM in environment settings

- Introduced `TILELANG_USE_GEMM_V1` environment variable to allow users to select between GEMM v1 and v2 implementations.
- Updated `gemm` function to default to v2 but switch to v1 if the environment variable is set to a truthy value.
- Added a method `use_gemm_v1` in the `Environment` class to facilitate this selection based on the environment variable.

* Refactor GEMM macro generator to use BufferRegion instead of Buffer

- Updated `wgmma` and `wgmma_rs` methods in `TensorCoreIntrinEmitter` to accept `BufferRegion` parameters instead of `Buffer`.
- Adjusted related calls in `GemmWGMMA` to ensure compatibility with the new parameter types.
- Simplified buffer access logic for better clarity and maintainability.

* Refactor GEMM functions to utilize BufferRegion for improved memory handling

- Updated `run_gemm`, `run_gemm_rs`, `run_gemm_sr`, and `run_gemm_rr` functions to set `num_stages` based on block dimensions, enhancing performance for larger matrices.
- Simplified calls to GEMM functions by removing redundant parameters and ensuring compatibility with BufferRegion.
- Introduced utility functions for converting between Buffer, BufferLoad, and BufferRegion, improving code clarity and maintainability.
- Enhanced error handling for full region checks in GEMM operations to ensure correctness in memory access.

* Refactor GEMM code for improved readability and consistency

- Cleaned up formatting and spacing in GEMM-related files for better readability.
- Standardized comments and code structure across various GEMM functions and macros.
- Enhanced error messages for clarity in buffer region checks.
- Removed redundant lines and improved overall code maintainability.

* Update GEMM correctness evaluation and macro generator for improved functionality

- Modified `N_VALUES` in `correctness_evaluation_sm70.py` to include only relevant sizes for tests.
- Updated test function call in `correctness_evaluation.py` to use `test_gemm_false_true` for better accuracy in testing.
- Refactored buffer handling in `mma_sm70_macro_generator.py` to improve clarity and consistency in shared buffer access.
- Enhanced `gemm_mma_sm70.py` to ensure full region checks for input and output buffers, improving correctness in GEMM operations.

* Refactor GEMM and intrinsic files for improved clarity and functionality

- Removed unused variable `A_stride_last` in `mma_sm70_macro_generator.py` to streamline code.
- Adjusted function signature formatting in `swizzle.py` for better readability.
- Restored the return of `GemmWGMMA` in `__init__.py` for correct GEMM instantiation.
- Removed unused variable `B_buf` in `gemm_mma_sm70.py` to enhance code cleanliness.
- Improved function signature formatting in `language.py` for consistency.

* Enhance GEMM and MMA functionality for FP64 support

- Refactored `GemmNode` to streamline the decision-making process for GEMM instruction selection.
- Added support for FP64 inputs in the MMA dispatcher, enabling new tensor operations.
- Introduced a new layout function for FP64 in `mma_layout.py` to facilitate shared memory storage.
- Updated `TensorCoreIntrinEmitter` to handle FP64 data types, including adjustments for micro tile dimensions and loading mechanisms.
- Enhanced utility functions to accommodate FP64 index mapping for shared memory operations.

* lint fix

* Refactor GEMM correctness evaluation and shared memory alignment handling

- Reverted the GEMM function call in `correctness_evaluation.py` to the original implementation for consistency.
- Added a helper function in `merge_shared_memory_allocations.cc` to streamline the marking of shared variables under alignment scope.
- Enhanced the `VisitExpr_` methods to ensure proper handling of shared memory alignment for `BufferLoadNode` and `VarNode` types.
- Cleaned up commented-out test code in `correctness_evaluation.py` for better readability.

* Enhance GEMM and MMA implementations with region-based memory handling

- Updated GEMM and MMA classes to utilize BufferRegion for input and output buffers, improving memory management and supporting strided GEMM operations.
- Added checks to ensure full region compliance for input buffers, enhancing correctness in matrix multiplication.
- Implemented clear accumulation functionality to reset output buffers before accumulation, ensuring accurate results in GEMM operations.

* Refactor test_tilelang_example_deepseek_v32.py to improve import structure and function calls

- Updated import statements to directly reference modules instead of individual test functions, enhancing clarity.
- Modified function calls to use the new module structure for better organization and maintainability in testing examples.

* Enhance OnArrayDeclaration method to handle repeated buffer declarations

- Updated the OnArrayDeclaration method to merge metadata for buffers that may appear in multiple Allocate statements, improving robustness against upstream transformations.
- Added logic to prefer concrete element data types and record extents when previously unknown, enhancing the handling of buffer declarations.

* Add abbreviation for bfloat16 data type in mfma_macro_generator.py

- Introduced a new abbreviation "bf16" for the bfloat16 data type in the mfma_macro_generator.py file, enhancing clarity and consistency in data type representation.

* Refactor CodeGenTileLangHIP to enhance dtype handling and mfma call generation

- Introduced a mapping function to normalize input data types to their corresponding scalar types, improving compatibility with MfmaTraits.
- Updated the mfma call generation to utilize the new mapping, streamlining the code and enhancing clarity.
- Removed outdated dtype mapping and replaced it with a more flexible approach to support additional data types like FP8.

* lint fix

* Enhance backend configuration in CMakeLists.txt and improve dtype handling in CodeGenTileLangHIP

- Introduced a macro to define backend options for CUDA, ROCM, and Metal, allowing user overrides and caching of settings.
- Updated logic to track user-selected backends and conditionally enable defaults based on environment variables.
- Refactored dtype handling in CodeGenTileLangHIP to streamline mfma call generation and improve clarity.
- Added support for bfloat16 in the mfma_macro_generator.py, enhancing data type representation consistency.

* Update bfloat16 handling in CodeGenTileLangHIP and mfma_macro_generator.py

- Changed the representation of bfloat16 in CodeGenTileLangHIP from "bfloat16x4" to "bfloat16x4_vec" for improved clarity.
- Adjusted the mfma_suffix generation in mfma_macro_generator.py to remove the underscore before "bf16", aligning with HIP intrinsic requirements.

* Change logging level from WARNING to DLOG in LegalizeNegativeIndex for non-negative index checks to reduce log verbosity.

* Refactor attention sink examples to simplify index calculations

- Updated index handling in `example_gqa_sink_bwd_bhsd.py` and `example_mha_sink_bwd_bhsd.py` to eliminate unnecessary local allocations and streamline logic for determining start and end indices.
- Improved readability by using direct calculations instead of local variables for index bounds in pipelined loops.

* Refactor attention sink examples to streamline index calculations

- Simplified index handling in `example_gqa_sink_bwd_bhsd.py`, `example_gqa_sink_fwd_bhsd_wgmma_pipelined.py`, `example_mha_sink_bwd_bhsd.py`, `example_mha_sink_fwd_bhsd_wgmma_pipelined.py`, and `example_mha_sink_fwd_bhsd.py` by removing unnecessary local allocations for start and end indices.
- Enhanced readability by directly calculating index bounds for pipelined loops, improving overall code clarity.

* lint fix

* bugfix

* Refactor reduce operation handling in CUDA and Python

- Removed outdated shared memory reduction logic from `reduce.cc`.
- Introduced fragment allocation and improved buffer handling in `reduce.py` to support shared and fragment scopes.
- Updated CUDA header to define a wider accumulator type for better numerical accuracy.
- Enhanced error handling for buffer scope validation in the reduction process.

* Fix ReduceOpNode to correctly compute AbsMax by using absolute values of inputs

* Enhance unit loop handling by refining annotation checks

- Updated the condition for identifying effectively empty annotations in unit loops to include cases where only the `pragma_unroll_explicit` hint is present.
- Introduced a new method, `IsEffectivelyEmptyAnnotation`, to encapsulate this logic, improving code clarity and maintainability.

* clean clode

8fbe1b3a

05 Nov, 2025 1 commit

[SM70] Refactor and minor fix for SM70 (#1195) · 4a9cb470

Lei Wang authored Nov 06, 2025

* [Feature] Add support for SM70 tensor core MMA instructions

- Introduced new intrinsic `ptx_mma_sm70` for Volta GPUs, enabling m16n16k4 shape with FP16 inputs and FP16/FP32 accumulation.
- Added `GemmMMASm70` class for handling GEMM operations specific to SM70 architecture.
- Implemented layout functions for Volta swizzled layouts and updated existing GEMM layout inference logic.
- Updated `requirements-dev.txt` to include `apache-tvm-ffi` dependency.
- Added correctness evaluation script for testing GEMM operations on SM70.

* [Refactor] Update formatting and installation commands in scripts

- Modified `format.sh` to install `pre-commit` and `clang-tidy` with the `--user` flag for user-specific installations.
- Improved readability in `correctness_evaluation_sm70.py` by adjusting the formatting of pytest parameters.
- Cleaned up spacing and formatting in various C++ source files for better consistency and readability.
- Removed unnecessary comments and improved layout function definitions in `mma_sm70_layout.py` and `mma_sm70_macro_generator.py` for clarity.
- Ensured consistent formatting in layout initialization and swizzle functions.

* typo fix

4a9cb470

03 Nov, 2025 1 commit

[Bugfix] Legalize Datatype for mma intrinisc codegen (#1179) · 7c61d31a

Lei Wang authored Nov 03, 2025

* fix

* lint fix

* Enhance CUDA code generation by updating register type handling for float data types. Introduced a workaround for TF32 type compatibility and improved the registration of MMA register types for A and B operands.

7c61d31a

02 Nov, 2025 2 commits

[Language] Add Correctness and performance check scripts for V2 (#1174) · d99853b6
Lei Wang authored Nov 03, 2025
```
* fix

* lint fix

* fix

* lint fix

* fix

* upd
```
d99853b6

[Language] Expose `T.warpgroup_fence_operand` for nvcc code motion (#986) · aef0a6bb

Lei Wang authored Nov 03, 2025



* remove debug print

* pipeline fix

* use the correct buffer access scope

* rs support

* warp warpgroup_fence_operand

* fix

* fp8 dtype ptx enhance

* mma fix

* TCGEN05 Interface

* tcgen05 support

* rebase

* update

* Enhance TCGEN05 support by adding new intrinsic operations and descriptors. Introduced `ptx_tcgen05_mma_ts` for tensor-memory to shared-memory instructions and `tcgen05_mma_arrive` for signaling barrier completion. Updated existing descriptors and code generation logic to accommodate these changes, ensuring compatibility with new instruction sets. Refactored related allocation functions and improved handling of shared memory descriptors.

* lint fix

* Refactor buffer reference handling in CUDA code generation and update test execution in tilelang. Ensure default annotations for unrolling are set correctly in TIR IR module.

* wgmma fix

---------
Co-authored-by: Zhiwen Mo <zm125@ic.ac.uk>

aef0a6bb

31 Oct, 2025 1 commit

[FFI] Rebase tvm to v0.22.0 to utilize tvm-ffi (#1108) · 10911e28

Lei Wang authored Oct 31, 2025



* 3rdparty tvm bump

* bump tvm into v0.22.0

* lint fix

* rebase tvm

* Update submodule tvm to latest commit 3085bc4

* Refactor: Update configuration retrieval in CopyNode and adjust test registration in tilelang

* test fix

* add requirement

* atomic_fix

* atomic_fix

* phaseout py39

* optimize

* optimize

* lint fix

* do not clean cache

* do not clean cache

* [Minor] Minor update for Python versions and dependencies

* [Lint] fix lint for py39

* [Lint] fix lint for ROCm

* [Build][CI] Sync CI changes from upstream/sdist

* [Lint] fix lint for ROCm

* [Build][CI] Update `repair-wheel-command`

* [Minor] update abi3audit result format

* [Lint] fix lint for ROCm

* [BugFix] fix build

* [Lint] fix lint for ROCm

* [BugFix] set rpath for libtvm and libtvm_runtime

* [Deps] pin apache-tvm-ffi version

* [Build] set Python 3.9 Limited API for Cython target

* [Build] set Python 3.9 Limited API for Cython target

* [Deps] Restore Python 3.8 support

* [Build] use `apache-tvm-ffi`'s `libtvm_ffi`

* [BugFix] use `;` as delimiter for RPATH on macOS

* [BugFix] use `--ignore-missing-dependencies` for `delocate-wheel`

* [Build] support `sccache` if available

* [Build] add CIBW import test

* [Build][CI] enable ccache for CIBW on Linux

* [BugFix] set rpath for libtvm and libtvm_runtime

* Revert "[Build][CI] enable ccache for CIBW on Linux"

This reverts commit cd9ab57bb5ddd2572c60bcbbebde81480a658fd3.

* [CI] fix perfbench bot

* [BugFix] use Python 3.9 to build wheel

* [Minor] update perfbench bot envs

* [BugFix] fix CIBW environment on Linux

* [CI] skip import test on CentOS 7

* [CI] use Python urllib to download file instead of Wget

---------
Co-authored-by: Xuehai Pan <XuehaiPan@pku.edu.cn>

10911e28

29 Oct, 2025 2 commits
- [Enhancement] Enhance Cast operations Vectorization (#1156) · feef9ef6
  LJC00118 authored Oct 29, 2025
```
* Enhance Cast vectorized

* Add Parallel vectorized cast test

* code lint

* merge newest commit
```
  feef9ef6
- [Refactor]:Move device_assert from extern_call to intrin_call (#1134) · 198f22b3
  Yuqi Dong authored Oct 29, 2025
```
* update

* Update codegen_cuda.cc
```
  198f22b3
28 Oct, 2025 1 commit

[BugFix] Implement bfloat16 support in CUDA code generation with min/max... · c70b2697

Tong WU authored Oct 29, 2025

[BugFix] Implement bfloat16 support in CUDA code generation with min/max functions and inf/nan values (#1143)

* Implement bfloat16 support in CUDA code generation with min/max functions and inf/nan values

* refactor

* fix prev typo

* bugfix

* lint

* bugfix

c70b2697

23 Oct, 2025 1 commit

[Feature] Enhance vectorized conversion support in CUDA codegen (#1095) · a148d62a

Tong WU authored Oct 23, 2025

* [Feature] Add vectorized float16 and float32 conversion support in CUDA codegen

* Implemented handling for conversions between float16 and float32 types, specifically for vectorized operations using __half22float2 and __float22half2_rn.
* Enhanced the existing code to support both directions of conversion based on the lane count.
* Improved overall type handling in the VisitExpr_ method for better compatibility with TileLang.

* [Feature] Add float32 to float8 conversion support in CUDA codegen

* Implemented handling for conversion from float32 to float8 (E4M3/E5M2) in the VisitExpr_ method.
* Added vectorized conversion support using __nv_cvt_float2_to_fp8x2 for float2 to fp8x2 transformations.
* Enhanced type handling for better compatibility with TileLang, particularly for float8 types.

* lint

* fix a bug

* [Enhancement] Support lanes=4 cases and add unit test for vectorized cast

* lint

* [Feature] Refactor bf16 convertion operations and remove legacy compile flags

* lint

a148d62a

22 Oct, 2025 1 commit

[CI][Lint] Retire `format.sh` and add `clang-tidy` to GHA workflow (#1044) · 5683e6a6

Xuehai Pan authored Oct 22, 2025

* [Lint] Retire `format.sh` and add `clang-tidy` to GHA workflow

* chore: update clang-tidy settings

* chore: upgrade clang-format and clang-tidy version

* lint: resolve clang-tidy errors

* [Maint] restore format.sh

* [CI] pre-commit autoupdate

* [Minor] fix `command -v` usage

5683e6a6

21 Oct, 2025 1 commit

[Language] Support tilelang `alloc_var(dtype, init=x)` (#1092) · bddb125e

Lei Wang authored Oct 21, 2025

* - carry existing local-var initializer map into OpaqueBlockLower, reattach it to
    generated Allocates and the PrimFunc attrs
  - thread the map through FlattenBuffer and StorageRewrite so flattened/merged
    allocations keep their tl.local_var_init annotations
  - teach annotation handling to accept scalar initializers, resolve buffers, and merge
    with existing stat

* lint fix

* enhance

* lint fix

* lint fix

bddb125e

15 Oct, 2025 2 commits

fix bug&add amd examples (#966) · 80665cd1

alex_xiao authored Oct 15, 2025



* [Enhancement] Refactor buffer index handling for improved precision and clarity (#668)

- Enhanced buffer index handling to address precision issues by removing redundant operations.
- Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection.
- Updated related documentation to reflect changes in buffer management practices.

* Remove obsolete test script for AMD example, streamlining the examples directory.

* Remove unused dtype_size variable in AMD example script to streamline code.

* Add input configuration file and update AMD example script for enhanced flexibility

- Introduced a new input.txt file for configurable parameters.
- Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack.
- Streamlined the main function for better clarity and organization.
- Added a new test script to facilitate running the example with specified parameters.

* Remove input configuration file and obsolete test script; enhance AMD example with swizzle layout annotations

- Deleted input.txt and test.sh files as they are no longer needed.
- Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance.
- Reintroduced swizzle usage in the kernel for better performance.

* Refactor AMD example script for FlashAttention-2

- Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`.
- Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls.
- Removed outdated comments and improved code organization for better readability.

* Refactor formatting in AMD FlashAttention example script

- Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function.
- Streamlined the `main` function parameter formatting for consistency.
- Removed unnecessary blank lines to enhance overall code organization.

* Update example_amd_flash_attn_fwd.py

* Enhance AMD example script and update CI workflows

- Improved the `example_amd_flash_attn_fwd.py` script for better clarity and organization.
- Added new CI workflows for AMD and documentation publishing.
- Updated various requirements files to include necessary dependencies.
- Introduced new test cases and examples for better coverage and functionality.
- Refactored existing code for improved readability and maintainability.

* Remove redundant tool cache cleanup step in AMD CI workflow

* Remove `torch` dependency from `requirements-rocm.txt` to streamline requirements.

* Add new AMD FlashAttention example and test script

- Introduced `example_amd_flash_attn_bwd.py` for backward attention computation using TileLang.
- Added `test.sh` script to facilitate running the new example with specified parameters.
- Enhanced the overall structure and organization of the example for better clarity and usability.

* Update configurations in `example_amd_flash_attn_fwd.py` for autotuner

- Reduced the number of threads and `num_split_q` options for improved performance.
- Adjusted `panel_size` options to streamline configuration settings.

* Update submodule 'tvm' to commit 6ccc74f622c7ec4ac25d430d0f6546e7b9edb217

* Update submodule 'tvm' to commit 14ff70ab142b9e5a31bbf9c7923c8a697d41e86c

* Add example for AMD Flash Attention backward pass implementation

- Introduced a new example script `example_amd_flash_attn_bwd.py` demonstrating the forward and backward operations of Flash Attention using TileLang.
- Implemented JIT-compiled functions for both forward and backward passes, including preprocessing and postprocessing steps.
- Added a main function to facilitate testing and benchmarking of the attention mechanism with configurable parameters.
- Included reference implementation for validation against PyTorch's attention mechanism.

This addition enhances the examples directory by providing a comprehensive guide for users to understand and utilize Flash Attention in their applications.

* Enhance AMD Flash Attention example with additional testing capabilities

- Updated `example_amd_flash_attn_bwd.py` to include more comprehensive testing features for the Flash Attention implementation.
- Improved the main function to allow for better parameter configuration and benchmarking.
- Added validation checks against PyTorch's attention mechanism to ensure accuracy and reliability of the example.

This update aims to provide users with a more robust tool for understanding and utilizing Flash Attention in their applications.

* Update submodule TVM to commit a64a5926a6e59f5417ef2501f9d88b467337cf6a

* Refactor HIP intrinsic rules to CUDA

- Updated file name from `intrin_rule_hip.cc` to `intrin_rule_cuda.cc` to reflect the change in focus from HIP to CUDA intrinsic rules.
- Adjusted include paths for better organization and clarity in the code structure.

* Update AMD CI workflow to uninstall specific PyTorch packages before installation

- Removed the installation of `flash_attn==2.5.8` to streamline the CI process.
- Added a step to uninstall `torch`, `torchvision`, and `torchaudio` prior to installing pre-release versions, ensuring compatibility and reducing potential conflicts.

* Remove unused shared memory allocations in AMD Flash Attention backward example

- Eliminated the allocation of shared memory for `dv_shared` and `dk_shared` in `example_amd_flash_attn_bwd.py` to streamline memory usage and improve performance.
- This change focuses on optimizing the backward pass implementation by reducing unnecessary memory overhead.

* Remove unnecessary pip uninstall command from AMD CI workflow

- Eliminated the step to uninstall `torch`, `torchvision`, and `torchaudio` in the AMD CI workflow, as it is no longer required for the installation of pre-release versions.
- This change simplifies the CI process and reduces potential overhead during package management.

* Refactor DispatchHIPWarpActiveMask function in HIP intrinsic rules

- Updated the return statement to use std::string for concatenation in the case of 16-bit types, improving code clarity.
- Added a null check for the CallNode pointer in DispatchHIPWarpActiveMask to enhance robustness and prevent potential dereferencing issues.

* Refactor formatting of HIP intrinsic rule registrations

- Adjusted the formatting of TVM_REGISTER_OP calls for better readability by aligning method chaining.
- No functional changes were made; this update focuses on code style improvements to enhance maintainability.

* Update file name and documentation for HIP intrinsic rules

- Renamed the file from `intrin_rule_cuda.cc` to `intrin_rule_hip.cc` to accurately reflect the focus on HIP intrinsic rules.
- Updated the file documentation to clarify its purpose as related to HIP rather than CUDA.

* Enhance DispatchHIPShuffle function with clang-analyzer comments

- Added NOLINTBEGIN and NOLINTEND comments to the DispatchHIPShuffle function to suppress clang-analyzer warnings related to inner pointer usage.
- This change improves code clarity and maintains compliance with static analysis tools.

* lint fix

* fix

* Enhance autotuner configurations in example_amd_flash_attn_fwd.py by adding new block sizes, stages, and panel sizes. Update test script to use relative Python path and adjust parameters for consistency.

* Add backward attention example to test script

- Extended the test.sh script to include a new backward attention example using example_amd_flash_attn_bwd.py.
- Added parameters for batch size, context length, and head dimensions to ensure consistency with the forward example.
- Updated the command for the backward tile example to match the new configuration.

* Refactor FlashAttention implementation in example_amd_flash_attn_bwd.py and example_amd_flash_attn_fwd.py

- Introduced new functions for forward and backward configurations to enhance autotuning capabilities.
- Updated the FlashAttention forward and backward functions to improve performance and maintainability.
- Adjusted test script parameters for consistency and clarity, including the addition of group handling.
- Enhanced the autotuner configurations by refining block sizes and stages for better performance tuning.
- Updated the main function to reflect changes in parameter names and types for better usability.

* Enhance FlashAttention backward implementation in example_amd_flash_attn_bwd.py

- Updated the backward function to return additional outputs, including log-sum-exp (LSE) values for improved gradient calculations.
- Refined autotuner configurations by adding new block sizes and adjusting parameters for better performance tuning.
- Improved shared memory usage in the backward pass to optimize memory access patterns and enhance computational efficiency.
- Updated the main function to reflect changes in parameter handling and ensure consistency with the forward pass.
- Enhanced correctness checks in the main function to include LSE validation alongside gradient checks.

* Enhance FlashAttention backward implementation in example_amd_flash_attn_bwd.py

- Introduced a scaling factor for improved numerical stability in gradient calculations.
- Optimized shared memory usage by adding new shared buffers for intermediate calculations.
- Refined the handling of tensor fragments to improve performance and maintainability.
- Updated the main function to ensure compatibility with the new output parameters for backward operations.
- Removed unnecessary parameters from the test script to streamline execution.

* Refactor FlashAttention implementation in example_amd_flash_attn_bwd.py and example_mha_bwd.py

- Updated the forward and backward functions to improve numerical stability and performance.
- Enhanced shared memory usage by optimizing buffer allocations and reducing unnecessary parameters.
- Adjusted autotuner configurations for better performance tuning and compatibility with new output parameters.
- Added debugging and benchmarking functions for improved correctness verification and performance analysis.
- Updated the main function to reflect changes in parameter handling and ensure consistency across examples.

* Enhance FlashAttention backward implementation in example_amd_flash_attn_bwd.py

- Updated scaling factor application for improved numerical stability in gradient calculations.
- Refined tensor handling to ensure consistency with forward pass operations.
- Optimized atomic operations for writing gradients to dK and dV using fp32 for better precision.
- Adjusted comments for clarity and alignment with standard implementation practices.

* Expand autotuner configurations in example_amd_flash_attn_bwd.py and update test.sh

- Increased the range of block sizes and stages for forward and backward configurations to enhance performance tuning.
- Adjusted the test script to include additional parameters for batch size and head dimensions, ensuring consistency with the forward example.
- Improved comments for clarity and alignment with the updated configurations.

* Enhance performance calculations and benchmarking in example_amd_flash_attn_bwd.py

- Updated FLOPs calculation to account for both forward and backward passes, clarifying the total computational cost.
- Modified benchmarking functions to evaluate the complete forward and backward performance of both reference and Tile-lang implementations.
- Improved comments for better understanding of the performance metrics and implementation details.
- Removed unnecessary parameter from test.sh to streamline execution.

* Remove forward attention test commands from test.sh and retain backward attention execution for streamlined testing.

* Refactor FlashAttention forward and backward implementations in example_amd_flash_attn_bwd.py and example_amd_flash_attn_fwd.py

- Updated the forward function to return both output and log-sum-exp (LSE) values for improved gradient calculations.
- Enhanced autotuner configurations for forward pass, including new parameters for better performance tuning.
- Refined scaling factor calculations for numerical stability in both forward and backward passes.
- Improved comments and documentation for clarity and consistency across implementations.
- Adjusted main function to reflect changes in parameter handling and ensure compatibility with new output requirements.

* Refactor FlashAttention implementation in example_amd_flash_attn_bwd.py

- Removed outdated comments and improved clarity in the code.
- Enhanced the forward function to consistently return output and log-sum-exp (LSE) values.
- Updated autotuner configurations to include new parameters for better performance tuning.
- Refined tensor handling and scaling factor calculations for improved numerical stability.
- Adjusted the main function to ensure compatibility with updated output requirements and parameter handling.

* Enhance FlashAttention backward implementation in example_amd_flash_attn_bwd.py

- Updated configuration parameters for backward calculations, including new options for block sizes, threads, and rasterization.
- Added new parameters (k_pack, qk_coalesced_width, v_coalesced_width) to improve performance tuning and memory access patterns.
- Modified tensor copy operations to utilize coalesced widths for optimized memory loads.
- Enhanced GEMM operations with k_pack for improved computational efficiency.
- Refined the configuration generation logic to accommodate the new parameters, ensuring comprehensive coverage for backward pass scenarios.

* Refactor configuration and tensor operations in example_amd_flash_attn_bwd.py

- Updated backward configuration parameters to include larger block sizes and a wider range of threads for enhanced performance tuning.
- Removed unnecessary parameters (k_pack, qk_coalesced_width, v_coalesced_width) from function signatures and tensor operations to simplify the implementation.
- Optimized tensor copy operations by eliminating coalesced width specifications, streamlining memory access patterns.
- Adjusted GEMM operations to improve computational efficiency without the use of k_pack.

* Enhance HIP code generation and FP8 type support

- Added support for additional FP8 types (e4m3, e4m3b11fnuz, e5m2fnuz, e8m0) in codegen_hip.cc to improve compatibility.
- Updated error logging to include unsupported FP8 type details for better debugging.
- Implemented handling for loop break and no-op register management in HIP within VisitExpr_ method.
- Introduced new FP8 vector types (e5 and e8) in hip_fp8.h for enhanced functionality.
- Added overloads for AtomicAdd in common.h to support both pointer and value arguments.

* Enhance FP8 type support and clarify accumulator handling in HIP

- Expanded FP8 type support in codegen_hip.cc to include additional float8 formats.
- Updated gemm.h to clarify the handling of the accumulator when clear_accum is true.
- Added comments in hip_fp8.h to indicate that E8M0 types are not supported in the current HIP version.

* Remove deprecated files and update print statements for clarity in example_amd_flash_attn_bwd.py

* Update print statement formatting for clarity in example_amd_flash_attn_bwd.py

* Remove redundant verification results summary print statement in example_amd_flash_attn_bwd.py for cleaner output.

* Fix formatting inconsistencies in example_amd_flash_attn_bwd.py and example_amd_flash_attn_fwd.py by adding spaces for improved readability in configuration parameters and print statements.

* Refactor and enhance HIP code generation for improved FP8 support

- Reorganized and cleaned up code in codegen_hip.cc for better readability and maintainability.
- Enhanced handling of FP8 types, including additional formats and improved error logging for unsupported types.
- Updated AtomicAdd function in common.h to streamline its implementation.
- Refined the PrintVecElemLoadExpr method to handle volatile loads more effectively.
- Added function to manage the addition of new functions in the code generation process.

* Fix formatting issue in HIP code generation for MFMA call

- Adjusted the indentation of the MFMA call code block in codegen_hip.cc for improved readability and consistency.

* Refactor HIP code generation and enhance FP8 type handling

- Reintroduced necessary includes and reorganized code in codegen_hip.cc for improved structure and readability.
- Enhanced the GetFP8Type function to support additional FP8 formats and improved error handling for unsupported types.
- Updated PrintType and PrintVecElemLoadExpr methods to better manage type conversions and vector element loading.
- Refined the AddFunction method to streamline function addition in the code generation process.

* Remove unnecessary blank line in example_amd_flash_attn_bwd.py for improved code cleanliness.

* Refactor backward attention implementation in example_amd_flash_attn_bwd.py

- Updated the GEMM operation to use shared memory for improved performance.
- Adjusted parallelization parameters to enhance efficiency in the backward pass.

* Fix formatting by removing an unnecessary blank line in example_amd_flash_attn_bwd.py for improved code cleanliness.

* Add additional test cases for `assert_tl_matmul_correctness` with `float8_e4m3fnuz` and various configurations

* Refactor test case formatting for `assert_tl_matmul_correctness` in `test_tilelang_gemm_mfma_intrinsic.py`

---------
Co-authored-by: xinxyxiao <xinyxiao@amd.com>
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

80665cd1

[Language] Expose `T.get_warp_idx_sync` and `T.shuffle_elect` for efficient thread election (#989) · b78d8404

Lei Wang authored Oct 15, 2025



* Expose CUDA warp/lane intrinsics in TileLang frontend

* generalize warp indexing intrinsics and add coverage

* [Lint]: [pre-commit.ci] auto fixes [...]

---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

b78d8404

11 Oct, 2025 2 commits

[Feature][Example] Support TMA reduce operation and update GQA bwd example (#969) · 05507037

Yu Cheng authored Oct 11, 2025



* [Feature][Example] Support TMA reduce operation and update GQA bwd example

* move GQA bwd with TMA reduce to new example

* [Lint]: [pre-commit.ci] auto fixes [...]

---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

05507037

[Refactor] Refactor Pass `InjectFenceProxy` and expose some warp group... · ddfaac36

Lei Wang authored Oct 11, 2025

[Refactor] Refactor Pass `InjectFenceProxy` and expose some warp group primitives in frontend (#977)

* • InjectFenceProxy docs and tests

  - annotate proxy fence injector with context comments for async/generic detection
  - add compiler internals doc covering the pass mechanics and link it in docs index
  - repair fence proxy test by fixing descriptor init usage and fence counter logic

* do not consider call_extern as async.

* doc update.

* reduce test size for sparse mla

ddfaac36

10 Oct, 2025 1 commit

[CI] add `pre-commit` integration (#955) · 8fe35402

Xuehai Pan authored Oct 10, 2025



* chore: misc cleanup

* feat: add pre-commit config

* chore: update lint dependencies

* style: fix lint issues

* feat: add pre-commit hooks

* fix: fix typos

* chore: update .gitattributes

* [Lint]: [pre-commit.ci] auto fixes [...]

* docs: update CONTRIBUTING.md

* chore: update default venv name

* chore: revert and exclude CUDA files

---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

8fe35402

09 Oct, 2025 2 commits

[TileOp] Implement WGMMA for T.gemm_v2 (#813) · a13cde28

Lei Wang authored Oct 10, 2025

* [Feature] Introduce WGMMA support and enhance GEMM layout handling

- Added support for the WGMMA intrinsic in the TileLang framework, enabling efficient matrix multiplication on newer architectures.
- Refactored GEMM layout functions to accept a boolean parameter for K dimension handling, improving flexibility in layout generation.
- Updated layout inference logic to accommodate new WGMMA configurations and ensure compatibility with existing GEMM operations.
- Enhanced Python bindings for layout functions, allowing for better integration and usability in user-defined operations.
- Improved documentation for layout functions and GEMM operations to clarify usage and parameters.

These changes enhance the performance and usability of GEMM operations, particularly for advanced architectures, while maintaining backward compatibility with existing implementations.

* [Refactor] Clean up code formatting and enhance layout function readability

- Improved code formatting across multiple files for better readability, including consistent indentation and line breaks.
- Updated layout function signatures to enhance clarity, particularly in `gemm_layouts.cc`, `layout.cc`, and `layout.h`.
- Refactored lambda functions in `builtin.cc` and `gemm_py.cc` for improved structure and maintainability.
- Enhanced comments and documentation in layout-related files to clarify usage and parameters.

These changes contribute to a cleaner codebase and improved maintainability of layout functions in the TileLang framework.

* [Feature] Add descriptor initialization and offset manipulation for WGMMA

- Introduced new TileLang builtins `initialize_descriptor` and `increase_descriptor_offset` to facilitate descriptor management for WGMMA operations.
- Updated `builtin.cc` and `builtin.h` to define and document the new builtins, enhancing the framework's capabilities for descriptor handling.
- Modified `codegen_cuda.cc` and `ptx.cc` to integrate the new builtins into the code generation process, ensuring proper assembly generation for WGMMA operations.
- Enhanced the `GemmWGMMA` class to utilize the new descriptor functionalities, improving the efficiency of matrix multiplication operations.
- Updated related tests and documentation to reflect the new features and ensure comprehensive coverage.

These changes enhance the TileLang framework's support for advanced matrix operations on newer architectures, improving performance and usability.

* [Refactor] Improve code formatting and readability in various files

- Enhanced code formatting across multiple files for better readability, including consistent indentation and line breaks.
- Updated function signatures and comments in `builtin.h`, `codegen_cuda.cc`, and `ptx.cc` to improve clarity.
- Refactored descriptor initialization and offset manipulation functions in `builtin.py` and `wgmma_macro_generator.py` for improved structure.
- Cleaned up unnecessary whitespace and improved alignment in `common.h` and `allocate.py`.

These changes contribute to a cleaner and more maintainable codebase in the TileLang framework.

* [Update] Update subproject commit and refactor layout function call

- Updated the subproject commit for `cutlass` to indicate a dirty state.
- Refactored the `UpdateAnalyzer` function in `layout.cc` to call `LayoutNode::getVarMap()` instead of `getVarMap()`, improving clarity and ensuring proper context for variable mapping.

These changes enhance the maintainability and clarity of the layout handling in the TileLang framework.

* support more data types

* gemm_rs support

* lint fix

* wgmma wrapper

* Remove debug logging for wgmma assembly code and refactor swizzle byte size calculations in wgmma macro generator. Enhanced handling of leading and stride byte offsets based on swizzle mode, improving clarity and performance in tensor core intrinsic emissions.

* Refactor GEMM layout functions to replace 'kfactor' with 'k_inner' for improved clarity and consistency. Update includes necessary changes in error messages for Hopper and Sm100 layouts. Additionally, include a new header for CUTE utilities in common.h.

* Comprehensively support WGMMA GEMM SS

* remove debug print

* lint fix

* remove debug print

* reduce bwd test shape

* lint fix

* clear cache for pytest

* lint fix

* Update sparse MLA examples to support SKV adjustment and correctness checks

- Changed SKV parameter from 32768 to 8192 in sparse MLA backward and forward tests.
- Added check_correctness parameter to test functions for validation of outputs.
- Updated test cases to reflect new SKV values and correctness checks.

* test fix

* adjust test case

* test fix

* skip some test currently

a13cde28

Modify the SM architecture number to support Thor’s sm110. (#957) · 07f62104
Shawn Liu authored Oct 09, 2025

07f62104

07 Oct, 2025 1 commit

[Backend] Add metal backend (#799) · 7fb06776

Yichen Yan authored Oct 07, 2025



* Reset

* Fix other CUDA issue

* fmt

* fmt

* fix cuda error

* fix

* fix

* fmt

* cleanup

* fix

* remove copyright

* trivial update

* readme update

* lint fix

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

7fb06776

02 Oct, 2025 1 commit

[Bugfix] Fix tensor memory copy layout (#933) · 5ccac4fa

Zhiwen Mo authored Oct 03, 2025

* Implements tcgen05.ld instruction support for copying from shared.tmem
  to local.fragment on SM100/Blackwell architecture. Adds layout inference
  and lowering logic for tensor memory operations with proper physical
  coordinate range analysis and warpgroup alignment checks.

  Changes:
  - Add kTMemLoad and kTMemStore to CopyInst enumeration
  - Implement CheckTMemLoad() and CheckTMemStore() validation functions
  - Add LowerTmemCopy() to generate tcgen05.ld/st/cp PTX intrinsics
  - Add tmem layout inference in InferLayout() using expandTcgen05Layout
  - Support multiple instruction variants (32dp32b/64b/128b/256b)
  - Add physical layout bounds analysis for tmem coordinates
  - Change clear_accum from bool to PrimExpr in GEMM operations
  - Fix std::optional access checks in layout_inference.cc
  - Add tmem_allocate/deallocate PTX intrinsic support
  - Fix cooperative_groups grid.sync() code generation

* fix

* pipeline fix

* bug fix

* bool fix

5ccac4fa

28 Sep, 2025 1 commit

[SM100] Add sm100 GEMM layouts and tcgen05 support (#887) · f58bcd43

Zhiwen Mo authored Sep 28, 2025

* update sm100 related utcmma, tmem, ld/st256 in src
* update sm100 related utcmma, tmem, ld/st256 in tilelang
* Remove deprecated GEMM examples and related README documentation for SM100 architecture support
* Update GEMM implementation to replace UTCMMA with TCGEN5MMA across relevant files
* Remove gemm_umma.py example and update README to reflect TCGEN5MMA terminology changes
* Update README.md for gemm_sm100 example by removing outdated API sections and streamlining documentation
* Update README and source files to reflect TCGEN5.MMA terminology changes
* Refactor CUDA GEMM header for improved readability

f58bcd43

26 Sep, 2025 2 commits

[Precision] Introduce `T.ieee_rsqrt` and related high precision op (#882) · a58bf9b6

Lei Wang authored Sep 26, 2025

* Add fast math operations for CUDA: exp, exp10, log, log2, log10, tan, cos, and sin (#865)

* Refactor fast math operation definitions for consistency and readability in CUDA code. Consolidated multiple definitions into single lines and improved formatting in related test files for better clarity.

* Remove unnecessary pass configurations for warp specialization and TMA lowering in fast math operation tests for CUDA. This simplifies the test setup while maintaining the focus on fast math functionality.

* Update fastmath tests to reflect that tl.* intrinsics generate no fastmath versions and disable cache in main execution.

* Fix formatting in fastmath test comments for clarity on tl.* intrinsics behavior.

* Add precision comparison tool for CUDA operations

This commit introduces a new Python script and CUDA source file for a precision comparison tool that evaluates the accuracy of various CUDA operations (including division, reciprocal, exponential, logarithmic, and trigonometric functions) across different implementations: CUDA Precise, CUDA Fast, Triton, Triton LibDevice, and TileLang. The tool generates test data, executes the operations, and summarizes the error statistics for each implementation against a double precision reference. Additionally, a README file is added to document the results of the comparisons for various operations.

* Add precision comparison tool for CUDA operations

This commit introduces a new precision comparison tool implemented in Python and CUDA, designed to evaluate the accuracy of various mathematical operations (division, reciprocal, exponential, logarithmic, trigonometric, square root, etc.) across different frameworks including CUDA Precise/Fast, Triton, Triton LibDevice, PyTorch, and TileLang. The tool includes functionality for generating test data, executing operations, and summarizing error statistics for each implementation. Additionally, it provides a comprehensive README with error metrics for each operation tested.

* Add IEEE-compliant mathematical operations and refactor fast math module

This commit introduces new high precision mathematical operations including ieee_add, ieee_sub, ieee_mul, ieee_fmaf, ieee_frcp, ieee_fsqrt, ieee_frsqrt, and ieee_fdiv to the TileLang framework. The fast math module has been refactored to remove the deprecated fastmath.py file and update the import paths accordingly. Additionally, the CUDA code generation has been enhanced to support these new operations, ensuring compatibility with IEEE standards for floating-point arithmetic.

* debug removed

* Refactor IEEE math tests for improved readability and consistency

This commit enhances the formatting of the `test_ieee_math.py` and `test_mathops_fastmath.py` files by adjusting line breaks for better clarity. It also removes unnecessary comments and ensures that the main execution of tests is streamlined. These changes aim to improve the overall maintainability of the test code.

* Update README.md to enhance formatting of precision comparison results

This commit reformats the precision comparison results in the README.md file, converting the error statistics tables into a more structured markdown format. This change improves readability and accessibility of the data for various mathematical operations across different implementations, including FP32 Precise, Triton, TileLang, and CUDA.

a58bf9b6

[FastMath] Disable default TVM fastmath intrinsic dispatch and add explicit... · 95c373f5

Lei Wang authored Sep 26, 2025

[FastMath] Disable default TVM fastmath intrinsic dispatch and add explicit fastmath op to invoke (#875)

* Add fast math operations for CUDA: exp, exp10, log, log2, log10, tan, cos, and sin (#865)

* Update fastmath tests to reflect that tl.* intrinsics generate no fastmath versions and disable cache in main execution.

* Fix formatting in fastmath test comments for clarity on tl.* intrinsics behavior.

* Add precision comparison tool for CUDA operations

95c373f5

18 Sep, 2025 1 commit
- [AMD] fix bf16x2 dtype codegen (#847) · 6efeb743
  Jiaxing Ding authored Sep 18, 2025
  
  6efeb743
15 Sep, 2025 1 commit

[feat] support gemm_sp for ampere and ada arch (#691) · 0b3683bf

botbw authored Sep 16, 2025



* [feat] add an example mma atom

* [fix] fix typo naming

* [feat] add a template to enable compilation

* [feat] add print util

* [WIP] pass on single block tile

* [feat] add sm80 metadata layout

* [chore] clean codebase

* [CI] format.sh

* [feat] add sm80 compress utils

* [bugfix] fix C fragment layout

* [refactor] use nvcc version instead of str

* [test] add test cases

* [chore] add a param check

* [chore] format a bit

* [chore] rename func to satisfy PEP 8 and appease gemini

* [chore] add check

* [feat] support sm75 layout && add assertion && chore

* [bug] fix illegal memory access when using two warps over N=32

This could be a missing check related to cutlass 2.x implementation.
Using the cutlass example can't trigger this cause it's bypassed by
padding the input.

For now I think it might be safe to increase the atom size and inve-
sgate in the future.

* [chore] add example

* [chore] format

* [example] update benchmark

* [bugfix] fix namespace and format

* [bugfix] fix incorrect param passing

* [refactor] update variable declaration for clarity in gemm_layouts and gemm_sp

* [Cleanup] Remove unnecessary blank lines in metadata layout functions in gemm_sp.py

* [CI] fix arch

* [example] add torch sparse benchmark

* [misc] polish && add reference && apply review suggestionsi && format

* [CI] format with clang-tidy

* [Cleanup] Format and align template struct definitions in half.hpp, common.h, and gemm_sp_sm80.h

* [Update] Modify CUDA version requirements in test_gemm_sp_sm80 and mark cutlass subproject as dirty

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

0b3683bf