Commits · 95e3b5a7160da6679e9507602f801866c3672e6b · OpenDAS / tilelang

19 Dec, 2025 2 commits
- [Refactor] Remove triton dependence in testing & move triton baseline into examples (#1470) · 95e3b5a7
  silentCoder-dev authored Dec 19, 2025
```
* remove triton dependence in testing & move triton baseline into example

* use ceildiv and handles arbitrary M correctly for triton
```
  95e3b5a7
- [Language] Make TL scripts friendly to Python syntax highlights (#1466) · 1a3a64fb
  Chaofan Lin authored Dec 19, 2025
```
* Language] Make TL scripts friendly to Python syntax highlights

* add comments

* fix submodule
```
  1a3a64fb
17 Dec, 2025 2 commits

[Enhancement] Update examples and tests for improved type handling functionality (#1448) · c750fb8a

Lei Wang authored Dec 17, 2025

* [Enhancement] Update examples and tests for improved type handling and functionality

- Enhanced various example scripts to support new data types and improve compatibility with PyTorch.
- Updated tests across multiple modules to ensure correct functionality with the latest changes in type handling.
- Refactored code in examples to streamline operations and improve clarity, particularly in tensor operations and memory management.
- Added comprehensive tests for new features and fixed existing issues related to type conversions and buffer handling.

* [Refactor] Update accumulation data type to float32 across examples

- Changed accumulation data type from "float" to T.float32 in multiple example scripts to ensure consistency and improve numerical stability.
- This update affects various modules including flash attention, GEMM analysis, convolution, and deepseek MLA examples, enhancing type handling across the board.

* [Refactor] Standardize data type usage across benchmark scripts

- Updated data type definitions in benchmark scripts to use T.float16 and T.float32 consistently, enhancing clarity and type handling.
- Adjusted dtype assignments in matmul functions and configuration setups to align with the new standard.
- Improved overall code consistency and maintainability by ensuring uniform data type usage across various modules.

* [Refactor] Standardize data type usage in templates and scripts

- Updated data type definitions in various templates and scripts to use string representations (e.g., "float16", "int32") instead of T.float16 and T.int32 for improved consistency and clarity.
- Enhanced overall code maintainability by ensuring uniform data type usage across multiple modules, including convolution, elementwise operations, and matrix multiplication templates.
- This change aims to streamline type handling and improve compatibility with existing workflows.

* [Refactor] Standardize data type usage in examples and benchmarks

- Updated data type definitions in various example and benchmark scripts to use T.float16 and T.int32 consistently, enhancing clarity and maintainability.
- Adjusted dtype assignments in kernel functions and configuration setups to align with the new standard.
- Improved overall code consistency by ensuring uniform data type usage across multiple modules, including attention mechanisms, matrix multiplication, and GEMM examples.

* [Refactor] Import dtypes from language.v2 module

- Added import statement for dtypes from the language.v2 module to enhance type handling and maintain consistency across the codebase.
- This change aims to streamline data type management and improve overall code clarity.

* fix

* [Refactor] Standardize data type usage across scripts

- Updated data type definitions in various scripts to use string representations (e.g., "float16", "int8") instead of T.float16 and T.int8 for improved consistency and clarity.
- Adjusted dtype assignments in functions and configuration setups to align with the new standard, enhancing overall code maintainability.
- This change affects multiple modules, including benchmark and attention mechanisms, ensuring uniform data type usage throughout the codebase.

* [Refactor] Update data type handling for consistency and clarity

- Changed string representations of data types in the Hint class to use T.float32 and T.int32 for improved consistency.
- Added new data types "int4" and "int16" to the dtypes module, enhancing type support across the codebase.
- Updated function signatures and assertions in the lop3 and mxfp modules to utilize the new data types, ensuring uniformity in type handling.
- This refactor aims to streamline data type management and improve overall code clarity and maintainability.

* [Enhancement] Improve data type handling and error messaging

- Introduced a mapping for canonical data types to their display strings, enhancing clarity in type representation.
- Updated the dtype creation logic to utilize the new mapping, ensuring more intuitive handling of string inputs.
- Refined error messages in the lop3 module to provide clearer feedback on invalid source formats, improving debugging and user experience.

* [Fix] Correct boolean flag in GEMM SP test case

- Updated the boolean flag in the test_gemm_sp_sm90 function to ensure proper functionality in the test case.
- This change enhances the accuracy of the test and aligns it with expected behavior for the GEMM SP implementation.

* [Refactor] Standardize data type usage across scripts

- Updated data type definitions in various scripts to use T.float16 and T.bfloat16 consistently, enhancing clarity and maintainability.
- Adjusted dtype assignments in function signatures and argument parsing to align with the new standard, ensuring uniform data type usage throughout the codebase.
- This change affects multiple modules, including benchmarks and examples, improving overall code consistency and readability.

* [Refactor] Standardize data type usage in various modules

- Updated data type assignments in multiple scripts to utilize T.float32, T.int8, and T.int32 consistently, enhancing clarity and maintainability.
- Adjusted function signatures and parameter types across benchmarks, examples, and tests to align with the new standard, ensuring uniform data type usage throughout the codebase.
- This change improves overall code consistency and readability, impacting modules related to matrix multiplication, GEMM, and tensor operations.

* [Refactor] Update argument parsing for data types in benchmarks

- Changed argument parsing for data types in benchmark_matmul_intrinsic.py and benchmark_matmul_sp.py to use string representations ("float16", "int8", "float") instead of T.float16 and T.float.
- This update enhances consistency in data type handling across benchmark scripts, improving clarity and maintainability.

* [Refactor] Update data type handling in benchmark and example scripts

- Changed data type arguments in benchmark and example scripts to use string representations ("float16") instead of T.float16 for improved consistency.
- Updated function signatures and argument parsing to align with the new standard, enhancing clarity and maintainability across the codebase.
- This change affects multiple modules related to attention mechanisms and tensor operations, ensuring uniform data type usage throughout the examples.

* [Refactor] Fix data type conversion in multiple scripts

- Corrected the usage of the data type conversion method from dtype..as_torch() to dtype.as_torch() across various benchmark and example scripts.
- This change enhances consistency in data type handling and improves code readability, impacting modules related to attention mechanisms and tensor operations.

* [Refactor] Update float8 data type usage across multiple scripts

- Changed instances of T.float8_e4m3 to T.float8_e4m3fn in various benchmark, example, and test scripts to ensure consistency in data type handling.
- This update enhances clarity and maintainability across the codebase, particularly in modules related to matrix multiplication and tensor operations.

* [Refactor] Enhance float8 data type handling in CUDA code generation

- Updated the handling of float8 data types in the CUDA code generation to include additional float8 variants, improving type conversion logic.
- Adjusted conditions to ensure proper type checks for float8 conversions, enhancing clarity and maintainability in the codebase.
- Modified layout inference to streamline float8 type checks, ensuring consistency across the implementation.
- This change impacts modules related to matrix operations and CUDA code generation, improving overall type handling and conversion accuracy.

* [Refactor] Streamline float8 data type handling in CUDA and related modules

- Enhanced float8 data type handling in CUDA code generation by refining type conversion logic and ensuring consistent type checks.
- Updated layout inference for float8 types to improve clarity and maintainability across the implementation.
- This change impacts modules related to matrix operations and CUDA code generation, improving overall type handling and conversion accuracy.

* [Refactor] Remove unnecessary cache disabling in float8 example script

- Eliminated the call to tilelang.disable_cache() in example_group_per_split_token_cast_to_fp8.py to streamline the code.
- This change enhances clarity and maintainability of the example script without affecting its functionality.

* [Refactor] Update data type usage in debug print tests

- Changed the argument for dtype in the test_debug_print_buffer function from a string representation to the corresponding T.bool type.
- This update enhances consistency in data type handling within the test suite, improving clarity and maintainability.

* lint fix

* Update function parameter types from `str` to `T.dtype` for improved type safety in attention sink and related examples

* Refactor `gemv_alloc_reducer` function signature for improved readability by formatting parameters across multiple lines.

c750fb8a

[Bugfix] Improve autotune from elementwise_add function in examples (#1445) · f4f87f46

senlyu163 authored Dec 17, 2025

* Remove JIT decorator from elementwise_add function in examples

* fix kernel compilation without autotune

* Refactor main function to accept parameters and update tests for autotune option

* Refactor autotune test function for morden style

f4f87f46

15 Dec, 2025 1 commit

[Refactor] Phase out the primitives folder since its design has been merged into tileop (#1429) · 89521e63

Lei Wang authored Dec 15, 2025

* Phase out primitives

* revert changes

* Refactor GemmWarpPolicy method signature for clarity

Updated the `from_warp_partition` method in the `GemmWarpPolicy` class to return the type `GemmWarpPolicy` instead of a string, enhancing type safety and clarity in the codebase. Removed an unnecessary blank line for improved readability.

* fix

89521e63

13 Dec, 2025 1 commit

[CUDA] Add read-only parameter annotation for CUDA codegen (#1416) · 00dd7388

Lei Wang authored Dec 14, 2025

* [Enhancement] Add read-only parameter annotation for CUDA codegen

* Introduced the `AnnotateReadOnlyParams` transformation to annotate read-only handle parameters in PrimFuncs, enabling the generation of `const` qualifiers in CUDA codegen.
* Updated `PrintFunctionSignature` and `AddFunction` methods to utilize the new attribute `tl.readonly_param_indices`, enhancing performance by allowing read-only cache loads.
* Modified the optimization pipeline to include the new annotation step, improving the overall efficiency of the code generation process.

* lint fix

* [Dependency] Update apache-tvm-ffi version to >=0.1.3

* Updated the version of apache-tvm-ffi in pyproject.toml, requirements.txt, and requirements-dev.txt to ensure compatibility with the latest features and fixes.
* Made adjustments in CUDA and HIP template files to use `const` qualifiers for global pointer parameters, enhancing code safety and clarity.

* lint fix

* [Enhancement] Refactor ReadWriteMarker for improved parameter handling

* Updated the ReadWriteMarker class to accept a set of parameter or data variables, enhancing its ability to track written variables.
* Introduced a new method, ResolveDataVarFromPtrArg, to resolve underlying buffer data from pointer-like arguments, improving accuracy in identifying written variables.
* Modified the MarkReadOnlyParams function to gather handle parameters and their corresponding buffer data variables, streamlining the process of determining read-only parameters.
* Enhanced the logic for identifying written variables to account for aliased data variables, ensuring comprehensive tracking of modifications.

* lint fix

* Update tma_load function to use const qualifier for global memory pointer

* Changed the parameter type of gmem_ptr in the tma_load function from void* to void const* to enhance type safety and clarity in memory operations.
* This modification ensures that the function correctly handles read-only global memory pointers, aligning with best practices in CUDA programming.

* Remove commented-out code and reorder transformations in OptimizeForTarget function for clarity

* Refactor buffer marking logic in annotate_read_only_params.cc to improve accuracy in identifying written variables. Update OptimizeForTarget function to reorder transformations for better clarity.

00dd7388

12 Dec, 2025 1 commit
- [Lint] Phaseout Yapf format and embrace ruff format (#1417) · 29051439
  Lei Wang authored Dec 12, 2025
  
  29051439
11 Dec, 2025 1 commit
- [AMD] Enable FA2 fwd on AMD MI300X (#1406) · 53be59dc
  danielhua23 authored Dec 11, 2025
```
* enable FA2 on AMD MI300X

* make lint happy
```
  53be59dc
08 Dec, 2025 1 commit

[BugFix] Fix split kernel layout bug of GQA decode (#1386) · 242b43bb

Zhengju Tang authored Dec 08, 2025

* [BugFix] Fix split kernel layout bug of GQA decode

* [BugFix] Avoid local with Parallel; use robust fragment instead

242b43bb

06 Dec, 2025 2 commits

[Language] Tilelang LazyJIT Experimental Version (#1337) · 0921328d

Kuris authored Dec 06, 2025



* initial step

* modify builder

* scratch version of new frontend

* write some tests

* add many tests

* add typing stub for tir.ir

* remove idents

* minor update

* minor update

* First version of jitv2 (renamed to LazyJIT)

* fix pre-commit error

* minor fix

* fix lint error

* fix lint error

* Fix conditional check for PrimFunc instance

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

0921328d

[Tool] Provide layout visualization tool (#1353) · 924225ed

Cunxiao Ni authored Dec 06, 2025

* Provide layout visualization tool

Adds a layout visualization tool to TileLang, which helps users understand and debug the layout transformations applied during compilation.

This tool visualizes the memory layout of tensors at different stages of the compilation process, allowing developers to identify potential inefficiencies and optimize their code for better performance.

The visualization can be enabled via a pass config option.

* format

* add layout visual example

* Adds vis extra with matplotlib dependency

* rafactor pass config name

* fix lint

* Enables configurable layout visualization formats

Allows users to specify the output formats (png, pdf, svg) for layout visualization through a pass config option.

This change provides more flexibility in how layout visualizations are generated, allowing users to choose the formats that best suit their needs.

It also fixes a bug where layout visualization was not correctly disabled when the config option was set to "false".

* Adds visual layout inference tool docs

* fix lint

* fix lint

* Rafactor configurable layout visualization formats

* fix lint

* fix typo

* add some comments

* fix lints

* add some warnings for user

* Moves layout visualization

* Refactors layout visualization pass configuration

Updates the layout visualization pass configuration to use boolean flag for enabling and a string for specifying formats.

* Enables multiple layout visualization formats

* Updates layout visualization docs

* Moves layout visualization to analysis

924225ed

05 Dec, 2025 1 commit

[Layout] Enhance Free Layout Inference (#1375) · 6654064d

Lei Wang authored Dec 05, 2025

* [Refactor] Update condition for benchmarking in example_gemv.py and simplify cached library path handling in sparse.py

* [Enhancement] Extend support for float8 data types in GEMM operations

- Updated GEMM operations to recognize additional float8 data types: `float8_e4m3fn` and `float8_e5m2fnuz`.
- Refactored condition checks in `checkWgmma` methods to simplify float8 type handling.
- Adjusted test cases to ensure compatibility with the new float8 types in tile language examples.

* lint fix

* [Enhancement] Add injective layout detection and exception handling

- Introduced `DetectInjective` method in `FragmentNode` to check for injective layouts.
- Added `LoopLayoutInjectiveException` to handle errors related to non-injective layouts.
- Updated `InferLayout` methods in `ParallelOpNode` to utilize injective checks and log relevant information.
- Refactored layout inference queue management to use `std::deque` for improved performance and added prioritization logic for buffer layouts.

* remove debug print

* minor layout fix

* fix for T.view

* [Enhancement] Improve injective layout detection in FragmentNode

- Updated the `DetectInjective` method to handle symbolic dimensions more effectively by introducing a mechanism to collect symbolic shapes and adjust the detection level accordingly.
- Added logging for cases where the layout detection falls back to NoCheck due to symbolic dimensions.
- Minor update to the test file to include the tilelang testing module.

* [Refactor] Simplify layout inference for bulk copy operations

- Removed unnecessary conditions for bulk load/store operations in the layout inference logic.
- Streamlined the handling of layout application for bulk copy instances to enhance clarity and maintainability.

* remove debug print

* [Enhancement] Introduce layout-related exceptions and improve error handling

- Added `LayoutConflictException` and `LoopLayoutInjectiveException` classes for better exception management in layout operations.
- Updated `InferLayout` method in `ParallelOpNode` to throw `LoopLayoutInjectiveException` with detailed error information when injective layout checks fail.
- Removed redundant exception class definitions from `parallel.h` to streamline code organization.

6654064d

03 Dec, 2025 1 commit

[Refactor] Generalize fp8 process (#1372) · 92121fc6

Lei Wang authored Dec 03, 2025

* [Refactor] Update condition for benchmarking in example_gemv.py and simplify cached library path handling in sparse.py

* [Enhancement] Extend support for float8 data types in GEMM operations

- Updated GEMM operations to recognize additional float8 data types: `float8_e4m3fn` and `float8_e5m2fnuz`.
- Refactored condition checks in `checkWgmma` methods to simplify float8 type handling.
- Adjusted test cases to ensure compatibility with the new float8 types in tile language examples.

* lint fix

92121fc6

02 Dec, 2025 1 commit

[Refactor] Update condition for benchmarking in example_gemv.py and simplify... · 6501bd07

Lei Wang authored Dec 02, 2025

[Refactor] Update condition for benchmarking in example_gemv.py and simplify cached library path handling in sparse.py (#1365)

6501bd07

01 Dec, 2025 2 commits

[Enhancement] Implement dynamic unroll factor in CUDA code generation (#1360) · 388ee7ee

Lei Wang authored Dec 02, 2025

* [Enhancement] Implement dynamic unroll factor in CUDA code generation

This commit introduces support for specifying a dynamic unroll factor in the CUDA code generation. The `unroll_factor` map is added to store unroll factors for loop variables, allowing for more flexible and optimized loop unrolling. Additionally, the `unroll` function is integrated into the loop language, enabling users to define unroll factors directly in their code. This enhancement improves performance by allowing tailored unrolling strategies based on specific loop characteristics.

* lint fix

* [Bugfix] Correct initialization of non-zero counters in custom compress kernel and update TIR registration for gemm_sp_py to use the correct tile operation

388ee7ee

[Language] support `T.gemm_sp_v2` on sm80 and sm89 (#1056) · 283a9a00

botbw authored Dec 01, 2025

* [misc] add a cpp side wrapper for gemm_sp_py

* [misc] typing

* [IR] bind GemmSPWarpPolicy

* [chore] add wrapper code

* [IR] fix GemmSPWarpPolicy

* [codegen] apply ptxas instructions

* [intrinsic] add typical (unused) mma layout

* [template] add uint16 debug func

* [intrinsic] add b matrix layout

* [gemm_sp] enable fp16/bf16 on sm8x

* [layout] refactor fp16/bf16 layout

* [gemm_sp] enable int8

* [chore] update test case dtype

* [gemm_sp] enable fp32

* [layout] refactor layouts

* [intrinsic] enable ldmatrix for mat A

* [layout] enable ldsm for matrix b

* [layout] add ldmatrix for fp32 and fp8

* [chore] refine

* [chore] refactor

* [chore] add fp8 efactor

* [chore] refactor

* [chore] add remove negative zero util

* [example] add a custom compress kernel

* [chore] minor update

* [test] refactor gemm_sp test

* [refactor] make metadata layout func

* [example] add option for using cutlass layout

* [doc] add a gemm_sp doc

* [doc] minor polish

* [chore] remove unused

* [bugfix] fix non replicate b case

* [test] refactor

* [chore] add a check

* [bugfix] fix util bug

* [wip] init a new test case for v2

* [chore] minor refactor

* [chore] minor update

* [bugfix] enable 16bit rs

* [language] enable rs

* [language] enable gemm_sp_sr

* [language] enable gemm_sp_rr

* [test] enable more tests

* [tvm] update ffi binding

* [chore] remove print

* [chore] fix benchmark script

* [lint] precommit lint

* [chore] apply feedback

* [test] use arch 8.0

* [chore] rollback ::ordered_metadata for backward compatibility

* [bugfix] fix captialized

* [example] keep gemm_sp on hopper

* [test] fix no fp8 normal kernel

* [test] reduce matmul size to satisfy accum error

* [test] use cal_diff for assertion

* [bugfix] expand float8 type

* [lib] add make_int4 for short type

* [language] add transpose E

* [bugfix] fix wrong var

* [format] format

* [chore] refactor binding

* [chore] fix wrong passing var

283a9a00

27 Nov, 2025 2 commits

[Refactor] Improve assertion handling in CodeGenCHost and ArgBinder (#1352) · 1e92d11c

Lei Wang authored Nov 28, 2025

* [Refactor] Improve assertion handling in CodeGenCHost and ArgBinder

This commit refines the assertion message generation in CodeGenCHost by optimizing the handling of equality checks and reducing buffer size for error messages. Additionally, it enhances the ArgBinder by introducing a nullable guard mechanism for assertions, allowing for more precise error handling when binding arguments. The changes improve the clarity and efficiency of assertion handling across the codebase.

* [Enhancement] Update matmul kernel and optimize argument binding

This commit enhances the matmul kernel by introducing additional tensor parameters and refining the pipeline stages for improved performance. It also updates the argument binding mechanism to include a flag indicating whether buffers are used, enhancing the efficiency of buffer management. Furthermore, the optimization phase in the engine is improved by adding a simplification step, ensuring better performance and clarity in the generated code.

* lint fix

* [Enhancement] Add tensor checks documentation and improve argument binding assertions

This commit introduces a new documentation page for host-side tensor checks, detailing the automatic validations performed by TileLang on kernel arguments. It enhances the ArgBinder by adding assertions for non-null pointers when arguments are used, improving error handling. Additionally, the optimization phase in the engine is updated to include a simplification step, ensuring better performance and clarity in the generated code.

* [Enhancement] Update .gitignore and refine matmul kernel for improved performance

This commit adds host checks logs to the .gitignore file to prevent unnecessary log files from being tracked. Additionally, it refines the matmul kernel by adjusting pipeline stages, updating tensor parameters, and enhancing argument handling for better performance. The changes also include improved error messages in the argument binding process, ensuring clearer diagnostics for users.

* lint fix

* lint fix

* [Refactor] Simplify tensor_null_test function and remove ptr_null_test

This commit refactors the tensor_null_test function by adding a with_bias parameter and removing the ptr_null_test function, which was previously unused. The run_test function is updated to reflect these changes, streamlining the testing process for tensor operations.

* lint fix

* fix

1e92d11c

Add sparse fine-tuning kernel for deepseek sparse attention to example (#1296) · b8240b7a
Yuxuan Hu authored Nov 27, 2025
```
* [EXAMPLE] add example for dsa sparse finetuning

* [Refactor]
```
b8240b7a

26 Nov, 2025 2 commits

[Enhancement] add more dtype and fix mma.ws for fp16 for tcgen05 (#1327) · f0c721a4

Yunqian Fan authored Nov 26, 2025

* feat: add fp8 variants; add placeholder for fp6/fp4 in meta

support ld with pack for fp32 dtype

add dump

add tempalte expand

remove unused dtype and change to rebased apis

* fix: when atom-m!=128, enable_ws

* fix: typo in tcgen05 meta; dispatch in gemm sm100

f0c721a4

[Refactor] Phaseout vmap for Tile Operators (#1334) · f5d9da46

Lei Wang authored Nov 26, 2025



* Refactor GEMM and Reduce operations by moving NormalizeToBufferRegion and MakeAccessPtrFromRegion to utils.{h,cc} for better code organization and reuse.

* lint fix

* Refactor region handling by removing the RegionOp and updating NormalizeToBufferRegion to only accept BufferLoad and BufferRegion. This change improves code organization and simplifies the handling of memory regions across various operations.

* fix

* Refactor memory region handling by introducing `tl.region` calls across various operations, including GEMM and fill functions. This change enhances the consistency of region management and improves code organization by utilizing utility functions for buffer region conversions.

* fix

* fix

* test fix

* lint fix

* Refactor GEMM operations to improve memory region handling by replacing `mbarPtr_` with `mbarRegion_` and updating related logic in both C++ and Python implementations. This change enhances the clarity and consistency of buffer region management.

* fix

* lint fix

* fix

* fix

* test fix

* lint fix

* lint fix

* minor fix

* fix

---------
Co-authored-by: Zhiwen Mo <zm125@ic.ac.uk>

f5d9da46

24 Nov, 2025 1 commit
- Revert "[WIP] support more dtypes for tcgen05 (#1229)" (#1323) · ca98cc39
  Lei Wang authored Nov 24, 2025
```
This reverts commit 0d101c11

.
Co-authored-by: Zhiwen Mo <zm125@ic.ac.uk>
```
  ca98cc39
21 Nov, 2025 1 commit

[WIP] support more dtypes for tcgen05 (#1229) · 0d101c11

Yunqian Fan authored Nov 21, 2025

support ld with pack for fp32 dtype

add dump

add tempalte expand

remove unused dtype and change to rebased apis

0d101c11

20 Nov, 2025 1 commit

[Feat] Add support for using `T.Tensor(n * 2 + 1)` in function annotation (#1285) · bccb6485

Kuris authored Nov 20, 2025



* [Feature] Add support for A: T.Tensor(n + 1) and A: T.Tensor(2*n)

* issue fix

* fix

* fix

* decreate nproc for debugging

---------
Co-authored-by: Lei Wang <leiwang1999@outlook.com>

bccb6485

19 Nov, 2025 1 commit
- Fix the bug in issue #1266 (#1284) · 49f35393
  liu yuhao authored Nov 19, 2025
```
Co-authored-by: cheeryBloosm <liu_yu_hao@126.com>
```
  49f35393
18 Nov, 2025 2 commits

[FFI] Use tvm ffi as the default execution backend (#1259) · 74da3696

Lei Wang authored Nov 18, 2025

* [Refactor] Update FFI type handling and simplify argument management

* Refactored FFI type definitions in runtime and code generation files to use `TVMFFIAny` instead of `TVMValue`, enhancing type clarity.
* Updated function registration in `runtime.cc` to utilize canonical names for better consistency.
* Simplified argument handling in the `simplify` transformation, ensuring unused buffer parameters are removed only when simplification is enabled.
* Adjusted autotuner and profiler parameters to standardize the execution backend to `tvm_ffi`, improving clarity in backend selection.
* Removed obsolete `adapt_torch2tvm` function from tensor utilities to streamline the codebase and reduce complexity.

* [Update] Sync TVM submodule and enhance kernel source handling

* Updated the TVM submodule to commit cdc2aced, ensuring compatibility with recent changes.
* Added functionality to print kernel source in `example_blocksparse_gemm.py` for better debugging.
* Commented out the main execution call in test files to prevent unintended execution during testing.
* Introduced `tilelang.disable_cache()` in various test files to streamline testing and avoid cache-related issues.
* Refactored kernel source retrieval methods to improve clarity and consistency across different execution backends.

* [Refactor] Clean up imports and improve code formatting

* Removed unused import of `tilelang.testing` in `test_example_blocksparse_gemm.py` to streamline the code.
* Reformatted several lines in `arg_binder.cc`, `make_packed_api.cc`, `tvm_ffi.py`, and `adapter.py` for improved readability and consistency.
* Updated comments and spacing in `tvm_ffi.py` to enhance clarity without altering functionality.

* Update execution backend options and improve resolution logic

- Changed default execution backend from "cython" to "auto" in multiple locations to allow automatic selection based on the target.
- Expanded the list of supported execution backends to include "torch" and "nvrtc" across various classes and functions.
- Enhanced backend resolution logic in `KernelCache` and `AutoTuner` to ensure appropriate backend selection based on the target.
- Updated documentation to reflect changes in execution backend options and their defaults.

* lint fix

* fix

* Enhance argument handling in CUDA and HIP runtime modules

- Updated `ExtractFuncInfo` in `rt_mod_cuda.cc` and `rt_mod_hip.cc` to map boolean argument types to int32, ensuring compatibility with device runtime.
- Refactored `BindDLTensor` in `arg_binder.cc` to improve null handling and validation checks for DLTensor parameters, utilizing expression-level guards to prevent dereferencing null pointers.
- Enhanced error checking for buffer shape, strides, and data fields, ensuring robust handling of optional inputs and maintaining consistency across various checks.

* lint fix

* minor fix

* fix

* recover check

* Refactor argument binding and validation in `arg_binder.cc`

- Improved null handling and validation checks in `BindDLTensor`, ensuring safe dereferencing of pointers.
- Enhanced consistency checks for buffer shape, strides, and data fields, utilizing expression-level guards.
- Updated `MakePackedAPI` to maintain code clarity and consistency in argument handling.
- Minor adjustments in test files to streamline kernel execution and improve readability.

* lint fix

* stride fix

* minor fix

* fix

* lint fix

* Add CUDA stream access policy window helpers and integrate with L2 persistent cache management

- Introduced functions to set and reset the CUDA stream access policy window, allowing for better control over L2 cache usage.
- Updated runtime files to include new FFI packed functions for managing stream attributes.
- Modified lower_hopper_intrin to incorporate prologue and epilogue statements for L2 cache setup and teardown.
- Enhanced tests to verify the inclusion of new FFI calls in the generated kernel source.

* check with symbolic

* support null ptr

* Update CMakeLists and lower.py for code generation and subproject status

- Added `codegen_c_host.cc` to the list of source files in CMakeLists.txt for improved code generation support.
- Updated the function call in `lower.py` to use `target.build.tilelang_c` for C target host code generation, enhancing compatibility.
- Marked the TVM subproject as dirty to indicate local modifications.

* lint fix

* Update comments for clarity in quickstart.py

74da3696

Bug fix for Gated Delta Net benchmark script (#1267) · 0f980f15

Jay Zhuang authored Nov 18, 2025



* fix argument order for fla chunk_gated_delta_rule_fwd_h

* explicit import assert_similar from utils

* rename utils module to avoid name clash

* set store_final_state and save_new_value to True

* fix

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

0f980f15

17 Nov, 2025 2 commits

[Enhancement] Keep max score attention across blocks in FlashAttention for... · 3ab93cd7

Tong WU authored Nov 17, 2025


[Enhancement] Keep max score attention across blocks in FlashAttention for better numerical stablity (#1269)

* Implement max score retention across blocks in FlashAttention for improved stability

* fix manual pipeline parameters

* Update examples/flash_attention/example_gqa_fwd_varlen.py
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* fix typo

* more

* fix a previous typo

---------
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

3ab93cd7

[EXAMPLE] In the flash attention example keep the max of all blocks seen in... · a2a27814

Varuna Jayasiri authored Nov 17, 2025

[EXAMPLE] In the flash attention example keep the max of all blocks seen in scores_max numerical stability (#1148)

* Keep the max of all blocks seen in scores_max for stability

* ruff formatting

a2a27814

16 Nov, 2025 2 commits

[Example] Add GQA decoding kernel with varlen page table (#1265) · 716dbef5

Zhengju Tang authored Nov 17, 2025

* [Example] Add page table for gqa decode

* [Example] Page table for varlen decoding

* [Lint]

* [Refactor] Remove redundant code

* [Lint]

* [Lint]

* [Lint]

716dbef5

[BugFix] Remove memory_order in atomic constexpr and fix NSA bwd (#1260) · 2de566e7

Kevinzz authored Nov 16, 2025



* fix nsa bwd and atomic

* [Lint]

* [BugFix]
- New implementation for atomicMax and atomicMin using atomicCAS
- PTX version atomicAdd for single 16-byte data
- Modify the test cases

* [Lint]

---------
Co-authored-by: tzj-fxz <tzjfxz@gmail.com>

2de566e7

15 Nov, 2025 1 commit

[BugFix] Refactor attention kernel to handle OOB positions by filling with... · 0af3fd7c

Tong WU authored Nov 15, 2025

[BugFix] Refactor attention kernel to handle OOB positions by filling with `-inf` instead of clearing accumulators. (#1222)

* Refactor attention kernel to handle OOB positions by filling with `-inf` instead of clearing accumulators.

* lint

* pre-commit

* Update imports in flash attention test file to use new backward and forward examples for better clarity and consistency.

0af3fd7c

14 Nov, 2025 1 commit
- [BugFix] Add autotune and exp2 for GDN kernel (#1258) · eac96cd7
  Zhengju Tang authored Nov 14, 2025
```
* [BugFix] Add autotune and exp2 for GDN kernel

* [Lint]

* [Lint]
```
  eac96cd7
13 Nov, 2025 2 commits

[Language][Reshape] Improve variable handling and ensure correctness during Layout Reshape (#1248) · d7164abf

Lei Wang authored Nov 13, 2025

* fix

* Refactor tensor reshaping in fp8_lighting_indexer.py

- Replaced the allocation of `s_reshaped` with a reshape operation to improve clarity and performance.
- Updated the logic in the computation of `s_reshaped` to utilize the reshaped tensor, enhancing the overall functionality of the attention mechanism.

* Refactor analyzer usage in Layout and Fragment reshaping

- Consolidated analyzer logic in the `Reshape` methods of `LayoutNode` and `FragmentNode` to utilize a fallback analyzer, improving code clarity and preventing potential null dereference issues.
- Updated variable binding and simplification calls to use the selected analyzer consistently, enhancing robustness in shape validation and index computation.

d7164abf

[Bugfix] Fix fp8 dtype for some cases (#1246) · 63bf1609

Lei Wang authored Nov 13, 2025

* [Enhancement] Add FP8 support and reproducibility in lighting indexer

* Introduced a manual seed in `test_fp8_lighting_indexer` to ensure reproducible performance.
* Added specializations for `cute::float_e4m3_t` and `cute::float_e5m2_t` in `gemm_mma.h` for enhanced FP8 support across multiple CUDA architectures, ensuring compatibility and improved functionality.ix

* Fix typos in `fp8_lighting_indexer.py` and improve formatting in `gemm_mma.h`

* Corrected a typo in the comment for `test_fp8_lighting_indexer` to enhance clarity.
* Reformatted lines in `gemm_mma.h` for better readability by aligning template specializations across multiple CUDA architectures.

* test fix

* bug fix

63bf1609

12 Nov, 2025 2 commits

RMSNorm epsilon refine in the example (#1243) · 468b1b70

pengxin99 authored Nov 13, 2025

* Fix division by zero in RMS normalization

* Fix rsqrt calculation to avoid division by zero

468b1b70

[Refactor] Add kernel selection option for GEMM v1 in environment settings (#1200) · 8fbe1b3a

Lei Wang authored Nov 12, 2025

* Add kernel selection option for GEMM v1 in environment settings

- Introduced `TILELANG_USE_GEMM_V1` environment variable to control the selection of GEMM version.
- Added `use_gemm_v1` method in the `Environment` class to determine if GEMM v1 should be used based on the environment variable.
- Updated GEMM function assignment to default to v2, allowing for v1 to be forced via the new environment variable.

* bug fix

* Add kernel selection option for GEMM in environment settings

- Introduced `TILELANG_USE_GEMM_V1` environment variable to allow users to select between GEMM v1 and v2 implementations.
- Updated `gemm` function to default to v2 but switch to v1 if the environment variable is set to a truthy value.
- Added a method `use_gemm_v1` in the `Environment` class to facilitate this selection based on the environment variable.

* Refactor GEMM macro generator to use BufferRegion instead of Buffer

- Updated `wgmma` and `wgmma_rs` methods in `TensorCoreIntrinEmitter` to accept `BufferRegion` parameters instead of `Buffer`.
- Adjusted related calls in `GemmWGMMA` to ensure compatibility with the new parameter types.
- Simplified buffer access logic for better clarity and maintainability.

* Refactor GEMM functions to utilize BufferRegion for improved memory handling

- Updated `run_gemm`, `run_gemm_rs`, `run_gemm_sr`, and `run_gemm_rr` functions to set `num_stages` based on block dimensions, enhancing performance for larger matrices.
- Simplified calls to GEMM functions by removing redundant parameters and ensuring compatibility with BufferRegion.
- Introduced utility functions for converting between Buffer, BufferLoad, and BufferRegion, improving code clarity and maintainability.
- Enhanced error handling for full region checks in GEMM operations to ensure correctness in memory access.

* Refactor GEMM code for improved readability and consistency

- Cleaned up formatting and spacing in GEMM-related files for better readability.
- Standardized comments and code structure across various GEMM functions and macros.
- Enhanced error messages for clarity in buffer region checks.
- Removed redundant lines and improved overall code maintainability.

* Update GEMM correctness evaluation and macro generator for improved functionality

- Modified `N_VALUES` in `correctness_evaluation_sm70.py` to include only relevant sizes for tests.
- Updated test function call in `correctness_evaluation.py` to use `test_gemm_false_true` for better accuracy in testing.
- Refactored buffer handling in `mma_sm70_macro_generator.py` to improve clarity and consistency in shared buffer access.
- Enhanced `gemm_mma_sm70.py` to ensure full region checks for input and output buffers, improving correctness in GEMM operations.

* Refactor GEMM and intrinsic files for improved clarity and functionality

- Removed unused variable `A_stride_last` in `mma_sm70_macro_generator.py` to streamline code.
- Adjusted function signature formatting in `swizzle.py` for better readability.
- Restored the return of `GemmWGMMA` in `__init__.py` for correct GEMM instantiation.
- Removed unused variable `B_buf` in `gemm_mma_sm70.py` to enhance code cleanliness.
- Improved function signature formatting in `language.py` for consistency.

* Enhance GEMM and MMA functionality for FP64 support

- Refactored `GemmNode` to streamline the decision-making process for GEMM instruction selection.
- Added support for FP64 inputs in the MMA dispatcher, enabling new tensor operations.
- Introduced a new layout function for FP64 in `mma_layout.py` to facilitate shared memory storage.
- Updated `TensorCoreIntrinEmitter` to handle FP64 data types, including adjustments for micro tile dimensions and loading mechanisms.
- Enhanced utility functions to accommodate FP64 index mapping for shared memory operations.

* lint fix

* Refactor GEMM correctness evaluation and shared memory alignment handling

- Reverted the GEMM function call in `correctness_evaluation.py` to the original implementation for consistency.
- Added a helper function in `merge_shared_memory_allocations.cc` to streamline the marking of shared variables under alignment scope.
- Enhanced the `VisitExpr_` methods to ensure proper handling of shared memory alignment for `BufferLoadNode` and `VarNode` types.
- Cleaned up commented-out test code in `correctness_evaluation.py` for better readability.

* Enhance GEMM and MMA implementations with region-based memory handling

- Updated GEMM and MMA classes to utilize BufferRegion for input and output buffers, improving memory management and supporting strided GEMM operations.
- Added checks to ensure full region compliance for input buffers, enhancing correctness in matrix multiplication.
- Implemented clear accumulation functionality to reset output buffers before accumulation, ensuring accurate results in GEMM operations.

* Refactor test_tilelang_example_deepseek_v32.py to improve import structure and function calls

- Updated import statements to directly reference modules instead of individual test functions, enhancing clarity.
- Modified function calls to use the new module structure for better organization and maintainability in testing examples.

* Enhance OnArrayDeclaration method to handle repeated buffer declarations

- Updated the OnArrayDeclaration method to merge metadata for buffers that may appear in multiple Allocate statements, improving robustness against upstream transformations.
- Added logic to prefer concrete element data types and record extents when previously unknown, enhancing the handling of buffer declarations.

* Add abbreviation for bfloat16 data type in mfma_macro_generator.py

- Introduced a new abbreviation "bf16" for the bfloat16 data type in the mfma_macro_generator.py file, enhancing clarity and consistency in data type representation.

* Refactor CodeGenTileLangHIP to enhance dtype handling and mfma call generation

- Introduced a mapping function to normalize input data types to their corresponding scalar types, improving compatibility with MfmaTraits.
- Updated the mfma call generation to utilize the new mapping, streamlining the code and enhancing clarity.
- Removed outdated dtype mapping and replaced it with a more flexible approach to support additional data types like FP8.

* lint fix

* Enhance backend configuration in CMakeLists.txt and improve dtype handling in CodeGenTileLangHIP

- Introduced a macro to define backend options for CUDA, ROCM, and Metal, allowing user overrides and caching of settings.
- Updated logic to track user-selected backends and conditionally enable defaults based on environment variables.
- Refactored dtype handling in CodeGenTileLangHIP to streamline mfma call generation and improve clarity.
- Added support for bfloat16 in the mfma_macro_generator.py, enhancing data type representation consistency.

* Update bfloat16 handling in CodeGenTileLangHIP and mfma_macro_generator.py

- Changed the representation of bfloat16 in CodeGenTileLangHIP from "bfloat16x4" to "bfloat16x4_vec" for improved clarity.
- Adjusted the mfma_suffix generation in mfma_macro_generator.py to remove the underscore before "bf16", aligning with HIP intrinsic requirements.

* Change logging level from WARNING to DLOG in LegalizeNegativeIndex for non-negative index checks to reduce log verbosity.

* Refactor attention sink examples to simplify index calculations

- Updated index handling in `example_gqa_sink_bwd_bhsd.py` and `example_mha_sink_bwd_bhsd.py` to eliminate unnecessary local allocations and streamline logic for determining start and end indices.
- Improved readability by using direct calculations instead of local variables for index bounds in pipelined loops.

* Refactor attention sink examples to streamline index calculations

- Simplified index handling in `example_gqa_sink_bwd_bhsd.py`, `example_gqa_sink_fwd_bhsd_wgmma_pipelined.py`, `example_mha_sink_bwd_bhsd.py`, `example_mha_sink_fwd_bhsd_wgmma_pipelined.py`, and `example_mha_sink_fwd_bhsd.py` by removing unnecessary local allocations for start and end indices.
- Enhanced readability by directly calculating index bounds for pipelined loops, improving overall code clarity.

* lint fix

* bugfix

* Refactor reduce operation handling in CUDA and Python

- Removed outdated shared memory reduction logic from `reduce.cc`.
- Introduced fragment allocation and improved buffer handling in `reduce.py` to support shared and fragment scopes.
- Updated CUDA header to define a wider accumulator type for better numerical accuracy.
- Enhanced error handling for buffer scope validation in the reduction process.

* Fix ReduceOpNode to correctly compute AbsMax by using absolute values of inputs

* Enhance unit loop handling by refining annotation checks

- Updated the condition for identifying effectively empty annotations in unit loops to include cases where only the `pragma_unroll_explicit` hint is present.
- Introduced a new method, `IsEffectivelyEmptyAnnotation`, to encapsulate this logic, improving code clarity and maintainability.

* clean clode

8fbe1b3a

11 Nov, 2025 1 commit

[GQA] Add varlen decoding kernel with logits saving (#1223) · eb6e8973

Zhengju Tang authored Nov 11, 2025

* [Example] Add GQA varlen decoding kernel with logits return

* [Example] Support Sink for GQA varlen decoding

* [Example] Add for no-varlen support

* [Tune] Add high performance logits saving

* [Lint]

* [Lint]

* [Rename]

eb6e8973

05 Nov, 2025 2 commits

[Example] Update GQA varlen fwd (#1173) · a9d823b8
Yu Cheng authored Nov 05, 2025
```
* [Example] Update GQA varlen fwd

* fix
```
a9d823b8

[GQA] Use TMA in GQA bwd kernel to boost performance (#1176) · 298ab480

Zhengju Tang authored Nov 05, 2025



* [Test] Add cp async to avoid register spill

* [BugFix] GQA fwd and bwd
- Fix the undefined behavior of -inf in acc_s
- Fix the causal loop range in varlen scenario

* [TMA] Move on to TMA and locate the register spill issue

* [Debug] Not the reason of zero-assignment. Probably the combination of Parallel op & conditional qkT

* [Debug] The SIMT copy in producer occupies too many registers

* [BugFix] Use 3D lse and delta to avoid illegal instruction

* [Perf] Relaxed order for dQ and SIMT store for dKdV

* [Feat] For atomic add version

* [Lint]

* [Bugfix] Enable code lowering with producer‑copy‑only program (#1168)

* bugfix

* lint fix

* Enhance warp group register allocation to handle missing consumer bodies gracefully. Updated logic to annotate producer side when consumer is absent, ensuring robustness in degenerate warp-specialized patterns.

* Refactor VisitExpr_ method in inject_tma_barrier.cc for improved readability. Adjusted formatting and spacing for clarity in barrier handling logic.

* Update barrier handling in inject_tma_barrier.cc to accommodate newly appended entries. Adjusted the size of the replace vector to ensure it covers the full needed length, and modified the logic for appending barriers based on the updated replace conditions.

* [Bugfix] Support 16bits shfl_sync (#1169)

* Add type-safe warp shuffle helpers for 16-bit float types in common.h

- Introduced generic passthrough functions for warp shuffle operations: `shfl_xor_sync`, `shfl_down_sync`, `shfl_up_sync`, and `shfl_sync`.
- Added specializations for `cutlass::half_t` and `cutlass::bfloat16_t` to ensure type safety during shuffle operations.
- Updated `reduce.h` to utilize the new shuffle functions, enhancing code clarity and maintainability.

* lint fix

* [Testing] Move TMA 1D and test for its functionality (#1167)

* [Testing] Move TMA 1D and test for its functionality

* [Lint]

* [Refactor]: Change the params in pytest to avoid oom error during ci (#1170)

* [Refactor]: Change the params in pytest to avoid oom error during ci

* format

* fix

* Update test_example_cast.py

* Update parameters in test_example_cast

* Update test_example_flash_attention.py

* update

* format

* fix

* fix

* format

* [Bugfix] Fix tvm import path for editable build (#1172)

* [Language] Expose `T.warpgroup_fence_operand` for nvcc code motion (#986)

* remove debug print

* pipeline fix

* use the correct buffer access scope

* rs support

* warp warpgroup_fence_operand

* fix

* fp8 dtype ptx enhance

* mma fix

* TCGEN05 Interface

* tcgen05 support

* rebase

* update

* Enhance TCGEN05 support by adding new intrinsic operations and descriptors. Introduced `ptx_tcgen05_mma_ts` for tensor-memory to shared-memory instructions and `tcgen05_mma_arrive` for signaling barrier completion. Updated existing descriptors and code generation logic to accommodate these changes, ensuring compatibility with new instruction sets. Refactored related allocation functions and improved handling of shared memory descriptors.

* lint fix

* Refactor buffer reference handling in CUDA code generation and update test execution in tilelang. Ensure default annotations for unrolling are set correctly in TIR IR module.

* wgmma fix

---------
Co-authored-by: Zhiwen Mo <zm125@ic.ac.uk>

* [Language] Add Correctness and performance check scripts for V2 (#1174)

* fix

* lint fix

* fix

* lint fix

* fix

* upd

* [Bugfix] Legalize Datatype for mma intrinisc codegen  (#1179)

* fix

* lint fix

* Enhance CUDA code generation by updating register type handling for float data types. Introduced a workaround for TF32 type compatibility and improved the registration of MMA register types for A and B operands.

* [Perf] Add layout and use_tma to boost performance

* [Lint]

* [Note]

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Co-authored-by: Yuqi Dong <134183314+yyttt6@users.noreply.github.com>
Co-authored-by: Zhiwen Mo <zm125@ic.ac.uk>

298ab480

04 Nov, 2025 1 commit

[Refactor] Improve Python3.9 compatibility for ParamSpec and Self (#1190) · 7d961892

Lei Wang authored Nov 04, 2025

* [Feature] Enhance fill operation to support various buffer types

- Added support for `BufferLoad` in the `fill` function to handle different buffer types.
- Updated `Fill` class to process region descriptors and buffer regions, improving flexibility in buffer handling.
- Introduced checks for static bounds in region definitions to ensure safety during operations.
- Refactored loop induction variable handling in `FillNode` to accommodate sliced regions.

* lint fix

* [Refactor] Improve Python compatibility for ParamSpec and Self

- Added compatibility handling for ParamSpec and Self to support Python versions below 3.10 and 3.11 respectively.
- Updated type annotations across multiple files to ensure consistent usage of typing features.

* [Update] Require Python 3.9 and enhance type annotations

- Updated the minimum required Python version from 3.8 to 3.9 in `pyproject.toml`.
- Removed references to Python 3.8 in classifiers.
- Changed type annotations from `int | None` to `Optional[int]` in multiple example files for better clarity and compatibility.
- Improved import statements to use `collections.abc` for `Iterable` and `contextlib` for `AbstractContextManager` in relevant files.

* [Refactor] Update import statements to enhance type annotations

- Replaced imports from `typing` with `collections.abc` for `Iterable` and `Mapping` in relevant files to improve compatibility and clarity.
- Updated the caching decorator from `functools.lru_cache` to `functools.cache` for better performance in the C++ compiler retrieval function.
- Adjusted import statements in the language proxy file to maintain consistency in type annotations.

* disable rocm rs nt test.

* lint fix

7d961892