Commits · ca98cc391790d160cffcb0b997c2380c276b8e2e · OpenDAS / tilelang

24 Nov, 2025 1 commit
- Revert "[WIP] support more dtypes for tcgen05 (#1229)" (#1323) · ca98cc39
  Lei Wang authored Nov 24, 2025
```
This reverts commit 0d101c11

.
Co-authored-by: Zhiwen Mo <zm125@ic.ac.uk>
```
  ca98cc39
23 Nov, 2025 1 commit

[Refactor] Backup Analyzer to get the appropriate arith informations (#1311) · 9f7bac4c

Lei Wang authored Nov 23, 2025

* [Refactor] Update Vectorization Functions to Accept Analyzer Parameter

- Modified `VectorizeLoop` and related functions to accept an `arith::Analyzer` parameter, enhancing their capability to perform analysis during vectorization.
- Updated multiple instances in `copy.cc`, `fill.cc`, `parallel.cc`, and layout inference files to utilize the new analyzer parameter for improved performance and correctness.
- Ensured consistency across vectorization logic by integrating the analyzer into existing workflows, facilitating better optimization opportunities.

* [Fix] Corrected PostOrderVisit call in loop_vectorize.cc

- Updated the PostOrderVisit function to analyze the body of the loop node instead of the node itself, ensuring proper handling of nested loops during vectorization analysis.

* fix

* lint fix

* fix

9f7bac4c

22 Nov, 2025 2 commits

[Bugfix] Fix autotune cache (#1315) · 721baedb
Lei Wang authored Nov 22, 2025

721baedb

Improve memory access safety and `T.assume` handling (#1292) · 470eb74c

LJC00118 authored Nov 22, 2025



* Improve memory access safety and T.assume handling

* Improve memory access safety and T.assume handling

* bugfix

* lint fix

* bugfix

* bugfix

* refactor legalize safe memory access pass

---------
Co-authored-by: Lei Wang <leiwang1999@outlook.com>

470eb74c

21 Nov, 2025 4 commits

[WIP] support more dtypes for tcgen05 (#1229) · 0d101c11

Yunqian Fan authored Nov 21, 2025

support ld with pack for fp32 dtype

add dump

add tempalte expand

remove unused dtype and change to rebased apis

0d101c11

[Fix] Fix frame scope error in T.macro (#1308) · bf90a5f5

Kuris authored Nov 21, 2025



* [Fix] Fix #1307 by adding macro inside function

* fix lint error

* add comments and fix lint error

* Remove debug print from enter_frame method

Removed debug print statement from enter_frame method.

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

bf90a5f5

[Bugfix] Fallback to the old AtomicAdd implementation for legacy architectures (#1306) · 17bbc0ca
Lei Wang authored Nov 21, 2025

17bbc0ca

[Fix] Remove unused let_bindings_ in CodeGenC to fix #1300 (#1305) · 2426090f

Kuris authored Nov 21, 2025

* [Feat] add missing support of uint32x2

* [Feat] Add `T.Ref` annotation and tests

* fix lint error

* minor update for error message on twice decl

* Remove unused let_bindings_ in CodeGenC to fix #1300

2426090f

20 Nov, 2025 4 commits
- [Enhancement] Shared Memory Size Can be Dynamic (#1294) · d4b6d094
  Lei Wang authored Nov 20, 2025
```
* bugfix

* lint fix

* test

* lint fix

* increate procs

* recover
```
  d4b6d094
- [Feat] add support for passing reference in T.Var annotation (#1291) · dd7fdb8e
  Kuris authored Nov 20, 2025
  
  dd7fdb8e
- [Feat] Add support for using `T.Tensor(n * 2 + 1)` in function annotation (#1285) · bccb6485
  Kuris authored Nov 20, 2025
```
* [Feature] Add support for A: T.Tensor(n + 1) and A: T.Tensor(2*n)

* issue fix

* fix

* fix

* decreate nproc for debugging

---------
Co-authored-by: Lei Wang <leiwang1999@outlook.com>
```
  bccb6485
- [Compatibility] Support CUDA 11.3 (#1290) · bef7e52e
  Lei Wang authored Nov 20, 2025
  
  bef7e52e
19 Nov, 2025 5 commits

[Language][UX] Nested loop checker in pre-lowering stage (#1288) · 9e67b861
Chaofan Lin authored Nov 20, 2025
```
* [Language][UX] Nested loop checker in pre-lowering stage

* rename

* comment

* address comments
```
9e67b861
Fix the bug in issue #1266 (#1284) · 49f35393
liu yuhao authored Nov 19, 2025
```
Co-authored-by: cheeryBloosm <liu_yu_hao@126.com>
```
49f35393

[Enhancement] Enhance CUDA compilation by integrating pass context configuration (#1283) · 551ac60d

Lei Wang authored Nov 19, 2025

- Updated the `tilelang_callback_cuda_compile` function to accept a `pass_config` parameter, allowing for more flexible compilation options.
- Introduced handling for fast math and PTXAS options based on the provided pass configuration.
- Modified the CUDA build process in `rt_mod_cuda.cc` to utilize the current pass context, improving the integration of compilation settings.
- Refactored NVCC command construction to use a dedicated function for better clarity and maintainability.

551ac60d

[Fix] Fix memory leak bug (#1281) · cd681e63

Kuris authored Nov 19, 2025

* add typing stub for tir.ir

* remove idents

* minor update

* [Refactor] add numpy conversion for dtype

* fix lint error

* remove unused np.float_ in dtype conversion

* fix type in np.int_

* fix typo

* minor fix

* remove debug files

* fix memory leak bug

* fix lint error

* add comments

* fix lint error

* remove duplicated, because tilelang doesn't dependent deprecated

cd681e63

[Bugfix] Supply missing `T.print` for bool type (#1279) · 4c8b9ada
Lei Wang authored Nov 19, 2025
```
* fix for bool dtype

* lint fix

* fix

* ci fix
```
4c8b9ada

18 Nov, 2025 7 commits

[FFI] Use tvm ffi as the default execution backend (#1259) · 74da3696

Lei Wang authored Nov 18, 2025

* [Refactor] Update FFI type handling and simplify argument management

* Refactored FFI type definitions in runtime and code generation files to use `TVMFFIAny` instead of `TVMValue`, enhancing type clarity.
* Updated function registration in `runtime.cc` to utilize canonical names for better consistency.
* Simplified argument handling in the `simplify` transformation, ensuring unused buffer parameters are removed only when simplification is enabled.
* Adjusted autotuner and profiler parameters to standardize the execution backend to `tvm_ffi`, improving clarity in backend selection.
* Removed obsolete `adapt_torch2tvm` function from tensor utilities to streamline the codebase and reduce complexity.

* [Update] Sync TVM submodule and enhance kernel source handling

* Updated the TVM submodule to commit cdc2aced, ensuring compatibility with recent changes.
* Added functionality to print kernel source in `example_blocksparse_gemm.py` for better debugging.
* Commented out the main execution call in test files to prevent unintended execution during testing.
* Introduced `tilelang.disable_cache()` in various test files to streamline testing and avoid cache-related issues.
* Refactored kernel source retrieval methods to improve clarity and consistency across different execution backends.

* [Refactor] Clean up imports and improve code formatting

* Removed unused import of `tilelang.testing` in `test_example_blocksparse_gemm.py` to streamline the code.
* Reformatted several lines in `arg_binder.cc`, `make_packed_api.cc`, `tvm_ffi.py`, and `adapter.py` for improved readability and consistency.
* Updated comments and spacing in `tvm_ffi.py` to enhance clarity without altering functionality.

* Update execution backend options and improve resolution logic

- Changed default execution backend from "cython" to "auto" in multiple locations to allow automatic selection based on the target.
- Expanded the list of supported execution backends to include "torch" and "nvrtc" across various classes and functions.
- Enhanced backend resolution logic in `KernelCache` and `AutoTuner` to ensure appropriate backend selection based on the target.
- Updated documentation to reflect changes in execution backend options and their defaults.

* lint fix

* fix

* Enhance argument handling in CUDA and HIP runtime modules

- Updated `ExtractFuncInfo` in `rt_mod_cuda.cc` and `rt_mod_hip.cc` to map boolean argument types to int32, ensuring compatibility with device runtime.
- Refactored `BindDLTensor` in `arg_binder.cc` to improve null handling and validation checks for DLTensor parameters, utilizing expression-level guards to prevent dereferencing null pointers.
- Enhanced error checking for buffer shape, strides, and data fields, ensuring robust handling of optional inputs and maintaining consistency across various checks.

* lint fix

* minor fix

* fix

* recover check

* Refactor argument binding and validation in `arg_binder.cc`

- Improved null handling and validation checks in `BindDLTensor`, ensuring safe dereferencing of pointers.
- Enhanced consistency checks for buffer shape, strides, and data fields, utilizing expression-level guards.
- Updated `MakePackedAPI` to maintain code clarity and consistency in argument handling.
- Minor adjustments in test files to streamline kernel execution and improve readability.

* lint fix

* stride fix

* minor fix

* fix

* lint fix

* Add CUDA stream access policy window helpers and integrate with L2 persistent cache management

- Introduced functions to set and reset the CUDA stream access policy window, allowing for better control over L2 cache usage.
- Updated runtime files to include new FFI packed functions for managing stream attributes.
- Modified lower_hopper_intrin to incorporate prologue and epilogue statements for L2 cache setup and teardown.
- Enhanced tests to verify the inclusion of new FFI calls in the generated kernel source.

* check with symbolic

* support null ptr

* Update CMakeLists and lower.py for code generation and subproject status

- Added `codegen_c_host.cc` to the list of source files in CMakeLists.txt for improved code generation support.
- Updated the function call in `lower.py` to use `target.build.tilelang_c` for C target host code generation, enhancing compatibility.
- Marked the TVM subproject as dirty to indicate local modifications.

* lint fix

* Update comments for clarity in quickstart.py

74da3696

[Language] Add shape check in `T.view/reshape` (#1277) · 921b96a3
Chaofan Lin authored Nov 18, 2025
```
* [Language] Add shape check in T.view/reshape

* address comments
```
921b96a3
[Bugfix] Minor fix for some cases (#1278) · 1b0efb65
Lei Wang authored Nov 18, 2025

1b0efb65

Bug fix for Gated Delta Net benchmark script (#1267) · 0f980f15

Jay Zhuang authored Nov 18, 2025



* fix argument order for fla chunk_gated_delta_rule_fwd_h

* explicit import assert_similar from utils

* rename utils module to avoid name clash

* set store_final_state and save_new_value to True

* fix

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

0f980f15

Fix various issues under `int64_t` static and dynamic shape. (#1218) · 49c85715

Elevator14B authored Nov 18, 2025



* Fix various issues under int64_t static and dynamic shape.

* Resolve reviewed issues.

* Add unit test.

* fix

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

49c85715

[BugFix] Adding extra parameters into autotune hashkey (#1274) · e805f8e5
Chaofan Lin authored Nov 18, 2025
```
* [BugFix] Adding extra parameters into autotune hashkey

* lint

* None check

* check serializable
```
e805f8e5
[Minor] Remove from __future__ import annotations for python 3.8 (#1273) · b1922518
Yichen Yan authored Nov 18, 2025

b1922518

17 Nov, 2025 5 commits

[Bugfix] Fix multiple cg defination when using T.sync_grid (#1272) · 220c3236
Yu Cheng authored Nov 18, 2025

220c3236

[Enhancement] Keep max score attention across blocks in FlashAttention for... · 3ab93cd7

Tong WU authored Nov 17, 2025


[Enhancement] Keep max score attention across blocks in FlashAttention for better numerical stablity (#1269)

* Implement max score retention across blocks in FlashAttention for improved stability

* fix manual pipeline parameters

* Update examples/flash_attention/example_gqa_fwd_varlen.py
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* fix typo

* more

* fix a previous typo

---------
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

3ab93cd7

[Docs] Improve Installation Guide (#1270) · b3d6f03c
Chaofan Lin authored Nov 17, 2025
```
* [Docs] Improve installation guide

* address comments
```
b3d6f03c

[EXAMPLE] In the flash attention example keep the max of all blocks seen in... · a2a27814

Varuna Jayasiri authored Nov 17, 2025

[EXAMPLE] In the flash attention example keep the max of all blocks seen in scores_max numerical stability (#1148)

* Keep the max of all blocks seen in scores_max for stability

* ruff formatting

a2a27814

[Refactor] add support for numpy dtype conversion (#1255) · 041d4a06

Kuris authored Nov 17, 2025

* add typing stub for tir.ir

* remove idents

* minor update

* [Refactor] add numpy conversion for dtype

* fix lint error

* remove unused np.float_ in dtype conversion

* fix type in np.int_

* fix typo

* minor fix

* remove debug files

041d4a06

16 Nov, 2025 2 commits

[Example] Add GQA decoding kernel with varlen page table (#1265) · 716dbef5

Zhengju Tang authored Nov 17, 2025

* [Example] Add page table for gqa decode

* [Example] Page table for varlen decoding

* [Lint]

* [Refactor] Remove redundant code

* [Lint]

* [Lint]

* [Lint]

716dbef5

[BugFix] Remove memory_order in atomic constexpr and fix NSA bwd (#1260) · 2de566e7

Kevinzz authored Nov 16, 2025



* fix nsa bwd and atomic

* [Lint]

* [BugFix]
- New implementation for atomicMax and atomicMin using atomicCAS
- PTX version atomicAdd for single 16-byte data
- Modify the test cases

* [Lint]

---------
Co-authored-by: tzj-fxz <tzjfxz@gmail.com>

2de566e7

15 Nov, 2025 3 commits

[AMD] Update CK for ROCm7 (#1262) · 729e66ca
Jiaxing Ding authored Nov 15, 2025

729e66ca

[fix] NVRTC execution backend (#1256) · eb415744

Gabriel Wu authored Nov 15, 2025

* [fix] NVRTC execution backend

* [fmt] run pre-commit

* [fix] coderabbit reviews

* [test] add cuda-python to test dep

* [fix] coderabbit reviews

* [fix] CUDA 13 compatibility

* [fix] sm90

* [fix] CUDA 13 compatibility

* [fix] pre-commit

* [fix] always use cuda::std::__atomic_ref_impl

* [fix] restore to external API

* Revert "[fix] restore to external API"

This reverts commit 49bd875638fb631d270015f408991d38fd1e9a5d.

* [fmt] use space instead tabs for py codegen

* [fix] im2col API

* [fix] revert atomic.h

* [fix] dynamic shape

* [refactor] extract common utils

* [feat] support L2 persistent map

* [fix] l2 persistent map

* [fix] pre-commit

* [fix] restore _TYPE_MAP

* [fix] pre-commit

* [fix] avoid duplicate TMA descs

* [docs] add docstring

* [fix] coderabbit

* [fix] coderabbit

* [fix] coderabbit

* [fix] coderabbit

eb415744

[BugFix] Refactor attention kernel to handle OOB positions by filling with... · 0af3fd7c

Tong WU authored Nov 15, 2025

[BugFix] Refactor attention kernel to handle OOB positions by filling with `-inf` instead of clearing accumulators. (#1222)

* Refactor attention kernel to handle OOB positions by filling with `-inf` instead of clearing accumulators.

* lint

* pre-commit

* Update imports in flash attention test file to use new backward and forward examples for better clarity and consistency.

0af3fd7c

14 Nov, 2025 2 commits
- [BugFix] Add autotune and exp2 for GDN kernel (#1258) · eac96cd7
  Zhengju Tang authored Nov 14, 2025
```
* [BugFix] Add autotune and exp2 for GDN kernel

* [Lint]

* [Lint]
```
  eac96cd7
- [Language] Add missing while statement (#1254) · 5eb30a4f
  Kuris authored Nov 14, 2025
```
* add typing stub for tir.ir

* remove idents

* minor update

* [Language] Add missing while statement

* add test
```
  5eb30a4f
13 Nov, 2025 4 commits

[Refactor] Update buffer handling in copy and atomic operations (#1247) · 2c0072a8

Lei Wang authored Nov 14, 2025

* [Refactor] Update buffer handling in copy and atomic operations

* Refactored the `copy` and `atomic_add` functions to use element-wise minimum for defining copy extents, ensuring correct handling of overlapping regions.
* Updated utility functions to create `BufferLoad` instances with explicit extents, improving memory management and clarity.
* Removed unused imports from `atomic.py` and `copy.py` to streamline the codebase.
* Adjusted logging in `copy.cc` to provide clearer warnings for fallback scenarios in bulk copy operations.

* Remove obsolete .git_commit.txt file

* Add unit test for dynamic copy extent handling in TileLang

* Introduced a new test file `test_tilelang_issue_1237.py` to verify that the `T.copy` function correctly manages dynamic extents during primitive function building.
* The test reproduces a specific issue related to dynamic slice lengths and static buffer sizes, ensuring robustness in the handling of such scenarios.
* The test does not require execution of the kernel, as building the primitive function is sufficient to validate the fix.

* lint fix

* fix

* Revert "fix"

This reverts commit 828b4c1e4de76a7d11e4d4092927303fbbe00097.

* Update TVM submodule and refactor atomic and copy functions

* Updated the TVM submodule to a dirty state.
* Refactored `atomic_add` and `copy` functions to pass extents explicitly to the `_to_region` helper, improving clarity and correctness in handling buffer regions.
* Commented out the main execution call in the test example for `cast` and added a new function call to better demonstrate the example usage.

* Enhance extent handling in atomic and copy functions

* Introduced `legalize_pairwise_extents` utility to align and broadcast extent lists for `atomic_add` and `copy` functions, ensuring compatibility and correctness in buffer operations.
* Updated both functions to utilize the new utility, improving clarity and robustness in handling dynamic and static extents.
* Added comments to clarify the extent handling logic.

* Enhance `legalize_pairwise_extents` function with early-exit rule

* Added an early-exit condition to the `legalize_pairwise_extents` function to return original extents if the number of non-1 dimensions in both source and destination extents is equal, improving performance by avoiding unnecessary adjustments.
* Updated the function's documentation to clarify the new behavior and maintain clarity in the extent handling logic.

* lint fix

2c0072a8

[Language][Reshape] Improve variable handling and ensure correctness during Layout Reshape (#1248) · d7164abf

Lei Wang authored Nov 13, 2025

* fix

* Refactor tensor reshaping in fp8_lighting_indexer.py

- Replaced the allocation of `s_reshaped` with a reshape operation to improve clarity and performance.
- Updated the logic in the computation of `s_reshaped` to utilize the reshaped tensor, enhancing the overall functionality of the attention mechanism.

* Refactor analyzer usage in Layout and Fragment reshaping

- Consolidated analyzer logic in the `Reshape` methods of `LayoutNode` and `FragmentNode` to utilize a fallback analyzer, improving code clarity and preventing potential null dereference issues.
- Updated variable binding and simplification calls to use the selected analyzer consistently, enhancing robustness in shape validation and index computation.

d7164abf

[Minor] Remove git_commit.txt (#1249) · c1398550
Chaofan Lin authored Nov 13, 2025

c1398550

[Bugfix] Fix fp8 dtype for some cases (#1246) · 63bf1609

Lei Wang authored Nov 13, 2025

* [Enhancement] Add FP8 support and reproducibility in lighting indexer

* Introduced a manual seed in `test_fp8_lighting_indexer` to ensure reproducible performance.
* Added specializations for `cute::float_e4m3_t` and `cute::float_e5m2_t` in `gemm_mma.h` for enhanced FP8 support across multiple CUDA architectures, ensuring compatibility and improved functionality.ix

* Fix typos in `fp8_lighting_indexer.py` and improve formatting in `gemm_mma.h`

* Corrected a typo in the comment for `test_fp8_lighting_indexer` to enhance clarity.
* Reformatted lines in `gemm_mma.h` for better readability by aligning template specializations across multiple CUDA architectures.

* test fix

* bug fix

63bf1609