Commits · e805f8e5a96a0c63342bdf0420941737dcbdc469 · OpenDAS / tilelang

18 Nov, 2025 2 commits
- [BugFix] Adding extra parameters into autotune hashkey (#1274) · e805f8e5
  Chaofan Lin authored Nov 18, 2025
```
* [BugFix] Adding extra parameters into autotune hashkey

* lint

* None check

* check serializable
```
  e805f8e5
- [Minor] Remove from __future__ import annotations for python 3.8 (#1273) · b1922518
  Yichen Yan authored Nov 18, 2025
  
  b1922518
17 Nov, 2025 5 commits

[Bugfix] Fix multiple cg defination when using T.sync_grid (#1272) · 220c3236
Yu Cheng authored Nov 18, 2025

220c3236

[Enhancement] Keep max score attention across blocks in FlashAttention for... · 3ab93cd7

Tong WU authored Nov 17, 2025


[Enhancement] Keep max score attention across blocks in FlashAttention for better numerical stablity (#1269)

* Implement max score retention across blocks in FlashAttention for improved stability

* fix manual pipeline parameters

* Update examples/flash_attention/example_gqa_fwd_varlen.py
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* fix typo

* more

* fix a previous typo

---------
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

3ab93cd7

[Docs] Improve Installation Guide (#1270) · b3d6f03c
Chaofan Lin authored Nov 17, 2025
```
* [Docs] Improve installation guide

* address comments
```
b3d6f03c

[EXAMPLE] In the flash attention example keep the max of all blocks seen in... · a2a27814

Varuna Jayasiri authored Nov 17, 2025

[EXAMPLE] In the flash attention example keep the max of all blocks seen in scores_max numerical stability (#1148)

* Keep the max of all blocks seen in scores_max for stability

* ruff formatting

a2a27814

[Refactor] add support for numpy dtype conversion (#1255) · 041d4a06

Kuris authored Nov 17, 2025

* add typing stub for tir.ir

* remove idents

* minor update

* [Refactor] add numpy conversion for dtype

* fix lint error

* remove unused np.float_ in dtype conversion

* fix type in np.int_

* fix typo

* minor fix

* remove debug files

041d4a06

16 Nov, 2025 2 commits

[Example] Add GQA decoding kernel with varlen page table (#1265) · 716dbef5

Zhengju Tang authored Nov 17, 2025

* [Example] Add page table for gqa decode

* [Example] Page table for varlen decoding

* [Lint]

* [Refactor] Remove redundant code

* [Lint]

* [Lint]

* [Lint]

716dbef5

[BugFix] Remove memory_order in atomic constexpr and fix NSA bwd (#1260) · 2de566e7

Kevinzz authored Nov 16, 2025



* fix nsa bwd and atomic

* [Lint]

* [BugFix]
- New implementation for atomicMax and atomicMin using atomicCAS
- PTX version atomicAdd for single 16-byte data
- Modify the test cases

* [Lint]

---------
Co-authored-by: tzj-fxz <tzjfxz@gmail.com>

2de566e7

15 Nov, 2025 3 commits

[AMD] Update CK for ROCm7 (#1262) · 729e66ca
Jiaxing Ding authored Nov 15, 2025

729e66ca

[fix] NVRTC execution backend (#1256) · eb415744

Gabriel Wu authored Nov 15, 2025

* [fix] NVRTC execution backend

* [fmt] run pre-commit

* [fix] coderabbit reviews

* [test] add cuda-python to test dep

* [fix] coderabbit reviews

* [fix] CUDA 13 compatibility

* [fix] sm90

* [fix] CUDA 13 compatibility

* [fix] pre-commit

* [fix] always use cuda::std::__atomic_ref_impl

* [fix] restore to external API

* Revert "[fix] restore to external API"

This reverts commit 49bd875638fb631d270015f408991d38fd1e9a5d.

* [fmt] use space instead tabs for py codegen

* [fix] im2col API

* [fix] revert atomic.h

* [fix] dynamic shape

* [refactor] extract common utils

* [feat] support L2 persistent map

* [fix] l2 persistent map

* [fix] pre-commit

* [fix] restore _TYPE_MAP

* [fix] pre-commit

* [fix] avoid duplicate TMA descs

* [docs] add docstring

* [fix] coderabbit

* [fix] coderabbit

* [fix] coderabbit

* [fix] coderabbit

eb415744

[BugFix] Refactor attention kernel to handle OOB positions by filling with... · 0af3fd7c

Tong WU authored Nov 15, 2025

[BugFix] Refactor attention kernel to handle OOB positions by filling with `-inf` instead of clearing accumulators. (#1222)

* Refactor attention kernel to handle OOB positions by filling with `-inf` instead of clearing accumulators.

* lint

* pre-commit

* Update imports in flash attention test file to use new backward and forward examples for better clarity and consistency.

0af3fd7c

14 Nov, 2025 2 commits
- [BugFix] Add autotune and exp2 for GDN kernel (#1258) · eac96cd7
  Zhengju Tang authored Nov 14, 2025
```
* [BugFix] Add autotune and exp2 for GDN kernel

* [Lint]

* [Lint]
```
  eac96cd7
- [Language] Add missing while statement (#1254) · 5eb30a4f
  Kuris authored Nov 14, 2025
```
* add typing stub for tir.ir

* remove idents

* minor update

* [Language] Add missing while statement

* add test
```
  5eb30a4f
13 Nov, 2025 6 commits

[Refactor] Update buffer handling in copy and atomic operations (#1247) · 2c0072a8

Lei Wang authored Nov 14, 2025

* [Refactor] Update buffer handling in copy and atomic operations

* Refactored the `copy` and `atomic_add` functions to use element-wise minimum for defining copy extents, ensuring correct handling of overlapping regions.
* Updated utility functions to create `BufferLoad` instances with explicit extents, improving memory management and clarity.
* Removed unused imports from `atomic.py` and `copy.py` to streamline the codebase.
* Adjusted logging in `copy.cc` to provide clearer warnings for fallback scenarios in bulk copy operations.

* Remove obsolete .git_commit.txt file

* Add unit test for dynamic copy extent handling in TileLang

* Introduced a new test file `test_tilelang_issue_1237.py` to verify that the `T.copy` function correctly manages dynamic extents during primitive function building.
* The test reproduces a specific issue related to dynamic slice lengths and static buffer sizes, ensuring robustness in the handling of such scenarios.
* The test does not require execution of the kernel, as building the primitive function is sufficient to validate the fix.

* lint fix

* fix

* Revert "fix"

This reverts commit 828b4c1e4de76a7d11e4d4092927303fbbe00097.

* Update TVM submodule and refactor atomic and copy functions

* Updated the TVM submodule to a dirty state.
* Refactored `atomic_add` and `copy` functions to pass extents explicitly to the `_to_region` helper, improving clarity and correctness in handling buffer regions.
* Commented out the main execution call in the test example for `cast` and added a new function call to better demonstrate the example usage.

* Enhance extent handling in atomic and copy functions

* Introduced `legalize_pairwise_extents` utility to align and broadcast extent lists for `atomic_add` and `copy` functions, ensuring compatibility and correctness in buffer operations.
* Updated both functions to utilize the new utility, improving clarity and robustness in handling dynamic and static extents.
* Added comments to clarify the extent handling logic.

* Enhance `legalize_pairwise_extents` function with early-exit rule

* Added an early-exit condition to the `legalize_pairwise_extents` function to return original extents if the number of non-1 dimensions in both source and destination extents is equal, improving performance by avoiding unnecessary adjustments.
* Updated the function's documentation to clarify the new behavior and maintain clarity in the extent handling logic.

* lint fix

2c0072a8

[Language][Reshape] Improve variable handling and ensure correctness during Layout Reshape (#1248) · d7164abf

Lei Wang authored Nov 13, 2025

* fix

* Refactor tensor reshaping in fp8_lighting_indexer.py

- Replaced the allocation of `s_reshaped` with a reshape operation to improve clarity and performance.
- Updated the logic in the computation of `s_reshaped` to utilize the reshaped tensor, enhancing the overall functionality of the attention mechanism.

* Refactor analyzer usage in Layout and Fragment reshaping

- Consolidated analyzer logic in the `Reshape` methods of `LayoutNode` and `FragmentNode` to utilize a fallback analyzer, improving code clarity and preventing potential null dereference issues.
- Updated variable binding and simplification calls to use the selected analyzer consistently, enhancing robustness in shape validation and index computation.

d7164abf

[Minor] Remove git_commit.txt (#1249) · c1398550
Chaofan Lin authored Nov 13, 2025

c1398550

[Bugfix] Fix fp8 dtype for some cases (#1246) · 63bf1609

Lei Wang authored Nov 13, 2025

* [Enhancement] Add FP8 support and reproducibility in lighting indexer

* Introduced a manual seed in `test_fp8_lighting_indexer` to ensure reproducible performance.
* Added specializations for `cute::float_e4m3_t` and `cute::float_e5m2_t` in `gemm_mma.h` for enhanced FP8 support across multiple CUDA architectures, ensuring compatibility and improved functionality.ix

* Fix typos in `fp8_lighting_indexer.py` and improve formatting in `gemm_mma.h`

* Corrected a typo in the comment for `test_fp8_lighting_indexer` to enhance clarity.
* Reformatted lines in `gemm_mma.h` for better readability by aligning template specializations across multiple CUDA architectures.

* test fix

* bug fix

63bf1609

[Refactor] Phaseout legacy loop vectorize dynamic pass (#1245) · f550a58d

Lei Wang authored Nov 13, 2025



* Deleted the LoopVectorizeDynamic implementation from the transform module.
* Removed associated references in the phase and initialization files to streamline the codebase.
* This change simplifies the transformation pipeline by eliminating unused functionality.
Co-authored-by: Zhiwen Mo <zm125@ic.ac.uk>

f550a58d

[AMD] enable amd ci test & fix bug & fix dockerfile (#1244) · b10d49b2
Jiaxing Ding authored Nov 13, 2025

b10d49b2

12 Nov, 2025 8 commits

RMSNorm epsilon refine in the example (#1243) · 468b1b70

pengxin99 authored Nov 13, 2025

* Fix division by zero in RMS normalization

* Fix rsqrt calculation to avoid division by zero

468b1b70

[Bugfix] Minor fix for tcgen05 (#1242) · 6882bd50

Lei Wang authored Nov 12, 2025



* Add correctness evaluation script for GEMM v2

- Introduced a new Python script `correctness_evaluation_tcgen05.py` for testing the correctness of GEMM v2 implementations using pytest.
- Implemented matrix multiplication and compilation checks, along with parameterized tests for various input configurations.
- Enhanced the testing framework to validate GEMM operations with different data types and configurations, ensuring robustness in the implementation.
- Updated logging in `legalize_negative_index.cc` to reduce verbosity by changing from WARNING to DLOG.
- Adjusted assertions in `tcgen05_macro_generator.py` to accommodate new warp size requirements for improved performance.
- Removed unused variable in `gemm_tcgen05.py` to streamline the codebase.

* lint fix

---------
Co-authored-by: Zhiwen Mo <zm125@ic.ac.uk>

6882bd50

[Enhancement] Support Layout/Fragment Reshape (#1241) · 4370309b

Lei Wang authored Nov 12, 2025



* Update layout handling and introduce reshape functionality

- Updated the `LayoutNode` class to include a new `Reshape` method, allowing for dynamic reshaping of layouts based on input shapes.
- Enhanced the `OutputShape` method to provide better handling of cases where the analyzer cannot form an `IntervalSet`, implementing fallback mechanisms to ensure safe extents.
- Refactored the `ReduceOpNode` to utilize `BufferRegion` for improved memory handling during reduction operations.
- Added tests for reshaping functionality and layout transformations to ensure correctness and performance in various scenarios.

* lint fix

* Revert tvm submodule pointer to 1815c3e0b6ec4ead36370bbd1562025d8529017c; keep src unchanged

* Update tvm submodule to commit f0bbd3bf741413c35c389ba5dedd5be206000ad1

* Update tvm submodule to commit f0bbd3bf741413c35c389ba5dedd5be206000ad1

* remove useless prove

* remove comment

---------
Co-authored-by: tilelang-bot <bot@tilelang>

4370309b

[Language] Add type stubs for tir op (#1239) · 02cfc2a3
Kuris authored Nov 12, 2025
```
* add typing stub for tir.ir

* remove idents

* minor update
```
02cfc2a3
[Bugfix] Minor fix in `builder.py` (#1235) · 30d8dedd
LJC00118 authored Nov 12, 2025

30d8dedd

[Refactor] Add kernel selection option for GEMM v1 in environment settings (#1200) · 8fbe1b3a

Lei Wang authored Nov 12, 2025

* Add kernel selection option for GEMM v1 in environment settings

- Introduced `TILELANG_USE_GEMM_V1` environment variable to control the selection of GEMM version.
- Added `use_gemm_v1` method in the `Environment` class to determine if GEMM v1 should be used based on the environment variable.
- Updated GEMM function assignment to default to v2, allowing for v1 to be forced via the new environment variable.

* bug fix

* Add kernel selection option for GEMM in environment settings

- Introduced `TILELANG_USE_GEMM_V1` environment variable to allow users to select between GEMM v1 and v2 implementations.
- Updated `gemm` function to default to v2 but switch to v1 if the environment variable is set to a truthy value.
- Added a method `use_gemm_v1` in the `Environment` class to facilitate this selection based on the environment variable.

* Refactor GEMM macro generator to use BufferRegion instead of Buffer

- Updated `wgmma` a...

8fbe1b3a

[Fix] Fix a type that make wrong T.macro backtrace (#1234) · 2b1f5990
Kuris authored Nov 12, 2025

2b1f5990

[Feature] Add Release Plan issue template for structured release management (#1231) · 454a9df6

Lei Wang authored Nov 12, 2025

* Introduced a new issue template for planning releases, including fields for version, milestone, scope, tasks, readiness checks, and additional notes.
* This template aims to streamline the release planning process and ensure all necessary information is captured for each release.

454a9df6

11 Nov, 2025 5 commits

[Enhancement] Extend type mappings and unify CPU backend initialization (#1230) · 9eaa708f

Lei Wang authored Nov 11, 2025

* Added new type mappings for int8, uint8, int16, uint16, int64, uint64, float64, bool, and uchar to the TLCPUSourceWrapper class.
* Updated the initialization function to use a common format for the CPU backend, ensuring consistency and improved error handling with the addition of get_last_error().
* Refactored the get_cpu_init_func method to return the updated initialization function, enhancing clarity and maintainability.

9eaa708f

[Enhancement] Refactor version retrieval logic in tilelang package (#1227) · e2c5906e

Lei Wang authored Nov 11, 2025



* Introduced a new function, _compute_version, to determine the package version with a clear preference order, enhancing version management.
* The function checks for a VERSION file in the source checkout, falls back to importlib.metadata for installed distributions, and defaults to a development version if all else fails.
* Updated the __version__ variable assignment to utilize the new function, improving clarity and maintainability of version handling.
Co-authored-by: Zhiwen Mo <zm125@ic.ac.uk>

e2c5906e

[Refactor] Simplify logic in the `CompleteBufferFragment` (#1226) · 7045f1d6

Lei Wang authored Nov 11, 2025



* fix

* Fix logging level in LayoutNode::InverseWithLevel method from WARNING to DLOG for symbolic layout fallback.

* lint fix

---------
Co-authored-by: Zhiwen Mo <zm125@ic.ac.uk>

7045f1d6

[Enhancement] Add thread count validation for ReduceOp fragment layout inference (#1225) · 67cc8611

Lei Wang authored Nov 11, 2025

* [Enhancement] Add thread count validation for ReduceOp fragment layout inference

* Introduced a check to ensure that the thread count is divisible by the replicate extent during layout inference in ReduceOpNode. This validation prevents layout inference failures and provides detailed error messages to guide users in resolving issues related to thread block sizes and fragment layouts.
* Updated tests to remove unsupported configurations that could lead to layout inference errors, ensuring more robust testing scenarios.

* lint fix

67cc8611

[GQA] Add varlen decoding kernel with logits saving (#1223) · eb6e8973

Zhengju Tang authored Nov 11, 2025

* [Example] Add GQA varlen decoding kernel with logits return

* [Example] Support Sink for GQA varlen decoding

* [Example] Add for no-varlen support

* [Tune] Add high performance logits saving

* [Lint]

* [Lint]

* [Rename]

eb6e8973

10 Nov, 2025 6 commits

[Language] Refactor reduce and support shared memory as its in/out (#1219) · 47039f06

Lei Wang authored Nov 10, 2025

* [Refactor] Update ReduceOpNode to use absolute values in Max computation and remove unused shared memory reduction logic

* Changed Max computation for AbsMax type to use absolute values of lhs and rhs.
* Removed unused shared memory reduction logic and related checks for buffer dimensions and thread extents, simplifying the Lower method.
* Added a fatal log for unsupported buffer scope reductions.

* reduce fix

* [Fix] Update type check for eval value in Builder class

* Changed the type check for eval values to raise a TypeError for unsupported types, specifically excluding instances of tvm.tir.Buffer. This improves error handling and clarity in the Builder class.

47039f06

[Enhancement] Improve iterator handling in layout utilities and parallel operations (#1221) · 2957afca

Lei Wang authored Nov 10, 2025

* [Enhancement] Improve iterator handling in layout utilities and parallel operations

* Added a new function, DivideUnusedIterators, to detect per-iterator gaps in fused index expressions, enhancing the accuracy of unused iterator detection.
* Updated CompleteBufferFragment to prefer direct inversion for bijective index mappings and introduced a fallback mechanism for non-bijective cases, improving layout inversion robustness.
* Added a new test for layout inference in fused kernels to ensure correct compilation and execution without layout inversion failures.

* lint fix

2957afca

[Bugfix] Improve error handling in LayoutNode::InverseWithLevel (#1215) (#1220) · cf46b7bd

Lei Wang authored Nov 10, 2025

* Added logging and exception handling for layout errors in InverseWithLevel method.
* Replaced direct error check with a throw statement to enhance error reporting and debugging capabilities.

cf46b7bd

[Utils] Add source export, NVCC-based PTX/SASS dump, logging (#1216) · 7e5b1cd2

Lei Wang authored Nov 10, 2025

* [Enhancement] Add NVCC support for PTX and SASS generation in TileLang

* Introduced functions to compile CUDA C++ source to PTX and SASS formats, enhancing the ability to generate intermediate representations for CUDA kernels.
* Added default compile options for NVCC, including paths for TileLang templates, CUTLASS, and CUDA includes.
* Implemented methods to export and display generated PTX and SASS code, improving usability for developers working with CUDA targets.
* Updated JITKernel class to integrate new NVCC functionalities for PTX and SASS handling, ensuring compatibility with existing workflows.

* [Fix] Improve error handling in get_sass_from_source function

* Added contextlib to suppress exceptions when removing temporary files, enhancing robustness.
* Fixed formatting of error message for clarity when CUDA tools are not found, ensuring better user feedback.

* [Enhancement] Preserve user flags in NVCC compile options

* Updated the default_compile_options function to preserve user-specified compile flags, including repeated tokens, by utilizing shlex for proper tokenization.
* This enhancement improves the flexibility and accuracy of NVCC compile options, ensuring that all user inputs are correctly handled.

7e5b1cd2

[Build] Explicitly add `libtvm` as a dep of `libtilelang` (#1215) · 2bc45bc3
Yichen Yan authored Nov 10, 2025

2bc45bc3
[Fix] Fix buffer re-import typo in tilelang.languge (#1214) · d5fda276
Kuris authored Nov 10, 2025
```
* Fix Buffer re-import typo in tilelang.langugage

* fix lint error
```
d5fda276

09 Nov, 2025 1 commit

[Bugfix] Enhane LetStmt Handling in Pipeline Transform (#1212) · 85218bd9

Lei Wang authored Nov 09, 2025

* [Enhancement] Introduce LetWrapper for handling loop variable substitutions in pipeline rewriting

* Added LetWrapper struct to encapsulate variable and value pairs for loop variable substitutions.
* Updated PipelineRewriter to accept a vector of LetWrapper instances, allowing for proper handling of Let statements that depend on the pipeline loop variable.
* Enhanced the BuildPipeline method to incorporate LetWrapper instances into rewritten blocks, ensuring correct substitutions during pipeline execution.
* Refactored logic for processing Let statements to differentiate between those that use the loop variable and those that do not, improving the flexibility of the pipeline transformation.

* Refactor lambda expression for clarity in loop variable usage check in inject_pipeline.cc

* [Test] Add regression test for loop variable handling in kernel compilation

* Introduced a new test case to verify correct handling of loop variables in the kernel compilation process, addressing a regression issue with InjectSoftwarePipeline.
* The test ensures that the loop variable is not left as a free variable, which previously caused failures in MakePackedAPI.
* Configurations are set to disable warp specialization and TMA lowering to align with the original issue reproduction.

* Remove unused import in regression test for loop variable handling in kernel compilation

85218bd9