Commits · 17a639765af7a4c5fd5ec95a8237ff4510a7ce79 · OpenDAS / tilelang

27 Oct, 2025 3 commits

[Enhancement] Add missing `fence_barrier_init` primitive after mbarrier init (#1121) · 17a63976
Yu Cheng authored Oct 27, 2025
```
* [Enhancement] Add missing  primitive after mbarrier init

* lint
```
17a63976

[CI]: Bump actions/download-artifact from 5 to 6 (#1127) · 0dc50a54

dependabot[bot] authored Oct 27, 2025

Bumps [actions/download-artifact](https://github.com/actions/download-artifact) from 5 to 6.
- [Release notes](https://github.com/actions/download-artifact/releases)
- [Commits](https://github.com/actions/download-artifact/compare/v5...v6

)

---
updated-dependencies:
- dependency-name: actions/download-artifact
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

0dc50a54

[CI]: Bump actions/upload-artifact from 4 to 5 (#1128) · 69113a6d

dependabot[bot] authored Oct 27, 2025

Bumps [actions/upload-artifact](https://github.com/actions/upload-artifact) from 4 to 5.
- [Release notes](https://github.com/actions/upload-artifact/releases)
- [Commits](https://github.com/actions/upload-artifact/compare/v4...v5

)

---
updated-dependencies:
- dependency-name: actions/upload-artifact
  dependency-version: '5'
  dependency-type: direct:production
  update-type: version-update:semver-major
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

69113a6d

25 Oct, 2025 1 commit

[Feature] Add memory_order PTX for vectorized atomic add (#1112) · 59865bdf

Zhengju Tang authored Oct 25, 2025



* [Feature] Add memory_order PTX for vectorized (2x) atomic add

* [Feature] Add memory_order PTX for all vectorized atomic add

* [Lint]

* test

* [BugFix] FIx init optional argument in alloc_var

* bug fix

* bug fix

* lint fix

* lint fix

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

59865bdf

24 Oct, 2025 1 commit
- [Bugfix] Resolve mixed stride dtype issue (inconsistent int32/int64 values) (#1119) · 65c4711f
  Lei Wang authored Oct 24, 2025
```
* fix int32 dtype issue

* lint fix

* lint

* lint fix

---------
Co-authored-by: Zhiwen Mo <zm125@ic.ac.uk>
```
  65c4711f
23 Oct, 2025 4 commits

[Feature] Support None type as input for `T.ptr` and `T.Tensor` (#1114) · 50e789dd
Wenhao Xie authored Oct 23, 2025
```
* [Feature] Support None type as input for T.ptr and T.Tensor

* lint

* lint

* lint

* lint fix
```
50e789dd

[Feature] Enhance vectorized conversion support in CUDA codegen (#1095) · a148d62a

Tong WU authored Oct 23, 2025

* [Feature] Add vectorized float16 and float32 conversion support in CUDA codegen

* Implemented handling for conversions between float16 and float32 types, specifically for vectorized operations using __half22float2 and __float22half2_rn.
* Enhanced the existing code to support both directions of conversion based on the lane count.
* Improved overall type handling in the VisitExpr_ method for better compatibility with TileLang.

* [Feature] Add float32 to float8 conversion support in CUDA codegen

* Implemented handling for conversion from float32 to float8 (E4M3/E5M2) in the VisitExpr_ method.
* Added vectorized conversion support using __nv_cvt_float2_to_fp8x2 for float2 to fp8x2 transformations.
* Enhanced type handling for better compatibility with TileLang, particularly for float8 types.

* lint

* fix a bug

* [Enhancement] Support lanes=4 cases and add unit test for vectorized cast

* lint

* [Feature] Refactor bf16 convertion operations and remove legacy compile flags

* lint

a148d62a

[Refactor] Improve scalar handling in CopyNode and update loop partition dtype logi (#1111) · 86c8bb46

Lei Wang authored Oct 23, 2025

* [Refactor] Improve scalar handling in CopyNode and update loop partition dtype logic

* Refactored CopyNode::MakeSIMTLoop to handle scalar cases more efficiently by moving the scalar check to the end of the function.
* Updated loop_partition.cc to set a default DataType for thread and vector extents, ensuring compatibility when loop_vars_ is empty.

* lint fix

* remove debug print

86c8bb46

[Lint] Enable pyupgrade linter in ruff (#963) · f14fb111
Yichen Yan authored Oct 23, 2025
```
* update rules

* ruff check

* other fixes

* fmt

* do not touch examples

* fmt
```
f14fb111

22 Oct, 2025 7 commits
- [Benchmark] Update Mamba2_chunk_scan benchmark (#1110) · 4f3523dc
  Yu Cheng authored Oct 22, 2025
  
  4f3523dc
- [Benchmark] Add Mamba2_chunk_scan benchmark (#1109) · 717f7b5d
  Yu Cheng authored Oct 22, 2025
  
  717f7b5d
- [Maint] Update uncommitted change detection command in `format.sh` (#1102) · e28433e0
  Xuehai Pan authored Oct 22, 2025
```
* [Maint] Remove pre-commit install in `format.sh`

* [Maint] Update uncommitted change detection command

* [Minor] update warning messages
```
  e28433e0
- [Refactor] Use forceinline in `ldmatrix` and update mamba scan kernel (#1104) · 8a5eb569
  Yu Cheng authored Oct 22, 2025
  
  8a5eb569
- [CI][Lint] Retire `format.sh` and add `clang-tidy` to GHA workflow (#1044) · 5683e6a6
  Xuehai Pan authored Oct 22, 2025
```
* [Lint] Retire `format.sh` and add `clang-tidy` to GHA workflow

* chore: update clang-tidy settings

* chore: upgrade clang-format and clang-tidy version

* lint: resolve clang-tidy errors

* [Maint] restore format.sh

* [CI] pre-commit autoupdate

* [Minor] fix `command -v` usage
```
  5683e6a6
- [Refactor] Optimize debug message for parallel inference (#1096) · 151d9e6b
  Lei Wang authored Oct 22, 2025
  
  151d9e6b
- [Example] Add block level high performance gemv example (#1097) · 514bdeaa
  Lei Wang authored Oct 22, 2025
```
* add alloc_reducer gemv example

* test
```
  514bdeaa
21 Oct, 2025 9 commits

[GQA] Add regional atomic add to slightly boost performance (#1093) · f003f371

Zhengju Tang authored Oct 22, 2025

* [Lint]

* [BugFix] Freeze the memory order of all atomic_add operations

* [Lint]

* [Atomic] Move on to regional atomic add

* [Lint]

f003f371

[Bugfix] Fix missing host cuTensorMapEncodeIm2col call (#1094) · 5cb5c068
Yu Cheng authored Oct 22, 2025

5cb5c068

[Language] Support tilelang `alloc_var(dtype, init=x)` (#1092) · bddb125e

Lei Wang authored Oct 21, 2025

* - carry existing local-var initializer map into OpaqueBlockLower, reattach it to
    generated Allocates and the PrimFunc attrs
  - thread the map through FlattenBuffer and StorageRewrite so flattened/merged
    allocations keep their tl.local_var_init annotations
  - teach annotation handling to accept scalar initializers, resolve buffers, and merge
    with existing stat

* lint fix

* enhance

* lint fix

* lint fix

bddb125e

[PassConfig] Introduce PassConfig `TL_STORAGE_REWRITE_DETECT_INPLACE` (#1089) · cdc67fc4

Lei Wang authored Oct 21, 2025

* • Enable configurable StorageRewrite inplace detection

  - Add kStorageRewriteDetectInplace constant and register the flag with PassContext so C++ code no longer hard-codes the key.
  - Wire StorageRewrite to include TileLang builtin constants and honor the new config toggle when deciding inplace reuse.
  - Document the flag across Python surfaces (PassConfigKey, JIT/autotuner docs) with usage guidance and simplified IR examples.

* lint fix

* add test

* lint fix

cdc67fc4

[Cleanup] Remove `tilelang.disable_cache()` calls from examples and tests (#1088) · 0c7e7419
Tong WU authored Oct 21, 2025
```
* [Cleanup] Remove `tilelang.disable_cache()` calls from example scripts

* lint

* lint
```
0c7e7419

[Target] Enhance target selection helpers and documentation (#1085) · 42c267e8

Lei Wang authored Oct 21, 2025

* Improve target docs and helper messaging

  Commit Message:

  - add SUPPORTED_TARGETS metadata and expose describe_supported_targets()
  - relax target validation to accept option suffixes and upgrade error messages
  - document target usage and compute capability mapping in docs/get_started/targets.md
  - note preference for string targets when caching and link the new guide in docs/index.md

* remove american english spelling

42c267e8

[Refactor] Rename cython output to `tilelang_cython` and relocate its path (#1086) · 60e9c7e6
Lei Wang authored Oct 21, 2025
```
* refactor cython wrapper

* optimize

* fix installations
```
60e9c7e6

[BugFix] Add memory order argument for non-vectorized atomic add (#1081) · 1d4b7180

Zhengju Tang authored Oct 21, 2025

* [BugFix] Add memory order argument for non-vectorized atomic add

* [Lint]

* [BugFix] Memory order

* [Lint]

* [BugFix] Argument in cuda template

* [Lint]

1d4b7180

[Feature] Add GQA backward kernel with varlen input (#1082) · 792e5d5b

Zhengju Tang authored Oct 21, 2025

* [Feature] Add GQA backward kernel with varlen input

* [Lint]

* [BugFix] Freeze the memory order of all atomic_add operations

* [Lint]

* [Lint]

* [BugFix] Use release order to boost performance

792e5d5b

20 Oct, 2025 11 commits

[Enhancement] Update async intrinsic handling in inject_fence_proxy (#1068) · bb8b3cd7

Tong WU authored Oct 21, 2025



* [Enhancement] Update async intrinsic handling in inject_fence_proxy

* Added support for wgmma async intrinsics in IsAsyncIntrinsic function.
* Changed handling of unknown externs to treat them as Generic instead of Async, improving accuracy in proxy kind determination.

* test fix

* Update testing/python/transform/test_tilelang_transform_inject_fence_proxy.py
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

bb8b3cd7

[Bugfix] Fix missing reg alloc in custom warp specialization (#1084) · f8d3e73e
Yu Cheng authored Oct 21, 2025

f8d3e73e
[Language] Efficient `T.reduce_` with shared memory input/output (#1080) · bc37ea69
Lei Wang authored Oct 20, 2025
```
* Support reduce ss

* lint fix

* test fix

* lint fix
```
bc37ea69
[Language] Recommend using `T.dynamic` instead of `T.symbolic` (#1076) · a7730272
Lei Wang authored Oct 20, 2025
```
* recommend using T.dynamic instead of T.symbolic

* lint fix

* lint fix
```
a7730272

[Autotune] Add autotune coverage for symbolic M and normalize cache key (#1075) · fd6cec58

Lei Wang authored Oct 20, 2025

- extend matmul autotune test suite with a symbolic M case and allow run_autotune to accept concrete values for symbolic dims
  - sanitize _kernel_parameters when generating cache keys so symbolic vars serialize deterministically

fd6cec58

[Feature] Support Reduce operators for bitwise and/or/xor (#1074) · ba410ae3
Zhengju Tang authored Oct 20, 2025
```
* [Feature] Support Reduce operators for bitwise and/or/xor

* [Lint]
```
ba410ae3
[Cache] raise errors for `tileang.clear_cache()` (#1077) · 1516f43c
Lei Wang authored Oct 20, 2025

1516f43c
[Layout] Utilizing IsEqual instead of StructuralEqual (#1073) · 6a388c0e
Lei Wang authored Oct 20, 2025

6a388c0e

[Parallel] Support `T.Parallel` with dynamic extents (#990) · 27701c3d

Lei Wang authored Oct 20, 2025

* Allow dynamic extents in loop partition; warn when layout inversion falls back to NoCheck

* add test and introduce predicate

* test fix

* fix

* enhance

* inverse with level

* test fix

* bug fix

27701c3d

[Example] Update GQA varlen fwd and MHA varlen fwd (#1071) · d66b83c9
Yu Cheng authored Oct 20, 2025

d66b83c9

[CI]: Bump actions/checkout from 4 to 5 (#1070) · e57ef582

dependabot[bot] authored Oct 20, 2025

Bumps [actions/checkout](https://github.com/actions/checkout) from 4 to 5.
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](https://github.com/actions/checkout/compare/v4...v5

)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-version: '5'
  dependency-type: direct:production
  update-type: version-update:semver-major
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

e57ef582

19 Oct, 2025 4 commits

[Benchmark] Add matmul FP16 benchmark results (#1067) · b2acfc37
Lei Wang authored Oct 19, 2025

b2acfc37
[Enhancement] Deprecate split&sum in attn bwd examples on Hopper and migrate... · 17bd0a6c
Tong WU authored Oct 19, 2025
```
[Enhancement] Deprecate split&sum in attn bwd examples on Hopper and migrate to vectorized atomic add (#1065)
```
17bd0a6c

[Refactor][Example] Update linear attention examples and add tests (#1010) · ae9a6f0a

Tong WU authored Oct 19, 2025



* [Refactor][Example] Update linear attention examples and add tests

- Refactored the backward and forward linear attention kernels to use shared memory and atomic additions for improved performance.
- Introduced L2 normalization in the main functions of both examples.
- Added a new test suite for the linear attention examples to ensure correctness and performance.
- Updated argument parsing in the main functions for better usability.

* upd docstring for tma atomic add

* lint

* Add flash-linear-attention dependency to requirements.txt

* Rename main function to chunk_linear_attn_bwd

* Rename main function to chunk_linear_attn_fwd

* chore

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

ae9a6f0a

[Misc] Add GitHub issue templates (#1057) · b7dfdb39
Xuehai Pan authored Oct 19, 2025

b7dfdb39