Commits · 72111642a6cd28c30778d40856c32781f2586ebe · OpenDAS / tilelang

17 Oct, 2025 7 commits

[Refactor] Refactor Pass `LegalizeSafeMemoryAccess` to support recursive load/store rewrite (#1050) · 72111642

Chaofan Lin authored Oct 17, 2025



* [Refactor] Refactor Pass  to support recursive load/store rewrite

* lint

* recursive collect conds for call_extern

* fix name

* [Lint]: [pre-commit.ci] auto fixes [...]

* lint

* [Lint]: [pre-commit.ci] auto fixes [...]

* lint

* [Lint]: [pre-commit.ci] auto fixes [...]

* address comment

* rename pad_value to safe_value

* lint

* add oob store test

* [Lint]: [pre-commit.ci] auto fixes [...]

* fix

* fix

---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

72111642

[Enhancement] Introduce a workaround for layout inference for local buffer store (#1055) · 278c0fbf

Lei Wang authored Oct 17, 2025



* [Enhancement] Improve layout inference for local buffer handling in parallel operations

* Added logic to check if a loop only manipulates "local" buffers, which affects thread binding decisions.
* Updated the condition for determining parallel loop execution to account for local buffer stores.
* Cleaned up comments for clarity and future considerations.

* [Refactor] Clean up parallel loop condition formatting in layout inference

* Reformatted the condition for determining parallel loop execution for better readability.
* Maintained existing logic while enhancing code clarity for future modifications.

---------
Co-authored-by: Zhiwen Mo <zm125@ic.ac.uk>

278c0fbf

[Enhancement] Improve CUDA compiler detection in CMake (#1054) · 37b3dbde
LJC00118 authored Oct 17, 2025
```
* improve CUDA compiler detection in CMake

* Minor fix
```
37b3dbde
[CI] Disable autofix for pre-commit CI (#1053) · 1281d6f8
Lei Wang authored Oct 17, 2025

1281d6f8

[Enhancement] Remove constraint requiring last dimension stride to be 1 (#1040) · 35cf8885

LJC00118 authored Oct 17, 2025



* remove last dimension stride must be 1 constraint

* add vectorize test

* minor fix

* [Lint]: [pre-commit.ci] auto fixes [...]

---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

35cf8885

Automatically initialize submodule if missing (#1052) · fd1493be
Lei Wang authored Oct 17, 2025

fd1493be

[Enhancement] Add support for symbolic dimensions in Cython kernel adapter and... · cc00fb65

Tong WU authored Oct 17, 2025

[Enhancement] Add support for symbolic dimensions in Cython kernel adapter and improve static shape validation in wrapper (#1024)

* [Enhancement] Add support for symbolic dimensions in Cython kernel adapter and improve static shape validation in wrapper

* [BugFix] Fix shape mismatch and deprecate `T.if()` in fused_moe example

* [Fix] Add `is_symbolic_expr` function to check for symbolic expressions in TIR

- Introduced a new utility function `is_symbolic_expr` to determine if an expression is a symbolic expression, enhancing type checking capabilities.
- Updated shape handling in `CythonKernelAdapter` to utilize the new function, improving handling for symbolic shapes.

cc00fb65

16 Oct, 2025 4 commits
- [CI] Fix ROCm CI (#1043) · a79bc5c6
  Xuehai Pan authored Oct 16, 2025
```
* [CI] fix ROCm CI

* feat: add a hook to error out on no test runs
```
  a79bc5c6
- [Bugfix] Improves compatibility when checking for MPS availability in... · 1f4ffdb8
  Lei Wang authored Oct 16, 2025
```
[Bugfix] Improves compatibility when checking for MPS availability in different PyTorch builds. (#1051)
```
  1f4ffdb8
- Allow mma gemm for all cuda (#1047) · e3742d33
  Yichen Yan authored Oct 16, 2025
  
  e3742d33
- [Feature]: Add test for atomicadd auto vectorize and remove useless code (#1019) · 0ff4f427
  Yuqi Dong authored Oct 16, 2025
```
* update

* format

* rabbit
```
  0ff4f427
15 Oct, 2025 8 commits

[Refactor] Use `has_simt_copy` to decide whether to insert `set_max_nreg` (#982) · bd1c7b39
Yu Cheng authored Oct 16, 2025

bd1c7b39

[BugFix] Phaseout dependency of Triton in sink examples to make CI happy (#1045) · 8f001e02

Tong WU authored Oct 16, 2025



* [BugFix] Phaseout dependency of Triton in sink examples to make CI happy

- Added `benchmark_gqa_sink_fwd.py` and `benchmark_mha_sink_fwd.py` to evaluate performance of GQA and MHA attention mechanisms using Triton.
- Refactored existing attention sink implementations to remove Triton kernel definitions from the reference programs, streamlining the code.
- Updated input generation and benchmarking logic to enhance configurability and performance measurement.
- Improved overall structure and organization of the examples for better clarity and usability.

* [Lint]: [pre-commit.ci] auto fixes [...]

---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

8f001e02

[CI][Refactor] Merge test CI workflow files into one (#973) · 8ce27782

Xuehai Pan authored Oct 15, 2025



* refactor: merge test CI workflow files into one

* chore: set `UV_INDEX_STRATEGY=unsafe-best-match`

* feat: add AST test with Python 3.8

* feat: implement manual caching mechanism for self-hosted runners

* refactor: simplify cache logic for self-hosted runners

* chore: clear uv cache on failure

* chore: print format.sh output to logs

* chore: improve uv caching

* chore: disable parallel test

* chore: use `PYTHONDEVMODE=1` in CI

* feat: enable coredump generation

* fix: fix perfbench condition

* Revert "feat: enable coredump generation"

This reverts commit c52da65cb572932e09905d08c43a39ec3cf47c54.

* chore: move example CI down

* Revert "chore: move example CI down"

This reverts commit 9d8e65055e01d955c5268a9a6705d270c2de0d57.

* chore: skip example `test_example_mha_sink_bwd_bhsd`

* chore: skip example `test_example_gqa_sink_bwd_bhsd`

* fix: fix example argument passing

* fix: loosen test criteria

* chore: rename `CMAKE_CONFIGURE_OPTIONS` -> `CLANG_TIDY_CMAKE_OPTIONS` for clarity

* feat: enable parallel testings

* chore: update pytest options

* remove skipped test as now been resolved

* chore: empty commit to re-trigger ci

* test for n 1

* chore: remove ` --numprocesses=1` option in example

* chore: disable failfast

* chore: update cibw selection

* fix: fix git submodule clone

* chore: update cibw commands

* fix: fix yapf multiprocessing

* chore: setup ccache for CIBW on macOS only

* chore: update comments

* chore: update artifact listing

* fix: do not fail if not found nvcc in PATH

* fix: fix flash-attn installation

* chore: update dist workflow trigger

* chore: remove outdated comments

* chore(workflows/dist): simplify build matrix strategy

* fix: fix CUDA path finding

* fix: fix CUDA path finding

* chore: imcrease CI timeout

* ci: disable failfast

* fix: hide path prefix

* chore: more verbose

* chore: disable PR trigger for dist workflow

* fix: seed for tests

* fix: use nightly torch for ROCm tests

* chore: enable PR trigger for dist workflow

* chore: stop uploading debug wheels as artifacts in PR

* chore: do not run workflows in forks

* chore: housekeep requirements

* chore: use Nightly-ROCm-6.3 for CI

* chore: use Nightly-ROCm-6.4 for CI

* Update ROCm toolkit version to 7.0

* chore: restore previous rocm-ci.yml for test

* fix: cleanup PYTHONPATH

* chore: remove previous rocm-ci.yml

* ci fix

* chore: remove previous rocm-ci.yml

* chore: enable parallel example run

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>
Co-authored-by: alex_xiao <xinyuxiao2024@gmail.com>

8ce27782

fix bug&add amd examples (#966) · 80665cd1

alex_xiao authored Oct 15, 2025

* [Enhancement] Refactor buffer index handling for improved precision and clarity (#668)

- Enhanced buffer index handling to address precision issues by removing redundant operations.
- Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection.
- Updated related documentation to reflect changes in buffer management practices.

* Remove obsolete test script for AMD example, streamlining the examples directory.

* Remove unused dtype_size variable in AMD example script to streamline code.

* Add input configuration file and update AMD example script for enhanced flexibility

- Introduced a new input.txt file for configurable parameters.
- Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack.
- Streamlined the main function for better clarity and organization.
- Added a new test script to facilitate running the example...

80665cd1

[Language] Expose `T.get_warp_idx_sync` and `T.shuffle_elect` for efficient thread election (#989) · b78d8404

Lei Wang authored Oct 15, 2025



* Expose CUDA warp/lane intrinsics in TileLang frontend

* generalize warp indexing intrinsics and add coverage

* [Lint]: [pre-commit.ci] auto fixes [...]

---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

b78d8404

[CUDA] Add pack functions for FP8 types (#967) · 32ddc1ac

LJC00118 authored Oct 15, 2025

* Remove an incorrect check

* add fp8 pack function

* code lint

* minor fix

* minor fix

* minor fix

* Minor fix

* Minor fix

32ddc1ac

[Env] Optimize the mechanism for locating `TL_LIBS` (#1038) · c67f73b0
Lei Wang authored Oct 15, 2025

c67f73b0

[TIR] Revert some changes of Pass `LowerIntrin` (#1035) · e5399527

Lei Wang authored Oct 15, 2025



* keep >> instead of /

* re think replicate

* lint fix

* handle const int buffers

* rep fix

---------
Co-authored-by: Zhiwen Mo <zm125@ic.ac.uk>

e5399527

14 Oct, 2025 8 commits

[CI] Disable buggy(maybe) warp specialized kernel ci test for H20 (#1033) · 5767475a
Lei Wang authored Oct 14, 2025

5767475a

[Bugfix] Recover code for flexible parallel (#1032) · eed320f5

Lei Wang authored Oct 14, 2025



* recover flex parallel process

* lint fix

---------
Co-authored-by: Zhiwen Mo <zm125@ic.ac.uk>

eed320f5

[Enhancement] Update abs function for half_t and bfloat_t to use cutlass implementation (#1023) · 1e8f0b18

Tong WU authored Oct 14, 2025



* [Enhancement] Update abs function for half_t and bfloat_t to use cutlass implementation

* [Lint]: [pre-commit.ci] auto fixes [...]

* optimize amd ci

---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

1e8f0b18

[CI] Removes debug print statements from the example. (#1030) · 2ada4eca

Cunxiao Ni authored Oct 14, 2025



* [CI] Removes debug print statements from the example.

* add parse args

* [Lint]: [pre-commit.ci] auto fixes [...]

* format

---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

2ada4eca

[Language] Support Consequential assignments like 'a = b = c = 1' (#992) · e59e7f9a

Lei Wang authored Oct 14, 2025



* chained assignments

* test update

* [Lint]: [pre-commit.ci] auto fixes [...]

---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

e59e7f9a

[Build] Prefer libs from local build dir (#1027) · 0f515b86

Yichen Yan authored Oct 14, 2025

* Load libs from build dir, if present, to support faster rebuild.

* typo

* upd

* refine check

* md lint

0f515b86

[Lint] Prefer American English spelling (#1022) · d684094b
Xuehai Pan authored Oct 14, 2025
```
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>
```
d684094b
[Transform] Migrate `LowerIntrin` from tvm into tilelang (#999) · 7a5077e4
Lei Wang authored Oct 14, 2025
```
* Donot lower ceildiv to >>

* lint fix

* test fix

* fallback ceildiv changes
```
7a5077e4

13 Oct, 2025 4 commits

[CI] Removes redundant environment variable (#1020) · eb37e459

Cunxiao Ni authored Oct 14, 2025

* [CI] Removes redundant environment variable
Removes the `UV_INDEX_URL`

* triggle CI

* triggle CI

* triggle CI

* triggle CI

eb37e459

[Build] Migrate to scikit-build-core (#939) · d89ba5b8

Yichen Yan authored Oct 13, 2025



* cleanup

* init

* build first wheel that may not work

* build cython ext

* fix tvm build

* use sabi

* update rpath to support auditwheel

* pass editible build

* update ci

* fix warnings

* do not use ccache in self host runner

* test local uv cache

* test pip index

* update lib search to respect new lib location

* fix

* update ci

* enable cuda by default

* update src map

* fix

* fix

* fix

* Generate version with backend and git information at build time

* copy tvm_cython to wheels

* fix tvm lib search

* fmt

* remove unused

* auto detect ccache

* add back backend-related files

* remove jit cython adaptor to simplify code

* fmt

* fix ci

* ci fix 2

* ci fix 3

* workaround metal

* ci fix 4

* fmt

* fmt

* Revert "ci fix 4"

This reverts commit d1de8291c3e40927955f3ad3cf87a75c78813676.

* tmp

* fix metal

* trivial cleanup

* add detailed build-time version for cuda

* add back mlc

* Restore wheel info and other trivial updates

* update

* fix cuda

* upd

* fix metal ci

* test for ga build

* test for nvidia/cuda

* test ubuntu 20

* fix

* fix

* Do not use `uv build`

* fix

* fix

* log toolchain version

* merge wheel

* update

* debug

* fix

* update

* skip rocm

* update artifacts each

* fix

* fix

* add mac

* fix cache

* fix cache

* fix cache

* reset and add comment

* upd

* fix git version

* update deps

* trivial update

* use in-tree build dir and install to src to speedup editable build

* Revert "use in-tree build dir and install to src to speedup editable build"

This reverts commit 6ab87b05c5eed811210136b8dca4fc3677dd51f2.

* add build-dir

* update docs

* remove old scrips

* [1/n] cleanup scripts

* [Lint]: [pre-commit.ci] auto fixes [...]

* fix and update

* wait for tvm fix

* revert some tmp fix

* fix

* fix

* spell

* doc update

* test cibuildwheel

* fix and test macos on ci

* Update .github/workflows/dist.yml
Co-authored-by: Xuehai Pan <XuehaiPan@outlook.com>

* fix

* test ga event

* cleanup

* bump tvm to support api3

* test final version

* add cron

* Update .github/workflows/dist.yml
Co-authored-by: Xuehai Pan <XuehaiPan@outlook.com>

* fix

* test ccache for metal cibuildwheel

* test newer macos

* finish

---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Xuehai Pan <XuehaiPan@outlook.com>

d89ba5b8

[CI] Speed up sparse tensor core test via vectorized generating sparse data (#1009) · bab57f23
Lei Wang authored Oct 13, 2025

bab57f23
[Bugfix] Fix atomicadd auto vectorize identify var error (#883) · 340bfc50
Yuqi Dong authored Oct 13, 2025
```
* update

* update

* update

* update
```
340bfc50

12 Oct, 2025 3 commits

[Bugfix] Fallback `torch.accelerator.synchronize()` to `torch.cuda.synchronize()` (#987) · 4a229ddb
Yuqi Dong authored Oct 12, 2025
```
* [Refactor]:Add support for torch version lower than 2.6.0

* update
```
4a229ddb
[BugFix] Robust gemm policy for sparse_mla_fwd in Hopper and Ada Lovelace architectures (#984) · fc41463c
Zhengju Tang authored Oct 12, 2025
```
* [BugFix] Robust gemm policy for sparse_mla_fwd in Hopper and Ada Lovelace architectures

* [Lint]
```
fc41463c

[Bugfix] Add NVIDIA HPC SDK support in CUDA detection (#974) (#976) · b0b5347a

Degeneracy-Evil authored Oct 12, 2025



* [Bugfix] Add NVIDIA HPC SDK support in CUDA detection (#974)

Enhanced CUDA detection to recognize NVIDIA HPC SDK installations:
- Added path check for nvhpc in nvcc binary path
- Added fallback scan for default nvhpc paths:
  /opt/nvidia/hpc_sdk/Linux_x86_64
- Maintained backward compatibility with standard CUDA installations

Verification:
- Tested on Ubuntu 24.04 with NVIDIA HPC SDK 25.7
- Confirmed detection works without manual CUDA_HOME or CUDA_PATH setting

Fixes #974

* [Bugfix] Fix CUDA home detection logic

* [Bugfix] Safely handle None cuda_home during CUDA detection

Adds a check for None before validating the CUDA home path to prevent errors when the path is not set.

* [Bugfix] Fix CUDA detection edge cases in nvhpc support (#974)

- Improved nvhpc path detection logic
- Added None check for cuda_home to avoid crashes
- Maintained existing CUDA installation compatibility

Fixes #974

* chore: rerun CI

---------
Co-authored-by: NaNExist <138002947+NaNExist@users.noreply.github.com>

b0b5347a

11 Oct, 2025 6 commits

[Feature][Example] Support TMA reduce operation and update GQA bwd example (#969) · 05507037

Yu Cheng authored Oct 11, 2025



* [Feature][Example] Support TMA reduce operation and update GQA bwd example

* move GQA bwd with TMA reduce to new example

* [Lint]: [pre-commit.ci] auto fixes [...]

---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

05507037

[Bugfix] Use `access_ptr("r")` instead of `access_ptr("w")` for correct pipeline analysis (#983) · 77b9d08e
Lei Wang authored Oct 11, 2025
```
* remove debug print

* pipeline fix

* use the correct buffer access scope
```
77b9d08e
[Typo] Remove debug print (#980) · 117f2b81
Lei Wang authored Oct 11, 2025

117f2b81

[Refactor] Refactor Pass `InjectFenceProxy` and expose some warp group... · ddfaac36

Lei Wang authored Oct 11, 2025

[Refactor] Refactor Pass `InjectFenceProxy` and expose some warp group primitives in frontend (#977)

* • InjectFenceProxy docs and tests

  - annotate proxy fence injector with context comments for async/generic detection
  - add compiler internals doc covering the pass mechanics and link it in docs index
  - repair fence proxy test by fixing descriptor init usage and fence counter logic

* do not consider call_extern as async.

* doc update.

* reduce test size for sparse mla

ddfaac36

[Language] Enhance `T.alloc_var` for AugAssign and AnnAsign (#979) · 77e31e52
Lei Wang authored Oct 11, 2025
```
* feat: add parser overrides for local.var aug assign.

* lint fix
```
77e31e52
[TileOp] Implememt `CumSum1D` (#978) · 747381ae
Lei Wang authored Oct 11, 2025
```
* support cumsum-1d

* cumsum 1d support
```
747381ae