Commits · bab57f23c8c92c53de0ff8054a0777284ff9e9fd · OpenDAS / tilelang

"src/git@developer.sourcefind.cn:gaoqiong/migraphx.git" did not exist on "77ef0c1d699e1ac801ee3d13a4d342ac1e90e588"

13 Oct, 2025 2 commits
- [CI] Speed up sparse tensor core test via vectorized generating sparse data (#1009) · bab57f23
  Lei Wang authored Oct 13, 2025
  
  bab57f23
- [Bugfix] Fix atomicadd auto vectorize identify var error (#883) · 340bfc50
  Yuqi Dong authored Oct 13, 2025
```
* update

* update

* update

* update
```
  340bfc50
12 Oct, 2025 3 commits

[Bugfix] Fallback `torch.accelerator.synchronize()` to `torch.cuda.synchronize()` (#987) · 4a229ddb
Yuqi Dong authored Oct 12, 2025
```
* [Refactor]:Add support for torch version lower than 2.6.0

* update
```
4a229ddb
[BugFix] Robust gemm policy for sparse_mla_fwd in Hopper and Ada Lovelace architectures (#984) · fc41463c
Zhengju Tang authored Oct 12, 2025
```
* [BugFix] Robust gemm policy for sparse_mla_fwd in Hopper and Ada Lovelace architectures

* [Lint]
```
fc41463c

[Bugfix] Add NVIDIA HPC SDK support in CUDA detection (#974) (#976) · b0b5347a

Degeneracy-Evil authored Oct 12, 2025



* [Bugfix] Add NVIDIA HPC SDK support in CUDA detection (#974)

Enhanced CUDA detection to recognize NVIDIA HPC SDK installations:
- Added path check for nvhpc in nvcc binary path
- Added fallback scan for default nvhpc paths:
  /opt/nvidia/hpc_sdk/Linux_x86_64
- Maintained backward compatibility with standard CUDA installations

Verification:
- Tested on Ubuntu 24.04 with NVIDIA HPC SDK 25.7
- Confirmed detection works without manual CUDA_HOME or CUDA_PATH setting

Fixes #974

* [Bugfix] Fix CUDA home detection logic

* [Bugfix] Safely handle None cuda_home during CUDA detection

Adds a check for None before validating the CUDA home path to prevent errors when the path is not set.

* [Bugfix] Fix CUDA detection edge cases in nvhpc support (#974)

- Improved nvhpc path detection logic
- Added None check for cuda_home to avoid crashes
- Maintained existing CUDA installation compatibility

Fixes #974

* chore: rerun CI

---------
Co-authored-by: NaNExist <138002947+NaNExist@users.noreply.github.com>

b0b5347a

11 Oct, 2025 7 commits

[Feature][Example] Support TMA reduce operation and update GQA bwd example (#969) · 05507037

Yu Cheng authored Oct 11, 2025



* [Feature][Example] Support TMA reduce operation and update GQA bwd example

* move GQA bwd with TMA reduce to new example

* [Lint]: [pre-commit.ci] auto fixes [...]

---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

05507037

[Bugfix] Use `access_ptr("r")` instead of `access_ptr("w")` for correct pipeline analysis (#983) · 77b9d08e
Lei Wang authored Oct 11, 2025
```
* remove debug print

* pipeline fix

* use the correct buffer access scope
```
77b9d08e
[Typo] Remove debug print (#980) · 117f2b81
Lei Wang authored Oct 11, 2025

117f2b81

[Refactor] Refactor Pass `InjectFenceProxy` and expose some warp group... · ddfaac36

Lei Wang authored Oct 11, 2025

[Refactor] Refactor Pass `InjectFenceProxy` and expose some warp group primitives in frontend (#977)

* • InjectFenceProxy docs and tests

  - annotate proxy fence injector with context comments for async/generic detection
  - add compiler internals doc covering the pass mechanics and link it in docs index
  - repair fence proxy test by fixing descriptor init usage and fence counter logic

* do not consider call_extern as async.

* doc update.

* reduce test size for sparse mla

ddfaac36

[Language] Enhance `T.alloc_var` for AugAssign and AnnAsign (#979) · 77e31e52
Lei Wang authored Oct 11, 2025
```
* feat: add parser overrides for local.var aug assign.

* lint fix
```
77e31e52
[TileOp] Implememt `CumSum1D` (#978) · 747381ae
Lei Wang authored Oct 11, 2025
```
* support cumsum-1d

* cumsum 1d support
```
747381ae

[CI][Refactor] Refactor non-test CI workflow files (#971) · 0ae183db

Xuehai Pan authored Oct 11, 2025



* chore: rename CI workflow files

* chore: rename perbench bot file

* refactor: rewrite comment passing via step output and post with github-script

* chore: rename pr-reminder bot file

* chore: use `pre-commit` instead of `format.sh`

* chore: rename docs workflow file

* refactor: rewrite docs workflow file

* chore: use `git clean -dxf -e <exclude>`
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* fix: fix perfbench condition

---------
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

0ae183db

10 Oct, 2025 6 commits

[Bugfix] Fix dummy kernel compliation (#962) · 7913fb1d

Chaofan Lin authored Oct 10, 2025



* [Bugfix] Fix visit EvaluateNode in BufferGemmCollector

* address comment

* lint

* fix

* Add TileLang SplitHostDevice pass and tighten issue 830 test names

* lint fix

* enhance for kernel value unpacking.

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

7913fb1d

[Doc] Install docs add docker install method (#961) · 6031416f
Xiaoyu Zhang authored Oct 10, 2025

6031416f

[CI] add `pre-commit` integration (#955) · 8fe35402

Xuehai Pan authored Oct 10, 2025



* chore: misc cleanup

* feat: add pre-commit config

* chore: update lint dependencies

* style: fix lint issues

* feat: add pre-commit hooks

* fix: fix typos

* chore: update .gitattributes

* [Lint]: [pre-commit.ci] auto fixes [...]

* docs: update CONTRIBUTING.md

* chore: update default venv name

* chore: revert and exclude CUDA files

---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

8fe35402

[Bugfix] Do not force inline let stmt (#947) · f8ae600c

Lei Wang authored Oct 10, 2025

* remove debug print

* Remove inline let expressions from the LowerAndLegalize function in phase.py

* add test

* Update sparse MLA examples to support SKV adjustment and correctness checks

- Changed SKV parameter from 32768 to 8192 in sparse MLA backward and forward tests.
- Added check_correctness parameter to test functions for validation of outputs.
- Updated test cases to reflect new SKV values and correctness checks.

* reduce test shape

* Update documentation structure and refactor main function parameters in example_fusedmoe_tilelang.py

- Added a new section for compiler internals in the documentation.
- Refactored the main function in example_fusedmoe_tilelang.py to accept parameters for hidden dimensions, expert configurations, and batch/sequence sizes, improving flexibility and readability.

* Update buffer access checks in merge_shared_memory_allocations.cc

- Changed the condition for buffer access from less than (<) to less than or equal to (<=) to allow access at the same scope level.
- Adjusted the logic for determining the access level when touching buffers to ensure correct handling of scope levels.

* lint fix

* Support pipeline with LetStmt

* lint fix

* • Fix LowerTileOp let handling to avoid LetInline dependency

  - inline let-bound BufferLoad nodes via resolver helpers and structured return
  - remap layouts/buffers using original data vars and only rewrite when needed
  - update pipeline planner to understand let-bound address_of buffers
  - document the new inline behaviour in docs/let_inline_fix.md

* fix for wgmma pipeline with let binding

* lint fix

* test fix

* reduce smem usage.

* let binding enhancement

* fix for dpgm

* fix simplify

* lint fix

* use tilelang.Simplify instead of tir.Simplify

* • Add TL_FORCE_LET_INLINE pass config and gate eager LetInline usage

  - register the new config in builtin headers/registration
  - add helper to pipeline enabling LetInline based on pass context
  - document LetStmt inlining controls and usage

f8ae600c

[Example] Add support for `bfloat16` and user-defined `sm_scale` in attention sink examples (#924) · 7cd0da99

Tong WU authored Oct 10, 2025



* revert split+sum template for MHA backward

* lint

* Update example_mha_bwd.py

* Update example_mha_bwd_wgmma_pipelined.py

* Refactor attention sink examples to support bf16 and user-defined softmax scale

* fix typos

* Adding compile flags for fast math optimizations and enabling BF16 support in both GQA and MHA backward implementations.

* Update backward configuration for GQA and MHA examples to align with flash attention

* Refactor GQA backward implementation to improve atomic add performance

* Allow for slightly larger numerical error for bf16

* upd readme to show bf16 benchmark results

* lint

* fix ci and lint

* fix comments and lint

* refactor atomic add

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

7cd0da99

[Docs] add CODE_OF_CONDUCT.md (#965) · 8f07b9b0

Xuehai Pan authored Oct 10, 2025



* [Docs] add CODE_OF_CONDUCT.md

* Update CODE_OF_CONDUCT.md

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

8f07b9b0

09 Oct, 2025 10 commits

[TileOp] Implement WGMMA for T.gemm_v2 (#813) · a13cde28

Lei Wang authored Oct 10, 2025

* [Feature] Introduce WGMMA support and enhance GEMM layout handling

- Added support for the WGMMA intrinsic in the TileLang framework, enabling efficient matrix multiplication on newer architectures.
- Refactored GEMM layout functions to accept a boolean parameter for K dimension handling, improving flexibility in layout generation.
- Updated layout inference logic to accommodate new WGMMA configurations and ensure compatibility with existing GEMM operations.
- Enhanced Python bindings for layout functions, allowing for better integration and usability in user-defined operations.
- Improved documentation for layout functions and GEMM operations to clarify usage and parameters.

These changes enhance the performance and usability of GEMM operations, particularly for advanced architectures, while maintaining backward compatibility with existing implementations.

* [Refactor] Clean up code formatting and enhance layout function readability

- Improved code formatting across multiple files for better readability, including consistent indentation and line breaks.
- Updated layout function signatures to enhance clarity, particularly in `gemm_layouts.cc`, `layout.cc`, and `layout.h`.
- Refactored lambda functions in `builtin.cc` and `gemm_py.cc` for improved structure and maintainability.
- Enhanced comments and documentation in layout-related files to clarify usage and parameters.

These changes contribute to a cleaner codebase and improved maintainability of layout functions in the TileLang framework.

* [Feature] Add descriptor initialization and offset manipulation for WGMMA

- Introduced new TileLang builtins `initialize_descriptor` and `increase_descriptor_offset` to facilitate descriptor management for WGMMA operations.
- Updated `builtin.cc` and `builtin.h` to define and document the new builtins, enhancing the framework's capabilities for descriptor handling.
- Modified `codegen_cuda.cc` and `ptx.cc` to integrate the new builtins into the code generation process, ensuring proper assembly generation for WGMMA operations.
- Enhanced the `GemmWGMMA` class to utilize the new descriptor functionalities, improving the efficiency of matrix multiplication operations.
- Updated related tests and documentation to reflect the new features and ensure comprehensive coverage.

These changes enhance the TileLang framework's support for advanced matrix operations on newer architectures, improving performance and usability.

* [Refactor] Improve code formatting and readability in various files

- Enhanced code formatting across multiple files for better readability, including consistent indentation and line breaks.
- Updated function signatures and comments in `builtin.h`, `codegen_cuda.cc`, and `ptx.cc` to improve clarity.
- Refactored descriptor initialization and offset manipulation functions in `builtin.py` and `wgmma_macro_generator.py` for improved structure.
- Cleaned up unnecessary whitespace and improved alignment in `common.h` and `allocate.py`.

These changes contribute to a cleaner and more maintainable codebase in the TileLang framework.

* [Update] Update subproject commit and refactor layout function call

- Updated the subproject commit for `cutlass` to indicate a dirty state.
- Refactored the `UpdateAnalyzer` function in `layout.cc` to call `LayoutNode::getVarMap()` instead of `getVarMap()`, improving clarity and ensuring proper context for variable mapping.

These changes enhance the maintainability and clarity of the layout handling in the TileLang framework.

* support more data types

* gemm_rs support

* lint fix

* wgmma wrapper

* Remove debug logging for wgmma assembly code and refactor swizzle byte size calculations in wgmma macro generator. Enhanced handling of leading and stride byte offsets based on swizzle mode, improving clarity and performance in tensor core intrinsic emissions.

* Refactor GEMM layout functions to replace 'kfactor' with 'k_inner' for improved clarity and consistency. Update includes necessary changes in error messages for Hopper and Sm100 layouts. Additionally, include a new header for CUTE utilities in common.h.

* Comprehensively support WGMMA GEMM SS

* remove debug print

* lint fix

* remove debug print

* reduce bwd test shape

* lint fix

* clear cache for pytest

* lint fix

* Update sparse MLA examples to support SKV adjustment and correctness checks

- Changed SKV parameter from 32768 to 8192 in sparse MLA backward and forward tests.
- Added check_correctness parameter to test functions for validation of outputs.
- Updated test cases to reflect new SKV values and correctness checks.

* test fix

* adjust test case

* test fix

* skip some test currently

a13cde28

[CI]: Bump actions/checkout from 2 to 5 (#953) · 10adb79f

dependabot[bot] authored Oct 09, 2025

Bumps [actions/checkout](https://github.com/actions/checkout) from 2 to 5.
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](https://github.com/actions/checkout/compare/v2...v5

)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-version: '5'
  dependency-type: direct:production
  update-type: version-update:semver-major
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

10adb79f

[CI]: Bump actions/github-script from 7 to 8 (#954) · 5d881a57

dependabot[bot] authored Oct 09, 2025

Bumps [actions/github-script](https://github.com/actions/github-script) from 7 to 8.
- [Release notes](https://github.com/actions/github-script/releases)
- [Commits](https://github.com/actions/github-script/compare/v7...v8

)

---
updated-dependencies:
- dependency-name: actions/github-script
  dependency-version: '8'
  dependency-type: direct:production
  update-type: version-update:semver-major
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

5d881a57

[CI]: Bump astral-sh/setup-uv from 6 to 7 (#952) · b6f90d25

dependabot[bot] authored Oct 09, 2025

Bumps [astral-sh/setup-uv](https://github.com/astral-sh/setup-uv) from 6 to 7.
- [Release notes](https://github.com/astral-sh/setup-uv/releases)
- [Commits](https://github.com/astral-sh/setup-uv/compare/v6...v7

)

---
updated-dependencies:
- dependency-name: astral-sh/setup-uv
  dependency-version: '7'
  dependency-type: direct:production
  update-type: version-update:semver-major
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

b6f90d25

[CI]: Bump actions/setup-python from 2 to 6 (#951) · d8fedc17

dependabot[bot] authored Oct 09, 2025

Bumps [actions/setup-python](https://github.com/actions/setup-python) from 2 to 6.
- [Release notes](https://github.com/actions/setup-python/releases)
- [Commits](https://github.com/actions/setup-python/compare/v2...v6

)

---
updated-dependencies:
- dependency-name: actions/setup-python
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

d8fedc17

[Bugfix][Doc] Add astroid version constraint to requirements.txt (#958) · 2dea17e5
Wenhao Xie authored Oct 09, 2025

2dea17e5
[Bugfix] Fix type object is not subscriptable in py38 (#959) · 6b2bb310
Xiaoyu Zhang authored Oct 09, 2025

6b2bb310
[CI] auto-cancel in-progress PR CI when new commits are pushed (#956) · 9a7cda42
Xuehai Pan authored Oct 09, 2025

9a7cda42
Modify the SM architecture number to support Thor’s sm110. (#957) · 07f62104
Shawn Liu authored Oct 09, 2025

07f62104
[CI] enable dependabot for GHA workflows (#950) · f6d4bd3a
Xuehai Pan authored Oct 09, 2025
```
* chore: add .editorconfig

* feat: enable dependabot for GHA workflows
```
f6d4bd3a

07 Oct, 2025 3 commits

[Backend] Add metal backend (#799) · 7fb06776

Yichen Yan authored Oct 07, 2025



* Reset

* Fix other CUDA issue

* fmt

* fmt

* fix cuda error

* fix

* fix

* fmt

* cleanup

* fix

* remove copyright

* trivial update

* readme update

* lint fix

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

7fb06776

[Refactor] Refine nvrtc compile related check style (#945) · 394e17d0
Xiaoyu Zhang authored Oct 07, 2025
```
* unify nvrtc check style

* unify nvrtc check style

* unify nvrtc check style
```
394e17d0

[Enhancement] Add buffer load copy functions and improve copy logic in tilelang (#946) · c61971e8

Lei Wang authored Oct 07, 2025

- Introduced new functions for buffer load copy with stride and parallel execution.
- Enhanced the copy logic in `copy.py` to simplify nested if statements for BufferLoad nodes.
- Added corresponding test cases for the new buffer load functionalities.

c61971e8

06 Oct, 2025 3 commits

[Profiler] Adds CUPTI profiler support (#936) · 91d5ef54

Cunxiao Ni authored Oct 06, 2025



* [Profiler]Adds CUPTI profiler support

* format

* rafactor cupti profiler

* format

* rafactor

* rafactor

* fix lint

* fix lint

* refactor

* add profiler tests

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

91d5ef54

[Example] Add sparse mla bwd example for deepseek_v32 (#919) · ac8c9afc

Zhichen Zeng authored Oct 06, 2025



* Add sparse mla bwd example

* add bwd into test

* Update README with bwd impl

* comment

* format fix

* lint fix

* fwd fix

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

ac8c9afc

[Example] Revert the atomic/split&sum templates in MHA backward examples (#943) · 481cae42

Tong WU authored Oct 06, 2025



* revert split+sum template for MHA backward

* lint

* Update example_mha_bwd.py

* Update example_mha_bwd_wgmma_pipelined.py

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

481cae42

05 Oct, 2025 3 commits
- [Example] Disable TMA and enable FastMath for NSA Examples (#941) · 3aecab8f
  Lei Wang authored Oct 05, 2025
```
* tma disable

* int64 cast fix.
```
  3aecab8f
- [Example] Introduce split+sum template, and optimize `atomic_add` performance... · 557589ff
  Lei Wang authored Oct 05, 2025
```
[Example] Introduce split+sum template, and optimize `atomic_add` performance for bwd examples (#940)

* example fix

* lint fix

* bug fix

* reduce test size.
```
  557589ff
- [Enhancement] Fix lint to improve grouped GEMM performance with TMA (#938) · 95170ab7
  Cunxiao Ni authored Oct 05, 2025
```
* [Example]  Fix lint  to improve grouped GEMM performance with TMA

* fix lint
```
  95170ab7
04 Oct, 2025 3 commits

[Enhancement] Enhance and add new GQA backward examples for Hopper (#930) · b31de0ce

Tong WU authored Oct 05, 2025

* [Enhancement] Enhance the GQA backward kernel by calculating `dq` and `dv` via copy&sum

* [Example] Implement GQA backward example for Hopper with customized tiling and pipeline

* [Example] Add relevant tests

* Fix all typos of wrong shape of `V_shared` in macros

b31de0ce

[Example] Add correctness assert into dsa example (#937) · d5c88afa
Lei Wang authored Oct 04, 2025

d5c88afa

[Example] Optimize online_softmax example (#934) · 242cb457

lijinpei authored Oct 04, 2025



* [Example] Optimize online_softmax example

- Y should be output in float16.
- BN needs to be equal to N to be really online.
- On my H100 machine, this increase speedup from 1.424x to 2.788x.

* enhance

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

242cb457