Commits · b0b5347a141439f3b122f54d12212e6ecd1b5a24 · OpenDAS / tilelang

12 Oct, 2025 1 commit

[Bugfix] Add NVIDIA HPC SDK support in CUDA detection (#974) (#976) · b0b5347a

Degeneracy-Evil authored Oct 12, 2025



* [Bugfix] Add NVIDIA HPC SDK support in CUDA detection (#974)

Enhanced CUDA detection to recognize NVIDIA HPC SDK installations:
- Added path check for nvhpc in nvcc binary path
- Added fallback scan for default nvhpc paths:
  /opt/nvidia/hpc_sdk/Linux_x86_64
- Maintained backward compatibility with standard CUDA installations

Verification:
- Tested on Ubuntu 24.04 with NVIDIA HPC SDK 25.7
- Confirmed detection works without manual CUDA_HOME or CUDA_PATH setting

Fixes #974

* [Bugfix] Fix CUDA home detection logic

* [Bugfix] Safely handle None cuda_home during CUDA detection

Adds a check for None before validating the CUDA home path to prevent errors when the path is not set.

* [Bugfix] Fix CUDA detection edge cases in nvhpc support (#974)

- Improved nvhpc path detection logic
- Added None check for cuda_home to avoid crashes
- Maintained existing CUDA installation compatibility

Fixes #974

* chore: rerun CI

---------
Co-authored-by: NaNExist <138002947+NaNExist@users.noreply.github.com>

b0b5347a

11 Oct, 2025 7 commits

[Feature][Example] Support TMA reduce operation and update GQA bwd example (#969) · 05507037

Yu Cheng authored Oct 11, 2025



* [Feature][Example] Support TMA reduce operation and update GQA bwd example

* move GQA bwd with TMA reduce to new example

* [Lint]: [pre-commit.ci] auto fixes [...]

---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

05507037

[Bugfix] Use `access_ptr("r")` instead of `access_ptr("w")` for correct pipeline analysis (#983) · 77b9d08e
Lei Wang authored Oct 11, 2025
```
* remove debug print

* pipeline fix

* use the correct buffer access scope
```
77b9d08e
[Typo] Remove debug print (#980) · 117f2b81
Lei Wang authored Oct 11, 2025

117f2b81

[Refactor] Refactor Pass `InjectFenceProxy` and expose some warp group... · ddfaac36

Lei Wang authored Oct 11, 2025

[Refactor] Refactor Pass `InjectFenceProxy` and expose some warp group primitives in frontend (#977)

* • InjectFenceProxy docs and tests

  - annotate proxy fence injector with context comments for async/generic detection
  - add compiler internals doc covering the pass mechanics and link it in docs index
  - repair fence proxy test by fixing descriptor init usage and fence counter logic

* do not consider call_extern as async.

* doc update.

* reduce test size for sparse mla

ddfaac36

[Language] Enhance `T.alloc_var` for AugAssign and AnnAsign (#979) · 77e31e52
Lei Wang authored Oct 11, 2025
```
* feat: add parser overrides for local.var aug assign.

* lint fix
```
77e31e52
[TileOp] Implememt `CumSum1D` (#978) · 747381ae
Lei Wang authored Oct 11, 2025
```
* support cumsum-1d

* cumsum 1d support
```
747381ae

[CI][Refactor] Refactor non-test CI workflow files (#971) · 0ae183db

Xuehai Pan authored Oct 11, 2025



* chore: rename CI workflow files

* chore: rename perbench bot file

* refactor: rewrite comment passing via step output and post with github-script

* chore: rename pr-reminder bot file

* chore: use `pre-commit` instead of `format.sh`

* chore: rename docs workflow file

* refactor: rewrite docs workflow file

* chore: use `git clean -dxf -e <exclude>`
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* fix: fix perfbench condition

---------
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

0ae183db

10 Oct, 2025 6 commits

[Bugfix] Fix dummy kernel compliation (#962) · 7913fb1d

Chaofan Lin authored Oct 10, 2025



* [Bugfix] Fix visit EvaluateNode in BufferGemmCollector

* address comment

* lint

* fix

* Add TileLang SplitHostDevice pass and tighten issue 830 test names

* lint fix

* enhance for kernel value unpacking.

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

7913fb1d

[Doc] Install docs add docker install method (#961) · 6031416f
Xiaoyu Zhang authored Oct 10, 2025

6031416f

[CI] add `pre-commit` integration (#955) · 8fe35402

Xuehai Pan authored Oct 10, 2025



* chore: misc cleanup

* feat: add pre-commit config

* chore: update lint dependencies

* style: fix lint issues

* feat: add pre-commit hooks

* fix: fix typos

* chore: update .gitattributes

* [Lint]: [pre-commit.ci] auto fixes [...]

* docs: update CONTRIBUTING.md

* chore: update default venv name

* chore: revert and exclude CUDA files

---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

8fe35402

[Bugfix] Do not force inline let stmt (#947) · f8ae600c

Lei Wang authored Oct 10, 2025

* remove debug print

* Remove inline let expressions from the LowerAndLegalize function in phase.py

* add test

* Update sparse MLA examples to support SKV adjustment and correctness checks

- Changed SKV parameter from 32768 to 8192 in sparse MLA backward and forward tests.
- Added check_correctness parameter to test functions for validation of outputs.
- Updated test cases to reflect new SKV values and correctness checks.

* reduce test shape

* Update documentation structure and refactor main function parameters in example_fusedmoe_tilelang.py

- Added a new section for compiler internals in the documentation.
- Refactored the main function in example_fusedmoe_tilelang.py to accept parameters for hidden dimensions, expert configurations, and batch/sequence sizes, improving flexibility and readability.

* Update buffer access checks in merge_shared_memory_allocations.cc

- Changed the condition for buffer access from less than (<) to less than or equal to (<=) to allow access at the same scope level.
- Adjusted the logic for determining the access level when touching buffers to ensure correct handling of scope levels.

* lint fix

* Support pipeline with LetStmt

* lint fix

* • Fix LowerTileOp let handling to avoid LetInline dependency

  - inline let-bound BufferLoad nodes via resolver helpers and structured return
  - remap layouts/buffers using original data vars and only rewrite when needed
  - update pipeline planner to understand let-bound address_of buffers
  - document the new inline behaviour in docs/let_inline_fix.md

* fix for wgmma pipeline with let binding

* lint fix

* test fix

* reduce smem usage.

* let binding enhancement

* fix for dpgm

* fix simplify

* lint fix

* use tilelang.Simplify instead of tir.Simplify

* • Add TL_FORCE_LET_INLINE pass config and gate eager LetInline usage

  - register the new config in builtin headers/registration
  - add helper to pipeline enabling LetInline based on pass context
  - document LetStmt inlining controls and usage

f8ae600c

[Example] Add support for `bfloat16` and user-defined `sm_scale` in attention sink examples (#924) · 7cd0da99

Tong WU authored Oct 10, 2025



* revert split+sum template for MHA backward

* lint

* Update example_mha_bwd.py

* Update example_mha_bwd_wgmma_pipelined.py

* Refactor attention sink examples to support bf16 and user-defined softmax scale

* fix typos

* Adding compile flags for fast math optimizations and enabling BF16 support in both GQA and MHA backward implementations.

* Update backward configuration for GQA and MHA examples to align with flash attention

* Refactor GQA backward implementation to improve atomic add performance

* Allow for slightly larger numerical error for bf16

* upd readme to show bf16 benchmark results

* lint

* fix ci and lint

* fix comments and lint

* refactor atomic add

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

7cd0da99

[Docs] add CODE_OF_CONDUCT.md (#965) · 8f07b9b0

Xuehai Pan authored Oct 10, 2025



* [Docs] add CODE_OF_CONDUCT.md

* Update CODE_OF_CONDUCT.md

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

8f07b9b0

09 Oct, 2025 10 commits

[TileOp] Implement WGMMA for T.gemm_v2 (#813) · a13cde28

Lei Wang authored Oct 10, 2025

* [Feature] Introduce WGMMA support and enhance GEMM layout handling

- Added support for the WGMMA intrinsic in the TileLang framework, enabling efficient matrix multiplication on newer architectures.
- Refactored GEMM layout functions to accept a boolean parameter for K dimension handling, improving flexibility in layout generation.
- Updated layout inference logic to accommodate new WGMMA configurations and ensure compatibility with existing GEMM operations.
- Enhanced Python bindings for layout functions, allowing for better integration and usability in user-defined operations.
- Improved documentation for layout functions and GEMM operations to clarify usage and parameters.

These changes enhance the performance and usability of GEMM operations, particularly for advanced architectures, while maintaining backward compatibility with existing implementations.

* [Refactor] Clean up code formatting and enhance layout function readability

- Improved code formatting across multiple files for better readability, including consistent indentation and line breaks.
- Updated layout function signatures to enhance clarity, particularly in `gemm_layouts.cc`, `layout.cc`, and `layout.h`.
- Refactored lambda functions in `builtin.cc` and `gemm_py.cc` for improved structure and maintainability.
- Enhanced comments and documentation in layout-related files to clarify usage and parameters.

These changes contribute to a cleaner codebase and improved maintainability of layout functions in the TileLang framework.

* [Feature] Add descriptor initialization and offset manipulation for WGMMA

- Introduced new TileLang builtins `initialize_descriptor` and `increase_descriptor_offset` to facilitate descriptor management for WGMMA operations.
- Updated `builtin.cc` and `builtin.h` to define and document the new builtins, enhancing the framework's capabilities for descriptor handling.
- Modified `codegen_cuda.cc` and `ptx.cc` to integrate the new builtins into the code generation process, ensuring proper assembly generation for WGMMA operations.
- Enhanced the `GemmWGMMA` class to utilize the new descriptor functionalities, improving the efficiency of matrix multiplication operations.
- Updated related tests and documentation to reflect the new features and ensure comprehensive coverage.

These changes enhance the TileLang framework's support for advanced matrix operations on newer architectures, improving performance and usability.

* [Refactor] Improve code formatting and readability in various files

- Enhanced code formatting across multiple files for better readability, including consistent indentation and line breaks.
- Updated function signatures and comments in `builtin.h`, `codegen_cuda.cc`, and `ptx.cc` to improve clarity.
- Refactored descriptor initialization and offset manipulation functions in `builtin.py` and `wgmma_macro_generator.py` for improved structure.
- Cleaned up unnecessary whitespace and improved alignment in `common.h` and `allocate.py`.

These changes contribute to a cleaner and more maintainable codebase in the TileLang framework.

* [Update] Update subproject commit and refactor layout function call

- Updated the subproject commit for `cutlass` to indicate a dirty state.
- Refactored the `UpdateAnalyzer` function in `layout.cc` to call `LayoutNode::getVarMap()` instead of `getVarMap()`, improving clarity and ensuring proper context for variable mapping.

These changes enhance the maintainability and clarity of the layout handling in the TileLang framework.

* support more data types

* gemm_rs support

* lint fix

* wgmma wrapper

* Remove debug logging for wgmma assembly code and refactor swizzle byte size calculations in wgmma macro generator. Enhanced handling of leading and stride byte offsets based on swizzle mode, improving clarity and performance in tensor core intrinsic emissions.

* Refactor GEMM layout functions to replace 'kfactor' with 'k_inner' for improved clarity and consistency. Update includes necessary changes in error messages for Hopper and Sm100 layouts. Additionally, include a new header for CUTE utilities in common.h.

* Comprehensively support WGMMA GEMM SS

* remove debug print

* lint fix

* remove debug print

* reduce bwd test shape

* lint fix

* clear cache for pytest

* lint fix

* Update sparse MLA examples to support SKV adjustment and correctness checks

- Changed SKV parameter from 32768 to 8192 in sparse MLA backward and forward tests.
- Added check_correctness parameter to test functions for validation of outputs.
- Updated test cases to reflect new SKV values and correctness checks.

* test fix

* adjust test case

* test fix

* skip some test currently

a13cde28

[CI]: Bump actions/checkout from 2 to 5 (#953) · 10adb79f

dependabot[bot] authored Oct 09, 2025

Bumps [actions/checkout](https://github.com/actions/checkout) from 2 to 5.
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](https://github.com/actions/checkout/compare/v2...v5

)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-version: '5'
  dependency-type: direct:production
  update-type: version-update:semver-major
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

10adb79f

[CI]: Bump actions/github-script from 7 to 8 (#954) · 5d881a57

dependabot[bot] authored Oct 09, 2025

Bumps [actions/github-script](https://github.com/actions/github-script) from 7 to 8.
- [Release notes](https://github.com/actions/github-script/releases)
- [Commits](https://github.com/actions/github-script/compare/v7...v8

)

---
updated-dependencies:
- dependency-name: actions/github-script
  dependency-version: '8'
  dependency-type: direct:production
  update-type: version-update:semver-major
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

5d881a57

[CI]: Bump astral-sh/setup-uv from 6 to 7 (#952) · b6f90d25

dependabot[bot] authored Oct 09, 2025

Bumps [astral-sh/setup-uv](https://github.com/astral-sh/setup-uv) from 6 to 7.
- [Release notes](https://github.com/astral-sh/setup-uv/releases)
- [Commits](https://github.com/astral-sh/setup-uv/compare/v6...v7

)

---
updated-dependencies:
- dependency-name: astral-sh/setup-uv
  dependency-version: '7'
  dependency-type: direct:production
  update-type: version-update:semver-major
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

b6f90d25

[CI]: Bump actions/setup-python from 2 to 6 (#951) · d8fedc17

dependabot[bot] authored Oct 09, 2025

Bumps [actions/setup-python](https://github.com/actions/setup-python) from 2 to 6.
- [Release notes](https://github.com/actions/setup-python/releases)
- [Commits](https://github.com/actions/setup-python/compare/v2...v6

)

---
updated-dependencies:
- dependency-name: actions/setup-python
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

d8fedc17

[Bugfix][Doc] Add astroid version constraint to requirements.txt (#958) · 2dea17e5
Wenhao Xie authored Oct 09, 2025

2dea17e5
[Bugfix] Fix type object is not subscriptable in py38 (#959) · 6b2bb310
Xiaoyu Zhang authored Oct 09, 2025

6b2bb310
[CI] auto-cancel in-progress PR CI when new commits are pushed (#956) · 9a7cda42
Xuehai Pan authored Oct 09, 2025

9a7cda42
Modify the SM architecture number to support Thor’s sm110. (#957) · 07f62104
Shawn Liu authored Oct 09, 2025

07f62104
[CI] enable dependabot for GHA workflows (#950) · f6d4bd3a
Xuehai Pan authored Oct 09, 2025
```
* chore: add .editorconfig

* feat: enable dependabot for GHA workflows
```
f6d4bd3a

07 Oct, 2025 3 commits

[Backend] Add metal backend (#799) · 7fb06776

Yichen Yan authored Oct 07, 2025



* Reset

* Fix other CUDA issue

* fmt

* fmt

* fix cuda error

* fix

* fix

* fmt

* cleanup

* fix

* remove copyright

* trivial update

* readme update

* lint fix

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

7fb06776

[Refactor] Refine nvrtc compile related check style (#945) · 394e17d0
Xiaoyu Zhang authored Oct 07, 2025
```
* unify nvrtc check style

* unify nvrtc check style

* unify nvrtc check style
```
394e17d0

[Enhancement] Add buffer load copy functions and improve copy logic in tilelang (#946) · c61971e8

Lei Wang authored Oct 07, 2025

- Introduced new functions for buffer load copy with stride and parallel execution.
- Enhanced the copy logic in `copy.py` to simplify nested if statements for BufferLoad nodes.
- Added corresponding test cases for the new buffer load functionalities.

c61971e8

06 Oct, 2025 3 commits

[Profiler] Adds CUPTI profiler support (#936) · 91d5ef54

Cunxiao Ni authored Oct 06, 2025



* [Profiler]Adds CUPTI profiler support

* format

* rafactor cupti profiler

* format

* rafactor

* rafactor

* fix lint

* fix lint

* refactor

* add profiler tests

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

91d5ef54

[Example] Add sparse mla bwd example for deepseek_v32 (#919) · ac8c9afc

Zhichen Zeng authored Oct 06, 2025



* Add sparse mla bwd example

* add bwd into test

* Update README with bwd impl

* comment

* format fix

* lint fix

* fwd fix

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

ac8c9afc

[Example] Revert the atomic/split&sum templates in MHA backward examples (#943) · 481cae42

Tong WU authored Oct 06, 2025



* revert split+sum template for MHA backward

* lint

* Update example_mha_bwd.py

* Update example_mha_bwd_wgmma_pipelined.py

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

481cae42

05 Oct, 2025 3 commits
- [Example] Disable TMA and enable FastMath for NSA Examples (#941) · 3aecab8f
  Lei Wang authored Oct 05, 2025
```
* tma disable

* int64 cast fix.
```
  3aecab8f
- [Example] Introduce split+sum template, and optimize `atomic_add` performance... · 557589ff
  Lei Wang authored Oct 05, 2025
```
[Example] Introduce split+sum template, and optimize `atomic_add` performance for bwd examples (#940)

* example fix

* lint fix

* bug fix

* reduce test size.
```
  557589ff
- [Enhancement] Fix lint to improve grouped GEMM performance with TMA (#938) · 95170ab7
  Cunxiao Ni authored Oct 05, 2025
```
* [Example]  Fix lint  to improve grouped GEMM performance with TMA

* fix lint
```
  95170ab7
04 Oct, 2025 3 commits

[Enhancement] Enhance and add new GQA backward examples for Hopper (#930) · b31de0ce

Tong WU authored Oct 05, 2025

* [Enhancement] Enhance the GQA backward kernel by calculating `dq` and `dv` via copy&sum

* [Example] Implement GQA backward example for Hopper with customized tiling and pipeline

* [Example] Add relevant tests

* Fix all typos of wrong shape of `V_shared` in macros

b31de0ce

[Example] Add correctness assert into dsa example (#937) · d5c88afa
Lei Wang authored Oct 04, 2025

d5c88afa

[Example] Optimize online_softmax example (#934) · 242cb457

lijinpei authored Oct 04, 2025



* [Example] Optimize online_softmax example

- Y should be output in float16.
- BN needs to be equal to N to be really online.
- On my H100 machine, this increase speedup from 1.424x to 2.788x.

* enhance

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

242cb457

02 Oct, 2025 2 commits

[Bugfix] Fix tensor memory copy layout (#933) · 5ccac4fa

Zhiwen Mo authored Oct 03, 2025

* Implements tcgen05.ld instruction support for copying from shared.tmem
  to local.fragment on SM100/Blackwell architecture. Adds layout inference
  and lowering logic for tensor memory operations with proper physical
  coordinate range analysis and warpgroup alignment checks.

  Changes:
  - Add kTMemLoad and kTMemStore to CopyInst enumeration
  - Implement CheckTMemLoad() and CheckTMemStore() validation functions
  - Add LowerTmemCopy() to generate tcgen05.ld/st/cp PTX intrinsics
  - Add tmem layout inference in InferLayout() using expandTcgen05Layout
  - Support multiple instruction variants (32dp32b/64b/128b/256b)
  - Add physical layout bounds analysis for tmem coordinates
  - Change clear_accum from bool to PrimExpr in GEMM operations
  - Fix std::optional access checks in layout_inference.cc
  - Add tmem_allocate/deallocate PTX intrinsic support
  - Fix cooperative_groups grid.sync() code generation

* fix

* pipeline fix

* bug fix

* bool fix

5ccac4fa

[Layout] Strict annotate completed replicated layout for fragment with constant index (#929) · fc4bd452

Lei Wang authored Oct 02, 2025

* [Layout] Add IsCompletedReplicated method and enhance layout inference in ParallelOpNode

- Introduced IsCompletedReplicated method in FragmentNode to check if a buffer is fully replicated.
- Enhanced InferLayout in ParallelOpNode to handle layout inference for replicated buffers, ensuring only fragment[0] access is allowed.
- Updated error handling for non-zero index access in fragment buffers to improve robustness.

* [Layout] Improve code formatting and readability in layout.cc and parallel.cc

- Enhanced formatting in FragmentNode's IsCompletedReplicated method for better clarity.
- Updated InferLayout method in ParallelOpNode to improve code readability by adjusting line breaks and indentation.
- Ensured consistent formatting across conditional statements and comments for improved maintainability.

* updt

* optimize const index related op

* bug fix

* reduce gdn test

* test fix

* lintfix

* lint fix

* test fix

fc4bd452

01 Oct, 2025 2 commits
- [CI] Fix documentation runner by adding 'nvidia' tag · f09e91e3
  Wenhao Xie authored Oct 01, 2025
  
  f09e91e3
- [Example] Add MLA decode ws example (#928) · 8150e47e
  Yu Cheng authored Oct 01, 2025
  
  8150e47e