Commits · b6f90d25f0853a5e92036d988a47541ff8c3c15e · OpenDAS / tilelang

09 Oct, 2025 7 commits

[CI]: Bump astral-sh/setup-uv from 6 to 7 (#952) · b6f90d25

dependabot[bot] authored Oct 09, 2025

Bumps [astral-sh/setup-uv](https://github.com/astral-sh/setup-uv) from 6 to 7.
- [Release notes](https://github.com/astral-sh/setup-uv/releases)
- [Commits](https://github.com/astral-sh/setup-uv/compare/v6...v7

)

---
updated-dependencies:
- dependency-name: astral-sh/setup-uv
  dependency-version: '7'
  dependency-type: direct:production
  update-type: version-update:semver-major
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

b6f90d25

[CI]: Bump actions/setup-python from 2 to 6 (#951) · d8fedc17

dependabot[bot] authored Oct 09, 2025

Bumps [actions/setup-python](https://github.com/actions/setup-python) from 2 to 6.
- [Release notes](https://github.com/actions/setup-python/releases)
- [Commits](https://github.com/actions/setup-python/compare/v2...v6

)

---
updated-dependencies:
- dependency-name: actions/setup-python
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

d8fedc17

[Bugfix][Doc] Add astroid version constraint to requirements.txt (#958) · 2dea17e5
Wenhao Xie authored Oct 09, 2025

2dea17e5
[Bugfix] Fix type object is not subscriptable in py38 (#959) · 6b2bb310
Xiaoyu Zhang authored Oct 09, 2025

6b2bb310
[CI] auto-cancel in-progress PR CI when new commits are pushed (#956) · 9a7cda42
Xuehai Pan authored Oct 09, 2025

9a7cda42
Modify the SM architecture number to support Thor’s sm110. (#957) · 07f62104
Shawn Liu authored Oct 09, 2025

07f62104
[CI] enable dependabot for GHA workflows (#950) · f6d4bd3a
Xuehai Pan authored Oct 09, 2025
```
* chore: add .editorconfig

* feat: enable dependabot for GHA workflows
```
f6d4bd3a

07 Oct, 2025 3 commits

[Backend] Add metal backend (#799) · 7fb06776

Yichen Yan authored Oct 07, 2025



* Reset

* Fix other CUDA issue

* fmt

* fmt

* fix cuda error

* fix

* fix

* fmt

* cleanup

* fix

* remove copyright

* trivial update

* readme update

* lint fix

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

7fb06776

[Refactor] Refine nvrtc compile related check style (#945) · 394e17d0
Xiaoyu Zhang authored Oct 07, 2025
```
* unify nvrtc check style

* unify nvrtc check style

* unify nvrtc check style
```
394e17d0

[Enhancement] Add buffer load copy functions and improve copy logic in tilelang (#946) · c61971e8

Lei Wang authored Oct 07, 2025

- Introduced new functions for buffer load copy with stride and parallel execution.
- Enhanced the copy logic in `copy.py` to simplify nested if statements for BufferLoad nodes.
- Added corresponding test cases for the new buffer load functionalities.

c61971e8

06 Oct, 2025 3 commits

[Profiler] Adds CUPTI profiler support (#936) · 91d5ef54

Cunxiao Ni authored Oct 06, 2025



* [Profiler]Adds CUPTI profiler support

* format

* rafactor cupti profiler

* format

* rafactor

* rafactor

* fix lint

* fix lint

* refactor

* add profiler tests

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

91d5ef54

[Example] Add sparse mla bwd example for deepseek_v32 (#919) · ac8c9afc

Zhichen Zeng authored Oct 06, 2025



* Add sparse mla bwd example

* add bwd into test

* Update README with bwd impl

* comment

* format fix

* lint fix

* fwd fix

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

ac8c9afc

[Example] Revert the atomic/split&sum templates in MHA backward examples (#943) · 481cae42

Tong WU authored Oct 06, 2025



* revert split+sum template for MHA backward

* lint

* Update example_mha_bwd.py

* Update example_mha_bwd_wgmma_pipelined.py

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

481cae42

05 Oct, 2025 3 commits
- [Example] Disable TMA and enable FastMath for NSA Examples (#941) · 3aecab8f
  Lei Wang authored Oct 05, 2025
```
* tma disable

* int64 cast fix.
```
  3aecab8f
- [Example] Introduce split+sum template, and optimize `atomic_add` performance... · 557589ff
  Lei Wang authored Oct 05, 2025
```
[Example] Introduce split+sum template, and optimize `atomic_add` performance for bwd examples (#940)

* example fix

* lint fix

* bug fix

* reduce test size.
```
  557589ff
- [Enhancement] Fix lint to improve grouped GEMM performance with TMA (#938) · 95170ab7
  Cunxiao Ni authored Oct 05, 2025
```
* [Example]  Fix lint  to improve grouped GEMM performance with TMA

* fix lint
```
  95170ab7
04 Oct, 2025 3 commits

[Enhancement] Enhance and add new GQA backward examples for Hopper (#930) · b31de0ce

Tong WU authored Oct 05, 2025

* [Enhancement] Enhance the GQA backward kernel by calculating `dq` and `dv` via copy&sum

* [Example] Implement GQA backward example for Hopper with customized tiling and pipeline

* [Example] Add relevant tests

* Fix all typos of wrong shape of `V_shared` in macros

b31de0ce

[Example] Add correctness assert into dsa example (#937) · d5c88afa
Lei Wang authored Oct 04, 2025

d5c88afa

[Example] Optimize online_softmax example (#934) · 242cb457

lijinpei authored Oct 04, 2025



* [Example] Optimize online_softmax example

- Y should be output in float16.
- BN needs to be equal to N to be really online.
- On my H100 machine, this increase speedup from 1.424x to 2.788x.

* enhance

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

242cb457

02 Oct, 2025 2 commits

[Bugfix] Fix tensor memory copy layout (#933) · 5ccac4fa

Zhiwen Mo authored Oct 03, 2025

* Implements tcgen05.ld instruction support for copying from shared.tmem
  to local.fragment on SM100/Blackwell architecture. Adds layout inference
  and lowering logic for tensor memory operations with proper physical
  coordinate range analysis and warpgroup alignment checks.

  Changes:
  - Add kTMemLoad and kTMemStore to CopyInst enumeration
  - Implement CheckTMemLoad() and CheckTMemStore() validation functions
  - Add LowerTmemCopy() to generate tcgen05.ld/st/cp PTX intrinsics
  - Add tmem layout inference in InferLayout() using expandTcgen05Layout
  - Support multiple instruction variants (32dp32b/64b/128b/256b)
  - Add physical layout bounds analysis for tmem coordinates
  - Change clear_accum from bool to PrimExpr in GEMM operations
  - Fix std::optional access checks in layout_inference.cc
  - Add tmem_allocate/deallocate PTX intrinsic support
  - Fix cooperative_groups grid.sync() code generation

* fix

* pipeline fix

* bug fix

* bool fix

5ccac4fa

[Layout] Strict annotate completed replicated layout for fragment with constant index (#929) · fc4bd452

Lei Wang authored Oct 02, 2025

* [Layout] Add IsCompletedReplicated method and enhance layout inference in ParallelOpNode

- Introduced IsCompletedReplicated method in FragmentNode to check if a buffer is fully replicated.
- Enhanced InferLayout in ParallelOpNode to handle layout inference for replicated buffers, ensuring only fragment[0] access is allowed.
- Updated error handling for non-zero index access in fragment buffers to improve robustness.

* [Layout] Improve code formatting and readability in layout.cc and parallel.cc

- Enhanced formatting in FragmentNode's IsCompletedReplicated method for better clarity.
- Updated InferLayout method in ParallelOpNode to improve code readability by adjusting line breaks and indentation.
- Ensured consistent formatting across conditional statements and comments for improved maintainability.

* updt

* optimize const index related op

* bug fix

* reduce gdn test

* test fix

* lintfix

* lint fix

* test fix

fc4bd452

01 Oct, 2025 5 commits

[CI] Fix documentation runner by adding 'nvidia' tag · f09e91e3
Wenhao Xie authored Oct 01, 2025

f09e91e3
[Example] Add MLA decode ws example (#928) · 8150e47e
Yu Cheng authored Oct 01, 2025

8150e47e

[CI] Refactor import paths in dequantization examples to use dequantize_utils (#914) · 9d382973

Lei Wang authored Oct 01, 2025

* Update requirements and refactor benchmark script for deepseek_nsa example

- Updated the requirements.txt to specify a fixed commit for the flash-linear-attention repository.
- Refactored import paths in benchmark_nsa_fwd.py for better organization.
- Added a new function to generate configurations for autotuning.
- Modified the tilelang_sparse_attention function to accept parameters for block size, number of stages, and threads, enhancing flexibility.
- Changed allocation of shared memory for accumulators to optimize performance.

* Refactor import paths in dequantization examples to use dequantize_utils

- Updated import statements in multiple dequantization example scripts to replace references to the removed utils.py file with the new dequantize_utils module.
- Ensured consistency across example scripts for better organization and maintainability.

9d382973

[Bugfix] Fix saving kernel source code where JITKernel.artifact is None (#921) · 1b4cd386

M.D_v2.5 authored Oct 01, 2025

In cases where JITKernel.artifact is None, it'll spit error -
```
2025-10-01 01:06:18  [TileLang:tilelang:ERROR]: Error saving kernel source
code to disk: 'NoneType' object has no attribute 'kernel_source'
```

Looking at properties of JITKernel, it seems that `JITKernel.kernel_source`
is a better way to achieve this.
Ref:
https://github.com/tile-ai/tilelang/blob/main/tilelang/jit/kernel.py#L453-L455

Co-authored-by: Dylan <miaoding.dai@plus.ai>

1b4cd386

[Enhancement] Include compile flags into the hash key of cached kernels (#911) · f737fa97

Tong WU authored Oct 01, 2025

* [Cache] Add compile_flags parameter to KernelCache hash keys

* [Cache] Update compile_flags parameter to accept both List[str] and str types

* lint

* [Refactor] Update compile_flags parameter to accept Union[List[str], str] type

f737fa97

30 Sep, 2025 3 commits

[CI] optimize CI time for sparse gemm (#906) · a35ac496

botbw authored Sep 30, 2025

* [CI] optimize CI time

* [CI] fix transpose && format

* [misc] apply coderabbit suggestions && fix typo

a35ac496

[Example] Specify a fixed commit for the flash-linear-attention repository and... · 3ad6202d

Lei Wang authored Sep 30, 2025

[Example] Specify a fixed commit for the flash-linear-attention repository and optimize nsa examples (#913)

- Updated the requirements.txt to specify a fixed commit for the flash-linear-attention repository.
- Refactored import paths in benchmark_nsa_fwd.py for better organization.
- Added a new function to generate configurations for autotuning.
- Modified the tilelang_sparse_attention function to accept parameters for block size, number of stages, and threads, enhancing flexibility.
- Changed allocation of shared memory for accumulators to optimize performance.

3ad6202d

[Typo] Fix branch name & link for AscendNPU IR in latest news (#907) · f92de932
Wenhao Xie authored Sep 30, 2025

f92de932

29 Sep, 2025 8 commits

[Example] Add topk into sparse mla example and append some docs (#901) · 6021ef32

Lei Wang authored Sep 30, 2025

* Remove unused `fp8_mqa_logits.py` file and update README.md to reflect new directory structure and file descriptions for deepseek_v32 example. Added sections for architecture overview, Lightning Indexer, Top-k Selector, and Sparse MLA Forward implementations.

* Update linting configurations and improve code formatting in deepseek_v32 example scripts

- Added per-file ignores for the inference directory in `pyproject.toml`.
- Refactored code in `topk_selector.py`, `convert.py`, `generate.py`, `kernel.py`, and `model.py` to enhance readability by adjusting spacing and line breaks.
- Ensured consistent formatting across function definitions and assertions for better clarity.

* Refactor test functions in deepseek_v32 example scripts for improved clarity and consistency

- Updated `fp8_lighting_indexer.py` to define a dedicated test function for the lighting indexer.
- Refactored `sparse_mla_fwd_pipelined.py` and `sparse_mla_fwd.py` to standardize test function parameters and improve readability.
- Enhanced `topk_selector.py` by introducing a test function with parameters for batch size and sequence length.
- Ensured all test functions are invoked correctly in the main execution block.

* Enhance test functions in deepseek_v32 example scripts with CUDA requirements and parameterization

- Added CUDA requirements decorators to `test_example_sparse_mla_fwd` and `test_example_sparse_mla_fwd_pipelined`.
- Parameterized test functions to use specific small shapes for testing, improving test coverage and clarity.

* lint fix

* Update README.md to correct image path for DeepSeek V3.2 architecture diagram

6021ef32

[Bugfix] Fix flops comp and softmax scale in mla (#900) · 16561159
Wenxuan Tan authored Sep 29, 2025
```
* fix flops comp and softmax scale

* format
```
16561159
[CI] Legalize math related test (#899) · 54fc6ba0
Lei Wang authored Sep 30, 2025

54fc6ba0
[Typo] Fix backend name for Huawei Ascend (#898) · d19fe1ae
Wenhao Xie authored Sep 30, 2025
```
* [Typo] Fix backend name for Huawei Ascend chips

* update
```
d19fe1ae

[Example] Add sparse mla examples (#896) · 65ac7454

Lei Wang authored Sep 29, 2025

* Update README.md to include directory structure and file descriptions for deepseek_v32 example

* Refactor and clean up deepseek_v32 example scripts

- Removed unused imports and functions from `fp8_mqa_logits.py` to streamline the code.
- Improved formatting and readability in `sparse_mla_fwd_pipelined.py` and `sparse_mla_fwd.py` by adjusting function signatures and indentation.
- Added `# ruff: noqa` comments to suppress linting warnings in multiple files.
- Enhanced the `generate_random_cu_seqlens` function in `utils.py` for better clarity and organization.
- Updated print statements for consistency in output formatting.

65ac7454

[News] Add announcement of support for Huawei Ascend chips (#895) · 78664e24
Wenhao Xie authored Sep 29, 2025

78664e24

[Example] Add example (#894) · 4424fa9a

Lei Wang authored Sep 29, 2025

* [Refactor] Enhance CopyNode Lower method to support disable_tma flag and improve flash attention implementation

* Updated the CopyNode Lower method to correctly include the disable_tma flag in the GetCopyInst call.
* Refactored the flash attention implementation to selectively disable TMA for specific copy operations while allowing it for others.
* Addressed linting issues for improved code quality

* sparse mla kernels

* Remove deprecated sparse MLA and utility files to streamline the codebase.

4424fa9a

[Layout] fix plot layout (#890) · 6c67a77f
Jiaxing Ding authored Sep 29, 2025

6c67a77f

28 Sep, 2025 2 commits

[Bugfix] Fix CopyNode Lower method to include disable_tma flag in GetCopyInst (#888) · 599264ca

Tong WU authored Sep 29, 2025

* Fix CopyNode Lower method to include disable_tma flag in GetCopyInst call

* Refactor flash attention implementation to disable TMA for specific copy and allow TMA for other operations

* attempt to fix lint

599264ca

[SM100] Add sm100 GEMM layouts and tcgen05 support (#887) · f58bcd43

Zhiwen Mo authored Sep 28, 2025

* update sm100 related utcmma, tmem, ld/st256 in src
* update sm100 related utcmma, tmem, ld/st256 in tilelang
* Remove deprecated GEMM examples and related README documentation for SM100 architecture support
* Update GEMM implementation to replace UTCMMA with TCGEN5MMA across relevant files
* Remove gemm_umma.py example and update README to reflect TCGEN5MMA terminology changes
* Update README.md for gemm_sm100 example by removing outdated API sections and streamlining documentation
* Update README and source files to reflect TCGEN5.MMA terminology changes
* Refactor CUDA GEMM header for improved readability

f58bcd43

26 Sep, 2025 1 commit

[Layout] Introduce Flexible Parallel to Support T.serial and local buffers... · c382dcbc

Lei Wang authored Sep 27, 2025


[Layout] Introduce Flexible Parallel to Support T.serial and local buffers inside T.Parallel loop (#844)

* Support T.serial and local buffers inside T.Parallel loop.

* Fix reducer layout in T.Parallel nested inside other loops

* Debug output with LOG(INFO)

* Add disable option for WGMMA.

* fix

* Use DLOG; fix missing registration for new pass config

* bug fix

* lint fix

* Enhance GEMM instruction set with UTCMMA and improve local buffer handling in casting example

* Update format.sh shebang, improve logging in layout inference, and enhance buffer store wrapper with detailed comments

* Enhance GEMM instantiation logic and improve layout inference for local buffer detection

- Updated the GEMM instantiation logic to include a check for WGMMA compatibility, ensuring that the conditions for using WGMMA are more robust.
- Refined the layout inference process to better identify when loops manipulate only local buffers, improving the accuracy of thread binding decisions in parallel loops.

---------
Co-authored-by: Huanqi Cao <caohuanqi@deepseek.com>

c382dcbc