Commits · 9a7cda42b28502d3a74bc46737c8946fedc745ae · OpenDAS / tilelang

09 Oct, 2025 3 commits
- [CI] auto-cancel in-progress PR CI when new commits are pushed (#956) · 9a7cda42
  Xuehai Pan authored Oct 09, 2025
  
  9a7cda42
- Modify the SM architecture number to support Thor’s sm110. (#957) · 07f62104
  Shawn Liu authored Oct 09, 2025
  
  07f62104
- [CI] enable dependabot for GHA workflows (#950) · f6d4bd3a
  Xuehai Pan authored Oct 09, 2025
```
* chore: add .editorconfig

* feat: enable dependabot for GHA workflows
```
  f6d4bd3a
07 Oct, 2025 3 commits

[Backend] Add metal backend (#799) · 7fb06776

Yichen Yan authored Oct 07, 2025



* Reset

* Fix other CUDA issue

* fmt

* fmt

* fix cuda error

* fix

* fix

* fmt

* cleanup

* fix

* remove copyright

* trivial update

* readme update

* lint fix

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

7fb06776

[Refactor] Refine nvrtc compile related check style (#945) · 394e17d0
Xiaoyu Zhang authored Oct 07, 2025
```
* unify nvrtc check style

* unify nvrtc check style

* unify nvrtc check style
```
394e17d0

[Enhancement] Add buffer load copy functions and improve copy logic in tilelang (#946) · c61971e8

Lei Wang authored Oct 07, 2025

- Introduced new functions for buffer load copy with stride and parallel execution.
- Enhanced the copy logic in `copy.py` to simplify nested if statements for BufferLoad nodes.
- Added corresponding test cases for the new buffer load functionalities.

c61971e8

06 Oct, 2025 3 commits

[Profiler] Adds CUPTI profiler support (#936) · 91d5ef54

Cunxiao Ni authored Oct 06, 2025



* [Profiler]Adds CUPTI profiler support

* format

* rafactor cupti profiler

* format

* rafactor

* rafactor

* fix lint

* fix lint

* refactor

* add profiler tests

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

91d5ef54

[Example] Add sparse mla bwd example for deepseek_v32 (#919) · ac8c9afc

Zhichen Zeng authored Oct 06, 2025



* Add sparse mla bwd example

* add bwd into test

* Update README with bwd impl

* comment

* format fix

* lint fix

* fwd fix

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

ac8c9afc

[Example] Revert the atomic/split&sum templates in MHA backward examples (#943) · 481cae42

Tong WU authored Oct 06, 2025



* revert split+sum template for MHA backward

* lint

* Update example_mha_bwd.py

* Update example_mha_bwd_wgmma_pipelined.py

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

481cae42

05 Oct, 2025 3 commits
- [Example] Disable TMA and enable FastMath for NSA Examples (#941) · 3aecab8f
  Lei Wang authored Oct 05, 2025
```
* tma disable

* int64 cast fix.
```
  3aecab8f
- [Example] Introduce split+sum template, and optimize `atomic_add` performance... · 557589ff
  Lei Wang authored Oct 05, 2025
```
[Example] Introduce split+sum template, and optimize `atomic_add` performance for bwd examples (#940)

* example fix

* lint fix

* bug fix

* reduce test size.
```
  557589ff
- [Enhancement] Fix lint to improve grouped GEMM performance with TMA (#938) · 95170ab7
  Cunxiao Ni authored Oct 05, 2025
```
* [Example]  Fix lint  to improve grouped GEMM performance with TMA

* fix lint
```
  95170ab7
04 Oct, 2025 3 commits

[Enhancement] Enhance and add new GQA backward examples for Hopper (#930) · b31de0ce

Tong WU authored Oct 05, 2025

* [Enhancement] Enhance the GQA backward kernel by calculating `dq` and `dv` via copy&sum

* [Example] Implement GQA backward example for Hopper with customized tiling and pipeline

* [Example] Add relevant tests

* Fix all typos of wrong shape of `V_shared` in macros

b31de0ce

[Example] Add correctness assert into dsa example (#937) · d5c88afa
Lei Wang authored Oct 04, 2025

d5c88afa

[Example] Optimize online_softmax example (#934) · 242cb457

lijinpei authored Oct 04, 2025



* [Example] Optimize online_softmax example

- Y should be output in float16.
- BN needs to be equal to N to be really online.
- On my H100 machine, this increase speedup from 1.424x to 2.788x.

* enhance

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

242cb457

02 Oct, 2025 2 commits

[Bugfix] Fix tensor memory copy layout (#933) · 5ccac4fa

Zhiwen Mo authored Oct 03, 2025

* Implements tcgen05.ld instruction support for copying from shared.tmem
  to local.fragment on SM100/Blackwell architecture. Adds layout inference
  and lowering logic for tensor memory operations with proper physical
  coordinate range analysis and warpgroup alignment checks.

  Changes:
  - Add kTMemLoad and kTMemStore to CopyInst enumeration
  - Implement CheckTMemLoad() and CheckTMemStore() validation functions
  - Add LowerTmemCopy() to generate tcgen05.ld/st/cp PTX intrinsics
  - Add tmem layout inference in InferLayout() using expandTcgen05Layout
  - Support multiple instruction variants (32dp32b/64b/128b/256b)
  - Add physical layout bounds analysis for tmem coordinates
  - Change clear_accum from bool to PrimExpr in GEMM operations
  - Fix std::optional access checks in layout_inference.cc
  - Add tmem_allocate/deallocate PTX intrinsic support
  - Fix cooperative_groups grid.sync() code generation

* fix

* pipeline fix

* bug fix

* bool fix

5ccac4fa

[Layout] Strict annotate completed replicated layout for fragment with constant index (#929) · fc4bd452

Lei Wang authored Oct 02, 2025

* [Layout] Add IsCompletedReplicated method and enhance layout inference in ParallelOpNode

- Introduced IsCompletedReplicated method in FragmentNode to check if a buffer is fully replicated.
- Enhanced InferLayout in ParallelOpNode to handle layout inference for replicated buffers, ensuring only fragment[0] access is allowed.
- Updated error handling for non-zero index access in fragment buffers to improve robustness.

* [Layout] Improve code formatting and readability in layout.cc and parallel.cc

- Enhanced formatting in FragmentNode's IsCompletedReplicated method for better clarity.
- Updated InferLayout method in ParallelOpNode to improve code readability by adjusting line breaks and indentation.
- Ensured consistent formatting across conditional statements and comments for improved maintainability.

* updt

* optimize const index related op

* bug fix

* reduce gdn test

* test fix

* lintfix

* lint fix

* test fix

fc4bd452

01 Oct, 2025 5 commits

[CI] Fix documentation runner by adding 'nvidia' tag · f09e91e3
Wenhao Xie authored Oct 01, 2025

f09e91e3
[Example] Add MLA decode ws example (#928) · 8150e47e
Yu Cheng authored Oct 01, 2025

8150e47e

[CI] Refactor import paths in dequantization examples to use dequantize_utils (#914) · 9d382973

Lei Wang authored Oct 01, 2025

* Update requirements and refactor benchmark script for deepseek_nsa example

- Updated the requirements.txt to specify a fixed commit for the flash-linear-attention repository.
- Refactored import paths in benchmark_nsa_fwd.py for better organization.
- Added a new function to generate configurations for autotuning.
- Modified the tilelang_sparse_attention function to accept parameters for block size, number of stages, and threads, enhancing flexibility.
- Changed allocation of shared memory for accumulators to optimize performance.

* Refactor import paths in dequantization examples to use dequantize_utils

- Updated import statements in multiple dequantization example scripts to replace references to the removed utils.py file with the new dequantize_utils module.
- Ensured consistency across example scripts for better organization and maintainability.

9d382973

[Bugfix] Fix saving kernel source code where JITKernel.artifact is None (#921) · 1b4cd386

M.D_v2.5 authored Oct 01, 2025

In cases where JITKernel.artifact is None, it'll spit error -
```
2025-10-01 01:06:18  [TileLang:tilelang:ERROR]: Error saving kernel source
code to disk: 'NoneType' object has no attribute 'kernel_source'
```

Looking at properties of JITKernel, it seems that `JITKernel.kernel_source`
is a better way to achieve this.
Ref:
https://github.com/tile-ai/tilelang/blob/main/tilelang/jit/kernel.py#L453-L455

Co-authored-by: Dylan <miaoding.dai@plus.ai>

1b4cd386

[Enhancement] Include compile flags into the hash key of cached kernels (#911) · f737fa97

Tong WU authored Oct 01, 2025

* [Cache] Add compile_flags parameter to KernelCache hash keys

* [Cache] Update compile_flags parameter to accept both List[str] and str types

* lint

* [Refactor] Update compile_flags parameter to accept Union[List[str], str] type

f737fa97

30 Sep, 2025 3 commits

[CI] optimize CI time for sparse gemm (#906) · a35ac496

botbw authored Sep 30, 2025

* [CI] optimize CI time

* [CI] fix transpose && format

* [misc] apply coderabbit suggestions && fix typo

a35ac496

[Example] Specify a fixed commit for the flash-linear-attention repository and... · 3ad6202d

Lei Wang authored Sep 30, 2025

[Example] Specify a fixed commit for the flash-linear-attention repository and optimize nsa examples (#913)

- Updated the requirements.txt to specify a fixed commit for the flash-linear-attention repository.
- Refactored import paths in benchmark_nsa_fwd.py for better organization.
- Added a new function to generate configurations for autotuning.
- Modified the tilelang_sparse_attention function to accept parameters for block size, number of stages, and threads, enhancing flexibility.
- Changed allocation of shared memory for accumulators to optimize performance.

3ad6202d

[Typo] Fix branch name & link for AscendNPU IR in latest news (#907) · f92de932
Wenhao Xie authored Sep 30, 2025

f92de932

29 Sep, 2025 8 commits

[Example] Add topk into sparse mla example and append some docs (#901) · 6021ef32

Lei Wang authored Sep 30, 2025

* Remove unused `fp8_mqa_logits.py` file and update README.md to reflect new directory structure and file descriptions for deepseek_v32 example. Added sections for architecture overview, Lightning Indexer, Top-k Selector, and Sparse MLA Forward implementations.

* Update linting configurations and improve code formatting in deepseek_v32 example scripts

- Added per-file ignores for the inference directory in `pyproject.toml`.
- Refactored code in `topk_selector.py`, `convert.py`, `generate.py`, `kernel.py`, and `model.py` to enhance readability by adjusting spacing and line breaks.
- Ensured consistent formatting across function definitions and assertions for better clarity.

* Refactor test functions in deepseek_v32 example scripts for improved clarity and consistency

- Updated `fp8_lighting_indexer.py` to define a dedicated test function for the lighting indexer.
- Refactored `sparse_mla_fwd_pipelined.py` and `sparse_mla_fwd.py` to standardize test function parameters and improve readability.
- Enhanced `topk_selector.py` by introducing a test function with parameters for batch size and sequence length.
- Ensured all test functions are invoked correctly in the main execution block.

* Enhance test functions in deepseek_v32 example scripts with CUDA requirements and parameterization

- Added CUDA requirements decorators to `test_example_sparse_mla_fwd` and `test_example_sparse_mla_fwd_pipelined`.
- Parameterized test functions to use specific small shapes for testing, improving test coverage and clarity.

* lint fix

* Update README.md to correct image path for DeepSeek V3.2 architecture diagram

6021ef32

[Bugfix] Fix flops comp and softmax scale in mla (#900) · 16561159
Wenxuan Tan authored Sep 29, 2025
```
* fix flops comp and softmax scale

* format
```
16561159
[CI] Legalize math related test (#899) · 54fc6ba0
Lei Wang authored Sep 30, 2025

54fc6ba0
[Typo] Fix backend name for Huawei Ascend (#898) · d19fe1ae
Wenhao Xie authored Sep 30, 2025
```
* [Typo] Fix backend name for Huawei Ascend chips

* update
```
d19fe1ae

[Example] Add sparse mla examples (#896) · 65ac7454

Lei Wang authored Sep 29, 2025

* Update README.md to include directory structure and file descriptions for deepseek_v32 example

* Refactor and clean up deepseek_v32 example scripts

- Removed unused imports and functions from `fp8_mqa_logits.py` to streamline the code.
- Improved formatting and readability in `sparse_mla_fwd_pipelined.py` and `sparse_mla_fwd.py` by adjusting function signatures and indentation.
- Added `# ruff: noqa` comments to suppress linting warnings in multiple files.
- Enhanced the `generate_random_cu_seqlens` function in `utils.py` for better clarity and organization.
- Updated print statements for consistency in output formatting.

65ac7454

[News] Add announcement of support for Huawei Ascend chips (#895) · 78664e24
Wenhao Xie authored Sep 29, 2025

78664e24

[Example] Add example (#894) · 4424fa9a

Lei Wang authored Sep 29, 2025

* [Refactor] Enhance CopyNode Lower method to support disable_tma flag and improve flash attention implementation

* Updated the CopyNode Lower method to correctly include the disable_tma flag in the GetCopyInst call.
* Refactored the flash attention implementation to selectively disable TMA for specific copy operations while allowing it for others.
* Addressed linting issues for improved code quality

* sparse mla kernels

* Remove deprecated sparse MLA and utility files to streamline the codebase.

4424fa9a

[Layout] fix plot layout (#890) · 6c67a77f
Jiaxing Ding authored Sep 29, 2025

6c67a77f

28 Sep, 2025 2 commits

[Bugfix] Fix CopyNode Lower method to include disable_tma flag in GetCopyInst (#888) · 599264ca

Tong WU authored Sep 29, 2025

* Fix CopyNode Lower method to include disable_tma flag in GetCopyInst call

* Refactor flash attention implementation to disable TMA for specific copy and allow TMA for other operations

* attempt to fix lint

599264ca

[SM100] Add sm100 GEMM layouts and tcgen05 support (#887) · f58bcd43

Zhiwen Mo authored Sep 28, 2025

* update sm100 related utcmma, tmem, ld/st256 in src
* update sm100 related utcmma, tmem, ld/st256 in tilelang
* Remove deprecated GEMM examples and related README documentation for SM100 architecture support
* Update GEMM implementation to replace UTCMMA with TCGEN5MMA across relevant files
* Remove gemm_umma.py example and update README to reflect TCGEN5MMA terminology changes
* Update README.md for gemm_sm100 example by removing outdated API sections and streamlining documentation
* Update README and source files to reflect TCGEN5.MMA terminology changes
* Refactor CUDA GEMM header for improved readability

f58bcd43

26 Sep, 2025 5 commits

[Layout] Introduce Flexible Parallel to Support T.serial and local buffers... · c382dcbc

Lei Wang authored Sep 27, 2025


[Layout] Introduce Flexible Parallel to Support T.serial and local buffers inside T.Parallel loop (#844)

* Support T.serial and local buffers inside T.Parallel loop.

* Fix reducer layout in T.Parallel nested inside other loops

* Debug output with LOG(INFO)

* Add disable option for WGMMA.

* fix

* Use DLOG; fix missing registration for new pass config

* bug fix

* lint fix

* Enhance GEMM instruction set with UTCMMA and improve local buffer handling in casting example

* Update format.sh shebang, improve logging in layout inference, and enhance buffer store wrapper with detailed comments

* Enhance GEMM instantiation logic and improve layout inference for local buffer detection

- Updated the GEMM instantiation logic to include a check for WGMMA compatibility, ensuring that the conditions for using WGMMA are more robust.
- Refined the layout inference process to better identify when loops manipulate only local buffers, improving the accuracy of thread binding decisions in parallel loops.

---------
Co-authored-by: Huanqi Cao <caohuanqi@deepseek.com>

c382dcbc

[Example] Optimize sink attention forward via swizzled layout and report benchmark results (#885) · bf67fb19

Tong WU authored Sep 27, 2025



* Enhance attention sink examples with swizzled layout and performance metrics

- Added `make_swizzled_layout` annotations for shared tensors in the `flashattn` function across MHA and GQA examples to optimize memory access patterns.
- Updated benchmark outputs to include speedup calculations comparing Triton and TileLang implementations.

* Add README for Attention Sink example with algorithm details and benchmark results

- Introduced a new README.md file for the Attention Sink example, outlining the forward and backward algorithms, including the computation of `dsinks`.
- Provided benchmark results comparing performance metrics of the optimized implementation against Triton, highlighting speedup across various configurations.

* Update README.md for Attention Sink example to include link to Triton implementation

* Update examples/attention_sink/README.md
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* Update examples/attention_sink/example_gqa_sink_fwd_bhsd_wgmma_pipelined.py
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* typo

---------
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

bf67fb19

[Dist] Provide an option to include commit ID in version (#884) · c861d8a2

Lei Wang authored Sep 26, 2025

* Update MANIFEST.in and setup.py to include commit ID in versioning and adjust included files

- Modified MANIFEST.in to include shared library files `libtvm.so` and `libtvm_runtime.so`.
- Updated setup.py to conditionally include the commit ID in the package version based on the `WITH_COMMITID` environment variable.
- Enhanced versioning logic in version.py to use a truncated commit ID for better compatibility.

* Update setup.py and related scripts to enable commit ID inclusion in package metadata

- Changed the default value of the `WITH_COMMITID` environment variable in setup.py to "True".
- Updated tox.ini to set `WITH_COMMITID` to "TRUE" for the testing environment and "FALSE" for the build environment.
- Modified pypi_distribution.sh to pass `WITH_COMMITID=FALSE` during the wheel build process.

* Update MANIFEST.in to include additional files and directories for packaging

- Added VERSION, CMakeLists.txt, and various requirements files to the package.
- Included recursive inclusion of source files and third-party libraries, while excluding specific clang and llvm directories.

c861d8a2

[Precision] Introduce `T.ieee_rsqrt` and related high precision op (#882) · a58bf9b6

Lei Wang authored Sep 26, 2025

* Add fast math operations for CUDA: exp, exp10, log, log2, log10, tan, cos, and sin (#865)

* Refactor fast math operation definitions for consistency and readability in CUDA code. Consolidated multiple definitions into single lines and improved formatting in related test files for better clarity.

* Remove unnecessary pass configurations for warp specialization and TMA lowering in fast math operation tests for CUDA. This simplifies the test setup while maintaining the focus on fast math functionality.

* Update fastmath tests to reflect that tl.* intrinsics generate no fastmath versions and disable cache in main execution.

* Fix formatting in fastmath test comments for clarity on tl.* intrinsics behavior.

* Add precision comparison tool for CUDA operations

This commit introduces a new Python script and CUDA source file for a precision comparison tool that evaluates the accuracy of various CUDA operations (including division, reciprocal, exponential, logarithmic, and trigonometric functions) across different implementations: CUDA Precise, CUDA Fast, Triton, Triton LibDevice, and TileLang. The tool generates test data, executes the operations, and summarizes the error statistics for each implementation against a double precision reference. Additionally, a README file is added to document the results of the comparisons for various operations.

* Add precision comparison tool for CUDA operations

This commit introduces a new precision comparison tool implemented in Python and CUDA, designed to evaluate the accuracy of various mathematical operations (division, reciprocal, exponential, logarithmic, trigonometric, square root, etc.) across different frameworks including CUDA Precise/Fast, Triton, Triton LibDevice, PyTorch, and TileLang. The tool includes functionality for generating test data, executing operations, and summarizing error statistics for each implementation. Additionally, it provides a comprehensive README with error metrics for each operation tested.

* Add IEEE-compliant mathematical operations and refactor fast math module

This commit introduces new high precision mathematical operations including ieee_add, ieee_sub, ieee_mul, ieee_fmaf, ieee_frcp, ieee_fsqrt, ieee_frsqrt, and ieee_fdiv to the TileLang framework. The fast math module has been refactored to remove the deprecated fastmath.py file and update the import paths accordingly. Additionally, the CUDA code generation has been enhanced to support these new operations, ensuring compatibility with IEEE standards for floating-point arithmetic.

* debug removed

* Refactor IEEE math tests for improved readability and consistency

This commit enhances the formatting of the `test_ieee_math.py` and `test_mathops_fastmath.py` files by adjusting line breaks for better clarity. It also removes unnecessary comments and ensures that the main execution of tests is streamlined. These changes aim to improve the overall maintainability of the test code.

* Update README.md to enhance formatting of precision comparison results

This commit reformats the precision comparison results in the README.md file, converting the error statistics tables into a more structured markdown format. This change improves readability and accessibility of the data for various mathematical operations across different implementations, including FP32 Precise, Triton, TileLang, and CUDA.

a58bf9b6

[Example] Add efficient attention sink backward implementations and tests (#877) · ec24561a

Tong WU authored Sep 26, 2025

* [Example] Add a new example to support attention sink for MHA

- Introduced a new example script for multi-head attention (MHA) with sliding window attention and sink tokens.
- Added a reference attention function to validate the implementation against PyTorch.
- Included argument parsing for command-line execution of the example.

* [Example] Replace MHA sink forward example with updated implementation

- Removed the old example script for multi-head attention (MHA) with sliding window attention and sink tokens.
- Introduced a new example script that modifies the attention mechanism to enhance performance and maintainability.
- Updated argument parsing and reference functions to align with the new implementation.

* Enhance MHA sink example with sliding window support

- Added a `window_size` parameter to the `flashattn` function to enable sliding window attention.
- Implemented assertions to ensure `window_size` is compatible with `block_N`.
- Updated the main function to include a `tune` option for performance tuning.
- Introduced a new test file to validate both full attention and sliding window scenarios.
- Adjusted FLOPS calculation to account for the sliding window configuration.

* lint

* [Fix] Add checkinf process to fix the bug of swa

* Migrate to BSHD layout to align with triton baselines

* lint

* fix typo

* Refactor MHA sink example to use seq_q and seq_kv parameters to accommodate the new sequence length parameters.

* Add GQA sink example for optimized attention mechanism & lint fix

* fix several typos and bugs

* lint

* fix speed issues of swa

* Add flash attention example with backward pass for BHSD layout and corresponding test cases

* Add backward pass implementation for flash attention with sinks and corresponding test case

* fix lint and typo

* Optimze the calculation of `dsinks`

* Add support for swa backward and update examples

* fix previous typos

* Add example for GQA sink backward pass and update tests for both MHA and GQA sinks

* fix lint

* fix previous typos

* typo

ec24561a