Commits · fa4fd0b73fb1b9ad36b960f1881fc883674cd005 · OpenDAS / tilelang

24 Sep, 2025 1 commit
- [Parser] Adapt Parser to work with Python 3.8 in some cases (#869) · fa4fd0b7
  Lei Wang authored Sep 24, 2025
  
  fa4fd0b7
23 Sep, 2025 6 commits

[Example] Add examples to support efficient attention sink forward process (#853) · d9a171ce

Tong WU authored Sep 23, 2025



* [Example] Add a new example to support attention sink for MHA

- Introduced a new example script for multi-head attention (MHA) with sliding window attention and sink tokens.
- Added a reference attention function to validate the implementation against PyTorch.
- Included argument parsing for command-line execution of the example.

* [Example] Replace MHA sink forward example with updated implementation

- Removed the old example script for multi-head attention (MHA) with sliding window attention and sink tokens.
- Introduced a new example script that modifies the attention mechanism to enhance performance and maintainability.
- Updated argument parsing and reference functions to align with the new implementation.

* Enhance MHA sink example with sliding window support

- Added a `window_size` parameter to the `flashattn` function to enable sliding window attention.
- Implemented assertions to ensure `window_size` is compatible with `block_N`.
- Updated the main function to include a `tune` option for performance tuning.
- Introduced a new test file to validate both full attention and sliding window scenarios.
- Adjusted FLOPS calculation to account for the sliding window configuration.

* lint

* [Fix] Add checkinf process to fix the bug of swa

* Migrate to BSHD layout to align with triton baselines

* lint

* fix typo

* Refactor MHA sink example to use seq_q and seq_kv parameters to accommodate the new sequence length parameters.

* Add GQA sink example for optimized attention mechanism & lint fix

* fix several typos and bugs

* lint

* fix speed issues of swa

* Update examples/attention_sink/example_gqa_sink_fwd_bhsd_wgmma_pipelined.py
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* Update examples/attention_sink/example_mha_sink_fwd_bhsd_wgmma_pipelined.py
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

---------
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

d9a171ce

[Autotune][Conv] optimize convolution examples to use autotune (#866) · b4483090
Lei Wang authored Sep 23, 2025

b4483090

[Layout] Support layout forward with multi dimension (#867) · 9cbbbbc6

Lei Wang authored Sep 23, 2025

* Enhance LayoutNode::Forward method to handle variable transformations more robustly

- Updated the method to check for a minimum number of input dimensions.
- Introduced a mechanism to transform the last InputDim() elements of the input variables.
- Concatenated transformed variables with the remaining input variables for a comprehensive output.

* Refactor LayoutNode::Forward method for improved readability

- Removed unnecessary whitespace to enhance code clarity.
- Maintained existing functionality while streamlining the transformation process of input variables.

9cbbbbc6

Add fast sine and cosine definitions in common.h for CUDA templates (#865) · 86aaf3c1
Tong WU authored Sep 23, 2025

86aaf3c1
[AMD] refactor MatrixCoreIntrinEmitter (#860) · 48c9a352
Jiaxing Ding authored Sep 23, 2025

48c9a352
[Bugfix] Ensure correct handling for cases where `seq_q<seq_kv` in flash attention examples (#864) · b12a63cf
Tong WU authored Sep 23, 2025
```
* fix flash attention examples  for `seqlen_q<seqlen_kv` cases

* lint
```
b12a63cf

22 Sep, 2025 4 commits

[AMD][MLA] Fix mla autotune for rocm (#861) · 3b21a67d

Lei Wang authored Sep 23, 2025

* Refactor matmul example to include ReLU activation and update batch size in benchmark script

* lint fix

* Enhance autotuning capabilities in benchmark script and update argument defaults

- Introduced a new `get_configs` function to generate autotuning configurations for the benchmark.
- Updated the default batch size and kv context length in the argument parser for improved performance.
- Renamed the `--auto_tune` argument to `--autotune` for consistency.
- Modified the kernel invocation logic to support autotuning based on the new configurations.

* lint fix

3b21a67d

[TMA] Bugfix when a shared buffer is both issued with tma store and tma load (#857) · b9a51c43

Lei Wang authored Sep 22, 2025

- Updated `init_desc_arg_map` to use `Var` as the key instead of `String` in `lower_hopper_intrin.cc`.
- Enhanced `func_call_args` method in `TLCUDASourceWrapper` to accept additional parameters for better argument mapping.
- Added assertions to ensure consistency between function parameters and arguments during kernel launches.
- Modified `generate_tma_descriptor_args` to utilize a mapping of variable names for TMA descriptor initialization.

b9a51c43

[Doc] Optimize the quickstart guide for clarity and not just for CUDA (#858) · 058a670b
Lei Wang authored Sep 22, 2025
```
* Refactor matmul example to include ReLU activation and update batch size in benchmark script

* lint fix
```
058a670b
[Analyzer] Enhance ConstIntBoundAnalyzer and IntervalSet with modular set analysis (#856) · bd168654
Lei Wang authored Sep 22, 2025

bd168654

21 Sep, 2025 1 commit

[PATCH] Static libg++ linking fix (#854) · a3497ebc

Lei Wang authored Sep 22, 2025

* bump version to 0.1.6

* phaseout py38

* py39

* Update submodule 'tvm' to latest commit adc0e48

* [Build] Update CMake and Python environment settings

- Added static linking flags for GCC and libstdc++ in CMakeLists.txt to enhance library linking.
- Removed the cmake version requirement from pyproject.toml to allow for broader compatibility.
- Updated the tox command in the Docker distribution script to include Python 3.8 for testing environments.

* [Build] Update Python version requirements in scripts and documentation

- Changed Python version requirement in README.md from 3.9+ to 3.8+.
- Updated installation and testing scripts to use Python 3.8 instead of 3.9, ensuring compatibility with the new minimum version.
- Adjusted tox commands in local and PyPI distribution scripts to include Python 3.8 in the testing environments.

* [Build] Update Python and CMake requirements in Dockerfile and pyproject.toml

- Added CMake version requirement (>=3.26) to pyproject.toml for build compatibility.
- Created a Python 3.8 environment in the Dockerfile and added a symlink for easier access to the Python 3.8 executable.

* [Build] Update CMake and Dockerfile for improved compatibility

- Removed static linking flags from CMakeLists.txt to simplify build configuration.
- Updated Dockerfile to use Ubuntu 20.04 and streamlined the installation of dependencies, removing gcc-9 and g++-9.
- Adjusted symlink creation for Python environments to use the `-sf` option for safer linking.

* [Build] Bump version to 0.1.6.post1 for post-release updates

* [Build] Remove static linking flags from CMakeLists.txt

- Eliminated static linking flags for GCC and libstdc++ to simplify build configuration and avoid potential conflicts with Python extensions.

* [Build] Update Docker distribution scripts for manylinux compatibility

- Changed base image from `tilelang-builder:18.04` to `tilelang-builder:manylinux` in both local and PyPI distribution scripts.
- Updated Dockerfile references to use `pypi.manylinux.Dockerfile`.
- Added `--gpus all` flag to the Docker run command to enable GPU support during execution.

* lint fix

* add cmake

a3497ebc

19 Sep, 2025 3 commits

[Release] Bump Version to 0.1.6 (#818) · 1ad6e461

Lei Wang authored Sep 19, 2025

* bump version to 0.1.6

* phaseout py38

* py39

* Update submodule 'tvm' to latest commit adc0e48

* [Build] Update CMake and Python environment settings

- Added static linking flags for GCC and libstdc++ in CMakeLists.txt to enhance library linking.
- Removed the cmake version requirement from pyproject.toml to allow for broader compatibility.
- Updated the tox command in the Docker distribution script to include Python 3.8 for testing environments.

* [Build] Update Python version requirements in scripts and documentation

- Changed Python version requirement in README.md from 3.9+ to 3.8+.
- Updated installation and testing scripts to use Python 3.8 instead of 3.9, ensuring compatibility with the new minimum version.
- Adjusted tox commands in local and PyPI distribution scripts to include Python 3.8 in the testing environments.

* [Build] Update Python and CMake requirements in Dockerfile and pyproject.toml

- Added CMake version requirement (>=3.26) to pyproject.toml for build compatibility.
- Created a Python 3.8 environment in the Dockerfile and added a symlink for easier access to the Python 3.8 executable.

1ad6e461

[Refactor] Enhance buffer store transformation in TIR pass (#851) · 094e2298

Lei Wang authored Sep 19, 2025

- Updated the `AddWrapperForSingleBufStore` function to improve the handling of buffer stores by adding detailed checks for fragment buffer accesses and ensuring only index 0 is used.
- Introduced new helper functions for collecting buffer accesses and indices, enhancing code readability and maintainability.
- Refined the logic for determining tile operations and thread bindings to ensure accurate transformations without affecting existing parallel structures.

094e2298

[Py38] Revert typing and parser updates for Python 3.8 compatibility (#850) · bc9623fc

Lei Wang authored Sep 19, 2025

* Update submodule TVM to commit 872e32c1 and adjust type hints in nvcc.py and utils.py for compatibility with Python typing standards.

* Update requirements.txt to specify ml_dtypes without a version constraint, indicating that versions greater than 0.5.1 are needed for fp4 support.

bc9623fc

18 Sep, 2025 6 commits

[TIR] Refactor division simplification in RewriteSimplifier (#849) · 8cc2ab22
Lei Wang authored Sep 18, 2025

8cc2ab22
[Typing] Fallback from Python 3.10+ type syntax for compatibility (#848) · c36a7eee
Lei Wang authored Sep 18, 2025

c36a7eee
[AMD] fix bf16x2 dtype codegen (#847) · 6efeb743
Jiaxing Ding authored Sep 18, 2025

6efeb743

[Refactor] Turn off `ENABLE_FAST_MATH` by default (#846) · e7e38355

Lei Wang authored Sep 18, 2025

* [Enhancement] Enable fast math optimization in tilelang JIT configurations

- Updated multiple examples and kernel functions to include `pass_configs` for enabling fast math optimization.
- Added support for the `TL_ENABLE_FAST_MATH` configuration option in the built-in operations.
- Enhanced the `LibraryGenerator` to handle the new fast math configuration, ensuring compatibility with existing settings.
- Updated documentation to reflect the changes in fast math handling and deprecation of the `TL_DISABLE_FAST_MATH` option.

* lint fix

* [Refactor] Introduce deprecated_warning utility for improved deprecation handling

- Added a new `deprecated_warning` function to streamline deprecation messages.
- Updated the `LibraryGenerator` to utilize the new function for warning about the deprecated `TL_DISABLE_FAST_MATH` configuration.
- Enhanced the `deprecated` decorator to support phaseout version messaging, improving clarity for users.

e7e38355

[CI] Test Fix: Handle BufferLoad nodes when T.gemm input has a stride (#843) · ebea77d9
Lei Wang authored Sep 18, 2025
```
* bugfix

* fix

* test fix
```
ebea77d9

[Refactor] Refactor some build related configurations (#827) · 232782dd

Lei Wang authored Sep 18, 2025

* bugfix

* [Build] Update build dependencies and Dockerfile configuration

- Updated `pyproject.toml` and `requirements-build.txt` to specify Cython version as `Cython>=3.0.0`.
- Removed unnecessary dependencies from the build system.
- Enhanced `pypi.Dockerfile` to install gcc-9 and g++-9, and added ninja-build for improved build performance.
- Updated conda environment creation to include Python 3.9 to 3.12, while removing the Python 3.8 environment.

* cmake fix

* fix

* fix

232782dd

17 Sep, 2025 5 commits

[CMake] Added support for statically linked system libc library (#825) · 2f7dc52e
Lei Wang authored Sep 17, 2025

2f7dc52e

[Enhancement] Add a MXFP4 grouped GEMM example for FusedMoE (#811) · 8554cb01

Tong WU authored Sep 17, 2025



* [Enhancement] Enhance dequantization examples and utilities

- Added a new example for grouped matrix multiplication with experts in `example_dequant_groupgemm_bf16_mxfp4_hopper.py`.
- Improved dequantization logic in existing examples by replacing nested loops with vectorized operations for better performance.
- Updated `torch_convert_bit_twiddling` function in `utils.py` to utilize parallel processing, enhancing efficiency and clarity in the conversion process.
Co-authored-by: Zhengju Tang <97930865+tzj-fxz@users.noreply.github.com>

* fix typos in docstrings

* remove redundant code

* [Format] Unreproducible debug with T.print

* [BugFix] Correct dtype in ref dequantize; larger data distribution

* [Format]

* [Refactor] Clean up and optimize example_dequant_groupgemm_bf16_mxfp4_hopper.py and utils.py

- Removed unnecessary cache disabling and manual seed setting in the example.
- Simplified nested loops into parallelized operations for better readability and performance.
- Updated the assertion function in utils.py to print detailed error messages.
- Adjusted tensor sizes in examples

* [Refactor] Update import path in example_dequant_gemm_fine_grained.py

- Changed the import statement for `_tir_packed_to_unsigned_convert` from `bitblas.quantization` to `tilelang.quantize` to reflect the new module structure.

* lint

* rename and add test

* lint

* [Feature] Enhance autotuning and configuration generation in example_dequant_groupedgemm_bf16_mxfp4_hopper.py

- Added a new function `get_configs()` to generate hyperparameter configurations for tuning.
- Updated the `matmul` function to utilize autotuning with the new configurations.
- Improve kernel performance via vectorization and threadblock swizzle.
- Enhanced the main function to support the new autotuning inputs and updated parameters for better performance.

* lint

* fix typo

* fix typo and lint

* make ci format check happy

* fix ci

---------
Co-authored-by: Zhengju Tang <97930865+tzj-fxz@users.noreply.github.com>
Co-authored-by: tzj-fxz <tzjfxz@gmail.com>

8554cb01

[Bugfix] Skip fp4 dtype binding when using older versions of ml_dtypes (#824) · e4a346fe
Lei Wang authored Sep 17, 2025
```
* bug fix when git is not installed

* ml_dtypes_fix
```
e4a346fe
[Bugfix] Bug fix when git command is not installed (#823) · a57f8270
Lei Wang authored Sep 17, 2025

a57f8270
[DSL] Support python tenary if then else expression (#822) · 15479958
Lei Wang authored Sep 17, 2025
```
* support python tenary if then else expression

* lint fix
```
15479958

16 Sep, 2025 3 commits

[Example] Remove redundant param (#821) · 907c3ff0
botbw authored Sep 16, 2025

907c3ff0
[CI] fix rocm ci (#819) · d3e75b70
Cunxiao Ni authored Sep 16, 2025
```
* [CI] fix rocm ci

* Trigger CI
```
d3e75b70

[Example] add w4a8 gemm kernel (#815) · 4bcb1593

Cunxiao Ni authored Sep 16, 2025

* [Bugfix] fix autotune bug

* [Example] add w4a8 gemm kernel

* fix lint: pinned the version of `ml_dtypes`
The version of ml_dtypes should be pinned in the dependency specification. If the version of ml_dtypes is too low, it may result in errors such as fp4 not being defined.

* Renames example for dequantization GEMM

* format

* add w4a8 example to ci

* fix lint

4bcb1593

15 Sep, 2025 4 commits

[Refactor] Update TVM subproject and streamline buffer store handling (#816) · 85d1a6b3

Yu Cheng authored Sep 16, 2025

- Updated the TVM subproject to the latest commit for improved functionality.
- Refactored `warp_specialized_rewriter.cc` to replace placeholder implementations for `BlockNode` and `BlockRealizeNode` with proper role filtering, enhancing code clarity and maintainability.
- Ensured consistent handling of the `cp_async_barrier_noinc` function in `builtin.py` by adding a newline at the end of the file.

85d1a6b3

[Refactor] Reopen #794 Fix lower bug when buffer store is not guarded by any tile op (#817) · 5c869bc7
Kurisu authored Sep 16, 2025
```
* [Refactor] Rewrite AddWrapper pass by ir_transform
PyStmtExprVisitor and PyStmtExprMutator seem buggy

* fix lint error
```
5c869bc7

[Refactor] Update TVM subproject and refactor BlockNode handling in... · 8b005226

Yu Cheng authored Sep 16, 2025

[Refactor] Update TVM subproject and refactor BlockNode handling in warp_specialized_rewriter.cc (#812)

* [Feature] Introduce custom warp specialization attribute and enhance warp group register allocation

- Added a new attribute `kCustomWarpSpecialization` to support custom warp specialization in the TileLang framework.
- Updated the `Collect` method in `SetMaxNRegCollector` to handle cases where warp specialization is detected, returning an empty array accordingly.
- Enhanced the `SetMaxNRegInjector` to skip processing when no registers are needed, improving efficiency.
- Modified the `WarpSpecialized` pass to include the new attribute in the function body when warp specialization is enabled, ensuring proper handling in transformations.

* lint

* lint

8b005226

[feat] support gemm_sp for ampere and ada arch (#691) · 0b3683bf

botbw authored Sep 16, 2025



* [feat] add an example mma atom

* [fix] fix typo naming

* [feat] add a template to enable compilation

* [feat] add print util

* [WIP] pass on single block tile

* [feat] add sm80 metadata layout

* [chore] clean codebase

* [CI] format.sh

* [feat] add sm80 compress utils

* [bugfix] fix C fragment layout

* [refactor] use nvcc version instead of str

* [test] add test cases

* [chore] add a param check

* [chore] format a bit

* [chore] rename func to satisfy PEP 8 and appease gemini

* [chore] add check

* [feat] support sm75 layout && add assertion && chore

* [bug] fix illegal memory access when using two warps over N=32

This could be a missing check related to cutlass 2.x implementation.
Using the cutlass example can't trigger this cause it's bypassed by
padding the input.

For now I think it might be safe to increase the atom size and inve-
sgate in the future.

* [chore] add example

* [chore] format

* [example] update benchmark

* [bugfix] fix namespace and format

* [bugfix] fix incorrect param passing

* [refactor] update variable declaration for clarity in gemm_layouts and gemm_sp

* [Cleanup] Remove unnecessary blank lines in metadata layout functions in gemm_sp.py

* [CI] fix arch

* [example] add torch sparse benchmark

* [misc] polish && add reference && apply review suggestionsi && format

* [CI] format with clang-tidy

* [Cleanup] Format and align template struct definitions in half.hpp, common.h, and gemm_sp_sm80.h

* [Update] Modify CUDA version requirements in test_gemm_sp_sm80 and mark cutlass subproject as dirty

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

0b3683bf

14 Sep, 2025 2 commits

[Fix] Fix lower bug when buffer store is not guarded by any tile op (#794) · f0d66698

Kurisu authored Sep 15, 2025

* [Fix] Fix lower bug when buffer store is not guarded by any tile op

* fix lint error

* Fix typo in  pass

* fix lint error

* Ignore custom thread binding

f0d66698

[Feature] Add ptx_cp_async_barrier_noinc intrinsic and related functionality (#809) · ae9b7063

Yu Cheng authored Sep 14, 2025

- Introduced a new intrinsic `ptx_cp_async_barrier_noinc` for handling the `cp.async.mbarrier.arrive.noinc` operation in TileLang.
- Updated the CUDA code generation to support the new barrier operation.
- Added a corresponding function in the TileLang Python API for ease of use.
- Enhanced the barrier handling in CUDA templates to include the new no-increment operation, improving synchronization capabilities in parallel execution contexts.

ae9b7063

13 Sep, 2025 1 commit
- [Lint] Add ruff config to check for useless spaces (#807) · 5e529522
  Yichen Yan authored Sep 13, 2025
```
* update lint config

* Remove spaces for blank line

* update
```
  5e529522
12 Sep, 2025 2 commits
- Add pytest-durations to requirements for ROCm (#810) · 4d54854b
  alex_xiao authored Sep 12, 2025
  
  4d54854b
- [AMD] support preshuffle weight mfma (#806) · 143b5222
  Jiaxing Ding authored Sep 12, 2025
```
Co-authored-by: Jiaxing Ding <jiaxing.ding@bytedance.com>
```
  143b5222
11 Sep, 2025 2 commits

[AMD] support fp8 T.gemm (#804) · 409ab83d

Tang Xinsheng authored Sep 11, 2025



* [AMD] support fp8 T.gemm

* format

---------
Co-authored-by: tangxinsheng.txs <tangxinsheng.txs@alibaba-inc.com>

409ab83d

[Refactor] Use new namespace and enhance dispatch macros for mma (#801) · b62a0b43

Lei Wang authored Sep 11, 2025

* Refactor CUDA GEMM operations to use new namespace and enhance dispatch macros

- Moved GEMM-related dispatch instructions to the `cute::tl_mma` namespace for better organization.
- Introduced `TL_DISPATCH_MMA` and `TL_DISPATCH_MMA_TEMPLATE` macros to streamline the definition of dispatch instructions for various data types and architectures.
- Updated the handling of CUDA architecture checks to include additional support for newer architectures.
- Improved clarity and maintainability of the code by restructuring the layout and organization of dispatch instructions.
- Ensured consistent usage of tensor views and memory clearing operations across different GEMM implementations.

* Remove deprecated `DispatchInstruction` templates and `tl_mma` namespace from CUDA GEMM implementation. This cleanup enhances code clarity and maintainability by eliminating unused structures and streamlining the overall organization of the GEMM operations.

b62a0b43