Commits · 242cb45787249e7d9918aeaf00349acbcc7bfab9 · OpenDAS / tilelang

"src/vscode:/vscode.git/clone" did not exist on "ff35fc088173cf3a2bd5af994ee907ddeb6c2f6b"

04 Oct, 2025 1 commit

[Example] Optimize online_softmax example (#934) · 242cb457

lijinpei authored Oct 04, 2025



* [Example] Optimize online_softmax example

- Y should be output in float16.
- BN needs to be equal to N to be really online.
- On my H100 machine, this increase speedup from 1.424x to 2.788x.

* enhance

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

242cb457

02 Oct, 2025 2 commits

[Bugfix] Fix tensor memory copy layout (#933) · 5ccac4fa

Zhiwen Mo authored Oct 03, 2025

* Implements tcgen05.ld instruction support for copying from shared.tmem
  to local.fragment on SM100/Blackwell architecture. Adds layout inference
  and lowering logic for tensor memory operations with proper physical
  coordinate range analysis and warpgroup alignment checks.

  Changes:
  - Add kTMemLoad and kTMemStore to CopyInst enumeration
  - Implement CheckTMemLoad() and CheckTMemStore() validation functions
  - Add LowerTmemCopy() to generate tcgen05.ld/st/cp PTX intrinsics
  - Add tmem layout inference in InferLayout() using expandTcgen05Layout
  - Support multiple instruction variants (32dp32b/64b/128b/256b)
  - Add physical layout bounds analysis for tmem coordinates
  - Change clear_accum from bool to PrimExpr in GEMM operations
  - Fix std::optional access checks in layout_inference.cc
  - Add tmem_allocate/deallocate PTX intrinsic support
  - Fix cooperative_groups grid.sync() code generation

* fix

* pipeline fix

* bug fix

* bool fix

5ccac4fa

[Layout] Strict annotate completed replicated layout for fragment with constant index (#929) · fc4bd452

Lei Wang authored Oct 02, 2025

* [Layout] Add IsCompletedReplicated method and enhance layout inference in ParallelOpNode

- Introduced IsCompletedReplicated method in FragmentNode to check if a buffer is fully replicated.
- Enhanced InferLayout in ParallelOpNode to handle layout inference for replicated buffers, ensuring only fragment[0] access is allowed.
- Updated error handling for non-zero index access in fragment buffers to improve robustness.

* [Layout] Improve code formatting and readability in layout.cc and parallel.cc

- Enhanced formatting in FragmentNode's IsCompletedReplicated method for better clarity.
- Updated InferLayout method in ParallelOpNode to improve code readability by adjusting line breaks and indentation.
- Ensured consistent formatting across conditional statements and comments for improved maintainability.

* updt

* optimize const index related op

* bug fix

* reduce gdn test

* test fix

* lintfix

* lint fix

* test fix

fc4bd452

01 Oct, 2025 2 commits

[Example] Add MLA decode ws example (#928) · 8150e47e
Yu Cheng authored Oct 01, 2025

8150e47e

[CI] Refactor import paths in dequantization examples to use dequantize_utils (#914) · 9d382973

Lei Wang authored Oct 01, 2025

* Update requirements and refactor benchmark script for deepseek_nsa example

- Updated the requirements.txt to specify a fixed commit for the flash-linear-attention repository.
- Refactored import paths in benchmark_nsa_fwd.py for better organization.
- Added a new function to generate configurations for autotuning.
- Modified the tilelang_sparse_attention function to accept parameters for block size, number of stages, and threads, enhancing flexibility.
- Changed allocation of shared memory for accumulators to optimize performance.

* Refactor import paths in dequantization examples to use dequantize_utils

- Updated import statements in multiple dequantization example scripts to replace references to the removed utils.py file with the new dequantize_utils module.
- Ensured consistency across example scripts for better organization and maintainability.

9d382973

30 Sep, 2025 2 commits

[CI] optimize CI time for sparse gemm (#906) · a35ac496

botbw authored Sep 30, 2025

* [CI] optimize CI time

* [CI] fix transpose && format

* [misc] apply coderabbit suggestions && fix typo

a35ac496

[Example] Specify a fixed commit for the flash-linear-attention repository and... · 3ad6202d

Lei Wang authored Sep 30, 2025

[Example] Specify a fixed commit for the flash-linear-attention repository and optimize nsa examples (#913)

- Updated the requirements.txt to specify a fixed commit for the flash-linear-attention repository.
- Refactored import paths in benchmark_nsa_fwd.py for better organization.
- Added a new function to generate configurations for autotuning.
- Modified the tilelang_sparse_attention function to accept parameters for block size, number of stages, and threads, enhancing flexibility.
- Changed allocation of shared memory for accumulators to optimize performance.

3ad6202d

29 Sep, 2025 5 commits

[Example] Add topk into sparse mla example and append some docs (#901) · 6021ef32

Lei Wang authored Sep 30, 2025

* Remove unused `fp8_mqa_logits.py` file and update README.md to reflect new directory structure and file descriptions for deepseek_v32 example. Added sections for architecture overview, Lightning Indexer, Top-k Selector, and Sparse MLA Forward implementations.

* Update linting configurations and improve code formatting in deepseek_v32 example scripts

- Added per-file ignores for the inference directory in `pyproject.toml`.
- Refactored code in `topk_selector.py`, `convert.py`, `generate.py`, `kernel.py`, and `model.py` to enhance readability by adjusting spacing and line breaks.
- Ensured consistent formatting across function definitions and assertions for better clarity.

* Refactor test functions in deepseek_v32 example scripts for improved clarity and consistency

- Updated `fp8_lighting_indexer.py` to define a dedicated test function for the lighting indexer.
- Refactored `sparse_mla_fwd_pipelined.py` and `sparse_mla_fwd.py` to standardize test function parameters and improve readability.
- Enhanced `topk_selector.py` by introducing a test function with parameters for batch size and sequence length.
- Ensured all test functions are invoked correctly in the main execution block.

* Enhance test functions in deepseek_v32 example scripts with CUDA requirements and parameterization

- Added CUDA requirements decorators to `test_example_sparse_mla_fwd` and `test_example_sparse_mla_fwd_pipelined`.
- Parameterized test functions to use specific small shapes for testing, improving test coverage and clarity.

* lint fix

* Update README.md to correct image path for DeepSeek V3.2 architecture diagram

6021ef32

[Bugfix] Fix flops comp and softmax scale in mla (#900) · 16561159
Wenxuan Tan authored Sep 29, 2025
```
* fix flops comp and softmax scale

* format
```
16561159

[Example] Add sparse mla examples (#896) · 65ac7454

Lei Wang authored Sep 29, 2025

* Update README.md to include directory structure and file descriptions for deepseek_v32 example

* Refactor and clean up deepseek_v32 example scripts

- Removed unused imports and functions from `fp8_mqa_logits.py` to streamline the code.
- Improved formatting and readability in `sparse_mla_fwd_pipelined.py` and `sparse_mla_fwd.py` by adjusting function signatures and indentation.
- Added `# ruff: noqa` comments to suppress linting warnings in multiple files.
- Enhanced the `generate_random_cu_seqlens` function in `utils.py` for better clarity and organization.
- Updated print statements for consistency in output formatting.

65ac7454

[Example] Add example (#894) · 4424fa9a

Lei Wang authored Sep 29, 2025

* [Refactor] Enhance CopyNode Lower method to support disable_tma flag and improve flash attention implementation

* Updated the CopyNode Lower method to correctly include the disable_tma flag in the GetCopyInst call.
* Refactored the flash attention implementation to selectively disable TMA for specific copy operations while allowing it for others.
* Addressed linting issues for improved code quality

* sparse mla kernels

* Remove deprecated sparse MLA and utility files to streamline the codebase.

4424fa9a

[Layout] fix plot layout (#890) · 6c67a77f
Jiaxing Ding authored Sep 29, 2025

6c67a77f

28 Sep, 2025 2 commits

[Bugfix] Fix CopyNode Lower method to include disable_tma flag in GetCopyInst (#888) · 599264ca

Tong WU authored Sep 29, 2025

* Fix CopyNode Lower method to include disable_tma flag in GetCopyInst call

* Refactor flash attention implementation to disable TMA for specific copy and allow TMA for other operations

* attempt to fix lint

599264ca

[SM100] Add sm100 GEMM layouts and tcgen05 support (#887) · f58bcd43

Zhiwen Mo authored Sep 28, 2025

* update sm100 related utcmma, tmem, ld/st256 in src
* update sm100 related utcmma, tmem, ld/st256 in tilelang
* Remove deprecated GEMM examples and related README documentation for SM100 architecture support
* Update GEMM implementation to replace UTCMMA with TCGEN5MMA across relevant files
* Remove gemm_umma.py example and update README to reflect TCGEN5MMA terminology changes
* Update README.md for gemm_sm100 example by removing outdated API sections and streamlining documentation
* Update README and source files to reflect TCGEN5.MMA terminology changes
* Refactor CUDA GEMM header for improved readability

f58bcd43

26 Sep, 2025 3 commits

[Layout] Introduce Flexible Parallel to Support T.serial and local buffers... · c382dcbc

Lei Wang authored Sep 27, 2025


[Layout] Introduce Flexible Parallel to Support T.serial and local buffers inside T.Parallel loop (#844)

* Support T.serial and local buffers inside T.Parallel loop.

* Fix reducer layout in T.Parallel nested inside other loops

* Debug output with LOG(INFO)

* Add disable option for WGMMA.

* fix

* Use DLOG; fix missing registration for new pass config

* bug fix

* lint fix

* Enhance GEMM instruction set with UTCMMA and improve local buffer handling in casting example

* Update format.sh shebang, improve logging in layout inference, and enhance buffer store wrapper with detailed comments

* Enhance GEMM instantiation logic and improve layout inference for local buffer detection

- Updated the GEMM instantiation logic to include a check for WGMMA compatibility, ensuring that the conditions for using WGMMA are more robust.
- Refined the layout inference process to better identify when loops manipulate only local buffers, improving the accuracy of thread binding decisions in parallel loops.

---------
Co-authored-by: Huanqi Cao <caohuanqi@deepseek.com>

c382dcbc

[Example] Optimize sink attention forward via swizzled layout and report benchmark results (#885) · bf67fb19

Tong WU authored Sep 27, 2025



* Enhance attention sink examples with swizzled layout and performance metrics

- Added `make_swizzled_layout` annotations for shared tensors in the `flashattn` function across MHA and GQA examples to optimize memory access patterns.
- Updated benchmark outputs to include speedup calculations comparing Triton and TileLang implementations.

* Add README for Attention Sink example with algorithm details and benchmark results

- Introduced a new README.md file for the Attention Sink example, outlining the forward and backward algorithms, including the computation of `dsinks`.
- Provided benchmark results comparing performance metrics of the optimized implementation against Triton, highlighting speedup across various configurations.

* Update README.md for Attention Sink example to include link to Triton implementation

* Update examples/attention_sink/README.md
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* Update examples/attention_sink/example_gqa_sink_fwd_bhsd_wgmma_pipelined.py
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* typo

---------
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

bf67fb19

[Example] Add efficient attention sink backward implementations and tests (#877) · ec24561a

Tong WU authored Sep 26, 2025

* [Example] Add a new example to support attention sink for MHA

- Introduced a new example script for multi-head attention (MHA) with sliding window attention and sink tokens.
- Added a reference attention function to validate the implementation against PyTorch.
- Included argument parsing for command-line execution of the example.

* [Example] Replace MHA sink forward example with updated implementation

- Removed the old example script for multi-head attention (MHA) with sliding window attention and sink tokens.
- Introduced a new example script that modifies the attention mechanism to enhance performance and maintainability.
- Updated argument parsing and reference functions to align with the new implementation.

* Enhance MHA sink example with sliding window support

- Added a `window_size` parameter to the `flashattn` function to enable sliding window attention.
- Implemented assertions to ensure `window_size` is compatible with `block_N`.
- Updated the main function to include a `tune` option for performance tuning.
- Introduced a new test file to validate both full attention and sliding window scenarios.
- Adjusted FLOPS calculation to account for the sliding window configuration.

* lint

* [Fix] Add checkinf process to fix the bug of swa

* Migrate to BSHD layout to align with triton baselines

* lint

* fix typo

* Refactor MHA sink example to use seq_q and seq_kv parameters to accommodate the new sequence length parameters.

* Add GQA sink example for optimized attention mechanism & lint fix

* fix several typos and bugs

* lint

* fix speed issues of swa

* Add flash attention example with backward pass for BHSD layout and corresponding test cases

* Add backward pass implementation for flash attention with sinks and corresponding test case

* fix lint and typo

* Optimze the calculation of `dsinks`

* Add support for swa backward and update examples

* fix previous typos

* Add example for GQA sink backward pass and update tests for both MHA and GQA sinks

* fix lint

* fix previous typos

* typo

ec24561a

25 Sep, 2025 1 commit

[Language] Support atomic add with ret (#870) · aa0b1090

Lei Wang authored Sep 26, 2025

* Add atomic operations for CUDA templates in new atomic.h file

- Introduced atomic functions including AtomicMax, AtomicMin, AtomicAdd, and their return variants for various data types.
- Implemented support for half, bfloat16, and float types with appropriate memory ordering.
- Moved atomic-related utilities from common.h to the new atomic.h file for better organization.
- Added Python bindings for atomic operations in tilelang, including atomic_max, atomic_min, atomic_add, and their vectorized counterparts.
- Updated customize.py to utilize the new atomic functions, enhancing modularity and maintainability.

* Refactor atomic operations in CUDA templates for improved readability

- Reformatted atomic operation implementations in atomic.h for better code clarity.
- Adjusted function signatures in tilelang's atomic.py to enhance readability by aligning parameters.
- Cleaned up unnecessary whitespace and comments in customize.py to streamline the codebase.

* Add thread storage synchronization configuration option

- Introduced a new configuration option `tl.disable_thread_storage_sync` to control the automatic insertion of thread synchronization barriers in shared memory access.
- Updated the `ThreadSync` pass to check this configuration and bypass synchronization if disabled.
- Enhanced documentation in `builtin.h` and `pass_config.py` to clarify the purpose and usage of the new option.

* Refactor thread storage sync configuration retrieval

- Simplified the retrieval of the thread storage sync configuration in the `ThreadSync` pass by removing unnecessary intermediate variables.
- Ensured that the inclusion of `builtin.h` is consistent by moving it to the appropriate location in the file.

* test fix

* Update atomic operations and tests for improved functionality

- Updated atomic operations in CUDA templates to remove unnecessary address_of calls, enhancing performance and readability.
- Refactored atomic operation signatures in tilelang's atomic.py to accept references instead of pointers.
- Added new atomic operations and corresponding test cases for atomic add, max, min, and load/store functionalities in the testing suite.
- Updated the TVM subproject to the latest commit for better compatibility.

* Update attention sink examples to use 32 heads

- Modified the `heads` parameter in both `example_gqa_sink_fwd_bhsd_wgmma_pipelined.py` and `example_mha_sink_fwd_bhsd_wgmma_pipelined.py` from 1 to 32 to enhance performance in attention mechanisms.
- Ensured consistency across example scripts for improved usability and testing.

* Refactor atomic add handling in vectorization

- Simplified the extraction of buffer loads for atomic add operations by removing unnecessary address_of calls, improving code clarity and performance.
- Updated the data type retrieval for vectorization size calculation to directly access the buffer load node, enhancing efficiency.

* Add loop break functionality and enhance thread synchronization

- Introduced a new `loop_break` function in `customize.py` to allow breaking out of loops, returning a call to the `tl.loop_break` intrinsic.
- Updated the `sync_threads` function in `builtin.py` to accept optional parameters for `barrier_id` and `arrive_count`, improving its flexibility for thread synchronization.
- Added necessary imports in `__init__.py` to include the new `loop_break` function for broader accessibility.

* test fix

aa0b1090

23 Sep, 2025 3 commits

[Example] Add examples to support efficient attention sink forward process (#853) · d9a171ce

Tong WU authored Sep 23, 2025



* [Example] Add a new example to support attention sink for MHA

- Introduced a new example script for multi-head attention (MHA) with sliding window attention and sink tokens.
- Added a reference attention function to validate the implementation against PyTorch.
- Included argument parsing for command-line execution of the example.

* [Example] Replace MHA sink forward example with updated implementation

- Removed the old example script for multi-head attention (MHA) with sliding window attention and sink tokens.
- Introduced a new example script that modifies the attention mechanism to enhance performance and maintainability.
- Updated argument parsing and reference functions to align with the new implementation.

* Enhance MHA sink example with sliding window support

- Added a `window_size` parameter to the `flashattn` function to enable sliding window attention.
- Implemented assertions to ensure `window_size` is compatible with `block_N`.
- Updated the main function to include a `tune` option for performance tuning.
- Introduced a new test file to validate both full attention and sliding window scenarios.
- Adjusted FLOPS calculation to account for the sliding window configuration.

* lint

* [Fix] Add checkinf process to fix the bug of swa

* Migrate to BSHD layout to align with triton baselines

* lint

* fix typo

* Refactor MHA sink example to use seq_q and seq_kv parameters to accommodate the new sequence length parameters.

* Add GQA sink example for optimized attention mechanism & lint fix

* fix several typos and bugs

* lint

* fix speed issues of swa

* Update examples/attention_sink/example_gqa_sink_fwd_bhsd_wgmma_pipelined.py
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* Update examples/attention_sink/example_mha_sink_fwd_bhsd_wgmma_pipelined.py
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

---------
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

d9a171ce

[Autotune][Conv] optimize convolution examples to use autotune (#866) · b4483090
Lei Wang authored Sep 23, 2025

b4483090
[Bugfix] Ensure correct handling for cases where `seq_q<seq_kv` in flash attention examples (#864) · b12a63cf
Tong WU authored Sep 23, 2025
```
* fix flash attention examples  for `seqlen_q<seqlen_kv` cases

* lint
```
b12a63cf

22 Sep, 2025 2 commits

[AMD][MLA] Fix mla autotune for rocm (#861) · 3b21a67d

Lei Wang authored Sep 23, 2025

* Refactor matmul example to include ReLU activation and update batch size in benchmark script

* lint fix

* Enhance autotuning capabilities in benchmark script and update argument defaults

- Introduced a new `get_configs` function to generate autotuning configurations for the benchmark.
- Updated the default batch size and kv context length in the argument parser for improved performance.
- Renamed the `--auto_tune` argument to `--autotune` for consistency.
- Modified the kernel invocation logic to support autotuning based on the new configurations.

* lint fix

3b21a67d

[Doc] Optimize the quickstart guide for clarity and not just for CUDA (#858) · 058a670b
Lei Wang authored Sep 22, 2025
```
* Refactor matmul example to include ReLU activation and update batch size in benchmark script

* lint fix
```
058a670b

18 Sep, 2025 1 commit

[Refactor] Turn off `ENABLE_FAST_MATH` by default (#846) · e7e38355

Lei Wang authored Sep 18, 2025

* [Enhancement] Enable fast math optimization in tilelang JIT configurations

- Updated multiple examples and kernel functions to include `pass_configs` for enabling fast math optimization.
- Added support for the `TL_ENABLE_FAST_MATH` configuration option in the built-in operations.
- Enhanced the `LibraryGenerator` to handle the new fast math configuration, ensuring compatibility with existing settings.
- Updated documentation to reflect the changes in fast math handling and deprecation of the `TL_DISABLE_FAST_MATH` option.

* lint fix

* [Refactor] Introduce deprecated_warning utility for improved deprecation handling

- Added a new `deprecated_warning` function to streamline deprecation messages.
- Updated the `LibraryGenerator` to utilize the new function for warning about the deprecated `TL_DISABLE_FAST_MATH` configuration.
- Enhanced the `deprecated` decorator to support phaseout version messaging, improving clarity for users.

e7e38355

17 Sep, 2025 1 commit

[Enhancement] Add a MXFP4 grouped GEMM example for FusedMoE (#811) · 8554cb01

Tong WU authored Sep 17, 2025



* [Enhancement] Enhance dequantization examples and utilities

- Added a new example for grouped matrix multiplication with experts in `example_dequant_groupgemm_bf16_mxfp4_hopper.py`.
- Improved dequantization logic in existing examples by replacing nested loops with vectorized operations for better performance.
- Updated `torch_convert_bit_twiddling` function in `utils.py` to utilize parallel processing, enhancing efficiency and clarity in the conversion process.
Co-authored-by: Zhengju Tang <97930865+tzj-fxz@users.noreply.github.com>

* fix typos in docstrings

* remove redundant code

* [Format] Unreproducible debug with T.print

* [BugFix] Correct dtype in ref dequantize; larger data distribution

* [Format]

* [Refactor] Clean up and optimize example_dequant_groupgemm_bf16_mxfp4_hopper.py and utils.py

- Removed unnecessary cache disabling and manual seed setting in the example.
- Simplified nested loops into parallelized operations for better readability and performance.
- Updated the assertion function in utils.py to print detailed error messages.
- Adjusted tensor sizes in examples

* [Refactor] Update import path in example_dequant_gemm_fine_grained.py

- Changed the import statement for `_tir_packed_to_unsigned_convert` from `bitblas.quantization` to `tilelang.quantize` to reflect the new module structure.

* lint

* rename and add test

* lint

* [Feature] Enhance autotuning and configuration generation in example_dequant_groupedgemm_bf16_mxfp4_hopper.py

- Added a new function `get_configs()` to generate hyperparameter configurations for tuning.
- Updated the `matmul` function to utilize autotuning with the new configurations.
- Improve kernel performance via vectorization and threadblock swizzle.
- Enhanced the main function to support the new autotuning inputs and updated parameters for better performance.

* lint

* fix typo

* fix typo and lint

* make ci format check happy

* fix ci

---------
Co-authored-by: Zhengju Tang <97930865+tzj-fxz@users.noreply.github.com>
Co-authored-by: tzj-fxz <tzjfxz@gmail.com>

8554cb01

16 Sep, 2025 2 commits

[Example] Remove redundant param (#821) · 907c3ff0
botbw authored Sep 16, 2025

907c3ff0

[Example] add w4a8 gemm kernel (#815) · 4bcb1593

Cunxiao Ni authored Sep 16, 2025

* [Bugfix] fix autotune bug

* [Example] add w4a8 gemm kernel

* fix lint: pinned the version of `ml_dtypes`
The version of ml_dtypes should be pinned in the dependency specification. If the version of ml_dtypes is too low, it may result in errors such as fp4 not being defined.

* Renames example for dequantization GEMM

* format

* add w4a8 example to ci

* fix lint

4bcb1593

15 Sep, 2025 1 commit

[feat] support gemm_sp for ampere and ada arch (#691) · 0b3683bf

botbw authored Sep 16, 2025



* [feat] add an example mma atom

* [fix] fix typo naming

* [feat] add a template to enable compilation

* [feat] add print util

* [WIP] pass on single block tile

* [feat] add sm80 metadata layout

* [chore] clean codebase

* [CI] format.sh

* [feat] add sm80 compress utils

* [bugfix] fix C fragment layout

* [refactor] use nvcc version instead of str

* [test] add test cases

* [chore] add a param check

* [chore] format a bit

* [chore] rename func to satisfy PEP 8 and appease gemini

* [chore] add check

* [feat] support sm75 layout && add assertion && chore

* [bug] fix illegal memory access when using two warps over N=32

This could be a missing check related to cutlass 2.x implementation.
Using the cutlass example can't trigger this cause it's bypassed by
padding the input.

For now I think it might be safe to increase the atom size and inve-
sgate in the future.

* [chore] add example

* [chore] format

* [example] update benchmark

* [bugfix] fix namespace and format

* [bugfix] fix incorrect param passing

* [refactor] update variable declaration for clarity in gemm_layouts and gemm_sp

* [Cleanup] Remove unnecessary blank lines in metadata layout functions in gemm_sp.py

* [CI] fix arch

* [example] add torch sparse benchmark

* [misc] polish && add reference && apply review suggestionsi && format

* [CI] format with clang-tidy

* [Cleanup] Format and align template struct definitions in half.hpp, common.h, and gemm_sp_sm80.h

* [Update] Modify CUDA version requirements in test_gemm_sp_sm80 and mark cutlass subproject as dirty

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

0b3683bf

13 Sep, 2025 1 commit
- [Lint] Add ruff config to check for useless spaces (#807) · 5e529522
  Yichen Yan authored Sep 13, 2025
```
* update lint config

* Remove spaces for blank line

* update
```
  5e529522
11 Sep, 2025 1 commit

[AMD] support fp8 T.gemm (#804) · 409ab83d

Tang Xinsheng authored Sep 11, 2025



* [AMD] support fp8 T.gemm

* format

---------
Co-authored-by: tangxinsheng.txs <tangxinsheng.txs@alibaba-inc.com>

409ab83d

05 Sep, 2025 1 commit
- [Bugfix] Fix incorrect synchronization bug in minference example (#786) · 013adca0
  Wenhao Xie authored Sep 05, 2025
```
* fix

* lint
```
  013adca0
04 Sep, 2025 1 commit

[AMD] Fix amd tir&add examples (#784) · f07f31c1

alex_xiao authored Sep 04, 2025



* [Enhancement] Refactor buffer index handling for improved precision and clarity (#668)

- Enhanced buffer index handling to address precision issues by removing redundant operations.
- Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection.
- Updated related documentation to reflect changes in buffer management practices.

* Remove obsolete test script for AMD example, streamlining the examples directory.

* Remove unused dtype_size variable in AMD example script to streamline code.

* Add input configuration file and update AMD example script for enhanced flexibility

- Introduced a new input.txt file for configurable parameters.
- Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack.
- Streamlined the main function for better clarity and organization.
- Added a new test script to facilitate running the example with specified parameters.

* Remove input configuration file and obsolete test script; enhance AMD example with swizzle layout annotations

- Deleted input.txt and test.sh files as they are no longer needed.
- Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance.
- Reintroduced swizzle usage in the kernel for better performance.

* Refactor AMD example script for FlashAttention-2

- Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`.
- Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls.
- Removed outdated comments and improved code organization for better readability.

* Refactor formatting in AMD FlashAttention example script

- Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function.
- Streamlined the `main` function parameter formatting for consistency.
- Removed unnecessary blank lines to enhance overall code organization.

* Update example_amd_flash_attn_fwd.py

* Enhance AMD example script and update CI workflows

- Improved the `example_amd_flash_attn_fwd.py` script for better clarity and organization.
- Added new CI workflows for AMD and documentation publishing.
- Updated various requirements files to include necessary dependencies.
- Introduced new test cases and examples for better coverage and functionality.
- Refactored existing code for improved readability and maintainability.

* Remove redundant tool cache cleanup step in AMD CI workflow

* Remove `torch` dependency from `requirements-rocm.txt` to streamline requirements.

* Add new AMD FlashAttention example and test script

- Introduced `example_amd_flash_attn_bwd.py` for backward attention computation using TileLang.
- Added `test.sh` script to facilitate running the new example with specified parameters.
- Enhanced the overall structure and organization of the example for better clarity and usability.

* Update configurations in `example_amd_flash_attn_fwd.py` for autotuner

- Reduced the number of threads and `num_split_q` options for improved performance.
- Adjusted `panel_size` options to streamline configuration settings.

* Update submodule 'tvm' to commit 6ccc74f622c7ec4ac25d430d0f6546e7b9edb217

* Update submodule 'tvm' to commit 14ff70ab142b9e5a31bbf9c7923c8a697d41e86c

* Add example for AMD Flash Attention backward pass implementation

- Introduced a new example script `example_amd_flash_attn_bwd.py` demonstrating the forward and backward operations of Flash Attention using TileLang.
- Implemented JIT-compiled functions for both forward and backward passes, including preprocessing and postprocessing steps.
- Added a main function to facilitate testing and benchmarking of the attention mechanism with configurable parameters.
- Included reference implementation for validation against PyTorch's attention mechanism.

This addition enhances the examples directory by providing a comprehensive guide for users to understand and utilize Flash Attention in their applications.

* Enhance AMD Flash Attention example with additional testing capabilities

- Updated `example_amd_flash_attn_bwd.py` to include more comprehensive testing features for the Flash Attention implementation.
- Improved the main function to allow for better parameter configuration and benchmarking.
- Added validation checks against PyTorch's attention mechanism to ensure accuracy and reliability of the example.

This update aims to provide users with a more robust tool for understanding and utilizing Flash Attention in their applications.

* Update submodule TVM to commit a64a5926a6e59f5417ef2501f9d88b467337cf6a

* Refactor HIP intrinsic rules to CUDA

- Updated file name from `intrin_rule_hip.cc` to `intrin_rule_cuda.cc` to reflect the change in focus from HIP to CUDA intrinsic rules.
- Adjusted include paths for better organization and clarity in the code structure.

* Update AMD CI workflow to uninstall specific PyTorch packages before installation

- Removed the installation of `flash_attn==2.5.8` to streamline the CI process.
- Added a step to uninstall `torch`, `torchvision`, and `torchaudio` prior to installing pre-release versions, ensuring compatibility and reducing potential conflicts.

* Remove unused shared memory allocations in AMD Flash Attention backward example

- Eliminated the allocation of shared memory for `dv_shared` and `dk_shared` in `example_amd_flash_attn_bwd.py` to streamline memory usage and improve performance.
- This change focuses on optimizing the backward pass implementation by reducing unnecessary memory overhead.

* Remove unnecessary pip uninstall command from AMD CI workflow

- Eliminated the step to uninstall `torch`, `torchvision`, and `torchaudio` in the AMD CI workflow, as it is no longer required for the installation of pre-release versions.
- This change simplifies the CI process and reduces potential overhead during package management.

* Refactor DispatchHIPWarpActiveMask function in HIP intrinsic rules

- Updated the return statement to use std::string for concatenation in the case of 16-bit types, improving code clarity.
- Added a null check for the CallNode pointer in DispatchHIPWarpActiveMask to enhance robustness and prevent potential dereferencing issues.

* Refactor formatting of HIP intrinsic rule registrations

- Adjusted the formatting of TVM_REGISTER_OP calls for better readability by aligning method chaining.
- No functional changes were made; this update focuses on code style improvements to enhance maintainability.

* Update file name and documentation for HIP intrinsic rules

- Renamed the file from `intrin_rule_cuda.cc` to `intrin_rule_hip.cc` to accurately reflect the focus on HIP intrinsic rules.
- Updated the file documentation to clarify its purpose as related to HIP rather than CUDA.

* Enhance DispatchHIPShuffle function with clang-analyzer comments

- Added NOLINTBEGIN and NOLINTEND comments to the DispatchHIPShuffle function to suppress clang-analyzer warnings related to inner pointer usage.
- This change improves code clarity and maintains compliance with static analysis tools.

* lint fix

* fix

---------
Co-authored-by: xinxyxiao <xinyxiao@amd.com>
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

f07f31c1

02 Sep, 2025 1 commit

[Example]Adds example for top-k operation (#775) · 021e44e3

Cunxiao Ni authored Sep 02, 2025

* [Example]Adds example for top-k operation

Adds an example demonstrating the top-k operation using tilelang

* format

* Adds topk tilelang example test

* fix lint

021e44e3

28 Aug, 2025 3 commits

[MXFP4] Add 1D TMA copy for Scale tensor in MXFP4 GEMM (#766) · ea548301
Zhengju Tang authored Aug 29, 2025
```
* [TMA] Add 1D TMA copy for Scale tensor

* [Lint]

* [Test] Add test for kernel

* [BugFix]
```
ea548301
[Example] Add vertical slash sparse attention pattern (#762) · 37051417
Wenhao Xie authored Aug 28, 2025
```
* upd sparse attn

* lint

* rename

* update test file

* update benchmark

* lint

* update benchmark
```
37051417

[Feature] Add 1D TMA support (#761) · 1774a1aa

Zhengju Tang authored Aug 28, 2025



* [Feature] Add 1D TMA support
- Check the contiguous conditions of 1D TMA copy
- Add new interface and params order of `tma_load` and `tma_store` call
- Add 1D `tma_store` interface in sm90 template
- Add elementwise kernel for 1D TMA example

* [Lint]

* [BugFix] Add conditions for 1D TMA copy on non-swizzle shared tensors

* [Lint]

* [BugFix] 1D TMA load

* [README] Update GDN README for clarity and add acknowledgements (#758)

- Improved formatting and clarity of the GDN kernel implementation description.
- Updated requirement section to list dependencies in a clearer format.
- Added an acknowledgements section to credit the developers and the Xiaomi LLM-Core Team for their contributions.

* cutlass v4.2.0 supporting cuda 13 (#760)

* [Lint]

* [Lint]

* [MXFP4] Add test for bf16&mxfp4 gemm

* [BugFix]

* [Lint]

---------
Co-authored-by: Yu Cheng <54519279+chengyupku@users.noreply.github.com>
Co-authored-by: Johnny <johnnync13@gmail.com>

1774a1aa

25 Aug, 2025 1 commit

[README] Update GDN README for clarity and add acknowledgements (#758) · e0cf5fee

Yu Cheng authored Aug 26, 2025

- Improved formatting and clarity of the GDN kernel implementation description.
- Updated requirement section to list dependencies in a clearer format.
- Added an acknowledgements section to credit the developers and the Xiaomi LLM-Core Team for their contributions.

e0cf5fee

24 Aug, 2025 1 commit
- [MXFP4] Add bias to MXFP4 GEMM kernel (#753) · fd199a4a
  Zhengju Tang authored Aug 24, 2025
```
* [MXFP4] Add bias to gemm kernel

* [Lint]

* [Lint] Rename "bias" to "Bias"
```
  fd199a4a
23 Aug, 2025 1 commit

[MXFP4] Fix bugs and optimize exponential operation (#750) · 796b3bbe

Zhengju Tang authored Aug 24, 2025



* [MXFP4] Fix bugs
- Optimize exp2 with shift operation to boost performance
- Fix bug of simple dequantization function call
- Fix bug of scaling factor with bias

* [Lint]

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

796b3bbe

22 Aug, 2025 1 commit

[Refactor] Merge bulk copy into copy and improve layout inference for bulk copy (#746) · 5c11d245

Lei Wang authored Aug 22, 2025

* [Refactor] Merge bulk copy into copy and refactor layout inference for bulk copy

* Deleted the `bulk_copy` operator implementation and its header file as it is no longer needed.
* Introduced a new function `cuTensorMapType()` to return the data type for CUDA tensor mapping.
* Updated related files to reflect these changes, ensuring that the codebase remains clean and maintainable.

* lint fix

* Fix typos in intrinsic names and remove unused print statement in block_sparse_attn_tilelang.py. Updated references from `ptx_ldmatirx` to `ptx_ldmatrix` across multiple files for consistency.

* remove bulk copy

* Refactor copy and atomic add operations to support TMA lower configuration

- Updated `GetCopyInst` to accept a `disable_tma_lower` parameter, allowing for conditional usage of TMA in bulk load/store operations.
- Modified `Lower` method in `Copy` to incorporate the new TMA configuration.
- Refactored `AtomicAdd::Lower` to streamline layout inference and vectorization logic.
- Removed unused `disable_tma_lower` field from `LowerArgs` structure for clarity.
- Enhanced atomic add vectorization by replacing the buggy implementation with a more robust loop vectorization approach.

* Enhance TMA bulk copy logic in `LowerBulkCopy` method

- Added a condition to set `desc.swizzle` to `CU_TENSOR_MAP_SWIZZLE_NONE` when `shared_layout` matches `linear_layout`, improving clarity in layout handling.
- Updated warning log to provide more detailed information about fallback scenarios, including source and destination buffer names and shapes, enhancing debugging capabilities.

* lint fix

* Remove fallback logging for non-swizzled global layout in `LowerBulkCopy` method to streamline the bulk copy logic. This change enhances code clarity by eliminating unnecessary warning messages related to inner box dimensions.

* Enhance reshape kernel compilation in `run_reshape` and `run_reshape_smem_1d_2_2d` functions

- Updated the `tl.compile` method to include `pass_configs` that disable TMA lower and warp specialization, addressing shared memory layout transformation limitations.
- Added TODO comments to indicate the need for further improvements in shared memory handling.

* Update `native_sparse_attention` function to include TMA configuration options

- Added `pass_configs` to the JIT decorator to disable TMA lower and warp specialization, addressing potential issues with shared memory layout transformations.
- Updated comments to clarify modifications in tensor shapes for inference, specifically setting `q` sequence length to 1.

* Refactor JIT decorator formatting in `native_sparse_attention` function

- Improved readability by reformatting the JIT decorator parameters for `native_sparse_attention`, ensuring consistent style across the codebase.
- No functional changes were made; this update focuses on code clarity and maintainability.

* Enhance thread management and logging in TileLang compilation

- Added a method to check if printing is enabled during compilation, improving control over logging behavior.
- Updated the JIT kernel class to utilize the new method for logging compilation status, ensuring consistent and clear output.
- Added comments to clarify the purpose of changes and improve code readability.

* Add warp specialization scope and refactor register management in TileLang

- Introduced a new constant `kWarpSpecializationScope` in `builtin.h` for better attribute management.
- Removed the `SetMaxNRegCollector` class and its related logic from `warp_specialized_rewriter.cc`, streamlining the warp specialization process.
- Added functions `annotate_producer_reg_dealloc` and `annotate_consumer_reg_alloc` in `builtin.py` to facilitate register management.
- Implemented `AnnotateWarpGroupRegAlloc` in `__init__.py` to inject register allocation calls into warp-specialized functions, enhancing the overall register handling in the compilation process.

* Refactor test for InjectSetMaxNReg pass in TileLang

- Improved readability by restructuring conditional checks and assertions in the test cases.
- Enhanced clarity in the collection of `set_max_nreg` calls by simplifying the logic.
- Ensured consistent formatting and spacing throughout the test functions for better maintainability.

* Enhance bulk copy and store checks in `Copy` class

- Updated scope validation for source and destination tensors in `CheckBulkLoad` and `CheckBulkStore` methods to include both `shared.dyn` and `shared` as valid options.
- Modified `CheckLDSMCopy` and `CheckSTSMCopy` methods to accommodate the new scope validation, ensuring compatibility with shared memory configurations.
- Improved logging in `LowerBulkCopy` to provide clearer warnings regarding unsupported swizzle layouts, including source and destination names for better debugging.

* lint fix

5c11d245