Commits · c70b269738b97ee38aac9b7522612893b547eb54 · OpenDAS / tilelang

"maint/scripts/local_distribution.sh" did not exist on "64f17c2f369e612cc297d358f607307a615bbb59"

28 Oct, 2025 1 commit

[BugFix] Implement bfloat16 support in CUDA code generation with min/max... · c70b2697

Tong WU authored Oct 29, 2025

[BugFix] Implement bfloat16 support in CUDA code generation with min/max functions and inf/nan values (#1143)

* Implement bfloat16 support in CUDA code generation with min/max functions and inf/nan values

* refactor

* fix prev typo

* bugfix

* lint

* bugfix

c70b2697

27 Oct, 2025 4 commits
- [Bugfix] Correctly construct the argument list for atomic add based on the vector size (#1137) · 7d389a43
  Lei Wang authored Oct 28, 2025
```
* atomic_fix

* atomic_fix
```
  7d389a43
- Add int2 and longlong4 pack functions (#1129) · 4c9da81a
  LJC00118 authored Oct 27, 2025
```
* Remove an incorrect check

* add fp8 pack function

* code lint

* minor fix

* minor fix

* minor fix

* Minor fix

* Minor fix

* add pack function

* code lint

* code lint
```
  4c9da81a
- [Feature]:Add device assert (#1116) · 5475f8e7
  Yuqi Dong authored Oct 27, 2025
```
* update

* update
```
  5475f8e7
- [Enhancement] Add missing `fence_barrier_init` primitive after mbarrier init (#1121) · 17a63976
  Yu Cheng authored Oct 27, 2025
```
* [Enhancement] Add missing  primitive after mbarrier init

* lint
```
  17a63976
25 Oct, 2025 1 commit

[Feature] Add memory_order PTX for vectorized atomic add (#1112) · 59865bdf

Zhengju Tang authored Oct 25, 2025



* [Feature] Add memory_order PTX for vectorized (2x) atomic add

* [Feature] Add memory_order PTX for all vectorized atomic add

* [Lint]

* test

* [BugFix] FIx init optional argument in alloc_var

* bug fix

* bug fix

* lint fix

* lint fix

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

59865bdf

24 Oct, 2025 1 commit
- [Bugfix] Resolve mixed stride dtype issue (inconsistent int32/int64 values) (#1119) · 65c4711f
  Lei Wang authored Oct 24, 2025
```
* fix int32 dtype issue

* lint fix

* lint

* lint fix

---------
Co-authored-by: Zhiwen Mo <zm125@ic.ac.uk>
```
  65c4711f
23 Oct, 2025 2 commits

[Feature] Enhance vectorized conversion support in CUDA codegen (#1095) · a148d62a

Tong WU authored Oct 23, 2025

* [Feature] Add vectorized float16 and float32 conversion support in CUDA codegen

* Implemented handling for conversions between float16 and float32 types, specifically for vectorized operations using __half22float2 and __float22half2_rn.
* Enhanced the existing code to support both directions of conversion based on the lane count.
* Improved overall type handling in the VisitExpr_ method for better compatibility with TileLang.

* [Feature] Add float32 to float8 conversion support in CUDA codegen

* Implemented handling for conversion from float32 to float8 (E4M3/E5M2) in the VisitExpr_ method.
* Added vectorized conversion support using __nv_cvt_float2_to_fp8x2 for float2 to fp8x2 transformations.
* Enhanced type handling for better compatibility with TileLang, particularly for float8 types.

* lint

* fix a bug

* [Enhancement] Support lanes=4 cases and add unit test for vectorized cast

* lint

* [Feature] Refactor bf16 convertion operations and remove legacy compile flags

* lint

a148d62a

[Refactor] Improve scalar handling in CopyNode and update loop partition dtype logi (#1111) · 86c8bb46

Lei Wang authored Oct 23, 2025

* [Refactor] Improve scalar handling in CopyNode and update loop partition dtype logic

* Refactored CopyNode::MakeSIMTLoop to handle scalar cases more efficiently by moving the scalar check to the end of the function.
* Updated loop_partition.cc to set a default DataType for thread and vector extents, ensuring compatibility when loop_vars_ is empty.

* lint fix

* remove debug print

86c8bb46

22 Oct, 2025 3 commits
- [Refactor] Use forceinline in `ldmatrix` and update mamba scan kernel (#1104) · 8a5eb569
  Yu Cheng authored Oct 22, 2025
  
  8a5eb569
- [CI][Lint] Retire `format.sh` and add `clang-tidy` to GHA workflow (#1044) · 5683e6a6
  Xuehai Pan authored Oct 22, 2025
```
* [Lint] Retire `format.sh` and add `clang-tidy` to GHA workflow

* chore: update clang-tidy settings

* chore: upgrade clang-format and clang-tidy version

* lint: resolve clang-tidy errors

* [Maint] restore format.sh

* [CI] pre-commit autoupdate

* [Minor] fix `command -v` usage
```
  5683e6a6
- [Refactor] Optimize debug message for parallel inference (#1096) · 151d9e6b
  Lei Wang authored Oct 22, 2025
  
  151d9e6b
21 Oct, 2025 4 commits

[Bugfix] Fix missing host cuTensorMapEncodeIm2col call (#1094) · 5cb5c068
Yu Cheng authored Oct 22, 2025

5cb5c068

[Language] Support tilelang `alloc_var(dtype, init=x)` (#1092) · bddb125e

Lei Wang authored Oct 21, 2025

* - carry existing local-var initializer map into OpaqueBlockLower, reattach it to
    generated Allocates and the PrimFunc attrs
  - thread the map through FlattenBuffer and StorageRewrite so flattened/merged
    allocations keep their tl.local_var_init annotations
  - teach annotation handling to accept scalar initializers, resolve buffers, and merge
    with existing stat

* lint fix

* enhance

* lint fix

* lint fix

bddb125e

[PassConfig] Introduce PassConfig `TL_STORAGE_REWRITE_DETECT_INPLACE` (#1089) · cdc67fc4

Lei Wang authored Oct 21, 2025

* • Enable configurable StorageRewrite inplace detection

  - Add kStorageRewriteDetectInplace constant and register the flag with PassContext so C++ code no longer hard-codes the key.
  - Wire StorageRewrite to include TileLang builtin constants and honor the new config toggle when deciding inplace reuse.
  - Document the flag across Python surfaces (PassConfigKey, JIT/autotuner docs) with usage guidance and simplified IR examples.

* lint fix

* add test

* lint fix

cdc67fc4

[BugFix] Add memory order argument for non-vectorized atomic add (#1081) · 1d4b7180

Zhengju Tang authored Oct 21, 2025

* [BugFix] Add memory order argument for non-vectorized atomic add

* [Lint]

* [BugFix] Memory order

* [Lint]

* [BugFix] Argument in cuda template

* [Lint]

1d4b7180

20 Oct, 2025 6 commits

[Enhancement] Update async intrinsic handling in inject_fence_proxy (#1068) · bb8b3cd7

Tong WU authored Oct 21, 2025



* [Enhancement] Update async intrinsic handling in inject_fence_proxy

* Added support for wgmma async intrinsics in IsAsyncIntrinsic function.
* Changed handling of unknown externs to treat them as Generic instead of Async, improving accuracy in proxy kind determination.

* test fix

* Update testing/python/transform/test_tilelang_transform_inject_fence_proxy.py
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

bb8b3cd7

[Bugfix] Fix missing reg alloc in custom warp specialization (#1084) · f8d3e73e
Yu Cheng authored Oct 21, 2025

f8d3e73e
[Language] Efficient `T.reduce_` with shared memory input/output (#1080) · bc37ea69
Lei Wang authored Oct 20, 2025
```
* Support reduce ss

* lint fix

* test fix

* lint fix
```
bc37ea69
[Feature] Support Reduce operators for bitwise and/or/xor (#1074) · ba410ae3
Zhengju Tang authored Oct 20, 2025
```
* [Feature] Support Reduce operators for bitwise and/or/xor

* [Lint]
```
ba410ae3
[Layout] Utilizing IsEqual instead of StructuralEqual (#1073) · 6a388c0e
Lei Wang authored Oct 20, 2025

6a388c0e

[Parallel] Support `T.Parallel` with dynamic extents (#990) · 27701c3d

Lei Wang authored Oct 20, 2025

* Allow dynamic extents in loop partition; warn when layout inversion falls back to NoCheck

* add test and introduce predicate

* test fix

* fix

* enhance

* inverse with level

* test fix

* bug fix

27701c3d

17 Oct, 2025 2 commits

[Refactor] Refactor Pass `LegalizeSafeMemoryAccess` to support recursive load/store rewrite (#1050) · 72111642

Chaofan Lin authored Oct 17, 2025



* [Refactor] Refactor Pass  to support recursive load/store rewrite

* lint

* recursive collect conds for call_extern

* fix name

* [Lint]: [pre-commit.ci] auto fixes [...]

* lint

* [Lint]: [pre-commit.ci] auto fixes [...]

* lint

* [Lint]: [pre-commit.ci] auto fixes [...]

* address comment

* rename pad_value to safe_value

* lint

* add oob store test

* [Lint]: [pre-commit.ci] auto fixes [...]

* fix

* fix

---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

72111642

[Enhancement] Introduce a workaround for layout inference for local buffer store (#1055) · 278c0fbf

Lei Wang authored Oct 17, 2025



* [Enhancement] Improve layout inference for local buffer handling in parallel operations

* Added logic to check if a loop only manipulates "local" buffers, which affects thread binding decisions.
* Updated the condition for determining parallel loop execution to account for local buffer stores.
* Cleaned up comments for clarity and future considerations.

* [Refactor] Clean up parallel loop condition formatting in layout inference

* Reformatted the condition for determining parallel loop execution for better readability.
* Maintained existing logic while enhancing code clarity for future modifications.

---------
Co-authored-by: Zhiwen Mo <zm125@ic.ac.uk>

278c0fbf

16 Oct, 2025 2 commits
- Allow mma gemm for all cuda (#1047) · e3742d33
  Yichen Yan authored Oct 16, 2025
  
  e3742d33
- [Feature]: Add test for atomicadd auto vectorize and remove useless code (#1019) · 0ff4f427
  Yuqi Dong authored Oct 16, 2025
```
* update

* format

* rabbit
```
  0ff4f427
15 Oct, 2025 5 commits

[Refactor] Use `has_simt_copy` to decide whether to insert `set_max_nreg` (#982) · bd1c7b39
Yu Cheng authored Oct 16, 2025

bd1c7b39

fix bug&add amd examples (#966) · 80665cd1

alex_xiao authored Oct 15, 2025



* [Enhancement] Refactor buffer index handling for improved precision and clarity (#668)

- Enhanced buffer index handling to address precision issues by removing redundant operations.
- Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection.
- Updated related documentation to reflect changes in buffer management practices.

* Remove obsolete test script for AMD example, streamlining the examples directory.

* Remove unused dtype_size variable in AMD example script to streamline code.

* Add input configuration file and update AMD example script for enhanced flexibility

- Introduced a new input.txt file for configurable parameters.
- Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack.
- Streamlined the main function for better clarity and organization.
- Added a new test script to facilitate running the example with specified parameters.

* Remove input configuration file and obsolete test script; enhance AMD example with swizzle layout annotations

- Deleted input.txt and test.sh files as they are no longer needed.
- Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance.
- Reintroduced swizzle usage in the kernel for better performance.

* Refactor AMD example script for FlashAttention-2

- Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`.
- Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls.
- Removed outdated comments and improved code organization for better readability.

* Refactor formatting in AMD FlashAttention example script

- Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function.
- Streamlined the `main` function parameter formatting for consistency.
- Removed unnecessary blank lines to enhance overall code organization.

* Update example_amd_flash_attn_fwd.py

* Enhance AMD example script and update CI workflows

- Improved the `example_amd_flash_attn_fwd.py` script for better clarity and organization.
- Added new CI workflows for AMD and documentation publishing.
- Updated various requirements files to include necessary dependencies.
- Introduced new test cases and examples for better coverage and functionality.
- Refactored existing code for improved readability and maintainability.

* Remove redundant tool cache cleanup step in AMD CI workflow

* Remove `torch` dependency from `requirements-rocm.txt` to streamline requirements.

* Add new AMD FlashAttention example and test script

- Introduced `example_amd_flash_attn_bwd.py` for backward attention computation using TileLang.
- Added `test.sh` script to facilitate running the new example with specified parameters.
- Enhanced the overall structure and organization of the example for better clarity and usability.

* Update configurations in `example_amd_flash_attn_fwd.py` for autotuner

- Reduced the number of threads and `num_split_q` options for improved performance.
- Adjusted `panel_size` options to streamline configuration settings.

* Update submodule 'tvm' to commit 6ccc74f622c7ec4ac25d430d0f6546e7b9edb217

* Update submodule 'tvm' to commit 14ff70ab142b9e5a31bbf9c7923c8a697d41e86c

* Add example for AMD Flash Attention backward pass implementation

- Introduced a new example script `example_amd_flash_attn_bwd.py` demonstrating the forward and backward operations of Flash Attention using TileLang.
- Implemented JIT-compiled functions for both forward and backward passes, including preprocessing and postprocessing steps.
- Added a main function to facilitate testing and benchmarking of the attention mechanism with configurable parameters.
- Included reference implementation for validation against PyTorch's attention mechanism.

This addition enhances the examples directory by providing a comprehensive guide for users to understand and utilize Flash Attention in their applications.

* Enhance AMD Flash Attention example with additional testing capabilities

- Updated `example_amd_flash_attn_bwd.py` to include more comprehensive testing features for the Flash Attention implementation.
- Improved the main function to allow for better parameter configuration and benchmarking.
- Added validation checks against PyTorch's attention mechanism to ensure accuracy and reliability of the example.

This update aims to provide users with a more robust tool for understanding and utilizing Flash Attention in their applications.

* Update submodule TVM to commit a64a5926a6e59f5417ef2501f9d88b467337cf6a

* Refactor HIP intrinsic rules to CUDA

- Updated file name from `intrin_rule_hip.cc` to `intrin_rule_cuda.cc` to reflect the change in focus from HIP to CUDA intrinsic rules.
- Adjusted include paths for better organization and clarity in the code structure.

* Update AMD CI workflow to uninstall specific PyTorch packages before installation

- Removed the installation of `flash_attn==2.5.8` to streamline the CI process.
- Added a step to uninstall `torch`, `torchvision`, and `torchaudio` prior to installing pre-release versions, ensuring compatibility and reducing potential conflicts.

* Remove unused shared memory allocations in AMD Flash Attention backward example

- Eliminated the allocation of shared memory for `dv_shared` and `dk_shared` in `example_amd_flash_attn_bwd.py` to streamline memory usage and improve performance.
- This change focuses on optimizing the backward pass implementation by reducing unnecessary memory overhead.

* Remove unnecessary pip uninstall command from AMD CI workflow

- Eliminated the step to uninstall `torch`, `torchvision`, and `torchaudio` in the AMD CI workflow, as it is no longer required for the installation of pre-release versions.
- This change simplifies the CI process and reduces potential overhead during package management.

* Refactor DispatchHIPWarpActiveMask function in HIP intrinsic rules

- Updated the return statement to use std::string for concatenation in the case of 16-bit types, improving code clarity.
- Added a null check for the CallNode pointer in DispatchHIPWarpActiveMask to enhance robustness and prevent potential dereferencing issues.

* Refactor formatting of HIP intrinsic rule registrations

- Adjusted the formatting of TVM_REGISTER_OP calls for better readability by aligning method chaining.
- No functional changes were made; this update focuses on code style improvements to enhance maintainability.

* Update file name and documentation for HIP intrinsic rules

- Renamed the file from `intrin_rule_cuda.cc` to `intrin_rule_hip.cc` to accurately reflect the focus on HIP intrinsic rules.
- Updated the file documentation to clarify its purpose as related to HIP rather than CUDA.

* Enhance DispatchHIPShuffle function with clang-analyzer comments

- Added NOLINTBEGIN and NOLINTEND comments to the DispatchHIPShuffle function to suppress clang-analyzer warnings related to inner pointer usage.
- This change improves code clarity and maintains compliance with static analysis tools.

* lint fix

* fix

* Enhance autotuner configurations in example_amd_flash_attn_fwd.py by adding new block sizes, stages, and panel sizes. Update test script to use relative Python path and adjust parameters for consistency.

* Add backward attention example to test script

- Extended the test.sh script to include a new backward attention example using example_amd_flash_attn_bwd.py.
- Added parameters for batch size, context length, and head dimensions to ensure consistency with the forward example.
- Updated the command for the backward tile example to match the new configuration.

* Refactor FlashAttention implementation in example_amd_flash_attn_bwd.py and example_amd_flash_attn_fwd.py

- Introduced new functions for forward and backward configurations to enhance autotuning capabilities.
- Updated the FlashAttention forward and backward functions to improve performance and maintainability.
- Adjusted test script parameters for consistency and clarity, including the addition of group handling.
- Enhanced the autotuner configurations by refining block sizes and stages for better performance tuning.
- Updated the main function to reflect changes in parameter names and types for better usability.

* Enhance FlashAttention backward implementation in example_amd_flash_attn_bwd.py

- Updated the backward function to return additional outputs, including log-sum-exp (LSE) values for improved gradient calculations.
- Refined autotuner configurations by adding new block sizes and adjusting parameters for better performance tuning.
- Improved shared memory usage in the backward pass to optimize memory access patterns and enhance computational efficiency.
- Updated the main function to reflect changes in parameter handling and ensure consistency with the forward pass.
- Enhanced correctness checks in the main function to include LSE validation alongside gradient checks.

* Enhance FlashAttention backward implementation in example_amd_flash_attn_bwd.py

- Introduced a scaling factor for improved numerical stability in gradient calculations.
- Optimized shared memory usage by adding new shared buffers for intermediate calculations.
- Refined the handling of tensor fragments to improve performance and maintainability.
- Updated the main function to ensure compatibility with the new output parameters for backward operations.
- Removed unnecessary parameters from the test script to streamline execution.

* Refactor FlashAttention implementation in example_amd_flash_attn_bwd.py and example_mha_bwd.py

- Updated the forward and backward functions to improve numerical stability and performance.
- Enhanced shared memory usage by optimizing buffer allocations and reducing unnecessary parameters.
- Adjusted autotuner configurations for better performance tuning and compatibility with new output parameters.
- Added debugging and benchmarking functions for improved correctness verification and performance analysis.
- Updated the main function to reflect changes in parameter handling and ensure consistency across examples.

* Enhance FlashAttention backward implementation in example_amd_flash_attn_bwd.py

- Updated scaling factor application for improved numerical stability in gradient calculations.
- Refined tensor handling to ensure consistency with forward pass operations.
- Optimized atomic operations for writing gradients to dK and dV using fp32 for better precision.
- Adjusted comments for clarity and alignment with standard implementation practices.

* Expand autotuner configurations in example_amd_flash_attn_bwd.py and update test.sh

- Increased the range of block sizes and stages for forward and backward configurations to enhance performance tuning.
- Adjusted the test script to include additional parameters for batch size and head dimensions, ensuring consistency with the forward example.
- Improved comments for clarity and alignment with the updated configurations.

* Enhance performance calculations and benchmarking in example_amd_flash_attn_bwd.py

- Updated FLOPs calculation to account for both forward and backward passes, clarifying the total computational cost.
- Modified benchmarking functions to evaluate the complete forward and backward performance of both reference and Tile-lang implementations.
- Improved comments for better understanding of the performance metrics and implementation details.
- Removed unnecessary parameter from test.sh to streamline execution.

* Remove forward attention test commands from test.sh and retain backward attention execution for streamlined testing.

* Refactor FlashAttention forward and backward implementations in example_amd_flash_attn_bwd.py and example_amd_flash_attn_fwd.py

- Updated the forward function to return both output and log-sum-exp (LSE) values for improved gradient calculations.
- Enhanced autotuner configurations for forward pass, including new parameters for better performance tuning.
- Refined scaling factor calculations for numerical stability in both forward and backward passes.
- Improved comments and documentation for clarity and consistency across implementations.
- Adjusted main function to reflect changes in parameter handling and ensure compatibility with new output requirements.

* Refactor FlashAttention implementation in example_amd_flash_attn_bwd.py

- Removed outdated comments and improved clarity in the code.
- Enhanced the forward function to consistently return output and log-sum-exp (LSE) values.
- Updated autotuner configurations to include new parameters for better performance tuning.
- Refined tensor handling and scaling factor calculations for improved numerical stability.
- Adjusted the main function to ensure compatibility with updated output requirements and parameter handling.

* Enhance FlashAttention backward implementation in example_amd_flash_attn_bwd.py

- Updated configuration parameters for backward calculations, including new options for block sizes, threads, and rasterization.
- Added new parameters (k_pack, qk_coalesced_width, v_coalesced_width) to improve performance tuning and memory access patterns.
- Modified tensor copy operations to utilize coalesced widths for optimized memory loads.
- Enhanced GEMM operations with k_pack for improved computational efficiency.
- Refined the configuration generation logic to accommodate the new parameters, ensuring comprehensive coverage for backward pass scenarios.

* Refactor configuration and tensor operations in example_amd_flash_attn_bwd.py

- Updated backward configuration parameters to include larger block sizes and a wider range of threads for enhanced performance tuning.
- Removed unnecessary parameters (k_pack, qk_coalesced_width, v_coalesced_width) from function signatures and tensor operations to simplify the implementation.
- Optimized tensor copy operations by eliminating coalesced width specifications, streamlining memory access patterns.
- Adjusted GEMM operations to improve computational efficiency without the use of k_pack.

* Enhance HIP code generation and FP8 type support

- Added support for additional FP8 types (e4m3, e4m3b11fnuz, e5m2fnuz, e8m0) in codegen_hip.cc to improve compatibility.
- Updated error logging to include unsupported FP8 type details for better debugging.
- Implemented handling for loop break and no-op register management in HIP within VisitExpr_ method.
- Introduced new FP8 vector types (e5 and e8) in hip_fp8.h for enhanced functionality.
- Added overloads for AtomicAdd in common.h to support both pointer and value arguments.

* Enhance FP8 type support and clarify accumulator handling in HIP

- Expanded FP8 type support in codegen_hip.cc to include additional float8 formats.
- Updated gemm.h to clarify the handling of the accumulator when clear_accum is true.
- Added comments in hip_fp8.h to indicate that E8M0 types are not supported in the current HIP version.

* Remove deprecated files and update print statements for clarity in example_amd_flash_attn_bwd.py

* Update print statement formatting for clarity in example_amd_flash_attn_bwd.py

* Remove redundant verification results summary print statement in example_amd_flash_attn_bwd.py for cleaner output.

* Fix formatting inconsistencies in example_amd_flash_attn_bwd.py and example_amd_flash_attn_fwd.py by adding spaces for improved readability in configuration parameters and print statements.

* Refactor and enhance HIP code generation for improved FP8 support

- Reorganized and cleaned up code in codegen_hip.cc for better readability and maintainability.
- Enhanced handling of FP8 types, including additional formats and improved error logging for unsupported types.
- Updated AtomicAdd function in common.h to streamline its implementation.
- Refined the PrintVecElemLoadExpr method to handle volatile loads more effectively.
- Added function to manage the addition of new functions in the code generation process.

* Fix formatting issue in HIP code generation for MFMA call

- Adjusted the indentation of the MFMA call code block in codegen_hip.cc for improved readability and consistency.

* Refactor HIP code generation and enhance FP8 type handling

- Reintroduced necessary includes and reorganized code in codegen_hip.cc for improved structure and readability.
- Enhanced the GetFP8Type function to support additional FP8 formats and improved error handling for unsupported types.
- Updated PrintType and PrintVecElemLoadExpr methods to better manage type conversions and vector element loading.
- Refined the AddFunction method to streamline function addition in the code generation process.

* Remove unnecessary blank line in example_amd_flash_attn_bwd.py for improved code cleanliness.

* Refactor backward attention implementation in example_amd_flash_attn_bwd.py

- Updated the GEMM operation to use shared memory for improved performance.
- Adjusted parallelization parameters to enhance efficiency in the backward pass.

* Fix formatting by removing an unnecessary blank line in example_amd_flash_attn_bwd.py for improved code cleanliness.

* Add additional test cases for `assert_tl_matmul_correctness` with `float8_e4m3fnuz` and various configurations

* Refactor test case formatting for `assert_tl_matmul_correctness` in `test_tilelang_gemm_mfma_intrinsic.py`

---------
Co-authored-by: xinxyxiao <xinyxiao@amd.com>
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

80665cd1

[Language] Expose `T.get_warp_idx_sync` and `T.shuffle_elect` for efficient thread election (#989) · b78d8404

Lei Wang authored Oct 15, 2025



* Expose CUDA warp/lane intrinsics in TileLang frontend

* generalize warp indexing intrinsics and add coverage

* [Lint]: [pre-commit.ci] auto fixes [...]

---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

b78d8404

[CUDA] Add pack functions for FP8 types (#967) · 32ddc1ac

LJC00118 authored Oct 15, 2025

* Remove an incorrect check

* add fp8 pack function

* code lint

* minor fix

* minor fix

* minor fix

* Minor fix

* Minor fix

32ddc1ac

[TIR] Revert some changes of Pass `LowerIntrin` (#1035) · e5399527

Lei Wang authored Oct 15, 2025



* keep >> instead of /

* re think replicate

* lint fix

* handle const int buffers

* rep fix

---------
Co-authored-by: Zhiwen Mo <zm125@ic.ac.uk>

e5399527

14 Oct, 2025 3 commits

[Bugfix] Recover code for flexible parallel (#1032) · eed320f5

Lei Wang authored Oct 14, 2025



* recover flex parallel process

* lint fix

---------
Co-authored-by: Zhiwen Mo <zm125@ic.ac.uk>

eed320f5

[Enhancement] Update abs function for half_t and bfloat_t to use cutlass implementation (#1023) · 1e8f0b18

Tong WU authored Oct 14, 2025



* [Enhancement] Update abs function for half_t and bfloat_t to use cutlass implementation

* [Lint]: [pre-commit.ci] auto fixes [...]

* optimize amd ci

---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

1e8f0b18

[Transform] Migrate `LowerIntrin` from tvm into tilelang (#999) · 7a5077e4
Lei Wang authored Oct 14, 2025
```
* Donot lower ceildiv to >>

* lint fix

* test fix

* fallback ceildiv changes
```
7a5077e4

13 Oct, 2025 1 commit
- [Bugfix] Fix atomicadd auto vectorize identify var error (#883) · 340bfc50
  Yuqi Dong authored Oct 13, 2025
```
* update

* update

* update

* update
```
  340bfc50
11 Oct, 2025 3 commits

[Feature][Example] Support TMA reduce operation and update GQA bwd example (#969) · 05507037

Yu Cheng authored Oct 11, 2025



* [Feature][Example] Support TMA reduce operation and update GQA bwd example

* move GQA bwd with TMA reduce to new example

* [Lint]: [pre-commit.ci] auto fixes [...]

---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

05507037

[Refactor] Refactor Pass `InjectFenceProxy` and expose some warp group... · ddfaac36

Lei Wang authored Oct 11, 2025

[Refactor] Refactor Pass `InjectFenceProxy` and expose some warp group primitives in frontend (#977)

* • InjectFenceProxy docs and tests

  - annotate proxy fence injector with context comments for async/generic detection
  - add compiler internals doc covering the pass mechanics and link it in docs index
  - repair fence proxy test by fixing descriptor init usage and fence counter logic

* do not consider call_extern as async.

* doc update.

* reduce test size for sparse mla

ddfaac36

[TileOp] Implememt `CumSum1D` (#978) · 747381ae
Lei Wang authored Oct 11, 2025
```
* support cumsum-1d

* cumsum 1d support
```
747381ae

10 Oct, 2025 2 commits

[Bugfix] Fix dummy kernel compliation (#962) · 7913fb1d

Chaofan Lin authored Oct 10, 2025



* [Bugfix] Fix visit EvaluateNode in BufferGemmCollector

* address comment

* lint

* fix

* Add TileLang SplitHostDevice pass and tighten issue 830 test names

* lint fix

* enhance for kernel value unpacking.

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

7913fb1d

[CI] add `pre-commit` integration (#955) · 8fe35402

Xuehai Pan authored Oct 10, 2025



* chore: misc cleanup

* feat: add pre-commit config

* chore: update lint dependencies

* style: fix lint issues

* feat: add pre-commit hooks

* fix: fix typos

* chore: update .gitattributes

* [Lint]: [pre-commit.ci] auto fixes [...]

* docs: update CONTRIBUTING.md

* chore: update default venv name

* chore: revert and exclude CUDA files

---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

8fe35402