Commits · 094e22983e2b891f685a1d5046fbfac3a6ff2a20 · OpenDAS / tilelang

"...composable_kernel_rocm.git" did not exist on "19d207dfbc6ca80747b7eb63ae6ccc3e58238b90"

19 Sep, 2025 2 commits

[Refactor] Enhance buffer store transformation in TIR pass (#851) · 094e2298

Lei Wang authored Sep 19, 2025

- Updated the `AddWrapperForSingleBufStore` function to improve the handling of buffer stores by adding detailed checks for fragment buffer accesses and ensuring only index 0 is used.
- Introduced new helper functions for collecting buffer accesses and indices, enhancing code readability and maintainability.
- Refined the logic for determining tile operations and thread bindings to ensure accurate transformations without affecting existing parallel structures.

094e2298

[Py38] Revert typing and parser updates for Python 3.8 compatibility (#850) · bc9623fc

Lei Wang authored Sep 19, 2025

* Update submodule TVM to commit 872e32c1 and adjust type hints in nvcc.py and utils.py for compatibility with Python typing standards.

* Update requirements.txt to specify ml_dtypes without a version constraint, indicating that versions greater than 0.5.1 are needed for fp4 support.

bc9623fc

18 Sep, 2025 6 commits

[TIR] Refactor division simplification in RewriteSimplifier (#849) · 8cc2ab22
Lei Wang authored Sep 18, 2025

8cc2ab22
[Typing] Fallback from Python 3.10+ type syntax for compatibility (#848) · c36a7eee
Lei Wang authored Sep 18, 2025

c36a7eee
[AMD] fix bf16x2 dtype codegen (#847) · 6efeb743
Jiaxing Ding authored Sep 18, 2025

6efeb743

[Refactor] Turn off `ENABLE_FAST_MATH` by default (#846) · e7e38355

Lei Wang authored Sep 18, 2025

* [Enhancement] Enable fast math optimization in tilelang JIT configurations

- Updated multiple examples and kernel functions to include `pass_configs` for enabling fast math optimization.
- Added support for the `TL_ENABLE_FAST_MATH` configuration option in the built-in operations.
- Enhanced the `LibraryGenerator` to handle the new fast math configuration, ensuring compatibility with existing settings.
- Updated documentation to reflect the changes in fast math handling and deprecation of the `TL_DISABLE_FAST_MATH` option.

* lint fix

* [Refactor] Introduce deprecated_warning utility for improved deprecation handling

- Added a new `deprecated_warning` function to streamline deprecation messages.
- Updated the `LibraryGenerator` to utilize the new function for warning about the deprecated `TL_DISABLE_FAST_MATH` configuration.
- Enhanced the `deprecated` decorator to support phaseout version messaging, improving clarity for users.

e7e38355

[CI] Test Fix: Handle BufferLoad nodes when T.gemm input has a stride (#843) · ebea77d9
Lei Wang authored Sep 18, 2025
```
* bugfix

* fix

* test fix
```
ebea77d9

[Refactor] Refactor some build related configurations (#827) · 232782dd

Lei Wang authored Sep 18, 2025

* bugfix

* [Build] Update build dependencies and Dockerfile configuration

- Updated `pyproject.toml` and `requirements-build.txt` to specify Cython version as `Cython>=3.0.0`.
- Removed unnecessary dependencies from the build system.
- Enhanced `pypi.Dockerfile` to install gcc-9 and g++-9, and added ninja-build for improved build performance.
- Updated conda environment creation to include Python 3.9 to 3.12, while removing the Python 3.8 environment.

* cmake fix

* fix

* fix

232782dd

17 Sep, 2025 5 commits

[CMake] Added support for statically linked system libc library (#825) · 2f7dc52e
Lei Wang authored Sep 17, 2025

2f7dc52e

[Enhancement] Add a MXFP4 grouped GEMM example for FusedMoE (#811) · 8554cb01

Tong WU authored Sep 17, 2025



* [Enhancement] Enhance dequantization examples and utilities

- Added a new example for grouped matrix multiplication with experts in `example_dequant_groupgemm_bf16_mxfp4_hopper.py`.
- Improved dequantization logic in existing examples by replacing nested loops with vectorized operations for better performance.
- Updated `torch_convert_bit_twiddling` function in `utils.py` to utilize parallel processing, enhancing efficiency and clarity in the conversion process.
Co-authored-by: Zhengju Tang <97930865+tzj-fxz@users.noreply.github.com>

* fix typos in docstrings

* remove redundant code

* [Format] Unreproducible debug with T.print

* [BugFix] Correct dtype in ref dequantize; larger data distribution

* [Format]

* [Refactor] Clean up and optimize example_dequant_groupgemm_bf16_mxfp4_hopper.py and utils.py

- Removed unnecessary cache disabling and manual seed setting in the example.
- Simplified nested loops into parallelized operations for better readability and performance.
- Updated the assertion function in utils.py to print detailed error messages.
- Adjusted tensor sizes in examples

* [Refactor] Update import path in example_dequant_gemm_fine_grained.py

- Changed the import statement for `_tir_packed_to_unsigned_convert` from `bitblas.quantization` to `tilelang.quantize` to reflect the new module structure.

* lint

* rename and add test

* lint

* [Feature] Enhance autotuning and configuration generation in example_dequant_groupedgemm_bf16_mxfp4_hopper.py

- Added a new function `get_configs()` to generate hyperparameter configurations for tuning.
- Updated the `matmul` function to utilize autotuning with the new configurations.
- Improve kernel performance via vectorization and threadblock swizzle.
- Enhanced the main function to support the new autotuning inputs and updated parameters for better performance.

* lint

* fix typo

* fix typo and lint

* make ci format check happy

* fix ci

---------
Co-authored-by: Zhengju Tang <97930865+tzj-fxz@users.noreply.github.com>
Co-authored-by: tzj-fxz <tzjfxz@gmail.com>

8554cb01

[Bugfix] Skip fp4 dtype binding when using older versions of ml_dtypes (#824) · e4a346fe
Lei Wang authored Sep 17, 2025
```
* bug fix when git is not installed

* ml_dtypes_fix
```
e4a346fe
[Bugfix] Bug fix when git command is not installed (#823) · a57f8270
Lei Wang authored Sep 17, 2025

a57f8270
[DSL] Support python tenary if then else expression (#822) · 15479958
Lei Wang authored Sep 17, 2025
```
* support python tenary if then else expression

* lint fix
```
15479958

16 Sep, 2025 3 commits

[Example] Remove redundant param (#821) · 907c3ff0
botbw authored Sep 16, 2025

907c3ff0
[CI] fix rocm ci (#819) · d3e75b70
Cunxiao Ni authored Sep 16, 2025
```
* [CI] fix rocm ci

* Trigger CI
```
d3e75b70

[Example] add w4a8 gemm kernel (#815) · 4bcb1593

Cunxiao Ni authored Sep 16, 2025

* [Bugfix] fix autotune bug

* [Example] add w4a8 gemm kernel

* fix lint: pinned the version of `ml_dtypes`
The version of ml_dtypes should be pinned in the dependency specification. If the version of ml_dtypes is too low, it may result in errors such as fp4 not being defined.

* Renames example for dequantization GEMM

* format

* add w4a8 example to ci

* fix lint

4bcb1593

15 Sep, 2025 4 commits

[Refactor] Update TVM subproject and streamline buffer store handling (#816) · 85d1a6b3

Yu Cheng authored Sep 16, 2025

- Updated the TVM subproject to the latest commit for improved functionality.
- Refactored `warp_specialized_rewriter.cc` to replace placeholder implementations for `BlockNode` and `BlockRealizeNode` with proper role filtering, enhancing code clarity and maintainability.
- Ensured consistent handling of the `cp_async_barrier_noinc` function in `builtin.py` by adding a newline at the end of the file.

85d1a6b3

[Refactor] Reopen #794 Fix lower bug when buffer store is not guarded by any tile op (#817) · 5c869bc7
Kurisu authored Sep 16, 2025
```
* [Refactor] Rewrite AddWrapper pass by ir_transform
PyStmtExprVisitor and PyStmtExprMutator seem buggy

* fix lint error
```
5c869bc7

[Refactor] Update TVM subproject and refactor BlockNode handling in... · 8b005226

Yu Cheng authored Sep 16, 2025

[Refactor] Update TVM subproject and refactor BlockNode handling in warp_specialized_rewriter.cc (#812)

* [Feature] Introduce custom warp specialization attribute and enhance warp group register allocation

- Added a new attribute `kCustomWarpSpecialization` to support custom warp specialization in the TileLang framework.
- Updated the `Collect` method in `SetMaxNRegCollector` to handle cases where warp specialization is detected, returning an empty array accordingly.
- Enhanced the `SetMaxNRegInjector` to skip processing when no registers are needed, improving efficiency.
- Modified the `WarpSpecialized` pass to include the new attribute in the function body when warp specialization is enabled, ensuring proper handling in transformations.

* lint

* lint

8b005226

[feat] support gemm_sp for ampere and ada arch (#691) · 0b3683bf

botbw authored Sep 16, 2025



* [feat] add an example mma atom

* [fix] fix typo naming

* [feat] add a template to enable compilation

* [feat] add print util

* [WIP] pass on single block tile

* [feat] add sm80 metadata layout

* [chore] clean codebase

* [CI] format.sh

* [feat] add sm80 compress utils

* [bugfix] fix C fragment layout

* [refactor] use nvcc version instead of str

* [test] add test cases

* [chore] add a param check

* [chore] format a bit

* [chore] rename func to satisfy PEP 8 and appease gemini

* [chore] add check

* [feat] support sm75 layout && add assertion && chore

* [bug] fix illegal memory access when using two warps over N=32

This could be a missing check related to cutlass 2.x implementation.
Using the cutlass example can't trigger this cause it's bypassed by
padding the input.

For now I think it might be safe to increase the atom size and inve-
sgate in the future.

* [chore] add example

* [chore] format

* [example] update benchmark

* [bugfix] fix namespace and format

* [bugfix] fix incorrect param passing

* [refactor] update variable declaration for clarity in gemm_layouts and gemm_sp

* [Cleanup] Remove unnecessary blank lines in metadata layout functions in gemm_sp.py

* [CI] fix arch

* [example] add torch sparse benchmark

* [misc] polish && add reference && apply review suggestionsi && format

* [CI] format with clang-tidy

* [Cleanup] Format and align template struct definitions in half.hpp, common.h, and gemm_sp_sm80.h

* [Update] Modify CUDA version requirements in test_gemm_sp_sm80 and mark cutlass subproject as dirty

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

0b3683bf

14 Sep, 2025 2 commits

[Fix] Fix lower bug when buffer store is not guarded by any tile op (#794) · f0d66698

Kurisu authored Sep 15, 2025

* [Fix] Fix lower bug when buffer store is not guarded by any tile op

* fix lint error

* Fix typo in  pass

* fix lint error

* Ignore custom thread binding

f0d66698

[Feature] Add ptx_cp_async_barrier_noinc intrinsic and related functionality (#809) · ae9b7063

Yu Cheng authored Sep 14, 2025

- Introduced a new intrinsic `ptx_cp_async_barrier_noinc` for handling the `cp.async.mbarrier.arrive.noinc` operation in TileLang.
- Updated the CUDA code generation to support the new barrier operation.
- Added a corresponding function in the TileLang Python API for ease of use.
- Enhanced the barrier handling in CUDA templates to include the new no-increment operation, improving synchronization capabilities in parallel execution contexts.

ae9b7063

13 Sep, 2025 1 commit
- [Lint] Add ruff config to check for useless spaces (#807) · 5e529522
  Yichen Yan authored Sep 13, 2025
```
* update lint config

* Remove spaces for blank line

* update
```
  5e529522
12 Sep, 2025 2 commits
- Add pytest-durations to requirements for ROCm (#810) · 4d54854b
  alex_xiao authored Sep 12, 2025
  
  4d54854b
- [AMD] support preshuffle weight mfma (#806) · 143b5222
  Jiaxing Ding authored Sep 12, 2025
```
Co-authored-by: Jiaxing Ding <jiaxing.ding@bytedance.com>
```
  143b5222
11 Sep, 2025 3 commits

[AMD] support fp8 T.gemm (#804) · 409ab83d

Tang Xinsheng authored Sep 11, 2025



* [AMD] support fp8 T.gemm

* format

---------
Co-authored-by: tangxinsheng.txs <tangxinsheng.txs@alibaba-inc.com>

409ab83d

[Refactor] Use new namespace and enhance dispatch macros for mma (#801) · b62a0b43

Lei Wang authored Sep 11, 2025

* Refactor CUDA GEMM operations to use new namespace and enhance dispatch macros

- Moved GEMM-related dispatch instructions to the `cute::tl_mma` namespace for better organization.
- Introduced `TL_DISPATCH_MMA` and `TL_DISPATCH_MMA_TEMPLATE` macros to streamline the definition of dispatch instructions for various data types and architectures.
- Updated the handling of CUDA architecture checks to include additional support for newer architectures.
- Improved clarity and maintainability of the code by restructuring the layout and organization of dispatch instructions.
- Ensured consistent usage of tensor views and memory clearing operations across different GEMM implementations.

* Remove deprecated `DispatchInstruction` templates and `tl_mma` namespace from CUDA GEMM implementation. This cleanup enhances code clarity and maintainability by eliminating unused structures and streamlining the overall organization of the GEMM operations.

b62a0b43

[Bugfix] Expose alloc_reducer definition to the python side (#802) · 55293631

Lei Wang authored Sep 11, 2025

- Introduced a new function `alloc_reducer` to allocate a reducer buffer with specified shape, data type, and reduction operation (sum, max, min).
- Added detailed documentation for the function, including usage instructions and parameter descriptions.
- Ensured that the function supports replication strategies and includes assertions for valid operation types and replication options.

This enhancement improves the functionality of buffer management in TileLang, facilitating efficient reduction operations in parallel loops.

55293631

10 Sep, 2025 2 commits

[TileOp] Introduce a experimental python defined `T.gemm_v2` (#793) · 91a7bb2b

Lei Wang authored Sep 11, 2025

* Refactor GEMM and GEMM-SP operations to enhance clarity and maintainability

- Removed deprecated prime factorization functions from `gemm.cc` and `gemm_sp.cc`.
- Introduced a new `GemmWarpPolicy` class to manage warp policy attributes and methods, improving encapsulation.
- Updated reflection methods to include the new policy structure, ensuring proper registration and introspection capabilities.
- Enhanced `GetArchInt` function in `utils.cc` for better readability and type safety.
- Added new `gemm_v2` function in `gemm.py` for improved GEMM operation with additional parameters and checks.

* Refactor GEMM and frontend legalize operations for improved clarity and functionality

- Updated `gemm_py.h` to include the correct header for GEMM operations.
- Renamed `FrontendLegalizer` class to `LetInliner` and updated related methods to reflect this change, enhancing code clarity.
- Modified the pass function from `FrontendLegalize` to `LetInline` for better alignment with its purpose.
- Updated test cases to utilize the new `gemm_v2` function and adjusted the testing framework for improved output and clarity.
- Removed obsolete test file `test_tilelang_transform_frontend_legalize.py` to streamline the test suite.
- Enhanced the `LowerAndLegalize` function to utilize the new `LetInline` pass, improving the overall transformation process.

* Enhance CUDA code generation and testing for GEMM operations

- Added indentation printing in `codegen_cuda.cc` for improved assembly code formatting.
- Updated `test_tilelang_tilelibrary_gemm.py` to include additional GEMM test cases and shared memory allocation with specified scope.
- Introduced new `matmul_sr` and `run_gemm_sr` functions for GEMM operations with shared and fragment memory layouts.
- Refactored layout inference in `mma_macro_generator.py` to improve clarity and correctness in shared memory handling.
- Enhanced `gemm/__init__.py` to support new GEMM operation combinations and layout inference logic.

These changes improve the clarity, functionality, and testing coverage of GEMM operations in the TileLang framework.

* Refactor GEMM layout and testing for improved clarity and functionality

- Updated `gemm_layouts.cc` to enhance the layout generation logic for transposed and non-transposed GEMM operations.
- Renamed and modified functions in `test_tilelang_tilelibrary_gemm.py` to reflect changes in GEMM function signatures and improve test coverage.
- Introduced new GEMM operation combinations in `gemm/__init__.py` to support additional layouts and configurations.
- Enhanced layout inference in `mma_layout.py` and `mma_macro_generator.py` for better handling of shared memory layouts.

These changes improve the clarity, functionality, and testing coverage of GEMM operations in the TileLang framework.

* Refactor GEMM layout and Python integration for improved functionality

- Updated `gemm_layouts.cc` to correct the order of layout replication and repetition for transposed and non-transposed GEMM operations.
- Enhanced `gemm_py.cc` to handle block realization more robustly, ensuring correct assignment of global symbols and block attributes.
- Refactored `inject_pipeline.cc` to streamline buffer read/write region handling, improving clarity and maintainability.
- Cleaned up test cases in `test_tilelang_tilelibrary_gemm.py` by removing unnecessary print statements and adjusting function calls for better test execution flow.

These changes enhance the clarity, functionality, and robustness of GEMM operations and their testing in the TileLang framework.

* Refactor GEMM layout and testing for improved clarity and functionality

- Updated `gemm_layouts.cc` to enhance layout generation logic for transposed and non-transposed GEMM operations.
- Improved block realization handling in `gemm_py.cc` for better assignment of global symbols.
- Streamlined buffer read/write region handling in `inject_pipeline.cc` for clarity.
- Enhanced test cases in `test_tilelang_tilelibrary_gemm.py` by adjusting function calls and adding new GEMM operation combinations.

These changes improve the clarity, functionality, and robustness of GEMM operations and their testing in the TileLang framework.

* tfloat32 support.

* lint fix

* lint fix

* Refactor shared memory allocation in GEMM tests

- Removed unnecessary scope specification in shared memory allocation for matrices A and B in `test_tilelang_tilelibrary_gemm.py`.
- This change simplifies the allocation process and aligns with the updated GEMM function signatures.

91a7bb2b

[AMD] support mfma i32_16x16x32_i8 (#800) · 9fd6bb30
Jiaxing Ding authored Sep 10, 2025
```
Co-authored-by: Jiaxing Ding <jiaxing.ding@bytedance.com>
```
9fd6bb30

09 Sep, 2025 2 commits

Refactor index handling in BufferStore and BufferLoad to promote 64-bit integers (#796) · 54aaec98

Lei Wang authored Sep 09, 2025

- Updated index processing in `BufferStore` and `BufferLoad` to ensure that integer indices with less than 64 bits are promoted to 64-bit integers.
- Introduced a new array to store the modified indices before updating the original indices, enhancing clarity and maintainability of the code.

54aaec98

Resolve reference cycle. (#795) · 7467f2b3
Lei Wang authored Sep 09, 2025
```
Co-authored-by: Huanqi Cao <caohuanqi@deepseek.com>
```
7467f2b3

06 Sep, 2025 3 commits

[CI]Adds pytest timeout to CI (#792) · bcfc8343

Cunxiao Ni authored Sep 06, 2025

* [CI]Adds pytest timeout to CI

Adds a timeout to pytest runs in CI to prevent jobs from hanging indefinitely.
This also adds `pytest-timeout` to the test requirements.

* fix lint

bcfc8343

[TMA] Automatically lower 1d tma in appropriate cases (#788) · 9d7d45be

Lei Wang authored Sep 06, 2025

* Enhance layout inference and copy operations with 1D TMA support

- Updated `CopyNode` to introduce separate handling for 1D bulk load/store operations, including new methods for checking and lowering these operations.
- Modified `InferLayout` and `GetCopyInst` to accommodate additional parameters for layout maps and analyzers.
- Enhanced `AtomicAddNode` and `FillNode` to utilize the updated layout inference logic.
- Improved buffer out-of-bounds checks during layout inference to ensure safe memory access.

This update improves the efficiency and correctness of memory operations in the TileLang framework.

* Refactor layout inference calls for improved readability

- Updated `InferLayout` calls in `AtomicAddNode`, `CopyNode`, and `FillNode` to enhance code clarity by formatting parameters across multiple lines.
- Cleaned up whitespace and formatting in `copy.h` and `layout_inference.cc` to adhere to coding standards and improve maintainability.

This refactor aims to streamline the layout inference logic and improve overall code organization.

* Fix shared tensor check in CopyNode for bulk copy operations

- Updated the condition in `CheckBulkCopy1D` to verify contiguity of `shared_tensor` instead of `dst`, ensuring correct handling of shared memory layouts during bulk copy operations.
- This change enhances the accuracy of memory operations in the TileLang framework.

* Update test_example_gdn_compilation.py to invoke test function directly

- Commented out the call to `tilelang.testing.main()` in `test_example_gdn_compilation.py` and replaced it with a direct call to `test_example_chunk_delta_bwd_compilation()`. This change simplifies the test execution flow and focuses on the specific test case.

* Enhance bulk load/store checks in CopyNode with last dimension validation

- Updated `CheckBulkLoad` and `CheckBulkStore` methods in `CopyNode` to include an optional parameter for validating the last dimension during bulk copy operations.
- Adjusted related methods `CheckBulkLoad1D` and `CheckBulkStore1D` to pass the new parameter, improving the accuracy of bulk copy checks.
- This change enhances the robustness of memory operations in the TileLang framework by ensuring compliance with dimensional requirements.

* Refactor CheckBulkLoad and CheckBulkStore methods for improved readability

- Reformatted the parameter lists of `CheckBulkLoad` and `CheckBulkStore` methods in `CopyNode` to enhance code clarity by aligning parameters across multiple lines.
- This change improves the maintainability of the code and adheres to coding standards.

9d7d45be

[AMD] fix mfma op interface (#791) · b6b02dab
Jiaxing Ding authored Sep 06, 2025
```
Co-authored-by: Jiaxing Ding <jiaxing.ding@bytedance.com>
```
b6b02dab

05 Sep, 2025 3 commits

[AMD] fix bugs in warp shuffle (#790) · cda5ea15

Tang Xinsheng authored Sep 05, 2025



* [AMD] fix bugs in warp shuffle

* format

---------
Co-authored-by: tangxinsheng.txs <tangxinsheng.txs@alibaba-inc.com>

cda5ea15

[Bugfix] Fix incorrect synchronization bug in minference example (#786) · 013adca0
Wenhao Xie authored Sep 05, 2025
```
* fix

* lint
```
013adca0

[Feat] Add tilelang T.assume support and assume injection for buffer shapes (#787) · e5b61e9b

Kurisu authored Sep 05, 2025

* Add InjectAssumes pass to speedup tvm prover

* Fix lint errors

* remove debug statements

* [Feat] add assume attr and assume support in tilelang

* Add convertion from tir.assume to tilelang assume

* [Fix] Add missing With constraint in IRMutator

* Fix typo in ir mutator

e5b61e9b

04 Sep, 2025 2 commits

[Nvidia][SM121] Add intrin.h include to gemm_mma.h for sm120+(#785) · 6e0c3500
Hao Kang authored Sep 04, 2025
```
To make sm120 arch runnable.
```
6e0c3500

[AMD] Fix amd tir&add examples (#784) · f07f31c1

alex_xiao authored Sep 04, 2025



* [Enhancement] Refactor buffer index handling for improved precision and clarity (#668)

- Enhanced buffer index handling to address precision issues by removing redundant operations.
- Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection.
- Updated related documentation to reflect changes in buffer management practices.

* Remove obsolete test script for AMD example, streamlining the examples directory.

* Remove unused dtype_size variable in AMD example script to streamline code.

* Add input configuration file and update AMD example script for enhanced flexibility

- Introduced a new input.txt file for configurable parameters.
- Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack.
- Streamlined the main function for better clarity and organization.
- Added a new test script to facilitate running the example with specified parameters.

* Remove input configuration file and obsolete test script; enhance AMD example with swizzle layout annotations

- Deleted input.txt and test.sh files as they are no longer needed.
- Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance.
- Reintroduced swizzle usage in the kernel for better performance.

* Refactor AMD example script for FlashAttention-2

- Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`.
- Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls.
- Removed outdated comments and improved code organization for better readability.

* Refactor formatting in AMD FlashAttention example script

- Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function.
- Streamlined the `main` function parameter formatting for consistency.
- Removed unnecessary blank lines to enhance overall code organization.

* Update example_amd_flash_attn_fwd.py

* Enhance AMD example script and update CI workflows

- Improved the `example_amd_flash_attn_fwd.py` script for better clarity and organization.
- Added new CI workflows for AMD and documentation publishing.
- Updated various requirements files to include necessary dependencies.
- Introduced new test cases and examples for better coverage and functionality.
- Refactored existing code for improved readability and maintainability.

* Remove redundant tool cache cleanup step in AMD CI workflow

* Remove `torch` dependency from `requirements-rocm.txt` to streamline requirements.

* Add new AMD FlashAttention example and test script

- Introduced `example_amd_flash_attn_bwd.py` for backward attention computation using TileLang.
- Added `test.sh` script to facilitate running the new example with specified parameters.
- Enhanced the overall structure and organization of the example for better clarity and usability.

* Update configurations in `example_amd_flash_attn_fwd.py` for autotuner

- Reduced the number of threads and `num_split_q` options for improved performance.
- Adjusted `panel_size` options to streamline configuration settings.

* Update submodule 'tvm' to commit 6ccc74f622c7ec4ac25d430d0f6546e7b9edb217

* Update submodule 'tvm' to commit 14ff70ab142b9e5a31bbf9c7923c8a697d41e86c

* Add example for AMD Flash Attention backward pass implementation

- Introduced a new example script `example_amd_flash_attn_bwd.py` demonstrating the forward and backward operations of Flash Attention using TileLang.
- Implemented JIT-compiled functions for both forward and backward passes, including preprocessing and postprocessing steps.
- Added a main function to facilitate testing and benchmarking of the attention mechanism with configurable parameters.
- Included reference implementation for validation against PyTorch's attention mechanism.

This addition enhances the examples directory by providing a comprehensive guide for users to understand and utilize Flash Attention in their applications.

* Enhance AMD Flash Attention example with additional testing capabilities

- Updated `example_amd_flash_attn_bwd.py` to include more comprehensive testing features for the Flash Attention implementation.
- Improved the main function to allow for better parameter configuration and benchmarking.
- Added validation checks against PyTorch's attention mechanism to ensure accuracy and reliability of the example.

This update aims to provide users with a more robust tool for understanding and utilizing Flash Attention in their applications.

* Update submodule TVM to commit a64a5926a6e59f5417ef2501f9d88b467337cf6a

* Refactor HIP intrinsic rules to CUDA

- Updated file name from `intrin_rule_hip.cc` to `intrin_rule_cuda.cc` to reflect the change in focus from HIP to CUDA intrinsic rules.
- Adjusted include paths for better organization and clarity in the code structure.

* Update AMD CI workflow to uninstall specific PyTorch packages before installation

- Removed the installation of `flash_attn==2.5.8` to streamline the CI process.
- Added a step to uninstall `torch`, `torchvision`, and `torchaudio` prior to installing pre-release versions, ensuring compatibility and reducing potential conflicts.

* Remove unused shared memory allocations in AMD Flash Attention backward example

- Eliminated the allocation of shared memory for `dv_shared` and `dk_shared` in `example_amd_flash_attn_bwd.py` to streamline memory usage and improve performance.
- This change focuses on optimizing the backward pass implementation by reducing unnecessary memory overhead.

* Remove unnecessary pip uninstall command from AMD CI workflow

- Eliminated the step to uninstall `torch`, `torchvision`, and `torchaudio` in the AMD CI workflow, as it is no longer required for the installation of pre-release versions.
- This change simplifies the CI process and reduces potential overhead during package management.

* Refactor DispatchHIPWarpActiveMask function in HIP intrinsic rules

- Updated the return statement to use std::string for concatenation in the case of 16-bit types, improving code clarity.
- Added a null check for the CallNode pointer in DispatchHIPWarpActiveMask to enhance robustness and prevent potential dereferencing issues.

* Refactor formatting of HIP intrinsic rule registrations

- Adjusted the formatting of TVM_REGISTER_OP calls for better readability by aligning method chaining.
- No functional changes were made; this update focuses on code style improvements to enhance maintainability.

* Update file name and documentation for HIP intrinsic rules

- Renamed the file from `intrin_rule_cuda.cc` to `intrin_rule_hip.cc` to accurately reflect the focus on HIP intrinsic rules.
- Updated the file documentation to clarify its purpose as related to HIP rather than CUDA.

* Enhance DispatchHIPShuffle function with clang-analyzer comments

- Added NOLINTBEGIN and NOLINTEND comments to the DispatchHIPShuffle function to suppress clang-analyzer warnings related to inner pointer usage.
- This change improves code clarity and maintains compliance with static analysis tools.

* lint fix

* fix

---------
Co-authored-by: xinxyxiao <xinyxiao@amd.com>
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

f07f31c1