"...composable_kernel_rocm.git" did not exist on "19d207dfbc6ca80747b7eb63ae6ccc3e58238b90"
  1. 19 Sep, 2025 2 commits
    • Lei Wang's avatar
      [Refactor] Enhance buffer store transformation in TIR pass (#851) · 094e2298
      Lei Wang authored
      - Updated the `AddWrapperForSingleBufStore` function to improve the handling of buffer stores by adding detailed checks for fragment buffer accesses and ensuring only index 0 is used.
      - Introduced new helper functions for collecting buffer accesses and indices, enhancing code readability and maintainability.
      - Refined the logic for determining tile operations and thread bindings to ensure accurate transformations without affecting existing parallel structures.
      094e2298
    • Lei Wang's avatar
      [Py38] Revert typing and parser updates for Python 3.8 compatibility (#850) · bc9623fc
      Lei Wang authored
      * Update submodule TVM to commit 872e32c1 and adjust type hints in nvcc.py and utils.py for compatibility with Python typing standards.
      
      * Update requirements.txt to specify ml_dtypes without a version constraint, indicating that versions greater than 0.5.1 are needed for fp4 support.
      bc9623fc
  2. 18 Sep, 2025 6 commits
    • Lei Wang's avatar
    • Lei Wang's avatar
    • Jiaxing Ding's avatar
      [AMD] fix bf16x2 dtype codegen (#847) · 6efeb743
      Jiaxing Ding authored
      6efeb743
    • Lei Wang's avatar
      [Refactor] Turn off `ENABLE_FAST_MATH` by default (#846) · e7e38355
      Lei Wang authored
      * [Enhancement] Enable fast math optimization in tilelang JIT configurations
      
      - Updated multiple examples and kernel functions to include `pass_configs` for enabling fast math optimization.
      - Added support for the `TL_ENABLE_FAST_MATH` configuration option in the built-in operations.
      - Enhanced the `LibraryGenerator` to handle the new fast math configuration, ensuring compatibility with existing settings.
      - Updated documentation to reflect the changes in fast math handling and deprecation of the `TL_DISABLE_FAST_MATH` option.
      
      * lint fix
      
      * [Refactor] Introduce deprecated_warning utility for improved deprecation handling
      
      - Added a new `deprecated_warning` function to streamline deprecation messages.
      - Updated the `LibraryGenerator` to utilize the new function for warning about the deprecated `TL_DISABLE_FAST_MATH` configuration.
      - Enhanced the `deprecated` decorator to support phaseout version messaging, improving clarity for users.
      e7e38355
    • Lei Wang's avatar
      [CI] Test Fix: Handle BufferLoad nodes when T.gemm input has a stride (#843) · ebea77d9
      Lei Wang authored
      * bugfix
      
      * fix
      
      * test fix
      ebea77d9
    • Lei Wang's avatar
      [Refactor] Refactor some build related configurations (#827) · 232782dd
      Lei Wang authored
      * bugfix
      
      * [Build] Update build dependencies and Dockerfile configuration
      
      - Updated `pyproject.toml` and `requirements-build.txt` to specify Cython version as `Cython>=3.0.0`.
      - Removed unnecessary dependencies from the build system.
      - Enhanced `pypi.Dockerfile` to install gcc-9 and g++-9, and added ninja-build for improved build performance.
      - Updated conda environment creation to include Python 3.9 to 3.12, while removing the Python 3.8 environment.
      
      * cmake fix
      
      * fix
      
      * fix
      232782dd
  3. 17 Sep, 2025 5 commits
    • Lei Wang's avatar
    • Tong WU's avatar
      [Enhancement] Add a MXFP4 grouped GEMM example for FusedMoE (#811) · 8554cb01
      Tong WU authored
      
      
      * [Enhancement] Enhance dequantization examples and utilities
      
      - Added a new example for grouped matrix multiplication with experts in `example_dequant_groupgemm_bf16_mxfp4_hopper.py`.
      - Improved dequantization logic in existing examples by replacing nested loops with vectorized operations for better performance.
      - Updated `torch_convert_bit_twiddling` function in `utils.py` to utilize parallel processing, enhancing efficiency and clarity in the conversion process.
      Co-authored-by: default avatarZhengju Tang <97930865+tzj-fxz@users.noreply.github.com>
      
      * fix typos in docstrings
      
      * remove redundant code
      
      * [Format] Unreproducible debug with T.print
      
      * [BugFix] Correct dtype in ref dequantize; larger data distribution
      
      * [Format]
      
      * [Refactor] Clean up and optimize example_dequant_groupgemm_bf16_mxfp4_hopper.py and utils.py
      
      - Removed unnecessary cache disabling and manual seed setting in the example.
      - Simplified nested loops into parallelized operations for better readability and performance.
      - Updated the assertion function in utils.py to print detailed error messages.
      - Adjusted tensor sizes in examples
      
      * [Refactor] Update import path in example_dequant_gemm_fine_grained.py
      
      - Changed the import statement for `_tir_packed_to_unsigned_convert` from `bitblas.quantization` to `tilelang.quantize` to reflect the new module structure.
      
      * lint
      
      * rename and add test
      
      * lint
      
      * [Feature] Enhance autotuning and configuration generation in example_dequant_groupedgemm_bf16_mxfp4_hopper.py
      
      - Added a new function `get_configs()` to generate hyperparameter configurations for tuning.
      - Updated the `matmul` function to utilize autotuning with the new configurations.
      - Improve kernel performance via vectorization and threadblock swizzle.
      - Enhanced the main function to support the new autotuning inputs and updated parameters for better performance.
      
      * lint
      
      * fix typo
      
      * fix typo and lint
      
      * make ci format check happy
      
      * fix ci
      
      ---------
      Co-authored-by: default avatarZhengju Tang <97930865+tzj-fxz@users.noreply.github.com>
      Co-authored-by: default avatartzj-fxz <tzjfxz@gmail.com>
      8554cb01
    • Lei Wang's avatar
      [Bugfix] Skip fp4 dtype binding when using older versions of ml_dtypes (#824) · e4a346fe
      Lei Wang authored
      * bug fix when git is not installed
      
      * ml_dtypes_fix
      e4a346fe
    • Lei Wang's avatar
      a57f8270
    • Lei Wang's avatar
      [DSL] Support python tenary if then else expression (#822) · 15479958
      Lei Wang authored
      * support python tenary if then else expression
      
      * lint fix
      15479958
  4. 16 Sep, 2025 3 commits
    • botbw's avatar
      [Example] Remove redundant param (#821) · 907c3ff0
      botbw authored
      907c3ff0
    • Cunxiao Ni's avatar
      [CI] fix rocm ci (#819) · d3e75b70
      Cunxiao Ni authored
      * [CI] fix rocm ci
      
      * Trigger CI
      d3e75b70
    • Cunxiao Ni's avatar
      [Example] add w4a8 gemm kernel (#815) · 4bcb1593
      Cunxiao Ni authored
      * [Bugfix] fix autotune bug
      
      * [Example] add w4a8 gemm kernel
      
      * fix lint: pinned the version of `ml_dtypes`
      The version of ml_dtypes should be pinned in the dependency specification. If the version of ml_dtypes is too low, it may result in errors such as fp4 not being defined.
      
      * Renames example for dequantization GEMM
      
      * format
      
      * add w4a8 example to ci
      
      * fix lint
      4bcb1593
  5. 15 Sep, 2025 4 commits
    • Yu Cheng's avatar
      [Refactor] Update TVM subproject and streamline buffer store handling (#816) · 85d1a6b3
      Yu Cheng authored
      - Updated the TVM subproject to the latest commit for improved functionality.
      - Refactored `warp_specialized_rewriter.cc` to replace placeholder implementations for `BlockNode` and `BlockRealizeNode` with proper role filtering, enhancing code clarity and maintainability.
      - Ensured consistent handling of the `cp_async_barrier_noinc` function in `builtin.py` by adding a newline at the end of the file.
      85d1a6b3
    • Kurisu's avatar
      [Refactor] Reopen #794 Fix lower bug when buffer store is not guarded by any tile op (#817) · 5c869bc7
      Kurisu authored
      * [Refactor] Rewrite AddWrapper pass by ir_transform
      PyStmtExprVisitor and PyStmtExprMutator seem buggy
      
      * fix lint error
      5c869bc7
    • Yu Cheng's avatar
      [Refactor] Update TVM subproject and refactor BlockNode handling in... · 8b005226
      Yu Cheng authored
      [Refactor] Update TVM subproject and refactor BlockNode handling in warp_specialized_rewriter.cc (#812)
      
      * [Feature] Introduce custom warp specialization attribute and enhance warp group register allocation
      
      - Added a new attribute `kCustomWarpSpecialization` to support custom warp specialization in the TileLang framework.
      - Updated the `Collect` method in `SetMaxNRegCollector` to handle cases where warp specialization is detected, returning an empty array accordingly.
      - Enhanced the `SetMaxNRegInjector` to skip processing when no registers are needed, improving efficiency.
      - Modified the `WarpSpecialized` pass to include the new attribute in the function body when warp specialization is enabled, ensuring proper handling in transformations.
      
      * lint
      
      * lint
      8b005226
    • botbw's avatar
      [feat] support gemm_sp for ampere and ada arch (#691) · 0b3683bf
      botbw authored
      
      
      * [feat] add an example mma atom
      
      * [fix] fix typo naming
      
      * [feat] add a template to enable compilation
      
      * [feat] add print util
      
      * [WIP] pass on single block tile
      
      * [feat] add sm80 metadata layout
      
      * [chore] clean codebase
      
      * [CI] format.sh
      
      * [feat] add sm80 compress utils
      
      * [bugfix] fix C fragment layout
      
      * [refactor] use nvcc version instead of str
      
      * [test] add test cases
      
      * [chore] add a param check
      
      * [chore] format a bit
      
      * [chore] rename func to satisfy PEP 8 and appease gemini
      
      * [chore] add check
      
      * [feat] support sm75 layout && add assertion && chore
      
      * [bug] fix illegal memory access when using two warps over N=32
      
      This could be a missing check related to cutlass 2.x implementation.
      Using the cutlass example can't trigger this cause it's bypassed by
      padding the input.
      
      For now I think it might be safe to increase the atom size and inve-
      sgate in the future.
      
      * [chore] add example
      
      * [chore] format
      
      * [example] update benchmark
      
      * [bugfix] fix namespace and format
      
      * [bugfix] fix incorrect param passing
      
      * [refactor] update variable declaration for clarity in gemm_layouts and gemm_sp
      
      * [Cleanup] Remove unnecessary blank lines in metadata layout functions in gemm_sp.py
      
      * [CI] fix arch
      
      * [example] add torch sparse benchmark
      
      * [misc] polish && add reference && apply review suggestionsi && format
      
      * [CI] format with clang-tidy
      
      * [Cleanup] Format and align template struct definitions in half.hpp, common.h, and gemm_sp_sm80.h
      
      * [Update] Modify CUDA version requirements in test_gemm_sp_sm80 and mark cutlass subproject as dirty
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      0b3683bf
  6. 14 Sep, 2025 2 commits
    • Kurisu's avatar
      [Fix] Fix lower bug when buffer store is not guarded by any tile op (#794) · f0d66698
      Kurisu authored
      * [Fix] Fix lower bug when buffer store is not guarded by any tile op
      
      * fix lint error
      
      * Fix typo in  pass
      
      * fix lint error
      
      * Ignore custom thread binding
      f0d66698
    • Yu Cheng's avatar
      [Feature] Add ptx_cp_async_barrier_noinc intrinsic and related functionality (#809) · ae9b7063
      Yu Cheng authored
      - Introduced a new intrinsic `ptx_cp_async_barrier_noinc` for handling the `cp.async.mbarrier.arrive.noinc` operation in TileLang.
      - Updated the CUDA code generation to support the new barrier operation.
      - Added a corresponding function in the TileLang Python API for ease of use.
      - Enhanced the barrier handling in CUDA templates to include the new no-increment operation, improving synchronization capabilities in parallel execution contexts.
      ae9b7063
  7. 13 Sep, 2025 1 commit
  8. 12 Sep, 2025 2 commits
  9. 11 Sep, 2025 3 commits
    • Tang Xinsheng's avatar
      [AMD] support fp8 T.gemm (#804) · 409ab83d
      Tang Xinsheng authored
      
      
      * [AMD] support fp8 T.gemm
      
      * format
      
      ---------
      Co-authored-by: default avatartangxinsheng.txs <tangxinsheng.txs@alibaba-inc.com>
      409ab83d
    • Lei Wang's avatar
      [Refactor] Use new namespace and enhance dispatch macros for mma (#801) · b62a0b43
      Lei Wang authored
      * Refactor CUDA GEMM operations to use new namespace and enhance dispatch macros
      
      - Moved GEMM-related dispatch instructions to the `cute::tl_mma` namespace for better organization.
      - Introduced `TL_DISPATCH_MMA` and `TL_DISPATCH_MMA_TEMPLATE` macros to streamline the definition of dispatch instructions for various data types and architectures.
      - Updated the handling of CUDA architecture checks to include additional support for newer architectures.
      - Improved clarity and maintainability of the code by restructuring the layout and organization of dispatch instructions.
      - Ensured consistent usage of tensor views and memory clearing operations across different GEMM implementations.
      
      * Remove deprecated `DispatchInstruction` templates and `tl_mma` namespace from CUDA GEMM implementation. This cleanup enhances code clarity and maintainability by eliminating unused structures and streamlining the overall organization of the GEMM operations.
      b62a0b43
    • Lei Wang's avatar
      [Bugfix] Expose alloc_reducer definition to the python side (#802) · 55293631
      Lei Wang authored
      - Introduced a new function `alloc_reducer` to allocate a reducer buffer with specified shape, data type, and reduction operation (sum, max, min).
      - Added detailed documentation for the function, including usage instructions and parameter descriptions.
      - Ensured that the function supports replication strategies and includes assertions for valid operation types and replication options.
      
      This enhancement improves the functionality of buffer management in TileLang, facilitating efficient reduction operations in parallel loops.
      55293631
  10. 10 Sep, 2025 2 commits
    • Lei Wang's avatar
      [TileOp] Introduce a experimental python defined `T.gemm_v2` (#793) · 91a7bb2b
      Lei Wang authored
      * Refactor GEMM and GEMM-SP operations to enhance clarity and maintainability
      
      - Removed deprecated prime factorization functions from `gemm.cc` and `gemm_sp.cc`.
      - Introduced a new `GemmWarpPolicy` class to manage warp policy attributes and methods, improving encapsulation.
      - Updated reflection methods to include the new policy structure, ensuring proper registration and introspection capabilities.
      - Enhanced `GetArchInt` function in `utils.cc` for better readability and type safety.
      - Added new `gemm_v2` function in `gemm.py` for improved GEMM operation with additional parameters and checks.
      
      * Refactor GEMM and frontend legalize operations for improved clarity and functionality
      
      - Updated `gemm_py.h` to include the correct header for GEMM operations.
      - Renamed `FrontendLegalizer` class to `LetInliner` and updated related methods to reflect this change, enhancing code clarity.
      - Modified the pass function from `FrontendLegalize` to `LetInline` for better alignment with its purpose.
      - Updated test cases to utilize the new `gemm_v2` function and adjusted the testing framework for improved output and clarity.
      - Removed obsolete test file `test_tilelang_transform_frontend_legalize.py` to streamline the test suite.
      - Enhanced the `LowerAndLegalize` function to utilize the new `LetInline` pass, improving the overall transformation process.
      
      * Enhance CUDA code generation and testing for GEMM operations
      
      - Added indentation printing in `codegen_cuda.cc` for improved assembly code formatting.
      - Updated `test_tilelang_tilelibrary_gemm.py` to include additional GEMM test cases and shared memory allocation with specified scope.
      - Introduced new `matmul_sr` and `run_gemm_sr` functions for GEMM operations with shared and fragment memory layouts.
      - Refactored layout inference in `mma_macro_generator.py` to improve clarity and correctness in shared memory handling.
      - Enhanced `gemm/__init__.py` to support new GEMM operation combinations and layout inference logic.
      
      These changes improve the clarity, functionality, and testing coverage of GEMM operations in the TileLang framework.
      
      * Refactor GEMM layout and testing for improved clarity and functionality
      
      - Updated `gemm_layouts.cc` to enhance the layout generation logic for transposed and non-transposed GEMM operations.
      - Renamed and modified functions in `test_tilelang_tilelibrary_gemm.py` to reflect changes in GEMM function signatures and improve test coverage.
      - Introduced new GEMM operation combinations in `gemm/__init__.py` to support additional layouts and configurations.
      - Enhanced layout inference in `mma_layout.py` and `mma_macro_generator.py` for better handling of shared memory layouts.
      
      These changes improve the clarity, functionality, and testing coverage of GEMM operations in the TileLang framework.
      
      * Refactor GEMM layout and Python integration for improved functionality
      
      - Updated `gemm_layouts.cc` to correct the order of layout replication and repetition for transposed and non-transposed GEMM operations.
      - Enhanced `gemm_py.cc` to handle block realization more robustly, ensuring correct assignment of global symbols and block attributes.
      - Refactored `inject_pipeline.cc` to streamline buffer read/write region handling, improving clarity and maintainability.
      - Cleaned up test cases in `test_tilelang_tilelibrary_gemm.py` by removing unnecessary print statements and adjusting function calls for better test execution flow.
      
      These changes enhance the clarity, functionality, and robustness of GEMM operations and their testing in the TileLang framework.
      
      * Refactor GEMM layout and testing for improved clarity and functionality
      
      - Updated `gemm_layouts.cc` to enhance layout generation logic for transposed and non-transposed GEMM operations.
      - Improved block realization handling in `gemm_py.cc` for better assignment of global symbols.
      - Streamlined buffer read/write region handling in `inject_pipeline.cc` for clarity.
      - Enhanced test cases in `test_tilelang_tilelibrary_gemm.py` by adjusting function calls and adding new GEMM operation combinations.
      
      These changes improve the clarity, functionality, and robustness of GEMM operations and their testing in the TileLang framework.
      
      * tfloat32 support.
      
      * lint fix
      
      * lint fix
      
      * Refactor shared memory allocation in GEMM tests
      
      - Removed unnecessary scope specification in shared memory allocation for matrices A and B in `test_tilelang_tilelibrary_gemm.py`.
      - This change simplifies the allocation process and aligns with the updated GEMM function signatures.
      91a7bb2b
    • Jiaxing Ding's avatar
      9fd6bb30
  11. 09 Sep, 2025 2 commits
  12. 06 Sep, 2025 3 commits
    • Cunxiao Ni's avatar
      [CI]Adds pytest timeout to CI (#792) · bcfc8343
      Cunxiao Ni authored
      * [CI]Adds pytest timeout to CI
      
      Adds a timeout to pytest runs in CI to prevent jobs from hanging indefinitely.
      This also adds `pytest-timeout` to the test requirements.
      
      * fix lint
      bcfc8343
    • Lei Wang's avatar
      [TMA] Automatically lower 1d tma in appropriate cases (#788) · 9d7d45be
      Lei Wang authored
      * Enhance layout inference and copy operations with 1D TMA support
      
      - Updated `CopyNode` to introduce separate handling for 1D bulk load/store operations, including new methods for checking and lowering these operations.
      - Modified `InferLayout` and `GetCopyInst` to accommodate additional parameters for layout maps and analyzers.
      - Enhanced `AtomicAddNode` and `FillNode` to utilize the updated layout inference logic.
      - Improved buffer out-of-bounds checks during layout inference to ensure safe memory access.
      
      This update improves the efficiency and correctness of memory operations in the TileLang framework.
      
      * Refactor layout inference calls for improved readability
      
      - Updated `InferLayout` calls in `AtomicAddNode`, `CopyNode`, and `FillNode` to enhance code clarity by formatting parameters across multiple lines.
      - Cleaned up whitespace and formatting in `copy.h` and `layout_inference.cc` to adhere to coding standards and improve maintainability.
      
      This refactor aims to streamline the layout inference logic and improve overall code organization.
      
      * Fix shared tensor check in CopyNode for bulk copy operations
      
      - Updated the condition in `CheckBulkCopy1D` to verify contiguity of `shared_tensor` instead of `dst`, ensuring correct handling of shared memory layouts during bulk copy operations.
      - This change enhances the accuracy of memory operations in the TileLang framework.
      
      * Update test_example_gdn_compilation.py to invoke test function directly
      
      - Commented out the call to `tilelang.testing.main()` in `test_example_gdn_compilation.py` and replaced it with a direct call to `test_example_chunk_delta_bwd_compilation()`. This change simplifies the test execution flow and focuses on the specific test case.
      
      * Enhance bulk load/store checks in CopyNode with last dimension validation
      
      - Updated `CheckBulkLoad` and `CheckBulkStore` methods in `CopyNode` to include an optional parameter for validating the last dimension during bulk copy operations.
      - Adjusted related methods `CheckBulkLoad1D` and `CheckBulkStore1D` to pass the new parameter, improving the accuracy of bulk copy checks.
      - This change enhances the robustness of memory operations in the TileLang framework by ensuring compliance with dimensional requirements.
      
      * Refactor CheckBulkLoad and CheckBulkStore methods for improved readability
      
      - Reformatted the parameter lists of `CheckBulkLoad` and `CheckBulkStore` methods in `CopyNode` to enhance code clarity by aligning parameters across multiple lines.
      - This change improves the maintainability of the code and adheres to coding standards.
      9d7d45be
    • Jiaxing Ding's avatar
      [AMD] fix mfma op interface (#791) · b6b02dab
      Jiaxing Ding authored
      
      Co-authored-by: default avatarJiaxing Ding <jiaxing.ding@bytedance.com>
      b6b02dab
  13. 05 Sep, 2025 3 commits
  14. 04 Sep, 2025 2 commits
    • Hao Kang's avatar
      [Nvidia][SM121] Add intrin.h include to gemm_mma.h for sm120+(#785) · 6e0c3500
      Hao Kang authored
      To make sm120 arch runnable.
      6e0c3500
    • alex_xiao's avatar
      [AMD] Fix amd tir&add examples (#784) · f07f31c1
      alex_xiao authored
      
      
      * [Enhancement] Refactor buffer index handling for improved precision and clarity (#668)
      
      - Enhanced buffer index handling to address precision issues by removing redundant operations.
      - Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection.
      - Updated related documentation to reflect changes in buffer management practices.
      
      * Remove obsolete test script for AMD example, streamlining the examples directory.
      
      * Remove unused dtype_size variable in AMD example script to streamline code.
      
      * Add input configuration file and update AMD example script for enhanced flexibility
      
      - Introduced a new input.txt file for configurable parameters.
      - Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack.
      - Streamlined the main function for better clarity and organization.
      - Added a new test script to facilitate running the example with specified parameters.
      
      * Remove input configuration file and obsolete test script; enhance AMD example with swizzle layout annotations
      
      - Deleted input.txt and test.sh files as they are no longer needed.
      - Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance.
      - Reintroduced swizzle usage in the kernel for better performance.
      
      * Refactor AMD example script for FlashAttention-2
      
      - Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`.
      - Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls.
      - Removed outdated comments and improved code organization for better readability.
      
      * Refactor formatting in AMD FlashAttention example script
      
      - Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function.
      - Streamlined the `main` function parameter formatting for consistency.
      - Removed unnecessary blank lines to enhance overall code organization.
      
      * Update example_amd_flash_attn_fwd.py
      
      * Enhance AMD example script and update CI workflows
      
      - Improved the `example_amd_flash_attn_fwd.py` script for better clarity and organization.
      - Added new CI workflows for AMD and documentation publishing.
      - Updated various requirements files to include necessary dependencies.
      - Introduced new test cases and examples for better coverage and functionality.
      - Refactored existing code for improved readability and maintainability.
      
      * Remove redundant tool cache cleanup step in AMD CI workflow
      
      * Remove `torch` dependency from `requirements-rocm.txt` to streamline requirements.
      
      * Add new AMD FlashAttention example and test script
      
      - Introduced `example_amd_flash_attn_bwd.py` for backward attention computation using TileLang.
      - Added `test.sh` script to facilitate running the new example with specified parameters.
      - Enhanced the overall structure and organization of the example for better clarity and usability.
      
      * Update configurations in `example_amd_flash_attn_fwd.py` for autotuner
      
      - Reduced the number of threads and `num_split_q` options for improved performance.
      - Adjusted `panel_size` options to streamline configuration settings.
      
      * Update submodule 'tvm' to commit 6ccc74f622c7ec4ac25d430d0f6546e7b9edb217
      
      * Update submodule 'tvm' to commit 14ff70ab142b9e5a31bbf9c7923c8a697d41e86c
      
      * Add example for AMD Flash Attention backward pass implementation
      
      - Introduced a new example script `example_amd_flash_attn_bwd.py` demonstrating the forward and backward operations of Flash Attention using TileLang.
      - Implemented JIT-compiled functions for both forward and backward passes, including preprocessing and postprocessing steps.
      - Added a main function to facilitate testing and benchmarking of the attention mechanism with configurable parameters.
      - Included reference implementation for validation against PyTorch's attention mechanism.
      
      This addition enhances the examples directory by providing a comprehensive guide for users to understand and utilize Flash Attention in their applications.
      
      * Enhance AMD Flash Attention example with additional testing capabilities
      
      - Updated `example_amd_flash_attn_bwd.py` to include more comprehensive testing features for the Flash Attention implementation.
      - Improved the main function to allow for better parameter configuration and benchmarking.
      - Added validation checks against PyTorch's attention mechanism to ensure accuracy and reliability of the example.
      
      This update aims to provide users with a more robust tool for understanding and utilizing Flash Attention in their applications.
      
      * Update submodule TVM to commit a64a5926a6e59f5417ef2501f9d88b467337cf6a
      
      * Refactor HIP intrinsic rules to CUDA
      
      - Updated file name from `intrin_rule_hip.cc` to `intrin_rule_cuda.cc` to reflect the change in focus from HIP to CUDA intrinsic rules.
      - Adjusted include paths for better organization and clarity in the code structure.
      
      * Update AMD CI workflow to uninstall specific PyTorch packages before installation
      
      - Removed the installation of `flash_attn==2.5.8` to streamline the CI process.
      - Added a step to uninstall `torch`, `torchvision`, and `torchaudio` prior to installing pre-release versions, ensuring compatibility and reducing potential conflicts.
      
      * Remove unused shared memory allocations in AMD Flash Attention backward example
      
      - Eliminated the allocation of shared memory for `dv_shared` and `dk_shared` in `example_amd_flash_attn_bwd.py` to streamline memory usage and improve performance.
      - This change focuses on optimizing the backward pass implementation by reducing unnecessary memory overhead.
      
      * Remove unnecessary pip uninstall command from AMD CI workflow
      
      - Eliminated the step to uninstall `torch`, `torchvision`, and `torchaudio` in the AMD CI workflow, as it is no longer required for the installation of pre-release versions.
      - This change simplifies the CI process and reduces potential overhead during package management.
      
      * Refactor DispatchHIPWarpActiveMask function in HIP intrinsic rules
      
      - Updated the return statement to use std::string for concatenation in the case of 16-bit types, improving code clarity.
      - Added a null check for the CallNode pointer in DispatchHIPWarpActiveMask to enhance robustness and prevent potential dereferencing issues.
      
      * Refactor formatting of HIP intrinsic rule registrations
      
      - Adjusted the formatting of TVM_REGISTER_OP calls for better readability by aligning method chaining.
      - No functional changes were made; this update focuses on code style improvements to enhance maintainability.
      
      * Update file name and documentation for HIP intrinsic rules
      
      - Renamed the file from `intrin_rule_cuda.cc` to `intrin_rule_hip.cc` to accurately reflect the focus on HIP intrinsic rules.
      - Updated the file documentation to clarify its purpose as related to HIP rather than CUDA.
      
      * Enhance DispatchHIPShuffle function with clang-analyzer comments
      
      - Added NOLINTBEGIN and NOLINTEND comments to the DispatchHIPShuffle function to suppress clang-analyzer warnings related to inner pointer usage.
      - This change improves code clarity and maintains compliance with static analysis tools.
      
      * lint fix
      
      * fix
      
      ---------
      Co-authored-by: default avatarxinxyxiao <xinyxiao@amd.com>
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      f07f31c1