"include/ck/utility/container_helper.hpp" did not exist on "11ec07e9d13c41ea8c1512f86414fd0096a0e095"
  1. 04 Sep, 2025 1 commit
    • alex_xiao's avatar
      [AMD] Fix amd tir&add examples (#784) · f07f31c1
      alex_xiao authored
      
      
      * [Enhancement] Refactor buffer index handling for improved precision and clarity (#668)
      
      - Enhanced buffer index handling to address precision issues by removing redundant operations.
      - Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection.
      - Updated related documentation to reflect changes in buffer management practices.
      
      * Remove obsolete test script for AMD example, streamlining the examples directory.
      
      * Remove unused dtype_size variable in AMD example script to streamline code.
      
      * Add input configuration file and update AMD example script for enhanced flexibility
      
      - Introduced a new input.txt file for configurable parameters.
      - Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack.
      - Streamlined the main function for better clarity and organization.
      - Added a new test script to facilitate running the example with specified parameters.
      
      * Remove input configuration file and obsolete test script; enhance AMD example with swizzle layout annotations
      
      - Deleted input.txt and test.sh files as they are no longer needed.
      - Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance.
      - Reintroduced swizzle usage in the kernel for better performance.
      
      * Refactor AMD example script for FlashAttention-2
      
      - Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`.
      - Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls.
      - Removed outdated comments and improved code organization for better readability.
      
      * Refactor formatting in AMD FlashAttention example script
      
      - Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function.
      - Streamlined the `main` function parameter formatting for consistency.
      - Removed unnecessary blank lines to enhance overall code organization.
      
      * Update example_amd_flash_attn_fwd.py
      
      * Enhance AMD example script and update CI workflows
      
      - Improved the `example_amd_flash_attn_fwd.py` script for better clarity and organization.
      - Added new CI workflows for AMD and documentation publishing.
      - Updated various requirements files to include necessary dependencies.
      - Introduced new test cases and examples for better coverage and functionality.
      - Refactored existing code for improved readability and maintainability.
      
      * Remove redundant tool cache cleanup step in AMD CI workflow
      
      * Remove `torch` dependency from `requirements-rocm.txt` to streamline requirements.
      
      * Add new AMD FlashAttention example and test script
      
      - Introduced `example_amd_flash_attn_bwd.py` for backward attention computation using TileLang.
      - Added `test.sh` script to facilitate running the new example with specified parameters.
      - Enhanced the overall structure and organization of the example for better clarity and usability.
      
      * Update configurations in `example_amd_flash_attn_fwd.py` for autotuner
      
      - Reduced the number of threads and `num_split_q` options for improved performance.
      - Adjusted `panel_size` options to streamline configuration settings.
      
      * Update submodule 'tvm' to commit 6ccc74f622c7ec4ac25d430d0f6546e7b9edb217
      
      * Update submodule 'tvm' to commit 14ff70ab142b9e5a31bbf9c7923c8a697d41e86c
      
      * Add example for AMD Flash Attention backward pass implementation
      
      - Introduced a new example script `example_amd_flash_attn_bwd.py` demonstrating the forward and backward operations of Flash Attention using TileLang.
      - Implemented JIT-compiled functions for both forward and backward passes, including preprocessing and postprocessing steps.
      - Added a main function to facilitate testing and benchmarking of the attention mechanism with configurable parameters.
      - Included reference implementation for validation against PyTorch's attention mechanism.
      
      This addition enhances the examples directory by providing a comprehensive guide for users to understand and utilize Flash Attention in their applications.
      
      * Enhance AMD Flash Attention example with additional testing capabilities
      
      - Updated `example_amd_flash_attn_bwd.py` to include more comprehensive testing features for the Flash Attention implementation.
      - Improved the main function to allow for better parameter configuration and benchmarking.
      - Added validation checks against PyTorch's attention mechanism to ensure accuracy and reliability of the example.
      
      This update aims to provide users with a more robust tool for understanding and utilizing Flash Attention in their applications.
      
      * Update submodule TVM to commit a64a5926a6e59f5417ef2501f9d88b467337cf6a
      
      * Refactor HIP intrinsic rules to CUDA
      
      - Updated file name from `intrin_rule_hip.cc` to `intrin_rule_cuda.cc` to reflect the change in focus from HIP to CUDA intrinsic rules.
      - Adjusted include paths for better organization and clarity in the code structure.
      
      * Update AMD CI workflow to uninstall specific PyTorch packages before installation
      
      - Removed the installation of `flash_attn==2.5.8` to streamline the CI process.
      - Added a step to uninstall `torch`, `torchvision`, and `torchaudio` prior to installing pre-release versions, ensuring compatibility and reducing potential conflicts.
      
      * Remove unused shared memory allocations in AMD Flash Attention backward example
      
      - Eliminated the allocation of shared memory for `dv_shared` and `dk_shared` in `example_amd_flash_attn_bwd.py` to streamline memory usage and improve performance.
      - This change focuses on optimizing the backward pass implementation by reducing unnecessary memory overhead.
      
      * Remove unnecessary pip uninstall command from AMD CI workflow
      
      - Eliminated the step to uninstall `torch`, `torchvision`, and `torchaudio` in the AMD CI workflow, as it is no longer required for the installation of pre-release versions.
      - This change simplifies the CI process and reduces potential overhead during package management.
      
      * Refactor DispatchHIPWarpActiveMask function in HIP intrinsic rules
      
      - Updated the return statement to use std::string for concatenation in the case of 16-bit types, improving code clarity.
      - Added a null check for the CallNode pointer in DispatchHIPWarpActiveMask to enhance robustness and prevent potential dereferencing issues.
      
      * Refactor formatting of HIP intrinsic rule registrations
      
      - Adjusted the formatting of TVM_REGISTER_OP calls for better readability by aligning method chaining.
      - No functional changes were made; this update focuses on code style improvements to enhance maintainability.
      
      * Update file name and documentation for HIP intrinsic rules
      
      - Renamed the file from `intrin_rule_cuda.cc` to `intrin_rule_hip.cc` to accurately reflect the focus on HIP intrinsic rules.
      - Updated the file documentation to clarify its purpose as related to HIP rather than CUDA.
      
      * Enhance DispatchHIPShuffle function with clang-analyzer comments
      
      - Added NOLINTBEGIN and NOLINTEND comments to the DispatchHIPShuffle function to suppress clang-analyzer warnings related to inner pointer usage.
      - This change improves code clarity and maintains compliance with static analysis tools.
      
      * lint fix
      
      * fix
      
      ---------
      Co-authored-by: default avatarxinxyxiao <xinyxiao@amd.com>
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      f07f31c1
  2. 15 Aug, 2025 1 commit
    • alex_xiao's avatar
      [CI][AMD] Add AMD GPU CI and fix some related bugs (#694) · 8e1b88f3
      alex_xiao authored
      
      
      * [Enhancement] Refactor buffer index handling for improved precision and clarity (#668)
      
      - Enhanced buffer index handling to address precision issues by removing redundant operations.
      - Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection.
      - Updated related documentation to reflect changes in buffer management practices.
      
      * Remove obsolete test script for AMD example, streamlining the examples directory.
      
      * Remove unused dtype_size variable in AMD example script to streamline code.
      
      * Add input configuration file and update AMD example script for enhanced flexibility
      
      - Introduced a new input.txt file for configurable parameters.
      - Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack.
      - Streamlined the main function for better clarity and organization.
      - Added a new test script to facilitate running the example with specified parameters.
      
      * Remove input configuration file and obsolete test script; enhance AMD example with swizzle layout annotations
      
      - Deleted input.txt and test.sh files as they are no longer needed.
      - Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance.
      - Reintroduced swizzle usage in the kernel for better performance.
      
      * Refactor AMD example script for FlashAttention-2
      
      - Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`.
      - Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls.
      - Removed outdated comments and improved code organization for better readability.
      
      * Refactor formatting in AMD FlashAttention example script
      
      - Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function.
      - Streamlined the `main` function parameter formatting for consistency.
      - Removed unnecessary blank lines to enhance overall code organization.
      
      * Update example_amd_flash_attn_fwd.py
      
      * Update AMD FlashAttention example and TVM submodule
      
      - Added a new example script `example_amd_flash_attn_fwd_k_block.py` for FlashAttention with K-blocking support.
      - Enhanced `example_amd_flash_attn_fwd.py` by expanding configuration options for block sizes and threads.
      - Updated the TVM submodule to the latest commit for improved functionality.
      - Introduced a new test script `test.sh` to facilitate running the new example with specified parameters.
      
      * Add CI workflow for automated format checking and testing
      
      - Introduced a new GitHub Actions workflow in `amd_ci.yml` to automate format checks and testing for pull requests.
      - The workflow includes steps for setting up a Python environment, running format checks, and executing tests.
      - Removed obsolete example script `example_amd_flash_attn_fwd_k_block.py` and test script `test.sh` to streamline the examples directory.
      
      * Rename CI workflow from "CI" to "AMD CI" for clarity and specificity.
      
      * Update AMD CI workflow to include copying PyTorch, TorchVision, and Torchaudio packages to the virtual environment for improved dependency management.
      
      * Update AMD CI workflow to install pytest directly instead of using requirements-test.txt
      
      * Update AMD CI workflow to remove 'flash-attn' from requirements and install dependencies from requirements-test.txt
      
      * Refactor AMD CI workflow to enhance clarity in removing 'flash-attn' from requirements-test.txt before installation
      
      * Remove Torchaudio package copying from AMD CI workflow to streamline dependency management.
      
      * Refactor AMD CI workflow to remove the format-check job and streamline the build-test process by directly copying PyTorch and TorchVision packages to the virtual environment.
      
      * Add installation of ROCm in AMD CI workflow
      
      - Included a step to execute the `install_rocm.sh` script for improved setup.
      - Removed unnecessary blank line for better readability in the workflow script.
      
      * Remove installation step for ROCm in AMD CI workflow to simplify the setup process.
      
      * Update AMD CI workflow to run specific test file with verbose output instead of all tests.
      
      * Add new tilelang built-in operations for AMD architecture
      
      - Introduced `tvm_mfma`, `tvm_mfma_store`, `tvm_rdna_wmma`, and `tvm_rdna_wmma_store` built-in operations to enhance support for matrix multiplication and storage in tilelang.
      - Each operation is configured with the appropriate number of inputs and marked as opaque in terms of call effects.
      
      * Enhance autotuner configurations and GEMM operations in AMD example
      
      - Updated block sizes and num_split_q parameters in `get_configs` for improved autotuning.
      - Modified `T.gemm` calls in `fast_flashattn` to utilize `GemmWarpPolicy.FullRow`, optimizing performance for matrix multiplications.
      
      * Update autotuner configurations in AMD example for enhanced performance
      
      - Refined block sizes, thread counts, and added new parameters in `get_configs` to optimize autotuning.
      - Adjusted `fast_flashattn` function to incorporate new parameters for panel size and coalesced widths, improving memory access patterns.
      
      * Enhance autotuner configurations and memory handling in AMD example
      
      - Expanded block sizes and thread counts in `get_configs` for improved autotuning capabilities.
      - Updated `fast_flashattn` to utilize a new shared memory allocation strategy, optimizing memory access patterns during GEMM operations.
      
      * Refine autotuner configurations and memory usage in AMD example
      
      - Reduced block sizes and adjusted thread counts in `get_configs` for optimized autotuning.
      - Updated `fast_flashattn` to utilize register fragments for accumulation, minimizing LDS usage and enhancing performance during GEMM operations.
      
      * Update autotuner configurations in AMD example for enhanced performance
      
      - Expanded block sizes and thread counts in `get_configs` to improve autotuning capabilities.
      - Adjusted `num_split_q` and `v_coalesced_width` parameters for better optimization during GEMM operations.
      
      * Enhance autotuner configurations and GEMM operations in AMD example
      
      - Expanded thread counts in `get_configs` to include higher values for improved autotuning.
      - Updated `fast_flashattn` to adjust accumulation logic and ensure proper handling of causal conditions, optimizing performance during matrix multiplications.
      
      * Update AMD CI workflow and remove obsolete test script
      
      - Modified the CI workflow to run on multiple environments: self-hosted, amd, and gpu.
      - Deleted the outdated `test.sh` script from the examples directory, streamlining the project structure.
      
      * Remove TVM subproject from 3rdparty directory
      
      * Refactor configuration generation and accumulation logic in AMD example
      
      - Reformatted the `get_configs` function for improved readability by aligning parameters.
      - Adjusted the `fast_flashattn` function to enhance clarity in the conditional logic for accumulation, ensuring better handling of causal conditions.
      
      * Enhance AMD CI workflow with additional logging and setup steps
      
      - Added echo statements to provide feedback during the CI process, indicating when the environment is running on an AMD GPU, copying necessary packages, and installing requirements.
      - Improved clarity in the workflow by explicitly stating when the project is being installed and when tests are being executed.
      
      * Comment out package copying in AMD CI workflow to prevent potential issues during environment setup
      
      * Update AMD CI workflow to install nightly versions of PyTorch and remove obsolete package copying steps
      
      * Enhance BuildTileLangHIP function by adding whitespace for improved readability
      
      * Refactor kTVMGridConstant definition for clarity and remove unnecessary comment
      
      * Update TVM subproject to latest commit a64a5926a6e59f5417ef2501f9d88b467337cf6a
      
      * lint fix
      
      * Update AMD CI workflow to use requirements-rocm.txt for dependency installation
      
      * fix ci
      
      * Remove dependency on format-check from AMD CI workflow
      
      * fix ci
      
      * fix ci
      
      * fix ci
      
      * Remove format-check job from AMD CI workflow
      
      * Add torch to requirements-rocm.txt and remove explicit pip install commands from AMD CI workflow
      
      * Add dependency on format-check job in AMD CI workflow
      
      * Add format-check job to AMD CI workflow
      
      * Update format-check job in AMD CI workflow to run on self-hosted environment
      
      * Enhance format-check job in AMD CI workflow with improved Python environment setup and automatic commit of lint changes
      
      * Update amd_ci.yml
      
      ---------
      Co-authored-by: default avatarxinxyxiao <xinyxiao@amd.com>
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      8e1b88f3
  3. 31 Jul, 2025 1 commit
    • alex_xiao's avatar
      Add Flash Attn example on amd mi300 series (#682) · adcba275
      alex_xiao authored
      
      
      * [Enhancement] Refactor buffer index handling for improved precision and clarity (#668)
      
      - Enhanced buffer index handling to address precision issues by removing redundant operations.
      - Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection.
      - Updated related documentation to reflect changes in buffer management practices.
      
      * Remove obsolete test script for AMD example, streamlining the examples directory.
      
      * Remove unused dtype_size variable in AMD example script to streamline code.
      
      * Add input configuration file and update AMD example script for enhanced flexibility
      
      - Introduced a new input.txt file for configurable parameters.
      - Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack.
      - Streamlined the main function for better clarity and organization.
      - Added a new test script to facilitate running the example with specified parameters.
      
      * Remove input configuration file and obsolete test script; enhance AMD example with swizzle layout annotations
      
      - Deleted input.txt and test.sh files as they are no longer needed.
      - Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance.
      - Reintroduced swizzle usage in the kernel for better performance.
      
      * Refactor AMD example script for FlashAttention-2
      
      - Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`.
      - Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls.
      - Removed outdated comments and improved code organization for better readability.
      
      * Refactor formatting in AMD FlashAttention example script
      
      - Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function.
      - Streamlined the `main` function parameter formatting for consistency.
      - Removed unnecessary blank lines to enhance overall code organization.
      
      * Update example_amd_flash_attn_fwd.py
      
      ---------
      Co-authored-by: default avatarxinxyxiao <xinyxiao@amd.com>
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      adcba275
  4. 29 Jul, 2025 2 commits