"test/gemm/run_gemm_test.inc" did not exist on "485ea46a40f6ed9310443a33541b494d042c57a8"
  1. 15 Aug, 2025 1 commit
    • alex_xiao's avatar
      [CI][AMD] Add AMD GPU CI and fix some related bugs (#694) · 8e1b88f3
      alex_xiao authored
      
      
      * [Enhancement] Refactor buffer index handling for improved precision and clarity (#668)
      
      - Enhanced buffer index handling to address precision issues by removing redundant operations.
      - Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection.
      - Updated related documentation to reflect changes in buffer management practices.
      
      * Remove obsolete test script for AMD example, streamlining the examples directory.
      
      * Remove unused dtype_size variable in AMD example script to streamline code.
      
      * Add input configuration file and update AMD example script for enhanced flexibility
      
      - Introduced a new input.txt file for configurable parameters.
      - Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack.
      - Streamlined the main function for better clarity and organization.
      - Added a new test script to facilitate running the example with specified parameters.
      
      * Remove input configuration file and obsolete test script; enhance AMD example with swizzle layout annotations
      
      - Deleted input.txt and test.sh files as they are no longer needed.
      - Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance.
      - Reintroduced swizzle usage in the kernel for better performance.
      
      * Refactor AMD example script for FlashAttention-2
      
      - Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`.
      - Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls.
      - Removed outdated comments and improved code organization for better readability.
      
      * Refactor formatting in AMD FlashAttention example script
      
      - Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function.
      - Streamlined the `main` function parameter formatting for consistency.
      - Removed unnecessary blank lines to enhance overall code organization.
      
      * Update example_amd_flash_attn_fwd.py
      
      * Update AMD FlashAttention example and TVM submodule
      
      - Added a new example script `example_amd_flash_attn_fwd_k_block.py` for FlashAttention with K-blocking support.
      - Enhanced `example_amd_flash_attn_fwd.py` by expanding configuration options for block sizes and threads.
      - Updated the TVM submodule to the latest commit for improved functionality.
      - Introduced a new test script `test.sh` to facilitate running the new example with specified parameters.
      
      * Add CI workflow for automated format checking and testing
      
      - Introduced a new GitHub Actions workflow in `amd_ci.yml` to automate format checks and testing for pull requests.
      - The workflow includes steps for setting up a Python environment, running format checks, and executing tests.
      - Removed obsolete example script `example_amd_flash_attn_fwd_k_block.py` and test script `test.sh` to streamline the examples directory.
      
      * Rename CI workflow from "CI" to "AMD CI" for clarity and specificity.
      
      * Update AMD CI workflow to include copying PyTorch, TorchVision, and Torchaudio packages to the virtual environment for improved dependency management.
      
      * Update AMD CI workflow to install pytest directly instead of using requirements-test.txt
      
      * Update AMD CI workflow to remove 'flash-attn' from requirements and install dependencies from requirements-test.txt
      
      * Refactor AMD CI workflow to enhance clarity in removing 'flash-attn' from requirements-test.txt before installation
      
      * Remove Torchaudio package copying from AMD CI workflow to streamline dependency management.
      
      * Refactor AMD CI workflow to remove the format-check job and streamline the build-test process by directly copying PyTorch and TorchVision packages to the virtual environment.
      
      * Add installation of ROCm in AMD CI workflow
      
      - Included a step to execute the `install_rocm.sh` script for improved setup.
      - Removed unnecessary blank line for better readability in the workflow script.
      
      * Remove installation step for ROCm in AMD CI workflow to simplify the setup process.
      
      * Update AMD CI workflow to run specific test file with verbose output instead of all tests.
      
      * Add new tilelang built-in operations for AMD architecture
      
      - Introduced `tvm_mfma`, `tvm_mfma_store`, `tvm_rdna_wmma`, and `tvm_rdna_wmma_store` built-in operations to enhance support for matrix multiplication and storage in tilelang.
      - Each operation is configured with the appropriate number of inputs and marked as opaque in terms of call effects.
      
      * Enhance autotuner configurations and GEMM operations in AMD example
      
      - Updated block sizes and num_split_q parameters in `get_configs` for improved autotuning.
      - Modified `T.gemm` calls in `fast_flashattn` to utilize `GemmWarpPolicy.FullRow`, optimizing performance for matrix multiplications.
      
      * Update autotuner configurations in AMD example for enhanced performance
      
      - Refined block sizes, thread counts, and added new parameters in `get_configs` to optimize autotuning.
      - Adjusted `fast_flashattn` function to incorporate new parameters for panel size and coalesced widths, improving memory access patterns.
      
      * Enhance autotuner configurations and memory handling in AMD example
      
      - Expanded block sizes and thread counts in `get_configs` for improved autotuning capabilities.
      - Updated `fast_flashattn` to utilize a new shared memory allocation strategy, optimizing memory access patterns during GEMM operations.
      
      * Refine autotuner configurations and memory usage in AMD example
      
      - Reduced block sizes and adjusted thread counts in `get_configs` for optimized autotuning.
      - Updated `fast_flashattn` to utilize register fragments for accumulation, minimizing LDS usage and enhancing performance during GEMM operations.
      
      * Update autotuner configurations in AMD example for enhanced performance
      
      - Expanded block sizes and thread counts in `get_configs` to improve autotuning capabilities.
      - Adjusted `num_split_q` and `v_coalesced_width` parameters for better optimization during GEMM operations.
      
      * Enhance autotuner configurations and GEMM operations in AMD example
      
      - Expanded thread counts in `get_configs` to include higher values for improved autotuning.
      - Updated `fast_flashattn` to adjust accumulation logic and ensure proper handling of causal conditions, optimizing performance during matrix multiplications.
      
      * Update AMD CI workflow and remove obsolete test script
      
      - Modified the CI workflow to run on multiple environments: self-hosted, amd, and gpu.
      - Deleted the outdated `test.sh` script from the examples directory, streamlining the project structure.
      
      * Remove TVM subproject from 3rdparty directory
      
      * Refactor configuration generation and accumulation logic in AMD example
      
      - Reformatted the `get_configs` function for improved readability by aligning parameters.
      - Adjusted the `fast_flashattn` function to enhance clarity in the conditional logic for accumulation, ensuring better handling of causal conditions.
      
      * Enhance AMD CI workflow with additional logging and setup steps
      
      - Added echo statements to provide feedback during the CI process, indicating when the environment is running on an AMD GPU, copying necessary packages, and installing requirements.
      - Improved clarity in the workflow by explicitly stating when the project is being installed and when tests are being executed.
      
      * Comment out package copying in AMD CI workflow to prevent potential issues during environment setup
      
      * Update AMD CI workflow to install nightly versions of PyTorch and remove obsolete package copying steps
      
      * Enhance BuildTileLangHIP function by adding whitespace for improved readability
      
      * Refactor kTVMGridConstant definition for clarity and remove unnecessary comment
      
      * Update TVM subproject to latest commit a64a5926a6e59f5417ef2501f9d88b467337cf6a
      
      * lint fix
      
      * Update AMD CI workflow to use requirements-rocm.txt for dependency installation
      
      * fix ci
      
      * Remove dependency on format-check from AMD CI workflow
      
      * fix ci
      
      * fix ci
      
      * fix ci
      
      * Remove format-check job from AMD CI workflow
      
      * Add torch to requirements-rocm.txt and remove explicit pip install commands from AMD CI workflow
      
      * Add dependency on format-check job in AMD CI workflow
      
      * Add format-check job to AMD CI workflow
      
      * Update format-check job in AMD CI workflow to run on self-hosted environment
      
      * Enhance format-check job in AMD CI workflow with improved Python environment setup and automatic commit of lint changes
      
      * Update amd_ci.yml
      
      ---------
      Co-authored-by: default avatarxinxyxiao <xinyxiao@amd.com>
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      8e1b88f3
  2. 31 Jul, 2025 1 commit
    • alex_xiao's avatar
      Add Flash Attn example on amd mi300 series (#682) · adcba275
      alex_xiao authored
      
      
      * [Enhancement] Refactor buffer index handling for improved precision and clarity (#668)
      
      - Enhanced buffer index handling to address precision issues by removing redundant operations.
      - Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection.
      - Updated related documentation to reflect changes in buffer management practices.
      
      * Remove obsolete test script for AMD example, streamlining the examples directory.
      
      * Remove unused dtype_size variable in AMD example script to streamline code.
      
      * Add input configuration file and update AMD example script for enhanced flexibility
      
      - Introduced a new input.txt file for configurable parameters.
      - Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack.
      - Streamlined the main function for better clarity and organization.
      - Added a new test script to facilitate running the example with specified parameters.
      
      * Remove input configuration file and obsolete test script; enhance AMD example with swizzle layout annotations
      
      - Deleted input.txt and test.sh files as they are no longer needed.
      - Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance.
      - Reintroduced swizzle usage in the kernel for better performance.
      
      * Refactor AMD example script for FlashAttention-2
      
      - Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`.
      - Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls.
      - Removed outdated comments and improved code organization for better readability.
      
      * Refactor formatting in AMD FlashAttention example script
      
      - Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function.
      - Streamlined the `main` function parameter formatting for consistency.
      - Removed unnecessary blank lines to enhance overall code organization.
      
      * Update example_amd_flash_attn_fwd.py
      
      ---------
      Co-authored-by: default avatarxinxyxiao <xinyxiao@amd.com>
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      adcba275
  3. 29 Jul, 2025 2 commits