• alex_xiao's avatar
    [CI][AMD] Add AMD GPU CI and fix some related bugs (#694) · 8e1b88f3
    alex_xiao authored
    
    
    * [Enhancement] Refactor buffer index handling for improved precision and clarity (#668)
    
    - Enhanced buffer index handling to address precision issues by removing redundant operations.
    - Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection.
    - Updated related documentation to reflect changes in buffer management practices.
    
    * Remove obsolete test script for AMD example, streamlining the examples directory.
    
    * Remove unused dtype_size variable in AMD example script to streamline code.
    
    * Add input configuration file and update AMD example script for enhanced flexibility
    
    - Introduced a new input.txt file for configurable parameters.
    - Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack.
    - Streamlined the main function for better clarity and organization.
    - Added a new test script to facilitate running the example with specified parameters.
    
    * Remove input configuration file and obsolete test script; enhance AMD example with swizzle layout annotations
    
    - Deleted input.txt and test.sh files as they are no longer needed.
    - Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance.
    - Reintroduced swizzle usage in the kernel for better performance.
    
    * Refactor AMD example script for FlashAttention-2
    
    - Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`.
    - Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls.
    - Removed outdated comments and improved code organization for better readability.
    
    * Refactor formatting in AMD FlashAttention example script
    
    - Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function.
    - Streamlined the `main` function parameter formatting for consistency.
    - Removed unnecessary blank lines to enhance overall code organization.
    
    * Update example_amd_flash_attn_fwd.py
    
    * Update AMD FlashAttention example and TVM submodule
    
    - Added a new example script `example_amd_flash_attn_fwd_k_block.py` for FlashAttention with K-blocking support.
    - Enhanced `example_amd_flash_attn_fwd.py` by expanding configuration options for block sizes and threads.
    - Updated the TVM submodule to the latest commit for improved functionality.
    - Introduced a new test script `test.sh` to facilitate running the new example with specified parameters.
    
    * Add CI workflow for automated format checking and testing
    
    - Introduced a new GitHub Actions workflow in `amd_ci.yml` to automate format checks and testing for pull requests.
    - The workflow includes steps for setting up a Python environment, running format checks, and executing tests.
    - Removed obsolete example script `example_amd_flash_attn_fwd_k_block.py` and test script `test.sh` to streamline the examples directory.
    
    * Rename CI workflow from "CI" to "AMD CI" for clarity and specificity.
    
    * Update AMD CI workflow to include copying PyTorch, TorchVision, and Torchaudio packages to the virtual environment for improved dependency management.
    
    * Update AMD CI workflow to install pytest directly instead of using requirements-test.txt
    
    * Update AMD CI workflow to remove 'flash-attn' from requirements and install dependencies from requirements-test.txt
    
    * Refactor AMD CI workflow to enhance clarity in removing 'flash-attn' from requirements-test.txt before installation
    
    * Remove Torchaudio package copying from AMD CI workflow to streamline dependency management.
    
    * Refactor AMD CI workflow to remove the format-check job and streamline the build-test process by directly copying PyTorch and TorchVision packages to the virtual environment.
    
    * Add installation of ROCm in AMD CI workflow
    
    - Included a step to execute the `install_rocm.sh` script for improved setup.
    - Removed unnecessary blank line for better readability in the workflow script.
    
    * Remove installation step for ROCm in AMD CI workflow to simplify the setup process.
    
    * Update AMD CI workflow to run specific test file with verbose output instead of all tests.
    
    * Add new tilelang built-in operations for AMD architecture
    
    - Introduced `tvm_mfma`, `tvm_mfma_store`, `tvm_rdna_wmma`, and `tvm_rdna_wmma_store` built-in operations to enhance support for matrix multiplication and storage in tilelang.
    - Each operation is configured with the appropriate number of inputs and marked as opaque in terms of call effects.
    
    * Enhance autotuner configurations and GEMM operations in AMD example
    
    - Updated block sizes and num_split_q parameters in `get_configs` for improved autotuning.
    - Modified `T.gemm` calls in `fast_flashattn` to utilize `GemmWarpPolicy.FullRow`, optimizing performance for matrix multiplications.
    
    * Update autotuner configurations in AMD example for enhanced performance
    
    - Refined block sizes, thread counts, and added new parameters in `get_configs` to optimize autotuning.
    - Adjusted `fast_flashattn` function to incorporate new parameters for panel size and coalesced widths, improving memory access patterns.
    
    * Enhance autotuner configurations and memory handling in AMD example
    
    - Expanded block sizes and thread counts in `get_configs` for improved autotuning capabilities.
    - Updated `fast_flashattn` to utilize a new shared memory allocation strategy, optimizing memory access patterns during GEMM operations.
    
    * Refine autotuner configurations and memory usage in AMD example
    
    - Reduced block sizes and adjusted thread counts in `get_configs` for optimized autotuning.
    - Updated `fast_flashattn` to utilize register fragments for accumulation, minimizing LDS usage and enhancing performance during GEMM operations.
    
    * Update autotuner configurations in AMD example for enhanced performance
    
    - Expanded block sizes and thread counts in `get_configs` to improve autotuning capabilities.
    - Adjusted `num_split_q` and `v_coalesced_width` parameters for better optimization during GEMM operations.
    
    * Enhance autotuner configurations and GEMM operations in AMD example
    
    - Expanded thread counts in `get_configs` to include higher values for improved autotuning.
    - Updated `fast_flashattn` to adjust accumulation logic and ensure proper handling of causal conditions, optimizing performance during matrix multiplications.
    
    * Update AMD CI workflow and remove obsolete test script
    
    - Modified the CI workflow to run on multiple environments: self-hosted, amd, and gpu.
    - Deleted the outdated `test.sh` script from the examples directory, streamlining the project structure.
    
    * Remove TVM subproject from 3rdparty directory
    
    * Refactor configuration generation and accumulation logic in AMD example
    
    - Reformatted the `get_configs` function for improved readability by aligning parameters.
    - Adjusted the `fast_flashattn` function to enhance clarity in the conditional logic for accumulation, ensuring better handling of causal conditions.
    
    * Enhance AMD CI workflow with additional logging and setup steps
    
    - Added echo statements to provide feedback during the CI process, indicating when the environment is running on an AMD GPU, copying necessary packages, and installing requirements.
    - Improved clarity in the workflow by explicitly stating when the project is being installed and when tests are being executed.
    
    * Comment out package copying in AMD CI workflow to prevent potential issues during environment setup
    
    * Update AMD CI workflow to install nightly versions of PyTorch and remove obsolete package copying steps
    
    * Enhance BuildTileLangHIP function by adding whitespace for improved readability
    
    * Refactor kTVMGridConstant definition for clarity and remove unnecessary comment
    
    * Update TVM subproject to latest commit a64a5926a6e59f5417ef2501f9d88b467337cf6a
    
    * lint fix
    
    * Update AMD CI workflow to use requirements-rocm.txt for dependency installation
    
    * fix ci
    
    * Remove dependency on format-check from AMD CI workflow
    
    * fix ci
    
    * fix ci
    
    * fix ci
    
    * Remove format-check job from AMD CI workflow
    
    * Add torch to requirements-rocm.txt and remove explicit pip install commands from AMD CI workflow
    
    * Add dependency on format-check job in AMD CI workflow
    
    * Add format-check job to AMD CI workflow
    
    * Update format-check job in AMD CI workflow to run on self-hosted environment
    
    * Enhance format-check job in AMD CI workflow with improved Python environment setup and automatic commit of lint changes
    
    * Update amd_ci.yml
    
    ---------
    Co-authored-by: default avatarxinxyxiao <xinyxiao@amd.com>
    Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
    Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
    8e1b88f3
builtin.cc 6.26 KB