• alex_xiao's avatar
    fix bug&add amd examples (#966) · 80665cd1
    alex_xiao authored
    
    
    * [Enhancement] Refactor buffer index handling for improved precision and clarity (#668)
    
    - Enhanced buffer index handling to address precision issues by removing redundant operations.
    - Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection.
    - Updated related documentation to reflect changes in buffer management practices.
    
    * Remove obsolete test script for AMD example, streamlining the examples directory.
    
    * Remove unused dtype_size variable in AMD example script to streamline code.
    
    * Add input configuration file and update AMD example script for enhanced flexibility
    
    - Introduced a new input.txt file for configurable parameters.
    - Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack.
    - Streamlined the main function for better clarity and organization.
    - Added a new test script to facilitate running the example with specified parameters.
    
    * Remove input configuration file and obsolete test script; enhance AMD example with swizzle layout annotations
    
    - Deleted input.txt and test.sh files as they are no longer needed.
    - Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance.
    - Reintroduced swizzle usage in the kernel for better performance.
    
    * Refactor AMD example script for FlashAttention-2
    
    - Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`.
    - Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls.
    - Removed outdated comments and improved code organization for better readability.
    
    * Refactor formatting in AMD FlashAttention example script
    
    - Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function.
    - Streamlined the `main` function parameter formatting for consistency.
    - Removed unnecessary blank lines to enhance overall code organization.
    
    * Update example_amd_flash_attn_fwd.py
    
    * Enhance AMD example script and update CI workflows
    
    - Improved the `example_amd_flash_attn_fwd.py` script for better clarity and organization.
    - Added new CI workflows for AMD and documentation publishing.
    - Updated various requirements files to include necessary dependencies.
    - Introduced new test cases and examples for better coverage and functionality.
    - Refactored existing code for improved readability and maintainability.
    
    * Remove redundant tool cache cleanup step in AMD CI workflow
    
    * Remove `torch` dependency from `requirements-rocm.txt` to streamline requirements.
    
    * Add new AMD FlashAttention example and test script
    
    - Introduced `example_amd_flash_attn_bwd.py` for backward attention computation using TileLang.
    - Added `test.sh` script to facilitate running the new example with specified parameters.
    - Enhanced the overall structure and organization of the example for better clarity and usability.
    
    * Update configurations in `example_amd_flash_attn_fwd.py` for autotuner
    
    - Reduced the number of threads and `num_split_q` options for improved performance.
    - Adjusted `panel_size` options to streamline configuration settings.
    
    * Update submodule 'tvm' to commit 6ccc74f622c7ec4ac25d430d0f6546e7b9edb217
    
    * Update submodule 'tvm' to commit 14ff70ab142b9e5a31bbf9c7923c8a697d41e86c
    
    * Add example for AMD Flash Attention backward pass implementation
    
    - Introduced a new example script `example_amd_flash_attn_bwd.py` demonstrating the forward and backward operations of Flash Attention using TileLang.
    - Implemented JIT-compiled functions for both forward and backward passes, including preprocessing and postprocessing steps.
    - Added a main function to facilitate testing and benchmarking of the attention mechanism with configurable parameters.
    - Included reference implementation for validation against PyTorch's attention mechanism.
    
    This addition enhances the examples directory by providing a comprehensive guide for users to understand and utilize Flash Attention in their applications.
    
    * Enhance AMD Flash Attention example with additional testing capabilities
    
    - Updated `example_amd_flash_attn_bwd.py` to include more comprehensive testing features for the Flash Attention implementation.
    - Improved the main function to allow for better parameter configuration and benchmarking.
    - Added validation checks against PyTorch's attention mechanism to ensure accuracy and reliability of the example.
    
    This update aims to provide users with a more robust tool for understanding and utilizing Flash Attention in their applications.
    
    * Update submodule TVM to commit a64a5926a6e59f5417ef2501f9d88b467337cf6a
    
    * Refactor HIP intrinsic rules to CUDA
    
    - Updated file name from `intrin_rule_hip.cc` to `intrin_rule_cuda.cc` to reflect the change in focus from HIP to CUDA intrinsic rules.
    - Adjusted include paths for better organization and clarity in the code structure.
    
    * Update AMD CI workflow to uninstall specific PyTorch packages before installation
    
    - Removed the installation of `flash_attn==2.5.8` to streamline the CI process.
    - Added a step to uninstall `torch`, `torchvision`, and `torchaudio` prior to installing pre-release versions, ensuring compatibility and reducing potential conflicts.
    
    * Remove unused shared memory allocations in AMD Flash Attention backward example
    
    - Eliminated the allocation of shared memory for `dv_shared` and `dk_shared` in `example_amd_flash_attn_bwd.py` to streamline memory usage and improve performance.
    - This change focuses on optimizing the backward pass implementation by reducing unnecessary memory overhead.
    
    * Remove unnecessary pip uninstall command from AMD CI workflow
    
    - Eliminated the step to uninstall `torch`, `torchvision`, and `torchaudio` in the AMD CI workflow, as it is no longer required for the installation of pre-release versions.
    - This change simplifies the CI process and reduces potential overhead during package management.
    
    * Refactor DispatchHIPWarpActiveMask function in HIP intrinsic rules
    
    - Updated the return statement to use std::string for concatenation in the case of 16-bit types, improving code clarity.
    - Added a null check for the CallNode pointer in DispatchHIPWarpActiveMask to enhance robustness and prevent potential dereferencing issues.
    
    * Refactor formatting of HIP intrinsic rule registrations
    
    - Adjusted the formatting of TVM_REGISTER_OP calls for better readability by aligning method chaining.
    - No functional changes were made; this update focuses on code style improvements to enhance maintainability.
    
    * Update file name and documentation for HIP intrinsic rules
    
    - Renamed the file from `intrin_rule_cuda.cc` to `intrin_rule_hip.cc` to accurately reflect the focus on HIP intrinsic rules.
    - Updated the file documentation to clarify its purpose as related to HIP rather than CUDA.
    
    * Enhance DispatchHIPShuffle function with clang-analyzer comments
    
    - Added NOLINTBEGIN and NOLINTEND comments to the DispatchHIPShuffle function to suppress clang-analyzer warnings related to inner pointer usage.
    - This change improves code clarity and maintains compliance with static analysis tools.
    
    * lint fix
    
    * fix
    
    * Enhance autotuner configurations in example_amd_flash_attn_fwd.py by adding new block sizes, stages, and panel sizes. Update test script to use relative Python path and adjust parameters for consistency.
    
    * Add backward attention example to test script
    
    - Extended the test.sh script to include a new backward attention example using example_amd_flash_attn_bwd.py.
    - Added parameters for batch size, context length, and head dimensions to ensure consistency with the forward example.
    - Updated the command for the backward tile example to match the new configuration.
    
    * Refactor FlashAttention implementation in example_amd_flash_attn_bwd.py and example_amd_flash_attn_fwd.py
    
    - Introduced new functions for forward and backward configurations to enhance autotuning capabilities.
    - Updated the FlashAttention forward and backward functions to improve performance and maintainability.
    - Adjusted test script parameters for consistency and clarity, including the addition of group handling.
    - Enhanced the autotuner configurations by refining block sizes and stages for better performance tuning.
    - Updated the main function to reflect changes in parameter names and types for better usability.
    
    * Enhance FlashAttention backward implementation in example_amd_flash_attn_bwd.py
    
    - Updated the backward function to return additional outputs, including log-sum-exp (LSE) values for improved gradient calculations.
    - Refined autotuner configurations by adding new block sizes and adjusting parameters for better performance tuning.
    - Improved shared memory usage in the backward pass to optimize memory access patterns and enhance computational efficiency.
    - Updated the main function to reflect changes in parameter handling and ensure consistency with the forward pass.
    - Enhanced correctness checks in the main function to include LSE validation alongside gradient checks.
    
    * Enhance FlashAttention backward implementation in example_amd_flash_attn_bwd.py
    
    - Introduced a scaling factor for improved numerical stability in gradient calculations.
    - Optimized shared memory usage by adding new shared buffers for intermediate calculations.
    - Refined the handling of tensor fragments to improve performance and maintainability.
    - Updated the main function to ensure compatibility with the new output parameters for backward operations.
    - Removed unnecessary parameters from the test script to streamline execution.
    
    * Refactor FlashAttention implementation in example_amd_flash_attn_bwd.py and example_mha_bwd.py
    
    - Updated the forward and backward functions to improve numerical stability and performance.
    - Enhanced shared memory usage by optimizing buffer allocations and reducing unnecessary parameters.
    - Adjusted autotuner configurations for better performance tuning and compatibility with new output parameters.
    - Added debugging and benchmarking functions for improved correctness verification and performance analysis.
    - Updated the main function to reflect changes in parameter handling and ensure consistency across examples.
    
    * Enhance FlashAttention backward implementation in example_amd_flash_attn_bwd.py
    
    - Updated scaling factor application for improved numerical stability in gradient calculations.
    - Refined tensor handling to ensure consistency with forward pass operations.
    - Optimized atomic operations for writing gradients to dK and dV using fp32 for better precision.
    - Adjusted comments for clarity and alignment with standard implementation practices.
    
    * Expand autotuner configurations in example_amd_flash_attn_bwd.py and update test.sh
    
    - Increased the range of block sizes and stages for forward and backward configurations to enhance performance tuning.
    - Adjusted the test script to include additional parameters for batch size and head dimensions, ensuring consistency with the forward example.
    - Improved comments for clarity and alignment with the updated configurations.
    
    * Enhance performance calculations and benchmarking in example_amd_flash_attn_bwd.py
    
    - Updated FLOPs calculation to account for both forward and backward passes, clarifying the total computational cost.
    - Modified benchmarking functions to evaluate the complete forward and backward performance of both reference and Tile-lang implementations.
    - Improved comments for better understanding of the performance metrics and implementation details.
    - Removed unnecessary parameter from test.sh to streamline execution.
    
    * Remove forward attention test commands from test.sh and retain backward attention execution for streamlined testing.
    
    * Refactor FlashAttention forward and backward implementations in example_amd_flash_attn_bwd.py and example_amd_flash_attn_fwd.py
    
    - Updated the forward function to return both output and log-sum-exp (LSE) values for improved gradient calculations.
    - Enhanced autotuner configurations for forward pass, including new parameters for better performance tuning.
    - Refined scaling factor calculations for numerical stability in both forward and backward passes.
    - Improved comments and documentation for clarity and consistency across implementations.
    - Adjusted main function to reflect changes in parameter handling and ensure compatibility with new output requirements.
    
    * Refactor FlashAttention implementation in example_amd_flash_attn_bwd.py
    
    - Removed outdated comments and improved clarity in the code.
    - Enhanced the forward function to consistently return output and log-sum-exp (LSE) values.
    - Updated autotuner configurations to include new parameters for better performance tuning.
    - Refined tensor handling and scaling factor calculations for improved numerical stability.
    - Adjusted the main function to ensure compatibility with updated output requirements and parameter handling.
    
    * Enhance FlashAttention backward implementation in example_amd_flash_attn_bwd.py
    
    - Updated configuration parameters for backward calculations, including new options for block sizes, threads, and rasterization.
    - Added new parameters (k_pack, qk_coalesced_width, v_coalesced_width) to improve performance tuning and memory access patterns.
    - Modified tensor copy operations to utilize coalesced widths for optimized memory loads.
    - Enhanced GEMM operations with k_pack for improved computational efficiency.
    - Refined the configuration generation logic to accommodate the new parameters, ensuring comprehensive coverage for backward pass scenarios.
    
    * Refactor configuration and tensor operations in example_amd_flash_attn_bwd.py
    
    - Updated backward configuration parameters to include larger block sizes and a wider range of threads for enhanced performance tuning.
    - Removed unnecessary parameters (k_pack, qk_coalesced_width, v_coalesced_width) from function signatures and tensor operations to simplify the implementation.
    - Optimized tensor copy operations by eliminating coalesced width specifications, streamlining memory access patterns.
    - Adjusted GEMM operations to improve computational efficiency without the use of k_pack.
    
    * Enhance HIP code generation and FP8 type support
    
    - Added support for additional FP8 types (e4m3, e4m3b11fnuz, e5m2fnuz, e8m0) in codegen_hip.cc to improve compatibility.
    - Updated error logging to include unsupported FP8 type details for better debugging.
    - Implemented handling for loop break and no-op register management in HIP within VisitExpr_ method.
    - Introduced new FP8 vector types (e5 and e8) in hip_fp8.h for enhanced functionality.
    - Added overloads for AtomicAdd in common.h to support both pointer and value arguments.
    
    * Enhance FP8 type support and clarify accumulator handling in HIP
    
    - Expanded FP8 type support in codegen_hip.cc to include additional float8 formats.
    - Updated gemm.h to clarify the handling of the accumulator when clear_accum is true.
    - Added comments in hip_fp8.h to indicate that E8M0 types are not supported in the current HIP version.
    
    * Remove deprecated files and update print statements for clarity in example_amd_flash_attn_bwd.py
    
    * Update print statement formatting for clarity in example_amd_flash_attn_bwd.py
    
    * Remove redundant verification results summary print statement in example_amd_flash_attn_bwd.py for cleaner output.
    
    * Fix formatting inconsistencies in example_amd_flash_attn_bwd.py and example_amd_flash_attn_fwd.py by adding spaces for improved readability in configuration parameters and print statements.
    
    * Refactor and enhance HIP code generation for improved FP8 support
    
    - Reorganized and cleaned up code in codegen_hip.cc for better readability and maintainability.
    - Enhanced handling of FP8 types, including additional formats and improved error logging for unsupported types.
    - Updated AtomicAdd function in common.h to streamline its implementation.
    - Refined the PrintVecElemLoadExpr method to handle volatile loads more effectively.
    - Added function to manage the addition of new functions in the code generation process.
    
    * Fix formatting issue in HIP code generation for MFMA call
    
    - Adjusted the indentation of the MFMA call code block in codegen_hip.cc for improved readability and consistency.
    
    * Refactor HIP code generation and enhance FP8 type handling
    
    - Reintroduced necessary includes and reorganized code in codegen_hip.cc for improved structure and readability.
    - Enhanced the GetFP8Type function to support additional FP8 formats and improved error handling for unsupported types.
    - Updated PrintType and PrintVecElemLoadExpr methods to better manage type conversions and vector element loading.
    - Refined the AddFunction method to streamline function addition in the code generation process.
    
    * Remove unnecessary blank line in example_amd_flash_attn_bwd.py for improved code cleanliness.
    
    * Refactor backward attention implementation in example_amd_flash_attn_bwd.py
    
    - Updated the GEMM operation to use shared memory for improved performance.
    - Adjusted parallelization parameters to enhance efficiency in the backward pass.
    
    * Fix formatting by removing an unnecessary blank line in example_amd_flash_attn_bwd.py for improved code cleanliness.
    
    * Add additional test cases for `assert_tl_matmul_correctness` with `float8_e4m3fnuz` and various configurations
    
    * Refactor test case formatting for `assert_tl_matmul_correctness` in `test_tilelang_gemm_mfma_intrinsic.py`
    
    ---------
    Co-authored-by: default avatarxinxyxiao <xinyxiao@amd.com>
    Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
    Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
    80665cd1
test_tilelang_gemm_mfma_intrinsic.py 8.75 KB