• Lei Wang's avatar
    [Refactor] Add kernel selection option for GEMM v1 in environment settings (#1200) · 8fbe1b3a
    Lei Wang authored
    * Add kernel selection option for GEMM v1 in environment settings
    
    - Introduced `TILELANG_USE_GEMM_V1` environment variable to control the selection of GEMM version.
    - Added `use_gemm_v1` method in the `Environment` class to determine if GEMM v1 should be used based on the environment variable.
    - Updated GEMM function assignment to default to v2, allowing for v1 to be forced via the new environment variable.
    
    * bug fix
    
    * Add kernel selection option for GEMM in environment settings
    
    - Introduced `TILELANG_USE_GEMM_V1` environment variable to allow users to select between GEMM v1 and v2 implementations.
    - Updated `gemm` function to default to v2 but switch to v1 if the environment variable is set to a truthy value.
    - Added a method `use_gemm_v1` in the `Environment` class to facilitate this selection based on the environment variable.
    
    * Refactor GEMM macro generator to use BufferRegion instead of Buffer
    
    - Updated `wgmma` and `wgmma_rs` methods in `TensorCoreIntrinEmitter` to accept `BufferRegion` parameters instead of `Buffer`.
    - Adjusted related calls in `GemmWGMMA` to ensure compatibility with the new parameter types.
    - Simplified buffer access logic for better clarity and maintainability.
    
    * Refactor GEMM functions to utilize BufferRegion for improved memory handling
    
    - Updated `run_gemm`, `run_gemm_rs`, `run_gemm_sr`, and `run_gemm_rr` functions to set `num_stages` based on block dimensions, enhancing performance for larger matrices.
    - Simplified calls to GEMM functions by removing redundant parameters and ensuring compatibility with BufferRegion.
    - Introduced utility functions for converting between Buffer, BufferLoad, and BufferRegion, improving code clarity and maintainability.
    - Enhanced error handling for full region checks in GEMM operations to ensure correctness in memory access.
    
    * Refactor GEMM code for improved readability and consistency
    
    - Cleaned up formatting and spacing in GEMM-related files for better readability.
    - Standardized comments and code structure across various GEMM functions and macros.
    - Enhanced error messages for clarity in buffer region checks.
    - Removed redundant lines and improved overall code maintainability.
    
    * Update GEMM correctness evaluation and macro generator for improved functionality
    
    - Modified `N_VALUES` in `correctness_evaluation_sm70.py` to include only relevant sizes for tests.
    - Updated test function call in `correctness_evaluation.py` to use `test_gemm_false_true` for better accuracy in testing.
    - Refactored buffer handling in `mma_sm70_macro_generator.py` to improve clarity and consistency in shared buffer access.
    - Enhanced `gemm_mma_sm70.py` to ensure full region checks for input and output buffers, improving correctness in GEMM operations.
    
    * Refactor GEMM and intrinsic files for improved clarity and functionality
    
    - Removed unused variable `A_stride_last` in `mma_sm70_macro_generator.py` to streamline code.
    - Adjusted function signature formatting in `swizzle.py` for better readability.
    - Restored the return of `GemmWGMMA` in `__init__.py` for correct GEMM instantiation.
    - Removed unused variable `B_buf` in `gemm_mma_sm70.py` to enhance code cleanliness.
    - Improved function signature formatting in `language.py` for consistency.
    
    * Enhance GEMM and MMA functionality for FP64 support
    
    - Refactored `GemmNode` to streamline the decision-making process for GEMM instruction selection.
    - Added support for FP64 inputs in the MMA dispatcher, enabling new tensor operations.
    - Introduced a new layout function for FP64 in `mma_layout.py` to facilitate shared memory storage.
    - Updated `TensorCoreIntrinEmitter` to handle FP64 data types, including adjustments for micro tile dimensions and loading mechanisms.
    - Enhanced utility functions to accommodate FP64 index mapping for shared memory operations.
    
    * lint fix
    
    * Refactor GEMM correctness evaluation and shared memory alignment handling
    
    - Reverted the GEMM function call in `correctness_evaluation.py` to the original implementation for consistency.
    - Added a helper function in `merge_shared_memory_allocations.cc` to streamline the marking of shared variables under alignment scope.
    - Enhanced the `VisitExpr_` methods to ensure proper handling of shared memory alignment for `BufferLoadNode` and `VarNode` types.
    - Cleaned up commented-out test code in `correctness_evaluation.py` for better readability.
    
    * Enhance GEMM and MMA implementations with region-based memory handling
    
    - Updated GEMM and MMA classes to utilize BufferRegion for input and output buffers, improving memory management and supporting strided GEMM operations.
    - Added checks to ensure full region compliance for input buffers, enhancing correctness in matrix multiplication.
    - Implemented clear accumulation functionality to reset output buffers before accumulation, ensuring accurate results in GEMM operations.
    
    * Refactor test_tilelang_example_deepseek_v32.py to improve import structure and function calls
    
    - Updated import statements to directly reference modules instead of individual test functions, enhancing clarity.
    - Modified function calls to use the new module structure for better organization and maintainability in testing examples.
    
    * Enhance OnArrayDeclaration method to handle repeated buffer declarations
    
    - Updated the OnArrayDeclaration method to merge metadata for buffers that may appear in multiple Allocate statements, improving robustness against upstream transformations.
    - Added logic to prefer concrete element data types and record extents when previously unknown, enhancing the handling of buffer declarations.
    
    * Add abbreviation for bfloat16 data type in mfma_macro_generator.py
    
    - Introduced a new abbreviation "bf16" for the bfloat16 data type in the mfma_macro_generator.py file, enhancing clarity and consistency in data type representation.
    
    * Refactor CodeGenTileLangHIP to enhance dtype handling and mfma call generation
    
    - Introduced a mapping function to normalize input data types to their corresponding scalar types, improving compatibility with MfmaTraits.
    - Updated the mfma call generation to utilize the new mapping, streamlining the code and enhancing clarity.
    - Removed outdated dtype mapping and replaced it with a more flexible approach to support additional data types like FP8.
    
    * lint fix
    
    * Enhance backend configuration in CMakeLists.txt and improve dtype handling in CodeGenTileLangHIP
    
    - Introduced a macro to define backend options for CUDA, ROCM, and Metal, allowing user overrides and caching of settings.
    - Updated logic to track user-selected backends and conditionally enable defaults based on environment variables.
    - Refactored dtype handling in CodeGenTileLangHIP to streamline mfma call generation and improve clarity.
    - Added support for bfloat16 in the mfma_macro_generator.py, enhancing data type representation consistency.
    
    * Update bfloat16 handling in CodeGenTileLangHIP and mfma_macro_generator.py
    
    - Changed the representation of bfloat16 in CodeGenTileLangHIP from "bfloat16x4" to "bfloat16x4_vec" for improved clarity.
    - Adjusted the mfma_suffix generation in mfma_macro_generator.py to remove the underscore before "bf16", aligning with HIP intrinsic requirements.
    
    * Change logging level from WARNING to DLOG in LegalizeNegativeIndex for non-negative index checks to reduce log verbosity.
    
    * Refactor attention sink examples to simplify index calculations
    
    - Updated index handling in `example_gqa_sink_bwd_bhsd.py` and `example_mha_sink_bwd_bhsd.py` to eliminate unnecessary local allocations and streamline logic for determining start and end indices.
    - Improved readability by using direct calculations instead of local variables for index bounds in pipelined loops.
    
    * Refactor attention sink examples to streamline index calculations
    
    - Simplified index handling in `example_gqa_sink_bwd_bhsd.py`, `example_gqa_sink_fwd_bhsd_wgmma_pipelined.py`, `example_mha_sink_bwd_bhsd.py`, `example_mha_sink_fwd_bhsd_wgmma_pipelined.py`, and `example_mha_sink_fwd_bhsd.py` by removing unnecessary local allocations for start and end indices.
    - Enhanced readability by directly calculating index bounds for pipelined loops, improving overall code clarity.
    
    * lint fix
    
    * bugfix
    
    * Refactor reduce operation handling in CUDA and Python
    
    - Removed outdated shared memory reduction logic from `reduce.cc`.
    - Introduced fragment allocation and improved buffer handling in `reduce.py` to support shared and fragment scopes.
    - Updated CUDA header to define a wider accumulator type for better numerical accuracy.
    - Enhanced error handling for buffer scope validation in the reduction process.
    
    * Fix ReduceOpNode to correctly compute AbsMax by using absolute values of inputs
    
    * Enhance unit loop handling by refining annotation checks
    
    - Updated the condition for identifying effectively empty annotations in unit loops to include cases where only the `pragma_unroll_explicit` hint is present.
    - Introduced a new method, `IsEffectivelyEmptyAnnotation`, to encapsulate this logic, improving code clarity and maintainability.
    
    * clean clode
    8fbe1b3a
gemm_sp.cc 12.7 KB