• Lei Wang's avatar
    [Bugfix] Enhance smem copy selector for uncommon shape (#510) · dbe8689f
    Lei Wang authored
    * [Refactor] Enhance GEMM warp partitioning logic for improved performance and flexibility
    
    * Updated the warp partitioning logic in `Gemm::ComputeWarpPartition` to better handle various GEMM policies, including FullRow, FullCol, and Square.
    * Implemented checks to dynamically adjust warp allocation based on matrix dimensions, ensuring optimal performance.
    * Introduced a new `SelectCopy` template to streamline memory access patterns in CUDA templates, enhancing compatibility across different architectures.
    * Refactored the Python `GemmWarpPolicy` class to align with the updated C++ logic, improving clarity and maintainability in warp allocation strategies.
    
    * [Refactor] Optimize matrix multiplication parameters and performance in quickstart example
    
    * Updated thread count in the kernel context from 256 to 128 to enhance performance.
    * Increased block sizes for matrix dimensions (M, N, block_M, block_N) to 1024 and 128 respectively, improving computational efficiency.
    * Adjusted the pipeline stages in the GEMM loop from 0 to 3 for better parallel execution.
    * Cleaned up comments for clarity and corrected a typo in the memory copy comment.
    
    * [Refactor] Simplify Copy type selection in OperandTraits for improved clarity
    
    * Replaced the conditional Copy type definition with a new SelectCopy template in OperandTraits, enhancing readability and maintainability of the code.
    * This change streamlines the logic for selecting memory copy patterns based on matrix dimensions and warp configurations.
    dbe8689f
gemm.cc 12.6 KB