-
Lei Wang authored
* [Refactor] Enhance GEMM warp partitioning logic for improved performance and flexibility * Updated the warp partitioning logic in `Gemm::ComputeWarpPartition` to better handle various GEMM policies, including FullRow, FullCol, and Square. * Implemented checks to dynamically adjust warp allocation based on matrix dimensions, ensuring optimal performance. * Introduced a new `SelectCopy` template to streamline memory access patterns in CUDA templates, enhancing compatibility across different architectures. * Refactored the Python `GemmWarpPolicy` class to align with the updated C++ logic, improving clarity and maintainability in warp allocation strategies. * [Refactor] Optimize matrix multiplication parameters and performance in quickstart example * Updated thread count in the kernel context from 256 to 128 to enhance performance. * Increased block sizes for matrix dimensions (M, N, block_M, block_N) to 1024 and 128 respectively, improving computational efficiency. * Adjusted the pipeline stages in the GEMM loop from 0 to 3 for better parallel execution. * Cleaned up comments for clarity and corrected a typo in the memory copy comment. * [Refactor] Simplify Copy type selection in OperandTraits for improved clarity * Replaced the conditional Copy type definition with a new SelectCopy template in OperandTraits, enhancing readability and maintainability of the code. * This change streamlines the logic for selecting memory copy patterns based on matrix dimensions and warp configurations.
dbe8689f