1. 06 Mar, 2025 1 commit
    • Lei Wang's avatar
      [Carver] Enhance Carver Adaptation for MatMul Benchmarking (#153) · 3c53297b
      Lei Wang authored
      * [Refactor] Consolidate GemmWarpPolicy Enum and Add Utility Method
      
      - Move GemmWarpPolicy from copy.py and gemm.py to primitives/gemm/base.py
      - Implement from_warp_partition class method to determine warp policy
      - Add docstring with examples for policy determination
      - Remove duplicate GemmWarpPolicy class definitions
      
      * [Enhancement] Add TensorCore Intrinsic Matrix Multiplication Benchmarks
      
      - Implement two new matrix multiplication benchmark scripts:
        1. `benchmark_matmul_intrinsic.py`: Uses TensorCore intrinsics with advanced configuration
        2. `benchmark_matmul.py`: Provides a more generic matrix multiplication benchmark
      
      - Add support for roller-based configuration generation in both benchmarks
      - Enhance MMA macro generator to handle 2D and 4D output buffer layouts
      - Implement flexible autotuning configurations with multiple parameters
      - Support different data types and accumulation modes
      - Add command-line arguments for matrix dimensions and roller configuration
      
      * lint fix
      
      * Fix roller hints generation in get_roller_hints_from_func
      
      - Simplify roller hints generation logic
      - Ensure policy-based configuration is always emitted when a policy is available
      - Remove redundant None check for roller hints
      
      * Add shared memory for matrix multiplication in benchmark and quickstart examples
      
      - Modify benchmark_matmul.py and quickstart.py to include C_shared allocation
      - Change accumulation dtype from float16 to float in benchmark_matmul.py
      - Update matrix multiplication kernels to use shared memory for result storage
      - Enable CUDA kernel source printing in quickstart example
      3c53297b