• Lei Wang's avatar
    [Carver] Enhance Carver Adaptation for MatMul Benchmarking (#153) · 3c53297b
    Lei Wang authored
    * [Refactor] Consolidate GemmWarpPolicy Enum and Add Utility Method
    
    - Move GemmWarpPolicy from copy.py and gemm.py to primitives/gemm/base.py
    - Implement from_warp_partition class method to determine warp policy
    - Add docstring with examples for policy determination
    - Remove duplicate GemmWarpPolicy class definitions
    
    * [Enhancement] Add TensorCore Intrinsic Matrix Multiplication Benchmarks
    
    - Implement two new matrix multiplication benchmark scripts:
      1. `benchmark_matmul_intrinsic.py`: Uses TensorCore intrinsics with advanced configuration
      2. `benchmark_matmul.py`: Provides a more generic matrix multiplication benchmark
    
    - Add support for roller-based configuration generation in both benchmarks
    - Enhance MMA macro generator to handle 2D and 4D output buffer layouts
    - Implement flexible autotuning configurations with multiple parameters
    - Support different data types and accumulation modes
    - Add command-line arguments for matrix dimensions and roller configuration
    
    * lint fix
    
    * Fix roller hints generation in get_roller_hints_from_func
    
    - Simplify roller hints generation logic
    - Ensure policy-based configuration is always emitted when a policy is available
    - Remove redundant None check for roller hints
    
    * Add shared memory for matrix multiplication in benchmark and quickstart examples
    
    - Modify benchmark_matmul.py and quickstart.py to include C_shared allocation
    - Change accumulation dtype from float16 to float in benchmark_matmul.py
    - Update matrix multiplication kernels to use shared memory for result storage
    - Enable CUDA kernel source printing in quickstart example
    3c53297b
benchmark_matmul.py 9.99 KB