-
Lei Wang authored
* [Refactor] Consolidate GemmWarpPolicy Enum and Add Utility Method - Move GemmWarpPolicy from copy.py and gemm.py to primitives/gemm/base.py - Implement from_warp_partition class method to determine warp policy - Add docstring with examples for policy determination - Remove duplicate GemmWarpPolicy class definitions * [Enhancement] Add TensorCore Intrinsic Matrix Multiplication Benchmarks - Implement two new matrix multiplication benchmark scripts: 1. `benchmark_matmul_intrinsic.py`: Uses TensorCore intrinsics with advanced configuration 2. `benchmark_matmul.py`: Provides a more generic matrix multiplication benchmark - Add support for roller-based configuration generation in both benchmarks - Enhance MMA macro generator to handle 2D and 4D output buffer layouts - Implement flexible autotuning configurations with multiple parameters - Support different data types and accumulation modes - Add command-line arguments for matrix dimensions and roller configuration * lint fix * Fix roller hints generation in get_roller_hints_from_func - Simplify roller hints generation logic - Ensure policy-based configuration is always emitted when a policy is available - Remove redundant None check for roller hints * Add shared memory for matrix multiplication in benchmark and quickstart examples - Modify benchmark_matmul.py and quickstart.py to include C_shared allocation - Change accumulation dtype from float16 to float in benchmark_matmul.py - Update matrix multiplication kernels to use shared memory for result storage - Enable CUDA kernel source printing in quickstart example
3c53297b