Optimize fp16 direct load GEMM instances (#1086)
This PR optimizes fp16 instances of direct load GEMM kernel introduced in #999 and #1052. Measured the performance of new instances on CDNA2 GPU and compared it against the performance of the best non-direct-load GEMM instances. Used 76 different GEMM problems. On average, this change improves the performance of the tested problems by 47%. For cases known as latency-bound, the speedup is around 126%.
Showing
Please register or sign in to comment