"...git@developer.sourcefind.cn:2222/OpenDAS/vllm_cscc.git" did not exist on "c721b814e31d1646ce95bca2acf68fe285fdc34e"
[Kernel] Vectorized FP8 quantize kernel (#5396)
Inspired by #5146, this PR improves FP8 quantize kernel by vectorizing data transfer to better utilize memory bandwidth. Microbenchmark shows that this improved kernel can achieve 1.0x-1.5x speedup (especially when hidden size is large). In details, we applied 3 optimizations: - Use inverted scale so that most divisions are changed to multiplications. - Unroll the loop by 4 times to improve ILP. - Use vectorized 4 to transfer data between HBM and SRAM.
Showing
Please register or sign in to comment