csrc/quantization/fp8/common.cu · e76466dde2bc9525d55165ceaa600d298c7bf773 · OpenDAS / vllm_cscc

"docs/getting_started/v1_user_guide.md" did not exist on "6dd55af6c9dde9174e0616739d783133f5e45d42"

[Kernel] Vectorized FP8 quantize kernel (#5396) · 5985e342

Cody Yu authored Jun 12, 2024

Inspired by #5146, this PR improves FP8 quantize kernel by vectorizing data transfer to better utilize memory bandwidth. Microbenchmark shows that this improved kernel can achieve 1.0x-1.5x speedup (especially when hidden size is large).

In details, we applied 3 optimizations:

- Use inverted scale so that most divisions are changed to multiplications.
- Unroll the loop by 4 times to improve ILP.
- Use vectorized 4 to transfer data between HBM and SRAM.

5985e342

common.cu 5.69 KB

Replace common.cu