"docs/getting_started/v1_user_guide.md" did not exist on "6dd55af6c9dde9174e0616739d783133f5e45d42"
  • Cody Yu's avatar
    [Kernel] Vectorized FP8 quantize kernel (#5396) · 5985e342
    Cody Yu authored
    Inspired by #5146, this PR improves FP8 quantize kernel by vectorizing data transfer to better utilize memory bandwidth. Microbenchmark shows that this improved kernel can achieve 1.0x-1.5x speedup (especially when hidden size is large).
    
    In details, we applied 3 optimizations:
    
    - Use inverted scale so that most divisions are changed to multiplications.
    - Unroll the loop by 4 times to improve ILP.
    - Use vectorized 4 to transfer data between HBM and SRAM.
    5985e342
common.cu 5.69 KB