1. 03 Jul, 2024 1 commit
  2. 29 Jun, 2024 1 commit
  3. 28 Jun, 2024 2 commits
  4. 26 Jun, 2024 2 commits
  5. 23 Jun, 2024 1 commit
  6. 21 Jun, 2024 2 commits
  7. 20 Jun, 2024 4 commits
  8. 18 Jun, 2024 3 commits
  9. 14 Jun, 2024 2 commits
  10. 13 Jun, 2024 2 commits
  11. 12 Jun, 2024 1 commit
    • Cody Yu's avatar
      [Kernel] Vectorized FP8 quantize kernel (#5396) · 5985e342
      Cody Yu authored
      Inspired by #5146, this PR improves FP8 quantize kernel by vectorizing data transfer to better utilize memory bandwidth. Microbenchmark shows that this improved kernel can achieve 1.0x-1.5x speedup (especially when hidden size is large).
      
      In details, we applied 3 optimizations:
      
      - Use inverted scale so that most divisions are changed to multiplications.
      - Unroll the loop by 4 times to improve ILP.
      - Use vectorized 4 to transfer data between HBM and SRAM.
      5985e342
  12. 09 Jun, 2024 1 commit
  13. 07 Jun, 2024 2 commits
  14. 05 Jun, 2024 1 commit
  15. 03 Jun, 2024 2 commits
  16. 02 Jun, 2024 1 commit
  17. 01 Jun, 2024 3 commits
  18. 31 May, 2024 3 commits
  19. 25 May, 2024 1 commit
  20. 23 May, 2024 2 commits
  21. 22 May, 2024 3 commits