1. 08 Oct, 2025 1 commit
  2. 01 Oct, 2025 1 commit
  3. 30 Sep, 2025 1 commit
  4. 23 Sep, 2025 1 commit
  5. 17 Sep, 2025 1 commit
  6. 13 Sep, 2025 1 commit
  7. 05 Aug, 2025 1 commit
  8. 30 Jul, 2025 1 commit
  9. 26 Jul, 2025 1 commit
  10. 22 Jul, 2025 2 commits
  11. 16 Jun, 2025 1 commit
  12. 12 Jun, 2025 1 commit
  13. 03 Jun, 2025 1 commit
  14. 14 May, 2025 1 commit
  15. 07 May, 2025 1 commit
  16. 31 Mar, 2025 2 commits
  17. 15 Mar, 2025 1 commit
  18. 14 Mar, 2025 1 commit
  19. 11 Mar, 2025 1 commit
  20. 27 Feb, 2025 1 commit
  21. 25 Feb, 2025 1 commit
  22. 20 Feb, 2025 1 commit
  23. 13 Dec, 2024 1 commit
  24. 08 Nov, 2024 1 commit
  25. 16 Oct, 2024 1 commit
  26. 04 Oct, 2024 1 commit
  27. 22 Aug, 2024 1 commit
  28. 16 Aug, 2024 1 commit
  29. 05 Aug, 2024 1 commit
  30. 30 Jul, 2024 1 commit
  31. 26 Jul, 2024 1 commit
  32. 22 Jul, 2024 1 commit
  33. 21 Jul, 2024 1 commit
  34. 20 Jul, 2024 1 commit
  35. 18 Jul, 2024 1 commit
  36. 03 Jul, 2024 1 commit
  37. 12 Jun, 2024 1 commit
    • Cody Yu's avatar
      [Kernel] Vectorized FP8 quantize kernel (#5396) · 5985e342
      Cody Yu authored
      Inspired by #5146, this PR improves FP8 quantize kernel by vectorizing data transfer to better utilize memory bandwidth. Microbenchmark shows that this improved kernel can achieve 1.0x-1.5x speedup (especially when hidden size is large).
      
      In details, we applied 3 optimizations:
      
      - Use inverted scale so that most divisions are changed to multiplications.
      - Unroll the loop by 4 times to improve ILP.
      - Use vectorized 4 to transfer data between HBM and SRAM.
      5985e342
  38. 09 Jun, 2024 1 commit