1. 05 Jun, 2024 2 commits
  2. 04 Jun, 2024 1 commit
  3. 03 Jun, 2024 1 commit
  4. 02 Jun, 2024 1 commit
  5. 01 Jun, 2024 2 commits
  6. 31 May, 2024 2 commits
  7. 30 May, 2024 1 commit
  8. 28 May, 2024 1 commit
  9. 27 May, 2024 1 commit
  10. 23 May, 2024 3 commits
  11. 22 May, 2024 1 commit
  12. 19 May, 2024 1 commit
  13. 18 May, 2024 1 commit
  14. 17 May, 2024 1 commit
  15. 16 May, 2024 3 commits
  16. 15 May, 2024 1 commit
    • SangBin Cho's avatar
      [Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode... · 65bf2ac1
      SangBin Cho authored
      [Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode to a single API (#4681)
      
      This PR combines prepare_prompt and prepare_decode into a single API. This PR also coelsce the attn metadata for prefill/decode to a single class and allow to slice them when running attn backend.
      
      It also refactors subquery_start_loc which was not refactored in the previous PR
      65bf2ac1
  17. 13 May, 2024 1 commit
  18. 11 May, 2024 1 commit
  19. 09 May, 2024 2 commits
  20. 08 May, 2024 3 commits
  21. 03 May, 2024 2 commits
  22. 02 May, 2024 1 commit
  23. 01 May, 2024 3 commits
    • Woosuk Kwon's avatar
      [Misc] Fix expert_ids shape in MoE (#4517) · 826b82a2
      Woosuk Kwon authored
      826b82a2
    • Philipp Moritz's avatar
      [Kernel] Update fused_moe tuning script for FP8 (#4457) · 24bb4fe4
      Philipp Moritz authored
      This PR updates the tuning script for the fused_moe kernel to support FP8 and also adds configurations for TP4. Note that for the configuration I removed num_warps and num_stages for small batch sizes since that improved performance and brought the benchmarks on par with the numbers before in that regime to make sure this is a strict improvement over the status quo.
      
      All the numbers below are for mistralai/Mixtral-8x7B-Instruct-v0.1, 1000 input and 50 output tokens.
      
      Before this PR (with static activation scaling):
      
      qps = 1: 9.8 ms ITL, 0.49s e2e latency
      qps = 2: 9.7 ms ITL, 0.49s e2e latency 
      qps = 4: 10.1 ms ITL, 0.52s e2e latency
      qps = 6: 11.9 ms ITL, 0.59s e2e latency
      qps = 8: 14.0 ms ITL, 0.70s e2e latency
      qps = 10: 15.7 ms ITL, 0.79s e2e latency
      
      After this PR (with static activation scaling):
      
      qps = 1: 9.8 ms ITL, 0.49s e2e latency
      qps = 2: 9.7 ms ITL, 0.49s e2e latency
      qps = 4: 10.2 ms ITL, 0.53s e2e latency
      qps = 6: 11.9 ms ITL, 0.59s e2e latency
      qps = 8: 11.9 ms ITL, 0.59s e2e latency
      qps = 10: 12.1 ms ITL, 0.61s e2e latency
      24bb4fe4
    • Jee Li's avatar
      d6f4bd7c
  24. 30 Apr, 2024 3 commits
  25. 29 Apr, 2024 1 commit