1. 19 May, 2024 1 commit
  2. 18 May, 2024 1 commit
  3. 17 May, 2024 2 commits
  4. 16 May, 2024 4 commits
  5. 15 May, 2024 1 commit
    • SangBin Cho's avatar
      [Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode... · 65bf2ac1
      SangBin Cho authored
      [Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode to a single API (#4681)
      
      This PR combines prepare_prompt and prepare_decode into a single API. This PR also coelsce the attn metadata for prefill/decode to a single class and allow to slice them when running attn backend.
      
      It also refactors subquery_start_loc which was not refactored in the previous PR
      65bf2ac1
  6. 13 May, 2024 4 commits
  7. 12 May, 2024 1 commit
  8. 11 May, 2024 1 commit
  9. 10 May, 2024 1 commit
    • SangBin Cho's avatar
      [Core] Fix circular reference which leaked llm instance in local dev env (#4737) · 6a0f6172
      SangBin Cho authored
      Storing exception frame is extremely prone to circular refernece because it contains the reference to objects.
      
      When tensorizer is not installed, it leaks llm instance because error frame has references to various modules which cause circular reference problem.
      
      I also found spec decoding has a circular reference issue, and I solved it using weakref.proxy.
      6a0f6172
  10. 09 May, 2024 2 commits
  11. 08 May, 2024 3 commits
  12. 04 May, 2024 1 commit
    • Michael Goin's avatar
      [Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with... · 2a052011
      Michael Goin authored
      [Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with Dynamic/Static Activations) (#4527)
      
      Follow on to #4332 to enable FP8 checkpoint loading for Mixtral and supersedes #4436.
      
      This PR enables the following checkpoint loading features for Mixtral:
      
      Supports loading fp8 checkpoints for Mixtral, such as this "nm-testing/Mixtral-8x7B-Instruct-v0.1-FP8" test model
      Supports static or dynamic activation quantization with static weight quantization (all per tensor)
      Supports different scales for each expert weight
      Supports Fp8 in QKV layer
      Notes:
      
      The Expert Gate/Router always runs at half / full precision for now.
      If there are different weight scales between QKV layer (for separate QKV weights), they are re-quantized using layer.weight_scale.max() so we can have a single gemm for performance.
      2a052011
  13. 03 May, 2024 2 commits
  14. 02 May, 2024 2 commits
  15. 01 May, 2024 5 commits
    • Woosuk Kwon's avatar
      [Misc] Fix expert_ids shape in MoE (#4517) · 826b82a2
      Woosuk Kwon authored
      826b82a2
    • Philipp Moritz's avatar
      [Misc] Remove Mixtral device="cuda" declarations (#4543) · c9d852d6
      Philipp Moritz authored
      Remove the device="cuda" declarations in mixtral as promised in #4343
      c9d852d6
    • Philipp Moritz's avatar
      [Kernel] Update fused_moe tuning script for FP8 (#4457) · 24bb4fe4
      Philipp Moritz authored
      This PR updates the tuning script for the fused_moe kernel to support FP8 and also adds configurations for TP4. Note that for the configuration I removed num_warps and num_stages for small batch sizes since that improved performance and brought the benchmarks on par with the numbers before in that regime to make sure this is a strict improvement over the status quo.
      
      All the numbers below are for mistralai/Mixtral-8x7B-Instruct-v0.1, 1000 input and 50 output tokens.
      
      Before this PR (with static activation scaling):
      
      qps = 1: 9.8 ms ITL, 0.49s e2e latency
      qps = 2: 9.7 ms ITL, 0.49s e2e latency 
      qps = 4: 10.1 ms ITL, 0.52s e2e latency
      qps = 6: 11.9 ms ITL, 0.59s e2e latency
      qps = 8: 14.0 ms ITL, 0.70s e2e latency
      qps = 10: 15.7 ms ITL, 0.79s e2e latency
      
      After this PR (with static activation scaling):
      
      qps = 1: 9.8 ms ITL, 0.49s e2e latency
      qps = 2: 9.7 ms ITL, 0.49s e2e latency
      qps = 4: 10.2 ms ITL, 0.53s e2e latency
      qps = 6: 11.9 ms ITL, 0.59s e2e latency
      qps = 8: 11.9 ms ITL, 0.59s e2e latency
      qps = 10: 12.1 ms ITL, 0.61s e2e latency
      24bb4fe4
    • Jee Li's avatar
      d6f4bd7c
    • Robert Caulk's avatar
  16. 30 Apr, 2024 4 commits
  17. 29 Apr, 2024 2 commits
  18. 27 Apr, 2024 3 commits