1. 25 May, 2024 1 commit
  2. 24 May, 2024 1 commit
  3. 23 May, 2024 3 commits
  4. 22 May, 2024 3 commits
  5. 21 May, 2024 2 commits
  6. 20 May, 2024 3 commits
  7. 19 May, 2024 2 commits
  8. 18 May, 2024 1 commit
  9. 17 May, 2024 2 commits
  10. 16 May, 2024 4 commits
  11. 15 May, 2024 1 commit
    • SangBin Cho's avatar
      [Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode... · 65bf2ac1
      SangBin Cho authored
      [Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode to a single API (#4681)
      
      This PR combines prepare_prompt and prepare_decode into a single API. This PR also coelsce the attn metadata for prefill/decode to a single class and allow to slice them when running attn backend.
      
      It also refactors subquery_start_loc which was not refactored in the previous PR
      65bf2ac1
  12. 13 May, 2024 4 commits
  13. 12 May, 2024 1 commit
  14. 11 May, 2024 1 commit
  15. 10 May, 2024 1 commit
    • SangBin Cho's avatar
      [Core] Fix circular reference which leaked llm instance in local dev env (#4737) · 6a0f6172
      SangBin Cho authored
      Storing exception frame is extremely prone to circular refernece because it contains the reference to objects.
      
      When tensorizer is not installed, it leaks llm instance because error frame has references to various modules which cause circular reference problem.
      
      I also found spec decoding has a circular reference issue, and I solved it using weakref.proxy.
      6a0f6172
  16. 09 May, 2024 2 commits
  17. 08 May, 2024 3 commits
  18. 04 May, 2024 1 commit
    • Michael Goin's avatar
      [Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with... · 2a052011
      Michael Goin authored
      [Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with Dynamic/Static Activations) (#4527)
      
      Follow on to #4332 to enable FP8 checkpoint loading for Mixtral and supersedes #4436.
      
      This PR enables the following checkpoint loading features for Mixtral:
      
      Supports loading fp8 checkpoints for Mixtral, such as this "nm-testing/Mixtral-8x7B-Instruct-v0.1-FP8" test model
      Supports static or dynamic activation quantization with static weight quantization (all per tensor)
      Supports different scales for each expert weight
      Supports Fp8 in QKV layer
      Notes:
      
      The Expert Gate/Router always runs at half / full precision for now.
      If there are different weight scales between QKV layer (for separate QKV weights), they are re-quantized using layer.weight_scale.max() so we can have a single gemm for performance.
      2a052011
  19. 03 May, 2024 2 commits
  20. 02 May, 2024 2 commits