Commits · e254497b66dcd87038969b0ad34d34425edfc5fe · OpenDAS / vllm_cscc

11 May, 2024 1 commit
- [Model][Misc] Add e5-mistral-7b-instruct and Embedding API (#3734) · e254497b
  Chang Su authored May 11, 2024
  
  e254497b
10 May, 2024 7 commits
- [Core][Test] fix function name typo in custom allreduce (#4750) · 4e121310
  youkaichao authored May 10, 2024
  
  4e121310
- [CI] Nits for bad initialization of SeqGroup in testing (#4748) · fcc2994b
  Robert Shaw authored May 10, 2024
  
  fcc2994b
- [Speculative decoding] CUDA graph support (#4295) · 2e7796f2
  heeju-kim2 authored May 11, 2024
```
Co-authored-by: Cade Daniel <edacih@gmail.com>
```
  2e7796f2
- [Core] Fix circular reference which leaked llm instance in local dev env (#4737) · 6a0f6172
  SangBin Cho authored May 10, 2024
```
Storing exception frame is extremely prone to circular refernece because it contains the reference to objects.

When tensorizer is not installed, it leaks llm instance because error frame has references to various modules which cause circular reference problem.

I also found spec decoding has a circular reference issue, and I solved it using weakref.proxy.
```
  6a0f6172
- [Misc] Keep only one implementation of the create_dummy_prompt function. (#4716) · e965d461
  Allen.Dou authored May 10, 2024
  
  e965d461
- [Core][Distributed] refactor pynccl (#4591) · 208b71bc
  youkaichao authored May 09, 2024
```
[Core][Distributed] refactor pynccl to hold multiple communicators (#4591)
```
  208b71bc
- [Kernel] Refactor FP8 kv-cache with NVIDIA float8_e4m3 support (#4535) · c8331017
  Cody Yu authored May 09, 2024
  
  c8331017
09 May, 2024 3 commits
- [Misc] Set block size at initialization & Fix test_model_runner (#4705) · 0ee535b2
  Woosuk Kwon authored May 09, 2024
  
  0ee535b2
- [Misc] Remove unnecessary ModelRunner imports (#4703) · 190bc838
  Woosuk Kwon authored May 09, 2024
  
  190bc838
- [Frontend] Move async logic outside of constructor (#4674) · f12b20de
  Cyrus Leung authored May 09, 2024
  
  f12b20de
08 May, 2024 6 commits
- [Dynamic Spec Decoding] Auto-disable by the running queue size (#4592) · f942efb5
  Cody Yu authored May 08, 2024
```
Co-authored-by: Cade Daniel <edacih@gmail.com>
```
  f942efb5
- [CI/Test] fix swap test for multi gpu (#4689) · 230c4b38
  youkaichao authored May 08, 2024
  
  230c4b38
- [Core][Optimization] change python dict to pytorch tensor for blocks to swap (#4659) · 20cfcdec
  youkaichao authored May 08, 2024
  
  20cfcdec
- [Bugfix][Kernel] allow non-power-of-2 for prefix prefill with alibi (#4573) · 0f9a6e3d
  DefTruth authored May 09, 2024
  
  0f9a6e3d
- [CI] Make mistral tests pass (#4596) · f6a59309
  SangBin Cho authored May 09, 2024
  
  f6a59309
- [Core][Distributed] support cpu&device in broadcast tensor dict (#4660) · cc466a32
  youkaichao authored May 07, 2024
```
[Core][Distributed] support both cpu and device tensor in broadcast tensor dict (#4660)
```
  cc466a32
07 May, 2024 3 commits
- [Bug fix][Core] fixup ngram not setup correctly (#4551) · 8344f774
  leiwen83 authored May 08, 2024
```
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: Cade Daniel <edacih@gmail.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
```
  8344f774
- [Core][Optimization] change copy-on-write from dict[int, list] to list (#4648) · 469f85c7
  youkaichao authored May 07, 2024
  
  469f85c7
- [Core][Optimization] change python dict to pytorch tensor (#4607) · 63575bc2
  youkaichao authored May 06, 2024
  
  63575bc2
04 May, 2024 3 commits

[Bugfix] Fix inappropriate content of model_name tag in Prometheus metrics (#3937) · 43029870
DearPlanet authored May 05, 2024

43029870

[Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with... · 2a052011

Michael Goin authored May 04, 2024

[Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with Dynamic/Static Activations) (#4527)

Follow on to #4332 to enable FP8 checkpoint loading for Mixtral and supersedes #4436.

This PR enables the following checkpoint loading features for Mixtral:

Supports loading fp8 checkpoints for Mixtral, such as this "nm-testing/Mixtral-8x7B-Instruct-v0.1-FP8" test model
Supports static or dynamic activation quantization with static weight quantization (all per tensor)
Supports different scales for each expert weight
Supports Fp8 in QKV layer
Notes:

The Expert Gate/Router always runs at half / full precision for now.
If there are different weight scales between QKV layer (for separate QKV weights), they are re-quantized using layer.weight_scale.max() so we can have a single gemm for performance.

2a052011

[Misc][Refactor] Introduce ExecuteModelData (#4540) · bc8ad684
Cody Yu authored May 03, 2024

bc8ad684

03 May, 2024 5 commits
- [Speculative decoding] Support target-model logprobs (#4378) · ab502751
  Cade Daniel authored May 03, 2024
  
  ab502751
- [Kernel] Use flashinfer for decoding (#4353) · 43c413ec
  Lily Liu authored May 03, 2024
```
Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com>
```
  43c413ec
- Fix/async chat serving (#2727) · f8e7adda
  Sebastian Schoennenbeck authored May 03, 2024
  
  f8e7adda
- [Core][Model runner refactoring 1/N] Refactor attn metadata term (#4518) · 3521ba4f
  SangBin Cho authored May 04, 2024
  
  3521ba4f
- [Core][Distributed] enable allreduce for multiple tp groups (#4566) · 344a5d0c
  youkaichao authored May 02, 2024
  
  344a5d0c
02 May, 2024 7 commits
- [Core] Ignore infeasible swap requests. (#4557) · 0f8a9140
  SangBin Cho authored May 03, 2024
  
  0f8a9140
- [kernel] fix sliding window in prefix prefill Triton kernel (#4405) · 32881f3f
  Michał Moskal authored May 02, 2024
```
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
```
  32881f3f
- [Kernel] Support running GPTQ 8-bit models in Marlin (#4533) · 7038e8b8
  alexm-nm authored May 02, 2024
  
  7038e8b8
- [Core][Distributed] enable multiple tp group (#4512) · 2a85f930
  youkaichao authored May 01, 2024
```
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
```
  2a85f930
- [CI]Add regression tests to ensure the async engine generates metrics (#4524) · 5e401bce
  Ronen Schaffer authored May 02, 2024
  
  5e401bce
- [Bug fix][Core] assert num_new_tokens == 1 fails when SamplingParams.n is not... · 0d62fe58
  SangBin Cho authored May 02, 2024
```
[Bug fix][Core] assert num_new_tokens == 1 fails when SamplingParams.n is not 1 and max_tokens is large & Add tests for preemption (#4451)
```
  0d62fe58
- [MISC] Rework logger to enable pythonic custom logging configuration to be provided (#4273) · b8afa8b9
  Danny Guinther authored May 01, 2024
  
  b8afa8b9
01 May, 2024 5 commits
- [Bugfix] Add validation for seed (#4529) · c47ba4aa
  sasha0552 authored May 01, 2024
  
  c47ba4aa
- [Core] Add `multiproc_worker_utils` for multiprocessing-based workers (#4357) · a657bfc4
  Nick Hill authored May 01, 2024
  
  a657bfc4
- [Core] Enable prefix caching with block manager v2 enabled (#4142) · 24750f4c
  leiwen83 authored May 02, 2024
```
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: Sage Moore <sagemoore@utexas.edu>
```
  24750f4c
- [Speculative decoding] Add ngram prompt lookup decoding (#4237) · b38e42fb
  leiwen83 authored May 02, 2024
```
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
```
  b38e42fb
- [Test] Add ignore_eos test (#4519) · 6f1df804
  SangBin Cho authored May 01, 2024
  
  6f1df804