Commits · f12b20deccbc6c8bb5cdeac053d75178341c66c1 · OpenDAS / vllm_cscc

09 May, 2024 2 commits
- [Frontend] Move async logic outside of constructor (#4674) · f12b20de
  Cyrus Leung authored May 09, 2024
  
  f12b20de
- [Frontend] add tok/s speed metric to llm class when using tqdm (#4400) · 16bc0a09
  Mahmoud Ashraf authored May 09, 2024
```
Co-authored-by: Michael Goin <michael@neuralmagic.com>
```
  16bc0a09
08 May, 2024 10 commits
- [Speculative decoding] [Bugfix] Fix overallocation in ngram + spec logprobs (#4672) · 8b9241be
  Cade Daniel authored May 08, 2024
  
  8b9241be
- [Dynamic Spec Decoding] Auto-disable by the running queue size (#4592) · f942efb5
  Cody Yu authored May 08, 2024
```
Co-authored-by: Cade Daniel <edacih@gmail.com>
```
  f942efb5
- [Misc] Use vllm-flash-attn instead of flash-attn (#4686) · 89579a20
  Woosuk Kwon authored May 08, 2024
  
  89579a20
- [Core][Optimization] change python dict to pytorch tensor for blocks to swap (#4659) · 20cfcdec
  youkaichao authored May 08, 2024
  
  20cfcdec
- [Core] Faster startup for LoRA enabled models (#4634) · ad932a22
  Antoni Baum authored May 08, 2024
  
  ad932a22
- [Misc] Add `get_name` method to attention backends (#4685) · 5510cf0e
  Woosuk Kwon authored May 08, 2024
  
  5510cf0e
- [Bugfix][Kernel] allow non-power-of-2 for prefix prefill with alibi (#4573) · 0f9a6e3d
  DefTruth authored May 09, 2024
  
  0f9a6e3d
- [CI] Make mistral tests pass (#4596) · f6a59309
  SangBin Cho authored May 09, 2024
  
  f6a59309
- [Core] Optimize sampler get_logprobs (#4594) · d7740ea4
  SangBin Cho authored May 09, 2024
  
  d7740ea4
- [Core][Distributed] support cpu&device in broadcast tensor dict (#4660) · cc466a32
  youkaichao authored May 07, 2024
```
[Core][Distributed] support both cpu and device tensor in broadcast tensor dict (#4660)
```
  cc466a32
07 May, 2024 4 commits
- [Bug fix][Core] fixup ngram not setup correctly (#4551) · 8344f774
  leiwen83 authored May 08, 2024
```
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: Cade Daniel <edacih@gmail.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
```
  8344f774
- [Core][Optimization] change copy-on-write from dict[int, list] to list (#4648) · 469f85c7
  youkaichao authored May 07, 2024
  
  469f85c7
- [Bugfix] Fixed error in slice_lora_b for MergedQKVParallelLinearWithLora (#4609) · 10760da8
  Austin Veselka authored May 07, 2024
  
  10760da8
- [Core][Optimization] change python dict to pytorch tensor (#4607) · 63575bc2
  youkaichao authored May 06, 2024
  
  63575bc2
06 May, 2024 1 commit
- [Bugfix] Fix `asyncio.Task` not being subscriptable (#4623) · 323f27b9
  Cyrus Leung authored May 07, 2024
  
  323f27b9
05 May, 2024 2 commits
- Disable cuda version check in vllm-openai image (#4530) · 0650e593
  zhaoyang-star authored May 06, 2024
  
  0650e593
- bump version to v0.4.2 (#4600) · 8d8357c8
  Simon Mo authored May 04, 2024
  
  8d8357c8
04 May, 2024 4 commits

[Bugfix] Fix inappropriate content of model_name tag in Prometheus metrics (#3937) · 43029870
DearPlanet authored May 05, 2024

43029870

[Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with... · 2a052011

Michael Goin authored May 04, 2024

[Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with Dynamic/Static Activations) (#4527)

Follow on to #4332 to enable FP8 checkpoint loading for Mixtral and supersedes #4436.

This PR enables the following checkpoint loading features for Mixtral:

Supports loading fp8 checkpoints for Mixtral, such as this "nm-testing/Mixtral-8x7B-Instruct-v0.1-FP8" test model
Supports static or dynamic activation quantization with static weight quantization (all per tensor)
Supports different scales for each expert weight
Supports Fp8 in QKV layer
Notes:

The Expert Gate/Router always runs at half / full precision for now.
If there are different weight scales between QKV layer (for separate QKV weights), they are re-quantized using layer.weight_scale.max() so we can have a single gemm for performance.

2a052011

[Doc] Chunked Prefill Documentation (#4580) · 36fb68f9
SangBin Cho authored May 04, 2024

36fb68f9
[Misc][Refactor] Introduce ExecuteModelData (#4540) · bc8ad684
Cody Yu authored May 03, 2024

bc8ad684

03 May, 2024 10 commits
- [Misc] add installation time env vars (#4574) · 344bf7cd
  youkaichao authored May 03, 2024
  
  344bf7cd
- [Speculative decoding] Support target-model logprobs (#4378) · ab502751
  Cade Daniel authored May 03, 2024
  
  ab502751
- [Kernel] Use flashinfer for decoding (#4353) · 43c413ec
  Lily Liu authored May 03, 2024
```
Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com>
```
  43c413ec
- Fix/async chat serving (#2727) · f8e7adda
  Sebastian Schoennenbeck authored May 03, 2024
  
  f8e7adda
- [Bugfix] Allow "None" or "" to be passed to CLI for string args that default to None (#4586) · 7e65477e
  Michael Goin authored May 03, 2024
  
  7e65477e
- [Core][Model runner refactoring 1/N] Refactor attn metadata term (#4518) · 3521ba4f
  SangBin Cho authored May 04, 2024
  
  3521ba4f
- [Doc] add env vars to the doc (#4572) · 2d7bce9c
  youkaichao authored May 02, 2024
  
  2d7bce9c
- [Misc] remove chunk detected debug logs (#4571) · ce3f1eed
  DefTruth authored May 03, 2024
  
  ce3f1eed
- [BugFix] Prevent the task of `_force_log` from being garbage collected (#4567) · 808632d3
  Yang, Bo authored May 02, 2024
  
  808632d3
- [Core][Distributed] enable allreduce for multiple tp groups (#4566) · 344a5d0c
  youkaichao authored May 02, 2024
  
  344a5d0c
02 May, 2024 7 commits
- [Core] Ignore infeasible swap requests. (#4557) · 0f8a9140
  SangBin Cho authored May 03, 2024
  
  0f8a9140
- [kernel] fix sliding window in prefix prefill Triton kernel (#4405) · 32881f3f
  Michał Moskal authored May 02, 2024
```
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
```
  32881f3f
- [Misc] centralize all usage of environment variables (#4548) · 5b8a7c1c
  youkaichao authored May 02, 2024
  
  5b8a7c1c
- [Kernel] Support running GPTQ 8-bit models in Marlin (#4533) · 7038e8b8
  alexm-nm authored May 02, 2024
  
  7038e8b8
- [Core][Distributed] enable multiple tp group (#4512) · 2a85f930
  youkaichao authored May 01, 2024
```
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
```
  2a85f930
- [mypy][6/N] Fix all the core subdirectory typing (#4450) · cf8cac8c
  SangBin Cho authored May 02, 2024
```
Co-authored-by: Cade Daniel <edacih@gmail.com>
```
  cf8cac8c
- [Bug fix][Core] assert num_new_tokens == 1 fails when SamplingParams.n is not... · 0d62fe58
  SangBin Cho authored May 02, 2024
```
[Bug fix][Core] assert num_new_tokens == 1 fails when SamplingParams.n is not 1 and max_tokens is large & Add tests for preemption (#4451)
```
  0d62fe58