Commits · 0ee535b2945d042cbb1fc6e63fd3fddd94d491f2 · OpenDAS / vllm_cscc

09 May, 2024 5 commits
- [Misc] Set block size at initialization & Fix test_model_runner (#4705) · 0ee535b2
  Woosuk Kwon authored May 09, 2024
  
  0ee535b2
- [Misc] Remove unnecessary ModelRunner imports (#4703) · 190bc838
  Woosuk Kwon authored May 09, 2024
  
  190bc838
- [Frontend] Move async logic outside of constructor (#4674) · f12b20de
  Cyrus Leung authored May 09, 2024
  
  f12b20de
- [Frontend] add tok/s speed metric to llm class when using tqdm (#4400) · 16bc0a09
  Mahmoud Ashraf authored May 09, 2024
```
Co-authored-by: Michael Goin <michael@neuralmagic.com>
```
  16bc0a09
- [Bugfix] Fine-tune gptq_marlin configs to be more similar to marlin (#4626) · e288df06
  alexm-nm authored May 08, 2024
  
  e288df06
08 May, 2024 11 commits
- [Speculative decoding] [Bugfix] Fix overallocation in ngram + spec logprobs (#4672) · 8b9241be
  Cade Daniel authored May 08, 2024
  
  8b9241be
- [Dynamic Spec Decoding] Auto-disable by the running queue size (#4592) · f942efb5
  Cody Yu authored May 08, 2024
```
Co-authored-by: Cade Daniel <edacih@gmail.com>
```
  f942efb5
- [Misc] Use vllm-flash-attn instead of flash-attn (#4686) · 89579a20
  Woosuk Kwon authored May 08, 2024
  
  89579a20
- [CI/Test] fix swap test for multi gpu (#4689) · 230c4b38
  youkaichao authored May 08, 2024
  
  230c4b38
- [Core][Optimization] change python dict to pytorch tensor for blocks to swap (#4659) · 20cfcdec
  youkaichao authored May 08, 2024
  
  20cfcdec
- [Core] Faster startup for LoRA enabled models (#4634) · ad932a22
  Antoni Baum authored May 08, 2024
  
  ad932a22
- [Misc] Add `get_name` method to attention backends (#4685) · 5510cf0e
  Woosuk Kwon authored May 08, 2024
  
  5510cf0e
- [Bugfix][Kernel] allow non-power-of-2 for prefix prefill with alibi (#4573) · 0f9a6e3d
  DefTruth authored May 09, 2024
  
  0f9a6e3d
- [CI] Make mistral tests pass (#4596) · f6a59309
  SangBin Cho authored May 09, 2024
  
  f6a59309
- [Core] Optimize sampler get_logprobs (#4594) · d7740ea4
  SangBin Cho authored May 09, 2024
  
  d7740ea4
- [Core][Distributed] support cpu&device in broadcast tensor dict (#4660) · cc466a32
  youkaichao authored May 07, 2024
```
[Core][Distributed] support both cpu and device tensor in broadcast tensor dict (#4660)
```
  cc466a32
07 May, 2024 6 commits

[Bug fix][Core] fixup ngram not setup correctly (#4551) · 8344f774

leiwen83 authored May 08, 2024


Co-authored-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: Cade Daniel <edacih@gmail.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>

8344f774

[Core][Optimization] change copy-on-write from dict[int, list] to list (#4648) · 469f85c7
youkaichao authored May 07, 2024

469f85c7
[Bugfix] Fixed error in slice_lora_b for MergedQKVParallelLinearWithLora (#4609) · 10760da8
Austin Veselka authored May 07, 2024

10760da8
[Build/CI] Fixing 'docker run' to re-enable AMD CI tests. (#4642) · 478aed58
Alexei-V-Ivanov-AMD authored May 07, 2024

478aed58
[Core][Optimization] change python dict to pytorch tensor (#4607) · 63575bc2
youkaichao authored May 06, 2024

63575bc2

[Kernel] Make static FP8 scaling more robust (#4570) · a98187cf

Philipp Moritz authored May 06, 2024

Previously FP8 static scaling works if the scales are overestimating the maxima of all activation tensors during computation. However this will not always be the case even if the scales were calibrated very carefully. For example, with the activations in my checkpoint

https://huggingface.co/pcmoritz/Mixtral-8x7B-v0.1-fp8-act-scale

(which was calibrated on https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k), I'm getting the following mostly random performance on MMLU:

|      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu              |N/A    |none  |     0|acc   |0.2295|±  |0.0035|
| - humanities     |N/A    |none  |     5|acc   |0.2421|±  |0.0062|
| - other          |N/A    |none  |     5|acc   |0.2398|±  |0.0076|
| - social_sciences|N/A    |none  |     5|acc   |0.2171|±  |0.0074|
| - stem           |N/A    |none  |     5|acc   |0.2125|±  |0.0073|
With the fix in this PR where the scaled activations are clamped between [-std::numeric_limits<c10::Float8_e4m3fn>::max(), std::numeric_limits<c10::Float8_e4m3fn>::max()] to make sure there are no NaNs, the performance is

|      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu              |N/A    |none  |     0|acc   |0.7008|±  |0.0036|
| - humanities     |N/A    |none  |     5|acc   |0.6453|±  |0.0065|
| - other          |N/A    |none  |     5|acc   |0.7692|±  |0.0072|
| - social_sciences|N/A    |none  |     5|acc   |0.8083|±  |0.0070|
| - stem           |N/A    |none  |     5|acc   |0.6115|±  |0.0083|
This is not perfect yet but is getting very close to the FP16 / dynamic activation scale performance.

a98187cf

06 May, 2024 4 commits
- Update lm-format-enforcer to 0.10.1 (#4631) · bd99d226
  Noam Gat authored May 07, 2024
  
  bd99d226
- [CI] Add retry for agent lost (#4633) · 19cb4716
  Cade Daniel authored May 06, 2024
  
  19cb4716
- [CI] use ccache actions properly in release workflow (#4629) · e186d37c
  Simon Mo authored May 06, 2024
  
  e186d37c
- [Bugfix] Fix `asyncio.Task` not being subscriptable (#4623) · 323f27b9
  Cyrus Leung authored May 07, 2024
  
  323f27b9
05 May, 2024 3 commits
- Disable cuda version check in vllm-openai image (#4530) · 0650e593
  zhaoyang-star authored May 06, 2024
  
  0650e593
- [CI] Reduce wheel size by not shipping debug symbols (#4602) · c7f2cf2b
  Simon Mo authored May 04, 2024
  
  c7f2cf2b
- bump version to v0.4.2 (#4600) · 8d8357c8
  Simon Mo authored May 04, 2024
  
  8d8357c8
04 May, 2024 5 commits

[Bugfix] Fix inappropriate content of model_name tag in Prometheus metrics (#3937) · 43029870
DearPlanet authored May 05, 2024

43029870
[CI] check size of the wheels (#4319) · 021b1a2a
Simon Mo authored May 04, 2024

021b1a2a

[Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with... · 2a052011

Michael Goin authored May 04, 2024

[Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with Dynamic/Static Activations) (#4527)

Follow on to #4332 to enable FP8 checkpoint loading for Mixtral and supersedes #4436.

This PR enables the following checkpoint loading features for Mixtral:

Supports loading fp8 checkpoints for Mixtral, such as this "nm-testing/Mixtral-8x7B-Instruct-v0.1-FP8" test model
Supports static or dynamic activation quantization with static weight quantization (all per tensor)
Supports different scales for each expert weight
Supports Fp8 in QKV layer
Notes:

The Expert Gate/Router always runs at half / full precision for now.
If there are different weight scales between QKV layer (for separate QKV weights), they are re-quantized using layer.weight_scale.max() so we can have a single gemm for performance.

2a052011

[Doc] Chunked Prefill Documentation (#4580) · 36fb68f9
SangBin Cho authored May 04, 2024

36fb68f9
[Misc][Refactor] Introduce ExecuteModelData (#4540) · bc8ad684
Cody Yu authored May 03, 2024

bc8ad684

03 May, 2024 6 commits
- [Misc] add installation time env vars (#4574) · 344bf7cd
  youkaichao authored May 03, 2024
  
  344bf7cd
- [Speculative decoding] Support target-model logprobs (#4378) · ab502751
  Cade Daniel authored May 03, 2024
  
  ab502751
- [Kernel] Use flashinfer for decoding (#4353) · 43c413ec
  Lily Liu authored May 03, 2024
```
Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com>
```
  43c413ec
- Fix/async chat serving (#2727) · f8e7adda
  Sebastian Schoennenbeck authored May 03, 2024
  
  f8e7adda
- [Bugfix] Allow "None" or "" to be passed to CLI for string args that default to None (#4586) · 7e65477e
  Michael Goin authored May 03, 2024
  
  7e65477e
- [Core][Model runner refactoring 1/N] Refactor attn metadata term (#4518) · 3521ba4f
  SangBin Cho authored May 04, 2024
  
  3521ba4f