Commits · 0650e5935b0f6af35fb2acf71769982c47b804d7 · OpenDAS / vllm_cscc

05 May, 2024 3 commits
- Disable cuda version check in vllm-openai image (#4530) · 0650e593
  zhaoyang-star authored May 06, 2024
  
  0650e593
- [CI] Reduce wheel size by not shipping debug symbols (#4602) · c7f2cf2b
  Simon Mo authored May 04, 2024
  
  c7f2cf2b
- bump version to v0.4.2 (#4600) · 8d8357c8
  Simon Mo authored May 04, 2024
  
  8d8357c8
04 May, 2024 5 commits

[Bugfix] Fix inappropriate content of model_name tag in Prometheus metrics (#3937) · 43029870
DearPlanet authored May 05, 2024

43029870
[CI] check size of the wheels (#4319) · 021b1a2a
Simon Mo authored May 04, 2024

021b1a2a

[Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with... · 2a052011

Michael Goin authored May 04, 2024

[Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with Dynamic/Static Activations) (#4527)

Follow on to #4332 to enable FP8 checkpoint loading for Mixtral and supersedes #4436.

This PR enables the following checkpoint loading features for Mixtral:

Supports loading fp8 checkpoints for Mixtral, such as this "nm-testing/Mixtral-8x7B-Instruct-v0.1-FP8" test model
Supports static or dynamic activation quantization with static weight quantization (all per tensor)
Supports different scales for each expert weight
Supports Fp8 in QKV layer
Notes:

The Expert Gate/Router always runs at half / full precision for now.
If there are different weight scales between QKV layer (for separate QKV weights), they are re-quantized using layer.weight_scale.max() so we can have a single gemm for performance.

2a052011

[Doc] Chunked Prefill Documentation (#4580) · 36fb68f9
SangBin Cho authored May 04, 2024

36fb68f9
[Misc][Refactor] Introduce ExecuteModelData (#4540) · bc8ad684
Cody Yu authored May 03, 2024

bc8ad684

03 May, 2024 10 commits
- [Misc] add installation time env vars (#4574) · 344bf7cd
  youkaichao authored May 03, 2024
  
  344bf7cd
- [Speculative decoding] Support target-model logprobs (#4378) · ab502751
  Cade Daniel authored May 03, 2024
  
  ab502751
- [Kernel] Use flashinfer for decoding (#4353) · 43c413ec
  Lily Liu authored May 03, 2024
```
Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com>
```
  43c413ec
- Fix/async chat serving (#2727) · f8e7adda
  Sebastian Schoennenbeck authored May 03, 2024
  
  f8e7adda
- [Bugfix] Allow "None" or "" to be passed to CLI for string args that default to None (#4586) · 7e65477e
  Michael Goin authored May 03, 2024
  
  7e65477e
- [Core][Model runner refactoring 1/N] Refactor attn metadata term (#4518) · 3521ba4f
  SangBin Cho authored May 04, 2024
  
  3521ba4f
- [Doc] add env vars to the doc (#4572) · 2d7bce9c
  youkaichao authored May 02, 2024
  
  2d7bce9c
- [Misc] remove chunk detected debug logs (#4571) · ce3f1eed
  DefTruth authored May 03, 2024
  
  ce3f1eed
- [BugFix] Prevent the task of `_force_log` from being garbage collected (#4567) · 808632d3
  Yang, Bo authored May 02, 2024
  
  808632d3
- [Core][Distributed] enable allreduce for multiple tp groups (#4566) · 344a5d0c
  youkaichao authored May 02, 2024
  
  344a5d0c
02 May, 2024 13 commits
- [Core] Ignore infeasible swap requests. (#4557) · 0f8a9140
  SangBin Cho authored May 03, 2024
  
  0f8a9140
- [CI/Build] AMD CI pipeline with extended set of tests. (#4267) · 9b5c9f94
  Alexei-V-Ivanov-AMD authored May 02, 2024
```
Co-authored-by: simon-mo <simon.mo@hey.com>
```
  9b5c9f94
- [kernel] fix sliding window in prefix prefill Triton kernel (#4405) · 32881f3f
  Michał Moskal authored May 02, 2024
```
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
```
  32881f3f
- [Misc] centralize all usage of environment variables (#4548) · 5b8a7c1c
  youkaichao authored May 02, 2024
  
  5b8a7c1c
- [BugFix] Include target-device specific requirements.txt in sdist (#4559) · 1ff0c73a
  Mark McLoughlin authored May 02, 2024
  
  1ff0c73a
- [Misc] Exclude the `tests` directory from being packaged (#4552) · 5ad60b0c
  Hu Dong authored May 03, 2024
  
  5ad60b0c
- [mypy][7/N] Cover all directories (#4555) · fb087af5
  SangBin Cho authored May 03, 2024
  
  fb087af5
- [Kernel] Support running GPTQ 8-bit models in Marlin (#4533) · 7038e8b8
  alexm-nm authored May 02, 2024
  
  7038e8b8
- [Core][Distributed] enable multiple tp group (#4512) · 2a85f930
  youkaichao authored May 01, 2024
```
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
```
  2a85f930
- [mypy][6/N] Fix all the core subdirectory typing (#4450) · cf8cac8c
  SangBin Cho authored May 02, 2024
```
Co-authored-by: Cade Daniel <edacih@gmail.com>
```
  cf8cac8c
- [CI]Add regression tests to ensure the async engine generates metrics (#4524) · 5e401bce
  Ronen Schaffer authored May 02, 2024
  
  5e401bce
- [Bug fix][Core] assert num_new_tokens == 1 fails when SamplingParams.n is not... · 0d62fe58
  SangBin Cho authored May 02, 2024
```
[Bug fix][Core] assert num_new_tokens == 1 fails when SamplingParams.n is not 1 and max_tokens is large & Add tests for preemption (#4451)
```
  0d62fe58
- [MISC] Rework logger to enable pythonic custom logging configuration to be provided (#4273) · b8afa8b9
  Danny Guinther authored May 01, 2024
  
  b8afa8b9
01 May, 2024 9 commits

[Misc] Fix expert_ids shape in MoE (#4517) · 826b82a2
Woosuk Kwon authored May 01, 2024

826b82a2
[Misc] Remove Mixtral device="cuda" declarations (#4543) · c9d852d6
Philipp Moritz authored May 01, 2024
```
Remove the device="cuda" declarations in mixtral as promised in #4343
```
c9d852d6
[Core][Distributed] fix pynccl del error (#4508) · 6ef09b08
youkaichao authored May 01, 2024

6ef09b08
[Bugfix][Core] Fix and refactor logging stats (#4336) · 3a922c1e
Roy authored May 02, 2024

3a922c1e
[Bugfix] Add validation for seed (#4529) · c47ba4aa
sasha0552 authored May 01, 2024

c47ba4aa

[Kernel] Update fused_moe tuning script for FP8 (#4457) · 24bb4fe4

Philipp Moritz authored May 01, 2024

This PR updates the tuning script for the fused_moe kernel to support FP8 and also adds configurations for TP4. Note that for the configuration I removed num_warps and num_stages for small batch sizes since that improved performance and brought the benchmarks on par with the numbers before in that regime to make sure this is a strict improvement over the status quo.

All the numbers below are for mistralai/Mixtral-8x7B-Instruct-v0.1, 1000 input and 50 output tokens.

Before this PR (with static activation scaling):

qps = 1: 9.8 ms ITL, 0.49s e2e latency
qps = 2: 9.7 ms ITL, 0.49s e2e latency
qps = 4: 10.1 ms ITL, 0.52s e2e latency
qps = 6: 11.9 ms ITL, 0.59s e2e latency
qps = 8: 14.0 ms ITL, 0.70s e2e latency
qps = 10: 15.7 ms ITL, 0.79s e2e latency

After this PR (with static activation scaling):

qps = 1: 9.8 ms ITL, 0.49s e2e latency
qps = 2: 9.7 ms ITL, 0.49s e2e latency
qps = 4: 10.2 ms ITL, 0.53s e2e latency
qps = 6: 11.9 ms ITL, 0.59s e2e latency
qps = 8: 11.9 ms ITL, 0.59s e2e latency
qps = 10: 12.1 ms ITL, 0.61s e2e latency

24bb4fe4

[Core] Add `multiproc_worker_utils` for multiprocessing-based workers (#4357) · a657bfc4
Nick Hill authored May 01, 2024

a657bfc4
[Core] Enable prefix caching with block manager v2 enabled (#4142) · 24750f4c
leiwen83 authored May 02, 2024
```
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: Sage Moore <sagemoore@utexas.edu>
```
24750f4c
[Speculative decoding] Add ngram prompt lookup decoding (#4237) · b38e42fb
leiwen83 authored May 02, 2024
```
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
```
b38e42fb