Commits · fb087af52e3834d98250a455355a3ef329663168 · OpenDAS / vllm_cscc

02 May, 2024 7 commits
- [mypy][7/N] Cover all directories (#4555) · fb087af5
  SangBin Cho authored May 03, 2024
  
  fb087af5
- [Kernel] Support running GPTQ 8-bit models in Marlin (#4533) · 7038e8b8
  alexm-nm authored May 02, 2024
  
  7038e8b8
- [Core][Distributed] enable multiple tp group (#4512) · 2a85f930
  youkaichao authored May 01, 2024
```
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
```
  2a85f930
- [mypy][6/N] Fix all the core subdirectory typing (#4450) · cf8cac8c
  SangBin Cho authored May 02, 2024
```
Co-authored-by: Cade Daniel <edacih@gmail.com>
```
  cf8cac8c
- [CI]Add regression tests to ensure the async engine generates metrics (#4524) · 5e401bce
  Ronen Schaffer authored May 02, 2024
  
  5e401bce
- [Bug fix][Core] assert num_new_tokens == 1 fails when SamplingParams.n is not... · 0d62fe58
  SangBin Cho authored May 02, 2024
```
[Bug fix][Core] assert num_new_tokens == 1 fails when SamplingParams.n is not 1 and max_tokens is large & Add tests for preemption (#4451)
```
  0d62fe58
- [MISC] Rework logger to enable pythonic custom logging configuration to be provided (#4273) · b8afa8b9
  Danny Guinther authored May 01, 2024
  
  b8afa8b9
01 May, 2024 21 commits
- [Misc] Fix expert_ids shape in MoE (#4517) · 826b82a2
  Woosuk Kwon authored May 01, 2024
  
  826b82a2
- [Misc] Remove Mixtral device="cuda" declarations (#4543) · c9d852d6
  Philipp Moritz authored May 01, 2024
```
Remove the device="cuda" declarations in mixtral as promised in #4343
```
  c9d852d6
- [Core][Distributed] fix pynccl del error (#4508) · 6ef09b08
  youkaichao authored May 01, 2024
  
  6ef09b08
- [Bugfix][Core] Fix and refactor logging stats (#4336) · 3a922c1e
  Roy authored May 02, 2024
  
  3a922c1e
- [Bugfix] Add validation for seed (#4529) · c47ba4aa
  sasha0552 authored May 01, 2024
  
  c47ba4aa
- [Kernel] Update fused_moe tuning script for FP8 (#4457) · 24bb4fe4
  Philipp Moritz authored May 01, 2024
```
This PR updates the tuning script for the fused_moe kernel to support FP8 and also adds configurations for TP4. Note that for the configuration I removed num_warps and num_stages for small batch sizes since that improved performance and brought the benchmarks on par with the numbers before in that regime to make sure this is a strict improvement over the status quo.

All the numbers below are for mistralai/Mixtral-8x7B-Instruct-v0.1, 1000 input and 50 output tokens.

Before this PR (with static activation scaling):

qps = 1: 9.8 ms ITL, 0.49s e2e latency
qps = 2: 9.7 ms ITL, 0.49s e2e latency 
qps = 4: 10.1 ms ITL, 0.52s e2e latency
qps = 6: 11.9 ms ITL, 0.59s e2e latency
qps = 8: 14.0 ms ITL, 0.70s e2e latency
qps = 10: 15.7 ms ITL, 0.79s e2e latency

After this PR (with static activation scaling):

qps = 1: 9.8 ms ITL, 0.49s e2e latency
qps = 2: 9.7 ms ITL, 0.49s e2e latency
qps = 4: 10.2 ms ITL, 0.53s e2e latency
qps = 6: 11.9 ms ITL, 0.59s e2e latency
qps = 8: 11.9 ms ITL, 0.59s e2e latency
qps = 10: 12.1 ms ITL, 0.61s e2e latency
```
  24bb4fe4
- [Core] Add `multiproc_worker_utils` for multiprocessing-based workers (#4357) · a657bfc4
  Nick Hill authored May 01, 2024
  
  a657bfc4
- [Core] Enable prefix caching with block manager v2 enabled (#4142) · 24750f4c
  leiwen83 authored May 02, 2024
```
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: Sage Moore <sagemoore@utexas.edu>
```
  24750f4c
- [Speculative decoding] Add ngram prompt lookup decoding (#4237) · b38e42fb
  leiwen83 authored May 02, 2024
```
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
```
  b38e42fb
- [CI/Build][Bugfix] VLLM_USE_PRECOMPILED should skip compilation (#4534) · 8b798eec
  Travis Johnson authored May 01, 2024
```
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
```
  8b798eec
- [Bugfix] Use random seed if seed is -1 (#4531) · 69909126
  sasha0552 authored May 01, 2024
  
  69909126
- [Doc] update(example model): for OpenAI compatible serving (#4503) · e491c7e0
  Frαnçois authored May 01, 2024
  
  e491c7e0
- [Bugfix] Fix 307 Redirect for `/metrics` (#4523) · 4dc8026d
  Robert Shaw authored May 01, 2024
  
  4dc8026d
- [Bugfix] Fix the fp8 kv_cache check error that occurs when failing to obtain... · a88bb9b0
  AnyISalIn authored May 02, 2024
```
[Bugfix] Fix the fp8 kv_cache check error that occurs when failing to obtain the CUDA version. (#4173)
Signed-off-by: AnyISalIn <anyisalin@gmail.com>
```
  a88bb9b0
- [Test] Add ignore_eos test (#4519) · 6f1df804
  SangBin Cho authored May 01, 2024
  
  6f1df804
- [Misc]Add customized information for models (#4132) · d6f4bd7c
  Jee Li authored May 01, 2024
  
  d6f4bd7c
- Allow user to define whitespace pattern for outlines (#4305) · c3845d82
  Robert Caulk authored May 01, 2024
  
  c3845d82
- [Misc] fix typo in block manager (#4453) · a822eb34
  Pastel！ authored May 01, 2024
  
  a822eb34
- [Misc][Typo] type annotation fix (#4495) · f458112e
  harrywu authored May 01, 2024
  
  f458112e
- [Core] Centralize GPU Worker construction (#4419) · 2e240c69
  Nick Hill authored Apr 30, 2024
  
  2e240c69
- Unable to find Punica extension issue during source code installation (#4494) · ee37328d
  fuchen.ljl authored May 01, 2024
```
Co-authored-by: Simon Mo <simon.mo@hey.com>
```
  ee37328d
30 Apr, 2024 10 commits
- fix_tokenizer_snapshot_download_bug (#4493) · 6ad58f42
  fuchen.ljl authored May 01, 2024
  
  6ad58f42
- [Bugfix][Minor] Make ignore_eos effective (#4468) · dd1a50a8
  Li, Jiang authored May 01, 2024
  
  dd1a50a8
- [Frontend] [Core] Tensorizer: support dynamic `num_readers`, update version (#4467) · 715c2d85
  Alpay Ariyak authored Apr 30, 2024
  
  715c2d85
- [Frontend] Support complex message content for chat completions endpoint (#3467) · a4941404
  Florian Greinacher authored May 01, 2024
```
Co-authored-by: Lily Liu <lilyliupku@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
```
  a4941404
- [Kernel] Support Fp8 Checkpoints (Dynamic + Static) (#4332) · 111815d4
  Robert Shaw authored Apr 30, 2024
```
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
```
  111815d4
- [Doc] add visualization for multi-stage dockerfile (#4456) · b31a1fb6
  Prashant Gupta authored Apr 30, 2024
```
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
```
  b31a1fb6
- [BugFix] fix num_lookahead_slots missing in async executor (#4165) · 4bb53e2d
  leiwen83 authored May 01, 2024
```
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
```
  4bb53e2d
- [Core]Refactor gptq_marlin ops (#4466) · 26f2fb51
  Kunshang Ji authored Apr 30, 2024
  
  26f2fb51
- [Bugfix][Kernel] Fix compute_type for MoE kernel (#4463) · fa322078
  Woosuk Kwon authored Apr 29, 2024
  
  fa322078
- [Misc] Upgrade to `torch==2.3.0` (#4454) · d627a3d8
  Michael Goin authored Apr 29, 2024
  
  d627a3d8
29 Apr, 2024 2 commits
- [Core][Distributed] use cpu group to broadcast metadata in cpu (#4444) · f4f921b7
  youkaichao authored Apr 29, 2024
  
  f4f921b7
- [CI] hotfix: soft fail neuron test (#4458) · ac5ccf01
  Simon Mo authored Apr 29, 2024
  
  ac5ccf01