Commits · d7740ea4dcee4ab75d7d6eef723f33cae957b288 · OpenDAS / vllm_cscc

08 May, 2024 1 commit
- [Core][Distributed] support cpu&device in broadcast tensor dict (#4660) · cc466a32
  youkaichao authored May 07, 2024
```
[Core][Distributed] support both cpu and device tensor in broadcast tensor dict (#4660)
```
  cc466a32
07 May, 2024 3 commits
- [Bug fix][Core] fixup ngram not setup correctly (#4551) · 8344f774
  leiwen83 authored May 08, 2024
```
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: Cade Daniel <edacih@gmail.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
```
  8344f774
- [Core][Optimization] change copy-on-write from dict[int, list] to list (#4648) · 469f85c7
  youkaichao authored May 07, 2024
  
  469f85c7
- [Core][Optimization] change python dict to pytorch tensor (#4607) · 63575bc2
  youkaichao authored May 06, 2024
  
  63575bc2
04 May, 2024 3 commits

[Bugfix] Fix inappropriate content of model_name tag in Prometheus metrics (#3937) · 43029870
DearPlanet authored May 05, 2024

43029870

[Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with... · 2a052011

Michael Goin authored May 04, 2024

[Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with Dynamic/Static Activations) (#4527)

Follow on to #4332 to enable FP8 checkpoint loading for Mixtral and supersedes #4436.

This PR enables the following checkpoint loading features for Mixtral:

Supports loading fp8 checkpoints for Mixtral, such as this "nm-testing/Mixtral-8x7B-Instruct-v0.1-FP8" test model
Supports static or dynamic activation quantization with static weight quantization (all per tensor)
Supports different scales for each expert weight
Supports Fp8 in QKV layer
Notes:

The Expert Gate/Router always runs at half / full precision for now.
If there are different weight scales between QKV layer (for separate QKV weights), they are re-quantized using layer.weight_scale.max() so we can have a single gemm for performance.

2a052011

[Misc][Refactor] Introduce ExecuteModelData (#4540) · bc8ad684
Cody Yu authored May 03, 2024

bc8ad684

03 May, 2024 5 commits
- [Speculative decoding] Support target-model logprobs (#4378) · ab502751
  Cade Daniel authored May 03, 2024
  
  ab502751
- [Kernel] Use flashinfer for decoding (#4353) · 43c413ec
  Lily Liu authored May 03, 2024
```
Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com>
```
  43c413ec
- Fix/async chat serving (#2727) · f8e7adda
  Sebastian Schoennenbeck authored May 03, 2024
  
  f8e7adda
- [Core][Model runner refactoring 1/N] Refactor attn metadata term (#4518) · 3521ba4f
  SangBin Cho authored May 04, 2024
  
  3521ba4f
- [Core][Distributed] enable allreduce for multiple tp groups (#4566) · 344a5d0c
  youkaichao authored May 02, 2024
  
  344a5d0c
02 May, 2024 7 commits
- [Core] Ignore infeasible swap requests. (#4557) · 0f8a9140
  SangBin Cho authored May 03, 2024
  
  0f8a9140
- [kernel] fix sliding window in prefix prefill Triton kernel (#4405) · 32881f3f
  Michał Moskal authored May 02, 2024
```
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
```
  32881f3f
- [Kernel] Support running GPTQ 8-bit models in Marlin (#4533) · 7038e8b8
  alexm-nm authored May 02, 2024
  
  7038e8b8
- [Core][Distributed] enable multiple tp group (#4512) · 2a85f930
  youkaichao authored May 01, 2024
```
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
```
  2a85f930
- [CI]Add regression tests to ensure the async engine generates metrics (#4524) · 5e401bce
  Ronen Schaffer authored May 02, 2024
  
  5e401bce
- [Bug fix][Core] assert num_new_tokens == 1 fails when SamplingParams.n is not... · 0d62fe58
  SangBin Cho authored May 02, 2024
```
[Bug fix][Core] assert num_new_tokens == 1 fails when SamplingParams.n is not 1 and max_tokens is large & Add tests for preemption (#4451)
```
  0d62fe58
- [MISC] Rework logger to enable pythonic custom logging configuration to be provided (#4273) · b8afa8b9
  Danny Guinther authored May 01, 2024
  
  b8afa8b9
01 May, 2024 7 commits
- [Bugfix] Add validation for seed (#4529) · c47ba4aa
  sasha0552 authored May 01, 2024
  
  c47ba4aa
- [Core] Add `multiproc_worker_utils` for multiprocessing-based workers (#4357) · a657bfc4
  Nick Hill authored May 01, 2024
  
  a657bfc4
- [Core] Enable prefix caching with block manager v2 enabled (#4142) · 24750f4c
  leiwen83 authored May 02, 2024
```
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: Sage Moore <sagemoore@utexas.edu>
```
  24750f4c
- [Speculative decoding] Add ngram prompt lookup decoding (#4237) · b38e42fb
  leiwen83 authored May 02, 2024
```
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
```
  b38e42fb
- [Test] Add ignore_eos test (#4519) · 6f1df804
  SangBin Cho authored May 01, 2024
  
  6f1df804
- [Misc]Add customized information for models (#4132) · d6f4bd7c
  Jee Li authored May 01, 2024
  
  d6f4bd7c
- Allow user to define whitespace pattern for outlines (#4305) · c3845d82
  Robert Caulk authored May 01, 2024
  
  c3845d82
30 Apr, 2024 3 commits

[Frontend] Support complex message content for chat completions endpoint (#3467) · a4941404
Florian Greinacher authored May 01, 2024
```
Co-authored-by: Lily Liu <lilyliupku@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
```
a4941404

[Kernel] Support Fp8 Checkpoints (Dynamic + Static) (#4332) · 111815d4

Robert Shaw authored Apr 30, 2024


Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>

111815d4

[BugFix] fix num_lookahead_slots missing in async executor (#4165) · 4bb53e2d
leiwen83 authored May 01, 2024
```
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
```
4bb53e2d

29 Apr, 2024 2 commits
- [Core][Distributed] use cpu group to broadcast metadata in cpu (#4444) · f4f921b7
  youkaichao authored Apr 29, 2024
  
  f4f921b7
- [Kernel] Marlin Expansion: Support AutoGPTQ Models with Marlin (#3922) · 73c8d677
  Robert Shaw authored Apr 29, 2024
```
Co-authored-by: alexm <alexm@neuralmagic.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
```
  73c8d677
27 Apr, 2024 6 commits
- [Core] Support offline use of local cache for models (#4374) · d6e520e1
  Prashant Gupta authored Apr 27, 2024
```
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com>
Co-authored-by: Travis Johnson <tjohnson31415@gmail.com>
```
  d6e520e1
- [BugFix] Fix `min_tokens` when `eos_token_id` is None (#4389) · 81661da7
  Nick Hill authored Apr 27, 2024
```
Co-authored-by: DefTruth <31974251+deftruth@users.noreply.github.com>
```
  81661da7
- [Bugfix] Abort requests when the connection to /v1/completions is interrupted (#4363) · dfea1731
  Ruoyu Qin authored Apr 28, 2024
  
  dfea1731
- [Bugfix][Core] Fix get decoding config from ray (#4335) · 7134303c
  Roy authored Apr 27, 2024
  
  7134303c
- [Kernel] Full Tensor Parallelism for LoRA Layers (#3524) · eefeb164
  Austin Veselka authored Apr 27, 2024
```
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
```
  eefeb164
- [Frontend][Bugfix] Disallow extra fields in OpenAI API (#4355) · 8947bc3c
  Cyrus Leung authored Apr 27, 2024
  
  8947bc3c
26 Apr, 2024 3 commits
- [Misc][Refactor] Generalize linear_method to be quant_method (#4373) · a62aaf1d
  Cody Yu authored Apr 26, 2024
  
  a62aaf1d
- [Core] Refactoring sampler and support prompt logprob for chunked prefill (#4309) · 603ad848
  SangBin Cho authored Apr 26, 2024
  
  603ad848
- [Bugfix] Fix parameter name in `get_tokenizer` (#4107) · a74dee9b
  Cyrus Leung authored Apr 26, 2024
  
  a74dee9b