Commits · 7342a7d7f87ea3f4e03ec0775093a0f1ce56e2a1 · OpenDAS / vllm_cscc

11 Oct, 2024 2 commits
- [Model] Support Mamba (#6484) · 7342a7d7
  Tyler Michael Smith authored Oct 11, 2024
  
  7342a7d7
- [misc] hide best_of from engine (#9261) · cbc2ef55
  youkaichao authored Oct 10, 2024
```
Co-authored-by: Brendan Wong <bjwpokemon@gmail.com>
```
  cbc2ef55
08 Oct, 2024 1 commit
- [Core][Frontend] Add Support for Inference Time mm_processor_kwargs (#9131) · a3691b6b
  Alex Brooks authored Oct 08, 2024
```
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
```
  a3691b6b
07 Oct, 2024 2 commits
- [misc] fix comment and variable name (#9139) · fa45513a
  youkaichao authored Oct 07, 2024
  
  fa45513a
- [core] remove beam search from the core (#9105) · 18b296fd
  youkaichao authored Oct 06, 2024
  
  18b296fd
02 Oct, 2024 1 commit
- [Core] Combined support for multi-step scheduling, chunked prefill & prefix caching (#8804) · 563649aa
  afeldman-nm authored Oct 02, 2024
```
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Andrew Feldman <afeld2012@gmail.com>
```
  563649aa
27 Sep, 2024 1 commit
- [Core] Multi-Step + Single Step Prefills via Chunked Prefill code path (#8378) · c2ec430a
  Varun Sundar Rabindranath authored Sep 27, 2024
```
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
```
  c2ec430a
25 Sep, 2024 2 commits
- [Misc] Fix minor typo in scheduler (#8765) · 8fae5ed7
  Woo-Yeon Lee authored Sep 25, 2024
  
  8fae5ed7
- [Core] Adding Priority Scheduling (#5958) · 6da1ab6b
  Archit Patke authored Sep 24, 2024
  
  6da1ab6b
08 Sep, 2024 1 commit
- [Bugfix] Fix async postprocessor in case of preemption (#8267) · 4ef41b84
  Alexander Matveev authored Sep 08, 2024
  
  4ef41b84
02 Sep, 2024 1 commit

improve chunked prefill performance · 6e36f4fa

wang.yuqi authored Sep 03, 2024

[Bugfix] Fix #7592 vllm 0.5.4 enable_chunked_prefill throughput is slightly lower than 0.5.3~0.5.0. (#7874)

6e36f4fa

29 Aug, 2024 1 commit
- [Core] Combine async postprocessor and multi-step (#7921) · 3f60f224
  Alexander Matveev authored Aug 29, 2024
  
  3f60f224
28 Aug, 2024 3 commits
- [Performance] Enable chunked prefill and prefix caching together (#7753) · e3580537
  Cody Yu authored Aug 28, 2024
  
  e3580537
- [Core] Async_output_proc: Add virtual engine support (towards pipeline parallel) (#7911) · f508e03e
  Alexander Matveev authored Aug 28, 2024
  
  f508e03e
- [hardware][rocm] allow rocm to override default env var (#7926) · bc6e42a9
  youkaichao authored Aug 27, 2024
  
  bc6e42a9
27 Aug, 2024 2 commits
- [mypy] Enable mypy type checking for `vllm/core` (#7229) · 9c71c97a
  Jonathan Berkhahn authored Aug 27, 2024
  
  9c71c97a
- [Core] Asynchronous Output Processor (#7049) · 2eedede8
  Megha Agarwal authored Aug 26, 2024
```
Co-authored-by: Alexander Matveev <alexm@neuralmagic.com>
```
  2eedede8
19 Aug, 2024 2 commits
- [MISC] Add prefix cache hit rate to metrics (#7606) · 3ac50b47
  Cody Yu authored Aug 19, 2024
  
  3ac50b47
- [Core] Optimize SPMD architecture with delta + serialization optimization (#7109) · ff7ec82c
  SangBin Cho authored Aug 18, 2024
  
  ff7ec82c
16 Aug, 2024 1 commit
- [Core] Fix tracking of model forward time in case of PP>1 (#7440) · 93478b63
  Mahesh Keralapura authored Aug 16, 2024
```
[Core] Fix tracking of model forward time to the span traces in case of PP>1 (#7440)
```
  93478b63
14 Aug, 2024 1 commit
- [core] [3/N] multi-step args and sequence.py (#7452) · 2ecf7b17
  William Lin authored Aug 14, 2024
  
  2ecf7b17
09 Aug, 2024 2 commits
- [Core] Add span metrics for model_forward, scheduler and sampler time (#7089) · 933790c2
  Mahesh Keralapura authored Aug 09, 2024
  
  933790c2
- [Performance] Optimize e2e overheads: Reduce python allocations (#7162) · e02ac556
  Alexander Matveev authored Aug 09, 2024
  
  e02ac556
08 Aug, 2024 1 commit
- [Misc] Fix typos in scheduler.py (#7285) · 74670964
  Rui Qiao authored Aug 07, 2024
```
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
```
  74670964
06 Aug, 2024 1 commit

[Core] Subclass ModelRunner to support cross-attention & encoder sequences... · fd95e026

afeldman-nm authored Aug 06, 2024


[Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) (#4942)
Co-authored-by: Andrew Feldman <afeld2012@gmail.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>

fd95e026

01 Aug, 2024 1 commit
- [core][scheduler] simplify and improve scheduler (#6867) · c8a7e932
  youkaichao authored Jul 31, 2024
  
  c8a7e932
30 Jul, 2024 2 commits
- [core][misc] improve free_finished_seq_groups (#6865) · 6ca8031e
  youkaichao authored Jul 30, 2024
```
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
```
  6ca8031e
- [BugFix] Fix use of per-request seed with pipeline parallel (#6698) · 5cf9254a
  Nick Hill authored Jul 30, 2024
  
  5cf9254a
16 Jul, 2024 1 commit
- [BugFix][Model] Jamba - Handle aborted requests, Add tests and fix cleanup bug (#6425) · 9ad32dac
  Mor Zusman authored Jul 16, 2024
```
Co-authored-by: Mor Zusman <morz@ai21.com>
```
  9ad32dac
09 Jul, 2024 1 commit

[CORE] Adding support for insertion of soft-tuned prompts (#4645) · 4d6ada94

Swapnil Parekh authored Jul 09, 2024


Co-authored-by: Swapnil Parekh <swapnilp@ibm.com>
Co-authored-by: Joe G <joseph.granados@h2o.ai>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>

4d6ada94

02 Jul, 2024 2 commits

[Model] Jamba support (#4115) · 9d6a8daa

Mor Zusman authored Jul 03, 2024


Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
Co-authored-by: Erez Schwartz <erezs@ai21.com>
Co-authored-by: Mor Zusman <morz@ai21.com>
Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com>
Co-authored-by: Tomer Asida <tomera@ai21.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
Co-authored-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>

9d6a8daa

[Core] Pipeline Parallel Support (#4412) · c5832d2a
Murali Andoorveedu authored Jul 02, 2024
```
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
```
c5832d2a

12 Jun, 2024 1 commit
- [Bugfix] Fix typo in scheduler.py (requeset -> request) (#5470) · 94a07bbd
  Michael Goin authored Jun 12, 2024
  
  94a07bbd
09 Jun, 2024 1 commit
- [Bugfix] Fix KeyError: 1 When Using LoRA adapters (#5164) · 45f92c00
  Bla_ckB authored Jun 10, 2024
  
  45f92c00
07 Jun, 2024 1 commit
- Addition of lacked ignored_seq_groups in _schedule_chunked_prefill (#5296) · dc49fb89
  limingshu authored Jun 07, 2024
  
  dc49fb89
03 Jun, 2024 1 commit
- [Misc]: Implement CPU/GPU swapping in BlockManagerV2 (#3834) · 10c38e3e
  Kaiyang Chen authored Jun 04, 2024
  
  10c38e3e
21 May, 2024 1 commit
- [Core] Fix scheduler considering "no LoRA" as "LoRA" (#4897) · 65ae8c2c
  Antoni Baum authored May 20, 2024
  
  65ae8c2c
18 May, 2024 1 commit

[Lora] Support long context lora (#4787) · 2e9a2227

SangBin Cho authored May 18, 2024

Currently we need to call rotary embedding kernel for each LoRA, which makes it hard to serve multiple long context length LoRA. Add batched rotary embedding kernel and pipe it through.

It replaces the rotary embedding layer to the one that is aware of multiple cos-sin-cache per scaling factors.

Follow up of https://github.com/vllm-project/vllm/pull/3095/files

2e9a2227

13 May, 2024 1 commit
- [Scheduler] Warning upon preemption and Swapping (#4647) · e7c46b95
  SangBin Cho authored May 13, 2024
```
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
```
  e7c46b95
11 May, 2024 1 commit
- [Model][Misc] Add e5-mistral-7b-instruct and Embedding API (#3734) · e254497b
  Chang Su authored May 11, 2024
  
  e254497b