Commits · ff7ec82c4dd6170ea8fedbd4d974c0a670e84c97 · OpenDAS / vllm_cscc

19 Aug, 2024 1 commit
- [Core] Optimize SPMD architecture with delta + serialization optimization (#7109) · ff7ec82c
  SangBin Cho authored Aug 18, 2024
  
  ff7ec82c
16 Aug, 2024 1 commit
- [Core] Fix tracking of model forward time in case of PP>1 (#7440) · 93478b63
  Mahesh Keralapura authored Aug 16, 2024
```
[Core] Fix tracking of model forward time to the span traces in case of PP>1 (#7440)
```
  93478b63
14 Aug, 2024 1 commit
- [core] [3/N] multi-step args and sequence.py (#7452) · 2ecf7b17
  William Lin authored Aug 14, 2024
  
  2ecf7b17
09 Aug, 2024 2 commits
- [Core] Add span metrics for model_forward, scheduler and sampler time (#7089) · 933790c2
  Mahesh Keralapura authored Aug 09, 2024
  
  933790c2
- [Performance] Optimize e2e overheads: Reduce python allocations (#7162) · e02ac556
  Alexander Matveev authored Aug 09, 2024
  
  e02ac556
08 Aug, 2024 1 commit
- [Misc] Fix typos in scheduler.py (#7285) · 74670964
  Rui Qiao authored Aug 07, 2024
```
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
```
  74670964
06 Aug, 2024 1 commit

[Core] Subclass ModelRunner to support cross-attention & encoder sequences... · fd95e026

afeldman-nm authored Aug 06, 2024


[Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) (#4942)
Co-authored-by: Andrew Feldman <afeld2012@gmail.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>

fd95e026

01 Aug, 2024 1 commit
- [core][scheduler] simplify and improve scheduler (#6867) · c8a7e932
  youkaichao authored Jul 31, 2024
  
  c8a7e932
30 Jul, 2024 2 commits
- [core][misc] improve free_finished_seq_groups (#6865) · 6ca8031e
  youkaichao authored Jul 30, 2024
```
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
```
  6ca8031e
- [BugFix] Fix use of per-request seed with pipeline parallel (#6698) · 5cf9254a
  Nick Hill authored Jul 30, 2024
  
  5cf9254a
16 Jul, 2024 1 commit
- [BugFix][Model] Jamba - Handle aborted requests, Add tests and fix cleanup bug (#6425) · 9ad32dac
  Mor Zusman authored Jul 16, 2024
```
Co-authored-by: Mor Zusman <morz@ai21.com>
```
  9ad32dac
09 Jul, 2024 1 commit

[CORE] Adding support for insertion of soft-tuned prompts (#4645) · 4d6ada94

Swapnil Parekh authored Jul 09, 2024


Co-authored-by: Swapnil Parekh <swapnilp@ibm.com>
Co-authored-by: Joe G <joseph.granados@h2o.ai>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>

4d6ada94

02 Jul, 2024 2 commits

[Model] Jamba support (#4115) · 9d6a8daa

Mor Zusman authored Jul 03, 2024


Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
Co-authored-by: Erez Schwartz <erezs@ai21.com>
Co-authored-by: Mor Zusman <morz@ai21.com>
Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com>
Co-authored-by: Tomer Asida <tomera@ai21.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
Co-authored-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>

9d6a8daa

[Core] Pipeline Parallel Support (#4412) · c5832d2a
Murali Andoorveedu authored Jul 02, 2024
```
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
```
c5832d2a

12 Jun, 2024 1 commit
- [Bugfix] Fix typo in scheduler.py (requeset -> request) (#5470) · 94a07bbd
  Michael Goin authored Jun 12, 2024
  
  94a07bbd
09 Jun, 2024 1 commit
- [Bugfix] Fix KeyError: 1 When Using LoRA adapters (#5164) · 45f92c00
  Bla_ckB authored Jun 10, 2024
  
  45f92c00
07 Jun, 2024 1 commit
- Addition of lacked ignored_seq_groups in _schedule_chunked_prefill (#5296) · dc49fb89
  limingshu authored Jun 07, 2024
  
  dc49fb89
03 Jun, 2024 1 commit
- [Misc]: Implement CPU/GPU swapping in BlockManagerV2 (#3834) · 10c38e3e
  Kaiyang Chen authored Jun 04, 2024
  
  10c38e3e
21 May, 2024 1 commit
- [Core] Fix scheduler considering "no LoRA" as "LoRA" (#4897) · 65ae8c2c
  Antoni Baum authored May 20, 2024
  
  65ae8c2c
18 May, 2024 1 commit

[Lora] Support long context lora (#4787) · 2e9a2227

SangBin Cho authored May 18, 2024

Currently we need to call rotary embedding kernel for each LoRA, which makes it hard to serve multiple long context length LoRA. Add batched rotary embedding kernel and pipe it through.

It replaces the rotary embedding layer to the one that is aware of multiple cos-sin-cache per scaling factors.

Follow up of https://github.com/vllm-project/vllm/pull/3095/files

2e9a2227

13 May, 2024 1 commit
- [Scheduler] Warning upon preemption and Swapping (#4647) · e7c46b95
  SangBin Cho authored May 13, 2024
```
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
```
  e7c46b95
11 May, 2024 1 commit
- [Model][Misc] Add e5-mistral-7b-instruct and Embedding API (#3734) · e254497b
  Chang Su authored May 11, 2024
  
  e254497b
08 May, 2024 1 commit
- [Core][Optimization] change python dict to pytorch tensor for blocks to swap (#4659) · 20cfcdec
  youkaichao authored May 08, 2024
  
  20cfcdec
07 May, 2024 2 commits
- [Core][Optimization] change copy-on-write from dict[int, list] to list (#4648) · 469f85c7
  youkaichao authored May 07, 2024
  
  469f85c7
- [Core][Optimization] change python dict to pytorch tensor (#4607) · 63575bc2
  youkaichao authored May 06, 2024
  
  63575bc2
04 May, 2024 1 commit
- [Misc][Refactor] Introduce ExecuteModelData (#4540) · bc8ad684
  Cody Yu authored May 03, 2024
  
  bc8ad684
02 May, 2024 2 commits
- [Core] Ignore infeasible swap requests. (#4557) · 0f8a9140
  SangBin Cho authored May 03, 2024
  
  0f8a9140
- [Bug fix][Core] assert num_new_tokens == 1 fails when SamplingParams.n is not... · 0d62fe58
  SangBin Cho authored May 02, 2024
```
[Bug fix][Core] assert num_new_tokens == 1 fails when SamplingParams.n is not 1 and max_tokens is large & Add tests for preemption (#4451)
```
  0d62fe58
28 Apr, 2024 1 commit

Add more Prometheus metrics (#2764) · bf480c53

Ronen Schaffer authored Apr 29, 2024


Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>

bf480c53

26 Apr, 2024 2 commits
- [Core] Refactoring sampler and support prompt logprob for chunked prefill (#4309) · 603ad848
  SangBin Cho authored Apr 26, 2024
  
  603ad848
- [CI] Disable non-lazy string operation on logging (#4326) · a88081bf
  SangBin Cho authored Apr 26, 2024
```
Co-authored-by: Danny Guinther <dguinther@neuralmagic.com>
```
  a88081bf
23 Apr, 2024 1 commit
- [Core] Scheduling optimization 2 (#4280) · 050f285f
  SangBin Cho authored Apr 23, 2024
  
  050f285f
22 Apr, 2024 1 commit
- [Core] Scheduler perf fix (#4270) · ad8d696a
  SangBin Cho authored Apr 23, 2024
  
  ad8d696a
16 Apr, 2024 1 commit
- [Speculative decoding 6/9] Integrate speculative decoding with LLMEngine (#3894) · e95cd879
  Cade Daniel authored Apr 16, 2024
  
  e95cd879
12 Apr, 2024 2 commits
- [mypy] Add mypy type annotation part 1 (#4006) · 09473ee4
  SangBin Cho authored Apr 13, 2024
  
  09473ee4
- [Misc] Fix typo in scheduler.py (#4022) · d4ec9ffb
  Zhuohan Li authored Apr 12, 2024
  
  d4ec9ffb
11 Apr, 2024 1 commit
- [Core][5/N] Fully working chunked prefill e2e (#3884) · 67b4221a
  SangBin Cho authored Apr 11, 2024
  
  67b4221a
05 Apr, 2024 1 commit
- [Chunked Prefill][4/n] Chunked prefill scheduler. (#3853) · 18de8834
  SangBin Cho authored Apr 06, 2024
  
  18de8834
03 Apr, 2024 1 commit
- [3/N] Refactor scheduler for chunked prefill scheduling (#3550) · 3dcb3e8b
  SangBin Cho authored Apr 04, 2024
  
  3dcb3e8b
01 Apr, 2024 1 commit
- [Speculative decoding 4/9] Lookahead scheduling for speculative decoding (#3250) · 93deb0b3
  Cade Daniel authored Apr 01, 2024
  
  93deb0b3