Commits · ff7ec82c4dd6170ea8fedbd4d974c0a670e84c97 · OpenDAS / vllm_cscc

19 Aug, 2024 1 commit
- [Core] Optimize SPMD architecture with delta + serialization optimization (#7109) · ff7ec82c
  SangBin Cho authored Aug 18, 2024
  
  ff7ec82c
16 Aug, 2024 1 commit
- [Core] Fix tracking of model forward time in case of PP>1 (#7440) · 93478b63
  Mahesh Keralapura authored Aug 16, 2024
```
[Core] Fix tracking of model forward time to the span traces in case of PP>1 (#7440)
```
  93478b63
14 Aug, 2024 1 commit
- [core] [3/N] multi-step args and sequence.py (#7452) · 2ecf7b17
  William Lin authored Aug 14, 2024
  
  2ecf7b17
09 Aug, 2024 4 commits
- [Core] Fix edge case in chunked prefill + block manager v2 (#7380) · baa24025
  Cade Daniel authored Aug 09, 2024
  
  baa24025
- [Core] Add span metrics for model_forward, scheduler and sampler time (#7089) · 933790c2
  Mahesh Keralapura authored Aug 09, 2024
  
  933790c2
- [Performance] e2e overheads reduction: Small followup diff (#7364) · fc7b8d1e
  Alexander Matveev authored Aug 09, 2024
  
  fc7b8d1e
- [Performance] Optimize e2e overheads: Reduce python allocations (#7162) · e02ac556
  Alexander Matveev authored Aug 09, 2024
  
  e02ac556
08 Aug, 2024 2 commits
- [Bugfix][fast] Fix the get_num_blocks_touched logic (#6849) · 782e53ab
  Zach Zheng authored Aug 08, 2024
  
  782e53ab
- [Misc] Fix typos in scheduler.py (#7285) · 74670964
  Rui Qiao authored Aug 07, 2024
```
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
```
  74670964
06 Aug, 2024 2 commits

[Core] Subclass ModelRunner to support cross-attention & encoder sequences... · fd95e026

afeldman-nm authored Aug 06, 2024


[Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) (#4942)
Co-authored-by: Andrew Feldman <afeld2012@gmail.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>

fd95e026

[Core] Optimize evictor-v2 performance (#7193) · 660470e5
xiaobochen123 authored Aug 07, 2024

660470e5

02 Aug, 2024 1 commit
- [Performance] Optimize `get_seqs` (#7051) · 6ce01f30
  Woosuk Kwon authored Aug 01, 2024
  
  6ce01f30
01 Aug, 2024 1 commit
- [core][scheduler] simplify and improve scheduler (#6867) · c8a7e932
  youkaichao authored Jul 31, 2024
  
  c8a7e932
30 Jul, 2024 2 commits
- [core][misc] improve free_finished_seq_groups (#6865) · 6ca8031e
  youkaichao authored Jul 30, 2024
```
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
```
  6ca8031e
- [BugFix] Fix use of per-request seed with pipeline parallel (#6698) · 5cf9254a
  Nick Hill authored Jul 30, 2024
  
  5cf9254a
19 Jul, 2024 1 commit
- [Misc] Small perf improvements (#6520) · 9ed82e70
  Antoni Baum authored Jul 19, 2024
  
  9ed82e70
16 Jul, 2024 1 commit
- [BugFix][Model] Jamba - Handle aborted requests, Add tests and fix cleanup bug (#6425) · 9ad32dac
  Mor Zusman authored Jul 16, 2024
```
Co-authored-by: Mor Zusman <morz@ai21.com>
```
  9ad32dac
09 Jul, 2024 1 commit

[CORE] Adding support for insertion of soft-tuned prompts (#4645) · 4d6ada94

Swapnil Parekh authored Jul 09, 2024


Co-authored-by: Swapnil Parekh <swapnilp@ibm.com>
Co-authored-by: Joe G <joseph.granados@h2o.ai>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>

4d6ada94

02 Jul, 2024 3 commits

[Model] Jamba support (#4115) · 9d6a8daa

Mor Zusman authored Jul 03, 2024


Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
Co-authored-by: Erez Schwartz <erezs@ai21.com>
Co-authored-by: Mor Zusman <morz@ai21.com>
Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com>
Co-authored-by: Tomer Asida <tomera@ai21.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
Co-authored-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>

9d6a8daa

[Core] Pipeline Parallel Support (#4412) · c5832d2a
Murali Andoorveedu authored Jul 02, 2024
```
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
```
c5832d2a
[Core] Optimize block_manager_v2 vs block_manager_v1 (to make V2 default) (#5602) · 3476ed08
Alexander Matveev authored Jul 01, 2024

3476ed08

27 Jun, 2024 1 commit
- [core][misc] remove logical block (#5882) · 64e8d2a7
  youkaichao authored Jun 27, 2024
  
  64e8d2a7
15 Jun, 2024 2 commits
- [mypy] Enable type checking for test directory (#5017) · 0e9164b4
  Cyrus Leung authored Jun 15, 2024
  
  0e9164b4
- [Core][Bugfix]: fix prefix caching for blockv2 (#5364) · 1b8a0d71
  leiwen83 authored Jun 15, 2024
```
Signed-off-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
```
  1b8a0d71
12 Jun, 2024 1 commit
- [Bugfix] Fix typo in scheduler.py (requeset -> request) (#5470) · 94a07bbd
  Michael Goin authored Jun 12, 2024
  
  94a07bbd
09 Jun, 2024 1 commit
- [Bugfix] Fix KeyError: 1 When Using LoRA adapters (#5164) · 45f92c00
  Bla_ckB authored Jun 10, 2024
  
  45f92c00
07 Jun, 2024 1 commit
- Addition of lacked ignored_seq_groups in _schedule_chunked_prefill (#5296) · dc49fb89
  limingshu authored Jun 07, 2024
  
  dc49fb89
03 Jun, 2024 1 commit
- [Misc]: Implement CPU/GPU swapping in BlockManagerV2 (#3834) · 10c38e3e
  Kaiyang Chen authored Jun 04, 2024
  
  10c38e3e
01 Jun, 2024 1 commit
- [Bugfix] Remove deprecated @abstractproperty (#5174) · 8279078e
  Zhuohan Li authored Jun 01, 2024
  
  8279078e
29 May, 2024 1 commit
- [Core] Cross-attention KV caching and memory-management (towards eventual... · 4238bc82
  afeldman-nm authored May 29, 2024
```
[Core] Cross-attention KV caching and memory-management (towards eventual encoder/decoder model support) (#4837)
```
  4238bc82
28 May, 2024 1 commit
- [Core] Sliding window for block manager v2 (#4545) · d4f39859
  Michał Moskal authored May 27, 2024
```
Co-authored-by: Ruth Evans <ruthevans@Ruths-MacBook-Pro.local>
```
  d4f39859
24 May, 2024 1 commit
- [Core][Bugfix]: fix prefix caching for blockv2 (#4764) · e64fde4b
  leiwen83 authored May 25, 2024
```
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
```
  e64fde4b
21 May, 2024 1 commit
- [Core] Fix scheduler considering "no LoRA" as "LoRA" (#4897) · 65ae8c2c
  Antoni Baum authored May 20, 2024
  
  65ae8c2c
18 May, 2024 1 commit

[Lora] Support long context lora (#4787) · 2e9a2227

SangBin Cho authored May 18, 2024

Currently we need to call rotary embedding kernel for each LoRA, which makes it hard to serve multiple long context length LoRA. Add batched rotary embedding kernel and pipe it through.

It replaces the rotary embedding layer to the one that is aware of multiple cos-sin-cache per scaling factors.

Follow up of https://github.com/vllm-project/vllm/pull/3095/files

2e9a2227

13 May, 2024 1 commit
- [Scheduler] Warning upon preemption and Swapping (#4647) · e7c46b95
  SangBin Cho authored May 13, 2024
```
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
```
  e7c46b95
11 May, 2024 1 commit
- [Model][Misc] Add e5-mistral-7b-instruct and Embedding API (#3734) · e254497b
  Chang Su authored May 11, 2024
  
  e254497b
08 May, 2024 1 commit
- [Core][Optimization] change python dict to pytorch tensor for blocks to swap (#4659) · 20cfcdec
  youkaichao authored May 08, 2024
  
  20cfcdec
07 May, 2024 2 commits
- [Core][Optimization] change copy-on-write from dict[int, list] to list (#4648) · 469f85c7
  youkaichao authored May 07, 2024
  
  469f85c7
- [Core][Optimization] change python dict to pytorch tensor (#4607) · 63575bc2
  youkaichao authored May 06, 2024
  
  63575bc2
04 May, 2024 1 commit
- [Misc][Refactor] Introduce ExecuteModelData (#4540) · bc8ad684
  Cody Yu authored May 03, 2024
  
  bc8ad684