Commits · 4d3a2c284ecc47748273664c9a6ef302ff3adcbe · OpenDAS / vllm_cscc

11 Nov, 2024 1 commit

[V1] `AsyncLLM` Implementation (#9826) · 6ace6fba

Robert Shaw authored Nov 11, 2024


Signed-off-by: Nick Hill <nickhill@us.ibm.com>
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>

6ace6fba

16 Oct, 2024 1 commit
- [CI/Build] mypy: Resolve some errors from checking vllm/engine (#9267) · 776dbd74
  Russell Bryant authored Oct 16, 2024
```
Signed-off-by: Russell Bryant <rbryant@redhat.com>
```
  776dbd74
21 Aug, 2024 1 commit
- [mypy] Enable following imports for entrypoints (#7248) · baaedfdb
  Cyrus Leung authored Aug 21, 2024
```
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Fei <dfdfcai4@gmail.com>
```
  baaedfdb
29 May, 2024 1 commit
- [Bugfix] Remove the last EOS token unless explicitly specified (#5077) · dfba529b
  Junichi Sato authored May 29, 2024
  
  dfba529b
18 May, 2024 1 commit

[Lora] Support long context lora (#4787) · 2e9a2227

SangBin Cho authored May 18, 2024

Currently we need to call rotary embedding kernel for each LoRA, which makes it hard to serve multiple long context length LoRA. Add batched rotary embedding kernel and pipe it through.

It replaces the rotary embedding layer to the one that is aware of multiple cos-sin-cache per scaling factors.

Follow up of https://github.com/vllm-project/vllm/pull/3095/files

2e9a2227

16 Apr, 2024 1 commit
- [Speculative decoding 6/9] Integrate speculative decoding with LLMEngine (#3894) · e95cd879
  Cade Daniel authored Apr 16, 2024
  
  e95cd879