"vllm/model_executor/models/step3p5_mtp.py" did not exist on "cdc1fa12eb1ba4795d24e97dcffa2018668a9267"
- 12 Jun, 2024 1 commit
-
-
Woosuk Kwon authored
-
- 11 Jun, 2024 1 commit
-
-
Nick Hill authored
-
- 09 Jun, 2024 1 commit
-
-
youkaichao authored
[Core][CUDA Graph] add output buffer for cudagraph to reduce memory footprint (#5074)
-
- 07 Jun, 2024 1 commit
-
-
Antoni Baum authored
-
- 04 Jun, 2024 1 commit
-
-
Toshiki Kataoka authored
-
- 03 Jun, 2024 1 commit
-
-
Cyrus Leung authored
-
- 30 May, 2024 1 commit
-
-
Hyunsung Lee authored
-
- 29 May, 2024 1 commit
-
-
youkaichao authored
-
- 28 May, 2024 2 commits
-
-
Robert Shaw authored
-
Michał Moskal authored
Co-authored-by:Ruth Evans <ruthevans@Ruths-MacBook-Pro.local>
-
- 22 May, 2024 2 commits
- 18 May, 2024 1 commit
-
-
SangBin Cho authored
Currently we need to call rotary embedding kernel for each LoRA, which makes it hard to serve multiple long context length LoRA. Add batched rotary embedding kernel and pipe it through. It replaces the rotary embedding layer to the one that is aware of multiple cos-sin-cache per scaling factors. Follow up of https://github.com/vllm-project/vllm/pull/3095/files
-
- 16 May, 2024 4 commits
-
-
youkaichao authored
-
youkaichao authored
-
Cody Yu authored
Co-authored-by:
Cade Daniel <edacih@gmail.com> Co-authored-by:
Cade Daniel <cade@anyscale.com>
-
Aurick Qiao authored
Co-authored-by:Woosuk Kwon <woosuk.kwon@berkeley.edu>
-
- 15 May, 2024 2 commits
-
-
SangBin Cho authored
[Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode to a single API (#4681) This PR combines prepare_prompt and prepare_decode into a single API. This PR also coelsce the attn metadata for prefill/decode to a single class and allow to slice them when running attn backend. It also refactors subquery_start_loc which was not refactored in the previous PR
-
SangBin Cho authored
Lora 3 & 4 test seems to have illegal memory access failure after this commit; [2024-05-14 23:51:18,182 E 22 22] logging.cc:101: Unhandled exception: N3c105ErrorE. what(): CUDA error: an illegal memory access was encountered <br class="Apple-interchange-newline"> Exmaple: https://buildkite.com/vllm/ci/builds/7382#018f793d-1527-4e1c-ab59-c3a34ec55241 This reverts commit 1356df53. FILL IN THE PR DESCRIPTION HERE FIX #xxxx (link existing issues this PR will resolve)
-
- 13 May, 2024 3 commits
-
-
Stephen Krider authored
Co-authored-by:
Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by:
LiuXiaoxuanPKU <lilyliupku@gmail.com>
-
Woosuk Kwon authored
-
youkaichao authored
-
- 11 May, 2024 1 commit
-
-
Chang Su authored
-
- 10 May, 2024 1 commit
-
-
youkaichao authored
[Core][Distributed] refactor pynccl to hold multiple communicators (#4591)
-
- 09 May, 2024 1 commit
-
-
Woosuk Kwon authored
-
- 08 May, 2024 3 commits
-
-
youkaichao authored
-
Antoni Baum authored
-
Woosuk Kwon authored
-
- 07 May, 2024 1 commit
-
-
youkaichao authored
-
- 04 May, 2024 1 commit
-
-
Cody Yu authored
-
- 03 May, 2024 3 commits
-
-
Lily Liu authored
Co-authored-by:LiuXiaoxuanPKU <llilyliupku@gmail.com>
-
SangBin Cho authored
-
youkaichao authored
-
- 26 Apr, 2024 2 commits
-
-
SangBin Cho authored
-
SangBin Cho authored
Co-authored-by:Danny Guinther <dguinther@neuralmagic.com>
-
- 25 Apr, 2024 2 commits
-
-
Nick Hill authored
-
SangBin Cho authored
-
- 24 Apr, 2024 1 commit
-
-
youkaichao authored
-
- 23 Apr, 2024 2 commits