- 20 May, 2024 3 commits
-
-
Alexander Matveev authored
-
Cyrus Leung authored
-
Woosuk Kwon authored
-
- 19 May, 2024 2 commits
-
-
Alexander Matveev authored
-
Cyrus Leung authored
-
- 18 May, 2024 2 commits
-
-
SangBin Cho authored
Currently we need to call rotary embedding kernel for each LoRA, which makes it hard to serve multiple long context length LoRA. Add batched rotary embedding kernel and pipe it through. It replaces the rotary embedding layer to the one that is aware of multiple cos-sin-cache per scaling factors. Follow up of https://github.com/vllm-project/vllm/pull/3095/files
-
alexeykondrat authored
-
- 17 May, 2024 6 commits
-
-
Michael Goin authored
-
Antoni Baum authored
-
eigenLiu authored
-
Jinzhen Lin authored
-
Alexei-V-Ivanov-AMD authored
[Build/CI] Extending the set of AMD tests with Regression, Basic Correctness, Distributed, Engine, Llava Tests (#4797)
-
bofeng huang authored
-
- 16 May, 2024 16 commits
-
-
Kante Yin authored
Signed-off-by:kerthcet <kerthcet@gmail.com>
-
Woosuk Kwon authored
-
Tyler Michael Smith authored
-
Silencio authored
Co-authored-by:Silencio <silencio@adsl-99-6-187-6.dsl.irvnca.sbcglobal.net>
-
youkaichao authored
-
youkaichao authored
-
Hongxia Yang authored
-
Simon Mo authored
-
Alexander Matveev authored
Co-authored-by:Robert Shaw <rshaw@neuralmagic.com>
-
Pierre Dulac authored
-
Alex Wu authored
-
Alex Wu authored
-
Jinzhen Lin authored
-
alexm-nm authored
-
Cody Yu authored
Co-authored-by:
Cade Daniel <edacih@gmail.com> Co-authored-by:
Cade Daniel <cade@anyscale.com>
-
Aurick Qiao authored
Co-authored-by:Woosuk Kwon <woosuk.kwon@berkeley.edu>
-
- 15 May, 2024 7 commits
-
-
Alex Wu authored
Co-authored-by:Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
-
Cyrus Leung authored
-
Zhuohan Li authored
-
zifeitong authored
-
Cyrus Leung authored
-
SangBin Cho authored
[Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode to a single API (#4681) This PR combines prepare_prompt and prepare_decode into a single API. This PR also coelsce the attn metadata for prefill/decode to a single class and allow to slice them when running attn backend. It also refactors subquery_start_loc which was not refactored in the previous PR
-
SangBin Cho authored
Lora 3 & 4 test seems to have illegal memory access failure after this commit; [2024-05-14 23:51:18,182 E 22 22] logging.cc:101: Unhandled exception: N3c105ErrorE. what(): CUDA error: an illegal memory access was encountered <br class="Apple-interchange-newline"> Exmaple: https://buildkite.com/vllm/ci/builds/7382#018f793d-1527-4e1c-ab59-c3a34ec55241 This reverts commit 1356df53. FILL IN THE PR DESCRIPTION HERE FIX #xxxx (link existing issues this PR will resolve)
-
- 14 May, 2024 4 commits
-
-
Simon Mo authored
-
Nick Hill authored
Co-authored-by:SAHIL SUNEJA <suneja@us.ibm.com>
-
Cyrus Leung authored
This PR fixes the CI failure introduced by #4798. The failure originates from having duplicate target names in reST, and is fixed by changing the ref targets to anonymous ones. For more information, see this discussion. I have also changed the format of the links to be more distinct from each other.
-
Kuntai Du authored
[Core][Hash][Automatic Prefix caching] Accelerating the hashing function by avoiding deep copies (#4696)
-