- 20 May, 2024 2 commits
-
-
Cyrus Leung authored
-
Woosuk Kwon authored
-
- 19 May, 2024 2 commits
-
-
Alexander Matveev authored
-
Cyrus Leung authored
-
- 18 May, 2024 2 commits
-
-
SangBin Cho authored
Currently we need to call rotary embedding kernel for each LoRA, which makes it hard to serve multiple long context length LoRA. Add batched rotary embedding kernel and pipe it through. It replaces the rotary embedding layer to the one that is aware of multiple cos-sin-cache per scaling factors. Follow up of https://github.com/vllm-project/vllm/pull/3095/files
-
alexeykondrat authored
-
- 17 May, 2024 4 commits
-
-
eigenLiu authored
-
Jinzhen Lin authored
-
Alexei-V-Ivanov-AMD authored
[Build/CI] Extending the set of AMD tests with Regression, Basic Correctness, Distributed, Engine, Llava Tests (#4797)
-
bofeng huang authored
-
- 16 May, 2024 12 commits
-
-
Woosuk Kwon authored
-
Tyler Michael Smith authored
-
youkaichao authored
-
youkaichao authored
-
Hongxia Yang authored
-
Alexander Matveev authored
Co-authored-by:Robert Shaw <rshaw@neuralmagic.com>
-
Pierre Dulac authored
-
Alex Wu authored
-
Jinzhen Lin authored
-
alexm-nm authored
-
Cody Yu authored
Co-authored-by:
Cade Daniel <edacih@gmail.com> Co-authored-by:
Cade Daniel <cade@anyscale.com>
-
Aurick Qiao authored
Co-authored-by:Woosuk Kwon <woosuk.kwon@berkeley.edu>
-
- 15 May, 2024 5 commits
-
-
Alex Wu authored
Co-authored-by:Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
-
Cyrus Leung authored
-
zifeitong authored
-
SangBin Cho authored
[Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode to a single API (#4681) This PR combines prepare_prompt and prepare_decode into a single API. This PR also coelsce the attn metadata for prefill/decode to a single class and allow to slice them when running attn backend. It also refactors subquery_start_loc which was not refactored in the previous PR
-
SangBin Cho authored
Lora 3 & 4 test seems to have illegal memory access failure after this commit; [2024-05-14 23:51:18,182 E 22 22] logging.cc:101: Unhandled exception: N3c105ErrorE. what(): CUDA error: an illegal memory access was encountered <br class="Apple-interchange-newline"> Exmaple: https://buildkite.com/vllm/ci/builds/7382#018f793d-1527-4e1c-ab59-c3a34ec55241 This reverts commit 1356df53. FILL IN THE PR DESCRIPTION HERE FIX #xxxx (link existing issues this PR will resolve)
-
- 14 May, 2024 2 commits
-
-
Nick Hill authored
Co-authored-by:SAHIL SUNEJA <suneja@us.ibm.com>
-
Kuntai Du authored
[Core][Hash][Automatic Prefix caching] Accelerating the hashing function by avoiding deep copies (#4696)
-
- 13 May, 2024 9 commits
-
-
Philipp Moritz authored
-
Stephen Krider authored
Co-authored-by:
Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by:
LiuXiaoxuanPKU <lilyliupku@gmail.com>
-
Cody Yu authored
-
Sanger Steel authored
[Frontend] [Core] perf: Automatically detect vLLM-tensorized model, update `tensorizer` to version 2.9.0 (#4208)
-
Woosuk Kwon authored
-
SangBin Cho authored
Co-authored-by:Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
-
Cyrus Leung authored
Since #4335 was merged, I've noticed that the definition of ServerRunner in the tests is the same as in the test for OpenAI API. I have moved the class to the test utilities to avoid code duplication. (Although it only has been repeated twice so far, I will add another similar test suite in #4200 which would duplicate the code a third time) Also, I have moved the test utilities file (test_utils.py) to under the test directory (tests/utils.py), since none of its code is actually used in the main package. Note that I have added __init__.py to each test subpackage and updated the ray.init() call in the test utilities file in order to relative import tests/utils.py.
-
youkaichao authored
-
Swapnil Parekh authored
-
- 12 May, 2024 1 commit
-
-
Yikang Shen authored
-
- 11 May, 2024 1 commit
-
-
Chang Su authored
-