- 25 Jun, 2024 4 commits
-
-
Woosuk Kwon authored
-
Matt Wong authored
-
Woosuk Kwon authored
-
Jie Fu (傅杰) authored
-
- 21 Jun, 2024 2 commits
-
-
rohithkrn authored
-
Joshua Rosenkranz authored
Signed-off-by:
Thomas Parnell <tpa@zurich.ibm.com> Co-authored-by:
Thomas Parnell <tpa@zurich.ibm.com> Co-authored-by:
Nick Hill <nickhill@us.ibm.com> Co-authored-by:
Davis Wertheimer <Davis.Wertheimer@ibm.com>
-
- 17 Jun, 2024 1 commit
-
-
Kunshang Ji authored
Co-authored-by:
Jiang Li <jiang1.li@intel.com> Co-authored-by:
Abhilash Majumder <abhilash.majumder@intel.com> Co-authored-by:
Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
-
- 15 Jun, 2024 1 commit
-
-
Cyrus Leung authored
-
- 13 Jun, 2024 1 commit
-
-
youkaichao authored
[Core][Distributed] add coordinator to reduce code duplication in tp and pp (#5293)
-
- 12 Jun, 2024 3 commits
-
-
Isotr0py authored
-
Travis Johnson authored
Signed-off-by:
Travis Johnson <tsjohnso@us.ibm.com> Co-authored-by:
Sanger Steel <sangersteel@gmail.com> Co-authored-by:
Roger Wang <ywang@roblox.com>
-
Woosuk Kwon authored
-
- 11 Jun, 2024 1 commit
-
-
Nick Hill authored
-
- 09 Jun, 2024 1 commit
-
-
youkaichao authored
[Core][CUDA Graph] add output buffer for cudagraph to reduce memory footprint (#5074)
-
- 07 Jun, 2024 1 commit
-
-
Antoni Baum authored
-
- 04 Jun, 2024 1 commit
-
-
Toshiki Kataoka authored
-
- 03 Jun, 2024 1 commit
-
-
Cyrus Leung authored
-
- 30 May, 2024 1 commit
-
-
Hyunsung Lee authored
-
- 29 May, 2024 1 commit
-
-
youkaichao authored
-
- 28 May, 2024 2 commits
-
-
Robert Shaw authored
-
Michał Moskal authored
Co-authored-by:Ruth Evans <ruthevans@Ruths-MacBook-Pro.local>
-
- 22 May, 2024 2 commits
- 18 May, 2024 1 commit
-
-
SangBin Cho authored
Currently we need to call rotary embedding kernel for each LoRA, which makes it hard to serve multiple long context length LoRA. Add batched rotary embedding kernel and pipe it through. It replaces the rotary embedding layer to the one that is aware of multiple cos-sin-cache per scaling factors. Follow up of https://github.com/vllm-project/vllm/pull/3095/files
-
- 16 May, 2024 4 commits
-
-
youkaichao authored
-
youkaichao authored
-
Cody Yu authored
Co-authored-by:
Cade Daniel <edacih@gmail.com> Co-authored-by:
Cade Daniel <cade@anyscale.com>
-
Aurick Qiao authored
Co-authored-by:Woosuk Kwon <woosuk.kwon@berkeley.edu>
-
- 15 May, 2024 2 commits
-
-
SangBin Cho authored
[Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode to a single API (#4681) This PR combines prepare_prompt and prepare_decode into a single API. This PR also coelsce the attn metadata for prefill/decode to a single class and allow to slice them when running attn backend. It also refactors subquery_start_loc which was not refactored in the previous PR
-
SangBin Cho authored
Lora 3 & 4 test seems to have illegal memory access failure after this commit; [2024-05-14 23:51:18,182 E 22 22] logging.cc:101: Unhandled exception: N3c105ErrorE. what(): CUDA error: an illegal memory access was encountered <br class="Apple-interchange-newline"> Exmaple: https://buildkite.com/vllm/ci/builds/7382#018f793d-1527-4e1c-ab59-c3a34ec55241 This reverts commit 1356df53. FILL IN THE PR DESCRIPTION HERE FIX #xxxx (link existing issues this PR will resolve)
-
- 13 May, 2024 3 commits
-
-
Stephen Krider authored
Co-authored-by:
Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by:
LiuXiaoxuanPKU <lilyliupku@gmail.com>
-
Woosuk Kwon authored
-
youkaichao authored
-
- 11 May, 2024 1 commit
-
-
Chang Su authored
-
- 10 May, 2024 1 commit
-
-
youkaichao authored
[Core][Distributed] refactor pynccl to hold multiple communicators (#4591)
-
- 09 May, 2024 1 commit
-
-
Woosuk Kwon authored
-
- 08 May, 2024 3 commits
-
-
youkaichao authored
-
Antoni Baum authored
-
Woosuk Kwon authored
-
- 07 May, 2024 1 commit
-
-
youkaichao authored
-