- 11 Oct, 2024 2 commits
-
-
Tyler Michael Smith authored
-
youkaichao authored
Co-authored-by:Brendan Wong <bjwpokemon@gmail.com>
-
- 08 Oct, 2024 1 commit
-
-
Alex Brooks authored
Signed-off-by:Alex-Brooks <Alex.Brooks@ibm.com>
-
- 07 Oct, 2024 2 commits
-
-
youkaichao authored
-
youkaichao authored
-
- 02 Oct, 2024 1 commit
-
-
afeldman-nm authored
Co-authored-by:
Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by:
Andrew Feldman <afeld2012@gmail.com>
-
- 27 Sep, 2024 1 commit
-
-
Varun Sundar Rabindranath authored
Co-authored-by:Varun Sundar Rabindranath <varun@neuralmagic.com>
-
- 25 Sep, 2024 2 commits
-
-
Woo-Yeon Lee authored
-
Archit Patke authored
-
- 08 Sep, 2024 1 commit
-
-
Alexander Matveev authored
-
- 02 Sep, 2024 1 commit
-
-
wang.yuqi authored
[Bugfix] Fix #7592 vllm 0.5.4 enable_chunked_prefill throughput is slightly lower than 0.5.3~0.5.0. (#7874)
-
- 29 Aug, 2024 1 commit
-
-
Alexander Matveev authored
-
- 28 Aug, 2024 3 commits
-
-
Cody Yu authored
-
Alexander Matveev authored
-
youkaichao authored
-
- 27 Aug, 2024 2 commits
-
-
Jonathan Berkhahn authored
-
Megha Agarwal authored
Co-authored-by:Alexander Matveev <alexm@neuralmagic.com>
-
- 19 Aug, 2024 2 commits
-
-
Cody Yu authored
-
SangBin Cho authored
-
- 16 Aug, 2024 1 commit
-
-
Mahesh Keralapura authored
[Core] Fix tracking of model forward time to the span traces in case of PP>1 (#7440)
-
- 14 Aug, 2024 1 commit
-
-
William Lin authored
-
- 09 Aug, 2024 2 commits
-
-
Mahesh Keralapura authored
-
Alexander Matveev authored
-
- 08 Aug, 2024 1 commit
-
-
Rui Qiao authored
Signed-off-by:Rui Qiao <ruisearch42@gmail.com>
-
- 06 Aug, 2024 1 commit
-
-
afeldman-nm authored
[Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) (#4942) Co-authored-by:
Andrew Feldman <afeld2012@gmail.com> Co-authored-by:
Nick Hill <nickhill@us.ibm.com>
-
- 01 Aug, 2024 1 commit
-
-
youkaichao authored
-
- 30 Jul, 2024 2 commits
-
-
youkaichao authored
Co-authored-by:Woosuk Kwon <woosuk.kwon@berkeley.edu>
-
Nick Hill authored
-
- 16 Jul, 2024 1 commit
-
-
Mor Zusman authored
Co-authored-by:Mor Zusman <morz@ai21.com>
-
- 09 Jul, 2024 1 commit
-
-
Swapnil Parekh authored
Co-authored-by:
Swapnil Parekh <swapnilp@ibm.com> Co-authored-by:
Joe G <joseph.granados@h2o.ai> Co-authored-by:
Antoni Baum <antoni.baum@protonmail.com>
-
- 02 Jul, 2024 2 commits
-
-
Mor Zusman authored
Signed-off-by:
Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai> Co-authored-by:
Erez Schwartz <erezs@ai21.com> Co-authored-by:
Mor Zusman <morz@ai21.com> Co-authored-by:
tomeras91 <57313761+tomeras91@users.noreply.github.com> Co-authored-by:
Tomer Asida <tomera@ai21.com> Co-authored-by:
Zhuohan Li <zhuohan123@gmail.com> Co-authored-by:
Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
-
Murali Andoorveedu authored
Signed-off-by:Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
-
- 12 Jun, 2024 1 commit
-
-
Michael Goin authored
-
- 09 Jun, 2024 1 commit
-
-
Bla_ckB authored
-
- 07 Jun, 2024 1 commit
-
-
limingshu authored
-
- 03 Jun, 2024 1 commit
-
-
Kaiyang Chen authored
-
- 21 May, 2024 1 commit
-
-
Antoni Baum authored
-
- 18 May, 2024 1 commit
-
-
SangBin Cho authored
Currently we need to call rotary embedding kernel for each LoRA, which makes it hard to serve multiple long context length LoRA. Add batched rotary embedding kernel and pipe it through. It replaces the rotary embedding layer to the one that is aware of multiple cos-sin-cache per scaling factors. Follow up of https://github.com/vllm-project/vllm/pull/3095/files
-
- 13 May, 2024 1 commit
-
-
SangBin Cho authored
Co-authored-by:Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
-
- 11 May, 2024 1 commit
-
-
Chang Su authored
-