- 28 Jun, 2024 2 commits
-
-
Ilya Lavrenov authored
-
Cyrus Leung authored
Co-authored-by:ywang96 <ywang@roblox.com>
-
- 25 Jun, 2024 1 commit
-
-
Woo-Yeon Lee authored
[Speculative Decoding] Support draft model on different tensor-parallel size than target model (#5414)
-
- 20 Jun, 2024 1 commit
-
-
Michael Goin authored
-
- 19 Jun, 2024 1 commit
-
-
Michael Goin authored
-
- 18 Jun, 2024 1 commit
-
-
Ronen Schaffer authored
This PR adds basic support for OpenTelemetry distributed tracing. It includes changes to enable tracing functionality and improve monitoring capabilities. I've also added a markdown with print-screens to guide users how to use this feature. You can find it here
-
- 17 Jun, 2024 1 commit
-
-
Kunshang Ji authored
Co-authored-by:
Jiang Li <jiang1.li@intel.com> Co-authored-by:
Abhilash Majumder <abhilash.majumder@intel.com> Co-authored-by:
Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
-
- 14 Jun, 2024 1 commit
-
-
Sanger Steel authored
-
- 12 Jun, 2024 1 commit
-
-
Woosuk Kwon authored
-
- 11 Jun, 2024 3 commits
-
-
sasha0552 authored
-
Ali Panahi authored
-
maor-ps authored
Co-authored-by:DarkLight1337 <tlleungac@connect.ust.hk>
-
- 05 Jun, 2024 1 commit
-
-
Michael Goin authored
-
- 03 Jun, 2024 2 commits
-
-
Kaiyang Chen authored
-
Cyrus Leung authored
-
- 01 Jun, 2024 1 commit
-
-
chenqianfzh authored
-
- 28 May, 2024 1 commit
-
-
Michał Moskal authored
Co-authored-by:Ruth Evans <ruthevans@Ruths-MacBook-Pro.local>
-
- 27 May, 2024 1 commit
-
-
Zhuohan Li authored
Co-authored-by:
rsnm2 <rshaw@neuralmagic.com> Co-authored-by:
Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
-
- 22 May, 2024 2 commits
- 21 May, 2024 1 commit
-
-
Kante Yin authored
Signed-off-by:kerthcet <kerthcet@gmail.com>
-
- 18 May, 2024 1 commit
-
-
SangBin Cho authored
Currently we need to call rotary embedding kernel for each LoRA, which makes it hard to serve multiple long context length LoRA. Add batched rotary embedding kernel and pipe it through. It replaces the rotary embedding layer to the one that is aware of multiple cos-sin-cache per scaling factors. Follow up of https://github.com/vllm-project/vllm/pull/3095/files
-
- 15 May, 2024 2 commits
-
-
zifeitong authored
-
SangBin Cho authored
[Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode to a single API (#4681) This PR combines prepare_prompt and prepare_decode into a single API. This PR also coelsce the attn metadata for prefill/decode to a single class and allow to slice them when running attn backend. It also refactors subquery_start_loc which was not refactored in the previous PR
-
- 14 May, 2024 1 commit
-
-
Nick Hill authored
Co-authored-by:SAHIL SUNEJA <suneja@us.ibm.com>
-
- 13 May, 2024 1 commit
-
-
Sanger Steel authored
[Frontend] [Core] perf: Automatically detect vLLM-tensorized model, update `tensorizer` to version 2.9.0 (#4208)
-
- 11 May, 2024 1 commit
-
-
Chang Su authored
-
- 09 May, 2024 1 commit
-
-
Cyrus Leung authored
-
- 08 May, 2024 1 commit
-
-
Cody Yu authored
Co-authored-by:Cade Daniel <edacih@gmail.com>
-
- 04 May, 2024 1 commit
-
-
DearPlanet authored
-
- 03 May, 2024 2 commits
-
-
Michael Goin authored
-
SangBin Cho authored
-
- 01 May, 2024 1 commit
-
-
leiwen83 authored
Co-authored-by:Lei Wen <wenlei03@qiyi.com>
-
- 27 Apr, 2024 1 commit
-
-
Austin Veselka authored
Co-authored-by:Antoni Baum <antoni.baum@protonmail.com>
-
- 23 Apr, 2024 1 commit
-
-
Cade Daniel authored
-
- 21 Apr, 2024 1 commit
-
-
GeauxEric authored
Co-authored-by:
Yun Ding <yunding@nvidia.com> Co-authored-by:
Roger Wang <ywang@roblox.com>
-
- 20 Apr, 2024 2 commits
-
-
Noam Gat authored
-
Harry Mellor authored
Co-authored-by:Harry Mellor <hmellor@oxts.com>
-
- 18 Apr, 2024 1 commit
-
-
Michael Goin authored
-
- 16 Apr, 2024 1 commit
-
-
Antoni Baum authored
-