- 06 Aug, 2024 1 commit
-
-
afeldman-nm authored
[Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) (#4942) Co-authored-by:
Andrew Feldman <afeld2012@gmail.com> Co-authored-by:
Nick Hill <nickhill@us.ibm.com>
-
- 05 Aug, 2024 2 commits
-
-
Isotr0py authored
Co-authored-by:Michael Goin <michael@neuralmagic.com>
-
Cade Daniel authored
-
- 30 Jul, 2024 1 commit
-
-
fzyzcjy authored
-
- 27 Jul, 2024 1 commit
-
-
tomeras91 authored
-
- 23 Jul, 2024 3 commits
-
-
dongmao zhang authored
Co-authored-by:Michael Goin <michael@neuralmagic.com>
-
Simon Mo authored
-
Woosuk Kwon authored
-
- 22 Jul, 2024 1 commit
-
-
Cyrus Leung authored
Co-authored-by:Roger Wang <ywang@roblox.com>
-
- 21 Jul, 2024 1 commit
-
-
sroy745 authored
[Spec Decode] Disable Log Prob serialization to CPU for spec decoding for both draft and target models. (#6485)
-
- 20 Jul, 2024 1 commit
-
-
Antoni Baum authored
-
- 18 Jul, 2024 1 commit
-
-
youkaichao authored
Co-authored-by:Michael Goin <michael@neuralmagic.com>
-
- 09 Jul, 2024 1 commit
-
-
Swapnil Parekh authored
Co-authored-by:
Swapnil Parekh <swapnilp@ibm.com> Co-authored-by:
Joe G <joseph.granados@h2o.ai> Co-authored-by:
Antoni Baum <antoni.baum@protonmail.com>
-
- 03 Jul, 2024 1 commit
-
-
xwjiang2010 authored
Signed-off-by:
Xiaowei Jiang <xwjiang2010@gmail.com> Co-authored-by:
Roger Wang <ywang@roblox.com>
-
- 02 Jul, 2024 1 commit
-
-
xwjiang2010 authored
Signed-off-by:
Xiaowei Jiang <xwjiang2010@gmail.com> Co-authored-by:
Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by:
Roger Wang <ywang@roblox.com>
-
- 01 Jul, 2024 1 commit
-
-
sroy745 authored
-
- 28 Jun, 2024 2 commits
-
-
Ilya Lavrenov authored
-
Cyrus Leung authored
Co-authored-by:ywang96 <ywang@roblox.com>
-
- 25 Jun, 2024 1 commit
-
-
Woo-Yeon Lee authored
[Speculative Decoding] Support draft model on different tensor-parallel size than target model (#5414)
-
- 20 Jun, 2024 1 commit
-
-
Michael Goin authored
-
- 19 Jun, 2024 1 commit
-
-
Michael Goin authored
-
- 18 Jun, 2024 1 commit
-
-
Ronen Schaffer authored
This PR adds basic support for OpenTelemetry distributed tracing. It includes changes to enable tracing functionality and improve monitoring capabilities. I've also added a markdown with print-screens to guide users how to use this feature. You can find it here
-
- 17 Jun, 2024 1 commit
-
-
Kunshang Ji authored
Co-authored-by:
Jiang Li <jiang1.li@intel.com> Co-authored-by:
Abhilash Majumder <abhilash.majumder@intel.com> Co-authored-by:
Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
-
- 14 Jun, 2024 1 commit
-
-
Sanger Steel authored
-
- 12 Jun, 2024 1 commit
-
-
Woosuk Kwon authored
-
- 11 Jun, 2024 3 commits
-
-
sasha0552 authored
-
Ali Panahi authored
-
maor-ps authored
Co-authored-by:DarkLight1337 <tlleungac@connect.ust.hk>
-
- 05 Jun, 2024 1 commit
-
-
Michael Goin authored
-
- 03 Jun, 2024 2 commits
-
-
Kaiyang Chen authored
-
Cyrus Leung authored
-
- 01 Jun, 2024 1 commit
-
-
chenqianfzh authored
-
- 28 May, 2024 1 commit
-
-
Michał Moskal authored
Co-authored-by:Ruth Evans <ruthevans@Ruths-MacBook-Pro.local>
-
- 27 May, 2024 1 commit
-
-
Zhuohan Li authored
Co-authored-by:
rsnm2 <rshaw@neuralmagic.com> Co-authored-by:
Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
-
- 22 May, 2024 2 commits
- 21 May, 2024 1 commit
-
-
Kante Yin authored
Signed-off-by:kerthcet <kerthcet@gmail.com>
-
- 18 May, 2024 1 commit
-
-
SangBin Cho authored
Currently we need to call rotary embedding kernel for each LoRA, which makes it hard to serve multiple long context length LoRA. Add batched rotary embedding kernel and pipe it through. It replaces the rotary embedding layer to the one that is aware of multiple cos-sin-cache per scaling factors. Follow up of https://github.com/vllm-project/vllm/pull/3095/files
-
- 15 May, 2024 2 commits
-
-
zifeitong authored
-
SangBin Cho authored
[Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode to a single API (#4681) This PR combines prepare_prompt and prepare_decode into a single API. This PR also coelsce the attn metadata for prefill/decode to a single class and allow to slice them when running attn backend. It also refactors subquery_start_loc which was not refactored in the previous PR
-