- 02 Jun, 2024 2 commits
-
-
Cyrus Leung authored
-
Simon Mo authored
-
- 01 Jun, 2024 3 commits
-
-
chenqianfzh authored
-
Varun Sundar Rabindranath authored
Co-authored-by:
Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by:
Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
-
Tyler Michael Smith authored
-
- 31 May, 2024 1 commit
-
-
SnowDist authored
Co-authored-by:Zhuohan Li <zhuohan123@gmail.com>
-
- 30 May, 2024 1 commit
-
-
Breno Faria authored
Co-authored-by:Breno Faria <breno.faria@intrafind.com>
-
- 29 May, 2024 6 commits
-
-
Cyrus Leung authored
-
Cyrus Leung authored
-
afeldman-nm authored
[Core] Cross-attention KV caching and memory-management (towards eventual encoder/decoder model support) (#4837)
-
Cyrus Leung authored
-
youkaichao authored
-
Junichi Sato authored
-
- 28 May, 2024 2 commits
-
-
Cyrus Leung authored
Co-authored-by:Roger Wang <ywang@roblox.com>
-
Michał Moskal authored
Co-authored-by:Ruth Evans <ruthevans@Ruths-MacBook-Pro.local>
-
- 27 May, 2024 1 commit
-
-
Zhuohan Li authored
Co-authored-by:
rsnm2 <rshaw@neuralmagic.com> Co-authored-by:
Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
-
- 25 May, 2024 2 commits
-
-
Lily Liu authored
-
Eric Xihui Lin authored
Co-authored-by:
beagleski <yunanzhang@microsoft.com> Co-authored-by:
bapatra <bapatra@microsoft.com> Co-authored-by:
Barun Patra <codedecde@users.noreply.github.com> Co-authored-by:
Michael Goin <michael@neuralmagic.com>
-
- 24 May, 2024 2 commits
-
-
leiwen83 authored
Co-authored-by:Lei Wen <wenlei03@qiyi.com>
-
Robert Shaw authored
Co-authored-by:Cody Yu <hao.yu.cody@gmail.com>
-
- 23 May, 2024 3 commits
-
-
Dipika Sikka authored
Co-authored-by:
Varun Sundar Rabindranath <varunsundar08@gmail.com> Co-authored-by:
Varun Sundar Rabindranath <varun@neuralmagic.com>
-
Murali Andoorveedu authored
Signed-off-by:Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
-
Alexander Matveev authored
-
- 22 May, 2024 6 commits
-
-
Cody Yu authored
-
raywanb authored
-
Cody Yu authored
The 2nd PR for #4532. This PR supports loading FP8 kv-cache scaling factors from a FP8 checkpoint (with .kv_scale parameter).
-
Tyler Michael Smith authored
Pass the CUDA stream into the CUTLASS GEMMs, to avoid future issues with CUDA graphs
-
SangBin Cho authored
-
sasha0552 authored
-
- 21 May, 2024 1 commit
-
-
Isotr0py authored
-
- 20 May, 2024 2 commits
-
-
Alexei-V-Ivanov-AMD authored
Co-authored-by:Alexey Kondratiev <alexey.kondratiev@amd.com>
-
Woosuk Kwon authored
-
- 19 May, 2024 2 commits
-
-
Alexander Matveev authored
-
Cyrus Leung authored
-
- 18 May, 2024 1 commit
-
-
SangBin Cho authored
Currently we need to call rotary embedding kernel for each LoRA, which makes it hard to serve multiple long context length LoRA. Add batched rotary embedding kernel and pipe it through. It replaces the rotary embedding layer to the one that is aware of multiple cos-sin-cache per scaling factors. Follow up of https://github.com/vllm-project/vllm/pull/3095/files
-
- 17 May, 2024 2 commits
-
-
Jinzhen Lin authored
-
Alexei-V-Ivanov-AMD authored
[Build/CI] Extending the set of AMD tests with Regression, Basic Correctness, Distributed, Engine, Llava Tests (#4797)
-
- 16 May, 2024 3 commits
-
-
Tyler Michael Smith authored
-
Silencio authored
Co-authored-by:Silencio <silencio@adsl-99-6-187-6.dsl.irvnca.sbcglobal.net>
-
youkaichao authored
-