- 05 Jun, 2024 2 commits
-
-
Cody Yu authored
-
Woosuk Kwon authored
-
- 04 Jun, 2024 2 commits
-
-
Woosuk Kwon authored
-
Toshiki Kataoka authored
-
- 03 Jun, 2024 3 commits
-
-
Breno Faria authored
-
Tyler Michael Smith authored
-
Cyrus Leung authored
-
- 02 Jun, 2024 1 commit
-
-
Divakar Verma authored
This PR enables the fused topk_softmax kernel used in moe layer for HIP
-
- 01 Jun, 2024 3 commits
-
-
chenqianfzh authored
-
Ye Cao authored
Signed-off-by:Ye Cao <caoye.cao@alibaba-inc.com>
-
Tyler Michael Smith authored
-
- 31 May, 2024 2 commits
-
-
Cody Yu authored
-
Robert Shaw authored
-
- 30 May, 2024 1 commit
-
-
Alexander Matveev authored
-
- 28 May, 2024 1 commit
-
-
Divakar Verma authored
This PR adds Triton kernel configs for the MoE kernel for MI300X
-
- 27 May, 2024 3 commits
-
-
Isotr0py authored
-
sasha0552 authored
-
Zhuohan Li authored
Co-authored-by:
rsnm2 <rshaw@neuralmagic.com> Co-authored-by:
Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
-
- 25 May, 2024 1 commit
-
-
Eric Xihui Lin authored
Co-authored-by:
beagleski <yunanzhang@microsoft.com> Co-authored-by:
bapatra <bapatra@microsoft.com> Co-authored-by:
Barun Patra <codedecde@users.noreply.github.com> Co-authored-by:
Michael Goin <michael@neuralmagic.com>
-
- 24 May, 2024 1 commit
-
-
Robert Shaw authored
Co-authored-by:Cody Yu <hao.yu.cody@gmail.com>
-
- 23 May, 2024 3 commits
-
-
Elisei Smirnov authored
Co-authored-by:Elisei Smirnov <el.smirnov@innopolis.university>
-
Dipika Sikka authored
Co-authored-by:
Varun Sundar Rabindranath <varunsundar08@gmail.com> Co-authored-by:
Varun Sundar Rabindranath <varun@neuralmagic.com>
-
Alexander Matveev authored
-
- 22 May, 2024 3 commits
-
-
Philipp Moritz authored
-
raywanb authored
-
Cody Yu authored
The 2nd PR for #4532. This PR supports loading FP8 kv-cache scaling factors from a FP8 checkpoint (with .kv_scale parameter).
-
- 21 May, 2024 2 commits
- 20 May, 2024 3 commits
-
-
Aurick Qiao authored
-
Mor Zusman authored
Allow dummy load format for fp8, torch.uniform_ doesn't support FP8 at the moment Co-authored-by:Mor Zusman <morz@ai21.com>
-
Cyrus Leung authored
-
- 19 May, 2024 2 commits
-
-
Alexander Matveev authored
-
Cyrus Leung authored
-
- 18 May, 2024 1 commit
-
-
SangBin Cho authored
Currently we need to call rotary embedding kernel for each LoRA, which makes it hard to serve multiple long context length LoRA. Add batched rotary embedding kernel and pipe it through. It replaces the rotary embedding layer to the one that is aware of multiple cos-sin-cache per scaling factors. Follow up of https://github.com/vllm-project/vllm/pull/3095/files
-
- 17 May, 2024 2 commits
-
-
eigenLiu authored
-
Jinzhen Lin authored
-
- 16 May, 2024 4 commits
-
-
Alexander Matveev authored
Co-authored-by:Robert Shaw <rshaw@neuralmagic.com>
-
Jinzhen Lin authored
-
alexm-nm authored
-
Aurick Qiao authored
Co-authored-by:Woosuk Kwon <woosuk.kwon@berkeley.edu>
-