- 03 Jun, 2024 3 commits
-
-
Breno Faria authored
-
Tyler Michael Smith authored
-
Cyrus Leung authored
-
- 02 Jun, 2024 1 commit
-
-
Divakar Verma authored
This PR enables the fused topk_softmax kernel used in moe layer for HIP
-
- 01 Jun, 2024 3 commits
-
-
chenqianfzh authored
-
Ye Cao authored
Signed-off-by:Ye Cao <caoye.cao@alibaba-inc.com>
-
Tyler Michael Smith authored
-
- 31 May, 2024 2 commits
-
-
Cody Yu authored
-
Robert Shaw authored
-
- 30 May, 2024 1 commit
-
-
Alexander Matveev authored
-
- 28 May, 2024 1 commit
-
-
Divakar Verma authored
This PR adds Triton kernel configs for the MoE kernel for MI300X
-
- 27 May, 2024 3 commits
-
-
Isotr0py authored
-
sasha0552 authored
-
Zhuohan Li authored
Co-authored-by:
rsnm2 <rshaw@neuralmagic.com> Co-authored-by:
Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
-
- 25 May, 2024 1 commit
-
-
Eric Xihui Lin authored
Co-authored-by:
beagleski <yunanzhang@microsoft.com> Co-authored-by:
bapatra <bapatra@microsoft.com> Co-authored-by:
Barun Patra <codedecde@users.noreply.github.com> Co-authored-by:
Michael Goin <michael@neuralmagic.com>
-
- 24 May, 2024 1 commit
-
-
Robert Shaw authored
Co-authored-by:Cody Yu <hao.yu.cody@gmail.com>
-
- 23 May, 2024 3 commits
-
-
Elisei Smirnov authored
Co-authored-by:Elisei Smirnov <el.smirnov@innopolis.university>
-
Dipika Sikka authored
Co-authored-by:
Varun Sundar Rabindranath <varunsundar08@gmail.com> Co-authored-by:
Varun Sundar Rabindranath <varun@neuralmagic.com>
-
Alexander Matveev authored
-
- 22 May, 2024 3 commits
-
-
Philipp Moritz authored
-
raywanb authored
-
Cody Yu authored
The 2nd PR for #4532. This PR supports loading FP8 kv-cache scaling factors from a FP8 checkpoint (with .kv_scale parameter).
-
- 21 May, 2024 2 commits
- 20 May, 2024 3 commits
-
-
Aurick Qiao authored
-
Mor Zusman authored
Allow dummy load format for fp8, torch.uniform_ doesn't support FP8 at the moment Co-authored-by:Mor Zusman <morz@ai21.com>
-
Cyrus Leung authored
-
- 19 May, 2024 2 commits
-
-
Alexander Matveev authored
-
Cyrus Leung authored
-
- 18 May, 2024 1 commit
-
-
SangBin Cho authored
Currently we need to call rotary embedding kernel for each LoRA, which makes it hard to serve multiple long context length LoRA. Add batched rotary embedding kernel and pipe it through. It replaces the rotary embedding layer to the one that is aware of multiple cos-sin-cache per scaling factors. Follow up of https://github.com/vllm-project/vllm/pull/3095/files
-
- 17 May, 2024 2 commits
-
-
eigenLiu authored
-
Jinzhen Lin authored
-
- 16 May, 2024 4 commits
-
-
Alexander Matveev authored
Co-authored-by:Robert Shaw <rshaw@neuralmagic.com>
-
Jinzhen Lin authored
-
alexm-nm authored
-
Aurick Qiao authored
Co-authored-by:Woosuk Kwon <woosuk.kwon@berkeley.edu>
-
- 15 May, 2024 1 commit
-
-
SangBin Cho authored
[Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode to a single API (#4681) This PR combines prepare_prompt and prepare_decode into a single API. This PR also coelsce the attn metadata for prefill/decode to a single class and allow to slice them when running attn backend. It also refactors subquery_start_loc which was not refactored in the previous PR
-
- 13 May, 2024 3 commits
-
-
Philipp Moritz authored
-
Sanger Steel authored
[Frontend] [Core] perf: Automatically detect vLLM-tensorized model, update `tensorizer` to version 2.9.0 (#4208)
-
Woosuk Kwon authored
-