- 20 Jul, 2024 1 commit
-
-
zhuwenwen authored
-
- 09 Jul, 2024 1 commit
-
-
huangwb authored
-
- 08 Jul, 2024 1 commit
-
-
zhuwenwen authored
-
- 06 Jul, 2024 1 commit
-
-
zhuwenwen authored
-
- 28 Jun, 2024 1 commit
-
-
zhuwenwen authored
-
- 11 Jun, 2024 1 commit
-
-
Nick Hill authored
-
- 10 Jun, 2024 1 commit
-
-
Dipika Sikka authored
Co-authored-by:Michael Goin <michael@neuralmagic.com>
-
- 09 Jun, 2024 1 commit
-
-
bnellnm authored
-
- 08 Jun, 2024 2 commits
-
-
Michael Goin authored
-
Cheng Li authored
Bug description: With torch 2.4.0.dev20240603+cu121, cutlass_fp8_supported outputs False, and the (capability, version) before the comparison is (90, 11111111112) This PR fixes the support check for FP8 CUTLASS ( cutlass_fp8_supported) which was introduced in https://github.com/vllm-project/vllm/pull/5183.
-
- 07 Jun, 2024 3 commits
-
-
Dipika Sikka authored
Co-authored-by:
Varun Sundar Rabindranath <varunsundar08@gmail.com> Co-authored-by:
Varun Sundar Rabindranath <varun@neuralmagic.com>
-
Tyler Michael Smith authored
Switching from torch._scaled_mm to vLLM's cutlass fp8 kernels when supported as we are seeing 5-15% improvement in e2e performance on neuralmagic/Meta-Llama-3-8B-Instruct-FP8 see https://docs.google.com/spreadsheets/d/1GiAnmzyGHgZ6zL_LDSTm35Bdrt4A8AaFEurDlISYYA4/ for some quick e2e benchmarks and #5144 for comparisons across different GEMM sizes.
-
Antoni Baum authored
-
- 06 Jun, 2024 1 commit
-
-
Philipp Moritz authored
-
- 05 Jun, 2024 4 commits
-
-
Woosuk Kwon authored
-
Philipp Moritz authored
-
Cody Yu authored
-
Woosuk Kwon authored
-
- 04 Jun, 2024 1 commit
-
-
Woosuk Kwon authored
-
- 03 Jun, 2024 1 commit
-
-
Tyler Michael Smith authored
-
- 02 Jun, 2024 1 commit
-
-
Divakar Verma authored
This PR enables the fused topk_softmax kernel used in moe layer for HIP
-
- 01 Jun, 2024 2 commits
-
-
chenqianfzh authored
-
Tyler Michael Smith authored
-
- 31 May, 2024 2 commits
-
-
Cody Yu authored
-
Robert Shaw authored
-
- 30 May, 2024 1 commit
-
-
Alexander Matveev authored
-
- 28 May, 2024 1 commit
-
-
Divakar Verma authored
This PR adds Triton kernel configs for the MoE kernel for MI300X
-
- 27 May, 2024 1 commit
-
-
sasha0552 authored
-
- 25 May, 2024 1 commit
-
-
zhuwenwen authored
-
- 23 May, 2024 3 commits
-
-
Elisei Smirnov authored
Co-authored-by:Elisei Smirnov <el.smirnov@innopolis.university>
-
Dipika Sikka authored
Co-authored-by:
Varun Sundar Rabindranath <varunsundar08@gmail.com> Co-authored-by:
Varun Sundar Rabindranath <varun@neuralmagic.com>
-
Alexander Matveev authored
-
- 22 May, 2024 1 commit
-
-
Cody Yu authored
The 2nd PR for #4532. This PR supports loading FP8 kv-cache scaling factors from a FP8 checkpoint (with .kv_scale parameter).
-
- 19 May, 2024 1 commit
-
-
Alexander Matveev authored
-
- 18 May, 2024 1 commit
-
-
SangBin Cho authored
Currently we need to call rotary embedding kernel for each LoRA, which makes it hard to serve multiple long context length LoRA. Add batched rotary embedding kernel and pipe it through. It replaces the rotary embedding layer to the one that is aware of multiple cos-sin-cache per scaling factors. Follow up of https://github.com/vllm-project/vllm/pull/3095/files
-
- 17 May, 2024 1 commit
-
-
Jinzhen Lin authored
-
- 16 May, 2024 3 commits
-
-
Alexander Matveev authored
Co-authored-by:Robert Shaw <rshaw@neuralmagic.com>
-
Jinzhen Lin authored
-
alexm-nm authored
-
- 15 May, 2024 1 commit
-
-
SangBin Cho authored
[Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode to a single API (#4681) This PR combines prepare_prompt and prepare_decode into a single API. This PR also coelsce the attn metadata for prefill/decode to a single class and allow to slice them when running attn backend. It also refactors subquery_start_loc which was not refactored in the previous PR
-