- 20 Jul, 2024 5 commits
- 09 Jul, 2024 1 commit
-
-
huangwb authored
-
- 08 Jul, 2024 1 commit
-
-
zhuwenwen authored
-
- 06 Jul, 2024 1 commit
-
-
zhuwenwen authored
-
- 28 Jun, 2024 1 commit
-
-
zhuwenwen authored
-
- 11 Jun, 2024 1 commit
-
-
Nick Hill authored
-
- 10 Jun, 2024 1 commit
-
-
Dipika Sikka authored
Co-authored-by:Michael Goin <michael@neuralmagic.com>
-
- 09 Jun, 2024 1 commit
-
-
bnellnm authored
-
- 08 Jun, 2024 2 commits
-
-
Michael Goin authored
-
Cheng Li authored
Bug description: With torch 2.4.0.dev20240603+cu121, cutlass_fp8_supported outputs False, and the (capability, version) before the comparison is (90, 11111111112) This PR fixes the support check for FP8 CUTLASS ( cutlass_fp8_supported) which was introduced in https://github.com/vllm-project/vllm/pull/5183.
-
- 07 Jun, 2024 3 commits
-
-
Dipika Sikka authored
Co-authored-by:
Varun Sundar Rabindranath <varunsundar08@gmail.com> Co-authored-by:
Varun Sundar Rabindranath <varun@neuralmagic.com>
-
Tyler Michael Smith authored
Switching from torch._scaled_mm to vLLM's cutlass fp8 kernels when supported as we are seeing 5-15% improvement in e2e performance on neuralmagic/Meta-Llama-3-8B-Instruct-FP8 see https://docs.google.com/spreadsheets/d/1GiAnmzyGHgZ6zL_LDSTm35Bdrt4A8AaFEurDlISYYA4/ for some quick e2e benchmarks and #5144 for comparisons across different GEMM sizes.
-
Antoni Baum authored
-
- 06 Jun, 2024 1 commit
-
-
Philipp Moritz authored
-
- 05 Jun, 2024 4 commits
-
-
Woosuk Kwon authored
-
Philipp Moritz authored
-
Cody Yu authored
-
Woosuk Kwon authored
-
- 04 Jun, 2024 1 commit
-
-
Woosuk Kwon authored
-
- 03 Jun, 2024 1 commit
-
-
Tyler Michael Smith authored
-
- 02 Jun, 2024 1 commit
-
-
Divakar Verma authored
This PR enables the fused topk_softmax kernel used in moe layer for HIP
-
- 01 Jun, 2024 2 commits
-
-
chenqianfzh authored
-
Tyler Michael Smith authored
-
- 31 May, 2024 2 commits
-
-
Cody Yu authored
-
Robert Shaw authored
-
- 30 May, 2024 1 commit
-
-
Alexander Matveev authored
-
- 28 May, 2024 1 commit
-
-
Divakar Verma authored
This PR adds Triton kernel configs for the MoE kernel for MI300X
-
- 27 May, 2024 1 commit
-
-
sasha0552 authored
-
- 25 May, 2024 1 commit
-
-
zhuwenwen authored
-
- 23 May, 2024 3 commits
-
-
Elisei Smirnov authored
Co-authored-by:Elisei Smirnov <el.smirnov@innopolis.university>
-
Dipika Sikka authored
Co-authored-by:
Varun Sundar Rabindranath <varunsundar08@gmail.com> Co-authored-by:
Varun Sundar Rabindranath <varun@neuralmagic.com>
-
Alexander Matveev authored
-
- 22 May, 2024 1 commit
-
-
Cody Yu authored
The 2nd PR for #4532. This PR supports loading FP8 kv-cache scaling factors from a FP8 checkpoint (with .kv_scale parameter).
-
- 19 May, 2024 1 commit
-
-
Alexander Matveev authored
-
- 18 May, 2024 1 commit
-
-
SangBin Cho authored
Currently we need to call rotary embedding kernel for each LoRA, which makes it hard to serve multiple long context length LoRA. Add batched rotary embedding kernel and pipe it through. It replaces the rotary embedding layer to the one that is aware of multiple cos-sin-cache per scaling factors. Follow up of https://github.com/vllm-project/vllm/pull/3095/files
-
- 17 May, 2024 1 commit
-
-
Jinzhen Lin authored
-