- 08 Jun, 2024 1 commit
-
-
Cheng Li authored
Bug description: With torch 2.4.0.dev20240603+cu121, cutlass_fp8_supported outputs False, and the (capability, version) before the comparison is (90, 11111111112) This PR fixes the support check for FP8 CUTLASS ( cutlass_fp8_supported) which was introduced in https://github.com/vllm-project/vllm/pull/5183.
-
- 07 Jun, 2024 3 commits
-
-
Dipika Sikka authored
Co-authored-by:
Varun Sundar Rabindranath <varunsundar08@gmail.com> Co-authored-by:
Varun Sundar Rabindranath <varun@neuralmagic.com>
-
Tyler Michael Smith authored
Switching from torch._scaled_mm to vLLM's cutlass fp8 kernels when supported as we are seeing 5-15% improvement in e2e performance on neuralmagic/Meta-Llama-3-8B-Instruct-FP8 see https://docs.google.com/spreadsheets/d/1GiAnmzyGHgZ6zL_LDSTm35Bdrt4A8AaFEurDlISYYA4/ for some quick e2e benchmarks and #5144 for comparisons across different GEMM sizes.
-
Antoni Baum authored
-
- 06 Jun, 2024 1 commit
-
-
Philipp Moritz authored
-
- 05 Jun, 2024 4 commits
-
-
Woosuk Kwon authored
-
Philipp Moritz authored
-
Cody Yu authored
-
Woosuk Kwon authored
-
- 04 Jun, 2024 1 commit
-
-
Woosuk Kwon authored
-
- 03 Jun, 2024 1 commit
-
-
Tyler Michael Smith authored
-
- 02 Jun, 2024 1 commit
-
-
Divakar Verma authored
This PR enables the fused topk_softmax kernel used in moe layer for HIP
-
- 01 Jun, 2024 2 commits
-
-
chenqianfzh authored
-
Tyler Michael Smith authored
-
- 31 May, 2024 2 commits
-
-
Cody Yu authored
-
Robert Shaw authored
-
- 30 May, 2024 1 commit
-
-
Alexander Matveev authored
-
- 28 May, 2024 1 commit
-
-
Divakar Verma authored
This PR adds Triton kernel configs for the MoE kernel for MI300X
-
- 27 May, 2024 1 commit
-
-
sasha0552 authored
-
- 23 May, 2024 3 commits
-
-
Elisei Smirnov authored
Co-authored-by:Elisei Smirnov <el.smirnov@innopolis.university>
-
Dipika Sikka authored
Co-authored-by:
Varun Sundar Rabindranath <varunsundar08@gmail.com> Co-authored-by:
Varun Sundar Rabindranath <varun@neuralmagic.com>
-
Alexander Matveev authored
-
- 22 May, 2024 1 commit
-
-
Cody Yu authored
The 2nd PR for #4532. This PR supports loading FP8 kv-cache scaling factors from a FP8 checkpoint (with .kv_scale parameter).
-
- 19 May, 2024 1 commit
-
-
Alexander Matveev authored
-
- 18 May, 2024 1 commit
-
-
SangBin Cho authored
Currently we need to call rotary embedding kernel for each LoRA, which makes it hard to serve multiple long context length LoRA. Add batched rotary embedding kernel and pipe it through. It replaces the rotary embedding layer to the one that is aware of multiple cos-sin-cache per scaling factors. Follow up of https://github.com/vllm-project/vllm/pull/3095/files
-
- 17 May, 2024 1 commit
-
-
Jinzhen Lin authored
-
- 16 May, 2024 3 commits
-
-
Alexander Matveev authored
Co-authored-by:Robert Shaw <rshaw@neuralmagic.com>
-
Jinzhen Lin authored
-
alexm-nm authored
-
- 15 May, 2024 1 commit
-
-
SangBin Cho authored
[Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode to a single API (#4681) This PR combines prepare_prompt and prepare_decode into a single API. This PR also coelsce the attn metadata for prefill/decode to a single class and allow to slice them when running attn backend. It also refactors subquery_start_loc which was not refactored in the previous PR
-
- 13 May, 2024 1 commit
-
-
Swapnil Parekh authored
-
- 11 May, 2024 1 commit
-
-
Chang Su authored
-
- 09 May, 2024 2 commits
-
-
Philipp Moritz authored
This PR improves the FP8 performance of linear layers, which had been lacking before (#4118 (comment) and #4118 (comment)). We noticed that CUBLASLt can find a better algorithm if the first dimension of the matrix is greater than 16. So this PR enlarges matrices appropriately during quantization. This improves FP8 performance and removes the performance regression vs. FP16, in many cases exceeding FP16 performance. Here are benchmarks on llama3 70b (ITL numbers for 1000 input and 50 output tokens at fixed qps and at TP 4), all FP8 measurements are for dynamic quantization: qps = 1: 24 ms (FP8, this PR), 32 ms (FP8, previous main), 26 ms (FP16) qps = 2: 26 ms (FP8, this PR), 34ms (FP8, previous main), 28 ms (FP16) qps = 4: 33 ms (FP8, this PR), 44 ms (FP8, previous main), 36 ms (FP16) qps = 6: 46 ms (FP8, this PR), 56 ms (FP8, previous main), 54 ms (FP16) qps = 8: 85 ms (FP8, this PR), 85 ms (FP8, previous main), 138 ms (FP16)
-
Hao Zhang authored
Co-authored-by:
Dash Desai <1723932+iamontheinet@users.noreply.github.com> Co-authored-by:
Aurick Qiao <qiao@aurick.net> Co-authored-by:
Aurick Qiao <aurick.qiao@snowflake.com> Co-authored-by:
Aurick Qiao <aurickq@users.noreply.github.com> Co-authored-by:
Cody Yu <hao.yu.cody@gmail.com>
-
- 08 May, 2024 3 commits
-
-
Cody Yu authored
Co-authored-by:Cade Daniel <edacih@gmail.com>
-
SangBin Cho authored
-
SangBin Cho authored
-
- 03 May, 2024 2 commits
-
-
Cade Daniel authored
-
SangBin Cho authored
-
- 02 May, 2024 1 commit
-
-
alexm-nm authored
-