Commits · 614aa5120303ab09be78fb1db669da198cc43b02 · OpenDAS / vllm_cscc

01 Jul, 2024 1 commit
- [misc][cuda] use nvml to avoid accidentally cuda initialization (#6007) · 614aa512
  youkaichao authored Jun 30, 2024
  
  614aa512
30 Jun, 2024 1 commit
- [ Misc ] Refactor w8a8 to use `process_weights_after_load` (Simplify Weight Loading) (#5940) · af9ad46f
  Robert Shaw authored Jun 30, 2024
```
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
```
  af9ad46f
28 Jun, 2024 1 commit
- [ Bugfix ] Enabling Loading Models With Fused QKV/MLP on Disk with FP8 (#5921) · 2cd402e1
  Robert Shaw authored Jun 28, 2024
```
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
```
  2cd402e1
20 Jun, 2024 1 commit
- [Bugfix] Fix the CUDA version check for FP8 support in the CUTLASS kernels (#5715) · 3f3b6b21
  Tyler Michael Smith authored Jun 20, 2024
  
  3f3b6b21
14 Jun, 2024 1 commit
- [Kernel] Fix CUTLASS 3.x custom broadcast load epilogue (#5516) · 703475f6
  Tyler Michael Smith authored Jun 14, 2024
  
  703475f6
13 Jun, 2024 2 commits

[Kernel] Disable CUTLASS kernels for fp8 (#5505) · e38042d4
Tyler Michael Smith authored Jun 13, 2024

e38042d4

[Kernel] Factor out epilogues from cutlass kernels (#5391) · 85657b56

Tyler Michael Smith authored Jun 13, 2024


Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: zifeitong <zifei.tong@parasail.io>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>

85657b56

08 Jun, 2024 2 commits

[Misc][Breaking] Change FP8 checkpoint format from act_scale -> input_scale (#5353) · c09dade2
Michael Goin authored Jun 08, 2024

c09dade2

[Bug Fix] Fix the support check for FP8 CUTLASS (#5352) · e69ded7d

Cheng Li authored Jun 07, 2024

Bug description:
With torch 2.4.0.dev20240603+cu121,
cutlass_fp8_supported outputs False, and the (capability, version) before the comparison is (90, 11111111112)

This PR fixes the support check for FP8 CUTLASS ( cutlass_fp8_supported) which was introduced in https://github.com/vllm-project/vllm/pull/5183.

e69ded7d

07 Jun, 2024 1 commit

[Kernel] Switch fp8 layers to use the CUTLASS kernels (#5183) · 8d75fe48

Tyler Michael Smith authored Jun 07, 2024

Switching from torch._scaled_mm to vLLM's cutlass fp8 kernels when supported as we are seeing 5-15% improvement in e2e performance on neuralmagic/Meta-Llama-3-8B-Instruct-FP8

see https://docs.google.com/spreadsheets/d/1GiAnmzyGHgZ6zL_LDSTm35Bdrt4A8AaFEurDlISYYA4/ for some quick e2e benchmarks and #5144 for comparisons across different GEMM sizes.

8d75fe48

05 Jun, 2024 1 commit
- [Model] Correct Mixtral FP8 checkpoint loading (#5231) · 5563a4de
  Cody Yu authored Jun 05, 2024
  
  5563a4de
22 May, 2024 1 commit

[Misc] Load FP8 kv-cache scaling factors from checkpoints (#4893) · a3a73ab0

Cody Yu authored May 22, 2024

The 2nd PR for #4532.

This PR supports loading FP8 kv-cache scaling factors from a FP8 checkpoint (with .kv_scale parameter).

a3a73ab0

09 May, 2024 1 commit

[Kernel] [FP8] Improve FP8 linear layer performance (#4691) · 379da6dc

Philipp Moritz authored May 09, 2024

This PR improves the FP8 performance of linear layers, which had been lacking before (#4118 (comment) and #4118 (comment)).

We noticed that CUBLASLt can find a better algorithm if the first dimension of the matrix is greater than 16. So this PR enlarges matrices appropriately during quantization. This improves FP8 performance and removes the performance regression vs. FP16, in many cases exceeding FP16 performance.

Here are benchmarks on llama3 70b (ITL numbers for 1000 input and 50 output tokens at fixed qps and at TP 4), all FP8 measurements are for dynamic quantization:

qps = 1: 24 ms (FP8, this PR), 32 ms (FP8, previous main), 26 ms (FP16)
qps = 2: 26 ms (FP8, this PR), 34ms (FP8, previous main), 28 ms (FP16)
qps = 4: 33 ms (FP8, this PR), 44 ms (FP8, previous main), 36 ms (FP16)
qps = 6: 46 ms (FP8, this PR), 56 ms (FP8, previous main), 54 ms (FP16)
qps = 8: 85 ms (FP8, this PR), 85 ms (FP8, previous main), 138 ms (FP16)

379da6dc

30 Apr, 2024 1 commit

[Kernel] Support Fp8 Checkpoints (Dynamic + Static) (#4332) · 111815d4

Robert Shaw authored Apr 30, 2024


Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>

111815d4

27 Apr, 2024 1 commit
- [Kernel] Optimize FP8 support for MoE kernel / Mixtral via static scales (#4343) · 12628d3c
  Philipp Moritz authored Apr 26, 2024
```
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
```
  12628d3c
26 Apr, 2024 1 commit
- [Misc][Refactor] Generalize linear_method to be quant_method (#4373) · a62aaf1d
  Cody Yu authored Apr 26, 2024
  
  a62aaf1d
24 Apr, 2024 1 commit
- [BUG] fixed fp8 conflict with aqlm (#4307) · 79a268c4
  Robert Shaw authored Apr 23, 2024
```
Fixes fp8 iterface which broke in AQLM merge.
```
  79a268c4
20 Apr, 2024 2 commits

Updating lm-format-enforcer version and adding links to decoding libraries in docs (#4222) · cc74b2b2
Noam Gat authored Apr 20, 2024

cc74b2b2

[Kernel][FP8] Initial support with dynamic per-tensor scaling (#4118) · a22cdea3

Cody Yu authored Apr 19, 2024

Provide an initial support to FP8 computation. This PR is inspired by HuggingFace TGI: huggingface/text-generation-inference#1726

This feature can be enabled with --quantization fp8 or -q fp8 when launching an engine.

Algorithm:
We still load a model checkpoint in FP16/BF16. After the weights are loaded, Fp8LinearMethod calculates the per-tensor scaling factor of weights and quantizes the weights accordingly. The scaling factor will then be stored for future use. Meanwhile, the per-tensor scaling factor for activations is calculated in every forward pass.

Initial Results:
Currently tested Mistral-7B on 1xH100. With prompt length ~5 and decoding length 128:

BF16: 1.47s
FP8: 1.66s
I'll try to use larger models and try to find more performance bottleneck. Meanwhile, you're welcome to try this code.

a22cdea3