Commits · 2a602b055a180c982126e8a438d188325fdb01a5 · OpenDAS / vllm_cscc

14 Mar, 2025 1 commit
- forward fix PR 14245, restore build on ROCm 6.2 (#14709) · 2a602b05
  Jeff Daily authored Mar 13, 2025
```
Signed-off-by: Jeff Daily <jeff.daily@amd.com>
```
  2a602b05
11 Mar, 2025 1 commit
- dynamic distpatch of fp8 kernels (#14245) · a1c8f379
  Jeff Daily authored Mar 11, 2025
```
Signed-off-by: Jeff Daily <jeff.daily@amd.com>
```
  a1c8f379
27 Feb, 2025 1 commit
- [ROCm][Quantization][Kernel] Use FP8 FNUZ when OCP flag is 0 or undefined (#13851) · a31614e3
  ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟 authored Feb 27, 2025
```
Signed-off-by: Hollow Man <hollowman@opensuse.org>
```
  a31614e3
25 Feb, 2025 1 commit
- [ROCm][Quantization][Kernel] Using HIP FP8 header (#12593) · aabeb268
  Gregory Shtrasberg authored Feb 25, 2025
  
  aabeb268
20 Feb, 2025 1 commit
- [ROCm] MI300A compile targets deprecation (#13560) · 0023cd2b
  Gregory Shtrasberg authored Feb 20, 2025
  
  0023cd2b
13 Dec, 2024 1 commit

[torch.compile] Dynamic fp8 + rms_norm fusion (#10906) · 30870b4f

Luka Govedič authored Dec 12, 2024


Signed-off-by: luka <luka@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

30870b4f

08 Nov, 2024 1 commit
- [torch.compile] Fuse RMSNorm with quant (#9138) · 4f93dfe9
  Luka Govedič authored Nov 08, 2024
```
Signed-off-by: luka <luka@neuralmagic.com>
Co-authored-by: youkaichao <youkaichao@126.com>
```
  4f93dfe9
16 Oct, 2024 1 commit
- [Bugfix][Kernel] Prevent integer overflow in fp8 dynamic per-token quantize kernel (#9425) · c3fab5f7
  Tyler Michael Smith authored Oct 16, 2024
  
  c3fab5f7
04 Oct, 2024 1 commit
- [CI/Build] Per file CUDA Archs (improve wheel size and dev build times) (#8845) · aeb37c2a
  Lucas Wilkinson authored Oct 03, 2024
  
  aeb37c2a
22 Aug, 2024 1 commit
- [Kernel] Replaced `blockReduce[...]` functions with `cub::BlockReduce` (#7233) · 7937009a
  Luka Govedič authored Aug 21, 2024
```
Co-authored-by: Michael Goin <michael@neuralmagic.com>
```
  7937009a
16 Aug, 2024 1 commit
- [Feature][Hardware][Amd] Add fp8 Linear Layer for Rocm (#7210) · e837b624
  Charlie Fu authored Aug 16, 2024
  
  e837b624
05 Aug, 2024 1 commit
- [CI/Build] Suppress divide-by-zero and missing return statement warnings (#7001) · 6e4852ce
  Tyler Michael Smith authored Aug 05, 2024
  
  6e4852ce
30 Jul, 2024 1 commit
- [Kernel] Squash a few more warnings (#6914) · cbbc9044
  Tyler Michael Smith authored Jul 30, 2024
  
  cbbc9044
26 Jul, 2024 1 commit
- [Bugfix][Kernel] Promote another index to int64_t (#6838) · 50704f52
  Tyler Michael Smith authored Jul 26, 2024
  
  50704f52
22 Jul, 2024 1 commit
- [Bugfix][Kernel] Use int64_t for indices in fp8 quant kernels (#6649) · fea59c77
  Tyler Michael Smith authored Jul 22, 2024
  
  fea59c77
21 Jul, 2024 1 commit
- [Kernel][Core] Add AWQ support to the Marlin kernel (#6612) · 396d92d5
  Alexander Matveev authored Jul 21, 2024
  
  396d92d5
20 Jul, 2024 1 commit
- [ Kernel ] FP8 Dynamic Per Token Quant - Add scale_ub (#6593) · 2e265642
  Varun Sundar Rabindranath authored Jul 19, 2024
```
Co-authored-by: Varun Sundar Rabindranth <varun@neuralmagic.com>
```
  2e265642
18 Jul, 2024 1 commit
- [ Kernel ] FP8 Dynamic-Per-Token Quant Kernel (#6511) · b5241e41
  Varun Sundar Rabindranath authored Jul 17, 2024
```
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
```
  b5241e41
03 Jul, 2024 1 commit
- [Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin (#5975) · 47f0954a
  Michael Goin authored Jul 03, 2024
  
  47f0954a
12 Jun, 2024 1 commit

[Kernel] Vectorized FP8 quantize kernel (#5396) · 5985e342

Cody Yu authored Jun 12, 2024

Inspired by #5146, this PR improves FP8 quantize kernel by vectorizing data transfer to better utilize memory bandwidth. Microbenchmark shows that this improved kernel can achieve 1.0x-1.5x speedup (especially when hidden size is large).

In details, we applied 3 optimizations:

- Use inverted scale so that most divisions are changed to multiplications.
- Unroll the loop by 4 times to improve ILP.
- Use vectorized 4 to transfer data between HBM and SRAM.

5985e342

09 Jun, 2024 1 commit
- [Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (#5047) · 5467ac31
  bnellnm authored Jun 09, 2024
  
  5467ac31
22 May, 2024 1 commit
- [CI/Build] Enforce style for C++ and CUDA code with `clang-format` (#4722) · 5f6d10c1
  Michael Goin authored May 22, 2024
  
  5f6d10c1
10 May, 2024 1 commit
- [Kernel] Refactor FP8 kv-cache with NVIDIA float8_e4m3 support (#4535) · c8331017
  Cody Yu authored May 09, 2024
  
  c8331017
07 May, 2024 1 commit

[Kernel] Make static FP8 scaling more robust (#4570) · a98187cf

Philipp Moritz authored May 06, 2024

Previously FP8 static scaling works if the scales are overestimating the maxima of all activation tensors during computation. However this will not always be the case even if the scales were calibrated very carefully. For example, with the activations in my checkpoint

https://huggingface.co/pcmoritz/Mixtral-8x7B-v0.1-fp8-act-scale

(which was calibrated on https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k), I'm getting the following mostly random performance on MMLU:

|      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu              |N/A    |none  |     0|acc   |0.2295|±  |0.0035|
| - humanities     |N/A    |none  |     5|acc   |0.2421|±  |0.0062|
| - other          |N/A    |none  |     5|acc   |0.2398|±  |0.0076|
| - social_sciences|N/A    |none  |     5|acc   |0.2171|±  |0.0074|
| - stem           |N/A    |none  |     5|acc   |0.2125|±  |0.0073|
With the fix in this PR where the scaled activations are clamped between [-std::numeric_limits<c10::Float8_e4m3fn>::max(), std::numeric_limits<c10::Float8_e4m3fn>::max()] to make sure there are no NaNs, the performance is

|      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu              |N/A    |none  |     0|acc   |0.7008|±  |0.0036|
| - humanities     |N/A    |none  |     5|acc   |0.6453|±  |0.0065|
| - other          |N/A    |none  |     5|acc   |0.7692|±  |0.0072|
| - social_sciences|N/A    |none  |     5|acc   |0.8083|±  |0.0070|
| - stem           |N/A    |none  |     5|acc   |0.6115|±  |0.0083|
This is not perfect yet but is getting very close to the FP16 / dynamic activation scale performance.

a98187cf

27 Apr, 2024 1 commit
- [Kernel] Optimize FP8 support for MoE kernel / Mixtral via static scales (#4343) · 12628d3c
  Philipp Moritz authored Apr 26, 2024
```
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
```
  12628d3c
24 Apr, 2024 1 commit

[Kernel] FP8 support for MoE kernel / Mixtral (#4244) · eace8bf0

Philipp Moritz authored Apr 23, 2024

This PR is the first step towards fixing https://github.com/vllm-project/vllm/pull/3208

It implements dynamic per-tensor scaling (see https://github.com/vllm-project/vllm/pull/4118), so users do not need to compute activation scales on a calibration dataset and they also don't need to convert their model checkpoints. It is enough to specify the `quantization="fp8"` argument. You can try out the PR like this:

```python
from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(model="mistralai/Mixtral-8x7B-Instruct-v0.1", tensor_parallel_size=2, quantization="fp8")

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```

**Performance**: For this PR, the focus is on making the code clean (while still trying to get reasonable performance), there is a bunch of optimizations that we will submit as a follow up PR that significantly improve the performance (similar to the numbers in https://github.com/vllm-project/vllm/pull/3954). With this PR, the results are as follows:

<img width="725" alt="Screenshot 2024-04-21 at 1 31 50 PM" src="https://github.com/vllm-project/vllm/assets/113316/d8fe1118-07a0-4d4e-8530-37a77d465a03">


**Accuracy**: The accuracy with this PR on MMLU on `mistralai/Mixtral-8x7B-v0.1` is as follows:

```
|      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu              |N/A    |none  |     0|acc   |0.7018|±  |0.0036|
| - humanities     |N/A    |none  |     5|acc   |0.6472|±  |0.0065|
| - other          |N/A    |none  |     5|acc   |0.7673|±  |0.0072|
| - social_sciences|N/A    |none  |     5|acc   |0.8099|±  |0.0070|
| - stem           |N/A    |none  |     5|acc   |0.6131|±  |0.0083|
```
this compares favorably with the fp16 results which are
```
|      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu              |N/A    |none  |     0|acc   |0.7020|±  |0.1313|
| - humanities     |N/A    |none  |     5|acc   |0.6425|±  |0.1349|
| - other          |N/A    |none  |     5|acc   |0.7744|±  |0.1038|
| - social_sciences|N/A    |none  |     5|acc   |0.8131|±  |0.0695|
| - stem           |N/A    |none  |     5|acc   |0.6108|±  |0.1383|
```

Happy hacking!

eace8bf0

03 Apr, 2024 1 commit

Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) (#3290) · 2ff767b5

Adrian Abeyta authored Apr 03, 2024


Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: HaiShaw <hixiao@gmail.com>
Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com>
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>
Co-authored-by: root <root@gt-pla-u18-08.pla.dcgpu>
Co-authored-by: mawong-amd <156021403+mawong-amd@users.noreply.github.com>
Co-authored-by: ttbachyinsda <ttbachyinsda@outlook.com>
Co-authored-by: guofangze <guofangze@kuaishou.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: jacobthebanana <50071502+jacobthebanana@users.noreply.github.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

2ff767b5