Commits · f48954a4beb1c0507132d97554358ba00fd83d0a · OpenDAS / vllm_cscc

09 Jun, 2024 1 commit
- [Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (#5047) · 5467ac31
  bnellnm authored Jun 09, 2024
  
  5467ac31
07 Jun, 2024 1 commit

[Kernel] Dynamic Per-Token Activation Quantization (#5037) · ca3ea51b

Dipika Sikka authored Jun 07, 2024


Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

ca3ea51b

05 Jun, 2024 1 commit
- [Kernel] Add GPU architecture guards to the CUTLASS w8a8 kernels to reduce binary size (#5157) · ccd4f129
  Tyler Michael Smith authored Jun 05, 2024
```
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
```
  ccd4f129
03 Jun, 2024 1 commit
- [Kernel] Pass a device pointer into the quantize kernel for the scales (#5159) · cbb2f59c
  Tyler Michael Smith authored Jun 03, 2024
  
  cbb2f59c
01 Jun, 2024 3 commits
- [Kernel] Update Cutlass fp8 configs (#5144) · f081c3ce
  Varun Sundar Rabindranath authored Jun 01, 2024
```
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
```
  f081c3ce
- [Kernel] Refactor CUTLASS kernels to always take scales that reside on the GPU (#5137) · 260d119e
  Tyler Michael Smith authored Jun 01, 2024
  
  260d119e
- [Build] Guard against older CUDA versions when building CUTLASS 3.x kernels (#5168) · 1197e021
  Tyler Michael Smith authored May 31, 2024
  
  1197e021
31 May, 2024 2 commits

Revert "[Kernel] Marlin_24: Ensure the mma.sp instruction is using the... · e9d3aa04

Simon Mo authored May 31, 2024

Revert "[Kernel] Marlin_24: Ensure the mma.sp instruction is using the ::ordered_metadata modifier (introduced with PTX 8.5)" (#5149)

e9d3aa04

[Kernel] Marlin_24: Ensure the mma.sp instruction is using the... · 6d21fa1c

Alexander Matveev authored May 30, 2024

[Kernel] Marlin_24: Ensure the mma.sp instruction is using the ::ordered_metadata modifier (introduced with PTX 8.5) (#5136)

6d21fa1c

23 May, 2024 2 commits
- [Kernel] Initial Activation Quantization Support (#4525) · a1242324
  Dipika Sikka authored May 23, 2024
```
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
```
  a1242324
- Marlin 24 prefill performance improvement (about 25% better on average) (#4983) · 60662532
  Alexander Matveev authored May 23, 2024
  
  60662532
22 May, 2024 2 commits
- [Kernel] Fixup for CUTLASS kernels in CUDA graphs (#4954) · 8674f988
  Tyler Michael Smith authored May 22, 2024
```
Pass the CUDA stream into the CUTLASS GEMMs, to avoid future issues with CUDA graphs
```
  8674f988
- [CI/Build] Enforce style for C++ and CUDA code with `clang-format` (#4722) · 5f6d10c1
  Michael Goin authored May 22, 2024
  
  5f6d10c1
20 May, 2024 1 commit
- Remove marlin warning (#4918) · da5a0b53
  Alexander Matveev authored May 20, 2024
  
  da5a0b53
16 May, 2024 3 commits
- [Kernel] Add w8a8 CUTLASS kernels (#4749) · 2060e936
  Tyler Michael Smith authored May 16, 2024
  
  2060e936
- Add GPTQ Marlin 2:4 sparse structured support (#4790) · 6979ade3
  Alexander Matveev authored May 16, 2024
```
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
```
  6979ade3
- [Kernel] add bfloat16 support for gptq marlin kernel (#4788) · 99caa491
  Jinzhen Lin authored May 16, 2024
  
  99caa491
10 May, 2024 1 commit
- [Kernel] Refactor FP8 kv-cache with NVIDIA float8_e4m3 support (#4535) · c8331017
  Cody Yu authored May 09, 2024
  
  c8331017
09 May, 2024 1 commit
- [Bugfix] Fine-tune gptq_marlin configs to be more similar to marlin (#4626) · e288df06
  alexm-nm authored May 08, 2024
  
  e288df06
07 May, 2024 1 commit

[Kernel] Make static FP8 scaling more robust (#4570) · a98187cf

Philipp Moritz authored May 06, 2024

Previously FP8 static scaling works if the scales are overestimating the maxima of all activation tensors during computation. However this will not always be the case even if the scales were calibrated very carefully. For example, with the activations in my checkpoint

https://huggingface.co/pcmoritz/Mixtral-8x7B-v0.1-fp8-act-scale

(which was calibrated on https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k), I'm getting the following mostly random performance on MMLU:

|      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu              |N/A    |none  |     0|acc   |0.2295|±  |0.0035|
| - humanities     |N/A    |none  |     5|acc   |0.2421|±  |0.0062|
| - other          |N/A    |none  |     5|acc   |0.2398|±  |0.0076|
| - social_sciences|N/A    |none  |     5|acc   |0.2171|±  |0.0074|
| - stem           |N/A    |none  |     5|acc   |0.2125|±  |0.0073|
With the fix in this PR where the scaled activations are clamped between [-std::numeric_limits<c10::Float8_e4m3fn>::max(), std::numeric_limits<c10::Float8_e4m3fn>::max()] to make sure there are no NaNs, the performance is

|      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu              |N/A    |none  |     0|acc   |0.7008|±  |0.0036|
| - humanities     |N/A    |none  |     5|acc   |0.6453|±  |0.0065|
| - other          |N/A    |none  |     5|acc   |0.7692|±  |0.0072|
| - social_sciences|N/A    |none  |     5|acc   |0.8083|±  |0.0070|
| - stem           |N/A    |none  |     5|acc   |0.6115|±  |0.0083|
This is not perfect yet but is getting very close to the FP16 / dynamic activation scale performance.

a98187cf

02 May, 2024 1 commit
- [Kernel] Support running GPTQ 8-bit models in Marlin (#4533) · 7038e8b8
  alexm-nm authored May 02, 2024
  
  7038e8b8
29 Apr, 2024 1 commit
- [Kernel] Marlin Expansion: Support AutoGPTQ Models with Marlin (#3922) · 73c8d677
  Robert Shaw authored Apr 29, 2024
```
Co-authored-by: alexm <alexm@neuralmagic.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
```
  73c8d677
27 Apr, 2024 1 commit
- [Kernel] Optimize FP8 support for MoE kernel / Mixtral via static scales (#4343) · 12628d3c
  Philipp Moritz authored Apr 26, 2024
```
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
```
  12628d3c
24 Apr, 2024 2 commits

[Bugfix] Fix marlin kernel crash on H100 (#4218) · aae08249

alexm-nm authored Apr 24, 2024

This PR addresses the Marlin kernel H100 crash that was reported here: neuralmagic#187.
The reason for the crash was the inline PTX assembly that introduced the async_copy with streaming behavior. The solution is to use the more standard PTX for async_copy (without the fractional L2 policy for "evict_first"). There is no performance difference between standard async_copy PTX and the previous one.

aae08249

[Kernel] FP8 support for MoE kernel / Mixtral (#4244) · eace8bf0

Philipp Moritz authored Apr 23, 2024

This PR is the first step towards fixing https://github.com/vllm-project/vllm/pull/3208

It implements dynamic per-tensor scaling (see https://github.com/vllm-project/vllm/pull/4118), so users do not need to compute activation scales on a calibration dataset and they also don't need to convert their model checkpoints. It is enough to specify the `quantization="fp8"` argument. You can try out the PR like this:

```python
from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(model="mistralai/Mixtral-8x7B-Instruct-v0.1", tensor_parallel_size=2, quantization="fp8")

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```

**Performance**: For this PR, the focus is on making the code clean (while still trying to get reasonable performance), there is a bunch of optimizations that we will submit as a follow up PR that significantly improve the performance (similar to the numbers in https://github.com/vllm-project/vllm/pull/3954). With this PR, the results are as follows:

<img width="725" alt="Screenshot 2024-04-21 at 1 31 50 PM" src="https://github.com/vllm-project/vllm/assets/113316/d8fe1118-07a0-4d4e-8530-37a77d465a03">


**Accuracy**: The accuracy with this PR on MMLU on `mistralai/Mixtral-8x7B-v0.1` is as follows:

```
|      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu              |N/A    |none  |     0|acc   |0.7018|±  |0.0036|
| - humanities     |N/A    |none  |     5|acc   |0.6472|±  |0.0065|
| - other          |N/A    |none  |     5|acc   |0.7673|±  |0.0072|
| - social_sciences|N/A    |none  |     5|acc   |0.8099|±  |0.0070|
| - stem           |N/A    |none  |     5|acc   |0.6131|±  |0.0083|
```
this compares favorably with the fp16 results which are
```
|      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu              |N/A    |none  |     0|acc   |0.7020|±  |0.1313|
| - humanities     |N/A    |none  |     5|acc   |0.6425|±  |0.1349|
| - other          |N/A    |none  |     5|acc   |0.7744|±  |0.1038|
| - social_sciences|N/A    |none  |     5|acc   |0.8131|±  |0.0695|
| - stem           |N/A    |none  |     5|acc   |0.6108|±  |0.1383|
```

Happy hacking!

eace8bf0

23 Apr, 2024 1 commit
- AQLM CUDA support (#3287) · 2b7949c1
  James Fleming authored Apr 23, 2024
```
Co-authored-by: mgoin <michael@neuralmagic.com>
```
  2b7949c1
11 Apr, 2024 1 commit
- [Core] Set `linear_weights` directly on the layer (#3977) · a10d3056
  Antoni Baum authored Apr 11, 2024
  
  a10d3056
03 Apr, 2024 1 commit

Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) (#3290) · 2ff767b5

Adrian Abeyta authored Apr 03, 2024


Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: HaiShaw <hixiao@gmail.com>
Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com>
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>
Co-authored-by: root <root@gt-pla-u18-08.pla.dcgpu>
Co-authored-by: mawong-amd <156021403+mawong-amd@users.noreply.github.com>
Co-authored-by: ttbachyinsda <ttbachyinsda@outlook.com>
Co-authored-by: guofangze <guofangze@kuaishou.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: jacobthebanana <50071502+jacobthebanana@users.noreply.github.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

2ff767b5

23 Mar, 2024 1 commit
- use half atomicAdd of dtk24.04 · 3f1166ab
  zhuwenwen authored Mar 23, 2024
  
  3f1166ab
01 Mar, 2024 1 commit

Integrate Marlin Kernels for Int4 GPTQ inference (#2497) · c0c2335c

Robert Shaw authored Mar 01, 2024


Co-authored-by: Robert Shaw <114415538+rib-2@users.noreply.github.com>
Co-authored-by: alexm <alexm@neuralmagic.com>

c0c2335c

29 Feb, 2024 1 commit
- Add Support for 2/3/8-bit GPTQ Quantization Models (#2330) · 01a5d18a
  CHU Tianxiang authored Feb 29, 2024
  
  01a5d18a
12 Feb, 2024 1 commit
- Refactor 2 awq gemm kernels into m16nXk32 (#2723) · 56383649
  Rex authored Feb 12, 2024
```
Co-authored-by: Chunan Zeng <chunanzeng@Chunans-Air.attlocal.net>
```
  56383649
01 Feb, 2024 2 commits
- Fix compile error when using rocm (#2648) · 923797fe
  zhaoyang-star authored Feb 02, 2024
  
  923797fe
- skip fp8 · 318e2b5a
  zhuwenwen authored Feb 01, 2024
  
  318e2b5a
29 Jan, 2024 1 commit

Support FP8-E5M2 KV Cache (#2279) · 9090bf02

zhaoyang-star authored Jan 29, 2024


Co-authored-by: zhaoyang <zhao.yang16@zte.com.cn>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>

9090bf02

27 Jan, 2024 1 commit
- AWQ: Up to 2.66x higher throughput (#2566) · beb89f68
  Casper authored Jan 27, 2024
  
  beb89f68
03 Jan, 2024 2 commits
- Enable CUDA graph for GPTQ & SqueezeLLM (#2318) · 6ef00b03
  Woosuk Kwon authored Jan 03, 2024
  
  6ef00b03
- [FIX] Support non-zero CUDA devices in custom kernels (#1959) · 77af974b
  Jee Li authored Jan 03, 2024
  
  77af974b
18 Dec, 2023 1 commit
- [ROCm] Fixes for GPTQ on ROCm (#2180) · 1b7c791d
  kliuae authored Dec 19, 2023
  
  1b7c791d
15 Dec, 2023 1 commit
- Add GPTQ support (#916) · 0fbfc4b8
  CHU Tianxiang authored Dec 15, 2023
  
  0fbfc4b8