Commits · b9e12416e56a1b2da86bbb7612b8789790a2b645 · OpenDAS / vllm_cscc

"tests/vscode:/vscode.git/clone" did not exist on "bfdb1ba5c3fb14387c69acb1f5067102d8028e56"

25 May, 2024 1 commit
- skip fp8 · f09d77ac
  zhuwenwen authored May 25, 2024
  
  f09d77ac
23 May, 2024 1 commit

[Kernel] Initial Activation Quantization Support (#4525) · a1242324

Dipika Sikka authored May 23, 2024


Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

a1242324

22 May, 2024 1 commit
- [CI/Build] Enforce style for C++ and CUDA code with `clang-format` (#4722) · 5f6d10c1
  Michael Goin authored May 22, 2024
  
  5f6d10c1
16 May, 2024 2 commits
- [Kernel] Add w8a8 CUTLASS kernels (#4749) · 2060e936
  Tyler Michael Smith authored May 16, 2024
  
  2060e936
- Add GPTQ Marlin 2:4 sparse structured support (#4790) · 6979ade3
  Alexander Matveev authored May 16, 2024
```
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
```
  6979ade3
03 May, 2024 1 commit
- [Kernel] Use flashinfer for decoding (#4353) · 43c413ec
  Lily Liu authored May 03, 2024
```
Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com>
```
  43c413ec
29 Apr, 2024 1 commit
- [Kernel] Marlin Expansion: Support AutoGPTQ Models with Marlin (#3922) · 73c8d677
  Robert Shaw authored Apr 29, 2024
```
Co-authored-by: alexm <alexm@neuralmagic.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
```
  73c8d677
27 Apr, 2024 1 commit
- [Kernel] Optimize FP8 support for MoE kernel / Mixtral via static scales (#4343) · 12628d3c
  Philipp Moritz authored Apr 26, 2024
```
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
```
  12628d3c
24 Apr, 2024 1 commit

[Kernel] FP8 support for MoE kernel / Mixtral (#4244) · eace8bf0

Philipp Moritz authored Apr 23, 2024

This PR is the first step towards fixing https://github.com/vllm-project/vllm/pull/3208

It implements dynamic per-tensor scaling (see https://github.com/vllm-project/vllm/pull/4118), so users do not need to compute activation scales on a calibration dataset and they also don't need to convert their model checkpoints. It is enough to specify the `quantization="fp8"` argument. You can try out the PR like this:

```python
from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(model="mistralai/Mixtral-8x7B-Instruct-v0.1", tensor_parallel_size=2, quantization="fp8")

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```

**Performance**: For this PR, the focus is on making the code clean (while still trying to get reasonable performance), there is a bunch of optimizations that we will submit as a follow up PR that significantly improve the performance (similar to the numbers in https://github.com/vllm-project/vllm/pull/3954). With this PR, the results are as follows:

<img width="725" alt="Screenshot 2024-04-21 at 1 31 50 PM" src="https://github.com/vllm-project/vllm/assets/113316/d8fe1118-07a0-4d4e-8530-37a77d465a03">


**Accuracy**: The accuracy with this PR on MMLU on `mistralai/Mixtral-8x7B-v0.1` is as follows:

```
|      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu              |N/A    |none  |     0|acc   |0.7018|±  |0.0036|
| - humanities     |N/A    |none  |     5|acc   |0.6472|±  |0.0065|
| - other          |N/A    |none  |     5|acc   |0.7673|±  |0.0072|
| - social_sciences|N/A    |none  |     5|acc   |0.8099|±  |0.0070|
| - stem           |N/A    |none  |     5|acc   |0.6131|±  |0.0083|
```
this compares favorably with the fp16 results which are
```
|      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu              |N/A    |none  |     0|acc   |0.7020|±  |0.1313|
| - humanities     |N/A    |none  |     5|acc   |0.6425|±  |0.1349|
| - other          |N/A    |none  |     5|acc   |0.7744|±  |0.1038|
| - social_sciences|N/A    |none  |     5|acc   |0.8131|±  |0.0695|
| - stem           |N/A    |none  |     5|acc   |0.6108|±  |0.1383|
```

Happy hacking!

eace8bf0

23 Apr, 2024 1 commit
- AQLM CUDA support (#3287) · 2b7949c1
  James Fleming authored Apr 23, 2024
```
Co-authored-by: mgoin <michael@neuralmagic.com>
```
  2b7949c1
03 Apr, 2024 1 commit

Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) (#3290) · 2ff767b5

Adrian Abeyta authored Apr 03, 2024


Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: HaiShaw <hixiao@gmail.com>
Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com>
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>
Co-authored-by: root <root@gt-pla-u18-08.pla.dcgpu>
Co-authored-by: mawong-amd <156021403+mawong-amd@users.noreply.github.com>
Co-authored-by: ttbachyinsda <ttbachyinsda@outlook.com>
Co-authored-by: guofangze <guofangze@kuaishou.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: jacobthebanana <50071502+jacobthebanana@users.noreply.github.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

2ff767b5

13 Mar, 2024 2 commits
- Add batched RoPE kernel (#3095) · 7e9bd08f
  Terry authored Mar 13, 2024
  
  7e9bd08f
- Add kernel for GeGLU with approximate GELU (#3337) · 602358f8
  Woosuk Kwon authored Mar 12, 2024
  
  602358f8
01 Mar, 2024 1 commit

Integrate Marlin Kernels for Int4 GPTQ inference (#2497) · c0c2335c

Robert Shaw authored Mar 01, 2024


Co-authored-by: Robert Shaw <114415538+rib-2@users.noreply.github.com>
Co-authored-by: alexm <alexm@neuralmagic.com>

c0c2335c

26 Feb, 2024 1 commit
- [Minor] Remove gather_cached_kv kernel (#3043) · d6e4a130
  Woosuk Kwon authored Feb 26, 2024
  
  d6e4a130
22 Feb, 2024 1 commit
- Optimize GeGLU layer in Gemma (#2975) · fd5dcc5c
  Woosuk Kwon authored Feb 21, 2024
  
  fd5dcc5c
06 Feb, 2024 1 commit
- Add fused top-K softmax kernel for MoE (#2769) · f0d4e145
  Woosuk Kwon authored Feb 05, 2024
  
  f0d4e145
30 Jan, 2024 2 commits
- Fused MOE for Mixtral (#2542) · ab406446
  Philipp Moritz authored Jan 29, 2024
```
Co-authored-by: chen shen <scv119@gmail.com>
```
  ab406446
- DeepseekMoE support with Fused MoE kernel (#2453) · 5d60def0
  wangding zeng authored Jan 30, 2024
```
Co-authored-by: roy <jasonailu87@gmail.com>
```
  5d60def0
29 Jan, 2024 1 commit

Support FP8-E5M2 KV Cache (#2279) · 9090bf02

zhaoyang-star authored Jan 29, 2024


Co-authored-by: zhaoyang <zhao.yang16@zte.com.cn>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>

9090bf02

27 Jan, 2024 2 commits
- Implement custom all reduce kernels (#2192) · 38017003
  Hanzhi Zhou authored Jan 28, 2024
  
  38017003
- AWQ: Up to 2.66x higher throughput (#2566) · beb89f68
  Casper authored Jan 27, 2024
  
  beb89f68
26 Jan, 2024 1 commit
- [ROCm] add support to ROCm 6.0 and MI300 (#2274) · 6b7de1a0
  Hongxia Yang authored Jan 26, 2024
  
  6b7de1a0
15 Dec, 2023 1 commit
- Add GPTQ support (#916) · 0fbfc4b8
  CHU Tianxiang authored Dec 15, 2023
  
  0fbfc4b8
08 Dec, 2023 1 commit

Merge EmbeddedLLM/vllm-rocm into vLLM main (#1836) · 6ccc0bff

TJian authored Dec 08, 2023


Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
Co-authored-by: Amir Balwel <amoooori04@gmail.com>
Co-authored-by: root <kuanfu.liu@akirakan.com>
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: kuanfu <kuanfu.liu@embeddedllm.com>
Co-authored-by: miloice <17350011+kliuae@users.noreply.github.com>

6ccc0bff

24 Nov, 2023 1 commit
- [Build] Avoid building too many extensions (#1624) · e0c6f556
  Yanming W authored Nov 23, 2023
  
  e0c6f556