Commits · 1591c68fdea97a213d5564f687009c4fd1b44608 · OpenDAS / vllm_cscc

04 May, 2024 1 commit

[Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with... · 2a052011

Michael Goin authored May 04, 2024

[Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with Dynamic/Static Activations) (#4527)

Follow on to #4332 to enable FP8 checkpoint loading for Mixtral and supersedes #4436.

This PR enables the following checkpoint loading features for Mixtral:

Supports loading fp8 checkpoints for Mixtral, such as this "nm-testing/Mixtral-8x7B-Instruct-v0.1-FP8" test model
Supports static or dynamic activation quantization with static weight quantization (all per tensor)
Supports different scales for each expert weight
Supports Fp8 in QKV layer
Notes:

The Expert Gate/Router always runs at half / full precision for now.
If there are different weight scales between QKV layer (for separate QKV weights), they are re-quantized using layer.weight_scale.max() so we can have a single gemm for performance.

2a052011

01 May, 2024 1 commit
- [Misc] Remove Mixtral device="cuda" declarations (#4543) · c9d852d6
  Philipp Moritz authored May 01, 2024
```
Remove the device="cuda" declarations in mixtral as promised in #4343
```
  c9d852d6
27 Apr, 2024 2 commits
- [BugFix] Resolved Issues For LinearMethod --> QuantConfig (#4418) · 4ea1f967
  Robert Shaw authored Apr 27, 2024
  
  4ea1f967
- [Kernel] Optimize FP8 support for MoE kernel / Mixtral via static scales (#4343) · 12628d3c
  Philipp Moritz authored Apr 26, 2024
```
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
```
  12628d3c
26 Apr, 2024 2 commits
- [Misc][Refactor] Generalize linear_method to be quant_method (#4373) · a62aaf1d
  Cody Yu authored Apr 26, 2024
  
  a62aaf1d
- [CI] Disable non-lazy string operation on logging (#4326) · a88081bf
  SangBin Cho authored Apr 26, 2024
```
Co-authored-by: Danny Guinther <dguinther@neuralmagic.com>
```
  a88081bf
25 Apr, 2024 2 commits
- [Bugfix][Model] Refactor OLMo model to support new HF format in transformers 4.40.0 (#4324) · fbf152d9
  Isotr0py authored Apr 26, 2024
```
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
```
  fbf152d9
- [Model] Adds Phi-3 support (#4298) · 96e90fde
  Caio Mendes authored Apr 25, 2024
  
  96e90fde
24 Apr, 2024 1 commit

[Kernel] FP8 support for MoE kernel / Mixtral (#4244) · eace8bf0

Philipp Moritz authored Apr 23, 2024

This PR is the first step towards fixing https://github.com/vllm-project/vllm/pull/3208

It implements dynamic per-tensor scaling (see https://github.com/vllm-project/vllm/pull/4118), so users do not need to compute activation scales on a calibration dataset and they also don't need to convert their model checkpoints. It is enough to specify the `quantization="fp8"` argument. You can try out the PR like this:

```python
from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(model="mistralai/Mixtral-8x7B-Instruct-v0.1", tensor_parallel_size=2, quantization="fp8")

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```

**Performance**: For this PR, the focus is on making the code clean (while still trying to get reasonable performance), there is a bunch of optimizations that we will submit as a follow up PR that significantly improve the performance (similar to the numbers in https://github.com/vllm-project/vllm/pull/3954). With this PR, the results are as follows:

<img width="725" alt="Screenshot 2024-04-21 at 1 31 50 PM" src="https://github.com/vllm-project/vllm/assets/113316/d8fe1118-07a0-4d4e-8530-37a77d465a03">


**Accuracy**: The accuracy with this PR on MMLU on `mistralai/Mixtral-8x7B-v0.1` is as follows:

```
|      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu              |N/A    |none  |     0|acc   |0.7018|±  |0.0036|
| - humanities     |N/A    |none  |     5|acc   |0.6472|±  |0.0065|
| - other          |N/A    |none  |     5|acc   |0.7673|±  |0.0072|
| - social_sciences|N/A    |none  |     5|acc   |0.8099|±  |0.0070|
| - stem           |N/A    |none  |     5|acc   |0.6131|±  |0.0083|
```
this compares favorably with the fp16 results which are
```
|      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu              |N/A    |none  |     0|acc   |0.7020|±  |0.1313|
| - humanities     |N/A    |none  |     5|acc   |0.6425|±  |0.1349|
| - other          |N/A    |none  |     5|acc   |0.7744|±  |0.1038|
| - social_sciences|N/A    |none  |     5|acc   |0.8131|±  |0.0695|
| - stem           |N/A    |none  |     5|acc   |0.6108|±  |0.1383|
```

Happy hacking!

eace8bf0

16 Apr, 2024 1 commit
- [Core] Refactor model loading code (#4097) · 69e1d2fb
  Antoni Baum authored Apr 16, 2024
  
  69e1d2fb
11 Apr, 2024 1 commit
- [Core][Model] torch.compile for layernorm in commandr (#3985) · caada5e5
  youkaichao authored Apr 10, 2024
```
[Core][Model] Use torch.compile to accelerate layernorm in commandr (#3985)
```
  caada5e5
10 Apr, 2024 1 commit

[Core][Refactor] move parallel_utils into vllm/distributed (#3950) · 63e7176f

youkaichao authored Apr 10, 2024

[WIP][Core][Refactor] move vllm/model_executor/parallel_utils into vllm/distributed and vllm/device_communicators (#3950)

63e7176f

09 Apr, 2024 1 commit
- [Bugfix] Fix KeyError on loading GPT-NeoX (#3925) · e23a43ae
  Junichi Sato authored Apr 10, 2024
  
  e23a43ae
08 Apr, 2024 4 commits
- [BugFix][Model] Fix commandr RoPE max_position_embeddings (#3919) · d036198e
  Roy authored Apr 09, 2024
  
  d036198e
- [Bugfix] Enable Proper `attention_bias` Usage in Llama Model Configuration (#3767) · bc0c0192
  Kiran R authored Apr 09, 2024
```
Co-authored-by: roy <jasonailu87@gmail.com>
```
  bc0c0192
- [Bugfix] Added Command-R GPTQ support (#3849) · f46864d6
  egortolmachev authored Apr 08, 2024
```
Co-authored-by: Egor Tolmachev <t333ga@gmail.com>
```
  f46864d6
- [Model] add minicpm (#3893) · b4543c8f
  ywfang authored Apr 08, 2024
  
  b4543c8f
07 Apr, 2024 1 commit
- [Core] enable out-of-tree model register (#3871) · 95baec82
  youkaichao authored Apr 06, 2024
  
  95baec82
05 Apr, 2024 1 commit
- [Bugfix] Fix incorrect output on OLMo models in Tensor Parallelism (#3869) · 54951ac4
  Isotr0py authored Apr 06, 2024
  
  54951ac4
04 Apr, 2024 1 commit
- [Model] Cohere CommandR+ (#3829) · 9117f892
  Saurabh Dash authored Apr 05, 2024
  
  9117f892
03 Apr, 2024 1 commit

Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) (#3290) · 2ff767b5

Adrian Abeyta authored Apr 03, 2024


Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: HaiShaw <hixiao@gmail.com>
Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com>
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>
Co-authored-by: root <root@gt-pla-u18-08.pla.dcgpu>
Co-authored-by: mawong-amd <156021403+mawong-amd@users.noreply.github.com>
Co-authored-by: ttbachyinsda <ttbachyinsda@outlook.com>
Co-authored-by: guofangze <guofangze@kuaishou.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: jacobthebanana <50071502+jacobthebanana@users.noreply.github.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

2ff767b5

28 Mar, 2024 3 commits
- [Model] Add support for Qwen2MoeModel (#3346) · d6ea427f
  wenyujin333 authored Mar 28, 2024
  
  d6ea427f
- [Model] Add support for xverse (#3610) · 098e1776
  hxer7963 authored Mar 28, 2024
```
Co-authored-by: willhe <hexin@xverse.cn>
Co-authored-by: root <root@localhost.localdomain>
```
  098e1776
- [Model] Fix and clean commandr (#3671) · 10e63222
  Roy authored Mar 28, 2024
  
  10e63222
27 Mar, 2024 4 commits
- Add support for Cohere's Command-R model (#3433) · 1182607e
  zeppombal authored Mar 27, 2024
```
Co-authored-by: José Maria Pombal <jose.pombal@unbabel.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
```
  1182607e
- [Model] Add support for DBRX (#3660) · e24336b5
  Megha Agarwal authored Mar 27, 2024
  
  e24336b5
- [Bugfix] More faithful implementation of Gemma (#3653) · 82c540be
  Woosuk Kwon authored Mar 27, 2024
  
  82c540be
- [Misc] Minor fix in KVCache type (#3652) · e66b629c
  Woosuk Kwon authored Mar 26, 2024
  
  e66b629c
26 Mar, 2024 1 commit
- Enable more models to inference based on LoRA (#3382) · 8af890a8
  Jee Li authored Mar 26, 2024
```
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
```
  8af890a8
25 Mar, 2024 4 commits
- [Feature] Add vision language model support. (#3042) · 64172a97
  xwjiang2010 authored Mar 25, 2024
  
  64172a97
- [CI] Try introducing isort. (#3495) · 01bfb22b
  SangBin Cho authored Mar 25, 2024
  
  01bfb22b
- [Core] Refactor Attention Take 2 (#3462) · 925f3332
  Woosuk Kwon authored Mar 24, 2024
  
  925f3332
- [Model] Add starcoder2 awq support (#3569) · b0dfa91d
  少年 authored Mar 25, 2024
  
  b0dfa91d
24 Mar, 2024 3 commits
- [BugFix] 1D query fix for MoE models (#3597) · 41deac4a
  Nick Hill authored Mar 24, 2024
  
  41deac4a
- [BugFix] Fix Falcon tied embeddings (#3590) · af9e5349
  Woosuk Kwon authored Mar 24, 2024
```
Co-authored-by: 44670 <44670@users.noreply.github.com>
```
  af9e5349
- [Misc] Fix BLOOM copyright notice (#3591) · 3c5ab9b8
  Woosuk Kwon authored Mar 23, 2024
  
  3c5ab9b8
22 Mar, 2024 2 commits
- [Hardware][Neuron] Refactor neuron support (#3471) · e90fc21f
  Zhuohan Li authored Mar 21, 2024
  
  e90fc21f
- [Bugfix][Model] Fix Qwen2 (#3554) · ea5f14e6
  Roy authored Mar 22, 2024
  
  ea5f14e6
21 Mar, 2024 2 commits
- [BugFix] gemma loading after quantization or LoRA. (#3553) · b7050ca7
  Taemin Lee authored Mar 22, 2024
  
  b7050ca7
- [Misc] Bump up transformers to v4.39.0 & Remove StarCoder2Config (#3551) · c188ecb0
  Woosuk Kwon authored Mar 21, 2024
```
Co-authored-by: Roy <jasonailu87@gmail.com>
Co-authored-by: Roger Meier <r.meier@siemens.com>
```
  c188ecb0