Commits · dbe55885541fcf27c0bc229df50b3ea3c53e38ce · OpenDAS / vllm_cscc

19 Jul, 2024 1 commit
- [ Misc ] non-uniform quantization via `compressed-tensors` for `Llama` (#6515) · dbe55885
  Robert Shaw authored Jul 18, 2024
  
  dbe55885
18 Jul, 2024 1 commit
- [Model] Pipeline parallel support for Mixtral (#6516) · b5af8c22
  Cody Yu authored Jul 17, 2024
  
  b5af8c22
16 Jul, 2024 1 commit
- [Kernel][Attention] Separate `Attention.kv_scale` into `k_scale` and `v_scale` (#6081) · 978aed53
  Michael Goin authored Jul 16, 2024
  
  978aed53
14 Jul, 2024 1 commit
- [ Misc ] Apply MoE Refactor to Deepseekv2 To Support Fp8 (#6417) · fb6af8bc
  Robert Shaw authored Jul 13, 2024
  
  fb6af8bc
10 Jul, 2024 1 commit
- [Bugfix] Support 2D input shape in MoE layer (#6287) · e72ae80b
  Woosuk Kwon authored Jul 10, 2024
  
  e72ae80b
02 Jul, 2024 3 commits
- [CORE] Quantized lm-head Framework (#4442) · ee93f4f9
  Qubitium-ModelCloud authored Jul 03, 2024
```
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
Co-authored-by: ZX <zx@lbx.dev>
```
  ee93f4f9
- [ Misc ] Refactor MoE to isolate Fp8 From Mixtral (#5970) · 7c008c51
  Robert Shaw authored Jul 02, 2024
```
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
```
  7c008c51
- [Core] Pipeline Parallel Support (#4412) · c5832d2a
  Murali Andoorveedu authored Jul 02, 2024
```
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
```
  c5832d2a
27 Jun, 2024 2 commits
- [Model][Bugfix] Implicit model flags and reenable Phi-3-Vision (#5896) · 98cf2ed6
  Cyrus Leung authored Jun 28, 2024
  
  98cf2ed6
- [Model] Add base class for LoRA-supported models (#5018) · 96354d6a
  Cyrus Leung authored Jun 27, 2024
  
  96354d6a
08 Jun, 2024 1 commit
- [Misc][Breaking] Change FP8 checkpoint format from act_scale -> input_scale (#5353) · c09dade2
  Michael Goin authored Jun 08, 2024
  
  c09dade2
05 Jun, 2024 1 commit
- [Model] Correct Mixtral FP8 checkpoint loading (#5231) · 5563a4de
  Cody Yu authored Jun 05, 2024
  
  5563a4de
31 May, 2024 1 commit
- [Model] Enable FP8 QKV in MoE and refine kernel tuning script (#5039) · e9899fb7
  Cody Yu authored May 31, 2024
  
  e9899fb7
27 May, 2024 1 commit

[Bugfix / Core] Prefix Caching Guards (merged with main) (#4846) · 1102bef2

Zhuohan Li authored May 27, 2024


Co-authored-by: rsnm2 <rshaw@neuralmagic.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>

1102bef2

22 May, 2024 1 commit

[Misc] Load FP8 kv-cache scaling factors from checkpoints (#4893) · a3a73ab0

Cody Yu authored May 22, 2024

The 2nd PR for #4532.

This PR supports loading FP8 kv-cache scaling factors from a FP8 checkpoint (with .kv_scale parameter).

a3a73ab0

13 May, 2024 2 commits
- [Bugfix] Fix dynamic FP8 quantization for Mixtral (#4793) · 33d3914b
  Philipp Moritz authored May 13, 2024
  
  33d3914b
- [Misc] Enhance attention selector (#4751) · 0fca3cdc
  Woosuk Kwon authored May 13, 2024
  
  0fca3cdc
04 May, 2024 1 commit

[Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with... · 2a052011

Michael Goin authored May 04, 2024

[Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with Dynamic/Static Activations) (#4527)

Follow on to #4332 to enable FP8 checkpoint loading for Mixtral and supersedes #4436.

This PR enables the following checkpoint loading features for Mixtral:

Supports loading fp8 checkpoints for Mixtral, such as this "nm-testing/Mixtral-8x7B-Instruct-v0.1-FP8" test model
Supports static or dynamic activation quantization with static weight quantization (all per tensor)
Supports different scales for each expert weight
Supports Fp8 in QKV layer
Notes:

The Expert Gate/Router always runs at half / full precision for now.
If there are different weight scales between QKV layer (for separate QKV weights), they are re-quantized using layer.weight_scale.max() so we can have a single gemm for performance.

2a052011

01 May, 2024 1 commit
- [Misc] Remove Mixtral device="cuda" declarations (#4543) · c9d852d6
  Philipp Moritz authored May 01, 2024
```
Remove the device="cuda" declarations in mixtral as promised in #4343
```
  c9d852d6
27 Apr, 2024 1 commit
- [Kernel] Optimize FP8 support for MoE kernel / Mixtral via static scales (#4343) · 12628d3c
  Philipp Moritz authored Apr 26, 2024
```
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
```
  12628d3c
26 Apr, 2024 1 commit
- [Misc][Refactor] Generalize linear_method to be quant_method (#4373) · a62aaf1d
  Cody Yu authored Apr 26, 2024
  
  a62aaf1d
24 Apr, 2024 1 commit

[Kernel] FP8 support for MoE kernel / Mixtral (#4244) · eace8bf0

Philipp Moritz authored Apr 23, 2024

This PR is the first step towards fixing https://github.com/vllm-project/vllm/pull/3208

It implements dynamic per-tensor scaling (see https://github.com/vllm-project/vllm/pull/4118), so users do not need to compute activation scales on a calibration dataset and they also don't need to convert their model checkpoints. It is enough to specify the `quantization="fp8"` argument. You can try out the PR like this:

```python
from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(model="mistralai/Mixtral-8x7B-Instruct-v0.1", tensor_parallel_size=2, quantization="fp8")

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```

**Performance**: For this PR, the focus is on making the code clean (while still trying to get reasonable performance), there is a bunch of optimizations that we will submit as a follow up PR that significantly improve the performance (similar to the numbers in https://github.com/vllm-project/vllm/pull/3954). With this PR, the results are as follows:

<img width="725" alt="Screenshot 2024-04-21 at 1 31 50 PM" src="https://github.com/vllm-project/vllm/assets/113316/d8fe1118-07a0-4d4e-8530-37a77d465a03">


**Accuracy**: The accuracy with this PR on MMLU on `mistralai/Mixtral-8x7B-v0.1` is as follows:

```
|      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu              |N/A    |none  |     0|acc   |0.7018|±  |0.0036|
| - humanities     |N/A    |none  |     5|acc   |0.6472|±  |0.0065|
| - other          |N/A    |none  |     5|acc   |0.7673|±  |0.0072|
| - social_sciences|N/A    |none  |     5|acc   |0.8099|±  |0.0070|
| - stem           |N/A    |none  |     5|acc   |0.6131|±  |0.0083|
```
this compares favorably with the fp16 results which are
```
|      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu              |N/A    |none  |     0|acc   |0.7020|±  |0.1313|
| - humanities     |N/A    |none  |     5|acc   |0.6425|±  |0.1349|
| - other          |N/A    |none  |     5|acc   |0.7744|±  |0.1038|
| - social_sciences|N/A    |none  |     5|acc   |0.8131|±  |0.0695|
| - stem           |N/A    |none  |     5|acc   |0.6108|±  |0.1383|
```

Happy hacking!

eace8bf0

16 Apr, 2024 1 commit
- [Core] Refactor model loading code (#4097) · 69e1d2fb
  Antoni Baum authored Apr 16, 2024
  
  69e1d2fb
10 Apr, 2024 1 commit

[Core][Refactor] move parallel_utils into vllm/distributed (#3950) · 63e7176f

youkaichao authored Apr 10, 2024

[WIP][Core][Refactor] move vllm/model_executor/parallel_utils into vllm/distributed and vllm/device_communicators (#3950)

63e7176f

25 Mar, 2024 2 commits
- [CI] Try introducing isort. (#3495) · 01bfb22b
  SangBin Cho authored Mar 25, 2024
  
  01bfb22b
- [Core] Refactor Attention Take 2 (#3462) · 925f3332
  Woosuk Kwon authored Mar 24, 2024
  
  925f3332
24 Mar, 2024 1 commit
- [BugFix] 1D query fix for MoE models (#3597) · 41deac4a
  Nick Hill authored Mar 24, 2024
  
  41deac4a
20 Mar, 2024 1 commit
- Migrate `logits` computation and gather to `model_runner` (#3233) · f1c0fc39
  Roy authored Mar 21, 2024
  
  f1c0fc39
07 Mar, 2024 1 commit
- Separate attention backends (#3005) · 2daf23ab
  Woosuk Kwon authored Mar 07, 2024
  
  2daf23ab
15 Feb, 2024 1 commit

Align LoRA code between Mistral and Mixtral (fixes #2875) (#2880) · 31348dff

Philipp Moritz authored Feb 14, 2024



* Fix AttributeError: MixtralModel object has no attribute org_vocab_size.

* Make LoRA logic for Mistral and Mixtral the same

---------
Co-authored-by: Pernekhan Utemuratov <pernekhan@deepinfra.com>

31348dff

13 Feb, 2024 1 commit

Add LoRA support for Mixtral (#2831) · 2a543d6e

Terry authored Feb 13, 2024

* add mixtral lora support

* formatting

* fix incorrectly ported logic

* polish tests

* minor fixes and refactoring

* minor fixes

* formatting

* rename and remove redundant logic

* refactoring

* refactoring

* minor fix

* minor refactoring

* fix code smell

2a543d6e

06 Feb, 2024 1 commit
- Add fused top-K softmax kernel for MoE (#2769) · f0d4e145
  Woosuk Kwon authored Feb 05, 2024
  
  f0d4e145
31 Jan, 2024 1 commit
- Add unit test for Mixtral MoE layer (#2677) · d0d93b92
  Philipp Moritz authored Jan 31, 2024
  
  d0d93b92
30 Jan, 2024 1 commit
- Fused MOE for Mixtral (#2542) · ab406446
  Philipp Moritz authored Jan 29, 2024
```
Co-authored-by: chen shen <scv119@gmail.com>
```
  ab406446
03 Jan, 2024 1 commit
- Use NCCL instead of ray for control-plane communication to remove serialization overhead (#2221) · fd4ea8ef
  Zhuohan Li authored Jan 04, 2024
  
  fd4ea8ef
20 Dec, 2023 1 commit
- [BugFix] Fix weight loading for Mixtral with TP (#2208) · ba4f8267
  Woosuk Kwon authored Dec 19, 2023
  
  ba4f8267
17 Dec, 2023 1 commit

Optimize model execution with CUDA graph (#1926) · 37ca5581

Woosuk Kwon authored Dec 16, 2023


Co-authored-by: Chen Shen <scv119@gmail.com>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>

37ca5581

16 Dec, 2023 1 commit
- Simplify weight loading logic (#2133) · eed74a55
  Roy authored Dec 17, 2023
  
  eed74a55
15 Dec, 2023 1 commit
- Add GPTQ support (#916) · 0fbfc4b8
  CHU Tianxiang authored Dec 15, 2023
  
  0fbfc4b8
14 Dec, 2023 1 commit
- Optimize Mixtral with expert parallelism (#2090) · 21d93c14
  Antoni Baum authored Dec 13, 2023
  
  21d93c14