Commits · 371b1251dd7db7581377e40ea4e4626c3f83ef7a · OpenDAS / vllm_cscc

"vllm/vscode:/vscode.git/clone" did not exist on "b40cf6402e356a10415e969e648a32911fb9b8ec"

06 Jul, 2024 1 commit
- add fa pad · 371b1251
  zhuwenwen authored Jul 06, 2024
  
  371b1251
10 Jun, 2024 2 commits
- [Bugfix] Fix LLaVA-NeXT (#5380) · 2c0d9335
  Cyrus Leung authored Jun 10, 2024
  
  2c0d9335
- [Model] Initial support for LLaVA-NeXT (#4199) · 6b29d6fe
  Cyrus Leung authored Jun 10, 2024
```
Co-authored-by: Roger Wang <ywang@roblox.com>
```
  6b29d6fe
08 Jun, 2024 1 commit
- [Misc][Breaking] Change FP8 checkpoint format from act_scale -> input_scale (#5353) · c09dade2
  Michael Goin authored Jun 08, 2024
  
  c09dade2
07 Jun, 2024 1 commit
- fix DbrxFusedNormAttention missing cache_config (#5340) · 767c727a
  Calvinn Ng authored Jun 08, 2024
```
Co-authored-by: team <calvinn.ng@ahrefs.com>
```
  767c727a
05 Jun, 2024 1 commit
- [Model] Correct Mixtral FP8 checkpoint loading (#5231) · 5563a4de
  Cody Yu authored Jun 05, 2024
  
  5563a4de
03 Jun, 2024 1 commit
- [Core] Support image processor (#4197) · 7a64d24a
  Cyrus Leung authored Jun 03, 2024
  
  7a64d24a
01 Jun, 2024 1 commit
- [Feature][Kernel] Support bitsandbytes quantization and QLoRA (#4776) · b9c0605a
  chenqianfzh authored Jun 01, 2024
  
  b9c0605a
31 May, 2024 1 commit
- [Model] Enable FP8 QKV in MoE and refine kernel tuning script (#5039) · e9899fb7
  Cody Yu authored May 31, 2024
  
  e9899fb7
27 May, 2024 2 commits
- [Model] Add support for falcon-11B (#5069) · 890aa93d
  Isotr0py authored May 28, 2024
  
  890aa93d
- [Bugfix / Core] Prefix Caching Guards (merged with main) (#4846) · 1102bef2
  Zhuohan Li authored May 27, 2024
```
Co-authored-by: rsnm2 <rshaw@neuralmagic.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
```
  1102bef2
25 May, 2024 1 commit

[Kernel][Backend][Model] Blocksparse flash attention kernel and Phi-3-Small model (#4799) · 8e192ff9

Eric Xihui Lin authored May 25, 2024


Co-authored-by: beagleski <yunanzhang@microsoft.com>
Co-authored-by: bapatra <bapatra@microsoft.com>
Co-authored-by: Barun Patra <codedecde@users.noreply.github.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>

8e192ff9

23 May, 2024 1 commit

[Kernel] Initial Activation Quantization Support (#4525) · a1242324

Dipika Sikka authored May 23, 2024


Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

a1242324

22 May, 2024 3 commits
- [Minor] Fix small typo in llama.py: QKVParallelLinear -> QuantizationConfig (#4991) · a36de682
  Philipp Moritz authored May 22, 2024
  
  a36de682
- [Model] LoRA gptbigcode implementation (#3949) · 97b03000
  raywanb authored May 23, 2024
  
  97b03000
- [Misc] Load FP8 kv-cache scaling factors from checkpoints (#4893) · a3a73ab0
  Cody Yu authored May 22, 2024
```
The 2nd PR for #4532.

This PR supports loading FP8 kv-cache scaling factors from a FP8 checkpoint (with .kv_scale parameter).
```
  a3a73ab0
21 May, 2024 2 commits
- [Model] Add Phi-2 LoRA support (#4886) · f12c3b5b
  Isotr0py authored May 21, 2024
  
  f12c3b5b
- [Model] add rope_scaling support for qwen2 (#4930) · d130b573
  HUANG Fei authored May 21, 2024
  
  d130b573
20 May, 2024 1 commit
- [Model] LLaVA model refactor (#4910) · 6287537a
  Cyrus Leung authored May 20, 2024
  
  6287537a
19 May, 2024 1 commit
- [Bugfix][Model] Add base class for vision-language models (#4809) · f68470e8
  Cyrus Leung authored May 19, 2024
  
  f68470e8
18 May, 2024 1 commit

[Lora] Support long context lora (#4787) · 2e9a2227

SangBin Cho authored May 18, 2024

Currently we need to call rotary embedding kernel for each LoRA, which makes it hard to serve multiple long context length LoRA. Add batched rotary embedding kernel and pipe it through.

It replaces the rotary embedding layer to the one that is aware of multiple cos-sin-cache per scaling factors.

Follow up of https://github.com/vllm-project/vllm/pull/3095/files

2e9a2227

17 May, 2024 1 commit
- Sync huggingface modifications of qwen Moe model (#4774) · 48d5985a
  eigenLiu authored May 18, 2024
  
  48d5985a
13 May, 2024 2 commits
- [Bugfix] Fix dynamic FP8 quantization for Mixtral (#4793) · 33d3914b
  Philipp Moritz authored May 13, 2024
  
  33d3914b
- [Misc] Enhance attention selector (#4751) · 0fca3cdc
  Woosuk Kwon authored May 13, 2024
  
  0fca3cdc
12 May, 2024 1 commit
- [Model] Add support for IBM Granite Code models (#4636) · 6eaccb73
  Yikang Shen authored May 12, 2024
  
  6eaccb73
11 May, 2024 1 commit
- [Model][Misc] Add e5-mistral-7b-instruct and Embedding API (#3734) · e254497b
  Chang Su authored May 11, 2024
  
  e254497b
09 May, 2024 1 commit

[Model] Snowflake arctic model implementation (#4652) · ebce310b

Hao Zhang authored May 09, 2024


Co-authored-by: Dash Desai <1723932+iamontheinet@users.noreply.github.com>
Co-authored-by: Aurick Qiao <qiao@aurick.net>
Co-authored-by: Aurick Qiao <aurick.qiao@snowflake.com>
Co-authored-by: Aurick Qiao <aurickq@users.noreply.github.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>

ebce310b

04 May, 2024 1 commit

[Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with... · 2a052011

Michael Goin authored May 04, 2024

[Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with Dynamic/Static Activations) (#4527)

Follow on to #4332 to enable FP8 checkpoint loading for Mixtral and supersedes #4436.

This PR enables the following checkpoint loading features for Mixtral:

Supports loading fp8 checkpoints for Mixtral, such as this "nm-testing/Mixtral-8x7B-Instruct-v0.1-FP8" test model
Supports static or dynamic activation quantization with static weight quantization (all per tensor)
Supports different scales for each expert weight
Supports Fp8 in QKV layer
Notes:

The Expert Gate/Router always runs at half / full precision for now.
If there are different weight scales between QKV layer (for separate QKV weights), they are re-quantized using layer.weight_scale.max() so we can have a single gemm for performance.

2a052011

01 May, 2024 1 commit
- [Misc] Remove Mixtral device="cuda" declarations (#4543) · c9d852d6
  Philipp Moritz authored May 01, 2024
```
Remove the device="cuda" declarations in mixtral as promised in #4343
```
  c9d852d6
27 Apr, 2024 2 commits
- [BugFix] Resolved Issues For LinearMethod --> QuantConfig (#4418) · 4ea1f967
  Robert Shaw authored Apr 27, 2024
  
  4ea1f967
- [Kernel] Optimize FP8 support for MoE kernel / Mixtral via static scales (#4343) · 12628d3c
  Philipp Moritz authored Apr 26, 2024
```
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
```
  12628d3c
26 Apr, 2024 2 commits
- [Misc][Refactor] Generalize linear_method to be quant_method (#4373) · a62aaf1d
  Cody Yu authored Apr 26, 2024
  
  a62aaf1d
- [CI] Disable non-lazy string operation on logging (#4326) · a88081bf
  SangBin Cho authored Apr 26, 2024
```
Co-authored-by: Danny Guinther <dguinther@neuralmagic.com>
```
  a88081bf
25 Apr, 2024 2 commits
- [Bugfix][Model] Refactor OLMo model to support new HF format in transformers 4.40.0 (#4324) · fbf152d9
  Isotr0py authored Apr 26, 2024
```
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
```
  fbf152d9
- [Model] Adds Phi-3 support (#4298) · 96e90fde
  Caio Mendes authored Apr 25, 2024
  
  96e90fde
24 Apr, 2024 1 commit

[Kernel] FP8 support for MoE kernel / Mixtral (#4244) · eace8bf0

Philipp Moritz authored Apr 23, 2024

This PR is the first step towards fixing https://github.com/vllm-project/vllm/pull/3208

It implements dynamic per-tensor scaling (see https://github.com/vllm-project/vllm/pull/4118), so users do not need to compute activation scales on a calibration dataset and they also don't need to convert their model checkpoints. It is enough to specify the `quantization="fp8"` argument. You can try out the PR like this:

```python
from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(model="mistralai/Mixtral-8x7B-Instruct-v0.1", tensor_parallel_size=2, quantization="fp8")

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```

**Performance**: For this PR, the focus is on making the code clean (while still trying to get reasonable performance), there is a bunch of optimizations that we will submit as a follow up PR that significantly improve the performance (similar to the numbers in https://github.com/vllm-project/vllm/pull/3954). With this PR, the results are as follows:

<img width="725" alt="Screenshot 2024-04-21 at 1 31 50 PM" src="https://github.com/vllm-project/vllm/assets/113316/d8fe1118-07a0-4d4e-8530-37a77d465a03">


**Accuracy**: The accuracy with this PR on MMLU on `mistralai/Mixtral-8x7B-v0.1` is as follows:

```
|      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu              |N/A    |none  |     0|acc   |0.7018|±  |0.0036|
| - humanities     |N/A    |none  |     5|acc   |0.6472|±  |0.0065|
| - other          |N/A    |none  |     5|acc   |0.7673|±  |0.0072|
| - social_sciences|N/A    |none  |     5|acc   |0.8099|±  |0.0070|
| - stem           |N/A    |none  |     5|acc   |0.6131|±  |0.0083|
```
this compares favorably with the fp16 results which are
```
|      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu              |N/A    |none  |     0|acc   |0.7020|±  |0.1313|
| - humanities     |N/A    |none  |     5|acc   |0.6425|±  |0.1349|
| - other          |N/A    |none  |     5|acc   |0.7744|±  |0.1038|
| - social_sciences|N/A    |none  |     5|acc   |0.8131|±  |0.0695|
| - stem           |N/A    |none  |     5|acc   |0.6108|±  |0.1383|
```

Happy hacking!

eace8bf0

16 Apr, 2024 1 commit
- [Core] Refactor model loading code (#4097) · 69e1d2fb
  Antoni Baum authored Apr 16, 2024
  
  69e1d2fb
11 Apr, 2024 1 commit
- [Core][Model] torch.compile for layernorm in commandr (#3985) · caada5e5
  youkaichao authored Apr 10, 2024
```
[Core][Model] Use torch.compile to accelerate layernorm in commandr (#3985)
```
  caada5e5
10 Apr, 2024 1 commit

[Core][Refactor] move parallel_utils into vllm/distributed (#3950) · 63e7176f

youkaichao authored Apr 10, 2024

[WIP][Core][Refactor] move vllm/model_executor/parallel_utils into vllm/distributed and vllm/device_communicators (#3950)

63e7176f

09 Apr, 2024 1 commit
- [Bugfix] Fix KeyError on loading GPT-NeoX (#3925) · e23a43ae
  Junichi Sato authored Apr 10, 2024
  
  e23a43ae