Commits · f4f8a9d892a357e341b90bc47a8d72ece62323d5 · OpenDAS / vllm_cscc

24 Jul, 2024 1 commit
- [Bugfix]fix modelscope compatible issue (#6730) · f4f8a9d8
  liuyhwangyh authored Jul 24, 2024
  
  f4f8a9d8
23 Jul, 2024 2 commits
- [bitsandbytes]: support read bnb pre-quantized model (#5753) · 87525fab
  dongmao zhang authored Jul 23, 2024
```
Co-authored-by: Michael Goin <michael@neuralmagic.com>
```
  87525fab
- support ignore patterns in model loader (#6673) · 3eda4ec7
  Simon Mo authored Jul 22, 2024
  
  3eda4ec7
16 Jul, 2024 1 commit
- [BugFix][Model] Jamba - Handle aborted requests, Add tests and fix cleanup bug (#6425) · 9ad32dac
  Mor Zusman authored Jul 16, 2024
```
Co-authored-by: Mor Zusman <morz@ai21.com>
```
  9ad32dac
15 Jul, 2024 1 commit
- [Misc] Add CustomOp Interface to UnquantizedFusedMoEMethod (#6289) · ec9933f4
  Woosuk Kwon authored Jul 15, 2024
  
  ec9933f4
03 Jul, 2024 2 commits
- [vlm] Remove vision language config. (#6089) · d9e98f42
  xwjiang2010 authored Jul 03, 2024
```
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
```
  d9e98f42
- [hardware][misc] introduce platform abstraction (#6080) · 482045ee
  youkaichao authored Jul 02, 2024
  
  482045ee
02 Jul, 2024 1 commit

[VLM] Remove `image_input_type` from VLM config (#5852) · 98d6682c

xwjiang2010 authored Jul 02, 2024


Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>

98d6682c

01 Jul, 2024 1 commit
- [misc][cuda] use nvml to avoid accidentally cuda initialization (#6007) · 614aa512
  youkaichao authored Jun 30, 2024
  
  614aa512
27 Jun, 2024 1 commit
- [Model] Add base class for LoRA-supported models (#5018) · 96354d6a
  Cyrus Leung authored Jun 27, 2024
  
  96354d6a
15 Jun, 2024 1 commit
- [mypy] Enable type checking for test directory (#5017) · 0e9164b4
  Cyrus Leung authored Jun 15, 2024
  
  0e9164b4
12 Jun, 2024 2 commits
- [Frontend] [Core] Support for sharded tensorized models (#4990) · 51602eef
  Travis Johnson authored Jun 12, 2024
```
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Co-authored-by: Sanger Steel <sangersteel@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
```
  51602eef
- [Hardware] Initial TPU integration (#5292) · 1a8bfd92
  Woosuk Kwon authored Jun 12, 2024
  
  1a8bfd92
01 Jun, 2024 2 commits
- [Feature][Kernel] Support bitsandbytes quantization and QLoRA (#4776) · b9c0605a
  chenqianfzh authored Jun 01, 2024
  
  b9c0605a
- [Minor] Fix the path typo in loader.py: save_sharded_states.py -> save_sharded_state.py (#5151) · c3540728
  Ye Cao authored Jun 02, 2024
```
Signed-off-by: Ye Cao <caoye.cao@alibaba-inc.com>
```
  c3540728
24 May, 2024 1 commit
- [Bugfix] Fix Mistral v0.3 Weight Loading (#5005) · 91977095
  Robert Shaw authored May 24, 2024
```
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
```
  91977095
20 May, 2024 1 commit
- [Core] Sharded State Loader download from HF (#4889) · 1937e298
  Aurick Qiao authored May 20, 2024
  
  1937e298
19 May, 2024 1 commit
- [Bugfix][Model] Add base class for vision-language models (#4809) · f68470e8
  Cyrus Leung authored May 19, 2024
  
  f68470e8
16 May, 2024 1 commit
- [Core] Implement sharded state loader (#4690) · 30e75439
  Aurick Qiao authored May 16, 2024
```
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
```
  30e75439
13 May, 2024 2 commits
- [Frontend] [Core] perf: Automatically detect vLLM-tensorized model, update... · 8bc68e19
  Sanger Steel authored May 13, 2024
```
[Frontend] [Core] perf: Automatically detect vLLM-tensorized model, update `tensorizer` to version 2.9.0 (#4208)
```
  8bc68e19
- [Misc] Enhance attention selector (#4751) · 0fca3cdc
  Woosuk Kwon authored May 13, 2024
  
  0fca3cdc
02 May, 2024 1 commit
- [Misc] centralize all usage of environment variables (#4548) · 5b8a7c1c
  youkaichao authored May 02, 2024
  
  5b8a7c1c
27 Apr, 2024 1 commit

[Core] Support offline use of local cache for models (#4374) · d6e520e1

Prashant Gupta authored Apr 27, 2024


Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com>
Co-authored-by: Travis Johnson <tjohnson31415@gmail.com>

d6e520e1

26 Apr, 2024 1 commit
- [Misc][Refactor] Generalize linear_method to be quant_method (#4373) · a62aaf1d
  Cody Yu authored Apr 26, 2024
  
  a62aaf1d
24 Apr, 2024 1 commit

[Kernel] FP8 support for MoE kernel / Mixtral (#4244) · eace8bf0

Philipp Moritz authored Apr 23, 2024

This PR is the first step towards fixing https://github.com/vllm-project/vllm/pull/3208

It implements dynamic per-tensor scaling (see https://github.com/vllm-project/vllm/pull/4118), so users do not need to compute activation scales on a calibration dataset and they also don't need to convert their model checkpoints. It is enough to specify the `quantization="fp8"` argument. You can try out the PR like this:

```python
from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(model="mistralai/Mixtral-8x7B-Instruct-v0.1", tensor_parallel_size=2, quantization="fp8")

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```

**Performance**: For this PR, the focus is on making the code clean (while still trying to get reasonable performance), there is a bunch of optimizations that we will submit as a follow up PR that significantly improve the performance (similar to the numbers in https://github.com/vllm-project/vllm/pull/3954). With this PR, the results are as follows:

<img width="725" alt="Screenshot 2024-04-21 at 1 31 50 PM" src="https://github.com/vllm-project/vllm/assets/113316/d8fe1118-07a0-4d4e-8530-37a77d465a03">


**Accuracy**: The accuracy with this PR on MMLU on `mistralai/Mixtral-8x7B-v0.1` is as follows:

```
|      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu              |N/A    |none  |     0|acc   |0.7018|±  |0.0036|
| - humanities     |N/A    |none  |     5|acc   |0.6472|±  |0.0065|
| - other          |N/A    |none  |     5|acc   |0.7673|±  |0.0072|
| - social_sciences|N/A    |none  |     5|acc   |0.8099|±  |0.0070|
| - stem           |N/A    |none  |     5|acc   |0.6131|±  |0.0083|
```
this compares favorably with the fp16 results which are
```
|      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu              |N/A    |none  |     0|acc   |0.7020|±  |0.1313|
| - humanities     |N/A    |none  |     5|acc   |0.6425|±  |0.1349|
| - other          |N/A    |none  |     5|acc   |0.7744|±  |0.1038|
| - social_sciences|N/A    |none  |     5|acc   |0.8131|±  |0.0695|
| - stem           |N/A    |none  |     5|acc   |0.6108|±  |0.1383|
```

Happy hacking!

eace8bf0

22 Apr, 2024 1 commit
- [Bugfix] Ensure download_weights_from_hf(..) inside loader is using the revision parameter (#4217) · 15436806
  alexm-nm authored Apr 22, 2024
  
  15436806
20 Apr, 2024 1 commit

[Kernel][FP8] Initial support with dynamic per-tensor scaling (#4118) · a22cdea3

Cody Yu authored Apr 19, 2024

Provide an initial support to FP8 computation. This PR is inspired by HuggingFace TGI: huggingface/text-generation-inference#1726

This feature can be enabled with --quantization fp8 or -q fp8 when launching an engine.

Algorithm:
We still load a model checkpoint in FP16/BF16. After the weights are loaded, Fp8LinearMethod calculates the per-tensor scaling factor of weights and quantizes the weights accordingly. The scaling factor will then be stored for future use. Meanwhile, the per-tensor scaling factor for activations is calculated in every forward pass.

Initial Results:
Currently tested Mistral-7B on 1xH100. With prompt length ~5 and decoding length 128:

BF16: 1.47s
FP8: 1.66s
I'll try to use larger models and try to find more performance bottleneck. Meanwhile, you're welcome to try this code.

a22cdea3

16 Apr, 2024 1 commit
- [Core] Refactor model loading code (#4097) · 69e1d2fb
  Antoni Baum authored Apr 16, 2024
  
  69e1d2fb