Commits · 3cd9b5bb2d4a0d5eed07186ae140f5dc8f839708 · OpenDAS / vllm_cscc

24 Apr, 2024 6 commits

[Core][Distributed] use existing torch.cuda.device (#4318) · 3cd9b5bb
youkaichao authored Apr 24, 2024
```
[Core][Distributed] use existing torch.cuda.device context manager (#4318)
```
3cd9b5bb
[Misc] Reduce supported Punica dtypes (#4304) · 468d761b
Woosuk Kwon authored Apr 23, 2024

468d761b
[CI][Build] change pynvml to nvidia-ml-py (#4302) · e4bf860a
youkaichao authored Apr 23, 2024

e4bf860a
[Core][Distributed] use cpu/gloo to initialize pynccl (#4248) · 91f50a6f
youkaichao authored Apr 23, 2024

91f50a6f
[BUG] fixed fp8 conflict with aqlm (#4307) · 79a268c4
Robert Shaw authored Apr 23, 2024
```
Fixes fp8 iterface which broke in AQLM merge.
```
79a268c4

[Kernel] FP8 support for MoE kernel / Mixtral (#4244) · eace8bf0

Philipp Moritz authored Apr 23, 2024

This PR is the first step towards fixing https://github.com/vllm-project/vllm/pull/3208

It implements dynamic per-tensor scaling (see https://github.com/vllm-project/vllm/pull/4118), so users do not need to compute activation scales on a calibration dataset and they also don't need to convert their model checkpoints. It is enough to specify the `quantization="fp8"` argument. You can try out the PR like this:

```python
from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(model="mistralai/Mixtral-8x7B-Instruct-v0.1", tensor_parallel_size=2, quantization="fp8")

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```

**Performance**: For this PR, the focus is on making the code clean (while still trying to get reasonable performance), there is a bunch of optimizations that we will submit as a follow up PR that significantly improve the performance (similar to the numbers in https://github.com/vllm-project/vllm/pull/3954). With this PR, the results are as follows:

<img width="725" alt="Screenshot 2024-04-21 at 1 31 50 PM" src="https://github.com/vllm-project/vllm/assets/113316/d8fe1118-07a0-4d4e-8530-37a77d465a03">


**Accuracy**: The accuracy with this PR on MMLU on `mistralai/Mixtral-8x7B-v0.1` is as follows:

```
|      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu              |N/A    |none  |     0|acc   |0.7018|±  |0.0036|
| - humanities     |N/A    |none  |     5|acc   |0.6472|±  |0.0065|
| - other          |N/A    |none  |     5|acc   |0.7673|±  |0.0072|
| - social_sciences|N/A    |none  |     5|acc   |0.8099|±  |0.0070|
| - stem           |N/A    |none  |     5|acc   |0.6131|±  |0.0083|
```
this compares favorably with the fp16 results which are
```
|      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu              |N/A    |none  |     0|acc   |0.7020|±  |0.1313|
| - humanities     |N/A    |none  |     5|acc   |0.6425|±  |0.1349|
| - other          |N/A    |none  |     5|acc   |0.7744|±  |0.1038|
| - social_sciences|N/A    |none  |     5|acc   |0.8131|±  |0.0695|
| - stem           |N/A    |none  |     5|acc   |0.6108|±  |0.1383|
```

Happy hacking!

eace8bf0

23 Apr, 2024 12 commits
- [Bugfix][Frontend] Raise exception when file-like chat template fails to be opened (#4292) · 1e8f4252
  Cyrus Leung authored Apr 24, 2024
  
  1e8f4252
- AQLM CUDA support (#3287) · 2b7949c1
  James Fleming authored Apr 23, 2024
```
Co-authored-by: mgoin <michael@neuralmagic.com>
```
  2b7949c1
- [CI] Add ccache for wheel builds job (#4281) · 62b5166b
  Simon Mo authored Apr 23, 2024
  
  62b5166b
- [Core][Logging] Add last frame information for better debugging (#4278) · d86285a4
  youkaichao authored Apr 23, 2024
  
  d86285a4
- [Bugfix] Add init_cached_hf_modules to RayWorkerWrapper (#4286) · d87f39e9
  DefTruth authored Apr 24, 2024
  
  d87f39e9
- [Bugfix] Fixing max token error message for openai compatible server (#4016) · d3c8180a
  Jack Gordley authored Apr 23, 2024
  
  d3c8180a
- [Speculative decoding 7/9] Speculative decoding end-to-end correctness tests. (#3951) · 62b8aebc
  Cade Daniel authored Apr 23, 2024
  
  62b8aebc
- [Core] Scheduling optimization 2 (#4280) · 050f285f
  SangBin Cho authored Apr 23, 2024
  
  050f285f
- [Core] Some simplification of WorkerWrapper changes (#4183) · 8f2ea22b
  Nick Hill authored Apr 23, 2024
  
  8f2ea22b
- [Mypy] Part 3 fix typing for nested directories for most of directory (#4161) · 0ae11f78
  SangBin Cho authored Apr 23, 2024
  
  0ae11f78
- Fix `autodoc` directives (#4272) · 34128a69
  Harry Mellor authored Apr 23, 2024
```
Co-authored-by: Harry Mellor <hmellor@oxts.com>
```
  34128a69
- [Core][Distributed] use absolute path for library file (#4271) · c1b4e415
  youkaichao authored Apr 22, 2024
  
  c1b4e415
22 Apr, 2024 9 commits
- [Doc] Update the SkyPilot doc with serving and Llama-3 (#4276) · ceaf4ed0
  Zhanghao Wu authored Apr 22, 2024
  
  ceaf4ed0
- [Core] Scheduler perf fix (#4270) · ad8d696a
  SangBin Cho authored Apr 23, 2024
  
  ad8d696a
- Add example scripts to documentation (#4225) · 3d925165
  Harry Mellor authored Apr 22, 2024
```
Co-authored-by: Harry Mellor <hmellor@oxts.com>
```
  3d925165
- [Bugfix] Ensure download_weights_from_hf(..) inside loader is using the revision parameter (#4217) · 15436806
  alexm-nm authored Apr 22, 2024
  
  15436806
- [Frontend] Enable support for CPU backend in AsyncLLMEngine. (#3993) · 077f0a2e
  Tao He authored Apr 22, 2024
```
Signed-off-by: Tao He <sighingnow@gmail.com>
```
  077f0a2e
- [Bugfix] Fix type annotations in CPU model runner (#4256) · e73ed0f1
  Woosuk Kwon authored Apr 22, 2024
  
  e73ed0f1
- [Misc] Add vision language model support to CPU backend (#3968) · 296cdf8a
  Isotr0py authored Apr 22, 2024
  
  296cdf8a
- [Core][Distributed] fix _is_full_nvlink detection (#4233) · 747b1a71
  youkaichao authored Apr 21, 2024
  
  747b1a71
- [AMD][Hardware][Misc][Bugfix] xformer cleanup and light navi logic and CI... · 95e5b087
  Hongxia Yang authored Apr 22, 2024
```
[AMD][Hardware][Misc][Bugfix] xformer cleanup and light navi logic and CI fixes and refactoring (#4129)
```
  95e5b087
21 Apr, 2024 3 commits
- Make initialization of tokenizer and detokenizer optional (#3748) · a37d815b
  GeauxEric authored Apr 21, 2024
```
Co-authored-by: Yun Ding <yunding@nvidia.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
```
  a37d815b
- [Doc]: Update the doc of adding new models (#4236) · 7f2593b1
  xiaoji authored Apr 22, 2024
  
  7f2593b1
- Don't show default value for flags in `EngineArgs` (#4223) · fe7d648f
  Harry Mellor authored Apr 21, 2024
```
Co-authored-by: Harry Mellor <hmellor@oxts.com>
```
  fe7d648f
20 Apr, 2024 6 commits

Updating lm-format-enforcer version and adding links to decoding libraries in docs (#4222) · cc74b2b2
Noam Gat authored Apr 20, 2024

cc74b2b2
[Frontend] multiple sampling params support (#3570) · 91528575
nunjunj authored Apr 20, 2024

91528575

[Kernel][FP8] Initial support with dynamic per-tensor scaling (#4118) · a22cdea3

Cody Yu authored Apr 19, 2024

Provide an initial support to FP8 computation. This PR is inspired by HuggingFace TGI: huggingface/text-generation-inference#1726

This feature can be enabled with --quantization fp8 or -q fp8 when launching an engine.

Algorithm:
We still load a model checkpoint in FP16/BF16. After the weights are loaded, Fp8LinearMethod calculates the per-tensor scaling factor of weights and quantizes the weights accordingly. The scaling factor will then be stored for future use. Meanwhile, the per-tensor scaling factor for activations is calculated in every forward pass.

Initial Results:
Currently tested Mistral-7B on 1xH100. With prompt length ~5 and decoding length 128:

BF16: 1.47s
FP8: 1.66s
I'll try to use larger models and try to find more performance bottleneck. Meanwhile, you're welcome to try this code.

a22cdea3

Fix missing docs and out of sync `EngineArgs` (#4219) · 682789d4
Harry Mellor authored Apr 20, 2024
```
Co-authored-by: Harry Mellor <hmellor@oxts.com>
```
682789d4
[Bugfix] Add fix for JSON whitespace (#4189) · 138485a8
Ayush Rautwar authored Apr 19, 2024
```
Co-authored-by: Ubuntu <ubuntu@ip-172-31-13-147.ec2.internal>
```
138485a8
Pass `tokenizer_revision` when getting tokenizer in openai serving (#4214) · bc9df157
Chirag Jain authored Apr 20, 2024

bc9df157

19 Apr, 2024 4 commits
- [Misc] add nccl in collect env (#4211) · 15b86408
  youkaichao authored Apr 19, 2024
  
  15b86408
- [Bugfix][Core] Restore logging of stats in the async engine (#4150) · 7be4f562
  Ronen Schaffer authored Apr 19, 2024
  
  7be4f562
- [Misc] fix docstrings (#4191) · 8f20fc04
  Uranus authored Apr 19, 2024
```
Co-authored-by: Zhong Wang <wangzhong@infini-ai.com>
```
  8f20fc04
- Bump version of 0.4.1 (#4177) · 221d93ec
  Simon Mo authored Apr 19, 2024
  
  221d93ec