- 04 May, 2024 1 commit
-
-
Michael Goin authored
[Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with Dynamic/Static Activations) (#4527) Follow on to #4332 to enable FP8 checkpoint loading for Mixtral and supersedes #4436. This PR enables the following checkpoint loading features for Mixtral: Supports loading fp8 checkpoints for Mixtral, such as this "nm-testing/Mixtral-8x7B-Instruct-v0.1-FP8" test model Supports static or dynamic activation quantization with static weight quantization (all per tensor) Supports different scales for each expert weight Supports Fp8 in QKV layer Notes: The Expert Gate/Router always runs at half / full precision for now. If there are different weight scales between QKV layer (for separate QKV weights), they are re-quantized using layer.weight_scale.max() so we can have a single gemm for performance.
-
- 01 May, 2024 1 commit
-
-
Philipp Moritz authored
Remove the device="cuda" declarations in mixtral as promised in #4343
-
- 27 Apr, 2024 2 commits
-
-
Robert Shaw authored
-
Philipp Moritz authored
Co-authored-by:Woosuk Kwon <woosuk.kwon@berkeley.edu>
-
- 26 Apr, 2024 2 commits
-
-
Cody Yu authored
-
SangBin Cho authored
Co-authored-by:Danny Guinther <dguinther@neuralmagic.com>
-
- 25 Apr, 2024 2 commits
-
-
Isotr0py authored
Co-authored-by:Woosuk Kwon <woosuk.kwon@berkeley.edu>
-
Caio Mendes authored
-
- 24 Apr, 2024 1 commit
-
-
Philipp Moritz authored
This PR is the first step towards fixing https://github.com/vllm-project/vllm/pull/3208 It implements dynamic per-tensor scaling (see https://github.com/vllm-project/vllm/pull/4118), so users do not need to compute activation scales on a calibration dataset and they also don't need to convert their model checkpoints. It is enough to specify the `quantization="fp8"` argument. You can try out the PR like this: ```python from vllm import LLM, SamplingParams prompts = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] sampling_params = SamplingParams(temperature=0.8, top_p=0.95) llm = LLM(model="mistralai/Mixtral-8x7B-Instruct-v0.1", tensor_parallel_size=2, quantization="fp8") outputs = llm.generate(prompts, sampling_params) # Print the outputs. for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` **Performance**: For this PR, the focus is on making the code clean (while still trying to get reasonable performance), there is a bunch of optimizations that we will submit as a follow up PR that significantly improve the performance (similar to the numbers in https://github.com/vllm-project/vllm/pull/3954). With this PR, the results are as follows: <img width="725" alt="Screenshot 2024-04-21 at 1 31 50 PM" src="https://github.com/vllm-project/vllm/assets/113316/d8fe1118-07a0-4d4e-8530-37a77d465a03"> **Accuracy**: The accuracy with this PR on MMLU on `mistralai/Mixtral-8x7B-v0.1` is as follows: ``` | Groups |Version|Filter|n-shot|Metric|Value | |Stderr| |------------------|-------|------|-----:|------|-----:|---|-----:| |mmlu |N/A |none | 0|acc |0.7018|± |0.0036| | - humanities |N/A |none | 5|acc |0.6472|± |0.0065| | - other |N/A |none | 5|acc |0.7673|± |0.0072| | - social_sciences|N/A |none | 5|acc |0.8099|± |0.0070| | - stem |N/A |none | 5|acc |0.6131|± |0.0083| ``` this compares favorably with the fp16 results which are ``` | Groups |Version|Filter|n-shot|Metric|Value | |Stderr| |------------------|-------|------|-----:|------|-----:|---|-----:| |mmlu |N/A |none | 0|acc |0.7020|± |0.1313| | - humanities |N/A |none | 5|acc |0.6425|± |0.1349| | - other |N/A |none | 5|acc |0.7744|± |0.1038| | - social_sciences|N/A |none | 5|acc |0.8131|± |0.0695| | - stem |N/A |none | 5|acc |0.6108|± |0.1383| ``` Happy hacking!
-
- 16 Apr, 2024 1 commit
-
-
Antoni Baum authored
-
- 11 Apr, 2024 1 commit
-
-
youkaichao authored
[Core][Model] Use torch.compile to accelerate layernorm in commandr (#3985)
-
- 10 Apr, 2024 1 commit
-
-
youkaichao authored
[WIP][Core][Refactor] move vllm/model_executor/parallel_utils into vllm/distributed and vllm/device_communicators (#3950)
-
- 09 Apr, 2024 1 commit
-
-
Junichi Sato authored
-
- 08 Apr, 2024 4 commits
-
-
Roy authored
-
Kiran R authored
Co-authored-by:roy <jasonailu87@gmail.com>
-
egortolmachev authored
Co-authored-by:Egor Tolmachev <t333ga@gmail.com>
-
ywfang authored
-
- 07 Apr, 2024 1 commit
-
-
youkaichao authored
-
- 05 Apr, 2024 1 commit
-
-
Isotr0py authored
-
- 04 Apr, 2024 1 commit
-
-
Saurabh Dash authored
-
- 03 Apr, 2024 1 commit
-
-
Adrian Abeyta authored
Co-authored-by:
Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Co-authored-by:
HaiShaw <hixiao@gmail.com> Co-authored-by:
AdrianAbeyta <Adrian.Abeyta@amd.com> Co-authored-by:
Matthew Wong <Matthew.Wong2@amd.com> Co-authored-by:
root <root@gt-pla-u18-08.pla.dcgpu> Co-authored-by:
mawong-amd <156021403+mawong-amd@users.noreply.github.com> Co-authored-by:
ttbachyinsda <ttbachyinsda@outlook.com> Co-authored-by:
guofangze <guofangze@kuaishou.com> Co-authored-by:
Michael Goin <mgoin64@gmail.com> Co-authored-by:
jacobthebanana <50071502+jacobthebanana@users.noreply.github.com> Co-authored-by:
Woosuk Kwon <woosuk.kwon@berkeley.edu>
-
- 28 Mar, 2024 3 commits
-
-
wenyujin333 authored
-
hxer7963 authored
Co-authored-by:
willhe <hexin@xverse.cn> Co-authored-by:
root <root@localhost.localdomain>
-
Roy authored
-
- 27 Mar, 2024 4 commits
-
-
zeppombal authored
Co-authored-by:
José Maria Pombal <jose.pombal@unbabel.com> Co-authored-by:
youkaichao <youkaichao@gmail.com>
-
Megha Agarwal authored
-
Woosuk Kwon authored
-
Woosuk Kwon authored
-
- 26 Mar, 2024 1 commit
-
-
Jee Li authored
Co-authored-by:Antoni Baum <antoni.baum@protonmail.com>
-
- 25 Mar, 2024 4 commits
-
-
xwjiang2010 authored
-
SangBin Cho authored
-
Woosuk Kwon authored
-
少年 authored
-
- 24 Mar, 2024 3 commits
-
-
Nick Hill authored
-
Woosuk Kwon authored
Co-authored-by:44670 <44670@users.noreply.github.com>
-
Woosuk Kwon authored
-
- 22 Mar, 2024 2 commits
-
-
Zhuohan Li authored
-
Roy authored
-
- 21 Mar, 2024 2 commits
-
-
Taemin Lee authored
-
Woosuk Kwon authored
Co-authored-by:
Roy <jasonailu87@gmail.com> Co-authored-by:
Roger Meier <r.meier@siemens.com>
-