- 03 May, 2024 1 commit
-
-
SangBin Cho authored
-
- 02 May, 2024 1 commit
-
-
alexm-nm authored
-
- 29 Apr, 2024 1 commit
-
-
Robert Shaw authored
Co-authored-by:
alexm <alexm@neuralmagic.com> Co-authored-by:
mgoin <michael@neuralmagic.com>
-
- 27 Apr, 2024 2 commits
-
-
Austin Veselka authored
Co-authored-by:Antoni Baum <antoni.baum@protonmail.com>
-
Philipp Moritz authored
Co-authored-by:Woosuk Kwon <woosuk.kwon@berkeley.edu>
-
- 24 Apr, 2024 3 commits
-
-
alexm-nm authored
This PR addresses the Marlin kernel H100 crash that was reported here: neuralmagic#187. The reason for the crash was the inline PTX assembly that introduced the async_copy with streaming behavior. The solution is to use the more standard PTX for async_copy (without the fractional L2 policy for "evict_first"). There is no performance difference between standard async_copy PTX and the previous one.
-
Woosuk Kwon authored
-
Philipp Moritz authored
This PR is the first step towards fixing https://github.com/vllm-project/vllm/pull/3208 It implements dynamic per-tensor scaling (see https://github.com/vllm-project/vllm/pull/4118), so users do not need to compute activation scales on a calibration dataset and they also don't need to convert their model checkpoints. It is enough to specify the `quantization="fp8"` argument. You can try out the PR like this: ```python from vllm import LLM, SamplingParams prompts = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] sampling_params = SamplingParams(temperature=0.8, top_p=0.95) llm = LLM(model="mistralai/Mixtral-8x7B-Instruct-v0.1", tensor_parallel_size=2, quantization="fp8") outputs = llm.generate(prompts, sampling_params) # Print the outputs. for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` **Performance**: For this PR, the focus is on making the code clean (while still trying to get reasonable performance), there is a bunch of optimizations that we will submit as a follow up PR that significantly improve the performance (similar to the numbers in https://github.com/vllm-project/vllm/pull/3954). With this PR, the results are as follows: <img width="725" alt="Screenshot 2024-04-21 at 1 31 50 PM" src="https://github.com/vllm-project/vllm/assets/113316/d8fe1118-07a0-4d4e-8530-37a77d465a03"> **Accuracy**: The accuracy with this PR on MMLU on `mistralai/Mixtral-8x7B-v0.1` is as follows: ``` | Groups |Version|Filter|n-shot|Metric|Value | |Stderr| |------------------|-------|------|-----:|------|-----:|---|-----:| |mmlu |N/A |none | 0|acc |0.7018|± |0.0036| | - humanities |N/A |none | 5|acc |0.6472|± |0.0065| | - other |N/A |none | 5|acc |0.7673|± |0.0072| | - social_sciences|N/A |none | 5|acc |0.8099|± |0.0070| | - stem |N/A |none | 5|acc |0.6131|± |0.0083| ``` this compares favorably with the fp16 results which are ``` | Groups |Version|Filter|n-shot|Metric|Value | |Stderr| |------------------|-------|------|-----:|------|-----:|---|-----:| |mmlu |N/A |none | 0|acc |0.7020|± |0.1313| | - humanities |N/A |none | 5|acc |0.6425|± |0.1349| | - other |N/A |none | 5|acc |0.7744|± |0.1038| | - social_sciences|N/A |none | 5|acc |0.8131|± |0.0695| | - stem |N/A |none | 5|acc |0.6108|± |0.1383| ``` Happy hacking!
-
- 23 Apr, 2024 1 commit
-
-
James Fleming authored
Co-authored-by:mgoin <michael@neuralmagic.com>
-
- 17 Apr, 2024 1 commit
-
-
Shoichi Uchinami authored
-
- 13 Apr, 2024 1 commit
-
-
Jee Li authored
-
- 11 Apr, 2024 3 commits
-
-
Antoni Baum authored
-
Antoni Baum authored
-
fuchen.ljl authored
-
- 08 Apr, 2024 1 commit
-
-
Matt Wong authored
-
- 04 Apr, 2024 1 commit
-
-
Woosuk Kwon authored
-
- 03 Apr, 2024 1 commit
-
-
Adrian Abeyta authored
Co-authored-by:
Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Co-authored-by:
HaiShaw <hixiao@gmail.com> Co-authored-by:
AdrianAbeyta <Adrian.Abeyta@amd.com> Co-authored-by:
Matthew Wong <Matthew.Wong2@amd.com> Co-authored-by:
root <root@gt-pla-u18-08.pla.dcgpu> Co-authored-by:
mawong-amd <156021403+mawong-amd@users.noreply.github.com> Co-authored-by:
ttbachyinsda <ttbachyinsda@outlook.com> Co-authored-by:
guofangze <guofangze@kuaishou.com> Co-authored-by:
Michael Goin <mgoin64@gmail.com> Co-authored-by:
jacobthebanana <50071502+jacobthebanana@users.noreply.github.com> Co-authored-by:
Woosuk Kwon <woosuk.kwon@berkeley.edu>
-
- 02 Apr, 2024 1 commit
-
-
bigPYJ1151 authored
Co-authored-by:
Kunshang Ji <kunshang.ji@intel.com> Co-authored-by:
Yuan Zhou <yuan.zhou@intel.com>
-
- 30 Mar, 2024 1 commit
-
-
mawong-amd authored
-
- 27 Mar, 2024 1 commit
-
-
Jee Li authored
-
- 26 Mar, 2024 1 commit
-
-
Jee Li authored
Co-authored-by:Antoni Baum <antoni.baum@protonmail.com>
-
- 22 Mar, 2024 1 commit
-
-
Hanzhi Zhou authored
-
- 18 Mar, 2024 1 commit
-
-
Woosuk Kwon authored
-
- 16 Mar, 2024 1 commit
-
-
Simon Mo authored
-
- 15 Mar, 2024 1 commit
-
-
akhoroshev authored
-
- 13 Mar, 2024 3 commits
-
-
Terry authored
-
Or Sharir authored
Add missing kernel for CodeLlama-34B on A/H100 (no tensor parallelism) when using Multi-LoRA. (#3350)
-
Woosuk Kwon authored
-
- 11 Mar, 2024 1 commit
-
-
kliuae authored
-
- 10 Mar, 2024 2 commits
-
-
Douglas Lehr authored
-
Terry authored
-
- 08 Mar, 2024 1 commit
-
-
whyiug authored
-
- 01 Mar, 2024 1 commit
-
-
Robert Shaw authored
Co-authored-by:
Robert Shaw <114415538+rib-2@users.noreply.github.com> Co-authored-by:
alexm <alexm@neuralmagic.com>
-
- 29 Feb, 2024 1 commit
-
-
CHU Tianxiang authored
-
- 28 Feb, 2024 1 commit
-
-
Woosuk Kwon authored
-
- 26 Feb, 2024 1 commit
-
-
Woosuk Kwon authored
-
- 22 Feb, 2024 1 commit
-
-
Woosuk Kwon authored
-
- 12 Feb, 2024 1 commit
-
-
Rex authored
Co-authored-by:Chunan Zeng <chunanzeng@Chunans-Air.attlocal.net>
-
- 06 Feb, 2024 1 commit
-
-
Woosuk Kwon authored
-
- 01 Feb, 2024 1 commit
-
-
zhaoyang-star authored
-