- 20 Apr, 2024 2 commits
-
-
Cody Yu authored
Provide an initial support to FP8 computation. This PR is inspired by HuggingFace TGI: huggingface/text-generation-inference#1726 This feature can be enabled with --quantization fp8 or -q fp8 when launching an engine. Algorithm: We still load a model checkpoint in FP16/BF16. After the weights are loaded, Fp8LinearMethod calculates the per-tensor scaling factor of weights and quantizes the weights accordingly. The scaling factor will then be stored for future use. Meanwhile, the per-tensor scaling factor for activations is calculated in every forward pass. Initial Results: Currently tested Mistral-7B on 1xH100. With prompt length ~5 and decoding length 128: BF16: 1.47s FP8: 1.66s I'll try to use larger models and try to find more performance bottleneck. Meanwhile, you're welcome to try this code.
-
Ayush Rautwar authored
Co-authored-by:Ubuntu <ubuntu@ip-172-31-13-147.ec2.internal>
-
- 18 Apr, 2024 2 commits
-
-
Michael Goin authored
-
SangBin Cho authored
Co-authored-by:SangBin Cho <sangcho@sangcho-LT93GQWG9C.local>
-
- 16 Apr, 2024 2 commits
-
-
Antoni Baum authored
-
Noam Gat authored
Co-authored-by:Simon Mo <simon.mo@hey.com>
-
- 15 Apr, 2024 1 commit
-
-
Sanger Steel authored
Co-authored-by:Roger Wang <136131678+ywang96@users.noreply.github.com>
-
- 14 Apr, 2024 1 commit
-
-
Sanger Steel authored
-
- 12 Apr, 2024 1 commit
-
-
Michael Feil authored
Co-authored-by:Roger Wang <136131678+ywang96@users.noreply.github.com>
-
- 11 Apr, 2024 4 commits
-
-
Antoni Baum authored
-
Roger Wang authored
-
Kunshang Ji authored
-
youkaichao authored
[Core][Model] Use torch.compile to accelerate layernorm in commandr (#3985)
-
- 10 Apr, 2024 4 commits
-
-
youkaichao authored
[WIP][Core][Refactor] move vllm/model_executor/parallel_utils into vllm/distributed and vllm/device_communicators (#3950)
-
Daniel E Marasco authored
-
Travis Johnson authored
Signed-off-by:Travis Johnson <tsjohnso@us.ibm.com>
-
胡译文 authored
-
- 09 Apr, 2024 2 commits
-
-
Junichi Sato authored
-
youkaichao authored
-
- 08 Apr, 2024 4 commits
-
-
Roy authored
-
Kiran R authored
Co-authored-by:roy <jasonailu87@gmail.com>
-
egortolmachev authored
Co-authored-by:Egor Tolmachev <t333ga@gmail.com>
-
ywfang authored
-
- 07 Apr, 2024 1 commit
-
-
youkaichao authored
-
- 05 Apr, 2024 1 commit
-
-
Isotr0py authored
-
- 04 Apr, 2024 4 commits
-
-
youkaichao authored
-
Saurabh Dash authored
-
youkaichao authored
-
Michael Feil authored
-
- 03 Apr, 2024 1 commit
-
-
Adrian Abeyta authored
Co-authored-by:
Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Co-authored-by:
HaiShaw <hixiao@gmail.com> Co-authored-by:
AdrianAbeyta <Adrian.Abeyta@amd.com> Co-authored-by:
Matthew Wong <Matthew.Wong2@amd.com> Co-authored-by:
root <root@gt-pla-u18-08.pla.dcgpu> Co-authored-by:
mawong-amd <156021403+mawong-amd@users.noreply.github.com> Co-authored-by:
ttbachyinsda <ttbachyinsda@outlook.com> Co-authored-by:
guofangze <guofangze@kuaishou.com> Co-authored-by:
Michael Goin <mgoin64@gmail.com> Co-authored-by:
jacobthebanana <50071502+jacobthebanana@users.noreply.github.com> Co-authored-by:
Woosuk Kwon <woosuk.kwon@berkeley.edu>
-
- 29 Mar, 2024 6 commits
-
-
Nick Hill authored
-
Hongxia Yang authored
-
Roy authored
-
yhu422 authored
-
youkaichao authored
[Core][Test] move local_rank to the last arg with default value to keep api compatible (#3711)
-
youkaichao authored
-
- 28 Mar, 2024 4 commits
-
-
Simon Mo authored
-
Woosuk Kwon authored
-
Roy authored
-
Roger Wang authored
-