- 20 Apr, 2024 2 commits
-
-
Noam Gat authored
-
Cody Yu authored
Provide an initial support to FP8 computation. This PR is inspired by HuggingFace TGI: huggingface/text-generation-inference#1726 This feature can be enabled with --quantization fp8 or -q fp8 when launching an engine. Algorithm: We still load a model checkpoint in FP16/BF16. After the weights are loaded, Fp8LinearMethod calculates the per-tensor scaling factor of weights and quantizes the weights accordingly. The scaling factor will then be stored for future use. Meanwhile, the per-tensor scaling factor for activations is calculated in every forward pass. Initial Results: Currently tested Mistral-7B on 1xH100. With prompt length ~5 and decoding length 128: BF16: 1.47s FP8: 1.66s I'll try to use larger models and try to find more performance bottleneck. Meanwhile, you're welcome to try this code.
-
- 18 Apr, 2024 1 commit
-
-
Michael Goin authored
-
- 12 Apr, 2024 1 commit
-
-
Michael Feil authored
Co-authored-by:Roger Wang <136131678+ywang96@users.noreply.github.com>
-
- 11 Apr, 2024 3 commits
-
-
Antoni Baum authored
-
Roger Wang authored
-
Kunshang Ji authored
-
- 10 Apr, 2024 3 commits
-
-
youkaichao authored
[WIP][Core][Refactor] move vllm/model_executor/parallel_utils into vllm/distributed and vllm/device_communicators (#3950)
-
Travis Johnson authored
Signed-off-by:Travis Johnson <tsjohnso@us.ibm.com>
-
胡译文 authored
-
- 03 Apr, 2024 1 commit
-
-
Adrian Abeyta authored
Co-authored-by:
Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Co-authored-by:
HaiShaw <hixiao@gmail.com> Co-authored-by:
AdrianAbeyta <Adrian.Abeyta@amd.com> Co-authored-by:
Matthew Wong <Matthew.Wong2@amd.com> Co-authored-by:
root <root@gt-pla-u18-08.pla.dcgpu> Co-authored-by:
mawong-amd <156021403+mawong-amd@users.noreply.github.com> Co-authored-by:
ttbachyinsda <ttbachyinsda@outlook.com> Co-authored-by:
guofangze <guofangze@kuaishou.com> Co-authored-by:
Michael Goin <mgoin64@gmail.com> Co-authored-by:
jacobthebanana <50071502+jacobthebanana@users.noreply.github.com> Co-authored-by:
Woosuk Kwon <woosuk.kwon@berkeley.edu>
-
- 29 Mar, 2024 1 commit
-
-
Roy authored
-
- 28 Mar, 2024 3 commits
-
-
Woosuk Kwon authored
-
Roger Wang authored
-
Woosuk Kwon authored
-
- 25 Mar, 2024 6 commits
-
-
Antoni Baum authored
-
Travis Johnson authored
Signed-off-by:
Travis Johnson <tsjohnso@us.ibm.com> Co-authored-by:
Nick Hill <nickhill@us.ibm.com>
-
Swapnil Parekh authored
Co-authored-by:Swapnil Parekh <swapnilp@ibm.com>
-
SangBin Cho authored
-
Woosuk Kwon authored
-
Kunshang Ji authored
-
- 22 Mar, 2024 1 commit
-
-
Zhuohan Li authored
-
- 21 Mar, 2024 1 commit
-
-
SangBin Cho authored
-
- 20 Mar, 2024 3 commits
-
-
Roy authored
-
SangBin Cho authored
-
Antoni Baum authored
Co-authored-by:Roger Wang <136131678+ywang96@users.noreply.github.com>
-
- 14 Mar, 2024 2 commits
-
-
Enrique Shockwave authored
-
youkaichao authored
[Kernel] change benchmark script so that result can be directly used; tune moe kernel in A100/H100 with tp=2,4,8 (#3389)
-
- 13 Mar, 2024 4 commits
-
-
Antoni Baum authored
-
Terry authored
-
Hui Liu authored
-
Woosuk Kwon authored
-
- 11 Mar, 2024 1 commit
-
-
Zhuohan Li authored
-
- 09 Mar, 2024 2 commits
-
-
Cade Daniel authored
-
Zhuohan Li authored
-
- 08 Mar, 2024 1 commit
-
-
Woosuk Kwon authored
-
- 07 Mar, 2024 1 commit
-
-
Woosuk Kwon authored
-
- 05 Mar, 2024 1 commit
-
-
Nick Hill authored
-
- 04 Mar, 2024 1 commit
-
-
Antoni Baum authored
Co-authored-by:Avnish Narayan <avnish@anyscale.com>
-
- 01 Mar, 2024 1 commit
-
-
Robert Shaw authored
Co-authored-by:
Robert Shaw <114415538+rib-2@users.noreply.github.com> Co-authored-by:
alexm <alexm@neuralmagic.com>
-