- 05 Jun, 2024 1 commit
-
-
Cody Yu authored
-
- 03 Jun, 2024 1 commit
-
-
Tyler Michael Smith authored
-
- 01 Jun, 2024 2 commits
-
-
chenqianfzh authored
-
Tyler Michael Smith authored
-
- 31 May, 2024 1 commit
-
-
Robert Shaw authored
-
- 30 May, 2024 1 commit
-
-
Alexander Matveev authored
-
- 27 May, 2024 1 commit
-
-
sasha0552 authored
-
- 23 May, 2024 2 commits
-
-
Dipika Sikka authored
Co-authored-by:
Varun Sundar Rabindranath <varunsundar08@gmail.com> Co-authored-by:
Varun Sundar Rabindranath <varun@neuralmagic.com>
-
Alexander Matveev authored
-
- 22 May, 2024 1 commit
-
-
Cody Yu authored
The 2nd PR for #4532. This PR supports loading FP8 kv-cache scaling factors from a FP8 checkpoint (with .kv_scale parameter).
-
- 19 May, 2024 1 commit
-
-
Alexander Matveev authored
-
- 16 May, 2024 3 commits
-
-
Alexander Matveev authored
Co-authored-by:Robert Shaw <rshaw@neuralmagic.com>
-
Jinzhen Lin authored
-
alexm-nm authored
-
- 09 May, 2024 2 commits
-
-
Philipp Moritz authored
This PR improves the FP8 performance of linear layers, which had been lacking before (#4118 (comment) and #4118 (comment)). We noticed that CUBLASLt can find a better algorithm if the first dimension of the matrix is greater than 16. So this PR enlarges matrices appropriately during quantization. This improves FP8 performance and removes the performance regression vs. FP16, in many cases exceeding FP16 performance. Here are benchmarks on llama3 70b (ITL numbers for 1000 input and 50 output tokens at fixed qps and at TP 4), all FP8 measurements are for dynamic quantization: qps = 1: 24 ms (FP8, this PR), 32 ms (FP8, previous main), 26 ms (FP16) qps = 2: 26 ms (FP8, this PR), 34ms (FP8, previous main), 28 ms (FP16) qps = 4: 33 ms (FP8, this PR), 44 ms (FP8, previous main), 36 ms (FP16) qps = 6: 46 ms (FP8, this PR), 56 ms (FP8, previous main), 54 ms (FP16) qps = 8: 85 ms (FP8, this PR), 85 ms (FP8, previous main), 138 ms (FP16)
-
Hao Zhang authored
Co-authored-by:
Dash Desai <1723932+iamontheinet@users.noreply.github.com> Co-authored-by:
Aurick Qiao <qiao@aurick.net> Co-authored-by:
Aurick Qiao <aurick.qiao@snowflake.com> Co-authored-by:
Aurick Qiao <aurickq@users.noreply.github.com> Co-authored-by:
Cody Yu <hao.yu.cody@gmail.com>
-
- 02 May, 2024 1 commit
-
-
alexm-nm authored
-
- 30 Apr, 2024 2 commits
-
-
Robert Shaw authored
Co-authored-by:
Philipp Moritz <pcmoritz@gmail.com> Co-authored-by:
Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by:
mgoin <michael@neuralmagic.com> Co-authored-by:
Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by:
Cody Yu <hao.yu.cody@gmail.com>
-
Kunshang Ji authored
-
- 29 Apr, 2024 2 commits
-
-
Robert Shaw authored
Co-authored-by:
alexm <alexm@neuralmagic.com> Co-authored-by:
mgoin <michael@neuralmagic.com>
-
SangBin Cho authored
-
- 27 Apr, 2024 1 commit
-
-
Philipp Moritz authored
Co-authored-by:Woosuk Kwon <woosuk.kwon@berkeley.edu>
-
- 26 Apr, 2024 1 commit
-
-
Cody Yu authored
-
- 25 Apr, 2024 1 commit
-
-
Kunshang Ji authored
-
- 24 Apr, 2024 1 commit
-
-
Robert Shaw authored
Fixes fp8 iterface which broke in AQLM merge.
-
- 23 Apr, 2024 1 commit
-
-
James Fleming authored
Co-authored-by:mgoin <michael@neuralmagic.com>
-
- 20 Apr, 2024 2 commits
-
-
Noam Gat authored
-
Cody Yu authored
Provide an initial support to FP8 computation. This PR is inspired by HuggingFace TGI: huggingface/text-generation-inference#1726 This feature can be enabled with --quantization fp8 or -q fp8 when launching an engine. Algorithm: We still load a model checkpoint in FP16/BF16. After the weights are loaded, Fp8LinearMethod calculates the per-tensor scaling factor of weights and quantizes the weights accordingly. The scaling factor will then be stored for future use. Meanwhile, the per-tensor scaling factor for activations is calculated in every forward pass. Initial Results: Currently tested Mistral-7B on 1xH100. With prompt length ~5 and decoding length 128: BF16: 1.47s FP8: 1.66s I'll try to use larger models and try to find more performance bottleneck. Meanwhile, you're welcome to try this code.
-
- 18 Apr, 2024 1 commit
-
-
Michael Goin authored
-
- 11 Apr, 2024 2 commits
-
-
Antoni Baum authored
-
Kunshang Ji authored
-
- 03 Apr, 2024 1 commit
-
-
Adrian Abeyta authored
Co-authored-by:
Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Co-authored-by:
HaiShaw <hixiao@gmail.com> Co-authored-by:
AdrianAbeyta <Adrian.Abeyta@amd.com> Co-authored-by:
Matthew Wong <Matthew.Wong2@amd.com> Co-authored-by:
root <root@gt-pla-u18-08.pla.dcgpu> Co-authored-by:
mawong-amd <156021403+mawong-amd@users.noreply.github.com> Co-authored-by:
ttbachyinsda <ttbachyinsda@outlook.com> Co-authored-by:
guofangze <guofangze@kuaishou.com> Co-authored-by:
Michael Goin <mgoin64@gmail.com> Co-authored-by:
jacobthebanana <50071502+jacobthebanana@users.noreply.github.com> Co-authored-by:
Woosuk Kwon <woosuk.kwon@berkeley.edu>
-
- 25 Mar, 2024 1 commit
-
-
SangBin Cho authored
-
- 14 Mar, 2024 1 commit
-
-
Enrique Shockwave authored
-
- 11 Mar, 2024 1 commit
-
-
Zhuohan Li authored
-
- 01 Mar, 2024 1 commit
-
-
Robert Shaw authored
Co-authored-by:
Robert Shaw <114415538+rib-2@users.noreply.github.com> Co-authored-by:
alexm <alexm@neuralmagic.com>
-
- 29 Feb, 2024 1 commit
-
-
CHU Tianxiang authored
-
- 12 Feb, 2024 1 commit
-
-
Rex authored
Co-authored-by:Chunan Zeng <chunanzeng@Chunans-Air.attlocal.net>
-
- 01 Feb, 2024 1 commit
-
-
Kunshang Ji authored
Co-authored-by:
Jiang Li <jiang1.li@intel.com> Co-authored-by:
Kunshang Ji <kunshang.ji@intel.com>
-
- 27 Jan, 2024 1 commit
-
-
Casper authored
-