- 05 Aug, 2024 1 commit
-
-
Isotr0py authored
Co-authored-by:Michael Goin <michael@neuralmagic.com>
-
- 02 Aug, 2024 1 commit
-
-
Lucas Wilkinson authored
-
- 01 Aug, 2024 1 commit
-
-
Jee Jee Li authored
-
- 31 Jul, 2024 3 commits
-
-
HandH1998 authored
-
Cyrus Leung authored
-
Cyrus Leung authored
-
- 30 Jul, 2024 1 commit
-
-
Tyler Michael Smith authored
-
- 27 Jul, 2024 1 commit
-
-
Alexander Matveev authored
-
- 24 Jul, 2024 1 commit
-
-
Antoni Baum authored
-
- 21 Jul, 2024 1 commit
-
-
Alexander Matveev authored
-
- 20 Jul, 2024 2 commits
-
-
Robert Shaw authored
-
Varun Sundar Rabindranath authored
Co-authored-by:Varun Sundar Rabindranth <varun@neuralmagic.com>
-
- 19 Jul, 2024 1 commit
-
-
Robert Shaw authored
-
- 18 Jul, 2024 1 commit
-
-
Varun Sundar Rabindranath authored
Co-authored-by:Varun Sundar Rabindranath <varun@neuralmagic.com>
-
- 17 Jul, 2024 1 commit
-
-
Alexander Matveev authored
-
- 16 Jul, 2024 1 commit
-
-
Michael Goin authored
-
- 03 Jul, 2024 1 commit
-
-
Michael Goin authored
-
- 26 Jun, 2024 1 commit
-
-
Luka Govedič authored
Co-authored-by:
Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com> Co-authored-by:
Lucas Wilkinson <lwilkinson@neuralmagic.com>
-
- 20 Jun, 2024 2 commits
-
-
Tyler Michael Smith authored
-
Roger Wang authored
-
- 17 Jun, 2024 1 commit
-
-
Kunshang Ji authored
Co-authored-by:
Jiang Li <jiang1.li@intel.com> Co-authored-by:
Abhilash Majumder <abhilash.majumder@intel.com> Co-authored-by:
Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
-
- 13 Jun, 2024 1 commit
-
-
Tyler Michael Smith authored
Co-authored-by:
Michael Goin <michael@neuralmagic.com> Co-authored-by:
youkaichao <youkaichao@gmail.com> Co-authored-by:
zifeitong <zifei.tong@parasail.io> Co-authored-by:
Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
-
- 12 Jun, 2024 1 commit
-
-
youkaichao authored
-
- 09 Jun, 2024 1 commit
-
-
bnellnm authored
-
- 07 Jun, 2024 3 commits
-
-
Dipika Sikka authored
Co-authored-by:
Varun Sundar Rabindranath <varunsundar08@gmail.com> Co-authored-by:
Varun Sundar Rabindranath <varun@neuralmagic.com>
-
Tyler Michael Smith authored
Switching from torch._scaled_mm to vLLM's cutlass fp8 kernels when supported as we are seeing 5-15% improvement in e2e performance on neuralmagic/Meta-Llama-3-8B-Instruct-FP8 see https://docs.google.com/spreadsheets/d/1GiAnmzyGHgZ6zL_LDSTm35Bdrt4A8AaFEurDlISYYA4/ for some quick e2e benchmarks and #5144 for comparisons across different GEMM sizes.
-
Jie Fu (傅杰) authored
-
- 03 Jun, 2024 1 commit
-
-
Tyler Michael Smith authored
-
- 25 May, 2024 1 commit
-
-
Eric Xihui Lin authored
Co-authored-by:
beagleski <yunanzhang@microsoft.com> Co-authored-by:
bapatra <bapatra@microsoft.com> Co-authored-by:
Barun Patra <codedecde@users.noreply.github.com> Co-authored-by:
Michael Goin <michael@neuralmagic.com>
-
- 23 May, 2024 1 commit
-
-
Dipika Sikka authored
Co-authored-by:
Varun Sundar Rabindranath <varunsundar08@gmail.com> Co-authored-by:
Varun Sundar Rabindranath <varun@neuralmagic.com>
-
- 16 May, 2024 2 commits
-
-
Tyler Michael Smith authored
-
Alexander Matveev authored
Co-authored-by:Robert Shaw <rshaw@neuralmagic.com>
-
- 10 May, 2024 2 commits
-
-
Kunshang Ji authored
-
Cody Yu authored
-
- 09 May, 2024 1 commit
-
-
Philipp Moritz authored
This PR improves the FP8 performance of linear layers, which had been lacking before (#4118 (comment) and #4118 (comment)). We noticed that CUBLASLt can find a better algorithm if the first dimension of the matrix is greater than 16. So this PR enlarges matrices appropriately during quantization. This improves FP8 performance and removes the performance regression vs. FP16, in many cases exceeding FP16 performance. Here are benchmarks on llama3 70b (ITL numbers for 1000 input and 50 output tokens at fixed qps and at TP 4), all FP8 measurements are for dynamic quantization: qps = 1: 24 ms (FP8, this PR), 32 ms (FP8, previous main), 26 ms (FP16) qps = 2: 26 ms (FP8, this PR), 34ms (FP8, previous main), 28 ms (FP16) qps = 4: 33 ms (FP8, this PR), 44 ms (FP8, previous main), 36 ms (FP16) qps = 6: 46 ms (FP8, this PR), 56 ms (FP8, previous main), 54 ms (FP16) qps = 8: 85 ms (FP8, this PR), 85 ms (FP8, previous main), 138 ms (FP16)
-
- 03 May, 2024 2 commits
-
-
Lily Liu authored
Co-authored-by:LiuXiaoxuanPKU <llilyliupku@gmail.com>
-
SangBin Cho authored
-
- 02 May, 2024 1 commit
-
-
alexm-nm authored
-
- 30 Apr, 2024 1 commit
-
-
Kunshang Ji authored
-
- 27 Apr, 2024 1 commit
-
-
Philipp Moritz authored
Co-authored-by:Woosuk Kwon <woosuk.kwon@berkeley.edu>
-