- 21 Jul, 2024 1 commit
-
-
Alexander Matveev authored
-
- 20 Jul, 2024 1 commit
-
-
Varun Sundar Rabindranath authored
Co-authored-by:Varun Sundar Rabindranth <varun@neuralmagic.com>
-
- 18 Jul, 2024 1 commit
-
-
Varun Sundar Rabindranath authored
Co-authored-by:Varun Sundar Rabindranath <varun@neuralmagic.com>
-
- 14 Jul, 2024 1 commit
-
-
Tyler Michael Smith authored
-
- 03 Jul, 2024 1 commit
-
-
Michael Goin authored
-
- 28 Jun, 2024 1 commit
-
-
Tyler Michael Smith authored
-
- 26 Jun, 2024 1 commit
-
-
Luka Govedič authored
Co-authored-by:
Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com> Co-authored-by:
Lucas Wilkinson <lwilkinson@neuralmagic.com>
-
- 23 Jun, 2024 1 commit
-
-
Varun Sundar Rabindranath authored
Co-authored-by:Varun Sundar Rabindranath <varun@neuralmagic.com>
-
- 20 Jun, 2024 3 commits
-
-
Tyler Michael Smith authored
-
Varun Sundar Rabindranath authored
Co-authored-by:Varun Sundar Rabindranath <varun@neuralmagic.com>
-
Varun Sundar Rabindranath authored
Co-authored-by:Varun Sundar Rabindranath <varun@neuralmagic.com>
-
- 18 Jun, 2024 1 commit
-
-
Tyler Michael Smith authored
-
- 14 Jun, 2024 2 commits
-
-
Tyler Michael Smith authored
-
Tyler Michael Smith authored
-
- 13 Jun, 2024 1 commit
-
-
Tyler Michael Smith authored
Co-authored-by:
Michael Goin <michael@neuralmagic.com> Co-authored-by:
youkaichao <youkaichao@gmail.com> Co-authored-by:
zifeitong <zifei.tong@parasail.io> Co-authored-by:
Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
-
- 12 Jun, 2024 1 commit
-
-
Cody Yu authored
Inspired by #5146, this PR improves FP8 quantize kernel by vectorizing data transfer to better utilize memory bandwidth. Microbenchmark shows that this improved kernel can achieve 1.0x-1.5x speedup (especially when hidden size is large). In details, we applied 3 optimizations: - Use inverted scale so that most divisions are changed to multiplications. - Unroll the loop by 4 times to improve ILP. - Use vectorized 4 to transfer data between HBM and SRAM.
-
- 09 Jun, 2024 1 commit
-
-
bnellnm authored
-
- 07 Jun, 2024 1 commit
-
-
Dipika Sikka authored
Co-authored-by:
Varun Sundar Rabindranath <varunsundar08@gmail.com> Co-authored-by:
Varun Sundar Rabindranath <varun@neuralmagic.com>
-
- 05 Jun, 2024 1 commit
-
-
Tyler Michael Smith authored
Co-authored-by:Cody Yu <hao.yu.cody@gmail.com>
-
- 03 Jun, 2024 1 commit
-
-
Tyler Michael Smith authored
-
- 01 Jun, 2024 3 commits
-
-
Varun Sundar Rabindranath authored
Co-authored-by:
Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by:
Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
-
Tyler Michael Smith authored
-
Tyler Michael Smith authored
-
- 31 May, 2024 2 commits
-
-
Simon Mo authored
Revert "[Kernel] Marlin_24: Ensure the mma.sp instruction is using the ::ordered_metadata modifier (introduced with PTX 8.5)" (#5149)
-
Alexander Matveev authored
[Kernel] Marlin_24: Ensure the mma.sp instruction is using the ::ordered_metadata modifier (introduced with PTX 8.5) (#5136)
-
- 23 May, 2024 2 commits
-
-
Dipika Sikka authored
Co-authored-by:
Varun Sundar Rabindranath <varunsundar08@gmail.com> Co-authored-by:
Varun Sundar Rabindranath <varun@neuralmagic.com>
-
Alexander Matveev authored
-
- 22 May, 2024 2 commits
-
-
Tyler Michael Smith authored
Pass the CUDA stream into the CUTLASS GEMMs, to avoid future issues with CUDA graphs
-
Michael Goin authored
-
- 20 May, 2024 1 commit
-
-
Alexander Matveev authored
-
- 16 May, 2024 3 commits
-
-
Tyler Michael Smith authored
-
Alexander Matveev authored
Co-authored-by:Robert Shaw <rshaw@neuralmagic.com>
-
Jinzhen Lin authored
-
- 10 May, 2024 1 commit
-
-
Cody Yu authored
-
- 09 May, 2024 1 commit
-
-
alexm-nm authored
-
- 07 May, 2024 1 commit
-
-
Philipp Moritz authored
Previously FP8 static scaling works if the scales are overestimating the maxima of all activation tensors during computation. However this will not always be the case even if the scales were calibrated very carefully. For example, with the activations in my checkpoint https://huggingface.co/pcmoritz/Mixtral-8x7B-v0.1-fp8-act-scale (which was calibrated on https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k), I'm getting the following mostly random performance on MMLU: | Groups |Version|Filter|n-shot|Metric|Value | |Stderr| |------------------|-------|------|-----:|------|-----:|---|-----:| |mmlu |N/A |none | 0|acc |0.2295|± |0.0035| | - humanities |N/A |none | 5|acc |0.2421|± |0.0062| | - other |N/A |none | 5|acc |0.2398|± |0.0076| | - social_sciences|N/A |none | 5|acc |0.2171|± |0.0074| | - stem |N/A |none | 5|acc |0.2125|± |0.0073| With the fix in this PR where the scaled activations are clamped between [-std::numeric_limits<c10::Float8_e4m3fn>::max(), std::numeric_limits<c10::Float8_e4m3fn>::max()] to make sure there are no NaNs, the performance is | Groups |Version|Filter|n-shot|Metric|Value | |Stderr| |------------------|-------|------|-----:|------|-----:|---|-----:| |mmlu |N/A |none | 0|acc |0.7008|± |0.0036| | - humanities |N/A |none | 5|acc |0.6453|± |0.0065| | - other |N/A |none | 5|acc |0.7692|± |0.0072| | - social_sciences|N/A |none | 5|acc |0.8083|± |0.0070| | - stem |N/A |none | 5|acc |0.6115|± |0.0083| This is not perfect yet but is getting very close to the FP16 / dynamic activation scale performance.
-
- 02 May, 2024 1 commit
-
-
alexm-nm authored
-
- 29 Apr, 2024 1 commit
-
-
Robert Shaw authored
Co-authored-by:
alexm <alexm@neuralmagic.com> Co-authored-by:
mgoin <michael@neuralmagic.com>
-
- 27 Apr, 2024 1 commit
-
-
Philipp Moritz authored
Co-authored-by:Woosuk Kwon <woosuk.kwon@berkeley.edu>
-
- 24 Apr, 2024 1 commit
-
-
alexm-nm authored
This PR addresses the Marlin kernel H100 crash that was reported here: neuralmagic#187. The reason for the crash was the inline PTX assembly that introduced the async_copy with streaming behavior. The solution is to use the more standard PTX for async_copy (without the fractional L2 policy for "evict_first"). There is no performance difference between standard async_copy PTX and the previous one.
-