- 09 May, 2024 5 commits
-
-
Woosuk Kwon authored
-
Woosuk Kwon authored
-
Cyrus Leung authored
-
Mahmoud Ashraf authored
Co-authored-by:Michael Goin <michael@neuralmagic.com>
-
alexm-nm authored
-
- 08 May, 2024 11 commits
-
-
Cade Daniel authored
-
Cody Yu authored
Co-authored-by:Cade Daniel <edacih@gmail.com>
-
Woosuk Kwon authored
-
youkaichao authored
-
youkaichao authored
-
Antoni Baum authored
-
Woosuk Kwon authored
-
DefTruth authored
-
SangBin Cho authored
-
SangBin Cho authored
-
youkaichao authored
[Core][Distributed] support both cpu and device tensor in broadcast tensor dict (#4660)
-
- 07 May, 2024 6 commits
-
-
leiwen83 authored
Co-authored-by:
Lei Wen <wenlei03@qiyi.com> Co-authored-by:
Cade Daniel <edacih@gmail.com> Co-authored-by:
Cody Yu <hao.yu.cody@gmail.com>
-
youkaichao authored
-
Austin Veselka authored
-
Alexei-V-Ivanov-AMD authored
-
youkaichao authored
-
Philipp Moritz authored
Previously FP8 static scaling works if the scales are overestimating the maxima of all activation tensors during computation. However this will not always be the case even if the scales were calibrated very carefully. For example, with the activations in my checkpoint https://huggingface.co/pcmoritz/Mixtral-8x7B-v0.1-fp8-act-scale (which was calibrated on https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k), I'm getting the following mostly random performance on MMLU: | Groups |Version|Filter|n-shot|Metric|Value | |Stderr| |------------------|-------|------|-----:|------|-----:|---|-----:| |mmlu |N/A |none | 0|acc |0.2295|± |0.0035| | - humanities |N/A |none | 5|acc |0.2421|± |0.0062| | - other |N/A |none | 5|acc |0.2398|± |0.0076| | - social_sciences|N/A |none | 5|acc |0.2171|± |0.0074| | - stem |N/A |none | 5|acc |0.2125|± |0.0073| With the fix in this PR where the scaled activations are clamped between [-std::numeric_limits<c10::Float8_e4m3fn>::max(), std::numeric_limits<c10::Float8_e4m3fn>::max()] to make sure there are no NaNs, the performance is | Groups |Version|Filter|n-shot|Metric|Value | |Stderr| |------------------|-------|------|-----:|------|-----:|---|-----:| |mmlu |N/A |none | 0|acc |0.7008|± |0.0036| | - humanities |N/A |none | 5|acc |0.6453|± |0.0065| | - other |N/A |none | 5|acc |0.7692|± |0.0072| | - social_sciences|N/A |none | 5|acc |0.8083|± |0.0070| | - stem |N/A |none | 5|acc |0.6115|± |0.0083| This is not perfect yet but is getting very close to the FP16 / dynamic activation scale performance.
-
- 06 May, 2024 4 commits
-
-
Noam Gat authored
-
Cade Daniel authored
-
Simon Mo authored
-
Cyrus Leung authored
-
- 05 May, 2024 3 commits
-
-
zhaoyang-star authored
-
Simon Mo authored
-
Simon Mo authored
-
- 04 May, 2024 5 commits
-
-
DearPlanet authored
-
Simon Mo authored
-
Michael Goin authored
[Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with Dynamic/Static Activations) (#4527) Follow on to #4332 to enable FP8 checkpoint loading for Mixtral and supersedes #4436. This PR enables the following checkpoint loading features for Mixtral: Supports loading fp8 checkpoints for Mixtral, such as this "nm-testing/Mixtral-8x7B-Instruct-v0.1-FP8" test model Supports static or dynamic activation quantization with static weight quantization (all per tensor) Supports different scales for each expert weight Supports Fp8 in QKV layer Notes: The Expert Gate/Router always runs at half / full precision for now. If there are different weight scales between QKV layer (for separate QKV weights), they are re-quantized using layer.weight_scale.max() so we can have a single gemm for performance.
-
SangBin Cho authored
-
Cody Yu authored
-
- 03 May, 2024 6 commits
-
-
youkaichao authored
-
Cade Daniel authored
-
Lily Liu authored
Co-authored-by:LiuXiaoxuanPKU <llilyliupku@gmail.com>
-
Sebastian Schoennenbeck authored
-
Michael Goin authored
-
SangBin Cho authored
-