- 13 May, 2024 1 commit
-
-
Swapnil Parekh authored
-
- 12 May, 2024 1 commit
-
-
Yikang Shen authored
-
- 11 May, 2024 1 commit
-
-
Chang Su authored
-
- 10 May, 2024 5 commits
-
-
youkaichao authored
-
SangBin Cho authored
Storing exception frame is extremely prone to circular refernece because it contains the reference to objects. When tensorizer is not installed, it leaks llm instance because error frame has references to various modules which cause circular reference problem. I also found spec decoding has a circular reference issue, and I solved it using weakref.proxy.
-
Kunshang Ji authored
-
youkaichao authored
[Core][Distributed] refactor pynccl to hold multiple communicators (#4591)
-
Cody Yu authored
-
- 09 May, 2024 6 commits
-
-
Philipp Moritz authored
This PR improves the FP8 performance of linear layers, which had been lacking before (#4118 (comment) and #4118 (comment)). We noticed that CUBLASLt can find a better algorithm if the first dimension of the matrix is greater than 16. So this PR enlarges matrices appropriately during quantization. This improves FP8 performance and removes the performance regression vs. FP16, in many cases exceeding FP16 performance. Here are benchmarks on llama3 70b (ITL numbers for 1000 input and 50 output tokens at fixed qps and at TP 4), all FP8 measurements are for dynamic quantization: qps = 1: 24 ms (FP8, this PR), 32 ms (FP8, previous main), 26 ms (FP16) qps = 2: 26 ms (FP8, this PR), 34ms (FP8, previous main), 28 ms (FP16) qps = 4: 33 ms (FP8, this PR), 44 ms (FP8, previous main), 36 ms (FP16) qps = 6: 46 ms (FP8, this PR), 56 ms (FP8, previous main), 54 ms (FP16) qps = 8: 85 ms (FP8, this PR), 85 ms (FP8, previous main), 138 ms (FP16)
-
Hao Zhang authored
Co-authored-by:
Dash Desai <1723932+iamontheinet@users.noreply.github.com> Co-authored-by:
Aurick Qiao <qiao@aurick.net> Co-authored-by:
Aurick Qiao <aurick.qiao@snowflake.com> Co-authored-by:
Aurick Qiao <aurickq@users.noreply.github.com> Co-authored-by:
Cody Yu <hao.yu.cody@gmail.com>
-
Michael Goin authored
-
Woosuk Kwon authored
-
Cyrus Leung authored
-
Mahmoud Ashraf authored
Co-authored-by:Michael Goin <michael@neuralmagic.com>
-
- 08 May, 2024 10 commits
-
-
Cade Daniel authored
-
Cody Yu authored
Co-authored-by:Cade Daniel <edacih@gmail.com>
-
Woosuk Kwon authored
-
youkaichao authored
-
Antoni Baum authored
-
Woosuk Kwon authored
-
DefTruth authored
-
SangBin Cho authored
-
SangBin Cho authored
-
youkaichao authored
[Core][Distributed] support both cpu and device tensor in broadcast tensor dict (#4660)
-
- 07 May, 2024 4 commits
-
-
leiwen83 authored
Co-authored-by:
Lei Wen <wenlei03@qiyi.com> Co-authored-by:
Cade Daniel <edacih@gmail.com> Co-authored-by:
Cody Yu <hao.yu.cody@gmail.com>
-
youkaichao authored
-
Austin Veselka authored
-
youkaichao authored
-
- 06 May, 2024 1 commit
-
-
Cyrus Leung authored
-
- 05 May, 2024 2 commits
-
-
zhaoyang-star authored
-
Simon Mo authored
-
- 04 May, 2024 4 commits
-
-
DearPlanet authored
-
Michael Goin authored
[Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with Dynamic/Static Activations) (#4527) Follow on to #4332 to enable FP8 checkpoint loading for Mixtral and supersedes #4436. This PR enables the following checkpoint loading features for Mixtral: Supports loading fp8 checkpoints for Mixtral, such as this "nm-testing/Mixtral-8x7B-Instruct-v0.1-FP8" test model Supports static or dynamic activation quantization with static weight quantization (all per tensor) Supports different scales for each expert weight Supports Fp8 in QKV layer Notes: The Expert Gate/Router always runs at half / full precision for now. If there are different weight scales between QKV layer (for separate QKV weights), they are re-quantized using layer.weight_scale.max() so we can have a single gemm for performance.
-
SangBin Cho authored
-
Cody Yu authored
-
- 03 May, 2024 5 commits
-
-
youkaichao authored
-
Cade Daniel authored
-
Lily Liu authored
Co-authored-by:LiuXiaoxuanPKU <llilyliupku@gmail.com>
-
Sebastian Schoennenbeck authored
-
Michael Goin authored
-