- 11 May, 2024 1 commit
-
-
Chang Su authored
-
- 10 May, 2024 7 commits
-
-
youkaichao authored
-
Robert Shaw authored
-
heeju-kim2 authored
Co-authored-by:Cade Daniel <edacih@gmail.com>
-
SangBin Cho authored
Storing exception frame is extremely prone to circular refernece because it contains the reference to objects. When tensorizer is not installed, it leaks llm instance because error frame has references to various modules which cause circular reference problem. I also found spec decoding has a circular reference issue, and I solved it using weakref.proxy.
-
Allen.Dou authored
-
youkaichao authored
[Core][Distributed] refactor pynccl to hold multiple communicators (#4591)
-
Cody Yu authored
-
- 09 May, 2024 3 commits
-
-
Woosuk Kwon authored
-
Woosuk Kwon authored
-
Cyrus Leung authored
-
- 08 May, 2024 6 commits
-
-
Cody Yu authored
Co-authored-by:Cade Daniel <edacih@gmail.com>
-
youkaichao authored
-
youkaichao authored
-
DefTruth authored
-
SangBin Cho authored
-
youkaichao authored
[Core][Distributed] support both cpu and device tensor in broadcast tensor dict (#4660)
-
- 07 May, 2024 3 commits
-
-
leiwen83 authored
Co-authored-by:
Lei Wen <wenlei03@qiyi.com> Co-authored-by:
Cade Daniel <edacih@gmail.com> Co-authored-by:
Cody Yu <hao.yu.cody@gmail.com>
-
youkaichao authored
-
youkaichao authored
-
- 04 May, 2024 3 commits
-
-
DearPlanet authored
-
Michael Goin authored
[Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with Dynamic/Static Activations) (#4527) Follow on to #4332 to enable FP8 checkpoint loading for Mixtral and supersedes #4436. This PR enables the following checkpoint loading features for Mixtral: Supports loading fp8 checkpoints for Mixtral, such as this "nm-testing/Mixtral-8x7B-Instruct-v0.1-FP8" test model Supports static or dynamic activation quantization with static weight quantization (all per tensor) Supports different scales for each expert weight Supports Fp8 in QKV layer Notes: The Expert Gate/Router always runs at half / full precision for now. If there are different weight scales between QKV layer (for separate QKV weights), they are re-quantized using layer.weight_scale.max() so we can have a single gemm for performance.
-
Cody Yu authored
-
- 03 May, 2024 5 commits
-
-
Cade Daniel authored
-
Lily Liu authored
Co-authored-by:LiuXiaoxuanPKU <llilyliupku@gmail.com>
-
Sebastian Schoennenbeck authored
-
SangBin Cho authored
-
youkaichao authored
-
- 02 May, 2024 7 commits
-
-
SangBin Cho authored
-
Michał Moskal authored
Co-authored-by:SangBin Cho <rkooo567@gmail.com>
-
alexm-nm authored
-
youkaichao authored
Co-authored-by:Zhuohan Li <zhuohan123@gmail.com>
-
Ronen Schaffer authored
-
SangBin Cho authored
[Bug fix][Core] assert num_new_tokens == 1 fails when SamplingParams.n is not 1 and max_tokens is large & Add tests for preemption (#4451)
-
Danny Guinther authored
-
- 01 May, 2024 5 commits
-
-
sasha0552 authored
-
Nick Hill authored
-
leiwen83 authored
Co-authored-by:
Lei Wen <wenlei03@qiyi.com> Co-authored-by:
Sage Moore <sagemoore@utexas.edu>
-
leiwen83 authored
Co-authored-by:Lei Wen <wenlei03@qiyi.com>
-
SangBin Cho authored
-