- 03 May, 2024 9 commits
-
-
Cade Daniel authored
-
Lily Liu authored
Co-authored-by:LiuXiaoxuanPKU <llilyliupku@gmail.com>
-
Sebastian Schoennenbeck authored
-
Michael Goin authored
-
SangBin Cho authored
-
youkaichao authored
-
DefTruth authored
-
Yang, Bo authored
-
youkaichao authored
-
- 02 May, 2024 13 commits
-
-
SangBin Cho authored
-
Alexei-V-Ivanov-AMD authored
Co-authored-by:simon-mo <simon.mo@hey.com>
-
Michał Moskal authored
Co-authored-by:SangBin Cho <rkooo567@gmail.com>
-
youkaichao authored
-
Mark McLoughlin authored
-
Hu Dong authored
-
SangBin Cho authored
-
alexm-nm authored
-
youkaichao authored
Co-authored-by:Zhuohan Li <zhuohan123@gmail.com>
-
SangBin Cho authored
Co-authored-by:Cade Daniel <edacih@gmail.com>
-
Ronen Schaffer authored
-
SangBin Cho authored
[Bug fix][Core] assert num_new_tokens == 1 fails when SamplingParams.n is not 1 and max_tokens is large & Add tests for preemption (#4451)
-
Danny Guinther authored
-
- 01 May, 2024 18 commits
-
-
Woosuk Kwon authored
-
Philipp Moritz authored
Remove the device="cuda" declarations in mixtral as promised in #4343
-
youkaichao authored
-
Roy authored
-
sasha0552 authored
-
Philipp Moritz authored
This PR updates the tuning script for the fused_moe kernel to support FP8 and also adds configurations for TP4. Note that for the configuration I removed num_warps and num_stages for small batch sizes since that improved performance and brought the benchmarks on par with the numbers before in that regime to make sure this is a strict improvement over the status quo. All the numbers below are for mistralai/Mixtral-8x7B-Instruct-v0.1, 1000 input and 50 output tokens. Before this PR (with static activation scaling): qps = 1: 9.8 ms ITL, 0.49s e2e latency qps = 2: 9.7 ms ITL, 0.49s e2e latency qps = 4: 10.1 ms ITL, 0.52s e2e latency qps = 6: 11.9 ms ITL, 0.59s e2e latency qps = 8: 14.0 ms ITL, 0.70s e2e latency qps = 10: 15.7 ms ITL, 0.79s e2e latency After this PR (with static activation scaling): qps = 1: 9.8 ms ITL, 0.49s e2e latency qps = 2: 9.7 ms ITL, 0.49s e2e latency qps = 4: 10.2 ms ITL, 0.53s e2e latency qps = 6: 11.9 ms ITL, 0.59s e2e latency qps = 8: 11.9 ms ITL, 0.59s e2e latency qps = 10: 12.1 ms ITL, 0.61s e2e latency
-
Nick Hill authored
-
leiwen83 authored
Co-authored-by:
Lei Wen <wenlei03@qiyi.com> Co-authored-by:
Sage Moore <sagemoore@utexas.edu>
-
leiwen83 authored
Co-authored-by:Lei Wen <wenlei03@qiyi.com>
-
Travis Johnson authored
Signed-off-by:Travis Johnson <tsjohnso@us.ibm.com>
-
sasha0552 authored
-
Frαnçois authored
-
Robert Shaw authored
-
AnyISalIn authored
[Bugfix] Fix the fp8 kv_cache check error that occurs when failing to obtain the CUDA version. (#4173) Signed-off-by:AnyISalIn <anyisalin@gmail.com>
-
SangBin Cho authored
-
Jee Li authored
-
Robert Caulk authored
-
Pastel! authored
-