- 02 May, 2024 7 commits
-
-
SangBin Cho authored
-
alexm-nm authored
-
youkaichao authored
Co-authored-by:Zhuohan Li <zhuohan123@gmail.com>
-
SangBin Cho authored
Co-authored-by:Cade Daniel <edacih@gmail.com>
-
Ronen Schaffer authored
-
SangBin Cho authored
[Bug fix][Core] assert num_new_tokens == 1 fails when SamplingParams.n is not 1 and max_tokens is large & Add tests for preemption (#4451)
-
Danny Guinther authored
-
- 01 May, 2024 21 commits
-
-
Woosuk Kwon authored
-
Philipp Moritz authored
Remove the device="cuda" declarations in mixtral as promised in #4343
-
youkaichao authored
-
Roy authored
-
sasha0552 authored
-
Philipp Moritz authored
This PR updates the tuning script for the fused_moe kernel to support FP8 and also adds configurations for TP4. Note that for the configuration I removed num_warps and num_stages for small batch sizes since that improved performance and brought the benchmarks on par with the numbers before in that regime to make sure this is a strict improvement over the status quo. All the numbers below are for mistralai/Mixtral-8x7B-Instruct-v0.1, 1000 input and 50 output tokens. Before this PR (with static activation scaling): qps = 1: 9.8 ms ITL, 0.49s e2e latency qps = 2: 9.7 ms ITL, 0.49s e2e latency qps = 4: 10.1 ms ITL, 0.52s e2e latency qps = 6: 11.9 ms ITL, 0.59s e2e latency qps = 8: 14.0 ms ITL, 0.70s e2e latency qps = 10: 15.7 ms ITL, 0.79s e2e latency After this PR (with static activation scaling): qps = 1: 9.8 ms ITL, 0.49s e2e latency qps = 2: 9.7 ms ITL, 0.49s e2e latency qps = 4: 10.2 ms ITL, 0.53s e2e latency qps = 6: 11.9 ms ITL, 0.59s e2e latency qps = 8: 11.9 ms ITL, 0.59s e2e latency qps = 10: 12.1 ms ITL, 0.61s e2e latency
-
Nick Hill authored
-
leiwen83 authored
Co-authored-by:
Lei Wen <wenlei03@qiyi.com> Co-authored-by:
Sage Moore <sagemoore@utexas.edu>
-
leiwen83 authored
Co-authored-by:Lei Wen <wenlei03@qiyi.com>
-
Travis Johnson authored
Signed-off-by:Travis Johnson <tsjohnso@us.ibm.com>
-
sasha0552 authored
-
Frαnçois authored
-
Robert Shaw authored
-
AnyISalIn authored
[Bugfix] Fix the fp8 kv_cache check error that occurs when failing to obtain the CUDA version. (#4173) Signed-off-by:AnyISalIn <anyisalin@gmail.com>
-
SangBin Cho authored
-
Jee Li authored
-
Robert Caulk authored
-
Pastel! authored
-
harrywu authored
-
Nick Hill authored
-
fuchen.ljl authored
Co-authored-by:Simon Mo <simon.mo@hey.com>
-
- 30 Apr, 2024 10 commits
-
-
fuchen.ljl authored
-
Li, Jiang authored
-
Alpay Ariyak authored
-
Florian Greinacher authored
Co-authored-by:
Lily Liu <lilyliupku@gmail.com> Co-authored-by:
Cyrus Leung <tlleungac@connect.ust.hk>
-
Robert Shaw authored
Co-authored-by:
Philipp Moritz <pcmoritz@gmail.com> Co-authored-by:
Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by:
mgoin <michael@neuralmagic.com> Co-authored-by:
Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by:
Cody Yu <hao.yu.cody@gmail.com>
-
Prashant Gupta authored
Signed-off-by:
Prashant Gupta <prashantgupta@us.ibm.com> Co-authored-by:
Roger Wang <ywang@roblox.com>
-
leiwen83 authored
Co-authored-by:Lei Wen <wenlei03@qiyi.com>
-
Kunshang Ji authored
-
Woosuk Kwon authored
-
Michael Goin authored
-
- 29 Apr, 2024 2 commits
-
-
youkaichao authored
-
Simon Mo authored
-