Commits · e9899fb7a4d9e032198d26ef84f1dd2cfd9621aa · OpenDAS / vllm_cscc

31 May, 2024 1 commit
- [Model] Enable FP8 QKV in MoE and refine kernel tuning script (#5039) · e9899fb7
  Cody Yu authored May 31, 2024
  
  e9899fb7
01 May, 2024 1 commit

[Kernel] Update fused_moe tuning script for FP8 (#4457) · 24bb4fe4

Philipp Moritz authored May 01, 2024

This PR updates the tuning script for the fused_moe kernel to support FP8 and also adds configurations for TP4. Note that for the configuration I removed num_warps and num_stages for small batch sizes since that improved performance and brought the benchmarks on par with the numbers before in that regime to make sure this is a strict improvement over the status quo.

All the numbers below are for mistralai/Mixtral-8x7B-Instruct-v0.1, 1000 input and 50 output tokens.

Before this PR (with static activation scaling):

qps = 1: 9.8 ms ITL, 0.49s e2e latency
qps = 2: 9.7 ms ITL, 0.49s e2e latency
qps = 4: 10.1 ms ITL, 0.52s e2e latency
qps = 6: 11.9 ms ITL, 0.59s e2e latency
qps = 8: 14.0 ms ITL, 0.70s e2e latency
qps = 10: 15.7 ms ITL, 0.79s e2e latency

After this PR (with static activation scaling):

qps = 1: 9.8 ms ITL, 0.49s e2e latency
qps = 2: 9.7 ms ITL, 0.49s e2e latency
qps = 4: 10.2 ms ITL, 0.53s e2e latency
qps = 6: 11.9 ms ITL, 0.59s e2e latency
qps = 8: 11.9 ms ITL, 0.59s e2e latency
qps = 10: 12.1 ms ITL, 0.61s e2e latency

24bb4fe4

25 Mar, 2024 1 commit
- [CI] Try introducing isort. (#3495) · 01bfb22b
  SangBin Cho authored Mar 25, 2024
  
  01bfb22b
14 Mar, 2024 1 commit
- [Kernel] change benchmark script so that result can be directly used; tune moe... · 8fe83865
  youkaichao authored Mar 14, 2024
```
[Kernel] change benchmark script so that result can be directly used; tune moe kernel in A100/H100 with tp=2,4,8 (#3389)
```
  8fe83865
26 Feb, 2024 1 commit
- Optimize Triton MoE Kernel (#2979) · cfc15a10
  Philipp Moritz authored Feb 26, 2024
```
Co-authored-by: Cade Daniel <edacih@gmail.com>
```
  cfc15a10