README.md 3.04 KB
Newer Older
1
## Tuning Triton MoE Kernels
2
3
4
5
6

This directory contains benchmarking tools for MoE (Mixture of Experts) kernels.

### Tuning Tool

7
- `tuning_fused_moe_triton.py`: A tool for tuning the `fused_moe_triton` kernel. Adapted from [vllm's benchmark_moe.py](https://github.com/vllm-project/vllm/blob/main/benchmarks/kernels/benchmark_moe.py), with added support for various model architectures.
8
9
10

Example usage:
```bash
11
12
13
14
15
# Tune Mixtral-8x7B with default settings
python benchmark/kernels/fused_moe_triton/tuning_fused_moe_triton.py \
    --model mistralai/Mixtral-8x7B-Instruct-v0.1 \
    --tune

16
17
# Tune Qwen2-57B with FP8 and TP=4
python benchmark/kernels/fused_moe_triton/tuning_fused_moe_triton.py \
18
    --model Qwen/Qwen2-57B-A14B-Instruct \
19
20
21
22
    --tp-size 4 \
    --dtype fp8_w8a8 \
    --tune

23
24
25
26
27
28
29
# Tune Qwen3-235B-A22B-FP8 and TP=4
python benchmark/kernels/fused_moe_triton/tuning_fused_moe_triton.py \
    --model Qwen/Qwen3-235B-A22B-FP8 \
    --tp-size 4 \
    --dtype fp8_w8a8 \
    --tune

30
# Tune DeepSeek-V3 with FP8 and TP=8
31
python benchmark/kernels/fused_moe_triton/tuning_fused_moe_triton.py \
32
33
34
    --model deepseek-ai/DeepSeek-V3-0324 \
    --tp-size 8 \
    --dtype fp8_w8a8 \
35
    --tune
36

37
# Tune DeepSeek-R1 with channel-wise INT8 and TP=16
38
39
40
41
42
python benchmark/kernels/fused_moe_triton/tuning_fused_moe_triton.py \
    --model meituan/DeepSeek-R1-Channel-INT8 \
    --tp-size 16 \
    --dtype int8_w8a8 \
    --tune
43
44
```

45
After tuning, a configuration file (e.g., `E=64,N=640,device_name=NVIDIA_GeForce_RTX_4090,dtype=fp8_w8a8.json`) will be generated in the current directory. You can move this file to `sglang/srt/layers/fused_moe_triton/configs/triton_version` dir to use it in `sglang`.
46
47
48

### Performance Comparison Tool

49
- `benchmark_vllm_vs_sglang_fused_moe_triton.py`: A tool for comparing the performance of fused MoE kernels between vllm and sglang implementations. Supports various model architectures and data types.
50
51
52
53
54
55
56
57

Example usage:
```bash
# Compare with default settings (Mixtral model)
python benchmark/kernels/fused_moe_triton/benchmark_vllm_vs_sglang_fused_moe_triton.py

# Compare with FP8 mode for Qwen2-57B
python benchmark/kernels/fused_moe_triton/benchmark_vllm_vs_sglang_fused_moe_triton.py \
58
    --model Qwen/Qwen2-57B-A14B-Instruct \
59
    --use-fp8-w8a8
60
61
62

# Compare with custom TP size
python benchmark/kernels/fused_moe_triton/benchmark_vllm_vs_sglang_fused_moe_triton.py \
63
64
65
    --model deepseek-ai/DeepSeek-V3-0324 \
    --tp-size 8

66
# Compare with custom TP size
67
68
python benchmark/kernels/fused_moe_triton/benchmark_vllm_vs_sglang_fused_moe_triton.py \
    --model deepseek-ai/DeepSeek-V3-0324 \
69
    --tp-size 8
70
71
72
```

The benchmark results will be saved as plots and data files in the specified output directory (default: `./configs/benchmark_ops/vllm_sglang_fused_moe/`).
73
74
75

- `benchmark_torch_compile_fused_moe.py`: A tool for benchmarking the performance of the fused MoE kernel with `torch.compile` and original fused MoE kernel.

76
Usage is the same as `benchmark_vllm_vs_sglang_fused_moe_triton.py`, note that `torch.compile` does not support `fp8_w8a8` and `int8_w8a8` fused_moe_kernel.