README.md 14.9 KB
Newer Older
Yineng Zhang's avatar
Yineng Zhang committed
1
2
3
4
5
6
7
8
9
10
# How to reproduce the benchmark results of SGLang

## Prerequisite

### Install the latest SGLang

```bash
git clone https://github.com/sgl-project/sglang.git
cd sglang

Ying Sheng's avatar
Ying Sheng committed
11
pip install --upgrade pip setuptools wheel
Yineng Zhang's avatar
Yineng Zhang committed
12
13
14
15
16
pip install -e "python[all]"

pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.3/
```

Yineng Zhang's avatar
Yineng Zhang committed
17
### Set up ulimit and HF_TOKEN
Yineng Zhang's avatar
Yineng Zhang committed
18
19

```bash
Yineng Zhang's avatar
Yineng Zhang committed
20
ulimit -n 65535
Yineng Zhang's avatar
Yineng Zhang committed
21
22
23
24
25
26
27
28
29
30
31
32
# Change the token to a real and usable one, with access permissions for the Llama 3 models.
export HF_TOKEN=hf_token
```

### Launch the server

```bash
# Meta-Llama-3.1-8B-Instruct
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --enable-torch-compile --disable-radix-cache

# Meta-Llama-3.1-70B-Instruct
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-70B-Instruct --disable-radix-cache --tp 8
Yineng Zhang's avatar
Yineng Zhang committed
33
34
35

# Meta-Llama-3-70B-Instruct-FP8
python -m sglang.launch_server --model-path neuralmagic/Meta-Llama-3-70B-Instruct-FP8 --disable-radix-cache --tp 8
Yineng Zhang's avatar
Yineng Zhang committed
36
37
38
39
```

## Benchmark

Yineng Zhang's avatar
Yineng Zhang committed
40
41
42
43
44
45
46
### Hardware Requirements

- 8B models: Single NVIDIA A100 80GB GPU
- 70B models: 8 x NVIDIA A100 80GB GPUs with Tensor Parallelism (TP) 8
- 70B FP8 models: 8 x NVIDIA H100 GPUs with Tensor Parallelism (TP) 8

Please ensure you have the appropriate hardware before running the benchmarks.
Yineng Zhang's avatar
Yineng Zhang committed
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72

#### Offline benchmark

```bash
# Random dataset, Input [512, 1024], Output [512, 1024], num prompts 3k
python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 3000 --random-input 1024 --random-output 1024 --random-range-ratio 0.5 --output-file sglang_offline_benchmark.jsonl
		
# Random dataset, Input [2048, 4096], Output [512, 1024], num prompts 3k      							
python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 3000 --random-input 4096 --random-output 1024 --random-range-ratio 0.5 --output-file sglang_offline_benchmark.jsonl

# Random dataset, Input [512, 1024], Output [256, 512], num prompts 3k							
python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 3000 --random-input 1024 --random-output 512 --random-range-ratio 0.5 --output-file sglang_offline_benchmark.jsonl

# Random dataset, Input [2048, 4096], Output [256, 512], num prompts 3k
python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 3000 --random-input 4096 --random-output 512 --random-range-ratio 0.5 --output-file sglang_offline_benchmark.jsonl

# ShareGPT dataset, num prompts 3k
python3 -m sglang.bench_serving --backend sglang --num-prompts 3000 --output-file sglang_offline_benchmark.jsonl

# get output token throughput
cat sglang_offline_benchmark.jsonl | cut -d':' -f12 | cut -d',' -f1
```

#### Online benchmark

```bash
Yineng Zhang's avatar
Yineng Zhang committed
73
# Random dataset, Input [512, 4096], Output [128, 1024], request rate 1, num prompts 300
Yineng Zhang's avatar
Yineng Zhang committed
74
75
python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-input 4096 --random-output 1024 --random-range-ratio 0.125 --num-prompts 300 --request-rate 1 --output-file sglang_online_benchmark.jsonl

Yineng Zhang's avatar
Yineng Zhang committed
76
# Random dataset, Input [512, 4096], Output [128, 1024], request rate 2, num prompts 600
Yineng Zhang's avatar
Yineng Zhang committed
77
78
python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-input 4096 --random-output 1024 --random-range-ratio 0.125 --num-prompts 600 --request-rate 2 --output-file sglang_online_benchmark.jsonl

Yineng Zhang's avatar
Yineng Zhang committed
79
# Random dataset, Input [512, 4096], Output [128, 1024], request rate 4, num prompts 1200
Yineng Zhang's avatar
Yineng Zhang committed
80
81
python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-input 4096 --random-output 1024 --random-range-ratio 0.125 --num-prompts 1200 --request-rate 4 --output-file sglang_online_benchmark.jsonl

Yineng Zhang's avatar
Yineng Zhang committed
82
# Random dataset, Input [512, 4096], Output [128, 1024], request rate 8, num prompts 2400
Yineng Zhang's avatar
Yineng Zhang committed
83
84
python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-input 4096 --random-output 1024 --random-range-ratio 0.125 --num-prompts 2400 --request-rate 8 --output-file sglang_online_benchmark.jsonl

Yineng Zhang's avatar
Yineng Zhang committed
85
# Random dataset, Input [512, 4096], Output [128, 1024], request rate 16, num prompts 3200
Yineng Zhang's avatar
Yineng Zhang committed
86
87
88
89
90
91
92
93
python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-input 4096 --random-output 1024 --random-range-ratio 0.125 --num-prompts 3200 --request-rate 16 --output-file sglang_online_benchmark.jsonl

# get median e2e latency
cat sglang_online_benchmark.jsonl | cut -d':' -f9 | cut -d',' -f1
```

## Other

Ying Sheng's avatar
Ying Sheng committed
94
We tried using vLLM 0.5.3.post1, but it often crashes under high loads, and it seems to have similar or worse performance compared to vLLM 0.5.2 from our partial benchmarking, so we are using the older version, vLLM 0.5.2.
Yineng Zhang's avatar
Yineng Zhang committed
95
96

Preparation for TensorRT LLM can refer to https://github.com/sgl-project/tensorrt-demo. Specifically, we used a batch size of 512, a max input length of 8192, and a max number of tokens of 8192. The instance count for preprocessing and postprocessing in Triton Server is 16.
Yineng Zhang's avatar
Yineng Zhang committed
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246

```bash
# vLLM
pip install vllm==0.5.2

# Meta-Llama-3-8B-Instruct
python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-8B-Instruct --disable-log-requests

# meta-llama/Meta-Llama-3-70B-Instruct
python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-70B-Instruct --disable-log-requests --tensor 8

# neuralmagic/Meta-Llama-3-70B-Instruct-FP8
python -m vllm.entrypoints.openai.api_server --model neuralmagic/Meta-Llama-3-70B-Instruct-FP8 --disable-log-requests --tensor 8
```

```bash
wget https://raw.githubusercontent.com/sgl-project/sglang/main/python/sglang/bench_serving.py
```

```bash
# vLLM Offline

# Random dataset, Input [512, 1024], Output [512, 1024], num prompts 3k
python3 bench_serving.py --backend vllm --dataset-name random --num-prompts 3000 --random-input 1024 --random-output 1024 --random-range-ratio 0.5 --output-file vllm_offline_benchmark.jsonl

# Random dataset, Input [2048, 4096], Output [512, 1024], num prompts 3k
python3 bench_serving.py --backend vllm --dataset-name random --num-prompts 3000 --random-input 4096 --random-output 1024 --random-range-ratio 0.5 --output-file vllm_offline_benchmark.jsonl

# Random dataset, Input [512, 1024], Output [256, 512], num prompts 3k
python3 bench_serving.py --backend vllm --dataset-name random --num-prompts 3000 --random-input 1024 --random-output 512 --random-range-ratio 0.5 --output-file vllm_offline_benchmark.jsonl

# Random dataset, Input [2048, 4096], Output [256, 512], num prompts 3k
python3 bench_serving.py --backend vllm --dataset-name random --num-prompts 3000 --random-input 4096 --random-output 512 --random-range-ratio 0.5 --output-file vllm_offline_benchmark.jsonl

# ShareGPT dataset, num prompts 3k
python3 bench_serving.py --backend vllm --num-prompts 3000 --output-file vllm_offline_benchmark.jsonl

# get output token throughput
cat vllm_offline_benchmark.jsonl | cut -d':' -f12 | cut -d',' -f1
```

```bash
# vLLM Online

# Random dataset, Input [512, 4096], Output [128, 1024], request rate 1, num prompts 300
python3 bench_serving.py --backend vllm --dataset-name random --random-input 4096 --random-output 1024 --random-range-ratio 0.125 --num-prompts 300 --request-rate 1 --output-file vllm_online_benchmark.jsonl

# Random dataset, Input [512, 4096], Output [128, 1024], request rate 2, num prompts 600
python3 bench_serving.py --backend vllm --dataset-name random --random-input 4096 --random-output 1024 --random-range-ratio 0.125 --num-prompts 600 --request-rate 2 --output-file vllm_online_benchmark.jsonl

# Random dataset, Input [512, 4096], Output [128, 1024], request rate 4, num prompts 1200
python3 bench_serving.py --backend vllm --dataset-name random --random-input 4096 --random-output 1024 --random-range-ratio 0.125 --num-prompts 1200 --request-rate 4 --output-file vllm_online_benchmark.jsonl

# Random dataset, Input [512, 4096], Output [128, 1024], request rate 8, num prompts 2400
python3 bench_serving.py --backend vllm --dataset-name random --random-input 4096 --random-output 1024 --random-range-ratio 0.125 --num-prompts 2400 --request-rate 8 --output-file vllm_online_benchmark.jsonl

# Random dataset, Input [512, 4096], Output [128, 1024], request rate 16, num prompts 3200
python3 bench_serving.py --backend vllm --dataset-name random --random-input 4096 --random-output 1024 --random-range-ratio 0.125 --num-prompts 3200 --request-rate 16 --output-file vllm_online_benchmark.jsonl

# get median e2e latency
cat vllm_online_benchmark.jsonl | cut -d':' -f9 | cut -d',' -f1
```

```bash
# TensorRT LLM Offline 8B

# Random dataset, Input [512, 1024], Output [512, 1024], num prompts 3k
python3 bench_serving.py --backend trt --dataset-name random --num-prompts 3000 --random-input 1024 --random-output 1024 --random-range-ratio 0.5 --output-file trt_offline_benchmark_8b.jsonl --model meta-llama/Meta-Llama-3-8B-Instruct

# Random dataset, Input [2048, 4096], Output [512, 1024], num prompts 3k
python3 bench_serving.py --backend trt --dataset-name random --num-prompts 3000 --random-input 4096 --random-output 1024 --random-range-ratio 0.5 --output-file trt_offline_benchmark_8b.jsonl --model meta-llama/Meta-Llama-3-8B-Instruct

# Random dataset, Input [512, 1024], Output [256, 512], num prompts 3k
python3 bench_serving.py --backend trt --dataset-name random --num-prompts 3000 --random-input 1024 --random-output 512 --random-range-ratio 0.5 --output-file trt_offline_benchmark_8b.jsonl --model meta-llama/Meta-Llama-3-8B-Instruct

# Random dataset, Input [2048, 4096], Output [256, 512], num prompts 3k
python3 bench_serving.py --backend trt --dataset-name random --num-prompts 3000 --random-input 4096 --random-output 512 --random-range-ratio 0.5 --output-file trt_offline_benchmark_8b.jsonl --model meta-llama/Meta-Llama-3-8B-Instruct

# ShareGPT dataset, num prompts 3k
python3 bench_serving.py --backend trt --num-prompts 3000 --output-file trt_offline_benchmark_8b.jsonl --model meta-llama/Meta-Llama-3-8B-Instruct

# get output token throughput
cat trt_offline_benchmark_8b.jsonl | cut -d':' -f12 | cut -d',' -f1
```

```bash
# TensorRT LLM Online 8B

# Random dataset, Input [512, 4096], Output [128, 1024], request rate 1, num prompts 300
python3 bench_serving.py --backend trt --dataset-name random --random-input 4096 --random-output 1024 --random-range-ratio 0.125 --num-prompts 300 --request-rate 1 --output-file trt_online_benchmark_8b.jsonl --model meta-llama/Meta-Llama-3-8B-Instruct

# Random dataset, Input [512, 4096], Output [128, 1024], request rate 2, num prompts 600
python3 bench_serving.py --backend trt --dataset-name random --random-input 4096 --random-output 1024 --random-range-ratio 0.125 --num-prompts 600 --request-rate 2 --output-file trt_online_benchmark_8b.jsonl --model meta-llama/Meta-Llama-3-8B-Instruct

# Random dataset, Input [512, 4096], Output [128, 1024], request rate 4, num prompts 1200
python3 bench_serving.py --backend trt --dataset-name random --random-input 4096 --random-output 1024 --random-range-ratio 0.125 --num-prompts 1200 --request-rate 4 --output-file trt_online_benchmark_8b.jsonl --model meta-llama/Meta-Llama-3-8B-Instruct

# Random dataset, Input [512, 4096], Output [128, 1024], request rate 8, num prompts 2400
python3 bench_serving.py --backend trt --dataset-name random --random-input 4096 --random-output 1024 --random-range-ratio 0.125 --num-prompts 2400 --request-rate 8 --output-file trt_online_benchmark_8b.jsonl --model meta-llama/Meta-Llama-3-8B-Instruct

# Random dataset, Input [512, 4096], Output [128, 1024], request rate 16, num prompts 3200
python3 bench_serving.py --backend trt --dataset-name random --random-input 4096 --random-output 1024 --random-range-ratio 0.125 --num-prompts 3200 --request-rate 16 --output-file trt_online_benchmark_8b.jsonl --model meta-llama/Meta-Llama-3-8B-Instruct

# get median e2e latency
cat trt_online_benchmark_8b.jsonl | cut -d':' -f9 | cut -d',' -f1
```

```bash
# TensorRT LLM Offline 70B

# Random dataset, Input [512, 1024], Output [512, 1024], num prompts 3k
python3 bench_serving.py --backend trt --dataset-name random --num-prompts 3000 --random-input 1024 --random-output 1024 --random-range-ratio 0.5 --output-file trt_offline_benchmark_70b.jsonl --model meta-llama/Meta-Llama-3-70B-Instruct

# Random dataset, Input [2048, 4096], Output [512, 1024], num prompts 3k
python3 bench_serving.py --backend trt --dataset-name random --num-prompts 3000 --random-input 4096 --random-output 1024 --random-range-ratio 0.5 --output-file trt_offline_benchmark_70b.jsonl --model meta-llama/Meta-Llama-3-70B-Instruct

# Random dataset, Input [512, 1024], Output [256, 512], num prompts 3k
python3 bench_serving.py --backend trt --dataset-name random --num-prompts 3000 --random-input 1024 --random-output 512 --random-range-ratio 0.5 --output-file trt_offline_benchmark_70b.jsonl --model meta-llama/Meta-Llama-3-70B-Instruct

# Random dataset, Input [2048, 4096], Output [256, 512], num prompts 3k
python3 bench_serving.py --backend trt --dataset-name random --num-prompts 3000 --random-input 4096 --random-output 512 --random-range-ratio 0.5 --output-file trt_offline_benchmark_70b.jsonl --model meta-llama/Meta-Llama-3-70B-Instruct

# ShareGPT dataset, num prompts 3k
python3 bench_serving.py --backend trt --num-prompts 3000 --output-file trt_offline_benchmark_70b.jsonl --model meta-llama/Meta-Llama-3-70B-Instruct

# get output token throughput
cat trt_offline_benchmark_70b.jsonl | cut -d':' -f12 | cut -d',' -f1
```

```bash
# TensorRT LLM Online 70B

# Random dataset, Input [512, 4096], Output [128, 1024], request rate 1, num prompts 300
python3 bench_serving.py --backend trt --dataset-name random --random-input 4096 --random-output 1024 --random-range-ratio 0.125 --num-prompts 300 --request-rate 1 --output-file trt_online_benchmark_70b.jsonl --model meta-llama/Meta-Llama-3-70B-Instruct

# Random dataset, Input [512, 4096], Output [128, 1024], request rate 2, num prompts 600
python3 bench_serving.py --backend trt --dataset-name random --random-input 4096 --random-output 1024 --random-range-ratio 0.125 --num-prompts 600 --request-rate 2 --output-file trt_online_benchmark_70b.jsonl --model meta-llama/Meta-Llama-3-70B-Instruct

# Random dataset, Input [512, 4096], Output [128, 1024], request rate 4, num prompts 1200
python3 bench_serving.py --backend trt --dataset-name random --random-input 4096 --random-output 1024 --random-range-ratio 0.125 --num-prompts 1200 --request-rate 4 --output-file trt_online_benchmark_70b.jsonl --model meta-llama/Meta-Llama-3-70B-Instruct

# Random dataset, Input [512, 4096], Output [128, 1024], request rate 8, num prompts 2400
python3 bench_serving.py --backend trt --dataset-name random --random-input 4096 --random-output 1024 --random-range-ratio 0.125 --num-prompts 2400 --request-rate 8 --output-file trt_online_benchmark_70b.jsonl --model meta-llama/Meta-Llama-3-70B-Instruct

# Random dataset, Input [512, 4096], Output [128, 1024], request rate 16, num prompts 3200
python3 bench_serving.py --backend trt --dataset-name random --random-input 4096 --random-output 1024 --random-range-ratio 0.125 --num-prompts 3200 --request-rate 16 --output-file trt_online_benchmark_70b.jsonl --model meta-llama/Meta-Llama-3-70B-Instruct

# get median e2e latency
cat trt_online_benchmark_70b.jsonl | cut -d':' -f9 | cut -d',' -f1
```