"vllm/model_executor/models/seed_oss.py" did not exist on "60f76243344d2d3deca5e5ecdade547acc7fed50"
README.md 9.21 KB
Newer Older
Woosuk Kwon's avatar
Woosuk Kwon committed
1
# Benchmarking vLLM
2

3
4
5
This README guides you through running benchmark tests with the extensive
datasets supported on vLLM. It’s a living document, updated as new features and datasets
become available.
6

7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
## Dataset Overview

<table style="width:100%; border-collapse: collapse;">
  <thead>
    <tr>
      <th style="width:15%; text-align: left;">Dataset</th>
      <th style="width:10%; text-align: center;">Online</th>
      <th style="width:10%; text-align: center;">Offline</th>
      <th style="width:65%; text-align: left;">Data Path</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>ShareGPT</strong></td>
      <td style="text-align: center;"></td>
      <td style="text-align: center;"></td>
      <td><code>wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json</code></td>
    </tr>
    <tr>
      <td><strong>BurstGPT</strong></td>
      <td style="text-align: center;"></td>
      <td style="text-align: center;"></td>
      <td><code>wget https://github.com/HPMLL/BurstGPT/releases/download/v1.1/BurstGPT_without_fails_2.csv</code></td>
    </tr>
    <tr>
      <td><strong>Sonnet</strong></td>
      <td style="text-align: center;"></td>
      <td style="text-align: center;"></td>
      <td>Local file: <code>benchmarks/sonnet.txt</code></td>
    </tr>
    <tr>
      <td><strong>Random</strong></td>
      <td style="text-align: center;"></td>
      <td style="text-align: center;"></td>
      <td><code>synthetic</code></td>
    </tr>
    <tr>
44
45
46
47
48
49
50
51
52
53
      <td><strong>HuggingFace-VisionArena</strong></td>
      <td style="text-align: center;"></td>
      <td style="text-align: center;"></td>
      <td><code>lmarena-ai/VisionArena-Chat</code></td>
    </tr>
    <tr>
      <td><strong>HuggingFace-InstructCoder</strong></td>
      <td style="text-align: center;"></td>
      <td style="text-align: center;"></td>
      <td><code>likaixin/InstructCoder</code></td>
54
55
56
57
58
59
    </tr>
      <tr>
      <td><strong>HuggingFace-AIMO</strong></td>
      <td style="text-align: center;"></td>
      <td style="text-align: center;"></td>
      <td><code>AI-MO/aimo-validation-aime</code> , <code>AI-MO/NuminaMath-1.5</code>, <code>AI-MO/NuminaMath-CoT</code></td>
60
61
    </tr>
    <tr>
62
      <td><strong>HuggingFace-Other</strong></td>
63
      <td style="text-align: center;"></td>
64
      <td style="text-align: center;"></td>
65
      <td><code>lmms-lab/LLaVA-OneVision-Data</code>, <code>Aeala/ShareGPT_Vicuna_unfiltered</code></td>
66
67
68
    </tr>
  </tbody>
</table>
69
70
71

✅: supported

72
🟡: Partial support
73

74
🚧: to be supported
75

76
**Note**: HuggingFace dataset's `dataset-name` should be set to `hf`
77
78
79
80
81

---
## Example - Online Benchmark

First start serving your model
82

83
```bash
84
vllm serve NousResearch/Hermes-3-Llama-3.1-8B --disable-log-requests
85
```
86

87
88
89
90
91
Then run the benchmarking script

```bash
# download dataset
# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
92
93
94
95
96
97
98
python3 vllm/benchmarks/benchmark_serving.py \
  --backend vllm \
  --model NousResearch/Hermes-3-Llama-3.1-8B \
  --endpoint /v1/completions \
  --dataset-name sharegpt \
  --dataset-path <your data path>/ShareGPT_V3_unfiltered_cleaned_split.json \
  --num-prompts 10
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
```

If successful, you will see the following output

```
============ Serving Benchmark Result ============
Successful requests:                     10        
Benchmark duration (s):                  5.78      
Total input tokens:                      1369      
Total generated tokens:                  2212      
Request throughput (req/s):              1.73      
Output token throughput (tok/s):         382.89    
Total Token throughput (tok/s):          619.85    
---------------Time to First Token----------------
Mean TTFT (ms):                          71.54     
Median TTFT (ms):                        73.88     
P99 TTFT (ms):                           79.49     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          7.91      
Median TPOT (ms):                        7.96      
P99 TPOT (ms):                           8.03      
---------------Inter-token Latency----------------
Mean ITL (ms):                           7.74      
Median ITL (ms):                         7.70      
P99 ITL (ms):                            8.39      
==================================================
```
126

127
### VisionArena Benchmark for Vision Language Models
128

129
```bash
130
131
# need a model with vision capability here
vllm serve Qwen/Qwen2-VL-7B-Instruct --disable-log-requests
132
```
133

134
```bash
135
python3 vllm/benchmarks/benchmark_serving.py \
136
137
138
139
140
141
142
  --backend openai-chat \
  --model Qwen/Qwen2-VL-7B-Instruct \
  --endpoint /v1/chat/completions \
  --dataset-name hf \
  --dataset-path lmarena-ai/VisionArena-Chat \
  --hf-split train \
  --num-prompts 1000
143
```
144

145
### InstructCoder Benchmark with Speculative Decoding
146

147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
``` bash
VLLM_USE_V1=1 vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
    --speculative-model "[ngram]" \
    --ngram_prompt_lookup_min 2 \
    --ngram-prompt-lookup-max 5 \
    --num_speculative_tokens 5
```

``` bash
python3 benchmarks/benchmark_serving.py \
    --model meta-llama/Meta-Llama-3-8B-Instruct \
    --dataset-name hf \
    --dataset-path likaixin/InstructCoder \
    --num-prompts 2048
```

### Other HuggingFaceDataset Examples
164
165
166
167
168
169
170
171
172

```bash
vllm serve Qwen/Qwen2-VL-7B-Instruct --disable-log-requests
```

**`lmms-lab/LLaVA-OneVision-Data`**

```bash
python3 vllm/benchmarks/benchmark_serving.py \
173
174
175
176
177
178
179
180
  --backend openai-chat \
  --model Qwen/Qwen2-VL-7B-Instruct \
  --endpoint /v1/chat/completions \
  --dataset-name hf \
  --dataset-path lmms-lab/LLaVA-OneVision-Data \
  --hf-split train \
  --hf-subset "chart2text(cauldron)" \
  --num-prompts 10
181
182
183
184
185
186
```

**`Aeala/ShareGPT_Vicuna_unfiltered`**

```bash
python3 vllm/benchmarks/benchmark_serving.py \
187
188
189
190
191
192
193
  --backend openai-chat \
  --model Qwen/Qwen2-VL-7B-Instruct \
  --endpoint /v1/chat/completions \
  --dataset-name hf \
  --dataset-path Aeala/ShareGPT_Vicuna_unfiltered \
  --hf-split train \
  --num-prompts 10
194
195
```

196
197
198
199
200
201
202
203
204
205
206
**`AI-MO/aimo-validation-aime`**

``` bash
python3 vllm/benchmarks/benchmark_serving.py \
    --model Qwen/QwQ-32B \
    --dataset-name hf \
    --dataset-path AI-MO/aimo-validation-aime \
    --num-prompts 10 \
    --seed 42
```

207
208
---
## Example - Offline Throughput Benchmark
209
210

```bash
211
python3 vllm/benchmarks/benchmark_throughput.py \
212
213
214
215
  --model NousResearch/Hermes-3-Llama-3.1-8B \
  --dataset-name sonnet \
  --dataset-path vllm/benchmarks/sonnet.txt \
  --num-prompts 10
216
```
217
218
219
220

If successful, you will see the following output

```
221
222
223
224
225
226
227
228
229
Throughput: 7.15 requests/s, 4656.00 total tokens/s, 1072.15 output tokens/s
Total num prompt tokens:  5014
Total num output tokens:  1500
```

### VisionArena Benchmark for Vision Language Models

``` bash
python3 vllm/benchmarks/benchmark_throughput.py \
230
231
232
233
234
235
  --model Qwen/Qwen2-VL-7B-Instruct \
  --backend vllm-chat \
  --dataset-name hf \
  --dataset-path lmarena-ai/VisionArena-Chat \
  --num-prompts 1000 \
  --hf-split train
236
237
238
239
240
241
242
243
```

The `num prompt tokens` now includes image token counts

```
Throughput: 2.55 requests/s, 4036.92 total tokens/s, 326.90 output tokens/s
Total num prompt tokens:  14527
Total num output tokens:  1280
244
```
245

246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
### InstructCoder Benchmark with Speculative Decoding

``` bash
VLLM_WORKER_MULTIPROC_METHOD=spawn \
VLLM_USE_V1=1 \
python3 vllm/benchmarks/benchmark_throughput.py \
    --dataset-name=hf \
    --dataset-path=likaixin/InstructCoder \
    --model=meta-llama/Meta-Llama-3-8B-Instruct \
    --input-len=1000 \
    --output-len=100 \
    --num-prompts=2048 \
    --async-engine \
    --speculative-model="[ngram]" \
    --ngram_prompt_lookup_min=2 \
    --ngram-prompt-lookup-max=5 \
    --num_speculative_tokens=5
```

```
Throughput: 104.77 requests/s, 23836.22 total tokens/s, 10477.10 output tokens/s
Total num prompt tokens:  261136
Total num output tokens:  204800
```

### Other HuggingFaceDataset Examples

**`lmms-lab/LLaVA-OneVision-Data`**

```bash
python3 vllm/benchmarks/benchmark_throughput.py \
  --model Qwen/Qwen2-VL-7B-Instruct \
  --backend vllm-chat \
  --dataset-name hf \
  --dataset-path lmms-lab/LLaVA-OneVision-Data \
  --hf-split train \
  --hf-subset "chart2text(cauldron)" \
  --num-prompts 10
```

**`Aeala/ShareGPT_Vicuna_unfiltered`**

```bash
python3 vllm/benchmarks/benchmark_throughput.py \
  --model Qwen/Qwen2-VL-7B-Instruct \
  --backend vllm-chat \
  --dataset-name hf \
  --dataset-path Aeala/ShareGPT_Vicuna_unfiltered \
  --hf-split train \
  --num-prompts 10
```

298
299
300
301
302
303
304
305
306
307
308
309
**`AI-MO/aimo-validation-aime`**

```bash
python3 benchmarks/benchmark_throughput.py \
  --model Qwen/QwQ-32B \
  --backend vllm \
  --dataset-name hf \
  --dataset-path AI-MO/aimo-validation-aime \
  --hf-split train \
  --num-prompts 10
```

310
311
312
### Benchmark with LoRA Adapters

``` bash
313
314
315
# download dataset
# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
python3 vllm/benchmarks/benchmark_throughput.py \
316
317
318
319
320
321
322
323
324
  --model meta-llama/Llama-2-7b-hf \
  --backend vllm \
  --dataset_path <your data path>/ShareGPT_V3_unfiltered_cleaned_split.json \
  --dataset_name sharegpt \
  --num-prompts 10 \
  --max-loras 2 \
  --max-lora-rank 8 \
  --enable-lora \
  --lora-path yard1/llama-2-7b-sql-lora-test
325
  ```