README.md 16.9 KB
Newer Older
Woosuk Kwon's avatar
Woosuk Kwon committed
1
# Benchmarking vLLM
2

3
4
5
This README guides you through running benchmark tests with the extensive
datasets supported on vLLM. It’s a living document, updated as new features and datasets
become available.
6

7
## Dataset Overview
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

<table style="width:100%; border-collapse: collapse;">
  <thead>
    <tr>
      <th style="width:15%; text-align: left;">Dataset</th>
      <th style="width:10%; text-align: center;">Online</th>
      <th style="width:10%; text-align: center;">Offline</th>
      <th style="width:65%; text-align: left;">Data Path</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>ShareGPT</strong></td>
      <td style="text-align: center;"></td>
      <td style="text-align: center;"></td>
      <td><code>wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json</code></td>
    </tr>
25
26
27
28
29
30
31
32
33
34
35
    <tr>
      <td><strong>ShareGPT4V (Image)</strong></td>
      <td style="text-align: center;"></td>
      <td style="text-align: center;"></td>
      <td>
        <code>wget https://huggingface.co/datasets/Lin-Chen/ShareGPT4V/blob/main/sharegpt4v_instruct_gpt4-vision_cap100k.json</code>
        <br>
        <div>Note that the images need to be downloaded separately. For example, to download COCO's 2017 Train images:</div>
        <code>wget http://images.cocodataset.org/zips/train2017.zip</code>
      </td>
    </tr>
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
    <tr>
      <td><strong>BurstGPT</strong></td>
      <td style="text-align: center;"></td>
      <td style="text-align: center;"></td>
      <td><code>wget https://github.com/HPMLL/BurstGPT/releases/download/v1.1/BurstGPT_without_fails_2.csv</code></td>
    </tr>
    <tr>
      <td><strong>Sonnet</strong></td>
      <td style="text-align: center;"></td>
      <td style="text-align: center;"></td>
      <td>Local file: <code>benchmarks/sonnet.txt</code></td>
    </tr>
    <tr>
      <td><strong>Random</strong></td>
      <td style="text-align: center;"></td>
      <td style="text-align: center;"></td>
      <td><code>synthetic</code></td>
    </tr>
    <tr>
55
56
57
58
59
60
61
62
63
64
      <td><strong>HuggingFace-VisionArena</strong></td>
      <td style="text-align: center;"></td>
      <td style="text-align: center;"></td>
      <td><code>lmarena-ai/VisionArena-Chat</code></td>
    </tr>
    <tr>
      <td><strong>HuggingFace-InstructCoder</strong></td>
      <td style="text-align: center;"></td>
      <td style="text-align: center;"></td>
      <td><code>likaixin/InstructCoder</code></td>
65
66
67
68
69
70
    </tr>
      <tr>
      <td><strong>HuggingFace-AIMO</strong></td>
      <td style="text-align: center;"></td>
      <td style="text-align: center;"></td>
      <td><code>AI-MO/aimo-validation-aime</code> , <code>AI-MO/NuminaMath-1.5</code>, <code>AI-MO/NuminaMath-CoT</code></td>
71
72
    </tr>
    <tr>
73
      <td><strong>HuggingFace-Other</strong></td>
74
      <td style="text-align: center;"></td>
75
      <td style="text-align: center;"></td>
76
      <td><code>lmms-lab/LLaVA-OneVision-Data</code>, <code>Aeala/ShareGPT_Vicuna_unfiltered</code></td>
77
    </tr>
78
79
80
81
82
83
    <tr>
      <td><strong>Custom</strong></td>
      <td style="text-align: center;"></td>
      <td style="text-align: center;"></td>
      <td>Local file: <code>data.jsonl</code></td>
    </tr>
84
85
  </tbody>
</table>
86
87
88

✅: supported

89
🟡: Partial support
90

91
🚧: to be supported
92

93
**Note**: HuggingFace dataset's `dataset-name` should be set to `hf`
94

95
96
## 🚀 Example - Online Benchmark

97
<details>
98
<summary>Show more</summary>
99
100

<br/>
101
102

First start serving your model
103

104
```bash
105
vllm serve NousResearch/Hermes-3-Llama-3.1-8B
106
```
107

108
109
110
111
112
Then run the benchmarking script

```bash
# download dataset
# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
113
vllm bench serve \
114
115
116
117
118
119
  --backend vllm \
  --model NousResearch/Hermes-3-Llama-3.1-8B \
  --endpoint /v1/completions \
  --dataset-name sharegpt \
  --dataset-path <your data path>/ShareGPT_V3_unfiltered_cleaned_split.json \
  --num-prompts 10
120
121
122
123
```

If successful, you will see the following output

124
```text
125
============ Serving Benchmark Result ============
126
127
128
129
130
131
132
Successful requests:                     10
Benchmark duration (s):                  5.78
Total input tokens:                      1369
Total generated tokens:                  2212
Request throughput (req/s):              1.73
Output token throughput (tok/s):         382.89
Total Token throughput (tok/s):          619.85
133
---------------Time to First Token----------------
134
135
136
Mean TTFT (ms):                          71.54
Median TTFT (ms):                        73.88
P99 TTFT (ms):                           79.49
137
-----Time per Output Token (excl. 1st token)------
138
139
140
Mean TPOT (ms):                          7.91
Median TPOT (ms):                        7.96
P99 TPOT (ms):                           8.03
141
---------------Inter-token Latency----------------
142
143
144
Mean ITL (ms):                           7.74
Median ITL (ms):                         7.70
P99 ITL (ms):                            8.39
145
146
==================================================
```
147

148
### Custom Dataset
149

150
151
If the dataset you want to benchmark is not supported yet in vLLM, even then you can benchmark on it using `CustomDataset`. Your data needs to be in `.jsonl` format and needs to have "prompt" field per entry, e.g., data.jsonl

152
```json
153
154
155
{"prompt": "What is the capital of India?"}
{"prompt": "What is the capital of Iran?"}
{"prompt": "What is the capital of China?"}
156
```
157
158
159

```bash
# start server
160
VLLM_USE_V1=1 vllm serve meta-llama/Llama-3.1-8B-Instruct
161
162
163
164
```

```bash
# run benchmarking script
165
vllm bench serve --port 9001 --save-result --save-detailed \
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
  --backend vllm \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --endpoint /v1/completions \
  --dataset-name custom \
  --dataset-path <path-to-your-data-jsonl> \
  --custom-skip-chat-template \
  --num-prompts 80 \
  --max-concurrency 1 \
  --temperature=0.3 \
  --top-p=0.75 \
  --result-dir "./log/"
```

You can skip applying chat template if your data already has it by using `--custom-skip-chat-template`.

181
### VisionArena Benchmark for Vision Language Models
182

183
```bash
184
# need a model with vision capability here
185
vllm serve Qwen/Qwen2-VL-7B-Instruct
186
```
187

188
```bash
189
vllm bench serve \
190
191
192
193
194
195
196
  --backend openai-chat \
  --model Qwen/Qwen2-VL-7B-Instruct \
  --endpoint /v1/chat/completions \
  --dataset-name hf \
  --dataset-path lmarena-ai/VisionArena-Chat \
  --hf-split train \
  --num-prompts 1000
197
```
198

199
### InstructCoder Benchmark with Speculative Decoding
200

201
202
``` bash
VLLM_USE_V1=1 vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
203
204
205
    --speculative-config $'{"method": "ngram",
    "num_speculative_tokens": 5, "prompt_lookup_max": 5,
    "prompt_lookup_min": 2}'
206
207
208
```

``` bash
209
vllm bench serve \
210
211
212
213
214
215
    --model meta-llama/Meta-Llama-3-8B-Instruct \
    --dataset-name hf \
    --dataset-path likaixin/InstructCoder \
    --num-prompts 2048
```

216
### Other HuggingFaceDataset Examples
217
218

```bash
219
vllm serve Qwen/Qwen2-VL-7B-Instruct
220
221
```

222
`lmms-lab/LLaVA-OneVision-Data`:
223
224

```bash
225
vllm bench serve \
226
227
228
229
230
231
232
233
  --backend openai-chat \
  --model Qwen/Qwen2-VL-7B-Instruct \
  --endpoint /v1/chat/completions \
  --dataset-name hf \
  --dataset-path lmms-lab/LLaVA-OneVision-Data \
  --hf-split train \
  --hf-subset "chart2text(cauldron)" \
  --num-prompts 10
234
235
```

236
`Aeala/ShareGPT_Vicuna_unfiltered`:
237
238

```bash
239
vllm bench serve \
240
241
242
243
244
245
246
  --backend openai-chat \
  --model Qwen/Qwen2-VL-7B-Instruct \
  --endpoint /v1/chat/completions \
  --dataset-name hf \
  --dataset-path Aeala/ShareGPT_Vicuna_unfiltered \
  --hf-split train \
  --num-prompts 10
247
248
```

249
`AI-MO/aimo-validation-aime`:
250
251

``` bash
252
vllm bench serve \
253
254
255
256
257
258
259
    --model Qwen/QwQ-32B \
    --dataset-name hf \
    --dataset-path AI-MO/aimo-validation-aime \
    --num-prompts 10 \
    --seed 42
```

260
`philschmid/mt-bench`:
261
262

``` bash
263
vllm bench serve \
264
265
266
267
268
269
    --model Qwen/QwQ-32B \
    --dataset-name hf \
    --dataset-path philschmid/mt-bench \
    --num-prompts 80
```

270
### Running With Sampling Parameters
271
272
273
274
275

When using OpenAI-compatible backends such as `vllm`, optional sampling
parameters can be specified. Example client command:

```bash
276
vllm bench serve \
277
278
279
280
281
282
283
284
285
286
287
  --backend vllm \
  --model NousResearch/Hermes-3-Llama-3.1-8B \
  --endpoint /v1/completions \
  --dataset-name sharegpt \
  --dataset-path <your data path>/ShareGPT_V3_unfiltered_cleaned_split.json \
  --top-k 10 \
  --top-p 0.9 \
  --temperature 0.5 \
  --num-prompts 10
```

288
### Running With Ramp-Up Request Rate
289
290
291
292
293
294

The benchmark tool also supports ramping up the request rate over the
duration of the benchmark run. This can be useful for stress testing the
server or finding the maximum throughput that it can handle, given some latency budget.

Two ramp-up strategies are supported:
295

296
297
298
299
- `linear`: Increases the request rate linearly from a start value to an end value.
- `exponential`: Increases the request rate exponentially.

The following arguments can be used to control the ramp-up:
300

301
302
303
304
- `--ramp-up-strategy`: The ramp-up strategy to use (`linear` or `exponential`).
- `--ramp-up-start-rps`: The request rate at the beginning of the benchmark.
- `--ramp-up-end-rps`: The request rate at the end of the benchmark.

305
306
</details>

307
308
## 📈 Example - Offline Throughput Benchmark

309
<details>
310
<summary>Show more</summary>
311
312

<br/>
313
314

```bash
315
vllm bench throughput \
316
317
318
319
  --model NousResearch/Hermes-3-Llama-3.1-8B \
  --dataset-name sonnet \
  --dataset-path vllm/benchmarks/sonnet.txt \
  --num-prompts 10
320
```
321
322
323

If successful, you will see the following output

324
```text
325
326
327
328
329
Throughput: 7.15 requests/s, 4656.00 total tokens/s, 1072.15 output tokens/s
Total num prompt tokens:  5014
Total num output tokens:  1500
```

330
### VisionArena Benchmark for Vision Language Models
331

332
```bash
333
vllm bench throughput \
334
335
336
337
338
339
  --model Qwen/Qwen2-VL-7B-Instruct \
  --backend vllm-chat \
  --dataset-name hf \
  --dataset-path lmarena-ai/VisionArena-Chat \
  --num-prompts 1000 \
  --hf-split train
340
341
342
343
```

The `num prompt tokens` now includes image token counts

344
```text
345
346
347
Throughput: 2.55 requests/s, 4036.92 total tokens/s, 326.90 output tokens/s
Total num prompt tokens:  14527
Total num output tokens:  1280
348
```
349

350
### InstructCoder Benchmark with Speculative Decoding
351
352
353
354

``` bash
VLLM_WORKER_MULTIPROC_METHOD=spawn \
VLLM_USE_V1=1 \
355
vllm bench throughput \
356
357
358
359
360
361
362
    --dataset-name=hf \
    --dataset-path=likaixin/InstructCoder \
    --model=meta-llama/Meta-Llama-3-8B-Instruct \
    --input-len=1000 \
    --output-len=100 \
    --num-prompts=2048 \
    --async-engine \
363
364
365
    --speculative-config $'{"method": "ngram",
    "num_speculative_tokens": 5, "prompt_lookup_max": 5,
    "prompt_lookup_min": 2}'
366
367
```

368
```text
369
370
371
372
373
Throughput: 104.77 requests/s, 23836.22 total tokens/s, 10477.10 output tokens/s
Total num prompt tokens:  261136
Total num output tokens:  204800
```

374
### Other HuggingFaceDataset Examples
375

376
`lmms-lab/LLaVA-OneVision-Data`:
377
378

```bash
379
vllm bench throughput \
380
381
382
383
384
385
386
387
388
  --model Qwen/Qwen2-VL-7B-Instruct \
  --backend vllm-chat \
  --dataset-name hf \
  --dataset-path lmms-lab/LLaVA-OneVision-Data \
  --hf-split train \
  --hf-subset "chart2text(cauldron)" \
  --num-prompts 10
```

389
`Aeala/ShareGPT_Vicuna_unfiltered`:
390
391

```bash
392
vllm bench throughput \
393
394
395
396
397
398
399
400
  --model Qwen/Qwen2-VL-7B-Instruct \
  --backend vllm-chat \
  --dataset-name hf \
  --dataset-path Aeala/ShareGPT_Vicuna_unfiltered \
  --hf-split train \
  --num-prompts 10
```

401
`AI-MO/aimo-validation-aime`:
402
403

```bash
404
vllm bench throughput \
405
406
407
408
409
410
411
412
  --model Qwen/QwQ-32B \
  --backend vllm \
  --dataset-name hf \
  --dataset-path AI-MO/aimo-validation-aime \
  --hf-split train \
  --num-prompts 10
```

413
Benchmark with LoRA adapters:
414
415

``` bash
416
417
# download dataset
# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
418
vllm bench throughput \
419
420
421
422
423
424
425
426
427
  --model meta-llama/Llama-2-7b-hf \
  --backend vllm \
  --dataset_path <your data path>/ShareGPT_V3_unfiltered_cleaned_split.json \
  --dataset_name sharegpt \
  --num-prompts 10 \
  --max-loras 2 \
  --max-lora-rank 8 \
  --enable-lora \
  --lora-path yard1/llama-2-7b-sql-lora-test
428
  ```
429

430
431
</details>

432
433
## 🛠️ Example - Structured Output Benchmark

434
<details>
435
<summary>Show more</summary>
436
437

<br/>
438
439
440

Benchmark the performance of structured output generation (JSON, grammar, regex).

441
### Server Setup
442
443

```bash
444
vllm serve NousResearch/Hermes-3-Llama-3.1-8B
445
446
```

447
### JSON Schema Benchmark
448
449
450
451
452
453
454
455
456
457
458

```bash
python3 benchmarks/benchmark_serving_structured_output.py \
  --backend vllm \
  --model NousResearch/Hermes-3-Llama-3.1-8B \
  --dataset json \
  --structured-output-ratio 1.0 \
  --request-rate 10 \
  --num-prompts 1000
```

459
### Grammar-based Generation Benchmark
460
461
462
463
464
465
466
467
468
469
470

```bash
python3 benchmarks/benchmark_serving_structured_output.py \
  --backend vllm \
  --model NousResearch/Hermes-3-Llama-3.1-8B \
  --dataset grammar \
  --structure-type grammar \
  --request-rate 10 \
  --num-prompts 1000
```

471
### Regex-based Generation Benchmark
472
473
474
475
476
477
478
479
480
481

```bash
python3 benchmarks/benchmark_serving_structured_output.py \
  --backend vllm \
  --model NousResearch/Hermes-3-Llama-3.1-8B \
  --dataset regex \
  --request-rate 10 \
  --num-prompts 1000
```

482
### Choice-based Generation Benchmark
483
484
485
486
487
488
489
490
491
492

```bash
python3 benchmarks/benchmark_serving_structured_output.py \
  --backend vllm \
  --model NousResearch/Hermes-3-Llama-3.1-8B \
  --dataset choice \
  --request-rate 10 \
  --num-prompts 1000
```

493
### XGrammar Benchmark Dataset
494
495
496
497
498
499
500
501
502
503

```bash
python3 benchmarks/benchmark_serving_structured_output.py \
  --backend vllm \
  --model NousResearch/Hermes-3-Llama-3.1-8B \
  --dataset xgrammar_bench \
  --request-rate 10 \
  --num-prompts 1000
```

504
505
</details>

506
507
## 📚 Example - Long Document QA Benchmark

508
<details>
509
<summary>Show more</summary>
510
511

<br/>
512
513
514

Benchmark the performance of long document question-answering with prefix caching.

515
### Basic Long Document QA Test
516
517
518
519
520
521
522
523
524
525
526

```bash
python3 benchmarks/benchmark_long_document_qa_throughput.py \
  --model meta-llama/Llama-2-7b-chat-hf \
  --enable-prefix-caching \
  --num-documents 16 \
  --document-length 2000 \
  --output-len 50 \
  --repeat-count 5
```

527
### Different Repeat Modes
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557

```bash
# Random mode (default) - shuffle prompts randomly
python3 benchmarks/benchmark_long_document_qa_throughput.py \
  --model meta-llama/Llama-2-7b-chat-hf \
  --enable-prefix-caching \
  --num-documents 8 \
  --document-length 3000 \
  --repeat-count 3 \
  --repeat-mode random

# Tile mode - repeat entire prompt list in sequence
python3 benchmarks/benchmark_long_document_qa_throughput.py \
  --model meta-llama/Llama-2-7b-chat-hf \
  --enable-prefix-caching \
  --num-documents 8 \
  --document-length 3000 \
  --repeat-count 3 \
  --repeat-mode tile

# Interleave mode - repeat each prompt consecutively
python3 benchmarks/benchmark_long_document_qa_throughput.py \
  --model meta-llama/Llama-2-7b-chat-hf \
  --enable-prefix-caching \
  --num-documents 8 \
  --document-length 3000 \
  --repeat-count 3 \
  --repeat-mode interleave
```

558
559
</details>

560
561
## 🗂️ Example - Prefix Caching Benchmark

562
<details>
563
<summary>Show more</summary>
564
565

<br/>
566
567
568

Benchmark the efficiency of automatic prefix caching.

569
### Fixed Prompt with Prefix Caching
570
571
572
573
574
575
576
577
578
579

```bash
python3 benchmarks/benchmark_prefix_caching.py \
  --model meta-llama/Llama-2-7b-chat-hf \
  --enable-prefix-caching \
  --num-prompts 1 \
  --repeat-count 100 \
  --input-length-range 128:256
```

580
### ShareGPT Dataset with Prefix Caching
581
582
583
584
585
586
587
588
589
590
591
592
593
594

```bash
# download dataset
# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

python3 benchmarks/benchmark_prefix_caching.py \
  --model meta-llama/Llama-2-7b-chat-hf \
  --dataset-path /path/ShareGPT_V3_unfiltered_cleaned_split.json \
  --enable-prefix-caching \
  --num-prompts 20 \
  --repeat-count 5 \
  --input-length-range 128:256
```

595
596
</details>

597
598
## ⚡ Example - Request Prioritization Benchmark

599
<details>
600
<summary>Show more</summary>
601
602

<br/>
603
604
605

Benchmark the performance of request prioritization in vLLM.

606
### Basic Prioritization Test
607
608
609
610
611
612
613
614
615
616

```bash
python3 benchmarks/benchmark_prioritization.py \
  --model meta-llama/Llama-2-7b-chat-hf \
  --input-len 128 \
  --output-len 64 \
  --num-prompts 100 \
  --scheduling-policy priority
```

617
### Multiple Sequences per Prompt
618
619
620
621
622
623
624
625
626
627

```bash
python3 benchmarks/benchmark_prioritization.py \
  --model meta-llama/Llama-2-7b-chat-hf \
  --input-len 128 \
  --output-len 64 \
  --num-prompts 100 \
  --scheduling-policy priority \
  --n 2
```
628
629

</details>
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667

## 👁️ Example - Multi-Modal Benchmark

<details>
<summary>Show more</summary>

<br/>

Benchmark the performance of multi-modal requests in vLLM.

### Images (ShareGPT4V)

Start vLLM:

```bash
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-VL-7B-Instruct \
  --dtype bfloat16 \
  --limit-mm-per-prompt '{"image": 1}' \
  --allowed-local-media-path /path/to/sharegpt4v/images
```

Send requests with images:

```bash
python benchmarks/benchmark_serving.py \
  --backend openai-chat \
  --model Qwen/Qwen2.5-VL-7B-Instruct \
  --dataset-name sharegpt \
  --dataset-path /path/to/ShareGPT4V/sharegpt4v_instruct_gpt4-vision_cap100k.json \
  --num-prompts 100 \
  --save-result \
  --result-dir ~/vllm_benchmark_results \
  --save-detailed \
  --endpoint /v1/chat/completion
```

</details>