README.md 18.5 KB
Newer Older
Woosuk Kwon's avatar
Woosuk Kwon committed
1
# Benchmarking vLLM
2

3
4
5
This README guides you through running benchmark tests with the extensive
datasets supported on vLLM. It’s a living document, updated as new features and datasets
become available.
6

7
## Dataset Overview
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

<table style="width:100%; border-collapse: collapse;">
  <thead>
    <tr>
      <th style="width:15%; text-align: left;">Dataset</th>
      <th style="width:10%; text-align: center;">Online</th>
      <th style="width:10%; text-align: center;">Offline</th>
      <th style="width:65%; text-align: left;">Data Path</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>ShareGPT</strong></td>
      <td style="text-align: center;"></td>
      <td style="text-align: center;"></td>
      <td><code>wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json</code></td>
    </tr>
25
26
27
28
29
30
31
32
33
34
    <tr>
      <td><strong>ShareGPT4V (Image)</strong></td>
      <td style="text-align: center;"></td>
      <td style="text-align: center;"></td>
      <td>
        <code>wget https://huggingface.co/datasets/Lin-Chen/ShareGPT4V/blob/main/sharegpt4v_instruct_gpt4-vision_cap100k.json</code>
        <br>
        <div>Note that the images need to be downloaded separately. For example, to download COCO's 2017 Train images:</div>
        <code>wget http://images.cocodataset.org/zips/train2017.zip</code>
      </td>
35
36
37
38
39
40
41
42
    </tr>
        <tr>
      <td><strong>ShareGPT4Video (Video)</strong></td>
      <td style="text-align: center;"></td>
      <td style="text-align: center;"></td>
      <td>
        <code>git clone https://huggingface.co/datasets/ShareGPT4Video/ShareGPT4Video</code>
      </td>
43
    </tr>
44
45
46
47
48
49
50
    <tr>
      <td><strong>BurstGPT</strong></td>
      <td style="text-align: center;"></td>
      <td style="text-align: center;"></td>
      <td><code>wget https://github.com/HPMLL/BurstGPT/releases/download/v1.1/BurstGPT_without_fails_2.csv</code></td>
    </tr>
    <tr>
51
      <td><strong>Sonnet (deprecated)</strong></td>
52
53
54
55
56
57
58
59
60
61
      <td style="text-align: center;"></td>
      <td style="text-align: center;"></td>
      <td>Local file: <code>benchmarks/sonnet.txt</code></td>
    </tr>
    <tr>
      <td><strong>Random</strong></td>
      <td style="text-align: center;"></td>
      <td style="text-align: center;"></td>
      <td><code>synthetic</code></td>
    </tr>
62
63
64
65
66
67
    <tr>
      <td><strong>Prefix Repetition</strong></td>
      <td style="text-align: center;"></td>
      <td style="text-align: center;"></td>
      <td><code>synthetic</code></td>
    </tr>
68
    <tr>
69
70
71
72
73
74
75
76
77
78
      <td><strong>HuggingFace-VisionArena</strong></td>
      <td style="text-align: center;"></td>
      <td style="text-align: center;"></td>
      <td><code>lmarena-ai/VisionArena-Chat</code></td>
    </tr>
    <tr>
      <td><strong>HuggingFace-InstructCoder</strong></td>
      <td style="text-align: center;"></td>
      <td style="text-align: center;"></td>
      <td><code>likaixin/InstructCoder</code></td>
79
80
81
82
83
84
    </tr>
      <tr>
      <td><strong>HuggingFace-AIMO</strong></td>
      <td style="text-align: center;"></td>
      <td style="text-align: center;"></td>
      <td><code>AI-MO/aimo-validation-aime</code> , <code>AI-MO/NuminaMath-1.5</code>, <code>AI-MO/NuminaMath-CoT</code></td>
85
86
    </tr>
    <tr>
87
      <td><strong>HuggingFace-Other</strong></td>
88
      <td style="text-align: center;"></td>
89
      <td style="text-align: center;"></td>
90
      <td><code>lmms-lab/LLaVA-OneVision-Data</code>, <code>Aeala/ShareGPT_Vicuna_unfiltered</code></td>
91
    </tr>
92
93
94
95
96
97
    <tr>
      <td><strong>Custom</strong></td>
      <td style="text-align: center;"></td>
      <td style="text-align: center;"></td>
      <td>Local file: <code>data.jsonl</code></td>
    </tr>
98
99
  </tbody>
</table>
100
101
102

✅: supported

103
🟡: Partial support
104

105
🚧: to be supported
106

107
**Note**: HuggingFace dataset's `dataset-name` should be set to `hf`
108

109
110
## 🚀 Example - Online Benchmark

111
<details>
112
<summary>Show more</summary>
113
114

<br/>
115
116

First start serving your model
117

118
```bash
119
vllm serve NousResearch/Hermes-3-Llama-3.1-8B
120
```
121

122
123
124
125
126
Then run the benchmarking script

```bash
# download dataset
# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
127
vllm bench serve \
128
129
130
131
132
133
  --backend vllm \
  --model NousResearch/Hermes-3-Llama-3.1-8B \
  --endpoint /v1/completions \
  --dataset-name sharegpt \
  --dataset-path <your data path>/ShareGPT_V3_unfiltered_cleaned_split.json \
  --num-prompts 10
134
135
136
137
```

If successful, you will see the following output

138
```text
139
============ Serving Benchmark Result ============
140
141
142
143
144
145
146
Successful requests:                     10
Benchmark duration (s):                  5.78
Total input tokens:                      1369
Total generated tokens:                  2212
Request throughput (req/s):              1.73
Output token throughput (tok/s):         382.89
Total Token throughput (tok/s):          619.85
147
---------------Time to First Token----------------
148
149
150
Mean TTFT (ms):                          71.54
Median TTFT (ms):                        73.88
P99 TTFT (ms):                           79.49
151
-----Time per Output Token (excl. 1st token)------
152
153
154
Mean TPOT (ms):                          7.91
Median TPOT (ms):                        7.96
P99 TPOT (ms):                           8.03
155
---------------Inter-token Latency----------------
156
157
158
Mean ITL (ms):                           7.74
Median ITL (ms):                         7.70
P99 ITL (ms):                            8.39
159
160
==================================================
```
161

162
### Custom Dataset
163

164
165
If the dataset you want to benchmark is not supported yet in vLLM, even then you can benchmark on it using `CustomDataset`. Your data needs to be in `.jsonl` format and needs to have "prompt" field per entry, e.g., data.jsonl

166
```json
167
168
169
{"prompt": "What is the capital of India?"}
{"prompt": "What is the capital of Iran?"}
{"prompt": "What is the capital of China?"}
170
```
171
172
173

```bash
# start server
174
VLLM_USE_V1=1 vllm serve meta-llama/Llama-3.1-8B-Instruct
175
176
177
178
```

```bash
# run benchmarking script
179
vllm bench serve --port 9001 --save-result --save-detailed \
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
  --backend vllm \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --endpoint /v1/completions \
  --dataset-name custom \
  --dataset-path <path-to-your-data-jsonl> \
  --custom-skip-chat-template \
  --num-prompts 80 \
  --max-concurrency 1 \
  --temperature=0.3 \
  --top-p=0.75 \
  --result-dir "./log/"
```

You can skip applying chat template if your data already has it by using `--custom-skip-chat-template`.

195
### VisionArena Benchmark for Vision Language Models
196

197
```bash
198
# need a model with vision capability here
199
vllm serve Qwen/Qwen2-VL-7B-Instruct
200
```
201

202
```bash
203
vllm bench serve \
204
  --backend openai-chat \
205
  --endpoint-type openai-chat \
206
207
208
209
210
211
  --model Qwen/Qwen2-VL-7B-Instruct \
  --endpoint /v1/chat/completions \
  --dataset-name hf \
  --dataset-path lmarena-ai/VisionArena-Chat \
  --hf-split train \
  --num-prompts 1000
212
```
213

214
### InstructCoder Benchmark with Speculative Decoding
215

216
217
``` bash
VLLM_USE_V1=1 vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
218
219
220
    --speculative-config $'{"method": "ngram",
    "num_speculative_tokens": 5, "prompt_lookup_max": 5,
    "prompt_lookup_min": 2}'
221
222
223
```

``` bash
224
vllm bench serve \
225
226
227
228
229
230
    --model meta-llama/Meta-Llama-3-8B-Instruct \
    --dataset-name hf \
    --dataset-path likaixin/InstructCoder \
    --num-prompts 2048
```

231
### Other HuggingFaceDataset Examples
232
233

```bash
234
vllm serve Qwen/Qwen2-VL-7B-Instruct
235
236
```

237
`lmms-lab/LLaVA-OneVision-Data`:
238
239

```bash
240
vllm bench serve \
241
  --backend openai-chat \
242
  --endpoint-type openai-chat \
243
244
245
246
247
248
249
  --model Qwen/Qwen2-VL-7B-Instruct \
  --endpoint /v1/chat/completions \
  --dataset-name hf \
  --dataset-path lmms-lab/LLaVA-OneVision-Data \
  --hf-split train \
  --hf-subset "chart2text(cauldron)" \
  --num-prompts 10
250
251
```

252
`Aeala/ShareGPT_Vicuna_unfiltered`:
253
254

```bash
255
vllm bench serve \
256
  --backend openai-chat \
257
  --endpoint-type openai-chat \
258
259
260
261
262
263
  --model Qwen/Qwen2-VL-7B-Instruct \
  --endpoint /v1/chat/completions \
  --dataset-name hf \
  --dataset-path Aeala/ShareGPT_Vicuna_unfiltered \
  --hf-split train \
  --num-prompts 10
264
265
```

266
`AI-MO/aimo-validation-aime`:
267
268

``` bash
269
vllm bench serve \
270
271
272
273
274
275
276
    --model Qwen/QwQ-32B \
    --dataset-name hf \
    --dataset-path AI-MO/aimo-validation-aime \
    --num-prompts 10 \
    --seed 42
```

277
`philschmid/mt-bench`:
278
279

``` bash
280
vllm bench serve \
281
282
283
284
285
286
    --model Qwen/QwQ-32B \
    --dataset-name hf \
    --dataset-path philschmid/mt-bench \
    --num-prompts 80
```

287
### Running With Sampling Parameters
288
289
290
291
292

When using OpenAI-compatible backends such as `vllm`, optional sampling
parameters can be specified. Example client command:

```bash
293
vllm bench serve \
294
295
296
297
298
299
300
301
302
303
304
  --backend vllm \
  --model NousResearch/Hermes-3-Llama-3.1-8B \
  --endpoint /v1/completions \
  --dataset-name sharegpt \
  --dataset-path <your data path>/ShareGPT_V3_unfiltered_cleaned_split.json \
  --top-k 10 \
  --top-p 0.9 \
  --temperature 0.5 \
  --num-prompts 10
```

305
### Running With Ramp-Up Request Rate
306
307
308
309
310
311

The benchmark tool also supports ramping up the request rate over the
duration of the benchmark run. This can be useful for stress testing the
server or finding the maximum throughput that it can handle, given some latency budget.

Two ramp-up strategies are supported:
312

313
314
315
316
- `linear`: Increases the request rate linearly from a start value to an end value.
- `exponential`: Increases the request rate exponentially.

The following arguments can be used to control the ramp-up:
317

318
319
320
321
- `--ramp-up-strategy`: The ramp-up strategy to use (`linear` or `exponential`).
- `--ramp-up-start-rps`: The request rate at the beginning of the benchmark.
- `--ramp-up-end-rps`: The request rate at the end of the benchmark.

322
323
</details>

324
325
## 📈 Example - Offline Throughput Benchmark

326
<details>
327
<summary>Show more</summary>
328
329

<br/>
330
331

```bash
332
vllm bench throughput \
333
334
335
336
  --model NousResearch/Hermes-3-Llama-3.1-8B \
  --dataset-name sonnet \
  --dataset-path vllm/benchmarks/sonnet.txt \
  --num-prompts 10
337
```
338
339
340

If successful, you will see the following output

341
```text
342
343
344
345
346
Throughput: 7.15 requests/s, 4656.00 total tokens/s, 1072.15 output tokens/s
Total num prompt tokens:  5014
Total num output tokens:  1500
```

347
### VisionArena Benchmark for Vision Language Models
348

349
```bash
350
vllm bench throughput \
351
352
353
354
355
356
  --model Qwen/Qwen2-VL-7B-Instruct \
  --backend vllm-chat \
  --dataset-name hf \
  --dataset-path lmarena-ai/VisionArena-Chat \
  --num-prompts 1000 \
  --hf-split train
357
358
359
360
```

The `num prompt tokens` now includes image token counts

361
```text
362
363
364
Throughput: 2.55 requests/s, 4036.92 total tokens/s, 326.90 output tokens/s
Total num prompt tokens:  14527
Total num output tokens:  1280
365
```
366

367
### InstructCoder Benchmark with Speculative Decoding
368
369
370
371

``` bash
VLLM_WORKER_MULTIPROC_METHOD=spawn \
VLLM_USE_V1=1 \
372
vllm bench throughput \
373
374
375
376
377
378
379
    --dataset-name=hf \
    --dataset-path=likaixin/InstructCoder \
    --model=meta-llama/Meta-Llama-3-8B-Instruct \
    --input-len=1000 \
    --output-len=100 \
    --num-prompts=2048 \
    --async-engine \
380
381
382
    --speculative-config $'{"method": "ngram",
    "num_speculative_tokens": 5, "prompt_lookup_max": 5,
    "prompt_lookup_min": 2}'
383
384
```

385
```text
386
387
388
389
390
Throughput: 104.77 requests/s, 23836.22 total tokens/s, 10477.10 output tokens/s
Total num prompt tokens:  261136
Total num output tokens:  204800
```

391
### Other HuggingFaceDataset Examples
392

393
`lmms-lab/LLaVA-OneVision-Data`:
394
395

```bash
396
vllm bench throughput \
397
398
399
400
401
402
403
404
405
  --model Qwen/Qwen2-VL-7B-Instruct \
  --backend vllm-chat \
  --dataset-name hf \
  --dataset-path lmms-lab/LLaVA-OneVision-Data \
  --hf-split train \
  --hf-subset "chart2text(cauldron)" \
  --num-prompts 10
```

406
`Aeala/ShareGPT_Vicuna_unfiltered`:
407
408

```bash
409
vllm bench throughput \
410
411
412
413
414
415
416
417
  --model Qwen/Qwen2-VL-7B-Instruct \
  --backend vllm-chat \
  --dataset-name hf \
  --dataset-path Aeala/ShareGPT_Vicuna_unfiltered \
  --hf-split train \
  --num-prompts 10
```

418
`AI-MO/aimo-validation-aime`:
419
420

```bash
421
vllm bench throughput \
422
423
424
425
426
427
428
429
  --model Qwen/QwQ-32B \
  --backend vllm \
  --dataset-name hf \
  --dataset-path AI-MO/aimo-validation-aime \
  --hf-split train \
  --num-prompts 10
```

430
Benchmark with LoRA adapters:
431
432

``` bash
433
434
# download dataset
# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
435
vllm bench throughput \
436
437
438
439
440
441
442
443
444
  --model meta-llama/Llama-2-7b-hf \
  --backend vllm \
  --dataset_path <your data path>/ShareGPT_V3_unfiltered_cleaned_split.json \
  --dataset_name sharegpt \
  --num-prompts 10 \
  --max-loras 2 \
  --max-lora-rank 8 \
  --enable-lora \
  --lora-path yard1/llama-2-7b-sql-lora-test
445
  ```
446

447
448
</details>

449
450
## 🛠️ Example - Structured Output Benchmark

451
<details>
452
<summary>Show more</summary>
453
454

<br/>
455
456
457

Benchmark the performance of structured output generation (JSON, grammar, regex).

458
### Server Setup
459
460

```bash
461
vllm serve NousResearch/Hermes-3-Llama-3.1-8B
462
463
```

464
### JSON Schema Benchmark
465
466
467
468
469
470
471
472
473
474
475

```bash
python3 benchmarks/benchmark_serving_structured_output.py \
  --backend vllm \
  --model NousResearch/Hermes-3-Llama-3.1-8B \
  --dataset json \
  --structured-output-ratio 1.0 \
  --request-rate 10 \
  --num-prompts 1000
```

476
### Grammar-based Generation Benchmark
477
478
479
480
481
482
483
484
485
486
487

```bash
python3 benchmarks/benchmark_serving_structured_output.py \
  --backend vllm \
  --model NousResearch/Hermes-3-Llama-3.1-8B \
  --dataset grammar \
  --structure-type grammar \
  --request-rate 10 \
  --num-prompts 1000
```

488
### Regex-based Generation Benchmark
489
490
491
492
493
494
495
496
497
498

```bash
python3 benchmarks/benchmark_serving_structured_output.py \
  --backend vllm \
  --model NousResearch/Hermes-3-Llama-3.1-8B \
  --dataset regex \
  --request-rate 10 \
  --num-prompts 1000
```

499
### Choice-based Generation Benchmark
500
501
502
503
504
505
506
507
508
509

```bash
python3 benchmarks/benchmark_serving_structured_output.py \
  --backend vllm \
  --model NousResearch/Hermes-3-Llama-3.1-8B \
  --dataset choice \
  --request-rate 10 \
  --num-prompts 1000
```

510
### XGrammar Benchmark Dataset
511
512
513
514
515
516
517
518
519
520

```bash
python3 benchmarks/benchmark_serving_structured_output.py \
  --backend vllm \
  --model NousResearch/Hermes-3-Llama-3.1-8B \
  --dataset xgrammar_bench \
  --request-rate 10 \
  --num-prompts 1000
```

521
522
</details>

523
524
## 📚 Example - Long Document QA Benchmark

525
<details>
526
<summary>Show more</summary>
527
528

<br/>
529
530
531

Benchmark the performance of long document question-answering with prefix caching.

532
### Basic Long Document QA Test
533
534
535
536
537
538
539
540
541
542
543

```bash
python3 benchmarks/benchmark_long_document_qa_throughput.py \
  --model meta-llama/Llama-2-7b-chat-hf \
  --enable-prefix-caching \
  --num-documents 16 \
  --document-length 2000 \
  --output-len 50 \
  --repeat-count 5
```

544
### Different Repeat Modes
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574

```bash
# Random mode (default) - shuffle prompts randomly
python3 benchmarks/benchmark_long_document_qa_throughput.py \
  --model meta-llama/Llama-2-7b-chat-hf \
  --enable-prefix-caching \
  --num-documents 8 \
  --document-length 3000 \
  --repeat-count 3 \
  --repeat-mode random

# Tile mode - repeat entire prompt list in sequence
python3 benchmarks/benchmark_long_document_qa_throughput.py \
  --model meta-llama/Llama-2-7b-chat-hf \
  --enable-prefix-caching \
  --num-documents 8 \
  --document-length 3000 \
  --repeat-count 3 \
  --repeat-mode tile

# Interleave mode - repeat each prompt consecutively
python3 benchmarks/benchmark_long_document_qa_throughput.py \
  --model meta-llama/Llama-2-7b-chat-hf \
  --enable-prefix-caching \
  --num-documents 8 \
  --document-length 3000 \
  --repeat-count 3 \
  --repeat-mode interleave
```

575
576
</details>

577
578
## 🗂️ Example - Prefix Caching Benchmark

579
<details>
580
<summary>Show more</summary>
581
582

<br/>
583
584
585

Benchmark the efficiency of automatic prefix caching.

586
### Fixed Prompt with Prefix Caching
587
588
589
590
591
592
593
594
595
596

```bash
python3 benchmarks/benchmark_prefix_caching.py \
  --model meta-llama/Llama-2-7b-chat-hf \
  --enable-prefix-caching \
  --num-prompts 1 \
  --repeat-count 100 \
  --input-length-range 128:256
```

597
### ShareGPT Dataset with Prefix Caching
598
599
600
601
602
603
604
605
606
607
608
609
610
611

```bash
# download dataset
# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

python3 benchmarks/benchmark_prefix_caching.py \
  --model meta-llama/Llama-2-7b-chat-hf \
  --dataset-path /path/ShareGPT_V3_unfiltered_cleaned_split.json \
  --enable-prefix-caching \
  --num-prompts 20 \
  --repeat-count 5 \
  --input-length-range 128:256
```

612
613
614
615
616
617
618
619
620
621
622
### Prefix Repetition Dataset

```bash
vllm bench serve \
  --backend openai \
  --model meta-llama/Llama-2-7b-chat-hf \
  --dataset-name prefix_repetition \
  --num-prompts 100 \
  --prefix-repetition-prefix-len 512 \
  --prefix-repetition-suffix-len 128 \
  --prefix-repetition-num-prefixes 5 \
623
  --prefix-repetition-output-len 128
624
625
```

626
627
</details>

628
629
## ⚡ Example - Request Prioritization Benchmark

630
<details>
631
<summary>Show more</summary>
632
633

<br/>
634
635
636

Benchmark the performance of request prioritization in vLLM.

637
### Basic Prioritization Test
638
639
640
641
642
643
644
645
646
647

```bash
python3 benchmarks/benchmark_prioritization.py \
  --model meta-llama/Llama-2-7b-chat-hf \
  --input-len 128 \
  --output-len 64 \
  --num-prompts 100 \
  --scheduling-policy priority
```

648
### Multiple Sequences per Prompt
649
650
651
652
653
654
655
656
657
658

```bash
python3 benchmarks/benchmark_prioritization.py \
  --model meta-llama/Llama-2-7b-chat-hf \
  --input-len 128 \
  --output-len 64 \
  --num-prompts 100 \
  --scheduling-policy priority \
  --n 2
```
659
660

</details>
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697

## 👁️ Example - Multi-Modal Benchmark

<details>
<summary>Show more</summary>

<br/>

Benchmark the performance of multi-modal requests in vLLM.

### Images (ShareGPT4V)

Start vLLM:

```bash
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-VL-7B-Instruct \
  --dtype bfloat16 \
  --limit-mm-per-prompt '{"image": 1}' \
  --allowed-local-media-path /path/to/sharegpt4v/images
```

Send requests with images:

```bash
python benchmarks/benchmark_serving.py \
  --backend openai-chat \
  --model Qwen/Qwen2.5-VL-7B-Instruct \
  --dataset-name sharegpt \
  --dataset-path /path/to/ShareGPT4V/sharegpt4v_instruct_gpt4-vision_cap100k.json \
  --num-prompts 100 \
  --save-result \
  --result-dir ~/vllm_benchmark_results \
  --save-detailed \
  --endpoint /v1/chat/completion
```

698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
### Videos (ShareGPT4Video)

Start vLLM:

```bash
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-VL-7B-Instruct \
  --dtype bfloat16 \
  --limit-mm-per-prompt '{"video": 1}' \
  --allowed-local-media-path /path/to/sharegpt4video/videos
```

Send requests with videos:

```bash
python benchmarks/benchmark_serving.py \
  --backend openai-chat \
  --model Qwen/Qwen2.5-VL-7B-Instruct \
  --dataset-name sharegpt \
  --dataset-path /path/to/ShareGPT4Video/llava_v1_5_mix665k_with_video_chatgpt72k_share4video28k.json \
  --num-prompts 100 \
  --save-result \
  --result-dir ~/vllm_benchmark_results \
  --save-detailed \
  --endpoint /v1/chat/completion
```

725
</details>