sweeps.md 10.1 KB
Newer Older
1
2
# Parameter Sweeps

3
4
`vllm bench sweep` is a suite of commands designed to run benchmarks across multiple configurations and compare them by visualizing the results.

5
6
7
8
## Online Benchmark

### Basic

9
10
11
12
`vllm bench sweep serve` starts `vllm serve` and iteratively runs `vllm bench serve` for each server configuration.

!!! tip
    If you only need to run benchmarks for a single server configuration, consider using [GuideLLM](https://github.com/vllm-project/guidellm), an established performance benchmarking framework with live progress updates and automatic report generation. It is also more flexible than `vllm bench serve` in terms of dataset loading, request formatting, and workload patterns.
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57

Follow these steps to run the script:

1. Construct the base command to `vllm serve`, and pass it to the `--serve-cmd` option.
2. Construct the base command to `vllm bench serve`, and pass it to the `--bench-cmd` option.
3. (Optional) If you would like to vary the settings of `vllm serve`, create a new JSON file and populate it with the parameter combinations you want to test. Pass the file path to `--serve-params`.

    - Example: Tuning `--max-num-seqs` and `--max-num-batched-tokens`:

    ```json
    [
        {
            "max_num_seqs": 32,
            "max_num_batched_tokens": 1024
        },
        {
            "max_num_seqs": 64,
            "max_num_batched_tokens": 1024
        },
        {
            "max_num_seqs": 64,
            "max_num_batched_tokens": 2048
        },
        {
            "max_num_seqs": 128,
            "max_num_batched_tokens": 2048
        },
        {
            "max_num_seqs": 128,
            "max_num_batched_tokens": 4096
        },
        {
            "max_num_seqs": 256,
            "max_num_batched_tokens": 4096
        }
    ]
    ```

4. (Optional) If you would like to vary the settings of `vllm bench serve`, create a new JSON file and populate it with the parameter combinations you want to test. Pass the file path to `--bench-params`.

    - Example: Using different input/output lengths for random dataset:

    ```json
    [
        {
58
            "_benchmark_name": "scenario_A",
59
60
61
62
            "random_input_len": 128,
            "random_output_len": 32
        },
        {
63
            "_benchmark_name": "scenario_B",
64
65
66
67
            "random_input_len": 256,
            "random_output_len": 64
        },
        {
68
            "_benchmark_name": "scenario_C",
69
70
71
72
73
74
            "random_input_len": 512,
            "random_output_len": 128
        }
    ]
    ```

75
5. Set `--output-dir` and optionally `--experiment-name` to control where to save the results.
76
77
78
79
80
81
82
83
84

Example command:

```bash
vllm bench sweep serve \
    --serve-cmd 'vllm serve meta-llama/Llama-2-7b-chat-hf' \
    --bench-cmd 'vllm bench serve --model meta-llama/Llama-2-7b-chat-hf --backend vllm --endpoint /v1/completions --dataset-name sharegpt --dataset-path benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json' \
    --serve-params benchmarks/serve_hparams.json \
    --bench-params benchmarks/bench_hparams.json \
85
86
    --output-dir benchmarks/results \
    --experiment-name demo
87
88
```

89
90
By default, each parameter combination is benchmarked 3 times to make the results more reliable. You can adjust the number of runs by setting `--num-runs`.

91
92
93
94
95
!!! important
    If both `--serve-params` and `--bench-params` are passed, the script will iterate over the Cartesian product between them.
    You can use `--dry-run` to preview the commands to be run.

    We only start the server once for each `--serve-params`, and keep it running for multiple `--bench-params`.
96
    Between each benchmark run, we call all `/reset_*_cache` endpoints to get a clean slate for the next run.
97
98
99
    In case you are using a custom `--serve-cmd`, you can override the commands used for resetting the state by setting `--after-bench-cmd`.

!!! note
100
101
    You should set `_benchmark_name` to provide a human-readable name for parameter combinations involving many variables.
    This becomes mandatory if the file name would otherwise exceed the maximum path length allowed by the filesystem.
102
103

!!! tip
104
    You can use the `--resume` option to continue the parameter sweep if an unexpected error occurs, e.g., timeout when connecting to HF Hub.
105

106
### Workload Explorer
107

108
109
110
`vllm bench sweep serve_workload` is a variant of `vllm bench sweep serve` that explores different workload levels in order to find the tradeoff between latency and throughput. The results can also be [visualized](#visualization) to determine the feasible SLAs.

The workload can be expressed in terms of request rate or concurrency (choose using `--workload-var`).
111
112
113
114

Example command:

```bash
115
vllm bench sweep serve_workload \
116
    --serve-cmd 'vllm serve meta-llama/Llama-2-7b-chat-hf' \
117
    --bench-cmd 'vllm bench serve --model meta-llama/Llama-2-7b-chat-hf --backend vllm --endpoint /v1/completions --dataset-name sharegpt --dataset-path benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 100' \
118
    --workload-var max_concurrency \
119
    --serve-params benchmarks/serve_hparams.json \
120
121
    --bench-params benchmarks/bench_hparams.json \
    --num-runs 1 \
122
123
    --output-dir benchmarks/results \
    --experiment-name demo
124
125
```

126
The algorithm for exploring different workload levels can be summarized as follows:
127

128
129
130
131
1. Run the benchmark by sending requests one at a time (serial inference, lowest workload). This results in the lowest possible latency and throughput.
2. Run the benchmark by sending all requests at once (batch inference, highest workload). This results in the highest possible latency and throughput.
3. Estimate the value of `workload_var` corresponding to Step 2.
4. Run the benchmark over intermediate values of `workload_var` uniformly using the remaining iterations.
132

133
You can override the number of iterations in the algorithm by setting `--workload-iters`.
134

135
136
!!! tip
    This is our equivalent of [GuideLLM's `--profile sweep`](https://github.com/vllm-project/guidellm/blob/v0.5.3/src/guidellm/benchmark/profiles.py#L575).
137

138
139
    In general, `--workload-var max_concurrency` produces more reliable results because it directly controls the workload imposed on the vLLM engine.
    Nevertheless, we default to `--workload-var request_rate` to maintain similar behavior as GuideLLM.
140

141
## Startup Benchmark
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190

`vllm bench sweep startup` runs `vllm bench startup` across parameter combinations to compare cold/warm startup time for different engine settings.

Follow these steps to run the script:

1. (Optional) Construct the base command to `vllm bench startup`, and pass it to `--startup-cmd` (default: `vllm bench startup`).
2. (Optional) Reuse a `--serve-params` JSON from `vllm bench sweep serve` to vary engine settings. Only parameters supported by `vllm bench startup` are applied.
3. (Optional) Create a `--startup-params` JSON to vary startup-specific options like iteration counts.
4. Determine where you want to save the results, and pass that to `--output-dir`.

Example `--serve-params`:

```json
[
    {
        "_benchmark_name": "tp1",
        "model": "Qwen/Qwen3-0.6B",
        "tensor_parallel_size": 1,
        "gpu_memory_utilization": 0.9
    },
    {
        "_benchmark_name": "tp2",
        "model": "Qwen/Qwen3-0.6B",
        "tensor_parallel_size": 2,
        "gpu_memory_utilization": 0.9
    }
]
```

Example `--startup-params`:

```json
[
    {
        "_benchmark_name": "qwen3-0.6",
        "num_iters_cold": 2,
        "num_iters_warmup": 1,
        "num_iters_warm": 2
    }
]
```

Example command:

```bash
vllm bench sweep startup \
    --startup-cmd 'vllm bench startup --model Qwen/Qwen3-0.6B' \
    --serve-params benchmarks/serve_hparams.json \
    --startup-params benchmarks/startup_hparams.json \
191
192
    --output-dir benchmarks/results \
    --experiment-name demo
193
194
195
196
197
198
```

!!! important
    By default, unsupported parameters in `--serve-params` or `--startup-params` are ignored with a warning.
    Use `--strict-params` to fail fast on unknown keys.

199
200
201
202
203
204
## Visualization

### Basic

`vllm bench sweep plot` can be used to plot performance curves from parameter sweep results.

205
206
Control the variables to plot via `--var-x` and `--var-y`, optionally applying `--filter-by` and `--bin-by` to the values. The plot is organized according to `--fig-by`, `--row-by`, `--col-by`, and `--curve-by`.

207
Example commands for visualizing [Workload Explorer](#workload-explorer) results:
208
209

```bash
210
EXPERIMENT_DIR=${1:-"benchmarks/results/demo"}
211
212

# Latency increases as the workload increases
213
vllm bench sweep plot $EXPERIMENT_DIR \
214
215
216
217
218
219
220
    --var-x max_concurrency \
    --var-y median_ttft_ms \
    --col-by _benchmark_name \
    --curve-by max_num_seqs,max_num_batched_tokens \
    --fig-name latency_curve

# Throughput saturates as workload increases
221
vllm bench sweep plot $EXPERIMENT_DIR \
222
223
224
    --var-x max_concurrency \
    --var-y total_token_throughput \
    --col-by _benchmark_name \
225
    --curve-by max_num_seqs,max_num_batched_tokens \
226
    --fig-name throughput_curve
227
228

# Tradeoff between latency and throughput
229
vllm bench sweep plot $EXPERIMENT_DIR \
230
    --var-x total_token_throughput \
231
    --var-y median_ttft_ms \
232
    --col-by _benchmark_name \
233
    --curve-by max_num_seqs,max_num_batched_tokens \
234
    --fig-name latency_throughput
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
```

!!! tip
    You can use `--dry-run` to preview the figures to be plotted.

### Pareto chart

`vllm bench sweep plot_pareto` helps pick configurations that balance per-user and per-GPU throughput.

Higher concurrency or batch size can raise GPU efficiency (per-GPU), but can add per user latency; lower concurrency improves per-user rate but underutilizes GPUs; The Pareto frontier shows the best achievable pairs across your runs.

- x-axis: tokens/s/user = `output_throughput` ÷ concurrency (`--user-count-var`, default `max_concurrency`, fallback `max_concurrent_requests`).
- y-axis: tokens/s/GPU = `output_throughput` ÷ GPU count (`--gpu-count-var` if set; else gpu_count is TP×PP*DP).
- Output: a single figure at `OUTPUT_DIR/pareto/PARETO.png`.
- Show the configuration used in each data point `--label-by` (default: `max_concurrency,gpu_count`).

Example:

```bash
254
255
256
EXPERIMENT_DIR=${1:-"benchmarks/results/demo"}

vllm bench sweep plot_pareto $EXPERIMENT_DIR \
257
258
  --label-by max_concurrency,tensor_parallel_size,pipeline_parallel_size
```
259
260
261

!!! tip
    You can use `--dry-run` to preview the figures to be plotted.