You can use the `--resume` option to continue the parameter sweep if one of the runs failed.
#### SLA Mode
### SLA Auto-Tuner
By passing SLA constraints via `--sla-params`, you can run this script in SLA mode, causing it to adjust either the request rate or concurrency (choose using `--sla-variable`) in order to satisfy the SLA constraints.
[`vllm/benchmarks/sweep/serve_sla.py`](../../vllm/benchmarks/sweep/serve_sla.py) is a wrapper over [`vllm/benchmarks/sweep/serve.py`](../../vllm/benchmarks/sweep/serve.py) that tunes either the request rate or concurrency (choose using `--sla-variable`) in order to satisfy the SLA constraints given by `--sla-params`.
For example, to ensure E2E latency within different target values for 99% of requests:
...
...
@@ -1044,7 +1042,7 @@ For example, to ensure E2E latency within different target values for 99% of req
@@ -1066,6 +1064,24 @@ The algorithm for adjusting the SLA variable is as follows:
For a given combination of `--serve-params` and `--bench-params`, we share the benchmark results across `--sla-params` to avoid rerunning benchmarks with the same SLA variable value.
### Visualizer
[`vllm/benchmarks/sweep/plot.py`](../../vllm/benchmarks/sweep/plot.py) can be used to plot performance curves from parameter sweep results.
You can use `--dry-run` to preview the figures to be plotted.
## Performance Benchmarks
The performance benchmarks are used for development to confirm whether new changes improve performance under various workloads. They are triggered on every commit with both the `perf-benchmarks` and `ready` labels, and when a PR is merged into vLLM.