@@ -11,7 +11,7 @@ See [vLLM performance dashboard](https://perf.vllm.ai) for the latest performanc
...
@@ -11,7 +11,7 @@ See [vLLM performance dashboard](https://perf.vllm.ai) for the latest performanc
## Performance benchmark quick overview
## Performance benchmark quick overview
**Benchmarking Coverage**: latency, throughput and fix-qps serving on A100 (the support for FP8 benchmark on H100 is coming!), with different models.
**Benchmarking Coverage**: latency, throughput and fix-qps serving on A100 (the support for FP8 benchmark on H100 is coming!) and Intel® Xeon® Processors, with different models.
**Benchmarking Duration**: about 1hr.
**Benchmarking Duration**: about 1hr.
...
@@ -31,13 +31,27 @@ Performance benchmark will be triggered when:
...
@@ -31,13 +31,27 @@ Performance benchmark will be triggered when:
- A PR being merged into vllm.
- A PR being merged into vllm.
- Every commit for those PRs with `perf-benchmarks` label AND `ready` label.
- Every commit for those PRs with `perf-benchmarks` label AND `ready` label.
-`ON_CPU`: set the value to '1' on Intel® Xeon® Processors. Default value is 0.
-`SERVING_JSON`: JSON file to use for the serving tests. Default value is empty string (use default file).
-`LATENCY_JSON`: JSON file to use for the latency tests. Default value is empty string (use default file).
-`THROUGHPUT_JSON`: JSON file to use for the throughout tests. Default value is empty string (use default file).
-`REMOTE_HOST`: IP for the remote vLLM service to benchmark. Default value is empty string.
-`REMOTE_PORT`: Port for the remote vLLM service to benchmark. Default value is empty string.
Nightly benchmark will be triggered when:
Nightly benchmark will be triggered when:
- Every commit for those PRs with `perf-benchmarks` label and `nightly-benchmarks` label.
- Every commit for those PRs with `perf-benchmarks` label and `nightly-benchmarks` label.
## Performance benchmark details
## Performance benchmark details
See [performance-benchmarks-descriptions.md](performance-benchmarks-descriptions.md) for detailed descriptions, and use `tests/latency-tests.json`, `tests/throughput-tests.json`, `tests/serving-tests.json` to configure the test cases.
See [performance-benchmarks-descriptions.md](performance-benchmarks-descriptions.md) for detailed descriptions, and use `tests/latency-tests.json`, `tests/throughput-tests.json`, `tests/serving-tests.json` to configure the test cases.
> NOTE: For Intel® Xeon® Processors, use `tests/latency-tests-cpu.json`, `tests/throughput-tests-cpu.json`, `tests/serving-tests-cpu.json` instead.
### Latency test
### Latency test
Here is an example of one test inside `latency-tests.json`:
Here is an example of one test inside `latency-tests.json`:
...
@@ -119,6 +133,30 @@ If you do not see the table, please wait till the benchmark finish running.
...
@@ -119,6 +133,30 @@ If you do not see the table, please wait till the benchmark finish running.
The json version of the table (together with the json version of the benchmark) will be also attached to the markdown file.
The json version of the table (together with the json version of the benchmark) will be also attached to the markdown file.
The raw benchmarking results (in the format of json files) are in the `Artifacts` tab of the benchmarking.
The raw benchmarking results (in the format of json files) are in the `Artifacts` tab of the benchmarking.
The `compare-json-results.py` helps to compare benchmark results JSON files converted using `convert-results-json-to-markdown.py`.
When run, benchmark script generates results under `benchmark/results` folder, along with the `benchmark_results.md` and `benchmark_results.json`.
`compare-json-results.py` compares two `benchmark_results.json` files and provides performance ratio e.g. for Output Tput, Median TTFT and Median TPOT.
Here is an example using the script to compare result_a and result_b without detail test name.
See [nightly-descriptions.md](nightly-descriptions.md) for the detailed description on test workload, models and docker containers of benchmarking other llm engines.
See [nightly-descriptions.md](nightly-descriptions.md) for the detailed description on test workload, models and docker containers of benchmarking other llm engines.
- Output length: the corresponding output length of these 200 prompts.
- Output length: the corresponding output length of these 200 prompts.
- Batch size: dynamically determined by vllm and the arrival pattern of the requests.
- Batch size: dynamically determined by vllm and the arrival pattern of the requests.
-**Average QPS (query per second)**: 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
-**Average QPS (query per second)**: 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).