Commit 3fb4b5fa authored by zhuwenwen's avatar zhuwenwen
Browse files

Merge tag 'v0.18.0' into v0.18.0-ori

parents bcf25339 89138b21
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
......@@ -4,6 +4,11 @@ This section guides you through running benchmark tests with the extensive datas
It's a living document, updated as new features and datasets become available.
!!! tip
The benchmarks described on this page are mainly for evaluating specific vLLM features as well as regression testing.
For benchmarking production vLLM servers, we recommend [GuideLLM](https://github.com/vllm-project/guidellm), an established performance benchmarking framework with live progress updates and automatic report generation. It is also more flexible than `vllm bench serve` in terms of dataset loading, request formatting, and workload patterns.
## Dataset Overview
<style>
......@@ -13,14 +18,14 @@ th {
</style>
| Dataset | Online | Offline | Data Path |
|---------|--------|---------|-----------|
| ------- | ------ | ------- | --------- |
| ShareGPT | ✅ | ✅ | `wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json` |
| ShareGPT4V (Image) | ✅ | ✅ | `wget https://huggingface.co/datasets/Lin-Chen/ShareGPT4V/resolve/main/sharegpt4v_instruct_gpt4-vision_cap100k.json`<br>Note that the images need to be downloaded separately. For example, to download COCO's 2017 Train images:<br>`wget http://images.cocodataset.org/zips/train2017.zip` |
| ShareGPT4Video (Video) | ✅ | ✅ | `git clone https://huggingface.co/datasets/ShareGPT4Video/ShareGPT4Video` |
| BurstGPT | ✅ | ✅ | `wget https://github.com/HPMLL/BurstGPT/releases/download/v1.1/BurstGPT_without_fails_2.csv` |
| Sonnet (deprecated) | ✅ | ✅ | Local file: `benchmarks/sonnet.txt` |
| Random | ✅ | ✅ | `synthetic` |
| RandomMultiModal (Image/Video) | 🟡 | 🚧 | `synthetic` |
| RandomMultiModal (Image/Video) | | | `synthetic` |
| RandomForReranking | ✅ | ✅ | `synthetic` |
| Prefix Repetition | ✅ | ✅ | `synthetic` |
| HuggingFace-VisionArena | ✅ | ✅ | `lmarena-ai/VisionArena-Chat` |
......@@ -30,6 +35,7 @@ th {
| HuggingFace-Other | ✅ | ✅ | `lmms-lab/LLaVA-OneVision-Data`, `Aeala/ShareGPT_Vicuna_unfiltered` |
| HuggingFace-MTBench | ✅ | ✅ | `philschmid/mt-bench` |
| HuggingFace-Blazedit | ✅ | ✅ | `vdaita/edit_5k_char`, `vdaita/edit_10k_char` |
| HuggingFace-ASR | ✅ | ✅ | `openslr/librispeech_asr`, `facebook/voxpopuli`, `LIUM/tedlium`, `edinburghcstr/ami`, `speechcolab/gigaspeech`, `kensho/spgispeech` |
| Spec Bench | ✅ | ✅ | `wget https://raw.githubusercontent.com/hemingkx/Spec-Bench/refs/heads/main/data/spec_bench/question.jsonl` |
| Custom | ✅ | ✅ | Local file: `data.jsonl` |
| Custom MM | ✅ | ✅ | Local file: `mm_data.jsonl` |
......@@ -299,6 +305,22 @@ vllm bench serve \
--blazedit-max-distance 0.99
```
`openslr/librispeech_asr`, `facebook/voxpopuli`, `LIUM/tedlium`, `edinburghcstr/ami`, `speechcolab/gigaspeech`, `kensho/spgispeech`
```bash
vllm bench serve \
--model openai/whisper-large-v3-turbo \
--backend openai-audio \
--dataset-name hf \
--dataset-path facebook/voxpopuli --hf-subset en --hf-split test --no-stream --trust-remote-code \
--num-prompts 99999999 \
--no-oversample \
--endpoint /v1/audio/transcriptions \
--ready-check-timeout-sec 600 \
--save-result \
--max-concurrency 512
```
#### Running With Sampling Parameters
When using OpenAI-compatible backends such as `vllm`, optional sampling
......@@ -361,14 +383,14 @@ The `--burstiness` parameter mathematically controls request arrival patterns us
Load Pattern Recommendations by Use Case:
| Use Case | Burstiness | Request Rate | Max Concurrency | Description |
| --- | --- | --- | --- | --- |
| Use Case | Burstiness | Request Rate | Max Concurrency | Description |
| --- | --- | --- | --- | --- |
| Maximum Throughput | N/A | Infinite | Limited | **Most common**: Simulates load balancer/gateway limits with unlimited user demand |
| Realistic Testing | 1.0 | Moderate (5-20) | Infinite | Natural Poisson traffic patterns for baseline performance |
| Stress Testing | 0.1-0.5 | High (20-100) | Infinite | Challenging burst patterns to test resilience |
| Latency Profiling | 2.0-5.0 | Low (1-10) | Infinite | Uniform load for consistent timing analysis |
| Capacity Planning | 1.0 | Variable | Limited | Test resource limits with realistic constraints |
| SLA Validation | 1.0 | Target rate | SLA limit | Production-like constraints for compliance testing |
| Realistic Testing | 1.0 | Moderate (5-20) | Infinite | Natural Poisson traffic patterns for baseline performance |
| Stress Testing | 0.1-0.5 | High (20-100) | Infinite | Challenging burst patterns to test resilience |
| Latency Profiling | 2.0-5.0 | Low (1-10) | Infinite | Uniform load for consistent timing analysis |
| Capacity Planning | 1.0 | Variable | Limited | Test resource limits with realistic constraints |
| SLA Validation | 1.0 | Target rate | SLA limit | Production-like constraints for compliance testing |
These load patterns help evaluate different aspects of your vLLM deployment, from basic performance characteristics to resilience under challenging traffic conditions.
......@@ -523,6 +545,24 @@ vllm bench throughput \
--lora-path yard1/llama-2-7b-sql-lora-test
```
#### Synthetic Random Multimodal (random-mm)
Generate synthetic multimodal inputs for offline throughput testing without external datasets.
Use `--backend vllm-chat` so that image tokens are counted correctly.
```bash
vllm bench throughput \
--model Qwen/Qwen2-VL-7B-Instruct \
--backend vllm-chat \
--dataset-name random-mm \
--num-prompts 100 \
--random-input-len 300 \
--random-output-len 40 \
--random-mm-base-items-per-request 2 \
--random-mm-limit-mm-per-prompt '{"image": 3, "video": 0}' \
--random-mm-bucket-config '{(256, 256, 1): 0.7, (720, 1280, 1): 0.3}'
```
</details>
### 🛠️ Structured Output Benchmark
......@@ -824,8 +864,8 @@ Generate synthetic image inputs alongside random text prompts to stress-test vis
Notes:
- Works only with online benchmark via the OpenAI backend (`--backend openai-chat`) and endpoint `/v1/chat/completions`.
- Video sampling is not yet implemented.
- For online benchmarks, use `--backend openai-chat` with endpoint `/v1/chat/completions`.
- For offline benchmarks, use `--backend vllm-chat` (see [Offline Throughput Benchmark](#-offline-throughput-benchmark) for an example).
Start the server (example):
......@@ -891,6 +931,74 @@ This should be seen as an edge case, and if this behavior can be avoided by sett
</details>
### 🔬 Multimodal Processor Benchmark
Benchmark per-stage latency of the multimodal (MM) input processor pipeline, including the encoder forward pass. This is useful for profiling preprocessing bottlenecks in vision-language models.
<details class="admonition abstract" markdown="1">
<summary>Show more</summary>
The benchmark measures the following stages for each request:
| Stage | Description |
| ----- | ----------- |
| `get_mm_hashes_secs` | Time spent hashing multimodal inputs |
| `get_cache_missing_items_secs` | Time spent looking up the processor cache |
| `apply_hf_processor_secs` | Time spent in the HuggingFace processor |
| `merge_mm_kwargs_secs` | Time spent merging multimodal kwargs |
| `apply_prompt_updates_secs` | Time spent updating prompt tokens |
| `preprocessor_total_secs` | Total preprocessing time |
| `encoder_forward_secs` | Time spent in the encoder model forward pass |
| `num_encoder_calls` | Number of encoder invocations per request |
The benchmark also reports end-to-end latency (TTFT + decode time) per
request. Use `--metric-percentiles` to select which percentiles to report
(default: p99) and `--output-json` to save results.
#### Basic Example with Synthetic Data (random-mm)
```bash
vllm bench mm-processor \
--model Qwen/Qwen2-VL-7B-Instruct \
--dataset-name random-mm \
--num-prompts 50 \
--random-input-len 300 \
--random-output-len 40 \
--random-mm-base-items-per-request 2 \
--random-mm-limit-mm-per-prompt '{"image": 3, "video": 0}' \
--random-mm-bucket-config '{(256, 256, 1): 0.7, (720, 1280, 1): 0.3}'
```
#### Using a HuggingFace Dataset
```bash
vllm bench mm-processor \
--model Qwen/Qwen2-VL-7B-Instruct \
--dataset-name hf \
--dataset-path lmarena-ai/VisionArena-Chat \
--hf-split train \
--num-prompts 100
```
#### Warmup, Custom Percentiles, and JSON Output
```bash
vllm bench mm-processor \
--model Qwen/Qwen2-VL-7B-Instruct \
--dataset-name random-mm \
--num-prompts 200 \
--num-warmups 5 \
--random-input-len 300 \
--random-output-len 40 \
--random-mm-base-items-per-request 1 \
--metric-percentiles 50,90,95,99 \
--output-json results.json
```
See [`vllm bench mm-processor`](../cli/bench/mm_processor.md) for the full argument reference.
</details>
### Embedding Benchmark
Benchmark the performance of embedding requests in vLLM.
......
......@@ -39,6 +39,12 @@ When run, benchmark script generates results under **benchmark/results** folder,
- `THROUGHPUT_JSON`: JSON file to use for the throughout tests. Default value is empty string (use default file).
- `REMOTE_HOST`: IP for the remote vLLM service to benchmark. Default value is empty string.
- `REMOTE_PORT`: Port for the remote vLLM service to benchmark. Default value is empty string.
- `PROMPTS_PER_CONCURRENCY`: Multiplier to compute `num_prompts` for serving tests (`num_prompts = max_concurrency × value`). Overrides JSON `num_prompts`. Default is NULL.
- `ENABLE_ADAPTIVE_CONCURRENCY`: set the value to '1' to enable adaptive SLA-based concurrency search after the static serving max_concurrency sweep. Default value is 0.
- `SLA_TTFT_MS`: default TTFT SLA threshold in milliseconds for adaptive concurrency search. Default value is 3000.
- `SLA_TPOT_MS`: default TPOT SLA threshold in milliseconds for adaptive concurrency search. Default value is 100.
- `ADAPTIVE_MAX_PROBES`: maximum number of extra adaptive search probes. Default value is 8.
- `ADAPTIVE_MAX_CONCURRENCY`: maximum allowed concurrency during adaptive search. Default value is 1024.
### Visualization
......@@ -60,12 +66,12 @@ Here is an example using the script to compare result_a and result_b with max co
***Output Tput (tok/s) — Model : [ meta-llama/Llama-3.1-8B-Instruct ] , Dataset Name : [ random ] , Input Len : [ 2048.0 ] , Output Len : [ 2048.0 ]***
| | # of max concurrency | qps | results_a/benchmark_results.json | results_b/benchmark_results.json | perf_ratio |
|----|------|-----|-----------|----------|----------|
| 0 | 12 | inf | 24.98 | 186.03 | 7.45 |
| 1 | 16 | inf| 25.49 | 246.92 | 9.69 |
| 2 | 24 | inf| 27.74 | 293.34 | 10.57 |
| 3 | 32 | inf| 28.61 |306.69 | 10.72 |
| | # of max concurrency | qps | results_a/benchmark_results.json | results_b/benchmark_results.json | perf_ratio |
| | -------------------- | --- | -------------------------------- | -------------------------------- | ---------- |
| 0 | 12 | inf | 24.98 | 186.03 | 7.45 |
| 1 | 16 | inf | 25.49 | 246.92 | 9.69 |
| 2 | 24 | inf | 27.74 | 293.34 | 10.57 |
| 3 | 32 | inf | 28.61 |306.69 | 10.72 |
***compare-json-results.py – Command-Line Parameters***
......
# Parameter Sweeps
`vllm bench sweep` is a suite of commands designed to run benchmarks across multiple configurations and compare them by visualizing the results.
## Online Benchmark
### Basic
`vllm bench sweep serve` automatically starts `vllm serve` and runs `vllm bench serve` to evaluate vLLM over multiple configurations.
`vllm bench sweep serve` starts `vllm serve` and iteratively runs `vllm bench serve` for each server configuration.
!!! tip
If you only need to run benchmarks for a single server configuration, consider using [GuideLLM](https://github.com/vllm-project/guidellm), an established performance benchmarking framework with live progress updates and automatic report generation. It is also more flexible than `vllm bench serve` in terms of dataset loading, request formatting, and workload patterns.
Follow these steps to run the script:
......@@ -50,21 +55,24 @@ Follow these steps to run the script:
```json
[
{
"_benchmark_name": "scenario_A",
"random_input_len": 128,
"random_output_len": 32
},
{
"_benchmark_name": "scenario_B",
"random_input_len": 256,
"random_output_len": 64
},
{
"_benchmark_name": "scenario_C",
"random_input_len": 512,
"random_output_len": 128
}
]
```
5. Determine where you want to save the results, and pass that to `--output-dir`.
5. Set `--output-dir` and optionally `--experiment-name` to control where to save the results.
Example command:
......@@ -74,9 +82,12 @@ vllm bench sweep serve \
--bench-cmd 'vllm bench serve --model meta-llama/Llama-2-7b-chat-hf --backend vllm --endpoint /v1/completions --dataset-name sharegpt --dataset-path benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json' \
--serve-params benchmarks/serve_hparams.json \
--bench-params benchmarks/bench_hparams.json \
-o benchmarks/results
--output-dir benchmarks/results \
--experiment-name demo
```
By default, each parameter combination is benchmarked 3 times to make the results more reliable. You can adjust the number of runs by setting `--num-runs`.
!!! important
If both `--serve-params` and `--bench-params` are passed, the script will iterate over the Cartesian product between them.
You can use `--dry-run` to preview the commands to be run.
......@@ -86,60 +97,48 @@ vllm bench sweep serve \
In case you are using a custom `--serve-cmd`, you can override the commands used for resetting the state by setting `--after-bench-cmd`.
!!! note
By default, each parameter combination is run 3 times to make the results more reliable. You can adjust the number of runs by setting `--num-runs`.
You should set `_benchmark_name` to provide a human-readable name for parameter combinations involving many variables.
This becomes mandatory if the file name would otherwise exceed the maximum path length allowed by the filesystem.
!!! tip
You can use the `--resume` option to continue the parameter sweep if one of the runs failed.
### SLA auto-tuner
You can use the `--resume` option to continue the parameter sweep if an unexpected error occurs, e.g., timeout when connecting to HF Hub.
`vllm bench sweep serve_sla` is a wrapper over `vllm bench sweep serve` that tunes either the request rate or concurrency (choose using `--sla-variable`) in order to satisfy the SLA constraints given by `--sla-params`.
### Workload Explorer
For example, to ensure E2E latency within different target values for 99% of requests:
`vllm bench sweep serve_workload` is a variant of `vllm bench sweep serve` that explores different workload levels in order to find the tradeoff between latency and throughput. The results can also be [visualized](#visualization) to determine the feasible SLAs.
```json
[
{
"p99_e2el_ms": "<=200"
},
{
"p99_e2el_ms": "<=500"
},
{
"p99_e2el_ms": "<=1000"
},
{
"p99_e2el_ms": "<=2000"
}
]
```
The workload can be expressed in terms of request rate or concurrency (choose using `--workload-var`).
Example command:
```bash
vllm bench sweep serve_sla \
vllm bench sweep serve_workload \
--serve-cmd 'vllm serve meta-llama/Llama-2-7b-chat-hf' \
--bench-cmd 'vllm bench serve --model meta-llama/Llama-2-7b-chat-hf --backend vllm --endpoint /v1/completions --dataset-name sharegpt --dataset-path benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json' \
--bench-cmd 'vllm bench serve --model meta-llama/Llama-2-7b-chat-hf --backend vllm --endpoint /v1/completions --dataset-name sharegpt --dataset-path benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 100' \
--workload-var max_concurrency \
--serve-params benchmarks/serve_hparams.json \
--bench-params benchmarks/bench_hparams.json \
--sla-params benchmarks/sla_hparams.json \
--sla-variable max_concurrency \
-o benchmarks/results
--num-runs 1 \
--output-dir benchmarks/results \
--experiment-name demo
```
The algorithm for adjusting the SLA variable is as follows:
The algorithm for exploring different workload levels can be summarized as follows:
1. Run the benchmark once with maximum possible QPS, and once with minimum possible QPS. For each run, calculate the distance of the SLA metrics from their targets, resulting in data points of QPS vs SLA distance.
2. Perform spline interpolation between the data points to estimate the QPS that results in zero SLA distance.
3. Run the benchmark with the estimated QPS and add the resulting data point to the history.
4. Repeat Steps 2 and 3 until the maximum QPS that passes SLA and the minimum QPS that fails SLA in the history are close enough to each other.
1. Run the benchmark by sending requests one at a time (serial inference, lowest workload). This results in the lowest possible latency and throughput.
2. Run the benchmark by sending all requests at once (batch inference, highest workload). This results in the highest possible latency and throughput.
3. Estimate the value of `workload_var` corresponding to Step 2.
4. Run the benchmark over intermediate values of `workload_var` uniformly using the remaining iterations.
!!! important
SLA tuning is applied over each combination of `--serve-params`, `--bench-params`, and `--sla-params`.
You can override the number of iterations in the algorithm by setting `--workload-iters`.
For a given combination of `--serve-params` and `--bench-params`, we share the benchmark results across `--sla-params` to avoid rerunning benchmarks with the same SLA variable value.
!!! tip
This is our equivalent of [GuideLLM's `--profile sweep`](https://github.com/vllm-project/guidellm/blob/v0.5.3/src/guidellm/benchmark/profiles.py#L575).
### Startup
In general, `--workload-var max_concurrency` produces more reliable results because it directly controls the workload imposed on the vLLM engine.
Nevertheless, we default to `--workload-var request_rate` to maintain similar behavior as GuideLLM.
## Startup Benchmark
`vllm bench sweep startup` runs `vllm bench startup` across parameter combinations to compare cold/warm startup time for different engine settings.
......@@ -189,7 +188,8 @@ vllm bench sweep startup \
--startup-cmd 'vllm bench startup --model Qwen/Qwen3-0.6B' \
--serve-params benchmarks/serve_hparams.json \
--startup-params benchmarks/startup_hparams.json \
-o benchmarks/results
--output-dir benchmarks/results \
--experiment-name demo
```
!!! important
......@@ -202,15 +202,36 @@ vllm bench sweep startup \
`vllm bench sweep plot` can be used to plot performance curves from parameter sweep results.
Example command:
Control the variables to plot via `--var-x` and `--var-y`, optionally applying `--filter-by` and `--bin-by` to the values. The plot is organized according to `--fig-by`, `--row-by`, `--col-by`, and `--curve-by`.
Example commands for visualizing [Workload Explorer](#workload-explorer) results:
```bash
vllm bench sweep plot benchmarks/results/<timestamp> \
EXPERIMENT_DIR=${1:-"benchmarks/results/demo"}
# Latency increases as the workload increases
vllm bench sweep plot $EXPERIMENT_DIR \
--var-x max_concurrency \
--var-y median_ttft_ms \
--col-by _benchmark_name \
--curve-by max_num_seqs,max_num_batched_tokens \
--fig-name latency_curve
# Throughput saturates as workload increases
vllm bench sweep plot $EXPERIMENT_DIR \
--var-x max_concurrency \
--row-by random_input_len \
--col-by random_output_len \
--curve-by api_server_count,max_num_batched_tokens \
--filter-by 'max_concurrency<=1024'
--var-y total_token_throughput \
--col-by _benchmark_name \
--curve-by max_num_seqs,max_num_batched_tokens \
--fig-name throughput_curve
# Tradeoff between latency and throughput
vllm bench sweep plot $EXPERIMENT_DIR \
--var-x total_token_throughput \
--var-y median_ttft_ms \
--col-by _benchmark_name \
--curve-by max_num_seqs,max_num_batched_tokens \
--fig-name latency_throughput
```
!!! tip
......@@ -230,6 +251,11 @@ Higher concurrency or batch size can raise GPU efficiency (per-GPU), but can add
Example:
```bash
vllm bench sweep plot_pareto benchmarks/results/<timestamp> \
EXPERIMENT_DIR=${1:-"benchmarks/results/demo"}
vllm bench sweep plot_pareto $EXPERIMENT_DIR \
--label-by max_concurrency,tensor_parallel_size,pipeline_parallel_size
```
!!! tip
You can use `--dry-run` to preview the figures to be plotted.
# vllm bench mm-processor
## Overview
`vllm bench mm-processor` profiles the multimodal input processor pipeline of
vision-language models. It measures per-stage latency from the HuggingFace
processor through to the encoder forward pass, helping you identify
preprocessing bottlenecks and understand how different image resolutions or
item counts affect end-to-end request time.
The benchmark supports two data sources: synthetic random multimodal inputs
(`random-mm`) and HuggingFace datasets (`hf`). Warmup requests are run before
measurement to ensure stable results.
## Quick Start
```bash
vllm bench mm-processor \
--model Qwen/Qwen2-VL-7B-Instruct \
--dataset-name random-mm \
--num-prompts 50 \
--random-input-len 300 \
--random-output-len 40 \
--random-mm-base-items-per-request 2 \
--random-mm-limit-mm-per-prompt '{"image": 3, "video": 0}' \
--random-mm-bucket-config '{(256, 256, 1): 0.7, (720, 1280, 1): 0.3}'
```
## Measured Stages
| Stage | Description |
| ----- | ----------- |
| `get_mm_hashes_secs` | Time spent hashing multimodal inputs |
| `get_cache_missing_items_secs` | Time spent looking up the processor cache |
| `apply_hf_processor_secs` | Time spent in the HuggingFace processor |
| `merge_mm_kwargs_secs` | Time spent merging multimodal kwargs |
| `apply_prompt_updates_secs` | Time spent updating prompt tokens |
| `preprocessor_total_secs` | Total preprocessing time |
| `encoder_forward_secs` | Time spent in the encoder model forward pass |
| `num_encoder_calls` | Number of encoder invocations per request |
The benchmark also reports end-to-end latency (TTFT + decode time) per
request. Use `--metric-percentiles` to select which percentiles to report
(default: p99) and `--output-json` to save results.
For more examples (HF datasets, warmup, JSON output), see
[Benchmarking CLI — Multimodal Processor Benchmark](../../benchmarking/cli.md#multimodal-processor-benchmark).
## JSON CLI Arguments
--8<-- "docs/cli/json_tip.inc.md"
......
# vllm bench sweep serve_sla
# vllm bench sweep serve_workload
## JSON CLI Arguments
......@@ -6,4 +6,4 @@
## Arguments
--8<-- "docs/generated/argparse/bench_sweep_serve_sla.inc.md"
--8<-- "docs/generated/argparse/bench_sweep_serve_workload.inc.md"
<!-- markdownlint-disable MD041 -->
When passing JSON CLI arguments, the following sets of arguments are equivalent:
- `--json-arg '{"key1": "value1", "key2": {"key3": "value2"}}'`
......@@ -6,4 +7,4 @@ When passing JSON CLI arguments, the following sets of arguments are equivalent:
Additionally, list elements can be passed individually using `+`:
- `--json-arg '{"key4": ["value3", "value4", "value5"]}'`
- `--json-arg.key4+ value3 --json-arg.key4+='value4,value5'`
\ No newline at end of file
- `--json-arg.key4+ value3 --json-arg.key4+='value4,value5'`
......@@ -15,7 +15,7 @@ llm = LLM(model="ibm-granite/granite-3.1-8b-instruct", tensor_parallel_size=2)
```
!!! warning
To ensure that vLLM initializes CUDA correctly, you should avoid calling related functions (e.g. [torch.cuda.set_device][])
To ensure that vLLM initializes CUDA correctly, you should avoid calling related functions (e.g. [torch.accelerator.set_device_index][])
before initializing vLLM. Otherwise, you may run into an error like `RuntimeError: Cannot re-initialize CUDA in forked subprocess`.
To control which devices are used, please instead set the `CUDA_VISIBLE_DEVICES` environment variable.
......
......@@ -5,6 +5,17 @@ This guide covers optimization strategies and performance tuning for vLLM V1.
!!! tip
Running out of memory? Consult [this guide](./conserving_memory.md) on how to conserve memory.
## Optimization Levels
vLLM provides 4 optimization levels (`-O0`, `-O1`, `-O2`, `-O3`) that allow users to trade off startup time for performance:
- `-O0`: No optimizations. Fastest startup time, but lowest performance.
- `-O1`: Fast optimization. Simple compilation and fast fusions, and PIECEWISE cudagraphs.
- `-O2`: Default optimization. Additional compilation ranges, additional fusions, FULL_AND_PIECEWISE cudagraphs.
- `-O3`: Aggressive optimization. Currently equal to `-O2`, but may include additional time-consuming or experimental optimizations in the future.
For more information, see the [optimization level documentation](../design/optimization_levels.md).
## Preemption
Due to the autoregressive nature of transformer architecture, there are times when KV cache space is insufficient to handle all batched requests.
......@@ -278,7 +289,7 @@ llm = LLM(
Based on the configuration, the content of the multi-modal caches on `P0` and `P1` are as follows:
| mm_processor_cache_type | Cache Type | `P0` Cache | `P1` Engine Cache | `P1` Worker Cache | Max. Memory |
|-------------------|-------------|------------|------------|-------------|-------------|
| ----------------- | ----------- | ---------- | ---------- | ----------- | ----------- |
| lru | Processor Caching | K + V | N/A | N/A | `mm_processor_cache_gb * data_parallel_size` |
| lru | Key-Replicated Caching | K | K + V | N/A | `mm_processor_cache_gb * api_server_count` |
| shm | Shared Memory Caching | K | N/A | V | `mm_processor_cache_gb * api_server_count` |
......
......@@ -49,7 +49,13 @@ If you are developing vLLM's Python and CUDA/C++ code, install Pytorch first:
uv pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu129
```
then install vLLM using:
Then install the necessary build dependencies from `requirements/build.txt`, skipping `torch` as it was installed in the previous step:
```bash
grep -v '^torch==' requirements/build.txt | uv pip install -r -
```
Finally install vLLM using:
```bash
uv pip install -e . --no-build-isolation
......@@ -69,7 +75,7 @@ For an optimized workflow when iterating on C++/CUDA kernels, see the [Increment
vLLM uses `pre-commit` to lint and format the codebase. See <https://pre-commit.com/#usage> if `pre-commit` is new to you. Setting up `pre-commit` is as easy as:
```bash
uv pip install pre-commit
uv pip install pre-commit>=4.5.1
pre-commit install
```
......@@ -88,7 +94,6 @@ vLLM's `pre-commit` hooks will now run automatically every time you commit.
Some `pre-commit` hooks only run in CI. If you need to, you can run them locally with:
```bash
pre-commit run --hook-stage manual markdownlint
pre-commit run --hook-stage manual mypy-3.10
```
......@@ -182,6 +187,30 @@ Using `-s` with `git commit` will automatically add this header.
- **VSCode**: Open the [Settings editor](https://code.visualstudio.com/docs/configure/settings)
and enable the `Git: Always Sign Off` (`git.alwaysSignOff`) field.
### AI Assisted Contributions
Before making an AI assisted contribution, you must:
1. **Be involved**: Do not submit "pure agent" PRs. The human submitter is responsible for reviewing all changed lines, validating behavior end-to-end, and running relevant tests.
2. **Ensure significance**: Avoid one-off "busywork" PRs (single typo, isolated style cleanup, one mutable default fix, etc.). Bundle mechanical cleanups into a clear, systematic scope.
When AI tools provide non-trivial assistance in generating or modifying code, you must:
1. **Review thoroughly**: You remain responsible for all code you submit. Review and understand AI-generated code with the same care as code you write manually.
2. **Disclose in PR**: Always mention when a pull request includes AI-generated code. Add a note in the PR description.
3. **Mark commits**: Add attribution using commit trailers such as `Co-authored-by:` (other projects use `Assisted-by:` or `Generated-by:`). For example:
```text
Your commit message here
Co-authored-by: GitHub Copilot
Co-authored-by: Claude
Co-authored-by: gemini-code-assist
Signed-off-by: Your Name <your.email@example.com>
```
AI-assisted code must meet all quality standards: proper testing, documentation, adherence to style guides, and thorough review. Attribution helps reviewers evaluate contributions in context and maintains legal clarity for the project.
### PR Title and Classification
Only specific types of PRs will be reviewed. The PR title is prefixed
......
......@@ -66,12 +66,12 @@ This complicates the process as we cannot use the out-of-the-box
- Important indexes at the moment include:
| Platform | `--extra-index-url` |
|----------|-----------------|
| CUDA 12.8| [https://download.pytorch.org/whl/cu128](https://download.pytorch.org/whl/cu128)|
| CPU | [https://download.pytorch.org/whl/cpu](https://download.pytorch.org/whl/cpu)|
| -------- | ------------------- |
| CUDA 12.8 | [https://download.pytorch.org/whl/cu128](https://download.pytorch.org/whl/cu128) |
| CPU | [https://download.pytorch.org/whl/cpu](https://download.pytorch.org/whl/cpu) |
| ROCm 6.2 | [https://download.pytorch.org/whl/rocm6.2.4](https://download.pytorch.org/whl/rocm6.2.4) |
| ROCm 6.3 | [https://download.pytorch.org/whl/rocm6.3](https://download.pytorch.org/whl/rocm6.3) |
| XPU | [https://download.pytorch.org/whl/xpu](https://download.pytorch.org/whl/xpu) |
| XPU | [https://download.pytorch.org/whl/xpu](https://download.pytorch.org/whl/xpu) |
- Update the below files to match the CUDA version from step 1. This makes sure that the release vLLM wheel is tested on CI.
- `.buildkite/release-pipeline.yaml`
......
......@@ -66,7 +66,7 @@ stages will be removed.
Assume a feature is deprecated in `v0.9.0`.
| Release | Status |
|---------------|-------------------------------------------------------------------------------------------------|
| ------------- | ----------------------------------------------------------------------------------------------- |
| `v0.9.0` | Feature is deprecated with clear removal version listed. |
| `v0.10.0` | Feature is now off by default, throws an error when used, and can be re-enabled for legacy use. |
| `v0.11.0` | Feature is removed. |
......
......@@ -248,21 +248,22 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
self,
seq_len: int,
mm_counts: Mapping[str, int],
mm_options: Mapping[str, BaseDummyOptions] | None = None,
mm_options: Mapping[str, BaseDummyOptions],
) -> MultiModalDataDict:
num_images = mm_counts.get("image", 0)
target_width, target_height = \
self.info.get_image_size_with_most_features()
image_overrides = mm_options.get("image") if mm_options else None
image_overrides = mm_options.get("image")
return {
"image":
self._get_dummy_images(width=target_width,
height=target_height,
num_images=num_images,
overrides=image_overrides)
"image": self._get_dummy_images(
width=target_width,
height=target_height,
num_images=num_images,
overrides=image_overrides,
)
}
```
......@@ -434,17 +435,16 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
self,
seq_len: int,
mm_counts: Mapping[str, int],
mm_options: Optional[Mapping[str, BaseDummyOptions]] = None,
mm_options: Mapping[str, BaseDummyOptions],
) -> MultiModalDataDict:
target_width, target_height = \
self.info.get_image_size_with_most_features()
num_images = mm_counts.get("image", 0)
image_overrides = mm_options.get("image") if mm_options else None
image_overrides = mm_options.get("image")
return {
"image":
self._get_dummy_images(
"image": self._get_dummy_images(
width=target_width,
height=target_height,
num_images=num_images,
......
......@@ -5,8 +5,12 @@
## Profile with PyTorch Profiler
We support tracing vLLM workers using the `torch.profiler` module. You can enable the torch profiler by setting `--profiler-config`
when launching the server, and setting the entries `profiler` to `'torch'` and `torch_profiler_dir` to the directory where you want to save the traces. Additionally, you can control the profiling content by specifying the following additional arguments in the config:
We support tracing vLLM workers using different profilers. You can enable profiling by setting the `--profiler-config` flag when launching the server.
!!! note
The `--profiler-config` flag is available in vLLM v0.13.0 and later. If you are using an earlier version, please upgrade to use this feature.
To use the `torch.profiler` module, set the `profiler` entry to `'torch'` and `torch_profiler_dir` to the directory where you want to save the traces. Additionally, you can control the profiling content by specifying the following additional arguments in the config:
- `torch_profiler_record_shapes` to enable recording Tensor Shapes, off by default
- `torch_profiler_with_memory` to record memory, off by default
......
......@@ -49,7 +49,7 @@ chart **including persistent volumes** and deletes the release.
The following table describes configurable parameters of the chart in `values.yaml`:
| Key | Type | Default | Description |
|-----|------|---------|-------------|
| --- | ---- | ------- | ----------- |
| autoscaling | object | {"enabled":false,"maxReplicas":100,"minReplicas":1,"targetCPUUtilizationPercentage":80} | Autoscaling configuration |
| autoscaling.enabled | bool | false | Enable autoscaling |
| autoscaling.maxReplicas | int | 100 | Maximum replicas |
......
# RunPod
vLLM can be deployed on [RunPod](https://www.runpod.io/), a cloud GPU platform that provides on-demand and serverless GPU instances for AI inference workloads.
## Prerequisites
- A RunPod account with GPU pod access
- A GPU pod running a CUDA-compatible template (e.g., `runpod/pytorch`)
## Starting the Server
SSH into your RunPod pod and launch the vLLM OpenAI-compatible server:
```bash
python -m vllm.entrypoints.openai.api_server \
--model <model-name> \
--host 0.0.0.0 \
--port 8000
```
!!! note
Use `--host 0.0.0.0` to bind to all interfaces so the server is reachable from outside the container.
## Exposing Port 8000
RunPod exposes HTTP services through its proxy. To make port 8000 accessible:
1. In the RunPod dashboard, navigate to your pod settings.
2. Add `8000` to the list of exposed HTTP ports.
3. After the pod restarts, RunPod provides a public URL in the format:
```text
https://<pod-id>-8000.proxy.runpod.net
```
## Troubleshooting 502 Bad Gateway
A `502 Bad Gateway` error from the RunPod proxy typically means the server is not yet listening. Common causes:
- **Model still loading** — Large models take time to download and load into GPU memory. Check the pod logs for progress.
- **Wrong host binding** — Ensure you passed `--host 0.0.0.0`. Binding to `127.0.0.1` (the default) makes the server unreachable from the proxy.
- **Port mismatch** — Verify the `--port` value matches the port exposed in the RunPod dashboard.
- **Out of GPU memory** — The model may be too large for the allocated GPU. Check logs for CUDA OOM errors and consider using a larger instance or adding `--tensor-parallel-size` for multi-GPU pods.
## Verifying the Deployment
Once the server is running, test it with a curl request:
!!! console "Command"
```bash
curl https://<pod-id>-8000.proxy.runpod.net/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "<model-name>",
"messages": [
{"role": "user", "content": "Hello, how are you?"}
],
"max_tokens": 50
}'
```
!!! console "Response"
```json
{
"id": "chat-abc123",
"object": "chat.completion",
"choices": [
{
"message": {
"role": "assistant",
"content": "I'm doing well, thank you for asking! How can I help you today?"
},
"index": 0,
"finish_reason": "stop"
}
]
}
```
You can also check the server health endpoint:
```bash
curl https://<pod-id>-8000.proxy.runpod.net/health
```
# AIBrix
[AIBrix](https://github.com/vllm-project/aibrix) is a cloud-native control plane that integrates with vLLM to simplify Kubernetes deployment, scaling, routing, and LoRA adapter management for large language model inference.
For installation and usage instructions, please refer to the [AIBrix documentation](https://aibrix.readthedocs.io/).
# NVIDIA Dynamo
[NVIDIA Dynamo](https://github.com/ai-dynamo/dynamo) is an open-source framework for distributed LLM inference that can run vLLM on Kubernetes with flexible serving architectures (e.g. aggregated/disaggregated, optional router/planner).
For Kubernetes deployment instructions and examples (including vLLM), see the [Deploying Dynamo on Kubernetes](https://github.com/ai-dynamo/dynamo/blob/main/docs/kubernetes/README.md) guide.
Background reading: InfoQ news coverage — [NVIDIA Dynamo simplifies Kubernetes deployment for LLM inference](https://www.infoq.com/news/2025/12/nvidia-dynamo-kubernetes/).
......@@ -5,6 +5,7 @@
Please see the Installation Guides for environment specific instructions:
- [Any Kubernetes Cluster](https://www.kubeai.org/installation/any/)
- [AKS](https://www.kubeai.org/installation/aks/)
- [EKS](https://www.kubeai.org/installation/eks/)
- [GKE](https://www.kubeai.org/installation/gke/)
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment