Merge tag 'v0.12.0' into v0.12.0-dev

41199996 · zhuwenwen · 31021d81 · 4fd9d6a8 · 41199996 · 41199996
Commit 41199996 authored Dec 13, 2025 by zhuwenwen
20 changed files
--- a/docs/assets/deployment/hf-inference-endpoints-choose-infra.png
+++ b/docs/assets/deployment/hf-inference-endpoints-choose-infra.png
--- a/docs/assets/deployment/hf-inference-endpoints-click-deploy-button.png
+++ b/docs/assets/deployment/hf-inference-endpoints-click-deploy-button.png
--- a/docs/assets/deployment/hf-inference-endpoints-configure-container.png
+++ b/docs/assets/deployment/hf-inference-endpoints-configure-container.png
--- a/docs/assets/deployment/hf-inference-endpoints-create-endpoint.png
+++ b/docs/assets/deployment/hf-inference-endpoints-create-endpoint.png
--- a/docs/assets/deployment/hf-inference-endpoints-locate-deploy-button.png
+++ b/docs/assets/deployment/hf-inference-endpoints-locate-deploy-button.png
--- a/docs/assets/deployment/hf-inference-endpoints-new-endpoint.png
+++ b/docs/assets/deployment/hf-inference-endpoints-new-endpoint.png
--- a/docs/assets/deployment/hf-inference-endpoints-select-hardware.png
+++ b/docs/assets/deployment/hf-inference-endpoints-select-hardware.png
--- a/docs/assets/deployment/hf-inference-endpoints-select-model.png
+++ b/docs/assets/deployment/hf-inference-endpoints-select-model.png
--- a/docs/assets/design/cuda_graphs/current_design.png
+++ b/docs/assets/design/cuda_graphs/current_design.png
--- a/docs/assets/design/cuda_graphs/executor_runtime.png
+++ b/docs/assets/design/cuda_graphs/executor_runtime.png
--- a/docs/assets/design/cuda_graphs/previous_design.png
+++ b/docs/assets/design/cuda_graphs/previous_design.png
--- a/docs/assets/design/cuda_graphs/wrapper_flow.png
+++ b/docs/assets/design/cuda_graphs/wrapper_flow.png
--- a/docs/assets/design/debug_vllm_compile/design_diagram.png
+++ b/docs/assets/design/debug_vllm_compile/design_diagram.png
--- a/docs/assets/design/debug_vllm_compile/dynamic_shapes.png
+++ b/docs/assets/design/debug_vllm_compile/dynamic_shapes.png
--- a/docs/assets/design/debug_vllm_compile/tlparse_inductor.png
+++ b/docs/assets/design/debug_vllm_compile/tlparse_inductor.png
--- a/docs/assets/features/disagg_encoder/disagg_encoder_flow.png
+++ b/docs/assets/features/disagg_encoder/disagg_encoder_flow.png
--- a/docs/benchmarking/README.md
+++ b/docs/benchmarking/README.md
+# Benchmark Suites
+
+vLLM provides comprehensive benchmarking tools for performance testing and evaluation:
+
+- **[Benchmark CLI](./cli.md)**: `vllm bench` CLI tools and specialized benchmark scripts for interactive performance testing.
+- **[Parameter Sweeps](./sweeps.md)**: Automate `vllm bench` runs for multiple configurations, useful for [optimization and tuning](../configuration/optimization.md).
+- **[Performance Dashboard](./dashboard.md)**: Automated CI that publishes benchmarks on each commit.
--- a/docs/contributing/benchmarks.md
+++ b/docs/contributing/benchmarks.md
---
-toc_depth: 4
---
+# Benchmark CLI

-# Benchmark Suites
+This section guides you through running benchmark tests with the extensive datasets supported on vLLM.

-vLLM provides comprehensive benchmarking tools for performance testing and evaluation:
+It's a living document, updated as new features and datasets become available.

- **[Benchmark CLI]**: `vllm bench` CLI tools and specialized benchmark scripts for interactive performance testing
- **[Performance benchmarks][performance-benchmarks]**: Automated CI benchmarks for development
- **[Nightly benchmarks][nightly-benchmarks]**: Comparative benchmarks against alternatives
-
-[Benchmark CLI]: #benchmark-cli
-
-## Benchmark CLI
-
-This section guides you through running benchmark tests with the extensive
-datasets supported on vLLM. It's a living document, updated as new features and datasets
-become available.
-
-### Dataset Overview
+## Dataset Overview

 <style>
 th {
@@ -29,12 +15,13 @@ th {
 | Dataset | Online | Offline | Data Path |
 |---------|--------|---------|-----------|
 | ShareGPT | ✅ | ✅ | `wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json` |
-| ShareGPT4V (Image) | ✅ | ✅ | `wget https://huggingface.co/datasets/Lin-Chen/ShareGPT4V/blob/main/sharegpt4v_instruct_gpt4-vision_cap100k.json`<br>Note that the images need to be downloaded separately. For example, to download COCO's 2017 Train images:<br>`wget http://images.cocodataset.org/zips/train2017.zip` |
+| ShareGPT4V (Image) | ✅ | ✅ | `wget https://huggingface.co/datasets/Lin-Chen/ShareGPT4V/resolve/main/sharegpt4v_instruct_gpt4-vision_cap100k.json`<br>Note that the images need to be downloaded separately. For example, to download COCO's 2017 Train images:<br>`wget http://images.cocodataset.org/zips/train2017.zip` |
 | ShareGPT4Video (Video) | ✅ | ✅ | `git clone https://huggingface.co/datasets/ShareGPT4Video/ShareGPT4Video` |
 | BurstGPT | ✅ | ✅ | `wget https://github.com/HPMLL/BurstGPT/releases/download/v1.1/BurstGPT_without_fails_2.csv` |
 | Sonnet (deprecated) | ✅ | ✅ | Local file: `benchmarks/sonnet.txt` |
 | Random | ✅ | ✅ | `synthetic` |
 | RandomMultiModal (Image/Video) | 🟡 | 🚧 | `synthetic` |
+| RandomForReranking | ✅ | ✅ | `synthetic` |
 | Prefix Repetition | ✅ | ✅ | `synthetic` |
 | HuggingFace-VisionArena | ✅ | ✅ | `lmarena-ai/VisionArena-Chat` |
 | HuggingFace-MMVU | ✅ | ✅ | `yale-nlp/MMVU` |
@@ -60,20 +47,20 @@ Legend:
    --dataset-path /datasets/VisionArena-Chat/ --hf-name lmarena-ai/VisionArena-Chat
    ```

-### Examples
+## Examples

-#### 🚀 Online Benchmark
+### 🚀 Online Benchmark

 <details class="admonition abstract" markdown="1">
 <summary>Show more</summary>

-First start serving your model
+First start serving your model:

 ```bash
 vllm serve NousResearch/Hermes-3-Llama-3.1-8B
 ```

-Then run the benchmarking script
+Then run the benchmarking script:

 ```bash
 # download dataset
@@ -87,7 +74,7 @@ vllm bench serve \
  --num-prompts 10
 ```

-If successful, you will see the following output
+If successful, you will see the following output:

 ```text
 ============ Serving Benchmark Result ============
@@ -113,7 +100,7 @@ P99 ITL (ms):                            8.39
 ==================================================
 ```

-##### Custom Dataset
+#### Custom Dataset

 If the dataset you want to benchmark is not supported yet in vLLM, even then you can benchmark on it using `CustomDataset`. Your data needs to be in `.jsonl` format and needs to have "prompt" field per entry, e.g., data.jsonl

@@ -125,7 +112,7 @@ If the dataset you want to benchmark is not supported yet in vLLM, even then you

 ```bash
 # start server
-VLLM_USE_V1=1 vllm serve meta-llama/Llama-3.1-8B-Instruct
+vllm serve meta-llama/Llama-3.1-8B-Instruct
 ```

 ```bash
@@ -146,7 +133,7 @@ vllm bench serve --port 9001 --save-result --save-detailed \

 You can skip applying chat template if your data already has it by using `--custom-skip-chat-template`.

-##### VisionArena Benchmark for Vision Language Models
+#### VisionArena Benchmark for Vision Language Models

 ```bash
 # need a model with vision capability here
@@ -164,10 +151,10 @@ vllm bench serve \
  --num-prompts 1000
 ```

-##### InstructCoder Benchmark with Speculative Decoding
+#### InstructCoder Benchmark with Speculative Decoding

 ``` bash
-VLLM_USE_V1=1 vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
+vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
    --speculative-config $'{"method": "ngram",
    "num_speculative_tokens": 5, "prompt_lookup_max": 5,
    "prompt_lookup_min": 2}'
@@ -181,10 +168,10 @@ vllm bench serve \
    --num-prompts 2048
 ```

-##### Spec Bench Benchmark with Speculative Decoding
+#### Spec Bench Benchmark with Speculative Decoding

 ``` bash
-VLLM_USE_V1=1 vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
+vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
    --speculative-config $'{"method": "ngram",
    "num_speculative_tokens": 5, "prompt_lookup_max": 5,
    "prompt_lookup_min": 2}'
@@ -218,7 +205,7 @@ vllm bench serve \
    --spec-bench-category "summarization"
 ```

-##### Other HuggingFaceDataset Examples
+#### Other HuggingFaceDataset Examples

 ```bash
 vllm serve Qwen/Qwen2-VL-7B-Instruct
@@ -284,7 +271,7 @@ vllm bench serve \
    --blazedit-max-distance 0.99
 ```

-##### Running With Sampling Parameters
+#### Running With Sampling Parameters

 When using OpenAI-compatible backends such as `vllm`, optional sampling
 parameters can be specified. Example client command:
@@ -302,7 +289,7 @@ vllm bench serve \
  --num-prompts 10
 ```

-##### Running With Ramp-Up Request Rate
+#### Running With Ramp-Up Request Rate

 The benchmark tool also supports ramping up the request rate over the
 duration of the benchmark run. This can be useful for stress testing the
@@ -319,9 +306,76 @@ The following arguments can be used to control the ramp-up:
 - `--ramp-up-start-rps`: The request rate at the beginning of the benchmark.
 - `--ramp-up-end-rps`: The request rate at the end of the benchmark.

+#### Load Pattern Configuration
+
+vLLM's benchmark serving script provides sophisticated load pattern simulation capabilities through three key parameters that control request generation and concurrency behavior:
+
+##### Load Pattern Control Parameters
+
+- `--request-rate`: Controls the target request generation rate (requests per second). Set to `inf` for maximum throughput testing or finite values for controlled load simulation.
+- `--burstiness`: Controls traffic variability using a Gamma distribution (range: > 0). Lower values create bursty traffic, higher values create uniform traffic.
+- `--max-concurrency`: Limits concurrent outstanding requests. If this argument is not provided, concurrency is unlimited. Set a value to simulate backpressure.
+
+These parameters work together to create realistic load patterns with carefully chosen defaults. The `--request-rate` parameter defaults to `inf` (infinite), which sends all requests immediately for maximum throughput testing. When set to finite values, it uses either a Poisson process (default `--burstiness=1.0`) or Gamma distribution for realistic request timing. The `--burstiness` parameter only takes effect when `--request-rate` is not infinite - a value of 1.0 creates natural Poisson traffic, while lower values (0.1-0.5) create bursty patterns and higher values (2.0-5.0) create uniform spacing. The `--max-concurrency` parameter defaults to `None` (unlimited) but can be set to simulate real-world constraints where a load balancer or API gateway limits concurrent connections. When combined, these parameters allow you to simulate everything from unrestricted stress testing (`--request-rate=inf`) to production-like scenarios with realistic arrival patterns and resource constraints.
+
+The `--burstiness` parameter mathematically controls request arrival patterns using a Gamma distribution where:
+
+- Shape parameter: `burstiness` value
+- Coefficient of Variation (CV): $\frac{1}{\sqrt{burstiness}}$
+- Traffic characteristics:
+    - `burstiness = 0.1`: Highly bursty traffic (CV ≈ 3.16) - stress testing
+    - `burstiness = 1.0`: Natural Poisson traffic (CV = 1.0) - realistic simulation  
+    - `burstiness = 5.0`: Uniform traffic (CV ≈ 0.45) - controlled load testing
+
+![Load Pattern Examples](../assets/contributing/load-pattern-examples.png)
+
+*Figure: Load pattern examples for each use case. Top row: Request arrival timelines showing cumulative requests over time. Bottom row: Inter-arrival time distributions showing traffic variability patterns. Each column represents a different use case with its specific parameter settings and resulting traffic characteristics.*
+
+Load Pattern Recommendations by Use Case:
+
+| Use Case           | Burstiness   | Request Rate    | Max Concurrency | Description                                               |
+| ---                | ---          | ---             | ---             | ---                                                       |
+| Maximum Throughput | N/A          | Infinite        | Limited         | **Most common**: Simulates load balancer/gateway limits with unlimited user demand |
+| Realistic Testing  | 1.0          | Moderate (5-20) | Infinite        | Natural Poisson traffic patterns for baseline performance |
+| Stress Testing     | 0.1-0.5      | High (20-100)   | Infinite        | Challenging burst patterns to test resilience             |
+| Latency Profiling  | 2.0-5.0      | Low (1-10)      | Infinite        | Uniform load for consistent timing analysis               |
+| Capacity Planning  | 1.0          | Variable        | Limited         | Test resource limits with realistic constraints           |
+| SLA Validation     | 1.0          | Target rate     | SLA limit       | Production-like constraints for compliance testing        |
+
+These load patterns help evaluate different aspects of your vLLM deployment, from basic performance characteristics to resilience under challenging traffic conditions.
+
+The **Maximum Throughput** pattern (`--request-rate=inf --max-concurrency=<limit>`) is the most commonly used configuration for production benchmarking. This simulates real-world deployment architectures where:
+
+- Users send requests as fast as they can (infinite rate)
+- A load balancer or API gateway controls the maximum concurrent connections
+- The system operates at its concurrency limit, revealing true throughput capacity
+- `--burstiness` has no effect since request timing is not controlled when rate is infinite
+
+This pattern helps determine optimal concurrency settings for your production load balancer configuration.
+
+To effectively configure load patterns, especially for **Capacity Planning** and **SLA Validation** use cases, you need to understand your system's resource limits. During startup, vLLM reports KV cache configuration that directly impacts your load testing parameters:
+
+```text
+GPU KV cache size: 15,728,640 tokens
+Maximum concurrency for 8,192 tokens per request: 1920
+```
+
+Where:
+
+- GPU KV cache size: Total tokens that can be cached across all concurrent requests
+- Maximum concurrency: Theoretical maximum concurrent requests for the given `max_model_len`
+- Calculation: `max_concurrency = kv_cache_size / max_model_len`
+
+Using KV cache metrics for load pattern configuration:
+
+- For Capacity Planning: Set `--max-concurrency` to 80-90% of the reported maximum to test realistic resource constraints
+- For SLA Validation: Use the reported maximum as your SLA limit to ensure compliance testing matches production capacity
+- For Realistic Testing: Monitor memory usage when approaching theoretical limits to understand sustainable request rates
+- Request rate guidance: Use the KV cache size to estimate sustainable request rates for your specific workload and sequence lengths
+
 </details>

-#### 📈 Offline Throughput Benchmark
+### 📈 Offline Throughput Benchmark

 <details class="admonition abstract" markdown="1">
 <summary>Show more</summary>
@@ -342,7 +396,7 @@ Total num prompt tokens:  5014
 Total num output tokens:  1500
 ```

-##### VisionArena Benchmark for Vision Language Models
+#### VisionArena Benchmark for Vision Language Models

 ```bash
 vllm bench throughput \
@@ -362,11 +416,10 @@ Total num prompt tokens:  14527
 Total num output tokens:  1280
 ```

-##### InstructCoder Benchmark with Speculative Decoding
+#### InstructCoder Benchmark with Speculative Decoding

 ``` bash
 VLLM_WORKER_MULTIPROC_METHOD=spawn \
-VLLM_USE_V1=1 \
 vllm bench throughput \
    --dataset-name=hf \
    --dataset-path=likaixin/InstructCoder \
@@ -386,7 +439,7 @@ Total num prompt tokens:  261136
 Total num output tokens:  204800
 ```

-##### Other HuggingFaceDataset Examples
+#### Other HuggingFaceDataset Examples

 `lmms-lab/LLaVA-OneVision-Data`:

@@ -444,20 +497,20 @@ vllm bench throughput \

 </details>

-#### 🛠️ Structured Output Benchmark
+### 🛠️ Structured Output Benchmark

 <details class="admonition abstract" markdown="1">
 <summary>Show more</summary>

 Benchmark the performance of structured output generation (JSON, grammar, regex).

-##### Server Setup
+#### Server Setup

 ```bash
 vllm serve NousResearch/Hermes-3-Llama-3.1-8B
 ```

-##### JSON Schema Benchmark
+#### JSON Schema Benchmark

 ```bash
 python3 benchmarks/benchmark_serving_structured_output.py \
@@ -469,7 +522,7 @@ python3 benchmarks/benchmark_serving_structured_output.py \
  --num-prompts 1000
 ```

-##### Grammar-based Generation Benchmark
+#### Grammar-based Generation Benchmark

 ```bash
 python3 benchmarks/benchmark_serving_structured_output.py \
@@ -481,7 +534,7 @@ python3 benchmarks/benchmark_serving_structured_output.py \
  --num-prompts 1000
 ```

-##### Regex-based Generation Benchmark
+#### Regex-based Generation Benchmark

 ```bash
 python3 benchmarks/benchmark_serving_structured_output.py \
@@ -492,7 +545,7 @@ python3 benchmarks/benchmark_serving_structured_output.py \
  --num-prompts 1000
 ```

-##### Choice-based Generation Benchmark
+#### Choice-based Generation Benchmark

 ```bash
 python3 benchmarks/benchmark_serving_structured_output.py \
@@ -503,7 +556,7 @@ python3 benchmarks/benchmark_serving_structured_output.py \
  --num-prompts 1000
 ```

-##### XGrammar Benchmark Dataset
+#### XGrammar Benchmark Dataset

 ```bash
 python3 benchmarks/benchmark_serving_structured_output.py \
@@ -516,14 +569,14 @@ python3 benchmarks/benchmark_serving_structured_output.py \

 </details>

-#### 📚 Long Document QA Benchmark
+### 📚 Long Document QA Benchmark

 <details class="admonition abstract" markdown="1">
 <summary>Show more</summary>

 Benchmark the performance of long document question-answering with prefix caching.

-##### Basic Long Document QA Test
+#### Basic Long Document QA Test

 ```bash
 python3 benchmarks/benchmark_long_document_qa_throughput.py \
@@ -535,7 +588,7 @@ python3 benchmarks/benchmark_long_document_qa_throughput.py \
  --repeat-count 5
 ```

-##### Different Repeat Modes
+#### Different Repeat Modes

 ```bash
 # Random mode (default) - shuffle prompts randomly
@@ -568,14 +621,14 @@ python3 benchmarks/benchmark_long_document_qa_throughput.py \

 </details>

-#### 🗂️ Prefix Caching Benchmark
+### 🗂️ Prefix Caching Benchmark

 <details class="admonition abstract" markdown="1">
 <summary>Show more</summary>

 Benchmark the efficiency of automatic prefix caching.

-##### Fixed Prompt with Prefix Caching
+#### Fixed Prompt with Prefix Caching

 ```bash
 python3 benchmarks/benchmark_prefix_caching.py \
@@ -586,7 +639,7 @@ python3 benchmarks/benchmark_prefix_caching.py \
  --input-length-range 128:256
 ```

-##### ShareGPT Dataset with Prefix Caching
+#### ShareGPT Dataset with Prefix Caching

 ```bash
 # download dataset
@@ -617,14 +670,14 @@ vllm bench serve \

 </details>

-#### ⚡ Request Prioritization Benchmark
+### ⚡ Request Prioritization Benchmark

 <details class="admonition abstract" markdown="1">
 <summary>Show more</summary>

 Benchmark the performance of request prioritization in vLLM.

-##### Basic Prioritization Test
+#### Basic Prioritization Test

 ```bash
 python3 benchmarks/benchmark_prioritization.py \
@@ -635,7 +688,7 @@ python3 benchmarks/benchmark_prioritization.py \
  --scheduling-policy priority
 ```

-##### Multiple Sequences per Prompt
+#### Multiple Sequences per Prompt

 ```bash
 python3 benchmarks/benchmark_prioritization.py \
@@ -649,20 +702,19 @@ python3 benchmarks/benchmark_prioritization.py \

 </details>

-#### 👁️ Multi-Modal Benchmark
+### 👁️ Multi-Modal Benchmark

 <details class="admonition abstract" markdown="1">
 <summary>Show more</summary>

 Benchmark the performance of multi-modal requests in vLLM.

-##### Images (ShareGPT4V)
+#### Images (ShareGPT4V)

 Start vLLM:

 ```bash
-python -m vllm.entrypoints.openai.api_server \
-  --model Qwen/Qwen2.5-VL-7B-Instruct \
+vllm serve Qwen/Qwen2.5-VL-7B-Instruct \
  --dtype bfloat16 \
  --limit-mm-per-prompt '{"image": 1}' \
  --allowed-local-media-path /path/to/sharegpt4v/images
@@ -683,13 +735,12 @@ vllm bench serve \
  --endpoint /v1/chat/completions
 ```

-##### Videos (ShareGPT4Video)
+#### Videos (ShareGPT4Video)

 Start vLLM:

 ```bash
-python -m vllm.entrypoints.openai.api_server \
-  --model Qwen/Qwen2.5-VL-7B-Instruct \
+vllm serve Qwen/Qwen2.5-VL-7B-Instruct \
  --dtype bfloat16 \
  --limit-mm-per-prompt '{"video": 1}' \
  --allowed-local-media-path /path/to/sharegpt4video/videos
@@ -710,13 +761,13 @@ vllm bench serve \
  --endpoint /v1/chat/completions
 ```

-##### Synthetic Random Images (random-mm)
+#### Synthetic Random Images (random-mm)

 Generate synthetic image inputs alongside random text prompts to stress-test vision models without external datasets.

 Notes:

- Works only with online benchmark via the OpenAI  backend (`--backend openai-chat`) and endpoint `/v1/chat/completions`.
+- Works only with online benchmark via the OpenAI backend (`--backend openai-chat`) and endpoint `/v1/chat/completions`.
 - Video sampling is not yet implemented.

 Start the server (example):
@@ -783,52 +834,145 @@ This should be seen as an edge case, and if this behavior can be avoided by sett

 </details>

-[](){ #performance-benchmarks }
+### Embedding Benchmark
+
+Benchmark the performance of embedding requests in vLLM.
+
+<details class="admonition abstract" markdown="1">
+<summary>Show more</summary>
+
+#### Text Embeddings
+
+Unlike generative models which use Completions API or Chat Completions API,
+you should set `--backend openai-embeddings` and `--endpoint /v1/embeddings` to use the Embeddings API.
+
+You can use any text dataset to benchmark the model, such as ShareGPT.
+
+Start the server:
+
+```bash
+vllm serve jinaai/jina-embeddings-v3 --trust-remote-code
+```
+
+Run the benchmark:
+
+```bash
+# download dataset
+# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
+vllm bench serve \
+  --model jinaai/jina-embeddings-v3 \
+  --backend openai-embeddings \
+  --endpoint /v1/embeddings \
+  --dataset-name sharegpt \
+  --dataset-path <your data path>/ShareGPT_V3_unfiltered_cleaned_split.json
+```

-## Performance Benchmarks
+#### Multi-modal Embeddings

-The performance benchmarks are used for development to confirm whether new changes improve performance under various workloads. They are triggered on every commit with both the `perf-benchmarks` and `ready` labels, and when a PR is merged into vLLM.
+Unlike generative models which use Completions API or Chat Completions API,
+you should set `--endpoint /v1/embeddings` to use the Embeddings API. The backend to use depends on the model:

-### Manually Trigger the benchmark
+- CLIP: `--backend openai-embeddings-clip`
+- VLM2Vec: `--backend openai-embeddings-vlm2vec`

-Use [vllm-ci-test-repo images](https://gallery.ecr.aws/q9t5s3a7/vllm-ci-test-repo) with vLLM benchmark suite.
-For CPU environment, please use the image with "-cpu" postfix.
+For other models, please add your own implementation inside [vllm/benchmarks/lib/endpoint_request_func.py](../../vllm/benchmarks/lib/endpoint_request_func.py) to match the expected instruction format.

-Here is an example for docker run command for CPU.
+You can use any text or multi-modal dataset to benchmark the model, as long as the model supports it.
+For example, you can use ShareGPT and VisionArena to benchmark vision-language embeddings.
+
+Serve and benchmark CLIP:

 ```bash
-docker run -it --entrypoint /bin/bash -v /data/huggingface:/root/.cache/huggingface  -e HF_TOKEN=''  --shm-size=16g --name vllm-cpu-ci  public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:1da94e673c257373280026f75ceb4effac80e892-cpu
+# Run this in another process
+vllm serve openai/clip-vit-base-patch32
+
+# Run these one by one after the server is up
+# download dataset
+# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
+vllm bench serve \
+  --model openai/clip-vit-base-patch32 \
+  --backend openai-embeddings-clip \
+  --endpoint /v1/embeddings \
+  --dataset-name sharegpt \
+  --dataset-path <your data path>/ShareGPT_V3_unfiltered_cleaned_split.json
+
+vllm bench serve \
+  --model openai/clip-vit-base-patch32 \
+  --backend openai-embeddings-clip \
+  --endpoint /v1/embeddings \
+  --dataset-name hf \
+  --dataset-path lmarena-ai/VisionArena-Chat
 ```

-Then, run below command inside the docker instance.
+Serve and benchmark VLM2Vec:

 ```bash
-bash .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
+# Run this in another process
+vllm serve TIGER-Lab/VLM2Vec-Full --runner pooling \
+  --trust-remote-code \
+  --chat-template examples/template_vlm2vec_phi3v.jinja
+
+# Run these one by one after the server is up
+# download dataset
+# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
+vllm bench serve \
+  --model TIGER-Lab/VLM2Vec-Full \
+  --backend openai-embeddings-vlm2vec \
+  --endpoint /v1/embeddings \
+  --dataset-name sharegpt \
+  --dataset-path <your data path>/ShareGPT_V3_unfiltered_cleaned_split.json
+
+vllm bench serve \
+  --model TIGER-Lab/VLM2Vec-Full \
+  --backend openai-embeddings-vlm2vec \
+  --endpoint /v1/embeddings \
+  --dataset-name hf \
+  --dataset-path lmarena-ai/VisionArena-Chat
 ```

-When run, benchmark script generates results under **benchmark/results** folder, along with the benchmark_results.md and benchmark_results.json.
+</details>

-#### Runtime environment variables
+### Reranker Benchmark

- `ON_CPU`: set the value to '1' on Intel® Xeon® Processors. Default value is 0.
- `SERVING_JSON`: JSON file to use for the serving tests. Default value is empty string (use default file).
- `LATENCY_JSON`: JSON file to use for the latency tests. Default value is empty string (use default file).
- `THROUGHPUT_JSON`: JSON file to use for the throughout tests. Default value is empty string (use default file).
- `REMOTE_HOST`: IP for the remote vLLM service to benchmark. Default value is empty string.
- `REMOTE_PORT`: Port for the remote vLLM service to benchmark. Default value is empty string.
+Benchmark the performance of rerank requests in vLLM.

-For more results visualization, check the [visualizing the results](https://github.com/intel-ai-tce/vllm/blob/more_cpu_models/.buildkite/nightly-benchmarks/README.md#visualizing-the-results).
+<details class="admonition abstract" markdown="1">
+<summary>Show more</summary>

-The latest performance results are hosted on the public [vLLM Performance Dashboard](https://hud.pytorch.org/benchmark/llms?repoName=vllm-project%2Fvllm).
+Unlike generative models which use Completions API or Chat Completions API,
+you should set `--backend vllm-rerank` and `--endpoint /v1/rerank` to use the Reranker API.

-More information on the performance benchmarks and their parameters can be found in [Benchmark README](https://github.com/intel-ai-tce/vllm/blob/more_cpu_models/.buildkite/nightly-benchmarks/README.md) and [performance benchmark description](gh-file:.buildkite/nightly-benchmarks/performance-benchmarks-descriptions.md).
+For reranking, the only supported dataset is `--dataset-name random-rerank`

-[](){ #nightly-benchmarks }
+Start the server:

-## Nightly Benchmarks
+```bash
+vllm serve BAAI/bge-reranker-v2-m3
+```

-These compare vLLM's performance against alternatives (`tgi`, `trt-llm`, and `lmdeploy`) when there are major updates of vLLM (e.g., bumping up to a new version). They are primarily intended for consumers to evaluate when to choose vLLM over other options and are triggered on every commit with both the `perf-benchmarks` and `nightly-benchmarks` labels.
+Run the benchmark:

-The latest nightly benchmark results are shared in major release blog posts such as [vLLM v0.6.0](https://blog.vllm.ai/2024/09/05/perf-update.html).
+```bash
+vllm bench serve \
+  --model BAAI/bge-reranker-v2-m3 \
+  --backend vllm-rerank \
+  --endpoint /v1/rerank \
+  --dataset-name random-rerank \
+  --tokenizer BAAI/bge-reranker-v2-m3 \
+  --random-input-len 512 \
+  --num-prompts 10 \
+  --random-batch-size 5
+```
+
+For reranker models, this will create `num_prompts / random_batch_size` requests with
+`random_batch_size` "documents" where each one has close to `random_input_len` tokens.
+In the example above, this results in 2 rerank requests with 5 "documents" each where
+each document has close to 512 tokens.

-More information on the nightly benchmarks and their parameters can be found [here](gh-file:.buildkite/nightly-benchmarks/nightly-descriptions.md).
+Please note that the `/v1/rerank` is also supported by embedding models. So if you're running
+with an embedding model, also set `--no_reranker`. Because in this case the query is
+treated as an individual prompt by the server, here we send `random_batch_size - 1` documents
+to account for the extra prompt which is the query. The token accounting to report the
+throughput numbers correctly is also adjusted.
+
+</details>
--- a/docs/benchmarking/dashboard.md
+++ b/docs/benchmarking/dashboard.md
+# Performance Dashboard
+
+The performance dashboard is used to confirm whether new changes improve/degrade performance under various workloads.
+It is updated by triggering benchmark runs on every commit with both the `perf-benchmarks` and `ready` labels, and when a PR is merged into vLLM.
+
+The results are automatically published to the public [vLLM Performance Dashboard](https://hud.pytorch.org/benchmark/llms?repoName=vllm-project%2Fvllm).
+
+## Manually Trigger the benchmark
+
+Use [vllm-ci-test-repo images](https://gallery.ecr.aws/q9t5s3a7/vllm-ci-test-repo) with vLLM benchmark suite.
+For CPU environment, please use the image with "-cpu" postfix.
+
+Here is an example for docker run command for CPU.
+
+```bash
+docker run -it --entrypoint /bin/bash -v /data/huggingface:/root/.cache/huggingface  -e HF_TOKEN=''  --shm-size=16g --name vllm-cpu-ci  public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:1da94e673c257373280026f75ceb4effac80e892-cpu
+```
+
+Then, run below command inside the docker instance.
+
+```bash
+bash .buildkite/performance-benchmarks/scripts/run-performance-benchmarks.sh
+```
+
+When run, benchmark script generates results under **benchmark/results** folder, along with the benchmark_results.md and benchmark_results.json.
+
+### Runtime environment variables
+
+- `ON_CPU`: set the value to '1' on Intel® Xeon® Processors. Default value is 0.
+- `SERVING_JSON`: JSON file to use for the serving tests. Default value is empty string (use default file).
+- `LATENCY_JSON`: JSON file to use for the latency tests. Default value is empty string (use default file).
+- `THROUGHPUT_JSON`: JSON file to use for the throughout tests. Default value is empty string (use default file).
+- `REMOTE_HOST`: IP for the remote vLLM service to benchmark. Default value is empty string.
+- `REMOTE_PORT`: Port for the remote vLLM service to benchmark. Default value is empty string.
+
+For more results visualization, check the [visualizing the results](https://github.com/intel-ai-tce/vllm/blob/more_cpu_models/.buildkite/nightly-benchmarks/README.md#visualizing-the-results).
+
+More information on the performance benchmarks and their parameters can be found in [Benchmark README](https://github.com/intel-ai-tce/vllm/blob/more_cpu_models/.buildkite/nightly-benchmarks/README.md) and [performance benchmark description](../../.buildkite/performance-benchmarks/performance-benchmarks-descriptions.md).
+
+## Continuous Benchmarking
+
+The continuous benchmarking provides automated performance monitoring for vLLM across different models and GPU devices. This helps track vLLM's performance characteristics over time and identify any performance regressions or improvements.
+
+### How It Works
+
+The continuous benchmarking is triggered via a [GitHub workflow CI](https://github.com/pytorch/pytorch-integration-testing/actions/workflows/vllm-benchmark.yml) in the PyTorch infrastructure repository, which runs automatically every 4 hours. The workflow executes three types of performance tests:
+
+- **Serving tests**: Measure request handling and API performance
+- **Throughput tests**: Evaluate token generation rates
+- **Latency tests**: Assess response time characteristics
+
+### Benchmark Configuration
+
+The benchmarking currently runs on a predefined set of models configured in the [vllm-benchmarks directory](https://github.com/pytorch/pytorch-integration-testing/tree/main/vllm-benchmarks/benchmarks). To add new models for benchmarking:
+
+1. Navigate to the appropriate GPU directory in the benchmarks configuration
+2. Add your model specifications to the corresponding configuration files
+3. The new models will be included in the next scheduled benchmark run
--- a/docs/benchmarking/sweeps.md
+++ b/docs/benchmarking/sweeps.md
+# Parameter Sweeps
+
+## Online Benchmark
+
+### Basic
+
+`vllm bench sweep serve` automatically starts `vllm serve` and runs `vllm bench serve` to evaluate vLLM over multiple configurations.
+
+Follow these steps to run the script:
+
+1. Construct the base command to `vllm serve`, and pass it to the `--serve-cmd` option.
+2. Construct the base command to `vllm bench serve`, and pass it to the `--bench-cmd` option.
+3. (Optional) If you would like to vary the settings of `vllm serve`, create a new JSON file and populate it with the parameter combinations you want to test. Pass the file path to `--serve-params`.
+
+    - Example: Tuning `--max-num-seqs` and `--max-num-batched-tokens`:
+
+    ```json
+    [
+        {
+            "max_num_seqs": 32,
+            "max_num_batched_tokens": 1024
+        },
+        {
+            "max_num_seqs": 64,
+            "max_num_batched_tokens": 1024
+        },
+        {
+            "max_num_seqs": 64,
+            "max_num_batched_tokens": 2048
+        },
+        {
+            "max_num_seqs": 128,
+            "max_num_batched_tokens": 2048
+        },
+        {
+            "max_num_seqs": 128,
+            "max_num_batched_tokens": 4096
+        },
+        {
+            "max_num_seqs": 256,
+            "max_num_batched_tokens": 4096
+        }
+    ]
+    ```
+
+4. (Optional) If you would like to vary the settings of `vllm bench serve`, create a new JSON file and populate it with the parameter combinations you want to test. Pass the file path to `--bench-params`.
+
+    - Example: Using different input/output lengths for random dataset:
+
+    ```json
+    [
+        {
+            "random_input_len": 128,
+            "random_output_len": 32
+        },
+        {
+            "random_input_len": 256,
+            "random_output_len": 64
+        },
+        {
+            "random_input_len": 512,
+            "random_output_len": 128
+        }
+    ]
+    ```
+
+5. Determine where you want to save the results, and pass that to `--output-dir`.
+
+Example command:
+
+```bash
+vllm bench sweep serve \
+    --serve-cmd 'vllm serve meta-llama/Llama-2-7b-chat-hf' \
+    --bench-cmd 'vllm bench serve --model meta-llama/Llama-2-7b-chat-hf --backend vllm --endpoint /v1/completions --dataset-name sharegpt --dataset-path benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json' \
+    --serve-params benchmarks/serve_hparams.json \
+    --bench-params benchmarks/bench_hparams.json \
+    -o benchmarks/results
+```
+
+!!! important
+    If both `--serve-params` and `--bench-params` are passed, the script will iterate over the Cartesian product between them.
+    You can use `--dry-run` to preview the commands to be run.
+
+    We only start the server once for each `--serve-params`, and keep it running for multiple `--bench-params`.
+    Between each benchmark run, we call the `/reset_prefix_cache` and `/reset_mm_cache` endpoints to get a clean slate for the next run.
+    In case you are using a custom `--serve-cmd`, you can override the commands used for resetting the state by setting `--after-bench-cmd`.
+
+!!! note
+    By default, each parameter combination is run 3 times to make the results more reliable. You can adjust the number of runs by setting `--num-runs`.
+
+!!! tip
+    You can use the `--resume` option to continue the parameter sweep if one of the runs failed.
+  
+### SLA auto-tuner
+
+`vllm bench sweep serve_sla` is a wrapper over `vllm bench sweep serve` that tunes either the request rate or concurrency (choose using `--sla-variable`) in order to satisfy the SLA constraints given by `--sla-params`.
+
+For example, to ensure E2E latency within different target values for 99% of requests:
+
+```json
+[
+    {
+        "p99_e2el_ms": "<=200"
+    },
+    {
+        "p99_e2el_ms": "<=500"
+    },
+    {
+        "p99_e2el_ms": "<=1000"
+    },
+    {
+        "p99_e2el_ms": "<=2000"
+    }
+]
+```
+
+Example command:
+
+```bash
+vllm bench sweep serve_sla \
+    --serve-cmd 'vllm serve meta-llama/Llama-2-7b-chat-hf' \
+    --bench-cmd 'vllm bench serve --model meta-llama/Llama-2-7b-chat-hf --backend vllm --endpoint /v1/completions --dataset-name sharegpt --dataset-path benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json' \
+    --serve-params benchmarks/serve_hparams.json \
+    --bench-params benchmarks/bench_hparams.json \
+    --sla-params benchmarks/sla_hparams.json \
+    --sla-variable max_concurrency \
+    -o benchmarks/results
+```
+
+The algorithm for adjusting the SLA variable is as follows:
+
+1. Run the benchmark with infinite QPS, and use the corresponding metrics to determine the initial value of the variable.
+    - For example, the initial request rate is set to the concurrency under infinite QPS.
+2. If the SLA is still satisfied, keep doubling the value until the SLA is no longer satisfied. This gives a relatively narrow window that contains the point where the SLA is barely satisfied.
+3. Apply binary search over the window to find the maximum value that still satisfies the SLA.
+
+!!! important
+    SLA tuning is applied over each combination of `--serve-params`, `--bench-params`, and `--sla-params`.
+
+    For a given combination of `--serve-params` and `--bench-params`, we share the benchmark results across `--sla-params` to avoid rerunning benchmarks with the same SLA variable value.
+
+## Visualization
+
+### Basic
+
+`vllm bench sweep plot` can be used to plot performance curves from parameter sweep results.
+
+Example command:
+
+```bash
+vllm bench sweep plot benchmarks/results/<timestamp> \
+    --var-x max_concurrency \
+    --row-by random_input_len \
+    --col-by random_output_len \
+    --curve-by api_server_count,max_num_batched_tokens \
+    --filter-by 'max_concurrency<=1024'
+```
+
+!!! tip
+    You can use `--dry-run` to preview the figures to be plotted.
+
+### Pareto chart
+
+`vllm bench sweep plot_pareto` helps pick configurations that balance per-user and per-GPU throughput.
+
+Higher concurrency or batch size can raise GPU efficiency (per-GPU), but can add per user latency; lower concurrency improves per-user rate but underutilizes GPUs; The Pareto frontier shows the best achievable pairs across your runs.
+
+- x-axis: tokens/s/user = `output_throughput` ÷ concurrency (`--user-count-var`, default `max_concurrency`, fallback `max_concurrent_requests`).
+- y-axis: tokens/s/GPU = `output_throughput` ÷ GPU count (`--gpu-count-var` if set; else gpu_count is TP×PP*DP).
+- Output: a single figure at `OUTPUT_DIR/pareto/PARETO.png`.
+- Show the configuration used in each data point `--label-by` (default: `max_concurrency,gpu_count`).
+
+Example:
+
+```bash
+vllm bench sweep plot_pareto benchmarks/results/<timestamp> \
+  --label-by max_concurrency,tensor_parallel_size,pipeline_parallel_size
+```