[Docs]Add documentation for bench serve visualization arguments (#40539)

Signed-off-by: Sophie du Couédic <sop@zurich.ibm.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

[Docs]Add documentation for bench serve visualization arguments (#40539)
Signed-off-by: Sophie du Couédic <sop@zurich.ibm.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
ff2c2bd8 · Sophie du Couédic · GitHub · cde8d247 · ff2c2bd8 · ff2c2bd8
Unverified Commit ff2c2bd8 authored Apr 24, 2026 by Sophie du Couédic Committed by GitHub Apr 23, 2026
5 changed files
--- a/docs/assets/contributing/vllm_bench_serve_dataset_stats.png
+++ b/docs/assets/contributing/vllm_bench_serve_dataset_stats.png
--- a/docs/assets/contributing/vllm_bench_serve_timeline.html
+++ b/docs/assets/contributing/vllm_bench_serve_timeline.html
--- a/docs/benchmarking/cli.md
+++ b/docs/benchmarking/cli.md
@@ -108,6 +108,38 @@ P99 ITL (ms):                            8.39
 ==================================================
 ```

+#### Results Visualization
+
+The `--plot-timeline` and `--plot-dataset-stats` can be used to generate respectively the requests completion timeline and dataset prompt and output tokens statistics, which can be useful for debugging purpose or for deeper analysis.
+
+```bash
+vllm bench serve \
+    --backend vllm \
+    --model meta-llama/Llama-3.1-8B-Instruct \
+    --endpoint /v1/completions \
+    --dataset-name sharegpt \
+    --dataset-path <your data path>/ShareGPT_V3_unfiltered_cleaned_split.json \
+    --num-prompts 100 \
+    --plot-timeline \
+    --timeline-itl-thresholds 2,5 \
+    --plot-dataset-stats \
+    --save-result
+```
+
+##### Interactive Timeline
+
+The generated timeline is an interactive visualization in the form of an HTML file that can be rendered in most browsers. To customize the ITL color thresholds, one can use `--timeline-itl-thresholds` flag (default: 25ms, 50ms)
+
+Example output:
+
+<iframe src="../../assets/contributing/vllm_bench_serve_timeline.html" width="100%" height="600" frameborder="0"></iframe>
+
+##### Dataset statistics
+
+The generated figure shows the input prompt and output tokens distribution.
+
+Example output: ![Dataset Statistics](../assets/contributing/vllm_bench_serve_dataset_stats.png)
+
 #### Custom Dataset

 If the dataset you want to benchmark is not supported yet in vLLM, even then you can benchmark on it using `CustomDataset`. Your data needs to be in `.jsonl` format and needs to have "prompt" field per entry, e.g., data.jsonl

--- a/pyproject.toml
+++ b/pyproject.toml
@@ -123,7 +123,8 @@ extend-exclude = ["tests/models/fixtures/*", "tests/prompts/*", "tests/tokenizer
    "benchmarks/sonnet.txt", "tests/lora/data/*", "build/*",
    "examples/pooling/token_embed/*", "tests/models/language/pooling/*",
    "vllm/third_party/*", "vllm/entrypoints/serve/instrumentator/static/*", "tests/entrypoints/openai/speech_to_text/test_transcription_validation.py",
-    "docs/governance/process.md", "tests/v1/engine/test_fast_incdec_prefix_err.py", ".git/*"]
+    "docs/governance/process.md", "docs/assets/contributing/vllm_bench_serve_timeline.html", 
+    "tests/v1/engine/test_fast_incdec_prefix_err.py", ".git/*"]
 ignore-hidden = false

 [tool.typos.default]

--- a/vllm/benchmarks/serve.py
+++ b/vllm/benchmarks/serve.py
@@ -1611,14 +1611,12 @@ def add_cli_args(parser: argparse.ArgumentParser):
    )
    parser.add_argument(
        "--timeline-itl-thresholds",
-        type=float,
-        nargs=2,
-        default=[25.0, 50.0],
-        metavar=("THRESHOLD1", "THRESHOLD2"),
+        type=str,
+        default="25,50",
        help="ITL thresholds in milliseconds for timeline plot coloring. "
-        "Specify two values to categorize inter-token latencies into three groups: "
-        "below first threshold (green), between thresholds (orange), "
-        "and above second threshold (red). Default: 25 50 (milliseconds).",
+        "Specify two comma-separated values to categorize inter-token "
+        "latencies into three groups: below first threshold (green), "
+        "between thresholds (orange), and above second threshold (red).",
    )
    parser.add_argument(
        "--plot-dataset-stats",
@@ -1637,6 +1635,19 @@ async def main_async(args: argparse.Namespace) -> dict[str, Any]:
    random.seed(args.seed)
    np.random.seed(args.seed)

+    # Validate timeline ITL thresholds
+    if args.plot_timeline:
+        try:
+            itl_thresholds = [
+                float(t.strip()) for t in args.timeline_itl_thresholds.split(",")
+            ]
+            if len(itl_thresholds) != 2:
+                raise ValueError(
+                    f"Expected 2 ITL threshold values, got {len(itl_thresholds)}"
+                )
+        except ValueError as e:
+            raise ValueError(f"Invalid --timeline-itl-thresholds format: {e}") from e
+
    # Validate ramp-up arguments
    if args.ramp_up_strategy is not None:
        if args.request_rate != float("inf"):
@@ -1906,7 +1917,9 @@ async def main_async(args: argparse.Namespace) -> dict[str, Any]:

                timeline_path = Path(file_name).with_suffix(".timeline.html")
                # Convert thresholds from milliseconds to seconds
-                itl_thresholds_sec = [t / 1000.0 for t in args.timeline_itl_thresholds]
+                itl_thresholds_sec = [
+                    float(t) / 1000.0 for t in args.timeline_itl_thresholds.split(",")
+                ]
                generate_timeline_plot(
                    per_request_data, timeline_path, itl_thresholds=itl_thresholds_sec
                )