refactor: remove benchmark shim, use AIPerf directly (#7074)

Signed-off-by: Ben Hamm <ben.hamm@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Biswa Panda <biswa.panda@gmail.com> Co-authored-by: Saravana Periyasamy <saperiyasamy@nvidia.com>

refactor: remove benchmark shim, use AIPerf directly (#7074)
Signed-off-by: Ben Hamm <ben.hamm@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Biswa Panda <biswa.panda@gmail.com> Co-authored-by: Saravana Periyasamy <saperiyasamy@nvidia.com>
419e936a · Ben Hamm · GitHub · 50818575 · 419e936a · 419e936a
Unverified Commit 419e936a authored Mar 09, 2026 by Ben Hamm Committed by GitHub Mar 10, 2026
10 changed files
--- a/benchmarks/README.md
+++ b/benchmarks/README.md
@@ -15,7 +15,7 @@
 # Benchmarks
-This directory contains benchmarking scripts and tools for performance evaluation of Dynamo deployments. The benchmarking framework is a wrapper around aiperf that makes it easy to benchmark DynamoGraphDeployments or other deployments with exposed endpoints.
+This directory contains benchmarking tools and scripts for Dynamo deployments. Benchmarking uses [AIPerf](https://github.com/ai-dynamo/aiperf) directly — a comprehensive tool for measuring generative AI inference performance.
 ## Quick Start
@@ -26,49 +26,37 @@ First, deploy your DynamoGraphDeployment using the [deployment documentation](..
 # Port-forward your deployment to http://localhost:8000
 kubectl port-forward -n <namespace> svc/<frontend-service-name> 8000:8000 > /dev/null 2>&1 &
-# Run benchmark
+# Run a single benchmark
-python3 -m benchmarks.utils.benchmark \
+aiperf profile \
-    --benchmark-name my-benchmark \
+    --model <your-model> \
-    --endpoint-url http://localhost:8000 \
+    --url http://localhost:8000 \
-    --model "<your-model>"
+    --endpoint-type chat \
+    --streaming \
-# Generate plots
+    --concurrency 10 \
-python3 -m benchmarks.utils.plot --data-dir ./benchmarks/results
+    --request-count 100
-# Or plot only specific benchmark experiments
+# Run a concurrency sweep for Pareto analysis
-python3 -m benchmarks.utils.plot --data-dir ./benchmarks/results --benchmark-name my-benchmark
+for c in 1 2 5 10 50 100; do
-```
+    aiperf profile \
+        --model <your-model> \
-## Features
+        --url http://localhost:8000 \
+        --endpoint-type chat \
-Benchmark any HTTP endpoints! The benchmarking framework supports:
+        --streaming \
+        --concurrency $c \
-**Flexible Configuration:**
+        --request-count $(( c * 3 > 10 ? c * 3 : 10 )) \
- User-defined benchmark names using `--benchmark-name` flag
+        --artifact-dir "artifacts/my-benchmark/c$c"
- Support for single endpoint benchmarking with `--endpoint-url` flag
+done
- Customizable concurrency levels (configurable via CONCURRENCIES env var), sequence lengths, and models
- Automated performance plot generation with custom benchmark names
+# Generate comparison plots
+aiperf plot artifacts/my-benchmark
-**Supported Backends:**
- DynamoGraphDeployments with port-forwarded endpoints
- External HTTP endpoints (for comparison with non-Dynamo backends or platforms)
-## Installation
-This is already included as part of the Dynamo container images. To install locally or standalone:
-```bash
-pip install -e .
 ```
-## Data Generation Tools
+## Directory Contents
-This directory also includes lightweight tools for:
- Analyzing prefix-structured data (`datagen analyze`)
- Synthesizing structured data customizable for testing purposes (`datagen synthesize`)
-Detailed information is provided in the `prefix_data_generator` directory.
+- **`incluster/`** — Kubernetes Job manifest for running benchmarks inside the cluster
+- **`router/`** — KV Router benchmarking scripts (prefix ratio, trace replay, agent, priority queue)
+- **`prefix_data_generator/`** — Tools for analyzing and synthesizing prefix-structured data
 ## Comprehensive Guide
-For detailed documentation, configuration options, and advanced usage, see the [complete benchmarking guide](../docs/benchmarks/benchmarking.md).
+For detailed documentation including server-side benchmarking, Pareto analysis, and advanced AIPerf features, see the [complete benchmarking guide](../docs/benchmarks/benchmarking.md).
--- a/benchmarks/incluster/benchmark_job.yaml
+++ b/benchmarks/incluster/benchmark_job.yaml
@@ -37,22 +37,29 @@ spec:
              secretKeyRef:
                name: hf-token-secret
                key: HF_TOKEN
-        command: ["python3", "-m", "benchmarks.utils.benchmark"]
+        command: ["/bin/bash", "-c"]
        args:
-          - --model
+          - |
-          - "Qwen/Qwen3-0.6B"
+            set -euo pipefail
-          - --isl
+            MODEL="Qwen/Qwen3-0.6B"
-          - "2000"
+            URL="http://vllm-agg-frontend:8000"
-          - --std
+            OUTPUT_DIR="/data/results/qwen3-0p6b-vllm-agg"
-          - "10"
-          - --osl
+            for c in 1 2 5 10 50 100; do
-          - "256"
+                echo "=== Concurrency $c ==="
-          - --output-dir
+                aiperf profile \
-          - /data/results
+                    --model "$MODEL" \
-          - --benchmark-name
+                    --url "$URL" \
-          - "qwen3-0p6b-vllm-agg"
+                    --endpoint-type chat \
-          - --endpoint-url
+                    --streaming \
-          - "vllm-agg-frontend:8000"
+                    --concurrency $c \
+                    --request-count $(( c * 3 > 10 ? c * 3 : 10 )) \
+                    --synthetic-input-tokens-mean 2000 \
+                    --output-tokens-mean 256 \
+                    --artifact-dir "$OUTPUT_DIR/c$c" \
+                    --ui none
+            done
+            echo "=== Benchmark complete ==="
        volumeMounts:
          - name: data-volume
            mountPath: /data

--- a/benchmarks/utils/__init__.py
+++ b/benchmarks/utils/__init__.py
-# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# Package marker for benchmarks utilities
--- a/benchmarks/utils/aiperf.py
+++ b/benchmarks/utils/aiperf.py
-# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-import os
-import subprocess
-from pathlib import Path
-from typing import List
-# Default concurrency levels - can be overridden with CONCURRENCIES environment variable
-DEFAULT_CONCURRENCIES: List[int] = [1, 2, 5, 10, 50, 100, 250]
-# Default request count per concurrency level - can be overridden with REQUEST_COUNT env var
-# When set to 0 or unset, defaults to max(concurrency * REQUEST_COUNT_SCALE_FACTOR, 10)
-# to ensure the concurrency level is fully utilized and each slot runs enough requests
-# for stable measurements
-DEFAULT_REQUEST_COUNT: int = 0
-REQUEST_COUNT_SCALE_FACTOR: int = 3
-def get_concurrency_levels() -> List[int]:
-    """Get concurrency levels from environment variable or use defaults"""
-    concurrencies_env = os.getenv("CONCURRENCIES")
-    if concurrencies_env:
-        try:
-            # Parse comma-separated values
-            concurrencies = [int(x.strip()) for x in concurrencies_env.split(",")]
-            # Validate all are positive integers
-            for c in concurrencies:
-                if c <= 0:
-                    raise ValueError(f"Concurrency level must be positive, got: {c}")
-            return sorted(concurrencies)
-        except ValueError as e:
-            print(f"WARNING: Invalid CONCURRENCIES environment variable: {e}")
-            print(f"Using default concurrency levels: {DEFAULT_CONCURRENCIES}")
-            return DEFAULT_CONCURRENCIES
-    return DEFAULT_CONCURRENCIES
-def get_request_count() -> int:
-    """Get request count from environment variable or use default.
-    Returns 0 to indicate 'auto' mode (will be computed per concurrency level).
-    """
-    request_count_env = os.getenv("REQUEST_COUNT")
-    if request_count_env:
-        try:
-            count = int(request_count_env.strip())
-            if count < 0:
-                raise ValueError(f"Request count must be non-negative, got: {count}")
-            return count
-        except ValueError as e:
-            print(f"WARNING: Invalid REQUEST_COUNT environment variable: {e}")
-            return DEFAULT_REQUEST_COUNT
-    return DEFAULT_REQUEST_COUNT
-CONCURRENCIES: List[int] = get_concurrency_levels()
-def run_aiperf(
-    service_url: str,
-    model_name: str,
-    isl: int,
-    osl: int,
-    stddev: int,
-    concurrency: int,
-    output_dir: Path,
-    request_count: int = 0,
-) -> None:
-    output_dir.mkdir(parents=True, exist_ok=True)
-    # Auto-compute request count: need enough requests to fully utilize concurrency
-    # and run each slot at least REQUEST_COUNT_SCALE_FACTOR times for stable measurements
-    if request_count <= 0:
-        request_count = max(concurrency * REQUEST_COUNT_SCALE_FACTOR, 10)
-    elif request_count < concurrency:
-        print(
-            f"WARNING: request_count ({request_count}) < concurrency ({concurrency}). "
-            f"Actual in-flight concurrency will be capped at {request_count}.",
-            flush=True,
-        )
-    cmd = [
-        "aiperf",
-        "profile",
-        "-m",
-        model_name,
-        "--endpoint-type",
-        "chat",
-        "--streaming",
-        "-u",
-        service_url,
-        "--synthetic-input-tokens-mean",
-        str(isl),
-        "--synthetic-input-tokens-stddev",
-        str(stddev),
-        "--concurrency",
-        str(concurrency),
-        "--request-count",
-        str(request_count),
-        "--output-tokens-mean",
-        str(osl),
-        "--extra-inputs",
-        f"max_tokens:{osl}",
-        "--extra-inputs",
-        f"min_tokens:{osl}",
-        "--extra-inputs",
-        "ignore_eos:true",
-        "--tokenizer",
-        model_name,
-        "--artifact-dir",
-        str(output_dir),
-    ]
-    print(
-        f"Running aiperf with isl {isl}, osl {osl}, concurrency {concurrency}, request_count {request_count}",
-        flush=True,
-    )
-    aip_process = subprocess.Popen(
-        cmd,
-        cwd=str(output_dir),
-        stdout=subprocess.PIPE,
-        stderr=subprocess.PIPE,
-        text=True,
-    )
-    stdout, stderr = aip_process.communicate()
-    if aip_process.returncode == 0:
-        print("Aiperf profiling completed successfully", flush=True)
-        if stdout:
-            print(stdout)
-    else:
-        print(f"Aiperf failed with error code: {aip_process.returncode}")
-        if stderr:
-            print(f"stderr: {stderr}")
-        raise subprocess.CalledProcessError(
-            aip_process.returncode, cmd, output=stdout, stderr=stderr
-        )
-def run_concurrency_sweep(
-    service_url: str, model_name: str, isl: int, osl: int, stddev: int, output_dir: Path
-) -> None:
-    concurrency_levels = get_concurrency_levels()
-    request_count = get_request_count()
-    print(
-        f"Running concurrency sweep for {model_name} with ISL {isl} and OSL {osl} and standard deviation {stddev}",
-        flush=True,
-    )
-    print(f"Concurrency levels: {concurrency_levels}", flush=True)
-    print(
-        f"Request count: {request_count if request_count > 0 else f'auto (max(concurrency*{REQUEST_COUNT_SCALE_FACTOR}, 10))'}",
-        flush=True,
-    )
-    for c in concurrency_levels:
-        print(f"Starting concurrency level {c}", flush=True)
-        run_aiperf(
-            service_url,
-            model_name,
-            isl,
-            osl,
-            stddev,
-            c,
-            output_dir / f"c{c}",
-            request_count=request_count,
-        )
--- a/benchmarks/utils/benchmark.py
+++ b/benchmarks/utils/benchmark.py
-#!/usr/bin/env python3
-# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-import argparse
-import re
-import sys
-from urllib.parse import urlsplit
-from benchmarks.utils.workflow import has_http_scheme, run_benchmark_workflow
-from deploy.utils.kubernetes import is_running_in_cluster
-def validate_endpoint(endpoint: str) -> None:
-    """Validate that endpoint is HTTP endpoint or internal service URL when running in cluster"""
-    v = endpoint.strip()
-    if is_running_in_cluster():
-        # Allow HTTP(S) or internal service URLs like host[:port][/path]
-        if has_http_scheme(v):
-            pass
-        else:
-            parts = urlsplit(f"//{v}")
-            host_ok = bool(parts.hostname)
-            port_ok = parts.port is None or (1 <= parts.port <= 65535)
-            if not (host_ok and port_ok):
-                raise ValueError(
-                    f"Endpoint must be HTTP(S) or internal service URL. Got: {endpoint}"
-                )
-    else:
-        if not has_http_scheme(v):
-            raise ValueError(f"Endpoint must be HTTP endpoint. Got: {endpoint}")
-def validate_benchmark_name(name: str) -> None:
-    """Validate benchmark name"""
-    if not name.strip():
-        raise ValueError("Benchmark name cannot be empty")
-    name = name.strip()
-    # Validate name characters
-    if not re.match(r"^[a-zA-Z0-9_-]+$", name):
-        raise ValueError(f"Invalid benchmark name: {name}")
-    # Validate reserved names
-    if name.lower() == "plots":
-        raise ValueError("Benchmark name 'plots' is reserved")
-def main() -> int:
-    parser = argparse.ArgumentParser(description="Benchmark Orchestrator")
-    parser.add_argument(
-        "--benchmark-name",
-        required=True,
-        help="Name/label for this benchmark (used in plots and results)",
-    )
-    parser.add_argument(
-        "--endpoint-url",
-        required=True,
-        help="Endpoint to benchmark: HTTP(S) URL (e.g., http://localhost:8000) or in-cluster service URL host[:port]",
-    )
-    parser.add_argument("--isl", type=int, default=2000, help="Input sequence length")
-    parser.add_argument(
-        "--std",
-        type=int,
-        default=10,
-        help="Input sequence standard deviation",
-    )
-    parser.add_argument("--osl", type=int, default=256, help="Output sequence length")
-    parser.add_argument(
-        "--model",
-        default="Qwen/Qwen3-0.6B",
-        help="Model name (must match the model deployed at the endpoint)",
-    )
-    parser.add_argument(
-        "--output-dir", type=str, default="benchmarks/results", help="Output directory"
-    )
-    args = parser.parse_args()
-    # Validate inputs
-    try:
-        validate_benchmark_name(args.benchmark_name)
-        validate_endpoint(args.endpoint_url)
-    except ValueError as e:
-        print(f"ERROR: {e}")
-        return 1
-    # Run the benchmark workflow with the parsed inputs
-    run_benchmark_workflow(
-        inputs={args.benchmark_name: args.endpoint_url},
-        isl=args.isl,
-        std=args.std,
-        osl=args.osl,
-        model=args.model,
-        output_dir=args.output_dir,
-    )
-    return 0
-if __name__ == "__main__":
-    sys.exit(main())
--- a/benchmarks/utils/plot.py
+++ b/benchmarks/utils/plot.py
-# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-import json
-import re
-from pathlib import Path
-from typing import Dict, List, Optional, Tuple
-import matplotlib.pyplot as plt
-def parse_benchmark_results(result_dir: Path) -> List[Tuple[int, Dict]]:
-    """
-    Parse benchmark results from a deployment directory.
-    Args:
-        result_dir: Path to the result directory
-    Returns:
-        List of (concurrency_level, metrics_dict) tuples sorted by concurrency
-    """
-    results = []
-    # Find all concurrency directories (e.g., c1, c2, c5, c10, c50, c100, c250)
-    for concurrency_dir in result_dir.iterdir():
-        if not concurrency_dir.is_dir() or not concurrency_dir.name.startswith("c"):
-            continue
-        # Extract concurrency level from directory name
-        match = re.match(r"c(\d+)", concurrency_dir.name)
-        if not match:
-            continue
-        concurrency = int(match.group(1))
-        # Find the aiperf JSON file
-        aiperf_json = None
-        for json_file in concurrency_dir.rglob("profile_export_aiperf.json"):
-            aiperf_json = json_file
-            break
-        if aiperf_json and aiperf_json.exists():
-            try:
-                with open(aiperf_json, "r") as f:
-                    metrics = json.load(f)
-                results.append((concurrency, metrics))
-                print(f"Loaded metrics for concurrency {concurrency}")
-            except Exception as e:
-                print(f"Error loading {aiperf_json}: {e}")
-        else:
-            print(f"Warning: No aiperf JSON found for {concurrency_dir}")
-    # Sort by concurrency level
-    results.sort(key=lambda x: x[0])
-    return results
-def extract_metric_series(
-    results: List[Tuple[int, Dict]], metric_path: str, stat: str = "avg"
-) -> Tuple[List[int], List[float]]:
-    """
-    Extract a time series of a specific metric across concurrency levels.
-    Args:
-        results: List of (concurrency, metrics) tuples
-        metric_path: Dot-separated path to the metric (e.g., 'inter_token_latency')
-        stat: Statistic to extract ('avg', 'p50', 'p90', etc.)
-    Returns:
-        Tuple of (concurrency_levels, metric_values)
-    """
-    concurrencies = []
-    values = []
-    path_keys = metric_path.split(".")
-    for concurrency, metrics in results:
-        try:
-            node = metrics
-            for k in path_keys:
-                node = node[k]
-            value = node[stat]
-            concurrencies.append(concurrency)
-            values.append(float(value))
-        except (KeyError, TypeError):
-            print(
-                f"Warning: {metric_path}.{stat} not found for concurrency {concurrency}"
-            )
-            continue
-    return concurrencies, values
-def create_plot(
-    title: str,
-    xlabel: str,
-    ylabel: str,
-    data_series: List[Tuple[str, List[int], List[float]]],
-    output_path: Path,
-    log_scale_x: bool = False,
-    log_scale_y: bool = False,
-) -> None:
-    """
-    Create a line plot with multiple series.
-    Args:
-        title: Plot title
-        xlabel: X-axis label
-        ylabel: Y-axis label
-        data_series: List of (label, x_values, y_values) tuples
-        output_path: Path to save the plot
-        log_scale_x: Whether to use log scale for X axis
-        log_scale_y: Whether to use log scale for Y axis
-    """
-    plt.figure(figsize=(10, 6))
-    colors = ["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd", "#8c564b"]
-    for i, (label, x_vals, y_vals) in enumerate(data_series):
-        if x_vals and y_vals:  # Only plot if we have data
-            plt.plot(
-                x_vals,
-                y_vals,
-                marker="o",
-                linewidth=2,
-                markersize=6,
-                color=colors[i % len(colors)],
-                label=label,
-            )
-    plt.title(title, fontsize=14, fontweight="bold")
-    plt.xlabel(xlabel, fontsize=12)
-    plt.ylabel(ylabel, fontsize=12)
-    plt.grid(True, alpha=0.3)
-    if log_scale_x:
-        plt.xscale("log")
-    if log_scale_y:
-        plt.yscale("log")
-    plt.legend()
-    plt.tight_layout()
-    plt.savefig(output_path, dpi=300, bbox_inches="tight")
-    plt.close()
-    print(f"Saved plot: {output_path}")
-def create_efficiency_plot(
-    deployment_results: Dict, plots_dir: Path, output_tokens: int = 200
-) -> None:
-    """
-    Create an efficiency plot showing tok/s/gpu vs tok/s/user with concurrency as labeled points.
-    Args:
-        deployment_results: Dict of deployment_type -> results
-        plots_dir: Directory to save plots
-        output_tokens: Average output tokens per request (default 200)
-    """
-    plt.figure(figsize=(12, 8))
-    # Support for up to 12 deployments in the plots
-    colors = [
-        "#1f77b4",
-        "#ff7f0e",
-        "#2ca02c",
-        "#d62728",
-        "#9467bd",
-        "#8c564b",
-        "#e377c2",
-        "#7f7f7f",
-        "#bcbd22",
-        "#17becf",
-        "#aec7e8",
-        "#ffbb78",
-    ]
-    markers = ["o", "s", "^", "D", "v", "<", ">", "p", "*", "h", "H", "+"]
-    for deployment_type, results in deployment_results.items():
-        tok_s_per_user = []
-        tok_s_per_gpu = []
-        concurrency_levels = []
-        for concurrency, metrics in results:
-            try:
-                # Get request throughput (requests/sec)
-                request_throughput = metrics["request_throughput"]["avg"]
-                # Calculate total tokens per second
-                total_tok_s = request_throughput * output_tokens
-                # Guard against zero concurrency and parameterize GPU count
-                if concurrency <= 0:
-                    continue
-                num_gpus = metrics.get("cluster", {}).get("num_gpus", 1)
-                tok_s_user = total_tok_s / concurrency
-                tok_s_gpu = total_tok_s / max(1, num_gpus)
-                tok_s_per_user.append(tok_s_user)
-                tok_s_per_gpu.append(tok_s_gpu)
-                concurrency_levels.append(concurrency)
-            except KeyError as e:
-                print(
-                    f"Warning: Missing metric for {deployment_type} concurrency {concurrency}: {e}"
-                )
-                continue
-        if tok_s_per_user and tok_s_per_gpu:
-            # Plot points
-            color_idx = list(deployment_results.keys()).index(deployment_type)
-            color = colors[color_idx % len(colors)]
-            marker = markers[color_idx % len(markers)]
-            plt.scatter(
-                tok_s_per_user,
-                tok_s_per_gpu,
-                c=color,
-                marker=marker,
-                s=120,
-                alpha=0.8,
-                label=deployment_type.title(),
-                edgecolors="black",
-                linewidth=1.5,
-            )
-            # Add concurrency labels
-            for i, (x, y, c) in enumerate(
-                zip(tok_s_per_user, tok_s_per_gpu, concurrency_levels)
-            ):
-                plt.annotate(
-                    f"{c}",
-                    (x, y),
-                    xytext=(8, 8),
-                    textcoords="offset points",
-                    fontsize=10,
-                    fontweight="bold",
-                    ha="left",
-                )
-    plt.title("GPU Efficiency vs User Experience", fontsize=14, fontweight="bold")
-    plt.xlabel("Tokens/sec per User", fontsize=12)
-    plt.ylabel("Tokens/sec per GPU", fontsize=12)
-    plt.grid(True, alpha=0.3)
-    # Add a note about what the numbers represent
-    plt.figtext(
-        0.02,
-        0.02,
-        "Note: Numbers on dots indicate concurrency level",
-        fontsize=10,
-        style="italic",
-        alpha=0.7,
-    )
-    plt.legend()
-    plt.tight_layout()
-    output_path = plots_dir / "efficiency_tok_s_gpu_vs_user.png"
-    plt.savefig(output_path, dpi=300, bbox_inches="tight")
-    plt.close()
-    print(f"Saved efficiency plot: {output_path}")
-def generate_plots(
-    base_output_dir: Path, output_dir: Path, benchmark_names: Optional[List[str]] = None
-) -> None:
-    """
-    Generate performance plots from benchmark results.
-    Args:
-        base_output_dir: Base directory containing benchmark results
-        output_dir: Directory to save plots
-        benchmark_names: Optional list of specific benchmark names to plot. If None, plots all subdirectories.
-    """
-    print(f"Generating plots from results in {base_output_dir}")
-    if not base_output_dir.exists():
-        print(f"Results directory does not exist: {base_output_dir}")
-        return
-    # Create plots directory
-    output_dir.mkdir(parents=True, exist_ok=True)
-    # Parse results for each deployment type
-    deployment_results = {}
-    # Find all subdirectories that contain benchmark results
-    names_set = set(benchmark_names) if benchmark_names is not None else None
-    for item in base_output_dir.iterdir():
-        if item.is_dir() and item.name != "plots":
-            deployment_type = item.name
-            # If benchmark_names is specified, only process those directories
-            if names_set is not None and deployment_type not in names_set:
-                print(f"Skipping {deployment_type} (not in specified benchmark names)")
-                continue
-            results = parse_benchmark_results(item)
-            if results:
-                deployment_results[deployment_type] = results
-                print(f"Found {len(results)} concurrency levels for {deployment_type}")
-            else:
-                print(f"No valid results found for {deployment_type}")
-    if not deployment_results:
-        if benchmark_names:
-            available = sorted(
-                [
-                    p.name
-                    for p in base_output_dir.iterdir()
-                    if p.is_dir() and p.name != "plots"
-                ]
-            )
-            missing = sorted([n for n in benchmark_names if n not in available])
-            print(f"No benchmark results found for specified names: {benchmark_names}")
-            if missing:
-                print(f"Missing (not found under {base_output_dir}): {missing}")
-            print(f"Available experiments: {available}")
-        else:
-            print("No benchmark results found to plot!")
-    # 1. P50 Inter-token Latency vs Concurrency
-    p50_data = []
-    for deployment_type, results in deployment_results.items():
-        concurrencies, latencies = extract_metric_series(
-            results, "inter_token_latency", "p50"
-        )
-        if concurrencies:
-            p50_data.append((deployment_type.title(), concurrencies, latencies))
-    create_plot(
-        title="P50 Inter-Token Latency vs Concurrency",
-        xlabel="Concurrency Level",
-        ylabel="P50 Inter-Token Latency (ms)",
-        data_series=p50_data,
-        output_path=output_dir / "p50_inter_token_latency_vs_concurrency.png",
-        log_scale_x=True,
-    )
-    # 2. Average Inter-token Latency vs Concurrency
-    avg_latency_data = []
-    for deployment_type, results in deployment_results.items():
-        concurrencies, latencies = extract_metric_series(
-            results, "inter_token_latency", "avg"
-        )
-        if concurrencies:
-            avg_latency_data.append((deployment_type.title(), concurrencies, latencies))
-    create_plot(
-        title="Average Inter-Token Latency vs Concurrency",
-        xlabel="Concurrency Level",
-        ylabel="Average Inter-Token Latency (ms)",
-        data_series=avg_latency_data,
-        output_path=output_dir / "avg_inter_token_latency_vs_concurrency.png",
-        log_scale_x=True,
-    )
-    # 3. Request Throughput vs Concurrency
-    throughput_data = []
-    for deployment_type, results in deployment_results.items():
-        concurrencies, throughputs = extract_metric_series(
-            results, "request_throughput", "avg"
-        )
-        if concurrencies:
-            throughput_data.append(
-                (deployment_type.title(), concurrencies, throughputs)
-            )
-    create_plot(
-        title="Request Throughput vs Concurrency",
-        xlabel="Concurrency Level",
-        ylabel="Request Throughput (req/s)",
-        data_series=throughput_data,
-        output_path=output_dir / "request_throughput_vs_concurrency.png",
-        log_scale_x=True,
-    )
-    # 4. Average Time to First Token vs Concurrency
-    ttft_data = []
-    for deployment_type, results in deployment_results.items():
-        concurrencies, ttfts = extract_metric_series(
-            results, "time_to_first_token", "avg"
-        )
-        if concurrencies:
-            ttft_data.append((deployment_type.title(), concurrencies, ttfts))
-    create_plot(
-        title="Average Time to First Token vs Concurrency",
-        xlabel="Concurrency Level",
-        ylabel="Average Time to First Token (ms)",
-        data_series=ttft_data,
-        output_path=output_dir / "avg_time_to_first_token_vs_concurrency.png",
-        log_scale_x=True,
-    )
-    # 5. Efficiency plot: tok/s/gpu vs tok/s/user
-    create_efficiency_plot(deployment_results, output_dir)
-    # Generate summary
-    summary_lines = [
-        "Benchmark Results Summary",
-        "=" * 30,
-        "",
-        f"Results directory: {base_output_dir}",
-        f"Plots generated: {output_dir}",
-        "",
-        "Deployment Types Found:",
-    ]
-    for deployment_type, results in deployment_results.items():
-        concurrency_levels = [r[0] for r in results]
-        summary_lines.append(
-            f"  {deployment_type}: {len(results)} concurrency levels ({min(concurrency_levels)}-{max(concurrency_levels)})"
-        )
-    summary_lines.extend(
-        [
-            "",
-            "Generated Plots:",
-            "  - p50_inter_token_latency_vs_concurrency.png",
-            "  - avg_inter_token_latency_vs_concurrency.png",
-            "  - request_throughput_vs_concurrency.png",
-            "  - avg_time_to_first_token_vs_concurrency.png",
-            "  - efficiency_tok_s_gpu_vs_user.png",
-        ]
-    )
-    summary_path = output_dir / "SUMMARY.txt"
-    summary_path.write_text("\n".join(summary_lines))
-    print(f"Generated summary: {summary_path}")
-    print(f"All plots saved to: {output_dir}")
-if __name__ == "__main__":
-    import argparse
-    parser = argparse.ArgumentParser(
-        description="Generate performance plots from benchmark results"
-    )
-    parser.add_argument(
-        "--data-dir", required=True, help="Directory containing benchmark results"
-    )
-    parser.add_argument(
-        "--output-dir", help="Output directory for plots (defaults to data-dir/plots)"
-    )
-    parser.add_argument(
-        "--benchmark-name",
-        action="append",
-        help="Specific benchmark experiment name to plot (can be specified multiple times). If not specified, plots all subdirectories.",
-    )
-    args = parser.parse_args()
-    data_dir = Path(args.data_dir)
-    benchmark_names = args.benchmark_name if args.benchmark_name else None
-    if args.output_dir:
-        # If output dir specified, use it as base and call generate_plots
-        output_dir = Path(args.output_dir)
-        output_dir.mkdir(parents=True, exist_ok=True)
-        generate_plots(data_dir, output_dir, benchmark_names)
-    else:
-        # Use data_dir as base output dir
-        generate_plots(data_dir, data_dir / "plots", benchmark_names)
--- a/benchmarks/utils/workflow.py
+++ b/benchmarks/utils/workflow.py
-# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-from pathlib import Path
-from typing import Dict, List
-from benchmarks.utils.aiperf import run_concurrency_sweep
-from deploy.utils.kubernetes import is_running_in_cluster
-def has_http_scheme(url: str) -> bool:
-    """Check if URL has HTTP or HTTPS scheme."""
-    return url.lower().startswith(("http://", "https://"))
-def normalize_service_url(endpoint: str) -> str:
-    e = endpoint.strip()
-    if has_http_scheme(e):
-        return e
-    if is_running_in_cluster():
-        return f"http://{e}"
-    return e  # Outside cluster, validation will have ensured scheme is present
-def print_concurrency_start(
-    label: str, model: str, isl: int, osl: int, std: int
-) -> None:
-    """Print concurrency sweep start messages"""
-    print(f"⚙️  Starting {label} concurrency sweep!", flush=True)
-    print(
-        "⏱️  This may take several minutes - running through multiple concurrency levels...",
-        flush=True,
-    )
-    print(f"🎯 Model: {model} | ISL: {isl} | OSL: {osl} | StdDev: {std}")
-def run_endpoint_benchmark(
-    label: str,
-    endpoint: str,
-    model: str,
-    isl: int,
-    osl: int,
-    std: int,
-    output_dir: Path,
-) -> None:
-    """Run benchmark for an existing endpoint with custom label"""
-    # Normalize endpoint to a usable URL (handles in-cluster scheme-less inputs)
-    service_url = normalize_service_url(endpoint)
-    print(f"🚀 Starting benchmark of endpoint '{label}': {service_url}")
-    print(f"📁 Results will be saved to: {output_dir / label}")
-    print_concurrency_start(label, model, isl, osl, std)
-    # Create output directory
-    (output_dir / label).mkdir(parents=True, exist_ok=True)
-    run_concurrency_sweep(
-        service_url=service_url,
-        model_name=model,
-        isl=isl,
-        osl=osl,
-        stddev=std,
-        output_dir=output_dir / label,
-    )
-    print("✅ Endpoint benchmark completed successfully!")
-def print_final_summary(output_dir: Path, labels: List[str]) -> None:
-    """Print final benchmark summary"""
-    print("🎉 Benchmark workflow completed successfully!")
-    print(f"📁 All results available at: {output_dir}")
-    if labels:
-        print(f"🚀 Benchmarked: {', '.join(labels)}")
-def run_benchmark_workflow(
-    inputs: Dict[str, str],
-    isl: int = 2000,
-    std: int = 10,
-    osl: int = 256,
-    model: str = "Qwen/Qwen3-0.6B",
-    output_dir: str = "benchmarks/results",
-) -> None:
-    """Main benchmark workflow orchestrator for HTTP endpoints (and in-cluster internal service URLs)"""
-    output_dir_path = Path(output_dir)
-    output_dir_path.mkdir(parents=True, exist_ok=True)
-    # Run endpoint benchmarks
-    benchmarked_labels = []
-    for label, endpoint in inputs.items():
-        run_endpoint_benchmark(label, endpoint, model, isl, osl, std, output_dir_path)
-        benchmarked_labels.append(label)
-    # Generate final summary
-    print_final_summary(output_dir_path, benchmarked_labels)
--- a/deploy/utils/requirements.txt
+++ b/deploy/utils/requirements.txt
@@ -4,8 +4,6 @@
 # Kubernetes and async dependencies
 aiofiles>=0.8.0
-# Benchmarking dependencies for Dynamo
-genai-perf==0.0.15
 httpx>=0.24.0
 kubernetes-asyncio>=24.0.0

--- a/docs/assets/img/aiperf-pareto-frontier.png
+++ b/docs/assets/img/aiperf-pareto-frontier.png
--- a/docs/benchmarks/benchmarking.md
+++ b/docs/benchmarks/benchmarking.md
@@ -5,13 +5,15 @@ title: Dynamo Benchmarking
 subtitle: Benchmark and compare performance across Dynamo deployment configurations
 ---
-This benchmarking framework lets you compare performance across any combination of:
+This guide shows how to benchmark Dynamo deployments using [AIPerf](https://github.com/ai-dynamo/aiperf), a comprehensive tool for measuring generative AI inference performance. AIPerf provides detailed metrics, real-time dashboards, and automatic visualization — you call it directly against your endpoints.
+You can benchmark any combination of:
 - **DynamoGraphDeployments**
- **External HTTP endpoints** (existing services deployed following standard documentation from vLLM, llm-d, AIBrix, etc.)
+- **External HTTP endpoints** (vLLM, llm-d, AIBrix, etc.)
 ## Choosing Your Benchmarking Approach
-Dynamo provides two benchmarking approaches to suit different use cases: **client-side** and **server-side**. Client-side refers to running benchmarks on your local machine and connecting to Kubernetes deployments via port-forwarding, while server-side refers to running benchmarks directly within the Kubernetes cluster using internal service URLs. Which method to use depends on your use case.
+**Client-side** runs benchmarks on your local machine via port-forwarding. **Server-side** runs benchmarks directly within the Kubernetes cluster using internal service URLs.
 **TLDR:**
 Need high performance/load testing? Server-side.
@@ -32,7 +34,6 @@ Just quick testing/comparison? Client-side.
 - You want optimal network performance (no port-forwarding overhead)
 - You're running automated CI/CD pipelines
 - You need isolated execution environments
- You're doing resource-intensive benchmarking
 - You want persistent result storage in the cluster
 → **[Go to Server-Side Benchmarking (In-Cluster)](#server-side-benchmarking-in-cluster)**
@@ -49,18 +50,20 @@ Just quick testing/comparison? Client-side.
 | **Results** | Local filesystem | Persistent volumes |
 | **Best for** | Light load | High load |
-## What This Tool Does
+## AIPerf Overview
+[AIPerf](https://github.com/ai-dynamo/aiperf) is a standalone benchmarking tool available on [PyPI](https://pypi.org/project/aiperf/). It is pre-installed in Dynamo container images. Key features:
-The framework is a Python-based wrapper around `aiperf` that:
+- Measures latency, throughput, TTFT, inter-token latency, and more
- Benchmarks any HTTP endpoints
+- Multiple load modes: concurrency, request-rate, trace replay
- Runs concurrency sweeps across configurable load levels
+- Automatic visualization with `aiperf plot` (Pareto curves, time series, GPU telemetry)
- Generates comparison plots with your custom labels
+- Interactive dashboard mode for real-time exploration
- Works with any HuggingFace-compatible model on NVIDIA GPUs (H200, H100, A100, etc.)
+- Arrival patterns (Poisson, constant, gamma) for realistic traffic simulation
- Provides direct Python script execution for maximum flexibility
+- Warmup phases, gradual ramping, and multi-URL load balancing
-**Default sequence lengths**: Input: 2000 tokens, Output: 256 tokens (configurable with `--isl` and `--osl`)
+**Important**: The `--model` parameter must match the model deployed at the endpoint.
-**Important**: The `--model` parameter configures AIPerf for benchmarking and provides logging context. The default `--model` value in the benchmarking script is `Qwen/Qwen3-0.6B`, but it must match the model deployed at the endpoint(s).
+For full documentation, see the [AIPerf docs](https://github.com/ai-dynamo/aiperf/tree/main/docs).
 ---
@@ -70,314 +73,261 @@ Client-side benchmarking runs on your local machine and connects to Kubernetes d
 ## Prerequisites
-1. **Dynamo container environment** - You must be running inside a Dynamo container with the benchmarking tools pre-installed.
+1. **Dynamo container environment** - You must be running inside a Dynamo container with AIPerf pre-installed, or install it locally:
+   ```bash
+   pip install aiperf
+   ```
 2. **HTTP endpoints** - Ensure you have HTTP endpoints available for benchmarking. These can be:
   - DynamoGraphDeployments exposed via HTTP endpoints
   - External services (vLLM, llm-d, AIBrix, etc.)
-   - Any HTTP endpoint serving HuggingFace-compatible models
+   - Any HTTP endpoint serving OpenAI-compatible models
-3. **Benchmark dependencies** - Since benchmarks run locally, you need to install the required Python dependencies. Install them using:
-   ```bash
-   pip install -r deploy/utils/requirements.txt
-   ```
 ## User Workflow
-Follow these steps to benchmark Dynamo deployments using client-side benchmarking:
+### Step 1: Set Up Cluster and Deploy
-### Step 1: Establish Kubernetes Cluster and Install Dynamo
+Set up your Kubernetes cluster with NVIDIA GPUs and install the Dynamo Kubernetes Platform following the [installation guide](../kubernetes/installation-guide.md). Then deploy your DynamoGraphDeployments using the [deployment documentation](https://github.com/ai-dynamo/dynamo/blob/main/examples/backends).
-Set up your Kubernetes cluster with NVIDIA GPUs and install the Dynamo Kubernetes Platform. First follow the [installation guide](../kubernetes/installation-guide.md) to install Dynamo Kubernetes Platform, then use [deploy/utils/README](https://github.com/ai-dynamo/dynamo/blob/main/deploy/utils/README.md) to set up benchmarking resources.
-### Step 2: Deploy DynamoGraphDeployments
+### Step 2: Port-Forward and Run a Single Benchmark
-Deploy your DynamoGraphDeployments separately using the [deployment documentation](https://github.com/ai-dynamo/dynamo/blob/main/examples/backends). Each deployment should have a frontend service exposed.
+> **Wait for model readiness.** Before benchmarking, ensure your deployment has fully loaded the model. Check pod logs or hit the health endpoint (`curl http://localhost:8000/health`) — it should return `200 OK` before you proceed.
-### Step 3: Port-Forward and Benchmark Deployment A
 ```bash
-# Port-forward the frontend service for deployment A
+# Port-forward the frontend service
 kubectl port-forward -n <namespace> svc/<frontend-service-name> 8000:8000 > /dev/null 2>&1 &
-# Note: remember to stop the port-forward process after benchmarking.
-# Benchmark deployment A using Python scripts
-python3 -m benchmarks.utils.benchmark \
-   --benchmark-name deployment-a \
-   --endpoint-url http://localhost:8000 \
-   --model "your-model-name" \
-   --output-dir ./benchmarks/results
-```
-### Step 4: [If Comparative] Teardown Deployment A and Establish Deployment B
-If comparing multiple deployments, teardown deployment A and deploy deployment B with a different configuration.
-### Step 5: [If Comparative] Port-Forward and Benchmark Deployment B
+# Run a single benchmark
-```bash
+aiperf profile \
-# Port-forward the frontend service for deployment B
+    --model <your-model-name> \
-kubectl port-forward -n <namespace> svc/<frontend-service-name> 8001:8000 > /dev/null 2>&1 &
+    --url http://localhost:8000 \
+    --endpoint-type chat \
-# Benchmark deployment B using Python scripts
+    --streaming \
-python3 -m benchmarks.utils.benchmark \
+    --concurrency 10 \
-   --benchmark-name deployment-b \
+    --request-count 100 \
-   --endpoint-url http://localhost:8001 \
+    --synthetic-input-tokens-mean 2000 \
-   --model "your-model-name" \
+    --output-tokens-mean 256
-   --output-dir ./benchmarks/results
 ```
-### Step 6: Generate Summary and Visualization
+This produces results in `artifacts/` and prints a summary table to the console:
-```bash
-# Generate plots and summary using Python plotting script
-python3 -m benchmarks.utils.plot --data-dir ./benchmarks/results
-# Or plot only specific benchmark experiments
+```text
-python3 -m benchmarks.utils.plot --data-dir ./benchmarks/results --benchmark-name experiment-a --benchmark-name experiment-b
+                                NVIDIA AIPerf | LLM Metrics
+┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
+┃              Metric ┃     avg ┃     min ┃     max ┃     p99 ┃     p90 ┃     p50 ┃     std ┃
+┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
+│ Time to First Token │  234.56 │  189.23 │  298.45 │  289.34 │  267.12 │  231.12 │   28.45 │
+│                (ms) │         │         │         │         │         │         │         │
+│     Request Latency │ 1234.56 │  987.34 │ 1567.89 │ 1534.23 │ 1456.78 │ 1223.45 │  156.78 │
+│                (ms) │         │         │         │         │         │         │         │
+│ Inter Token Latency │   15.67 │   12.34 │   19.45 │   19.01 │   18.23 │   15.45 │    1.89 │
+│                (ms) │         │         │         │         │         │         │         │
+│  Request Throughput │   31.45 │     N/A │     N/A │     N/A │     N/A │     N/A │     N/A │
+│      (requests/sec) │         │         │         │         │         │         │         │
+└─────────────────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘
 ```
-## Use Cases
+*Actual numbers will vary based on model size, hardware, batch size, and network conditions. Client-side benchmarks include port-forwarding overhead — use [server-side benchmarking](#server-side-benchmarking-in-cluster) for accurate performance measurement.*
-The benchmarking framework supports various comparative analysis scenarios:
- **Compare multiple DynamoGraphDeployments of a single backend** (e.g., aggregated vs disaggregated configurations)
+To stop the port-forward when done: `kill %1` (or `kill <PID>`).
- **Compare different backends** (e.g., SGLang vs TensorRT-LLM vs vLLM)
- **Compare Dynamo vs other platforms** (e.g., Dynamo vs llm-d vs AIBrix)
- **Compare different models** (e.g., Llama-3-8B vs Llama-3-70B vs Qwen-3-0.6B)
- **Compare different hardware configurations** (e.g., H100 vs A100 vs H200)
- **Compare different parallelization strategies** (e.g., different GPU counts or memory configurations)
-## Configuration and Usage
+### Step 3: Concurrency Sweep for Pareto Analysis
-### Command Line Options
+To understand how your deployment behaves across load levels, run a concurrency sweep. Each concurrency level sends enough requests for stable measurements (`max(c*3, 10)`):
 ```bash
-python3 -m benchmarks.utils.benchmark --benchmark-name <name> --endpoint-url <endpoint_url> [OPTIONS]
+MODEL="<your-model-name>"
+URL="http://localhost:8000"
-REQUIRED:
-  --benchmark-name NAME           Name/label for this benchmark (used in plots and results)
+for c in 1 2 5 10 50 100; do
-  --endpoint-url URL              HTTP endpoint URL to benchmark (e.g., http://localhost:8000)
+    aiperf profile \
+        --model "$MODEL" \
-OPTIONS:
+        --url "$URL" \
-  -h, --help                    Show help message and examples
+        --endpoint-type chat \
-  -m, --model MODEL             Model name for AIPerf configuration and logging (default: Qwen/Qwen3-0.6B)
+        --streaming \
-                                NOTE: This must match the model deployed at the endpoint
+        --concurrency $c \
-  -i, --isl LENGTH              Input sequence length (default: 2000)
+        --request-count $(( c * 3 > 10 ? c * 3 : 10 )) \
-  -s, --std STDDEV              Input sequence standard deviation (default: 10)
+        --synthetic-input-tokens-mean 2000 \
-  -o, --osl LENGTH              Output sequence length (default: 256)
+        --output-tokens-mean 256 \
-  -d, --output-dir DIR          Output directory (default: ./benchmarks/results)
+        --artifact-dir "artifacts/deployment-a/c$c"
-  --verbose                     Enable verbose output
+done
 ```
-### Important Notes
+**Note**: Adjust concurrency levels to match your deployment's capacity. Very high concurrency on a small deployment (e.g., c250 on a single GPU) will cause server errors. Start with lower values and increase until you find the saturation point.
- **Benchmark Name**: The benchmark name becomes the label in plots and results
+### Step 4: [If Comparative] Benchmark a Second Deployment
- **Name Restrictions**: Names can only contain letters, numbers, hyphens, and underscores. The name `plots` is reserved.
- **Port-Forwarding**: You must have an exposed endpoint before benchmarking
- **Model Parameter**: The `--model` parameter configures AIPerf for testing and logging, and must match the model deployed at the endpoint
- **Sequential Benchmarking**: For comparative benchmarks, deploy and benchmark each configuration separately
-### What Happens During Benchmarking
+Teardown deployment A and deploy deployment B with a different configuration. Kill the previous port-forward (`kill %1`), then repeat:
-The Python benchmarking module:
+```bash
-1. **Connects** to your port-forwarded endpoint
+kubectl port-forward -n <namespace> svc/<frontend-service-b> 8000:8000 > /dev/null 2>&1 &
-2. **Benchmarks** using AIPerf at various concurrency levels (default: 1, 2, 5, 10, 50, 100, 250)
-3. **Measures** key metrics: latency, throughput, time-to-first-token
+for c in 1 2 5 10 50 100; do
-4. **Saves** results to an output directory organized by benchmark name
+    aiperf profile \
+        --model "$MODEL" \
-The Python plotting module:
+        --url "$URL" \
-1. **Generates** comparison plots using your benchmark name in `<OUTPUT_DIR>/plots/`
+        --endpoint-type chat \
-2. **Creates** summary statistics and visualizations
+        --streaming \
+        --concurrency $c \
-### Plotting Options
+        --request-count $(( c * 3 > 10 ? c * 3 : 10 )) \
+        --synthetic-input-tokens-mean 2000 \
+        --output-tokens-mean 256 \
+        --artifact-dir "artifacts/deployment-b/c$c"
+done
+```
-The plotting script supports several options for customizing which experiments to visualize:
+### Step 5: Generate Visualizations
 ```bash
-# Plot all benchmark experiments in the data directory
+# Compare all runs — auto-detects multi-run directories
-python3 -m benchmarks.utils.plot --data-dir ./benchmarks/results
+aiperf plot artifacts/deployment-a artifacts/deployment-b
-# Plot only specific benchmark experiments
+# Or compare all subdirectories under a parent
-python3 -m benchmarks.utils.plot --data-dir ./benchmarks/results --benchmark-name experiment-a --benchmark-name experiment-b
+aiperf plot artifacts/
-# Specify custom output directory for plots
+# Launch interactive dashboard for exploration
-python3 -m benchmarks.utils.plot --data-dir ./benchmarks/results --output-dir ./custom-plots
+aiperf plot artifacts/ --dashboard
 ```
-**Available Options:**
+AIPerf automatically generates plots based on available data:
- `--data-dir`: Directory containing benchmark results (required)
+- **TTFT vs Throughput** — find the sweet spot between responsiveness and capacity (always generated for multi-run comparisons)
- `--benchmark-name`: Specific benchmark experiment name to plot (can be specified multiple times). Names must match subdirectory names under the data dir.
+- **Pareto Curves** — throughput per GPU vs latency and interactivity (only generated when GPU telemetry data is available — add `--gpu-telemetry` during profiling if DCGM is running)
- `--output-dir`: Custom output directory for plots (defaults to data-dir/plots)
+- **Time series** — per-request TTFT, ITL, and latency over time (generated for single-run analysis)
-**Note**: If `--benchmark-name` is not specified, the script will plot all subdirectories found in the data directory.
-### Using Your Own Models and Configuration
-The benchmarking framework supports any HuggingFace-compatible LLM model. Specify your model in the benchmark script's `--model` parameter. It must match the model name of the deployment. You can override the default sequence lengths (2000/256 tokens) with `--isl` and `--osl` flags if needed for your specific workload.
-The benchmarking framework is built around Python modules that provide direct control over the benchmark workflow. The Python benchmarking module connects to your existing endpoints, runs the benchmarks, and can generate plots. Deployment is user-managed and out of scope for this tool.
+Here is an example Pareto frontier from a concurrency sweep of Qwen3-0.6B on 8x H200 with vLLM, showing the tradeoff between user experience (tokens/sec per user) and resource efficiency (tokens/sec per GPU):
-### Comparison Limitations
+![AIPerf Pareto Frontier](../assets/img/aiperf-pareto-frontier.png)
-The plotting system supports up to 12 different benchmarks in a single comparison.
+See the [AIPerf Visualization Guide](https://github.com/ai-dynamo/aiperf/blob/main/docs/tutorials/plot.md) for full details on plot customization, experiment classification, and themes.
-### Concurrency Configuration
+## Use Cases
-You can customize the concurrency levels using the CONCURRENCIES environment variable:
+- **Compare DynamoGraphDeployments** (e.g., aggregated vs disaggregated configurations)
+- **Compare different backends** (e.g., SGLang vs TensorRT-LLM vs vLLM)
+- **Compare Dynamo vs other platforms** (e.g., Dynamo vs llm-d vs AIBrix)
+- **Compare different models** (e.g., Llama-3-8B vs Llama-3-70B vs Qwen-3-0.6B)
+- **Compare different hardware configurations** (e.g., H100 vs A100 vs H200)
+- **Compare different parallelization strategies** (e.g., different GPU counts or memory configurations)
-```bash
+## AIPerf Quick Reference
-# Custom concurrency levels
-CONCURRENCIES="1,5,20,50" python3 -m benchmarks.utils.benchmark \
-    --benchmark-name my-test \
-    --endpoint-url http://localhost:8000
-# Or set permanently
-export CONCURRENCIES="1,2,5,10,25,50,100"
-python3 -m benchmarks.utils.benchmark \
-    --benchmark-name test \
-    --endpoint-url http://localhost:8000
-```
-### Request Count Configuration
+### Commonly Used Options
-The number of requests sent per concurrency level is auto-computed as `max(concurrency * 3, 10)` by default. This ensures each concurrency slot runs enough requests for stable measurements. You can override this with the `REQUEST_COUNT` environment variable:
+```text
+aiperf profile [OPTIONS]
-```bash
+REQUIRED:
-# Fixed request count for all concurrency levels
+  --model MODEL               Model name (must match the deployed model)
-REQUEST_COUNT=500 python3 -m benchmarks.utils.benchmark \
+  --url URL                   Endpoint URL (e.g., http://localhost:8000)
-    --benchmark-name my-test \
-    --endpoint-url http://localhost:8000
+COMMON OPTIONS:
+  --endpoint-type TYPE        Endpoint type: chat, completions, embeddings (default: chat)
-# Combined with custom concurrency levels
+  --streaming                 Enable streaming responses
-CONCURRENCIES="1,10,50,200" REQUEST_COUNT=1000 python3 -m benchmarks.utils.benchmark \
+  --concurrency N             Number of concurrent requests
-    --benchmark-name high-load-test \
+  --request-rate N            Target requests per second (alternative to --concurrency)
-    --endpoint-url http://localhost:8000
+  --request-count N           Total number of requests to send
+  --benchmark-duration N      Run for N seconds instead of a fixed request count
+  --synthetic-input-tokens-mean N   Average input sequence length in tokens
+  --output-tokens-mean N      Average output sequence length in tokens
+  --artifact-dir DIR          Output directory for results (default: artifacts/)
+  --warmup-request-count N    Warmup requests before measurement
+  --ui TYPE                   UI mode: dashboard, simple, none (default: dashboard)
 ```
-**Important**: The request count must be greater than or equal to the concurrency level. If the request count is too low, the actual in-flight concurrency will be capped at the request count, leading to inaccurate results at higher concurrency levels.
+For the complete CLI reference, see `aiperf profile --help` or the [CLI docs](https://github.com/ai-dynamo/aiperf/blob/main/docs/cli-options.md).
-## Understanding Your Results
-After benchmarking completes, check `./benchmarks/results/` (or your custom output directory):
-### Plot Labels and Organization
-The plotting script uses the `--benchmark-name` as the experiment name in all generated plots. For example:
- `--benchmark-name aggregated` → plots will show "aggregated" as the label
- `--benchmark-name vllm-disagg` → plots will show "vllm-disagg" as the label
-This allows you to easily identify and compare different configurations in the visualization plots.
+### Output Sequence Length
-### Summary and Plots
+To enforce a specific output length, pass `ignore_eos` and `min_tokens` via `--extra-inputs`:
-```text
+```bash
-benchmarks/results/plots
+aiperf profile \
-├── SUMMARY.txt                                     # Quick overview of all results
+    --model <model> \
-├── p50_inter_token_latency_vs_concurrency.png      # Token generation speed
+    --url http://localhost:8000 \
-├── avg_time_to_first_token_vs_concurrency.png      # Response time
+    --endpoint-type chat \
-├── request_throughput_vs_concurrency.png           # Requests per second
+    --streaming \
-├── efficiency_tok_s_gpu_vs_user.png                # GPU efficiency
+    --concurrency 10 \
-└── avg_inter_token_latency_vs_concurrency.png      # Average latency
+    --output-tokens-mean 256 \
+    --extra-inputs max_tokens:256 \
+    --extra-inputs min_tokens:256 \
+    --extra-inputs ignore_eos:true
 ```
-### Data Files
+### Understanding Results
-Raw data is organized by deployment/benchmark type and concurrency level:
+Each `aiperf profile` run produces an artifact directory containing:
+- **`profile_export_aiperf.json`** — Structured metrics (latency, throughput, TTFT, ITL, etc.)
+- **`profile_export.jsonl`** — Per-request raw data
+- **`profile_export_aiperf.csv`** — CSV format metrics
-**For Any Benchmarking (uses your custom benchmark name):**
+Results are organized by the `--artifact-dir` you specify. For concurrency sweeps, a common pattern is:
-```text
-results/                         # Client-side: ./benchmarks/results/ or custom dir
-├── plots/                       # Server-side: /data/results/
-│   ├── SUMMARY.txt              # Performance visualization plots
-│   ├── p50_inter_token_latency_vs_concurrency.png
-│   ├── avg_inter_token_latency_vs_concurrency.png
-│   ├── request_throughput_vs_concurrency.png
-│   ├── efficiency_tok_s_gpu_vs_user.png
-│   └── avg_time_to_first_token_vs_concurrency.png
-├── <your-benchmark-name>/       # Results for your benchmark (uses your custom name)
-│   ├── c1/                      # Concurrency level 1
-│   │   └── profile_export_aiperf.json
-│   ├── c2/                      # Concurrency level 2
-│   ├── c5/                      # Concurrency level 5
-│   └── ...                      # Other concurrency levels (10, 50, 100, 250)
-└── <your-benchmark-name-N>/     # Results for additional benchmarking runs
-    └── c*/                      # Same structure as above
-```
-**Example with actual benchmark names:**
 ```text
-results/
+artifacts/
-├── plots/
+├── deployment-a/
-├── experiment-a/                  # --benchmark-name experiment-a
+│   ├── c1/
-├── experiment-b/                  # --benchmark-name experiment-b
+│   │   ├── profile_export_aiperf.json
-└── experiment-c/                  # --benchmark-name experiment-c
+│   │   └── profile_export.jsonl
+│   ├── c10/
+│   ├── c50/
+│   └── c100/
+├── deployment-b/
+│   ├── c1/
+│   ├── c10/
+│   ├── c50/
+│   └── c100/
+└── plots/                    # Generated by aiperf plot
+    ├── ttft_vs_throughput.png
+    ├── pareto_curve_throughput_per_gpu_vs_latency.png      # If GPU telemetry available
+    └── pareto_curve_throughput_per_gpu_vs_interactivity.png # If GPU telemetry available
 ```
-Each concurrency directory contains:
- **`profile_export_aiperf.json`** - Structured metrics from AIPerf
- **`profile_export_aiperf.csv`** - CSV format metrics from AIPerf
- **`profile_export.json`** - Raw AIPerf results
- **`inputs.json`** - Generated test inputs
 ---
 # Server-Side Benchmarking (In-Cluster)
-Server-side benchmarking runs directly within the Kubernetes cluster, eliminating the need for port forwarding and providing better resource utilization.
+Server-side benchmarking runs directly within the Kubernetes cluster, eliminating port-forwarding overhead and enabling high-load testing.
-## What Server-Side Benchmarking Does
-The server-side benchmarking solution:
- Runs benchmarks directly within the Kubernetes cluster using internal service URLs
- Uses Kubernetes service DNS for direct communication (no port forwarding required)
- Leverages the existing benchmarking infrastructure (`benchmarks.utils.benchmark`)
- Stores results persistently using `dynamo-pvc`
- Provides isolated execution environment with configurable resources
- Handles high load/speed requirements without timeout issues
- **Note**: Each benchmark job runs within a single Kubernetes namespace, but can benchmark services across multiple namespaces using the full DNS format `svc_name.namespace.svc.cluster.local`
 ## Prerequisites
 1. **Kubernetes cluster** with NVIDIA GPUs and Dynamo namespace setup (see [Dynamo Kubernetes Platform docs](../kubernetes/README.md))
-2. **Storage** PersistentVolumeClaim configured with appropriate permissions (see [deploy/utils README](https://github.com/ai-dynamo/dynamo/blob/main/deploy/utils/README.md))
+2. **Storage**: PersistentVolumeClaim configured with appropriate permissions (see [deploy/utils README](https://github.com/ai-dynamo/dynamo/blob/main/deploy/utils/README.md))
-3. **Docker image** containing the Dynamo benchmarking tools
+3. **Docker image** containing AIPerf (Dynamo runtime images include it)
 ## Quick Start
 ### Step 1: Deploy Your DynamoGraphDeployment
-Deploy your DynamoGraphDeployment using the [deployment documentation](https://github.com/ai-dynamo/dynamo/blob/main/examples/backends). Ensure it has a frontend service exposed.
+Deploy using the [deployment documentation](https://github.com/ai-dynamo/dynamo/blob/main/examples/backends). Ensure it has a frontend service exposed and the model is fully loaded before running benchmarks — check pod logs or verify the health endpoint returns `200 OK`.
+### Step 2: Configure and Run Benchmark Job
-### Step 2: Deploy and Run Benchmark Job
+First, edit `benchmarks/incluster/benchmark_job.yaml` to match your deployment:
-**Note**: The server-side benchmarking job requires a Docker image containing the Dynamo benchmarking tools. Before the 0.5.1 release, you must build your own Docker image using the [container build instructions](https://github.com/ai-dynamo/dynamo/blob/main/container/README.md), push it to your container registry, then update the `image` field in `benchmarks/incluster/benchmark_job.yaml` to use your built image tag.
+- **Model name**: Update the `MODEL` variable
+- **Service URL**: Update the `URL` variable (use `<svc_name>.<namespace>.svc.cluster.local:port` for cross-namespace access)
+- **Concurrency levels**: Adjust the `for c in ...` loop
+- **Docker image**: Update the `image` field if needed
+Then deploy:
 ```bash
 export NAMESPACE=benchmarking
-# Deploy the benchmark job with default settings
+# Deploy the benchmark job
 kubectl apply -f benchmarks/incluster/benchmark_job.yaml -n $NAMESPACE
-# Monitor the job, wait for it to complete
+# Monitor the job
 kubectl logs -f job/dynamo-benchmark -n $NAMESPACE
 ```
-#### Customize the job configuration
-To customize the benchmark parameters, edit the `benchmarks/incluster/benchmark_job.yaml` file and modify:
- **Model name**: Change `"Qwen/Qwen3-0.6B"` in the args section
- **Benchmark name**: Change `"qwen3-0p6b-vllm-agg"` to your desired benchmark name
- **Service URL**: Change `"vllm-agg-frontend:8000"` so the service URL matches your deployed service
- **Docker image**: Change the image field if needed
-Then deploy:
-```bash
-kubectl apply -f benchmarks/incluster/benchmark_job.yaml -n $NAMESPACE
-```
 ### Step 3: Retrieve Results
 ```bash
-# Create access pod (skip this step if access pod is already running)
+# Create access pod (skip if already running)
 kubectl apply -f deploy/utils/manifests/pvc-access-pod.yaml -n $NAMESPACE
 kubectl wait --for=condition=Ready pod/pvc-access-pod -n $NAMESPACE --timeout=60s
 # Download the results
-kubectl cp $NAMESPACE/pvc-access-pod:/data/results/<benchmark-name> ./benchmarks/results/<benchmark-name>
+kubectl cp $NAMESPACE/pvc-access-pod:/data/results ./results
 # Cleanup
 kubectl delete pod pvc-access-pod -n $NAMESPACE
@@ -385,156 +335,55 @@ kubectl delete pod pvc-access-pod -n $NAMESPACE
 ### Step 4: Generate Plots
 ```bash
-# Generate performance plots from the downloaded results
+aiperf plot ./results
-python3 -m benchmarks.utils.plot \
-  --data-dir ./benchmarks/results
 ```
-This will create visualization plots. For more details on interpreting these plots, see the [Summary and Plots](#summary-and-plots) section above.
 ## Cross-Namespace Service Access
-Server-side benchmarking can benchmark services across multiple namespaces from a single job using Kubernetes DNS. When referencing services in other namespaces, use the full DNS format:
+When referencing services in other namespaces, use full Kubernetes DNS:
 ```bash
-# Access service in same namespace
+# Same namespace
-SERVICE_URL=vllm-agg-frontend:8000
+--url http://vllm-agg-frontend:8000
-# Access service in different namespace
-SERVICE_URL=vllm-agg-frontend.production.svc.cluster.local:8000
-```
-**DNS Format**: `<service-name>.<namespace>.svc.cluster.local:port`
-This allows you to:
- Benchmark multiple services across different namespaces in a single job
- Compare services running in different environments (dev, staging, production)
- Test cross-namespace integrations without port-forwarding
- Run comprehensive cross-namespace performance comparisons
-## Configuration
-The benchmark job is configured directly in the YAML file.
-### Default Configuration
- **Model**: `Qwen/Qwen3-0.6B`
- **Benchmark Name**: `qwen3-0p6b-vllm-agg`
- **Service**: `vllm-agg-frontend:8000`
- **Docker Image**: `nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag`
-### Customizing the Job
-To customize the benchmark, edit `benchmarks/incluster/benchmark_job.yaml`:
-1. **Change the model**: Update the `--model` argument
-2. **Change the benchmark name**: Update the `--benchmark-name` argument
-3. **Change the service URL**: Update the `--endpoint-url` argument (use `<svc_name>.<namespace>.svc.cluster.local:port` for cross-namespace access)
-4. **Change Docker image**: Update the image field if needed
-### Example: Multi-Namespace Benchmarking
-To benchmark services across multiple namespaces, you would need to run separate benchmark jobs for each service since the format supports one benchmark per job. However, the results are stored in the same PVC and may be accessed together.
+# Different namespace
+--url http://vllm-agg-frontend.production.svc.cluster.local:8000
-```yaml
-# Job 1: Production service
-args:
-  - --model
-  - "Qwen/Qwen3-0.6B"
-  - --benchmark-name
-  - "prod-vllm"
-  - --endpoint-url
-  - "vllm-agg-frontend.production.svc.cluster.local:8000"
-  - --output-dir
-  - /data/results
-# Job 2: Staging service
-args:
-  - --model
-  - "Qwen/Qwen3-0.6B"
-  - --benchmark-name
-  - "staging-vllm"
-  - --endpoint-url
-  - "vllm-agg-frontend.staging.svc.cluster.local:8000"
-  - --output-dir
-  - /data/results
-```
-## Understanding Your Results
-Results are stored in `/data/results` and follow the same structure as client-side benchmarking:
-```text
-/data/results/
-└── <benchmark-name>/                # Results for your benchmark name
-    ├── c1/                          # Concurrency level 1
-    │   └── profile_export_aiperf.json
-    ├── c2/                          # Concurrency level 2
-    └── ...                          # Other concurrency levels
 ```
 ## Monitoring and Debugging
-### Check Job Status
 ```bash
+# Check job status
 kubectl describe job dynamo-benchmark -n $NAMESPACE
-```
-### View Logs
+# Follow logs
-```bash
-# Follow logs in real-time
 kubectl logs -f job/dynamo-benchmark -n $NAMESPACE
-```
-### Debug Failed Jobs
-```bash
 # Check pod status
 kubectl get pods -n $NAMESPACE -l job-name=dynamo-benchmark
-# Describe failed pod
+# Debug failed pod
 kubectl describe pod <pod-name> -n $NAMESPACE
 ```
-## Troubleshooting
+### Troubleshooting
-### Common Issues
 1. **Service not found**: Ensure your DynamoGraphDeployment frontend service is running
-3. **PVC access**: Check that `dynamo-pvc` is properly configured and accessible
+2. **PVC access**: Check that `dynamo-pvc` is properly configured and accessible
-4. **Image pull issues**: Ensure the Docker image is accessible from the cluster
+3. **Image pull issues**: Ensure the Docker image is accessible from the cluster
-5. **Resource constraints**: Adjust resource limits if the job is being evicted
+4. **Resource constraints**: Adjust resource limits if the job is being evicted
-### Debug Commands
 ```bash
 # Check PVC status
 kubectl get pvc dynamo-pvc -n $NAMESPACE
-# Check service endpoints
+# Verify service exists and has endpoints
 kubectl get svc -n $NAMESPACE
+kubectl get endpoints <service-name> -n $NAMESPACE
-# Verify your service exists and has endpoints
-SVC_NAME="${SERVICE_URL%%:*}"
-kubectl get svc "$SVC_NAME" -n "$NAMESPACE"
-kubectl get endpoints "$SVC_NAME" -n "$NAMESPACE"
 ```
 ---
-## Customize Benchmarking Behavior
-The built-in Python workflow connects to endpoints, benchmarks with aiperf, and generates plots. If you want to modify the behavior:
-1. **Extend the workflow**: Modify `benchmarks/utils/workflow.py` to add custom deployment types or metrics collection
-2. **Generate different plots**: Modify `benchmarks/utils/plot.py` to generate a different set of plots for whatever you wish to visualize.
-3. **Direct module usage**: Use individual Python modules (`benchmarks.utils.benchmark`, `benchmarks.utils.plot`) for granular control over each step of the benchmarking process.
-The Python benchmarking module provides a complete end-to-end benchmarking experience with full control over the workflow.
---
 ## Testing with Mocker Backend
 For development and testing purposes, Dynamo provides a [mocker backend](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/mocker) that simulates LLM inference without requiring actual GPU resources. This is useful for:
@@ -547,3 +396,22 @@ For development and testing purposes, Dynamo provides a [mocker backend](https:/
 The mocker backend mimics the API and behavior of real backends (SGLang, TensorRT-LLM, vLLM) but generates mock responses instead of running actual inference.
 See the [mocker directory](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/mocker) for usage examples and configuration options.
+---
+## Advanced AIPerf Features
+AIPerf has many capabilities beyond basic profiling. Here are some particularly useful for Dynamo benchmarking:
+| Feature | Description | Docs |
+|---------|-------------|------|
+| Trace Replay | Replay production traces for deterministic benchmarking | [Trace Replay](https://github.com/ai-dynamo/aiperf/blob/main/docs/benchmark-modes/trace-replay.md) |
+| Arrival Patterns | Poisson, constant, gamma traffic distributions | [Arrival Patterns](https://github.com/ai-dynamo/aiperf/blob/main/docs/tutorials/arrival-patterns.md) |
+| Gradual Ramping | Smooth ramp-up of concurrency and request rate | [Ramping](https://github.com/ai-dynamo/aiperf/blob/main/docs/tutorials/ramping.md) |
+| Warmup Phase | Eliminate cold-start effects from measurements | [Warmup](https://github.com/ai-dynamo/aiperf/blob/main/docs/tutorials/warmup.md) |
+| Multi-URL Load Balancing | Distribute requests across multiple endpoints | [Multi-URL](https://github.com/ai-dynamo/aiperf/blob/main/docs/tutorials/multi-url-load-balancing.md) |
+| GPU Telemetry | Collect DCGM metrics during benchmarking | [GPU Telemetry](https://github.com/ai-dynamo/aiperf/blob/main/docs/tutorials/gpu-telemetry.md) |
+| Goodput Analysis | SLO-based throughput measurement | [Goodput](https://github.com/ai-dynamo/aiperf/blob/main/docs/tutorials/goodput.md) |
+| Timeslice Analysis | Per-timeslice performance breakdown | [Timeslices](https://github.com/ai-dynamo/aiperf/blob/main/docs/tutorials/timeslices.md) |
+| Multi-Turn Conversations | Benchmark multi-turn chat workloads | [Multi-Turn](https://github.com/ai-dynamo/aiperf/blob/main/docs/tutorials/multi-turn.md) |
+| Experiment Classification | Baseline vs treatment semantic colors in plots | [Plotting](https://github.com/ai-dynamo/aiperf/blob/main/docs/tutorials/plot.md) |