"lib/llm/src/entrypoint/input/endpoint.rs" did not exist on "ec2e730720daaf7405da456f90fc800c413a27d8"
Unverified Commit 419e936a authored by Ben Hamm's avatar Ben Hamm Committed by GitHub
Browse files

refactor: remove benchmark shim, use AIPerf directly (#7074)


Signed-off-by: default avatarBen Hamm <ben.hamm@gmail.com>
Co-authored-by: default avatarClaude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: default avatarBiswa Panda <biswa.panda@gmail.com>
Co-authored-by: default avatarSaravana Periyasamy <saperiyasamy@nvidia.com>
parent 50818575
...@@ -15,7 +15,7 @@ ...@@ -15,7 +15,7 @@
# Benchmarks # Benchmarks
This directory contains benchmarking scripts and tools for performance evaluation of Dynamo deployments. The benchmarking framework is a wrapper around aiperf that makes it easy to benchmark DynamoGraphDeployments or other deployments with exposed endpoints. This directory contains benchmarking tools and scripts for Dynamo deployments. Benchmarking uses [AIPerf](https://github.com/ai-dynamo/aiperf) directly — a comprehensive tool for measuring generative AI inference performance.
## Quick Start ## Quick Start
...@@ -26,49 +26,37 @@ First, deploy your DynamoGraphDeployment using the [deployment documentation](.. ...@@ -26,49 +26,37 @@ First, deploy your DynamoGraphDeployment using the [deployment documentation](..
# Port-forward your deployment to http://localhost:8000 # Port-forward your deployment to http://localhost:8000
kubectl port-forward -n <namespace> svc/<frontend-service-name> 8000:8000 > /dev/null 2>&1 & kubectl port-forward -n <namespace> svc/<frontend-service-name> 8000:8000 > /dev/null 2>&1 &
# Run benchmark # Run a single benchmark
python3 -m benchmarks.utils.benchmark \ aiperf profile \
--benchmark-name my-benchmark \ --model <your-model> \
--endpoint-url http://localhost:8000 \ --url http://localhost:8000 \
--model "<your-model>" --endpoint-type chat \
--streaming \
# Generate plots --concurrency 10 \
python3 -m benchmarks.utils.plot --data-dir ./benchmarks/results --request-count 100
# Or plot only specific benchmark experiments # Run a concurrency sweep for Pareto analysis
python3 -m benchmarks.utils.plot --data-dir ./benchmarks/results --benchmark-name my-benchmark for c in 1 2 5 10 50 100; do
``` aiperf profile \
--model <your-model> \
## Features --url http://localhost:8000 \
--endpoint-type chat \
Benchmark any HTTP endpoints! The benchmarking framework supports: --streaming \
--concurrency $c \
**Flexible Configuration:** --request-count $(( c * 3 > 10 ? c * 3 : 10 )) \
- User-defined benchmark names using `--benchmark-name` flag --artifact-dir "artifacts/my-benchmark/c$c"
- Support for single endpoint benchmarking with `--endpoint-url` flag done
- Customizable concurrency levels (configurable via CONCURRENCIES env var), sequence lengths, and models
- Automated performance plot generation with custom benchmark names # Generate comparison plots
aiperf plot artifacts/my-benchmark
**Supported Backends:**
- DynamoGraphDeployments with port-forwarded endpoints
- External HTTP endpoints (for comparison with non-Dynamo backends or platforms)
## Installation
This is already included as part of the Dynamo container images. To install locally or standalone:
```bash
pip install -e .
``` ```
## Data Generation Tools ## Directory Contents
This directory also includes lightweight tools for:
- Analyzing prefix-structured data (`datagen analyze`)
- Synthesizing structured data customizable for testing purposes (`datagen synthesize`)
Detailed information is provided in the `prefix_data_generator` directory. - **`incluster/`** — Kubernetes Job manifest for running benchmarks inside the cluster
- **`router/`** — KV Router benchmarking scripts (prefix ratio, trace replay, agent, priority queue)
- **`prefix_data_generator/`** — Tools for analyzing and synthesizing prefix-structured data
## Comprehensive Guide ## Comprehensive Guide
For detailed documentation, configuration options, and advanced usage, see the [complete benchmarking guide](../docs/benchmarks/benchmarking.md). For detailed documentation including server-side benchmarking, Pareto analysis, and advanced AIPerf features, see the [complete benchmarking guide](../docs/benchmarks/benchmarking.md).
...@@ -37,22 +37,29 @@ spec: ...@@ -37,22 +37,29 @@ spec:
secretKeyRef: secretKeyRef:
name: hf-token-secret name: hf-token-secret
key: HF_TOKEN key: HF_TOKEN
command: ["python3", "-m", "benchmarks.utils.benchmark"] command: ["/bin/bash", "-c"]
args: args:
- --model - |
- "Qwen/Qwen3-0.6B" set -euo pipefail
- --isl MODEL="Qwen/Qwen3-0.6B"
- "2000" URL="http://vllm-agg-frontend:8000"
- --std OUTPUT_DIR="/data/results/qwen3-0p6b-vllm-agg"
- "10"
- --osl for c in 1 2 5 10 50 100; do
- "256" echo "=== Concurrency $c ==="
- --output-dir aiperf profile \
- /data/results --model "$MODEL" \
- --benchmark-name --url "$URL" \
- "qwen3-0p6b-vllm-agg" --endpoint-type chat \
- --endpoint-url --streaming \
- "vllm-agg-frontend:8000" --concurrency $c \
--request-count $(( c * 3 > 10 ? c * 3 : 10 )) \
--synthetic-input-tokens-mean 2000 \
--output-tokens-mean 256 \
--artifact-dir "$OUTPUT_DIR/c$c" \
--ui none
done
echo "=== Benchmark complete ==="
volumeMounts: volumeMounts:
- name: data-volume - name: data-volume
mountPath: /data mountPath: /data
......
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Package marker for benchmarks utilities
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
import os
import subprocess
from pathlib import Path
from typing import List
# Default concurrency levels - can be overridden with CONCURRENCIES environment variable
DEFAULT_CONCURRENCIES: List[int] = [1, 2, 5, 10, 50, 100, 250]
# Default request count per concurrency level - can be overridden with REQUEST_COUNT env var
# When set to 0 or unset, defaults to max(concurrency * REQUEST_COUNT_SCALE_FACTOR, 10)
# to ensure the concurrency level is fully utilized and each slot runs enough requests
# for stable measurements
DEFAULT_REQUEST_COUNT: int = 0
REQUEST_COUNT_SCALE_FACTOR: int = 3
def get_concurrency_levels() -> List[int]:
"""Get concurrency levels from environment variable or use defaults"""
concurrencies_env = os.getenv("CONCURRENCIES")
if concurrencies_env:
try:
# Parse comma-separated values
concurrencies = [int(x.strip()) for x in concurrencies_env.split(",")]
# Validate all are positive integers
for c in concurrencies:
if c <= 0:
raise ValueError(f"Concurrency level must be positive, got: {c}")
return sorted(concurrencies)
except ValueError as e:
print(f"WARNING: Invalid CONCURRENCIES environment variable: {e}")
print(f"Using default concurrency levels: {DEFAULT_CONCURRENCIES}")
return DEFAULT_CONCURRENCIES
return DEFAULT_CONCURRENCIES
def get_request_count() -> int:
"""Get request count from environment variable or use default.
Returns 0 to indicate 'auto' mode (will be computed per concurrency level).
"""
request_count_env = os.getenv("REQUEST_COUNT")
if request_count_env:
try:
count = int(request_count_env.strip())
if count < 0:
raise ValueError(f"Request count must be non-negative, got: {count}")
return count
except ValueError as e:
print(f"WARNING: Invalid REQUEST_COUNT environment variable: {e}")
return DEFAULT_REQUEST_COUNT
return DEFAULT_REQUEST_COUNT
CONCURRENCIES: List[int] = get_concurrency_levels()
def run_aiperf(
service_url: str,
model_name: str,
isl: int,
osl: int,
stddev: int,
concurrency: int,
output_dir: Path,
request_count: int = 0,
) -> None:
output_dir.mkdir(parents=True, exist_ok=True)
# Auto-compute request count: need enough requests to fully utilize concurrency
# and run each slot at least REQUEST_COUNT_SCALE_FACTOR times for stable measurements
if request_count <= 0:
request_count = max(concurrency * REQUEST_COUNT_SCALE_FACTOR, 10)
elif request_count < concurrency:
print(
f"WARNING: request_count ({request_count}) < concurrency ({concurrency}). "
f"Actual in-flight concurrency will be capped at {request_count}.",
flush=True,
)
cmd = [
"aiperf",
"profile",
"-m",
model_name,
"--endpoint-type",
"chat",
"--streaming",
"-u",
service_url,
"--synthetic-input-tokens-mean",
str(isl),
"--synthetic-input-tokens-stddev",
str(stddev),
"--concurrency",
str(concurrency),
"--request-count",
str(request_count),
"--output-tokens-mean",
str(osl),
"--extra-inputs",
f"max_tokens:{osl}",
"--extra-inputs",
f"min_tokens:{osl}",
"--extra-inputs",
"ignore_eos:true",
"--tokenizer",
model_name,
"--artifact-dir",
str(output_dir),
]
print(
f"Running aiperf with isl {isl}, osl {osl}, concurrency {concurrency}, request_count {request_count}",
flush=True,
)
aip_process = subprocess.Popen(
cmd,
cwd=str(output_dir),
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
text=True,
)
stdout, stderr = aip_process.communicate()
if aip_process.returncode == 0:
print("Aiperf profiling completed successfully", flush=True)
if stdout:
print(stdout)
else:
print(f"Aiperf failed with error code: {aip_process.returncode}")
if stderr:
print(f"stderr: {stderr}")
raise subprocess.CalledProcessError(
aip_process.returncode, cmd, output=stdout, stderr=stderr
)
def run_concurrency_sweep(
service_url: str, model_name: str, isl: int, osl: int, stddev: int, output_dir: Path
) -> None:
concurrency_levels = get_concurrency_levels()
request_count = get_request_count()
print(
f"Running concurrency sweep for {model_name} with ISL {isl} and OSL {osl} and standard deviation {stddev}",
flush=True,
)
print(f"Concurrency levels: {concurrency_levels}", flush=True)
print(
f"Request count: {request_count if request_count > 0 else f'auto (max(concurrency*{REQUEST_COUNT_SCALE_FACTOR}, 10))'}",
flush=True,
)
for c in concurrency_levels:
print(f"Starting concurrency level {c}", flush=True)
run_aiperf(
service_url,
model_name,
isl,
osl,
stddev,
c,
output_dir / f"c{c}",
request_count=request_count,
)
#!/usr/bin/env python3
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
import argparse
import re
import sys
from urllib.parse import urlsplit
from benchmarks.utils.workflow import has_http_scheme, run_benchmark_workflow
from deploy.utils.kubernetes import is_running_in_cluster
def validate_endpoint(endpoint: str) -> None:
"""Validate that endpoint is HTTP endpoint or internal service URL when running in cluster"""
v = endpoint.strip()
if is_running_in_cluster():
# Allow HTTP(S) or internal service URLs like host[:port][/path]
if has_http_scheme(v):
pass
else:
parts = urlsplit(f"//{v}")
host_ok = bool(parts.hostname)
port_ok = parts.port is None or (1 <= parts.port <= 65535)
if not (host_ok and port_ok):
raise ValueError(
f"Endpoint must be HTTP(S) or internal service URL. Got: {endpoint}"
)
else:
if not has_http_scheme(v):
raise ValueError(f"Endpoint must be HTTP endpoint. Got: {endpoint}")
def validate_benchmark_name(name: str) -> None:
"""Validate benchmark name"""
if not name.strip():
raise ValueError("Benchmark name cannot be empty")
name = name.strip()
# Validate name characters
if not re.match(r"^[a-zA-Z0-9_-]+$", name):
raise ValueError(f"Invalid benchmark name: {name}")
# Validate reserved names
if name.lower() == "plots":
raise ValueError("Benchmark name 'plots' is reserved")
def main() -> int:
parser = argparse.ArgumentParser(description="Benchmark Orchestrator")
parser.add_argument(
"--benchmark-name",
required=True,
help="Name/label for this benchmark (used in plots and results)",
)
parser.add_argument(
"--endpoint-url",
required=True,
help="Endpoint to benchmark: HTTP(S) URL (e.g., http://localhost:8000) or in-cluster service URL host[:port]",
)
parser.add_argument("--isl", type=int, default=2000, help="Input sequence length")
parser.add_argument(
"--std",
type=int,
default=10,
help="Input sequence standard deviation",
)
parser.add_argument("--osl", type=int, default=256, help="Output sequence length")
parser.add_argument(
"--model",
default="Qwen/Qwen3-0.6B",
help="Model name (must match the model deployed at the endpoint)",
)
parser.add_argument(
"--output-dir", type=str, default="benchmarks/results", help="Output directory"
)
args = parser.parse_args()
# Validate inputs
try:
validate_benchmark_name(args.benchmark_name)
validate_endpoint(args.endpoint_url)
except ValueError as e:
print(f"ERROR: {e}")
return 1
# Run the benchmark workflow with the parsed inputs
run_benchmark_workflow(
inputs={args.benchmark_name: args.endpoint_url},
isl=args.isl,
std=args.std,
osl=args.osl,
model=args.model,
output_dir=args.output_dir,
)
return 0
if __name__ == "__main__":
sys.exit(main())
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
import json
import re
from pathlib import Path
from typing import Dict, List, Optional, Tuple
import matplotlib.pyplot as plt
def parse_benchmark_results(result_dir: Path) -> List[Tuple[int, Dict]]:
"""
Parse benchmark results from a deployment directory.
Args:
result_dir: Path to the result directory
Returns:
List of (concurrency_level, metrics_dict) tuples sorted by concurrency
"""
results = []
# Find all concurrency directories (e.g., c1, c2, c5, c10, c50, c100, c250)
for concurrency_dir in result_dir.iterdir():
if not concurrency_dir.is_dir() or not concurrency_dir.name.startswith("c"):
continue
# Extract concurrency level from directory name
match = re.match(r"c(\d+)", concurrency_dir.name)
if not match:
continue
concurrency = int(match.group(1))
# Find the aiperf JSON file
aiperf_json = None
for json_file in concurrency_dir.rglob("profile_export_aiperf.json"):
aiperf_json = json_file
break
if aiperf_json and aiperf_json.exists():
try:
with open(aiperf_json, "r") as f:
metrics = json.load(f)
results.append((concurrency, metrics))
print(f"Loaded metrics for concurrency {concurrency}")
except Exception as e:
print(f"Error loading {aiperf_json}: {e}")
else:
print(f"Warning: No aiperf JSON found for {concurrency_dir}")
# Sort by concurrency level
results.sort(key=lambda x: x[0])
return results
def extract_metric_series(
results: List[Tuple[int, Dict]], metric_path: str, stat: str = "avg"
) -> Tuple[List[int], List[float]]:
"""
Extract a time series of a specific metric across concurrency levels.
Args:
results: List of (concurrency, metrics) tuples
metric_path: Dot-separated path to the metric (e.g., 'inter_token_latency')
stat: Statistic to extract ('avg', 'p50', 'p90', etc.)
Returns:
Tuple of (concurrency_levels, metric_values)
"""
concurrencies = []
values = []
path_keys = metric_path.split(".")
for concurrency, metrics in results:
try:
node = metrics
for k in path_keys:
node = node[k]
value = node[stat]
concurrencies.append(concurrency)
values.append(float(value))
except (KeyError, TypeError):
print(
f"Warning: {metric_path}.{stat} not found for concurrency {concurrency}"
)
continue
return concurrencies, values
def create_plot(
title: str,
xlabel: str,
ylabel: str,
data_series: List[Tuple[str, List[int], List[float]]],
output_path: Path,
log_scale_x: bool = False,
log_scale_y: bool = False,
) -> None:
"""
Create a line plot with multiple series.
Args:
title: Plot title
xlabel: X-axis label
ylabel: Y-axis label
data_series: List of (label, x_values, y_values) tuples
output_path: Path to save the plot
log_scale_x: Whether to use log scale for X axis
log_scale_y: Whether to use log scale for Y axis
"""
plt.figure(figsize=(10, 6))
colors = ["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd", "#8c564b"]
for i, (label, x_vals, y_vals) in enumerate(data_series):
if x_vals and y_vals: # Only plot if we have data
plt.plot(
x_vals,
y_vals,
marker="o",
linewidth=2,
markersize=6,
color=colors[i % len(colors)],
label=label,
)
plt.title(title, fontsize=14, fontweight="bold")
plt.xlabel(xlabel, fontsize=12)
plt.ylabel(ylabel, fontsize=12)
plt.grid(True, alpha=0.3)
if log_scale_x:
plt.xscale("log")
if log_scale_y:
plt.yscale("log")
plt.legend()
plt.tight_layout()
plt.savefig(output_path, dpi=300, bbox_inches="tight")
plt.close()
print(f"Saved plot: {output_path}")
def create_efficiency_plot(
deployment_results: Dict, plots_dir: Path, output_tokens: int = 200
) -> None:
"""
Create an efficiency plot showing tok/s/gpu vs tok/s/user with concurrency as labeled points.
Args:
deployment_results: Dict of deployment_type -> results
plots_dir: Directory to save plots
output_tokens: Average output tokens per request (default 200)
"""
plt.figure(figsize=(12, 8))
# Support for up to 12 deployments in the plots
colors = [
"#1f77b4",
"#ff7f0e",
"#2ca02c",
"#d62728",
"#9467bd",
"#8c564b",
"#e377c2",
"#7f7f7f",
"#bcbd22",
"#17becf",
"#aec7e8",
"#ffbb78",
]
markers = ["o", "s", "^", "D", "v", "<", ">", "p", "*", "h", "H", "+"]
for deployment_type, results in deployment_results.items():
tok_s_per_user = []
tok_s_per_gpu = []
concurrency_levels = []
for concurrency, metrics in results:
try:
# Get request throughput (requests/sec)
request_throughput = metrics["request_throughput"]["avg"]
# Calculate total tokens per second
total_tok_s = request_throughput * output_tokens
# Guard against zero concurrency and parameterize GPU count
if concurrency <= 0:
continue
num_gpus = metrics.get("cluster", {}).get("num_gpus", 1)
tok_s_user = total_tok_s / concurrency
tok_s_gpu = total_tok_s / max(1, num_gpus)
tok_s_per_user.append(tok_s_user)
tok_s_per_gpu.append(tok_s_gpu)
concurrency_levels.append(concurrency)
except KeyError as e:
print(
f"Warning: Missing metric for {deployment_type} concurrency {concurrency}: {e}"
)
continue
if tok_s_per_user and tok_s_per_gpu:
# Plot points
color_idx = list(deployment_results.keys()).index(deployment_type)
color = colors[color_idx % len(colors)]
marker = markers[color_idx % len(markers)]
plt.scatter(
tok_s_per_user,
tok_s_per_gpu,
c=color,
marker=marker,
s=120,
alpha=0.8,
label=deployment_type.title(),
edgecolors="black",
linewidth=1.5,
)
# Add concurrency labels
for i, (x, y, c) in enumerate(
zip(tok_s_per_user, tok_s_per_gpu, concurrency_levels)
):
plt.annotate(
f"{c}",
(x, y),
xytext=(8, 8),
textcoords="offset points",
fontsize=10,
fontweight="bold",
ha="left",
)
plt.title("GPU Efficiency vs User Experience", fontsize=14, fontweight="bold")
plt.xlabel("Tokens/sec per User", fontsize=12)
plt.ylabel("Tokens/sec per GPU", fontsize=12)
plt.grid(True, alpha=0.3)
# Add a note about what the numbers represent
plt.figtext(
0.02,
0.02,
"Note: Numbers on dots indicate concurrency level",
fontsize=10,
style="italic",
alpha=0.7,
)
plt.legend()
plt.tight_layout()
output_path = plots_dir / "efficiency_tok_s_gpu_vs_user.png"
plt.savefig(output_path, dpi=300, bbox_inches="tight")
plt.close()
print(f"Saved efficiency plot: {output_path}")
def generate_plots(
base_output_dir: Path, output_dir: Path, benchmark_names: Optional[List[str]] = None
) -> None:
"""
Generate performance plots from benchmark results.
Args:
base_output_dir: Base directory containing benchmark results
output_dir: Directory to save plots
benchmark_names: Optional list of specific benchmark names to plot. If None, plots all subdirectories.
"""
print(f"Generating plots from results in {base_output_dir}")
if not base_output_dir.exists():
print(f"Results directory does not exist: {base_output_dir}")
return
# Create plots directory
output_dir.mkdir(parents=True, exist_ok=True)
# Parse results for each deployment type
deployment_results = {}
# Find all subdirectories that contain benchmark results
names_set = set(benchmark_names) if benchmark_names is not None else None
for item in base_output_dir.iterdir():
if item.is_dir() and item.name != "plots":
deployment_type = item.name
# If benchmark_names is specified, only process those directories
if names_set is not None and deployment_type not in names_set:
print(f"Skipping {deployment_type} (not in specified benchmark names)")
continue
results = parse_benchmark_results(item)
if results:
deployment_results[deployment_type] = results
print(f"Found {len(results)} concurrency levels for {deployment_type}")
else:
print(f"No valid results found for {deployment_type}")
if not deployment_results:
if benchmark_names:
available = sorted(
[
p.name
for p in base_output_dir.iterdir()
if p.is_dir() and p.name != "plots"
]
)
missing = sorted([n for n in benchmark_names if n not in available])
print(f"No benchmark results found for specified names: {benchmark_names}")
if missing:
print(f"Missing (not found under {base_output_dir}): {missing}")
print(f"Available experiments: {available}")
else:
print("No benchmark results found to plot!")
# 1. P50 Inter-token Latency vs Concurrency
p50_data = []
for deployment_type, results in deployment_results.items():
concurrencies, latencies = extract_metric_series(
results, "inter_token_latency", "p50"
)
if concurrencies:
p50_data.append((deployment_type.title(), concurrencies, latencies))
create_plot(
title="P50 Inter-Token Latency vs Concurrency",
xlabel="Concurrency Level",
ylabel="P50 Inter-Token Latency (ms)",
data_series=p50_data,
output_path=output_dir / "p50_inter_token_latency_vs_concurrency.png",
log_scale_x=True,
)
# 2. Average Inter-token Latency vs Concurrency
avg_latency_data = []
for deployment_type, results in deployment_results.items():
concurrencies, latencies = extract_metric_series(
results, "inter_token_latency", "avg"
)
if concurrencies:
avg_latency_data.append((deployment_type.title(), concurrencies, latencies))
create_plot(
title="Average Inter-Token Latency vs Concurrency",
xlabel="Concurrency Level",
ylabel="Average Inter-Token Latency (ms)",
data_series=avg_latency_data,
output_path=output_dir / "avg_inter_token_latency_vs_concurrency.png",
log_scale_x=True,
)
# 3. Request Throughput vs Concurrency
throughput_data = []
for deployment_type, results in deployment_results.items():
concurrencies, throughputs = extract_metric_series(
results, "request_throughput", "avg"
)
if concurrencies:
throughput_data.append(
(deployment_type.title(), concurrencies, throughputs)
)
create_plot(
title="Request Throughput vs Concurrency",
xlabel="Concurrency Level",
ylabel="Request Throughput (req/s)",
data_series=throughput_data,
output_path=output_dir / "request_throughput_vs_concurrency.png",
log_scale_x=True,
)
# 4. Average Time to First Token vs Concurrency
ttft_data = []
for deployment_type, results in deployment_results.items():
concurrencies, ttfts = extract_metric_series(
results, "time_to_first_token", "avg"
)
if concurrencies:
ttft_data.append((deployment_type.title(), concurrencies, ttfts))
create_plot(
title="Average Time to First Token vs Concurrency",
xlabel="Concurrency Level",
ylabel="Average Time to First Token (ms)",
data_series=ttft_data,
output_path=output_dir / "avg_time_to_first_token_vs_concurrency.png",
log_scale_x=True,
)
# 5. Efficiency plot: tok/s/gpu vs tok/s/user
create_efficiency_plot(deployment_results, output_dir)
# Generate summary
summary_lines = [
"Benchmark Results Summary",
"=" * 30,
"",
f"Results directory: {base_output_dir}",
f"Plots generated: {output_dir}",
"",
"Deployment Types Found:",
]
for deployment_type, results in deployment_results.items():
concurrency_levels = [r[0] for r in results]
summary_lines.append(
f" {deployment_type}: {len(results)} concurrency levels ({min(concurrency_levels)}-{max(concurrency_levels)})"
)
summary_lines.extend(
[
"",
"Generated Plots:",
" - p50_inter_token_latency_vs_concurrency.png",
" - avg_inter_token_latency_vs_concurrency.png",
" - request_throughput_vs_concurrency.png",
" - avg_time_to_first_token_vs_concurrency.png",
" - efficiency_tok_s_gpu_vs_user.png",
]
)
summary_path = output_dir / "SUMMARY.txt"
summary_path.write_text("\n".join(summary_lines))
print(f"Generated summary: {summary_path}")
print(f"All plots saved to: {output_dir}")
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(
description="Generate performance plots from benchmark results"
)
parser.add_argument(
"--data-dir", required=True, help="Directory containing benchmark results"
)
parser.add_argument(
"--output-dir", help="Output directory for plots (defaults to data-dir/plots)"
)
parser.add_argument(
"--benchmark-name",
action="append",
help="Specific benchmark experiment name to plot (can be specified multiple times). If not specified, plots all subdirectories.",
)
args = parser.parse_args()
data_dir = Path(args.data_dir)
benchmark_names = args.benchmark_name if args.benchmark_name else None
if args.output_dir:
# If output dir specified, use it as base and call generate_plots
output_dir = Path(args.output_dir)
output_dir.mkdir(parents=True, exist_ok=True)
generate_plots(data_dir, output_dir, benchmark_names)
else:
# Use data_dir as base output dir
generate_plots(data_dir, data_dir / "plots", benchmark_names)
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
from pathlib import Path
from typing import Dict, List
from benchmarks.utils.aiperf import run_concurrency_sweep
from deploy.utils.kubernetes import is_running_in_cluster
def has_http_scheme(url: str) -> bool:
"""Check if URL has HTTP or HTTPS scheme."""
return url.lower().startswith(("http://", "https://"))
def normalize_service_url(endpoint: str) -> str:
e = endpoint.strip()
if has_http_scheme(e):
return e
if is_running_in_cluster():
return f"http://{e}"
return e # Outside cluster, validation will have ensured scheme is present
def print_concurrency_start(
label: str, model: str, isl: int, osl: int, std: int
) -> None:
"""Print concurrency sweep start messages"""
print(f"⚙️ Starting {label} concurrency sweep!", flush=True)
print(
"⏱️ This may take several minutes - running through multiple concurrency levels...",
flush=True,
)
print(f"🎯 Model: {model} | ISL: {isl} | OSL: {osl} | StdDev: {std}")
def run_endpoint_benchmark(
label: str,
endpoint: str,
model: str,
isl: int,
osl: int,
std: int,
output_dir: Path,
) -> None:
"""Run benchmark for an existing endpoint with custom label"""
# Normalize endpoint to a usable URL (handles in-cluster scheme-less inputs)
service_url = normalize_service_url(endpoint)
print(f"🚀 Starting benchmark of endpoint '{label}': {service_url}")
print(f"📁 Results will be saved to: {output_dir / label}")
print_concurrency_start(label, model, isl, osl, std)
# Create output directory
(output_dir / label).mkdir(parents=True, exist_ok=True)
run_concurrency_sweep(
service_url=service_url,
model_name=model,
isl=isl,
osl=osl,
stddev=std,
output_dir=output_dir / label,
)
print("✅ Endpoint benchmark completed successfully!")
def print_final_summary(output_dir: Path, labels: List[str]) -> None:
"""Print final benchmark summary"""
print("🎉 Benchmark workflow completed successfully!")
print(f"📁 All results available at: {output_dir}")
if labels:
print(f"🚀 Benchmarked: {', '.join(labels)}")
def run_benchmark_workflow(
inputs: Dict[str, str],
isl: int = 2000,
std: int = 10,
osl: int = 256,
model: str = "Qwen/Qwen3-0.6B",
output_dir: str = "benchmarks/results",
) -> None:
"""Main benchmark workflow orchestrator for HTTP endpoints (and in-cluster internal service URLs)"""
output_dir_path = Path(output_dir)
output_dir_path.mkdir(parents=True, exist_ok=True)
# Run endpoint benchmarks
benchmarked_labels = []
for label, endpoint in inputs.items():
run_endpoint_benchmark(label, endpoint, model, isl, osl, std, output_dir_path)
benchmarked_labels.append(label)
# Generate final summary
print_final_summary(output_dir_path, benchmarked_labels)
...@@ -4,8 +4,6 @@ ...@@ -4,8 +4,6 @@
# Kubernetes and async dependencies # Kubernetes and async dependencies
aiofiles>=0.8.0 aiofiles>=0.8.0
# Benchmarking dependencies for Dynamo
genai-perf==0.0.15
httpx>=0.24.0 httpx>=0.24.0
kubernetes-asyncio>=24.0.0 kubernetes-asyncio>=24.0.0
......
File suppressed by a .gitattributes entry or the file's encoding is unsupported.
...@@ -5,13 +5,15 @@ title: Dynamo Benchmarking ...@@ -5,13 +5,15 @@ title: Dynamo Benchmarking
subtitle: Benchmark and compare performance across Dynamo deployment configurations subtitle: Benchmark and compare performance across Dynamo deployment configurations
--- ---
This benchmarking framework lets you compare performance across any combination of: This guide shows how to benchmark Dynamo deployments using [AIPerf](https://github.com/ai-dynamo/aiperf), a comprehensive tool for measuring generative AI inference performance. AIPerf provides detailed metrics, real-time dashboards, and automatic visualization — you call it directly against your endpoints.
You can benchmark any combination of:
- **DynamoGraphDeployments** - **DynamoGraphDeployments**
- **External HTTP endpoints** (existing services deployed following standard documentation from vLLM, llm-d, AIBrix, etc.) - **External HTTP endpoints** (vLLM, llm-d, AIBrix, etc.)
## Choosing Your Benchmarking Approach ## Choosing Your Benchmarking Approach
Dynamo provides two benchmarking approaches to suit different use cases: **client-side** and **server-side**. Client-side refers to running benchmarks on your local machine and connecting to Kubernetes deployments via port-forwarding, while server-side refers to running benchmarks directly within the Kubernetes cluster using internal service URLs. Which method to use depends on your use case. **Client-side** runs benchmarks on your local machine via port-forwarding. **Server-side** runs benchmarks directly within the Kubernetes cluster using internal service URLs.
**TLDR:** **TLDR:**
Need high performance/load testing? Server-side. Need high performance/load testing? Server-side.
...@@ -32,7 +34,6 @@ Just quick testing/comparison? Client-side. ...@@ -32,7 +34,6 @@ Just quick testing/comparison? Client-side.
- You want optimal network performance (no port-forwarding overhead) - You want optimal network performance (no port-forwarding overhead)
- You're running automated CI/CD pipelines - You're running automated CI/CD pipelines
- You need isolated execution environments - You need isolated execution environments
- You're doing resource-intensive benchmarking
- You want persistent result storage in the cluster - You want persistent result storage in the cluster
**[Go to Server-Side Benchmarking (In-Cluster)](#server-side-benchmarking-in-cluster)** **[Go to Server-Side Benchmarking (In-Cluster)](#server-side-benchmarking-in-cluster)**
...@@ -49,18 +50,20 @@ Just quick testing/comparison? Client-side. ...@@ -49,18 +50,20 @@ Just quick testing/comparison? Client-side.
| **Results** | Local filesystem | Persistent volumes | | **Results** | Local filesystem | Persistent volumes |
| **Best for** | Light load | High load | | **Best for** | Light load | High load |
## What This Tool Does ## AIPerf Overview
[AIPerf](https://github.com/ai-dynamo/aiperf) is a standalone benchmarking tool available on [PyPI](https://pypi.org/project/aiperf/). It is pre-installed in Dynamo container images. Key features:
The framework is a Python-based wrapper around `aiperf` that: - Measures latency, throughput, TTFT, inter-token latency, and more
- Benchmarks any HTTP endpoints - Multiple load modes: concurrency, request-rate, trace replay
- Runs concurrency sweeps across configurable load levels - Automatic visualization with `aiperf plot` (Pareto curves, time series, GPU telemetry)
- Generates comparison plots with your custom labels - Interactive dashboard mode for real-time exploration
- Works with any HuggingFace-compatible model on NVIDIA GPUs (H200, H100, A100, etc.) - Arrival patterns (Poisson, constant, gamma) for realistic traffic simulation
- Provides direct Python script execution for maximum flexibility - Warmup phases, gradual ramping, and multi-URL load balancing
**Default sequence lengths**: Input: 2000 tokens, Output: 256 tokens (configurable with `--isl` and `--osl`) **Important**: The `--model` parameter must match the model deployed at the endpoint.
**Important**: The `--model` parameter configures AIPerf for benchmarking and provides logging context. The default `--model` value in the benchmarking script is `Qwen/Qwen3-0.6B`, but it must match the model deployed at the endpoint(s). For full documentation, see the [AIPerf docs](https://github.com/ai-dynamo/aiperf/tree/main/docs).
--- ---
...@@ -70,314 +73,261 @@ Client-side benchmarking runs on your local machine and connects to Kubernetes d ...@@ -70,314 +73,261 @@ Client-side benchmarking runs on your local machine and connects to Kubernetes d
## Prerequisites ## Prerequisites
1. **Dynamo container environment** - You must be running inside a Dynamo container with the benchmarking tools pre-installed. 1. **Dynamo container environment** - You must be running inside a Dynamo container with AIPerf pre-installed, or install it locally:
```bash
pip install aiperf
```
2. **HTTP endpoints** - Ensure you have HTTP endpoints available for benchmarking. These can be: 2. **HTTP endpoints** - Ensure you have HTTP endpoints available for benchmarking. These can be:
- DynamoGraphDeployments exposed via HTTP endpoints - DynamoGraphDeployments exposed via HTTP endpoints
- External services (vLLM, llm-d, AIBrix, etc.) - External services (vLLM, llm-d, AIBrix, etc.)
- Any HTTP endpoint serving HuggingFace-compatible models - Any HTTP endpoint serving OpenAI-compatible models
3. **Benchmark dependencies** - Since benchmarks run locally, you need to install the required Python dependencies. Install them using:
```bash
pip install -r deploy/utils/requirements.txt
```
## User Workflow ## User Workflow
Follow these steps to benchmark Dynamo deployments using client-side benchmarking: ### Step 1: Set Up Cluster and Deploy
### Step 1: Establish Kubernetes Cluster and Install Dynamo Set up your Kubernetes cluster with NVIDIA GPUs and install the Dynamo Kubernetes Platform following the [installation guide](../kubernetes/installation-guide.md). Then deploy your DynamoGraphDeployments using the [deployment documentation](https://github.com/ai-dynamo/dynamo/blob/main/examples/backends).
Set up your Kubernetes cluster with NVIDIA GPUs and install the Dynamo Kubernetes Platform. First follow the [installation guide](../kubernetes/installation-guide.md) to install Dynamo Kubernetes Platform, then use [deploy/utils/README](https://github.com/ai-dynamo/dynamo/blob/main/deploy/utils/README.md) to set up benchmarking resources.
### Step 2: Deploy DynamoGraphDeployments ### Step 2: Port-Forward and Run a Single Benchmark
Deploy your DynamoGraphDeployments separately using the [deployment documentation](https://github.com/ai-dynamo/dynamo/blob/main/examples/backends). Each deployment should have a frontend service exposed.
> **Wait for model readiness.** Before benchmarking, ensure your deployment has fully loaded the model. Check pod logs or hit the health endpoint (`curl http://localhost:8000/health`) — it should return `200 OK` before you proceed.
### Step 3: Port-Forward and Benchmark Deployment A
```bash ```bash
# Port-forward the frontend service for deployment A # Port-forward the frontend service
kubectl port-forward -n <namespace> svc/<frontend-service-name> 8000:8000 > /dev/null 2>&1 & kubectl port-forward -n <namespace> svc/<frontend-service-name> 8000:8000 > /dev/null 2>&1 &
# Note: remember to stop the port-forward process after benchmarking.
# Benchmark deployment A using Python scripts
python3 -m benchmarks.utils.benchmark \
--benchmark-name deployment-a \
--endpoint-url http://localhost:8000 \
--model "your-model-name" \
--output-dir ./benchmarks/results
```
### Step 4: [If Comparative] Teardown Deployment A and Establish Deployment B
If comparing multiple deployments, teardown deployment A and deploy deployment B with a different configuration.
### Step 5: [If Comparative] Port-Forward and Benchmark Deployment B # Run a single benchmark
```bash aiperf profile \
# Port-forward the frontend service for deployment B --model <your-model-name> \
kubectl port-forward -n <namespace> svc/<frontend-service-name> 8001:8000 > /dev/null 2>&1 & --url http://localhost:8000 \
--endpoint-type chat \
# Benchmark deployment B using Python scripts --streaming \
python3 -m benchmarks.utils.benchmark \ --concurrency 10 \
--benchmark-name deployment-b \ --request-count 100 \
--endpoint-url http://localhost:8001 \ --synthetic-input-tokens-mean 2000 \
--model "your-model-name" \ --output-tokens-mean 256
--output-dir ./benchmarks/results
``` ```
### Step 6: Generate Summary and Visualization This produces results in `artifacts/` and prints a summary table to the console:
```bash
# Generate plots and summary using Python plotting script
python3 -m benchmarks.utils.plot --data-dir ./benchmarks/results
# Or plot only specific benchmark experiments ```text
python3 -m benchmarks.utils.plot --data-dir ./benchmarks/results --benchmark-name experiment-a --benchmark-name experiment-b NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃ Metric ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p50 ┃ std ┃
┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│ Time to First Token │ 234.56 │ 189.23 │ 298.45 │ 289.34 │ 267.12 │ 231.12 │ 28.45 │
│ (ms) │ │ │ │ │ │ │ │
│ Request Latency │ 1234.56 │ 987.34 │ 1567.89 │ 1534.23 │ 1456.78 │ 1223.45 │ 156.78 │
│ (ms) │ │ │ │ │ │ │ │
│ Inter Token Latency │ 15.67 │ 12.34 │ 19.45 │ 19.01 │ 18.23 │ 15.45 │ 1.89 │
│ (ms) │ │ │ │ │ │ │ │
│ Request Throughput │ 31.45 │ N/A │ N/A │ N/A │ N/A │ N/A │ N/A │
│ (requests/sec) │ │ │ │ │ │ │ │
└─────────────────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘
``` ```
## Use Cases *Actual numbers will vary based on model size, hardware, batch size, and network conditions. Client-side benchmarks include port-forwarding overhead — use [server-side benchmarking](#server-side-benchmarking-in-cluster) for accurate performance measurement.*
The benchmarking framework supports various comparative analysis scenarios:
- **Compare multiple DynamoGraphDeployments of a single backend** (e.g., aggregated vs disaggregated configurations) To stop the port-forward when done: `kill %1` (or `kill <PID>`).
- **Compare different backends** (e.g., SGLang vs TensorRT-LLM vs vLLM)
- **Compare Dynamo vs other platforms** (e.g., Dynamo vs llm-d vs AIBrix)
- **Compare different models** (e.g., Llama-3-8B vs Llama-3-70B vs Qwen-3-0.6B)
- **Compare different hardware configurations** (e.g., H100 vs A100 vs H200)
- **Compare different parallelization strategies** (e.g., different GPU counts or memory configurations)
## Configuration and Usage ### Step 3: Concurrency Sweep for Pareto Analysis
### Command Line Options To understand how your deployment behaves across load levels, run a concurrency sweep. Each concurrency level sends enough requests for stable measurements (`max(c*3, 10)`):
```bash ```bash
python3 -m benchmarks.utils.benchmark --benchmark-name <name> --endpoint-url <endpoint_url> [OPTIONS] MODEL="<your-model-name>"
URL="http://localhost:8000"
REQUIRED:
--benchmark-name NAME Name/label for this benchmark (used in plots and results) for c in 1 2 5 10 50 100; do
--endpoint-url URL HTTP endpoint URL to benchmark (e.g., http://localhost:8000) aiperf profile \
--model "$MODEL" \
OPTIONS: --url "$URL" \
-h, --help Show help message and examples --endpoint-type chat \
-m, --model MODEL Model name for AIPerf configuration and logging (default: Qwen/Qwen3-0.6B) --streaming \
NOTE: This must match the model deployed at the endpoint --concurrency $c \
-i, --isl LENGTH Input sequence length (default: 2000) --request-count $(( c * 3 > 10 ? c * 3 : 10 )) \
-s, --std STDDEV Input sequence standard deviation (default: 10) --synthetic-input-tokens-mean 2000 \
-o, --osl LENGTH Output sequence length (default: 256) --output-tokens-mean 256 \
-d, --output-dir DIR Output directory (default: ./benchmarks/results) --artifact-dir "artifacts/deployment-a/c$c"
--verbose Enable verbose output done
``` ```
### Important Notes **Note**: Adjust concurrency levels to match your deployment's capacity. Very high concurrency on a small deployment (e.g., c250 on a single GPU) will cause server errors. Start with lower values and increase until you find the saturation point.
- **Benchmark Name**: The benchmark name becomes the label in plots and results ### Step 4: [If Comparative] Benchmark a Second Deployment
- **Name Restrictions**: Names can only contain letters, numbers, hyphens, and underscores. The name `plots` is reserved.
- **Port-Forwarding**: You must have an exposed endpoint before benchmarking
- **Model Parameter**: The `--model` parameter configures AIPerf for testing and logging, and must match the model deployed at the endpoint
- **Sequential Benchmarking**: For comparative benchmarks, deploy and benchmark each configuration separately
### What Happens During Benchmarking Teardown deployment A and deploy deployment B with a different configuration. Kill the previous port-forward (`kill %1`), then repeat:
The Python benchmarking module: ```bash
1. **Connects** to your port-forwarded endpoint kubectl port-forward -n <namespace> svc/<frontend-service-b> 8000:8000 > /dev/null 2>&1 &
2. **Benchmarks** using AIPerf at various concurrency levels (default: 1, 2, 5, 10, 50, 100, 250)
3. **Measures** key metrics: latency, throughput, time-to-first-token for c in 1 2 5 10 50 100; do
4. **Saves** results to an output directory organized by benchmark name aiperf profile \
--model "$MODEL" \
The Python plotting module: --url "$URL" \
1. **Generates** comparison plots using your benchmark name in `<OUTPUT_DIR>/plots/` --endpoint-type chat \
2. **Creates** summary statistics and visualizations --streaming \
--concurrency $c \
### Plotting Options --request-count $(( c * 3 > 10 ? c * 3 : 10 )) \
--synthetic-input-tokens-mean 2000 \
--output-tokens-mean 256 \
--artifact-dir "artifacts/deployment-b/c$c"
done
```
The plotting script supports several options for customizing which experiments to visualize: ### Step 5: Generate Visualizations
```bash ```bash
# Plot all benchmark experiments in the data directory # Compare all runs — auto-detects multi-run directories
python3 -m benchmarks.utils.plot --data-dir ./benchmarks/results aiperf plot artifacts/deployment-a artifacts/deployment-b
# Plot only specific benchmark experiments # Or compare all subdirectories under a parent
python3 -m benchmarks.utils.plot --data-dir ./benchmarks/results --benchmark-name experiment-a --benchmark-name experiment-b aiperf plot artifacts/
# Specify custom output directory for plots # Launch interactive dashboard for exploration
python3 -m benchmarks.utils.plot --data-dir ./benchmarks/results --output-dir ./custom-plots aiperf plot artifacts/ --dashboard
``` ```
**Available Options:** AIPerf automatically generates plots based on available data:
- `--data-dir`: Directory containing benchmark results (required) - **TTFT vs Throughput** — find the sweet spot between responsiveness and capacity (always generated for multi-run comparisons)
- `--benchmark-name`: Specific benchmark experiment name to plot (can be specified multiple times). Names must match subdirectory names under the data dir. - **Pareto Curves** — throughput per GPU vs latency and interactivity (only generated when GPU telemetry data is available — add `--gpu-telemetry` during profiling if DCGM is running)
- `--output-dir`: Custom output directory for plots (defaults to data-dir/plots) - **Time series** — per-request TTFT, ITL, and latency over time (generated for single-run analysis)
**Note**: If `--benchmark-name` is not specified, the script will plot all subdirectories found in the data directory.
### Using Your Own Models and Configuration
The benchmarking framework supports any HuggingFace-compatible LLM model. Specify your model in the benchmark script's `--model` parameter. It must match the model name of the deployment. You can override the default sequence lengths (2000/256 tokens) with `--isl` and `--osl` flags if needed for your specific workload.
The benchmarking framework is built around Python modules that provide direct control over the benchmark workflow. The Python benchmarking module connects to your existing endpoints, runs the benchmarks, and can generate plots. Deployment is user-managed and out of scope for this tool. Here is an example Pareto frontier from a concurrency sweep of Qwen3-0.6B on 8x H200 with vLLM, showing the tradeoff between user experience (tokens/sec per user) and resource efficiency (tokens/sec per GPU):
### Comparison Limitations ![AIPerf Pareto Frontier](../assets/img/aiperf-pareto-frontier.png)
The plotting system supports up to 12 different benchmarks in a single comparison. See the [AIPerf Visualization Guide](https://github.com/ai-dynamo/aiperf/blob/main/docs/tutorials/plot.md) for full details on plot customization, experiment classification, and themes.
### Concurrency Configuration ## Use Cases
You can customize the concurrency levels using the CONCURRENCIES environment variable: - **Compare DynamoGraphDeployments** (e.g., aggregated vs disaggregated configurations)
- **Compare different backends** (e.g., SGLang vs TensorRT-LLM vs vLLM)
- **Compare Dynamo vs other platforms** (e.g., Dynamo vs llm-d vs AIBrix)
- **Compare different models** (e.g., Llama-3-8B vs Llama-3-70B vs Qwen-3-0.6B)
- **Compare different hardware configurations** (e.g., H100 vs A100 vs H200)
- **Compare different parallelization strategies** (e.g., different GPU counts or memory configurations)
```bash ## AIPerf Quick Reference
# Custom concurrency levels
CONCURRENCIES="1,5,20,50" python3 -m benchmarks.utils.benchmark \
--benchmark-name my-test \
--endpoint-url http://localhost:8000
# Or set permanently
export CONCURRENCIES="1,2,5,10,25,50,100"
python3 -m benchmarks.utils.benchmark \
--benchmark-name test \
--endpoint-url http://localhost:8000
```
### Request Count Configuration ### Commonly Used Options
The number of requests sent per concurrency level is auto-computed as `max(concurrency * 3, 10)` by default. This ensures each concurrency slot runs enough requests for stable measurements. You can override this with the `REQUEST_COUNT` environment variable: ```text
aiperf profile [OPTIONS]
```bash REQUIRED:
# Fixed request count for all concurrency levels --model MODEL Model name (must match the deployed model)
REQUEST_COUNT=500 python3 -m benchmarks.utils.benchmark \ --url URL Endpoint URL (e.g., http://localhost:8000)
--benchmark-name my-test \
--endpoint-url http://localhost:8000 COMMON OPTIONS:
--endpoint-type TYPE Endpoint type: chat, completions, embeddings (default: chat)
# Combined with custom concurrency levels --streaming Enable streaming responses
CONCURRENCIES="1,10,50,200" REQUEST_COUNT=1000 python3 -m benchmarks.utils.benchmark \ --concurrency N Number of concurrent requests
--benchmark-name high-load-test \ --request-rate N Target requests per second (alternative to --concurrency)
--endpoint-url http://localhost:8000 --request-count N Total number of requests to send
--benchmark-duration N Run for N seconds instead of a fixed request count
--synthetic-input-tokens-mean N Average input sequence length in tokens
--output-tokens-mean N Average output sequence length in tokens
--artifact-dir DIR Output directory for results (default: artifacts/)
--warmup-request-count N Warmup requests before measurement
--ui TYPE UI mode: dashboard, simple, none (default: dashboard)
``` ```
**Important**: The request count must be greater than or equal to the concurrency level. If the request count is too low, the actual in-flight concurrency will be capped at the request count, leading to inaccurate results at higher concurrency levels. For the complete CLI reference, see `aiperf profile --help` or the [CLI docs](https://github.com/ai-dynamo/aiperf/blob/main/docs/cli-options.md).
## Understanding Your Results
After benchmarking completes, check `./benchmarks/results/` (or your custom output directory):
### Plot Labels and Organization
The plotting script uses the `--benchmark-name` as the experiment name in all generated plots. For example:
- `--benchmark-name aggregated` → plots will show "aggregated" as the label
- `--benchmark-name vllm-disagg` → plots will show "vllm-disagg" as the label
This allows you to easily identify and compare different configurations in the visualization plots. ### Output Sequence Length
### Summary and Plots To enforce a specific output length, pass `ignore_eos` and `min_tokens` via `--extra-inputs`:
```text ```bash
benchmarks/results/plots aiperf profile \
├── SUMMARY.txt # Quick overview of all results --model <model> \
├── p50_inter_token_latency_vs_concurrency.png # Token generation speed --url http://localhost:8000 \
├── avg_time_to_first_token_vs_concurrency.png # Response time --endpoint-type chat \
├── request_throughput_vs_concurrency.png # Requests per second --streaming \
├── efficiency_tok_s_gpu_vs_user.png # GPU efficiency --concurrency 10 \
└── avg_inter_token_latency_vs_concurrency.png # Average latency --output-tokens-mean 256 \
--extra-inputs max_tokens:256 \
--extra-inputs min_tokens:256 \
--extra-inputs ignore_eos:true
``` ```
### Data Files ### Understanding Results
Raw data is organized by deployment/benchmark type and concurrency level: Each `aiperf profile` run produces an artifact directory containing:
- **`profile_export_aiperf.json`** — Structured metrics (latency, throughput, TTFT, ITL, etc.)
- **`profile_export.jsonl`** — Per-request raw data
- **`profile_export_aiperf.csv`** — CSV format metrics
**For Any Benchmarking (uses your custom benchmark name):** Results are organized by the `--artifact-dir` you specify. For concurrency sweeps, a common pattern is:
```text
results/ # Client-side: ./benchmarks/results/ or custom dir
├── plots/ # Server-side: /data/results/
│ ├── SUMMARY.txt # Performance visualization plots
│ ├── p50_inter_token_latency_vs_concurrency.png
│ ├── avg_inter_token_latency_vs_concurrency.png
│ ├── request_throughput_vs_concurrency.png
│ ├── efficiency_tok_s_gpu_vs_user.png
│ └── avg_time_to_first_token_vs_concurrency.png
├── <your-benchmark-name>/ # Results for your benchmark (uses your custom name)
│ ├── c1/ # Concurrency level 1
│ │ └── profile_export_aiperf.json
│ ├── c2/ # Concurrency level 2
│ ├── c5/ # Concurrency level 5
│ └── ... # Other concurrency levels (10, 50, 100, 250)
└── <your-benchmark-name-N>/ # Results for additional benchmarking runs
└── c*/ # Same structure as above
```
**Example with actual benchmark names:**
```text ```text
results/ artifacts/
├── plots/ ├── deployment-a/
├── experiment-a/ # --benchmark-name experiment-a │ ├── c1/
├── experiment-b/ # --benchmark-name experiment-b │ │ ├── profile_export_aiperf.json
└── experiment-c/ # --benchmark-name experiment-c │ │ └── profile_export.jsonl
│ ├── c10/
│ ├── c50/
│ └── c100/
├── deployment-b/
│ ├── c1/
│ ├── c10/
│ ├── c50/
│ └── c100/
└── plots/ # Generated by aiperf plot
├── ttft_vs_throughput.png
├── pareto_curve_throughput_per_gpu_vs_latency.png # If GPU telemetry available
└── pareto_curve_throughput_per_gpu_vs_interactivity.png # If GPU telemetry available
``` ```
Each concurrency directory contains:
- **`profile_export_aiperf.json`** - Structured metrics from AIPerf
- **`profile_export_aiperf.csv`** - CSV format metrics from AIPerf
- **`profile_export.json`** - Raw AIPerf results
- **`inputs.json`** - Generated test inputs
--- ---
# Server-Side Benchmarking (In-Cluster) # Server-Side Benchmarking (In-Cluster)
Server-side benchmarking runs directly within the Kubernetes cluster, eliminating the need for port forwarding and providing better resource utilization. Server-side benchmarking runs directly within the Kubernetes cluster, eliminating port-forwarding overhead and enabling high-load testing.
## What Server-Side Benchmarking Does
The server-side benchmarking solution:
- Runs benchmarks directly within the Kubernetes cluster using internal service URLs
- Uses Kubernetes service DNS for direct communication (no port forwarding required)
- Leverages the existing benchmarking infrastructure (`benchmarks.utils.benchmark`)
- Stores results persistently using `dynamo-pvc`
- Provides isolated execution environment with configurable resources
- Handles high load/speed requirements without timeout issues
- **Note**: Each benchmark job runs within a single Kubernetes namespace, but can benchmark services across multiple namespaces using the full DNS format `svc_name.namespace.svc.cluster.local`
## Prerequisites ## Prerequisites
1. **Kubernetes cluster** with NVIDIA GPUs and Dynamo namespace setup (see [Dynamo Kubernetes Platform docs](../kubernetes/README.md)) 1. **Kubernetes cluster** with NVIDIA GPUs and Dynamo namespace setup (see [Dynamo Kubernetes Platform docs](../kubernetes/README.md))
2. **Storage** PersistentVolumeClaim configured with appropriate permissions (see [deploy/utils README](https://github.com/ai-dynamo/dynamo/blob/main/deploy/utils/README.md)) 2. **Storage**: PersistentVolumeClaim configured with appropriate permissions (see [deploy/utils README](https://github.com/ai-dynamo/dynamo/blob/main/deploy/utils/README.md))
3. **Docker image** containing the Dynamo benchmarking tools 3. **Docker image** containing AIPerf (Dynamo runtime images include it)
## Quick Start ## Quick Start
### Step 1: Deploy Your DynamoGraphDeployment ### Step 1: Deploy Your DynamoGraphDeployment
Deploy your DynamoGraphDeployment using the [deployment documentation](https://github.com/ai-dynamo/dynamo/blob/main/examples/backends). Ensure it has a frontend service exposed. Deploy using the [deployment documentation](https://github.com/ai-dynamo/dynamo/blob/main/examples/backends). Ensure it has a frontend service exposed and the model is fully loaded before running benchmarks — check pod logs or verify the health endpoint returns `200 OK`.
### Step 2: Configure and Run Benchmark Job
### Step 2: Deploy and Run Benchmark Job First, edit `benchmarks/incluster/benchmark_job.yaml` to match your deployment:
**Note**: The server-side benchmarking job requires a Docker image containing the Dynamo benchmarking tools. Before the 0.5.1 release, you must build your own Docker image using the [container build instructions](https://github.com/ai-dynamo/dynamo/blob/main/container/README.md), push it to your container registry, then update the `image` field in `benchmarks/incluster/benchmark_job.yaml` to use your built image tag. - **Model name**: Update the `MODEL` variable
- **Service URL**: Update the `URL` variable (use `<svc_name>.<namespace>.svc.cluster.local:port` for cross-namespace access)
- **Concurrency levels**: Adjust the `for c in ...` loop
- **Docker image**: Update the `image` field if needed
Then deploy:
```bash ```bash
export NAMESPACE=benchmarking export NAMESPACE=benchmarking
# Deploy the benchmark job with default settings # Deploy the benchmark job
kubectl apply -f benchmarks/incluster/benchmark_job.yaml -n $NAMESPACE kubectl apply -f benchmarks/incluster/benchmark_job.yaml -n $NAMESPACE
# Monitor the job, wait for it to complete # Monitor the job
kubectl logs -f job/dynamo-benchmark -n $NAMESPACE kubectl logs -f job/dynamo-benchmark -n $NAMESPACE
``` ```
#### Customize the job configuration
To customize the benchmark parameters, edit the `benchmarks/incluster/benchmark_job.yaml` file and modify:
- **Model name**: Change `"Qwen/Qwen3-0.6B"` in the args section
- **Benchmark name**: Change `"qwen3-0p6b-vllm-agg"` to your desired benchmark name
- **Service URL**: Change `"vllm-agg-frontend:8000"` so the service URL matches your deployed service
- **Docker image**: Change the image field if needed
Then deploy:
```bash
kubectl apply -f benchmarks/incluster/benchmark_job.yaml -n $NAMESPACE
```
### Step 3: Retrieve Results ### Step 3: Retrieve Results
```bash ```bash
# Create access pod (skip this step if access pod is already running) # Create access pod (skip if already running)
kubectl apply -f deploy/utils/manifests/pvc-access-pod.yaml -n $NAMESPACE kubectl apply -f deploy/utils/manifests/pvc-access-pod.yaml -n $NAMESPACE
kubectl wait --for=condition=Ready pod/pvc-access-pod -n $NAMESPACE --timeout=60s kubectl wait --for=condition=Ready pod/pvc-access-pod -n $NAMESPACE --timeout=60s
# Download the results # Download the results
kubectl cp $NAMESPACE/pvc-access-pod:/data/results/<benchmark-name> ./benchmarks/results/<benchmark-name> kubectl cp $NAMESPACE/pvc-access-pod:/data/results ./results
# Cleanup # Cleanup
kubectl delete pod pvc-access-pod -n $NAMESPACE kubectl delete pod pvc-access-pod -n $NAMESPACE
...@@ -385,156 +335,55 @@ kubectl delete pod pvc-access-pod -n $NAMESPACE ...@@ -385,156 +335,55 @@ kubectl delete pod pvc-access-pod -n $NAMESPACE
### Step 4: Generate Plots ### Step 4: Generate Plots
```bash ```bash
# Generate performance plots from the downloaded results aiperf plot ./results
python3 -m benchmarks.utils.plot \
--data-dir ./benchmarks/results
``` ```
This will create visualization plots. For more details on interpreting these plots, see the [Summary and Plots](#summary-and-plots) section above.
## Cross-Namespace Service Access ## Cross-Namespace Service Access
Server-side benchmarking can benchmark services across multiple namespaces from a single job using Kubernetes DNS. When referencing services in other namespaces, use the full DNS format: When referencing services in other namespaces, use full Kubernetes DNS:
```bash ```bash
# Access service in same namespace # Same namespace
SERVICE_URL=vllm-agg-frontend:8000 --url http://vllm-agg-frontend:8000
# Access service in different namespace
SERVICE_URL=vllm-agg-frontend.production.svc.cluster.local:8000
```
**DNS Format**: `<service-name>.<namespace>.svc.cluster.local:port`
This allows you to:
- Benchmark multiple services across different namespaces in a single job
- Compare services running in different environments (dev, staging, production)
- Test cross-namespace integrations without port-forwarding
- Run comprehensive cross-namespace performance comparisons
## Configuration
The benchmark job is configured directly in the YAML file.
### Default Configuration
- **Model**: `Qwen/Qwen3-0.6B`
- **Benchmark Name**: `qwen3-0p6b-vllm-agg`
- **Service**: `vllm-agg-frontend:8000`
- **Docker Image**: `nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag`
### Customizing the Job
To customize the benchmark, edit `benchmarks/incluster/benchmark_job.yaml`:
1. **Change the model**: Update the `--model` argument
2. **Change the benchmark name**: Update the `--benchmark-name` argument
3. **Change the service URL**: Update the `--endpoint-url` argument (use `<svc_name>.<namespace>.svc.cluster.local:port` for cross-namespace access)
4. **Change Docker image**: Update the image field if needed
### Example: Multi-Namespace Benchmarking
To benchmark services across multiple namespaces, you would need to run separate benchmark jobs for each service since the format supports one benchmark per job. However, the results are stored in the same PVC and may be accessed together. # Different namespace
--url http://vllm-agg-frontend.production.svc.cluster.local:8000
```yaml
# Job 1: Production service
args:
- --model
- "Qwen/Qwen3-0.6B"
- --benchmark-name
- "prod-vllm"
- --endpoint-url
- "vllm-agg-frontend.production.svc.cluster.local:8000"
- --output-dir
- /data/results
# Job 2: Staging service
args:
- --model
- "Qwen/Qwen3-0.6B"
- --benchmark-name
- "staging-vllm"
- --endpoint-url
- "vllm-agg-frontend.staging.svc.cluster.local:8000"
- --output-dir
- /data/results
```
## Understanding Your Results
Results are stored in `/data/results` and follow the same structure as client-side benchmarking:
```text
/data/results/
└── <benchmark-name>/ # Results for your benchmark name
├── c1/ # Concurrency level 1
│ └── profile_export_aiperf.json
├── c2/ # Concurrency level 2
└── ... # Other concurrency levels
``` ```
## Monitoring and Debugging ## Monitoring and Debugging
### Check Job Status
```bash ```bash
# Check job status
kubectl describe job dynamo-benchmark -n $NAMESPACE kubectl describe job dynamo-benchmark -n $NAMESPACE
```
### View Logs # Follow logs
```bash
# Follow logs in real-time
kubectl logs -f job/dynamo-benchmark -n $NAMESPACE kubectl logs -f job/dynamo-benchmark -n $NAMESPACE
```
### Debug Failed Jobs
```bash
# Check pod status # Check pod status
kubectl get pods -n $NAMESPACE -l job-name=dynamo-benchmark kubectl get pods -n $NAMESPACE -l job-name=dynamo-benchmark
# Describe failed pod # Debug failed pod
kubectl describe pod <pod-name> -n $NAMESPACE kubectl describe pod <pod-name> -n $NAMESPACE
``` ```
## Troubleshooting ### Troubleshooting
### Common Issues
1. **Service not found**: Ensure your DynamoGraphDeployment frontend service is running 1. **Service not found**: Ensure your DynamoGraphDeployment frontend service is running
3. **PVC access**: Check that `dynamo-pvc` is properly configured and accessible 2. **PVC access**: Check that `dynamo-pvc` is properly configured and accessible
4. **Image pull issues**: Ensure the Docker image is accessible from the cluster 3. **Image pull issues**: Ensure the Docker image is accessible from the cluster
5. **Resource constraints**: Adjust resource limits if the job is being evicted 4. **Resource constraints**: Adjust resource limits if the job is being evicted
### Debug Commands
```bash ```bash
# Check PVC status # Check PVC status
kubectl get pvc dynamo-pvc -n $NAMESPACE kubectl get pvc dynamo-pvc -n $NAMESPACE
# Check service endpoints # Verify service exists and has endpoints
kubectl get svc -n $NAMESPACE kubectl get svc -n $NAMESPACE
kubectl get endpoints <service-name> -n $NAMESPACE
# Verify your service exists and has endpoints
SVC_NAME="${SERVICE_URL%%:*}"
kubectl get svc "$SVC_NAME" -n "$NAMESPACE"
kubectl get endpoints "$SVC_NAME" -n "$NAMESPACE"
``` ```
--- ---
## Customize Benchmarking Behavior
The built-in Python workflow connects to endpoints, benchmarks with aiperf, and generates plots. If you want to modify the behavior:
1. **Extend the workflow**: Modify `benchmarks/utils/workflow.py` to add custom deployment types or metrics collection
2. **Generate different plots**: Modify `benchmarks/utils/plot.py` to generate a different set of plots for whatever you wish to visualize.
3. **Direct module usage**: Use individual Python modules (`benchmarks.utils.benchmark`, `benchmarks.utils.plot`) for granular control over each step of the benchmarking process.
The Python benchmarking module provides a complete end-to-end benchmarking experience with full control over the workflow.
---
## Testing with Mocker Backend ## Testing with Mocker Backend
For development and testing purposes, Dynamo provides a [mocker backend](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/mocker) that simulates LLM inference without requiring actual GPU resources. This is useful for: For development and testing purposes, Dynamo provides a [mocker backend](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/mocker) that simulates LLM inference without requiring actual GPU resources. This is useful for:
...@@ -547,3 +396,22 @@ For development and testing purposes, Dynamo provides a [mocker backend](https:/ ...@@ -547,3 +396,22 @@ For development and testing purposes, Dynamo provides a [mocker backend](https:/
The mocker backend mimics the API and behavior of real backends (SGLang, TensorRT-LLM, vLLM) but generates mock responses instead of running actual inference. The mocker backend mimics the API and behavior of real backends (SGLang, TensorRT-LLM, vLLM) but generates mock responses instead of running actual inference.
See the [mocker directory](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/mocker) for usage examples and configuration options. See the [mocker directory](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/mocker) for usage examples and configuration options.
---
## Advanced AIPerf Features
AIPerf has many capabilities beyond basic profiling. Here are some particularly useful for Dynamo benchmarking:
| Feature | Description | Docs |
|---------|-------------|------|
| Trace Replay | Replay production traces for deterministic benchmarking | [Trace Replay](https://github.com/ai-dynamo/aiperf/blob/main/docs/benchmark-modes/trace-replay.md) |
| Arrival Patterns | Poisson, constant, gamma traffic distributions | [Arrival Patterns](https://github.com/ai-dynamo/aiperf/blob/main/docs/tutorials/arrival-patterns.md) |
| Gradual Ramping | Smooth ramp-up of concurrency and request rate | [Ramping](https://github.com/ai-dynamo/aiperf/blob/main/docs/tutorials/ramping.md) |
| Warmup Phase | Eliminate cold-start effects from measurements | [Warmup](https://github.com/ai-dynamo/aiperf/blob/main/docs/tutorials/warmup.md) |
| Multi-URL Load Balancing | Distribute requests across multiple endpoints | [Multi-URL](https://github.com/ai-dynamo/aiperf/blob/main/docs/tutorials/multi-url-load-balancing.md) |
| GPU Telemetry | Collect DCGM metrics during benchmarking | [GPU Telemetry](https://github.com/ai-dynamo/aiperf/blob/main/docs/tutorials/gpu-telemetry.md) |
| Goodput Analysis | SLO-based throughput measurement | [Goodput](https://github.com/ai-dynamo/aiperf/blob/main/docs/tutorials/goodput.md) |
| Timeslice Analysis | Per-timeslice performance breakdown | [Timeslices](https://github.com/ai-dynamo/aiperf/blob/main/docs/tutorials/timeslices.md) |
| Multi-Turn Conversations | Benchmark multi-turn chat workloads | [Multi-Turn](https://github.com/ai-dynamo/aiperf/blob/main/docs/tutorials/multi-turn.md) |
| Experiment Classification | Baseline vs treatment semantic colors in plots | [Plotting](https://github.com/ai-dynamo/aiperf/blob/main/docs/tutorials/plot.md) |
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment