Unverified Commit 699996e4 authored by hhzhang16's avatar hhzhang16 Committed by GitHub
Browse files

feat: add benchmarking guide (#2620)


Signed-off-by: default avatarHannah Zhang <hannahz@nvidia.com>
parent 3c4adde5
......@@ -151,6 +151,13 @@ Rerun with `curl -N` and change `stream` in the request to `true` to get the res
- Check out [Backends](components/backends) to deploy various workflow configurations (e.g. SGLang with router, vLLM with disaggregated serving, etc.)
- Run some [Examples](examples) to learn about building components in Dynamo and exploring various integrations.
### Benchmarking Dynamo
Dynamo provides comprehensive benchmarking tools to evaluate and optimize your deployments:
* **[Benchmarking Guide](docs/benchmarks/benchmarking.md)** – Compare deployment topologies (aggregated vs. disaggregated vs. vanilla vLLM) using GenAI-Perf
* **[Pre-Deployment Profiling](docs/benchmarks/pre_deployment_profiling.md)** – Optimize configurations before deployment to meet SLA requirements
# Engines
Dynamo is designed to be inference engine agnostic. To use any engine with Dynamo, NATS and etcd need to be installed, along with a Dynamo frontend (`python -m dynamo.frontend [--interactive]`).
......
......@@ -15,19 +15,72 @@
# Benchmarks
This directory contains benchmarking scripts and tools for performance evaluation.
This directory contains benchmarking scripts and tools for performance evaluation of Dynamo deployments. The benchmarking framework is a wrapper around genai-perf that makes it easy to benchmark DynamoGraphDeployments and compare them with external endpoints.
## Quick Start
### Benchmark an Existing Endpoint
```bash
./benchmark.sh --namespace my-namespace --input my-endpoint=http://your-endpoint:8000
```
### Benchmark Dynamo Deployments
```bash
# Benchmark disaggregated vLLM with custom label
./benchmark.sh --namespace my-namespace --input vllm-disagg=components/backends/vllm/deploy/disagg.yaml
# Benchmark TensorRT-LLM disaggregated deployment
./benchmark.sh --namespace my-namespace --input trtllm-disagg=components/backends/trtllm/deploy/disagg.yaml
# Compare multiple Dynamo deployments
./benchmark.sh --namespace my-namespace \
--input agg=components/backends/vllm/deploy/agg.yaml \
--input disagg=components/backends/vllm/deploy/disagg.yaml
# Compare Dynamo vs external endpoint
./benchmark.sh --namespace my-namespace \
--input dynamo=components/backends/vllm/deploy/disagg.yaml \
--input external=http://localhost:8000
```
**Note**:
- The sample manifests may reference private registry images. Update the `image:` fields to use accessible images from [Dynamo NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo/artifacts) or your own registry before running.
- Only DynamoGraphDeployment manifests are supported for automatic deployment. To benchmark non-Dynamo backends (vLLM, TensorRT-LLM, SGLang, etc.), deploy them manually using their Kubernetes guides and use the endpoint option.
## Features
The benchmarking framework supports:
**Two Benchmarking Modes:**
- **Endpoint Benchmarking**: Test existing HTTP endpoints without deployment overhead
- **Deployment Benchmarking**: Deploy, test, and cleanup DynamoGraphDeployments automatically
**Flexible Configuration:**
- User-defined labels for each input using `--input label=value` format
- Support for multiple inputs to enable comparisons
- Customizable concurrency levels (configurable via CONCURRENCIES env var), sequence lengths, and models
- Automated performance plot generation with custom labels
**Supported Backends:**
- DynamoGraphDeployments
- External HTTP endpoints (for comparison with non-Dynamo backends)
## Installation
This is already included as part of the dynamo vllm image. To install locally or standalone, run:
This is already included as part of the Dynamo container images. To install locally or standalone:
```bash
pip install -e .
```
Currently, this will install lightweight tools for:
## Data Generation Tools
This directory also includes lightweight tools for:
- Analyzing prefix-structured data (`datagen analyze`)
- Synthesizing structured data customizable for testing purposes (`datagen synthesize`)
Detailed information are provided in the `prefix_data_generator` directory.
The benchmarking scripts for the core dynamo components are to come soon (e.g. routing, disagg, Planner).
\ No newline at end of file
Detailed information is provided in the `prefix_data_generator` directory.
## Comprehensive Guide
For detailed documentation, configuration options, and advanced usage, see the [complete benchmarking guide](../docs/benchmarks/benchmarking.md).
#!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
set -euo pipefail
# Script directory
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
DYNAMO_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
# Configuration - all set via command line arguments
NAMESPACE=""
MODEL="deepseek-ai/DeepSeek-R1-Distill-Llama-8B"
ISL=2000
STD=10
OSL=256
OUTPUT_DIR="./benchmarks/results"
# Input configurations stored as associative arrays
declare -A INPUT_LABELS
declare -A INPUT_VALUES
# Flags
VERBOSE=false
show_help() {
cat << EOF
Dynamo Benchmark Runner
This script is a wrapper around genai-perf that benchmarks Dynamo LLM deployments and
plots the results in an easy-to-use way. It supports comparing multiple DynamoGraphDeployments
or endpoints with custom labels defined by you.
The client runs locally and connects to your deployments/endpoints for benchmarking.
USAGE:
$0 --namespace NAMESPACE --input <label>=<manifest_or_endpoint> [--input <label>=<manifest_or_endpoint>]... [OPTIONS]
REQUIRED:
-n, --namespace NAMESPACE Kubernetes namespace
--input <label>=<manifest_path_or_endpoint> Benchmark input with custom label
- <label>: becomes the name/label in plots
- <manifest_path_or_endpoint>: either a DynamoGraphDeployment manifest or HTTP endpoint URL
Can be specified multiple times for comparisons
OPTIONS:
-h, --help Show this help message
-m, --model MODEL Model name for GenAI-Perf configuration and logging (default: deepseek-ai/DeepSeek-R1-Distill-Llama-8B)
NOTE: This must match the model configured in your deployment manifests and the model deployed in any endpoints.
-i, --isl LENGTH Input sequence length (default: $ISL)
-s, --std STDDEV Input sequence standard deviation (default: $STD)
-o, --osl LENGTH Output sequence length (default: $OSL)
-d, --output-dir DIR Output directory (default: $OUTPUT_DIR)
--verbose Enable verbose output
EXAMPLES:
# Compare aggregated vs disaggregated Dynamo deployments
$0 --namespace \$NAMESPACE \\
--input agg=components/backends/vllm/deploy/agg.yaml \\
--input disagg=components/backends/vllm/deploy/disagg.yaml
# Compare Dynamo deployment vs external endpoint
$0 --namespace \$NAMESPACE \\
--input dynamo=components/backends/vllm/deploy/disagg.yaml \\
--input external=http://localhost:8000
# Compare three different configurations
$0 --namespace \$NAMESPACE \\
--input dynamo-agg=components/backends/vllm/deploy/agg.yaml \\
--input dynamo-disagg=components/backends/vllm/deploy/disagg.yaml \\
--input external-vllm=http://localhost:8000
# Benchmark a single Dynamo deployment
$0 --namespace \$NAMESPACE \\
--input my-setup=components/backends/vllm/deploy/disagg.yaml
# Benchmark single external endpoint
$0 --namespace \$NAMESPACE \\
--input production=http://localhost:8000
DEPLOYMENT TYPES:
- DynamoGraphDeployment: Supports various Dynamo deployment configurations including:
* Aggregated deployments (prefill and decode together)
* Disaggregated deployments (prefill and decode separate)
* Router deployments
* Planner deployments
* And other Dynamo configurations
- External Endpoints: For comparing against non-Dynamo backends
NOTE:
- Only DynamoGraphDeployment manifests are supported for automatic deployment.
- To benchmark non-Dynamo backends (vLLM, TensorRT-LLM, SGLang, etc.), deploy them
manually following their Kubernetes deployment guides, expose a port (i.e. via port-forward),
and use the endpoint option.
- For Dynamo deployment setup, setup_k8s_namespace.sh provides fully encapsulated
deployment setup including namespace creation, CRDs, and operator installation.
- The --model flag configures GenAI-Perf and should match what's configured in your deployment manifests and endpoints.
- Only one model can be benchmarked at a time across all inputs.
EOF
}
parse_input() {
local input_arg="$1"
# Basic format validation: must contain exactly one '=' character
if [[ ! "$input_arg" =~ ^[^=]+=[^=]+$ ]]; then
echo "ERROR: Invalid input format. Expected: <label>=<manifest_path_or_endpoint>" >&2
echo "Got: $input_arg" >&2
echo "Format must be: key=value with exactly one '=' character" >&2
exit 1
fi
# Split on the first '=' character
local label="${input_arg%%=*}"
local value="${input_arg#*=}"
# Basic validation - detailed validation will be done in Python
if [[ -z "$label" ]]; then
echo "ERROR: Label cannot be empty in input: $input_arg" >&2
exit 1
fi
if [[ -z "$value" ]]; then
echo "ERROR: Value cannot be empty in input: $input_arg" >&2
exit 1
fi
# Check for duplicate labels
if [[ -n "${INPUT_LABELS[$label]:-}" ]]; then
echo "ERROR: Duplicate label '$label' found. Each label must be unique." >&2
exit 1
fi
# Store the input
INPUT_LABELS["$label"]=1
INPUT_VALUES["$label"]="$value"
echo "Added input: $label -> $value"
}
parse_args() {
while [[ $# -gt 0 ]]; do
case $1 in
-h|--help)
show_help
exit 0
;;
-n|--namespace)
NAMESPACE="$2"
shift 2
;;
-m|--model)
MODEL="$2"
shift 2
;;
-i|--isl)
ISL="$2"
shift 2
;;
-s|--std)
STD="$2"
shift 2
;;
-o|--osl)
OSL="$2"
shift 2
;;
-d|--output-dir)
OUTPUT_DIR="$2"
shift 2
;;
--input)
parse_input "$2"
shift 2
;;
--verbose)
VERBOSE=true
shift
;;
*)
echo "Unknown option: $1" >&2
echo "Use --help for usage information." >&2
exit 1
;;
esac
done
}
validate_config() {
local errors=()
if [[ -z "$NAMESPACE" ]]; then
errors+=("--namespace is required")
fi
# Check that at least one input is specified
if [[ ${#INPUT_LABELS[@]} -eq 0 ]]; then
errors+=("At least one --input must be specified")
fi
if [[ ${#errors[@]} -gt 0 ]]; then
echo "ERROR: Missing required arguments:" >&2
for error in "${errors[@]}"; do
echo " $error" >&2
done
echo "Use --help for usage information." >&2
exit 1
fi
# Validate that specified files exist and endpoints are valid URLs
for label in "${!INPUT_VALUES[@]}"; do
local value="${INPUT_VALUES[$label]}"
# Check if it's a URL (starts with http:// or https://)
if [[ "$value" =~ ^https?:// ]]; then
echo "Input '$label': endpoint $value"
else
# It should be a file path - validate it exists
if [[ ! -f "$value" ]]; then
echo "ERROR: Manifest file not found for input '$label': $value" >&2
exit 1
fi
echo "Input '$label': manifest $value"
fi
done
if [[ ! "$ISL" =~ ^[0-9]+$ ]] || [[ "$ISL" -le 0 ]]; then
echo "ERROR: ISL must be a positive integer, got: $ISL" >&2
exit 1
fi
if [[ ! "$OSL" =~ ^[0-9]+$ ]] || [[ "$OSL" -le 0 ]]; then
echo "ERROR: OSL must be a positive integer, got: $OSL" >&2
exit 1
fi
if [[ ! "$STD" =~ ^[0-9]+$ ]] || [[ "$STD" -lt 0 ]]; then
echo "ERROR: STD must be a non-negative integer, got: $STD" >&2
exit 1
fi
}
print_config() {
echo "=== Benchmark Configuration ==="
echo "Namespace: $NAMESPACE"
echo "Model: $MODEL"
echo "Input Sequence Length: $ISL tokens"
echo "Output Sequence Length: $OSL tokens"
echo "Sequence Std Dev: $STD tokens"
echo "Output Directory: $OUTPUT_DIR"
echo ""
echo "Benchmark Inputs:"
for label in "${!INPUT_VALUES[@]}"; do
local value="${INPUT_VALUES[$label]}"
if [[ "$value" =~ ^https?:// ]]; then
echo " $label: endpoint $value"
else
echo " $label: manifest $value"
fi
done
echo "==============================="
echo
}
clear_output_directory() {
if [[ -d "$OUTPUT_DIR" ]]; then
echo "🧹 Clearing existing output directory: $OUTPUT_DIR"
rm -rf "$OUTPUT_DIR"
fi
mkdir -p "$OUTPUT_DIR"
echo "✅ Output directory prepared: $OUTPUT_DIR"
}
run_benchmark() {
echo "🚀 Starting benchmark workflow..."
# Clear and recreate output directory
clear_output_directory
# Change to dynamo root directory
cd "$DYNAMO_ROOT"
local cmd=(
python3 -u -m benchmarks.utils.benchmark
--namespace "$NAMESPACE"
--model "$MODEL"
--isl "$ISL"
--std "$STD"
--osl "$OSL"
--output-dir "$OUTPUT_DIR"
)
# Add all input arguments
for label in "${!INPUT_VALUES[@]}"; do
local value="${INPUT_VALUES[$label]}"
cmd+=(--input "$label=$value")
done
if [[ "$VERBOSE" == "true" ]]; then
echo "Executing: ${cmd[*]}"
fi
if ! "${cmd[@]}"; then
echo "❌ Benchmark failed!" >&2
exit 1
fi
echo "✅ Benchmark completed successfully!"
}
generate_plots() {
echo "📊 Generating performance plots..."
cd "$DYNAMO_ROOT"
local plot_cmd=(
python3 -m benchmarks.utils.plot
--data-dir "$OUTPUT_DIR"
)
if [[ "$VERBOSE" == "true" ]]; then
echo "Executing: ${plot_cmd[*]}"
fi
if ! "${plot_cmd[@]}"; then
echo "⚠️ Plot generation failed, but benchmark data is still available" >&2
return 1
fi
echo "✅ Plots generated successfully!"
echo "📁 Results available at: $OUTPUT_DIR"
echo "📈 Plots available at: $OUTPUT_DIR/plots"
}
main() {
trap cleanup EXIT
parse_args "$@"
validate_config
print_config
if [[ "$VERBOSE" == "true" ]]; then
export DYNAMO_VERBOSE=true
fi
local start_time
start_time=$(date +%s)
run_benchmark
generate_plots
local end_time
end_time=$(date +%s)
local duration
duration=$((end_time - start_time))
echo
echo "🎉 All done!"
echo "⏱️ Total time: ${duration}s"
echo "📁 Results: $OUTPUT_DIR"
echo "📊 Plots: $OUTPUT_DIR/plots"
}
cleanup() {
if [[ $? -ne 0 ]]; then
echo "❌ Script failed. Check logs above for details." >&2
fi
}
# Only run main if script is executed directly (not sourced)
if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then
trap 'cleanup $?' EXIT
main "$@"
fi
......@@ -8,7 +8,7 @@ metadata:
spec:
template:
spec:
serviceAccountName: profile-sla-sa
serviceAccountName: dynamo-sa
containers:
- name: profile-sla
image: ${DOCKER_IMAGE}
......@@ -26,10 +26,10 @@ spec:
value: nats://${NAMESPACE}-nats:4222
- name: ETCD_ENDPOINTS
value: ${NAMESPACE}-etcd:2379
command: ["python", "/workspace/benchmarks/profiler/profile_sla.py"]
command: ["python", "-m", "benchmarks.profiler.profile_sla"]
args:
- --config
- ${DGD_CONFIG_FILE}
- /workspace/configs/disagg.yaml
- --output-dir
- /workspace/profiling_results
- --namespace
......@@ -51,9 +51,14 @@ spec:
volumeMounts:
- name: output-volume
mountPath: /workspace/profiling_results
- name: configs
mountPath: /workspace/configs
restartPolicy: Never
volumes:
- name: output-volume
persistentVolumeClaim:
claimName: profiling-pvc
claimName: dynamo-pvc
- name: configs
persistentVolumeClaim:
claimName: dynamo-pvc
backoffLimit: 0
......@@ -23,10 +23,6 @@ import numpy as np
import yaml
from utils.config import CONFIG_MODIFIERS, WORKER_COMPONENT_NAMES
from utils.defaults import DECODE_NUM_REQUESTS_RANGE
from utils.dynamo_deployment import (
DynamoDeploymentClient,
cleanup_remaining_deployments,
)
from utils.genai_perf import benchmark_decode, benchmark_prefill
from utils.plot import plot_decode_performance, plot_prefill_performance
from utils.profile_cache import (
......@@ -38,6 +34,11 @@ from utils.profile_cache import (
from utils.profile_decode import profile_decode
from utils.profile_prefill import profile_prefill
from deploy.utils.dynamo_deployment import (
DynamoDeploymentClient,
cleanup_remaining_deployments,
)
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)
console_handler = logging.StreamHandler()
......@@ -170,10 +171,10 @@ async def run_profile(args):
prefill_ttft.append(ttft)
prefill_thpt_per_gpu.append(args.isl / ttft / tp_size * 1000)
print("Cleaning up deployment...")
logger.info("Cleaning up deployment...")
await client.delete_deployment()
deployment_clients.remove(client)
print("Deployment deleted")
logger.info("Deployment deleted")
# Plot the results as a 2D scatter plot
if prefill_tp_size and prefill_ttft and prefill_thpt_per_gpu:
......@@ -270,7 +271,7 @@ async def run_profile(args):
)
max_concurrency = max_kv_tokens // (args.isl + args.osl)
sweep_num_request = [
num for num in DECODE_NUM_REQUESTS_RANGE if num < max_concurrency
num for num in DECODE_NUM_REQUESTS_RANGE if num <= max_concurrency
]
logger.info(
f"Sweeping num_request range based on maximum number of kv tokens: {sweep_num_request}"
......@@ -303,10 +304,10 @@ async def run_profile(args):
decode_concurrency.append(num_request)
decode_kv_cache_size.append(max_kv_tokens)
print("Cleaning up deployment...")
logger.info("Cleaning up deployment...")
await client.delete_deployment()
deployment_clients.remove(client)
print("Deployment deleted")
logger.info("Deployment deleted")
# Store partial results for plotting later
decode_results.append(
......@@ -318,6 +319,11 @@ async def run_profile(args):
plot_decode_performance(decode_results, args.itl, args.output_dir)
logger.info("Analyzing results and generate recommendations...")
# Safety guards: no results → exit early with a clear message
if not (prefill_tp_size and prefill_ttft and prefill_thpt_per_gpu):
logger.error("No prefill results produced; skipping recommendations.")
return
# select best tp size for prefill
if min(prefill_ttft) > args.ttft:
logger.info(
......@@ -349,6 +355,15 @@ async def run_profile(args):
)
# select best tp size for decode
if not (
decode_tp_size
and decode_itl
and decode_thpt_per_gpu
and decode_concurrency
and decode_kv_cache_size
):
logger.error("No decode results produced; skipping recommendations.")
return
if min(decode_itl) > args.itl:
logger.info(
"No TP size satisfies the ITL requirement, please try a smaller model or a more powerful GPU SKU"
......@@ -367,7 +382,7 @@ async def run_profile(args):
# calculate kv cache utlization for the selected TP and concurrency
selected_decode_kv_cache_utilization = (
decode_concurrency[selected_decode_idx]
* (args.isl + args.osl / 2)
* (args.isl + (args.osl / 2))
/ decode_kv_cache_size[selected_decode_idx]
)
# set a +- 20% range for the kv cache utilization
......@@ -433,10 +448,10 @@ async def run_profile(args):
args.prefill_interpolation_granularity,
)
print("Cleaning up deployment...")
logger.info("Cleaning up deployment...")
await client.delete_deployment()
deployment_clients.remove(client)
print("Deployment deleted")
logger.info("Deployment deleted")
# interpolate ITL - Active_KV_Cache - Decode_Context_Length with best decode TP
best_decode_tp = decode_tp_size[selected_decode_idx]
......@@ -490,10 +505,10 @@ async def run_profile(args):
args.decode_interpolation_granularity,
)
print("Cleaning up deployment...")
logger.info("Cleaning up deployment...")
await client.delete_deployment()
deployment_clients.remove(client)
print("Deployment deleted")
logger.info("Deployment deleted")
except Exception as e:
logger.error(f"Profile job failed with error: {e}")
......
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import logging
import os
import signal
import subprocess
import time
import pynvml
import requests
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)
console_handler = logging.StreamHandler()
console_handler.setLevel(logging.INFO)
formatter = logging.Formatter(
"%(asctime)s - %(name)s - %(levelname)s - %(message)s", "%Y-%m-%d %H:%M:%S"
)
console_handler.setFormatter(formatter)
logger.addHandler(console_handler)
def get_dynamo_serve_cmd(config_file_path):
config_file_path = os.path.abspath(config_file_path)
return [
"dynamo",
"serve",
"graphs.agg:Frontend",
"-f",
config_file_path,
]
def get_available_gpu_count():
try:
pynvml.nvmlInit()
gpu_count = pynvml.nvmlDeviceGetCount()
if gpu_count > 0:
logger.info(f"Detected {gpu_count} GPUs in the system:")
for i in range(gpu_count):
handle = pynvml.nvmlDeviceGetHandleByIndex(i)
name = pynvml.nvmlDeviceGetName(handle)
memory = pynvml.nvmlDeviceGetMemoryInfo(handle)
total_memory_mb = memory.total / (1024 * 1024)
free_memory_mb = memory.free / (1024 * 1024)
logger.info(
f" GPU {i}: {name}, Total Memory: {total_memory_mb:.2f} MB, Free Memory: {free_memory_mb:.2f} MB"
)
else:
logger.warning("No GPUs detected with pynvml.")
pynvml.nvmlShutdown()
return gpu_count
except ImportError:
logger.error(
"pynvml module not found. Please install it with 'pip install pynvml'"
)
return 0
except pynvml.NVMLError as e:
logger.error(f"NVML Error: {e}")
return 0
except Exception as e:
logger.error(f"Error detecting GPUs: {e}")
return 0
def shutdown_deployment(dynamo_process):
os.killpg(os.getpgid(dynamo_process.pid), signal.SIGINT)
dynamo_process.communicate()
try:
current_pid = os.getpid()
ps_cmd = ["ps", "-ef"]
ps_output = subprocess.check_output(ps_cmd, text=True)
for line in ps_output.splitlines():
if "python" in line.lower():
parts = line.split()
if len(parts) >= 2:
try:
pid = int(parts[1])
if pid != current_pid: # Exclude current process
os.kill(pid, signal.SIGKILL)
except ValueError:
continue
except Exception as e:
logger.error(f"Error killing Python processes: {e}")
time.sleep(5)
def wait_for_server_ready(model_name: str, port: int, timeout: int = 300):
logger.info("Waiting for the server to be ready...")
endpoint_url = f"http://localhost:{port}/v1/chat/completions"
start_time = time.time()
server_ready = False
while time.time() - start_time < timeout:
try:
# Send a simple request to check if the server is up
response = requests.post(
endpoint_url,
json={
"model": model_name,
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 1,
},
timeout=5,
)
if response.status_code != 200:
logger.info(
f"Server returned status code {response.status_code}, waiting..."
)
time.sleep(5)
continue
logger.info(f"Server is ready after {time.time() - start_time:.2f} seconds")
server_ready = True
break
except (requests.RequestException, ConnectionError) as e:
logger.info(f"Server not ready yet: {e}")
time.sleep(5)
return server_ready
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Package marker for benchmarks utilities
#!/usr/bin/env python3
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
import argparse
import asyncio
import sys
from typing import Tuple
from benchmarks.utils.workflow import categorize_inputs, run_benchmark_workflow
def parse_input(input_str: str) -> Tuple[str, str]:
"""Parse input string in format key=value with additional validation"""
if "=" not in input_str:
raise ValueError(
f"Invalid input format. Expected: <label>=<manifest_path_or_endpoint>, got: {input_str}"
)
parts = input_str.split("=", 1) # Split on first '=' only
if len(parts) != 2:
raise ValueError(
f"Invalid input format. Expected: <label>=<manifest_path_or_endpoint>, got: {input_str}"
)
label, value = parts
if not label.strip():
raise ValueError("Label cannot be empty")
if not value.strip():
raise ValueError("Value cannot be empty")
label = label.strip()
value = value.strip()
# Validate label characters
import re
if not re.match(r"^[a-zA-Z0-9_-]+$", label):
raise ValueError(
f"Label must contain only letters, numbers, hyphens, and underscores. Invalid label: {label}"
)
return label, value
def main() -> int:
parser = argparse.ArgumentParser(description="Benchmark Orchestrator")
parser.add_argument(
"--input",
action="append",
dest="inputs",
help="Input in format <label>=<manifest_path_or_endpoint>. Can be specified multiple times for comparisons.",
)
parser.add_argument("--namespace", required=True, help="Kubernetes namespace")
parser.add_argument("--isl", type=int, default=200, help="Input sequence length")
parser.add_argument(
"--std",
type=int,
default=10,
help="Input sequence standard deviation",
)
parser.add_argument("--osl", type=int, default=200, help="Output sequence length")
parser.add_argument(
"--model",
default="deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
help="Model name",
)
parser.add_argument(
"--output-dir", type=str, default="benchmarks/results", help="Output directory"
)
args = parser.parse_args()
# Validate inputs
if not args.inputs:
print("ERROR: At least one --input must be specified")
return 1
# Parse inputs
try:
parsed_inputs = {}
for input_str in args.inputs:
label, value = parse_input(input_str)
if label in parsed_inputs:
print(
f"ERROR: Duplicate label '{label}' found. Each label must be unique."
)
return 1
parsed_inputs[label] = value
# Check for plotting limitations
if len(parsed_inputs) > 12:
print(
f"WARNING: You provided {len(parsed_inputs)} inputs, but the plotting system supports up to 12 inputs."
)
print(
"Consider running separate benchmark sessions or grouping related comparisons together."
)
print(
"Continuing with benchmark, but some inputs may not appear in plots..."
)
print()
endpoints, manifests = categorize_inputs(parsed_inputs)
except (ValueError, FileNotFoundError) as e:
print(f"ERROR: {e}")
return 1
# Run the benchmark workflow with the parsed inputs
asyncio.run(
run_benchmark_workflow(
namespace=args.namespace,
inputs=parsed_inputs,
isl=args.isl,
std=args.std,
osl=args.osl,
model=args.model,
output_dir=args.output_dir,
)
)
return 0
if __name__ == "__main__":
sys.exit(main())
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
import os
import subprocess
from pathlib import Path
from typing import List
# Default concurrency levels - can be overridden with CONCURRENCIES environment variable
DEFAULT_CONCURRENCIES: List[int] = [1, 2, 5, 10, 50, 100, 250]
def get_concurrency_levels() -> List[int]:
"""Get concurrency levels from environment variable or use defaults"""
concurrencies_env = os.getenv("CONCURRENCIES")
if concurrencies_env:
try:
# Parse comma-separated values
concurrencies = [int(x.strip()) for x in concurrencies_env.split(",")]
# Validate all are positive integers
for c in concurrencies:
if c <= 0:
raise ValueError(f"Concurrency level must be positive, got: {c}")
return sorted(concurrencies)
except ValueError as e:
print(f"WARNING: Invalid CONCURRENCIES environment variable: {e}")
print(f"Using default concurrency levels: {DEFAULT_CONCURRENCIES}")
return DEFAULT_CONCURRENCIES
return DEFAULT_CONCURRENCIES
CONCURRENCIES: List[int] = get_concurrency_levels()
def run_genai_perf(
service_url: str,
model_name: str,
isl: int,
osl: int,
stddev: int,
concurrency: int,
output_dir: Path,
) -> None:
output_dir.mkdir(parents=True, exist_ok=True)
cmd = [
"genai-perf",
"profile",
"-m",
model_name,
"--endpoint-type",
"chat",
"--streaming",
"-u",
service_url,
"--synthetic-input-tokens-mean",
str(isl),
"--synthetic-input-tokens-stddev",
str(stddev),
"--concurrency",
str(concurrency),
"--output-tokens-mean",
str(osl),
"--extra-inputs",
f"max_tokens:{osl}",
"--extra-inputs",
f"min_tokens:{osl}",
"--extra-inputs",
"ignore_eos:true",
"--tokenizer",
model_name,
"--artifact-dir",
str(output_dir),
"--",
"-vv",
"--max-threads=300",
]
print(
f"Running genai-perf with isl {isl}, osl {osl}, concurrency {concurrency}",
flush=True,
)
gap_process = subprocess.Popen(
cmd,
cwd=str(output_dir),
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
text=True,
)
stdout, stderr = gap_process.communicate()
if gap_process.returncode == 0:
print("Genai-perf profiling completed successfully", flush=True)
if stdout:
print(stdout)
else:
print(f"Genai-perf failed with error code: {gap_process.returncode}")
if stderr:
print(f"stderr: {stderr}")
raise subprocess.CalledProcessError(
gap_process.returncode, cmd, output=stdout, stderr=stderr
)
def run_concurrency_sweep(
service_url: str, model_name: str, isl: int, osl: int, stddev: int, output_dir: Path
) -> None:
concurrency_levels = get_concurrency_levels()
print(
f"Running concurrency sweep for {model_name} with ISL {isl} and OSL {osl} and standard deviation {stddev}",
flush=True,
)
print(f"Concurrency levels: {concurrency_levels}", flush=True)
for c in concurrency_levels:
print(f"Starting concurrency level {c}", flush=True)
run_genai_perf(
service_url, model_name, isl, osl, stddev, c, output_dir / f"c{c}"
)
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
import json
import re
from pathlib import Path
from typing import Dict, List, Tuple
import matplotlib.pyplot as plt
def parse_benchmark_results(result_dir: Path) -> List[Tuple[int, Dict]]:
"""
Parse benchmark results from a deployment directory.
Args:
result_dir: Path to the result directory
Returns:
List of (concurrency_level, metrics_dict) tuples sorted by concurrency
"""
results = []
# Find all concurrency directories (e.g., c1, c2, c5, c10, c50, c100, c250)
for concurrency_dir in result_dir.iterdir():
if not concurrency_dir.is_dir() or not concurrency_dir.name.startswith("c"):
continue
# Extract concurrency level from directory name
match = re.match(r"c(\d+)", concurrency_dir.name)
if not match:
continue
concurrency = int(match.group(1))
# Find the genai-perf JSON file
genai_perf_json = None
for json_file in concurrency_dir.rglob("profile_export_genai_perf.json"):
genai_perf_json = json_file
break
if genai_perf_json and genai_perf_json.exists():
try:
with open(genai_perf_json, "r") as f:
metrics = json.load(f)
results.append((concurrency, metrics))
print(f"Loaded metrics for concurrency {concurrency}")
except Exception as e:
print(f"Error loading {genai_perf_json}: {e}")
else:
print(f"Warning: No genai-perf JSON found for {concurrency_dir}")
# Sort by concurrency level
results.sort(key=lambda x: x[0])
return results
def extract_metric_series(
results: List[Tuple[int, Dict]], metric_path: str, stat: str = "avg"
) -> Tuple[List[int], List[float]]:
"""
Extract a time series of a specific metric across concurrency levels.
Args:
results: List of (concurrency, metrics) tuples
metric_path: Dot-separated path to the metric (e.g., 'inter_token_latency')
stat: Statistic to extract ('avg', 'p50', 'p90', etc.)
Returns:
Tuple of (concurrency_levels, metric_values)
"""
concurrencies = []
values = []
path_keys = metric_path.split(".")
for concurrency, metrics in results:
try:
node = metrics
for k in path_keys:
node = node[k]
value = node[stat]
concurrencies.append(concurrency)
values.append(float(value))
except (KeyError, TypeError):
print(
f"Warning: {metric_path}.{stat} not found for concurrency {concurrency}"
)
continue
return concurrencies, values
def create_plot(
title: str,
xlabel: str,
ylabel: str,
data_series: List[Tuple[str, List[int], List[float]]],
output_path: Path,
log_scale_x: bool = False,
log_scale_y: bool = False,
) -> None:
"""
Create a line plot with multiple series.
Args:
title: Plot title
xlabel: X-axis label
ylabel: Y-axis label
data_series: List of (label, x_values, y_values) tuples
output_path: Path to save the plot
log_scale_x: Whether to use log scale for X axis
log_scale_y: Whether to use log scale for Y axis
"""
plt.figure(figsize=(10, 6))
colors = ["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd", "#8c564b"]
for i, (label, x_vals, y_vals) in enumerate(data_series):
if x_vals and y_vals: # Only plot if we have data
plt.plot(
x_vals,
y_vals,
marker="o",
linewidth=2,
markersize=6,
color=colors[i % len(colors)],
label=label,
)
plt.title(title, fontsize=14, fontweight="bold")
plt.xlabel(xlabel, fontsize=12)
plt.ylabel(ylabel, fontsize=12)
plt.grid(True, alpha=0.3)
if log_scale_x:
plt.xscale("log")
if log_scale_y:
plt.yscale("log")
plt.legend()
plt.tight_layout()
plt.savefig(output_path, dpi=300, bbox_inches="tight")
plt.close()
print(f"Saved plot: {output_path}")
def create_efficiency_plot(
deployment_results: Dict, plots_dir: Path, output_tokens: int = 200
) -> None:
"""
Create an efficiency plot showing tok/s/gpu vs tok/s/user with concurrency as labeled points.
Args:
deployment_results: Dict of deployment_type -> results
plots_dir: Directory to save plots
output_tokens: Average output tokens per request (default 200)
"""
plt.figure(figsize=(12, 8))
# Support for up to 12 deployments in the plots
colors = [
"#1f77b4",
"#ff7f0e",
"#2ca02c",
"#d62728",
"#9467bd",
"#8c564b",
"#e377c2",
"#7f7f7f",
"#bcbd22",
"#17becf",
"#aec7e8",
"#ffbb78",
]
markers = ["o", "s", "^", "D", "v", "<", ">", "p", "*", "h", "H", "+"]
for deployment_type, results in deployment_results.items():
tok_s_per_user = []
tok_s_per_gpu = []
concurrency_levels = []
for concurrency, metrics in results:
try:
# Get request throughput (requests/sec)
request_throughput = metrics["request_throughput"]["avg"]
# Calculate total tokens per second
total_tok_s = request_throughput * output_tokens
# Guard against zero concurrency and parameterize GPU count
if concurrency <= 0:
continue
num_gpus = metrics.get("cluster", {}).get("num_gpus", 1)
tok_s_user = total_tok_s / concurrency
tok_s_gpu = total_tok_s / max(1, num_gpus)
tok_s_per_user.append(tok_s_user)
tok_s_per_gpu.append(tok_s_gpu)
concurrency_levels.append(concurrency)
except KeyError as e:
print(
f"Warning: Missing metric for {deployment_type} concurrency {concurrency}: {e}"
)
continue
if tok_s_per_user and tok_s_per_gpu:
# Plot points
color_idx = list(deployment_results.keys()).index(deployment_type)
color = colors[color_idx % len(colors)]
marker = markers[color_idx % len(markers)]
plt.scatter(
tok_s_per_user,
tok_s_per_gpu,
c=color,
marker=marker,
s=120,
alpha=0.8,
label=deployment_type.title(),
edgecolors="black",
linewidth=1.5,
)
# Add concurrency labels
for i, (x, y, c) in enumerate(
zip(tok_s_per_user, tok_s_per_gpu, concurrency_levels)
):
plt.annotate(
f"{c}",
(x, y),
xytext=(8, 8),
textcoords="offset points",
fontsize=10,
fontweight="bold",
ha="left",
)
plt.title("GPU Efficiency vs User Experience", fontsize=14, fontweight="bold")
plt.xlabel("Tokens/sec per User", fontsize=12)
plt.ylabel("Tokens/sec per GPU", fontsize=12)
plt.grid(True, alpha=0.3)
# Add a note about what the numbers represent
plt.figtext(
0.02,
0.02,
"Note: Numbers on dots indicate concurrency level",
fontsize=10,
style="italic",
alpha=0.7,
)
plt.legend()
plt.tight_layout()
output_path = plots_dir / "efficiency_tok_s_gpu_vs_user.png"
plt.savefig(output_path, dpi=300, bbox_inches="tight")
plt.close()
print(f"Saved efficiency plot: {output_path}")
def generate_plots(base_output_dir: Path, output_dir: Path) -> None:
"""
Generate performance plots from benchmark results.
Args:
base_output_dir: Base directory containing benchmark results
output_dir: Directory to save plots
"""
print(f"Generating plots from results in {base_output_dir}")
# Create plots directory
output_dir.mkdir(exist_ok=True)
# Parse results for each deployment type
deployment_results = {}
# Find all subdirectories that contain benchmark results
for item in base_output_dir.iterdir():
if item.is_dir() and item.name != "plots":
deployment_type = item.name
results = parse_benchmark_results(item)
if results:
deployment_results[deployment_type] = results
print(f"Found {len(results)} concurrency levels for {deployment_type}")
else:
print(f"No valid results found for {deployment_type}")
if not deployment_results:
print("No benchmark results found to plot!")
return
# 1. P50 Inter-token Latency vs Concurrency
p50_data = []
for deployment_type, results in deployment_results.items():
concurrencies, latencies = extract_metric_series(
results, "inter_token_latency", "p50"
)
if concurrencies:
p50_data.append((deployment_type.title(), concurrencies, latencies))
create_plot(
title="P50 Inter-Token Latency vs Concurrency",
xlabel="Concurrency Level",
ylabel="P50 Inter-Token Latency (ms)",
data_series=p50_data,
output_path=output_dir / "p50_inter_token_latency_vs_concurrency.png",
log_scale_x=True,
)
# 2. Average Inter-token Latency vs Concurrency
avg_latency_data = []
for deployment_type, results in deployment_results.items():
concurrencies, latencies = extract_metric_series(
results, "inter_token_latency", "avg"
)
if concurrencies:
avg_latency_data.append((deployment_type.title(), concurrencies, latencies))
create_plot(
title="Average Inter-Token Latency vs Concurrency",
xlabel="Concurrency Level",
ylabel="Average Inter-Token Latency (ms)",
data_series=avg_latency_data,
output_path=output_dir / "avg_inter_token_latency_vs_concurrency.png",
log_scale_x=True,
)
# 3. Request Throughput vs Concurrency
throughput_data = []
for deployment_type, results in deployment_results.items():
concurrencies, throughputs = extract_metric_series(
results, "request_throughput", "avg"
)
if concurrencies:
throughput_data.append(
(deployment_type.title(), concurrencies, throughputs)
)
create_plot(
title="Request Throughput vs Concurrency",
xlabel="Concurrency Level",
ylabel="Request Throughput (req/s)",
data_series=throughput_data,
output_path=output_dir / "request_throughput_vs_concurrency.png",
log_scale_x=True,
)
# 4. Average Time to First Token vs Concurrency
ttft_data = []
for deployment_type, results in deployment_results.items():
concurrencies, ttfts = extract_metric_series(
results, "time_to_first_token", "avg"
)
if concurrencies:
ttft_data.append((deployment_type.title(), concurrencies, ttfts))
create_plot(
title="Average Time to First Token vs Concurrency",
xlabel="Concurrency Level",
ylabel="Average Time to First Token (ms)",
data_series=ttft_data,
output_path=output_dir / "avg_time_to_first_token_vs_concurrency.png",
log_scale_x=True,
)
# 5. Efficiency plot: tok/s/gpu vs tok/s/user
create_efficiency_plot(deployment_results, output_dir)
# Generate summary
summary_lines = [
"Benchmark Results Summary",
"=" * 30,
"",
f"Results directory: {base_output_dir}",
f"Plots generated: {output_dir}",
"",
"Deployment Types Found:",
]
for deployment_type, results in deployment_results.items():
concurrency_levels = [r[0] for r in results]
summary_lines.append(
f" {deployment_type}: {len(results)} concurrency levels ({min(concurrency_levels)}-{max(concurrency_levels)})"
)
summary_lines.extend(
[
"",
"Generated Plots:",
" - p50_inter_token_latency_vs_concurrency.png",
" - avg_inter_token_latency_vs_concurrency.png",
" - request_throughput_vs_concurrency.png",
" - avg_time_to_first_token_vs_concurrency.png",
" - efficiency_tok_s_gpu_vs_user.png",
]
)
summary_path = output_dir / "SUMMARY.txt"
summary_path.write_text("\n".join(summary_lines))
print(f"Generated summary: {summary_path}")
print(f"All plots saved to: {output_dir}")
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(
description="Generate performance plots from benchmark results"
)
parser.add_argument(
"--data-dir", required=True, help="Directory containing benchmark results"
)
parser.add_argument(
"--output-dir", help="Output directory for plots (defaults to data-dir/plots)"
)
args = parser.parse_args()
data_dir = Path(args.data_dir)
if args.output_dir:
# If output dir specified, use it as base and call generate_plots
output_dir = Path(args.output_dir)
output_dir.mkdir(parents=True, exist_ok=True)
generate_plots(data_dir, output_dir)
else:
# Use data_dir as base output dir
generate_plots(data_dir, data_dir / "plots")
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
from dataclasses import dataclass
from pathlib import Path
from typing import Callable, Dict, List, Tuple
from benchmarks.utils.genai import run_concurrency_sweep
from benchmarks.utils.plot import generate_plots
from deploy.utils.dynamo_deployment import DynamoDeploymentClient
@dataclass
class DeploymentConfig:
"""Configuration for a single deployment type"""
name: str # Human-readable name (e.g., "aggregated")
manifest_path: str # Path to deployment manifest
output_subdir: str # Subdirectory name for results (e.g., "agg")
client_factory: Callable # Function to create the client
deploy_func: Callable # Function to deploy the client
def create_dynamo_client(
namespace: str, deployment_name: str
) -> DynamoDeploymentClient:
"""Factory function for DynamoDeploymentClient"""
return DynamoDeploymentClient(namespace=namespace, deployment_name=deployment_name)
async def deploy_dynamo_client(
client: DynamoDeploymentClient, manifest_path: str
) -> None:
"""Deploy a DynamoDeploymentClient"""
await client.create_deployment(manifest_path)
await client.wait_for_deployment_ready(timeout=1800)
async def teardown(client) -> None:
"""Clean up deployment and stop port forwarding"""
try:
if hasattr(client, "stop_port_forward"):
client.stop_port_forward()
await client.delete_deployment()
except Exception:
pass
def print_deployment_start(config: DeploymentConfig, output_dir: str) -> None:
"""Print deployment start messages"""
print(f"🚀 Starting {config.name} deployment benchmark...")
print(f"📄 Manifest: {config.manifest_path}")
print(f"📁 Results will be saved to: {Path(output_dir) / config.output_subdir}")
def print_concurrency_start(
deployment_name: str, model: str, isl: int, osl: int, std: int
) -> None:
"""Print concurrency sweep start messages"""
print(f"⚙️ Starting {deployment_name} concurrency sweep!", flush=True)
print(
"⏱️ This may take several minutes - running through multiple concurrency levels...",
flush=True,
)
print(f"🎯 Model: {model} | ISL: {isl} | OSL: {osl} | StdDev: {std}")
def print_deployment_complete(config: DeploymentConfig) -> None:
"""Print deployment completion message"""
print(f"✅ {config.name.title()} deployment benchmark completed successfully!")
def print_deployment_skip(deployment_type: str) -> None:
"""Print deployment skip message"""
print(f"⏭️ Skipping {deployment_type} deployment (not specified)")
async def run_single_deployment_benchmark(
config: DeploymentConfig,
namespace: str,
output_dir: str,
model: str,
isl: int,
osl: int,
std: int,
) -> None:
"""Run benchmark for a single deployment type"""
print_deployment_start(config, output_dir)
# Create and deploy client
client = config.client_factory(namespace, config.output_subdir)
await config.deploy_func(client, config.manifest_path)
try:
print_concurrency_start(config.name, model, isl, osl, std)
# Run concurrency sweep
(Path(output_dir) / config.output_subdir).mkdir(parents=True, exist_ok=True)
run_concurrency_sweep(
service_url=client.port_forward_frontend(quiet=True),
model_name=model,
isl=isl,
osl=osl,
stddev=std,
output_dir=Path(output_dir) / config.output_subdir,
)
finally:
await teardown(client)
print_deployment_complete(config)
async def run_endpoint_benchmark(
label: str,
endpoint: str,
model: str,
isl: int,
osl: int,
std: int,
output_dir: str,
) -> None:
"""Run benchmark for an existing endpoint with custom label"""
print(f"🚀 Starting benchmark of endpoint '{label}': {endpoint}")
print(f"📁 Results will be saved to: {Path(output_dir) / label}")
print_concurrency_start(f"endpoint ({label})", model, isl, osl, std)
run_concurrency_sweep(
service_url=endpoint,
model_name=model,
isl=isl,
osl=osl,
stddev=std,
output_dir=Path(output_dir) / label,
)
print("✅ Endpoint benchmark completed successfully!")
def print_final_summary(output_dir: str, deployed_types: List[str]) -> None:
"""Print final benchmark summary"""
print("📊 Generating performance plots...")
generate_plots(
base_output_dir=Path(output_dir), output_dir=Path(output_dir) / "plots"
)
print(f"📈 Plots saved to: {Path(output_dir) / 'plots'}")
print(f"📋 Summary saved to: {Path(output_dir) / 'SUMMARY.txt'}")
print()
print("🎉 Benchmark workflow completed successfully!")
print(f"📁 All results available at: {output_dir}")
if deployed_types:
print(f"🚀 Benchmarked deployments: {', '.join(deployed_types)}")
print(f"📊 View plots at: {Path(output_dir) / 'plots'}")
def categorize_inputs(inputs: Dict[str, str]) -> Tuple[Dict[str, str], Dict[str, str]]:
"""Categorize inputs into endpoints and manifests"""
endpoints = {}
manifests = {}
for label, value in inputs.items():
# Validate reserved labels
if label.lower() == "plots":
raise ValueError(
"Label 'plots' is reserved and cannot be used. Please choose a different label."
)
if value.startswith(("http://", "https://")):
endpoints[label] = value
else:
# It should be a file path - validate it exists
if not Path(value).is_file():
raise FileNotFoundError(
f"Manifest file not found for input '{label}': {value}"
)
manifests[label] = value
return endpoints, manifests
def validate_dynamo_manifest(manifest_path: str) -> None:
"""Validate that the manifest is a DynamoGraphDeployment"""
try:
with open(manifest_path, "r") as f:
content = f.read()
# Check for DynamoGraphDeployment
if "kind: DynamoGraphDeployment" not in content:
raise ValueError(
f"Manifest {manifest_path} is not a DynamoGraphDeployment. Only DynamoGraphDeployments are supported for deployment benchmarking."
)
except FileNotFoundError:
raise FileNotFoundError(f"Manifest file not found: {manifest_path}")
except Exception as e:
raise ValueError(f"Error reading manifest {manifest_path}: {e}")
async def run_benchmark_workflow(
namespace: str,
inputs: Dict[str, str],
isl: int = 200,
std: int = 10,
osl: int = 200,
model: str = "nvidia/Llama-3.1-8B-Instruct-FP8",
output_dir: str = "benchmarks/results",
) -> None:
"""Main benchmark workflow orchestrator with dynamic inputs"""
Path(output_dir).mkdir(parents=True, exist_ok=True)
# Categorize inputs into endpoints and manifests
endpoints, manifests = categorize_inputs(inputs)
# Run endpoint benchmarks
for label, endpoint in endpoints.items():
await run_endpoint_benchmark(label, endpoint, model, isl, osl, std, output_dir)
# Create deployment configurations for manifests
deployment_configs = []
for label, manifest_path in manifests.items():
# Validate that it's a DynamoGraphDeployment
validate_dynamo_manifest(manifest_path)
config = DeploymentConfig(
name=label,
manifest_path=manifest_path,
output_subdir=label,
client_factory=create_dynamo_client,
deploy_func=deploy_dynamo_client,
)
deployment_configs.append(config)
# Run benchmarks for each deployment type
deployed_labels = list(endpoints.keys())
for config in deployment_configs:
await run_single_deployment_benchmark(
config=config,
namespace=namespace,
output_dir=output_dir,
model=model,
isl=isl,
osl=osl,
std=std,
)
deployed_labels.append(config.name)
# Generate final summary
print_final_summary(output_dir, deployed_labels)
......@@ -47,7 +47,7 @@ spec:
failureThreshold: 10
pvc:
create: false
name: profiling-pvc # Must be pre-created before deployment and SLA profiler must have been run
name: dynamo-pvc # Must be pre-created before deployment and SLA profiler must have been run
mountPoint: /workspace/profiling_results
extraPodSpec:
mainContainer:
......
......@@ -99,7 +99,7 @@ We have public images available on [NGC Catalog](https://catalog.ngc.nvidia.com/
### Pre-Deployment Profiling (SLA Planner Only)
If using the SLA Planner deployment (`disagg_planner.yaml`), follow the [pre-deployment profiling guide](../../../../docs/architecture/pre_deployment_profiling.md) to run pre-deployment profiling. The results will be saved to the `profiling-pvc` PVC and queried by the SLA Planner.
If using the SLA Planner deployment (`disagg_planner.yaml`), follow the [pre-deployment profiling guide](../../../../docs/benchmarks/pre_deployment_profiling.md) to run pre-deployment profiling. The results will be saved to the `dynamo-pvc` PVC and queried by the SLA Planner.
## Usage
......
......@@ -13,7 +13,7 @@ spec:
replicas: 1
extraPodSpec:
mainContainer:
image: nvcr.io/nvidian/nim-llm-dev/vllm-runtime:dep-233.17
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1
VllmDecodeWorker:
envFromSecret: hf-token-secret
dynamoNamespace: vllm-agg
......@@ -24,7 +24,7 @@ spec:
gpu: "1"
extraPodSpec:
mainContainer:
image: nvcr.io/nvidian/nim-llm-dev/vllm-runtime:dep-233.17
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1
workingDir: /workspace/components/backends/vllm
command:
- /bin/sh
......
......@@ -13,7 +13,7 @@ spec:
replicas: 1
extraPodSpec:
mainContainer:
image: nvcr.io/nvidian/nim-llm-dev/vllm-runtime:dep-233.17
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1
envs:
- name: DYN_ROUTER_MODE
value: kv
......@@ -27,7 +27,7 @@ spec:
gpu: "1"
extraPodSpec:
mainContainer:
image: nvcr.io/nvidian/nim-llm-dev/vllm-runtime:dep-233.17
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1
workingDir: /workspace/components/backends/vllm
command:
- /bin/sh
......
......@@ -13,7 +13,7 @@ spec:
replicas: 1
extraPodSpec:
mainContainer:
image: nvcr.io/nvidian/nim-llm-dev/vllm-runtime:dep-233.17
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1
VllmDecodeWorker:
dynamoNamespace: vllm-disagg
envFromSecret: hf-token-secret
......@@ -24,7 +24,7 @@ spec:
gpu: "1"
extraPodSpec:
mainContainer:
image: nvcr.io/nvidian/nim-llm-dev/vllm-runtime:dep-233.17
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1
workingDir: /workspace/components/backends/vllm
command:
- /bin/sh
......@@ -41,7 +41,7 @@ spec:
gpu: "1"
extraPodSpec:
mainContainer:
image: nvcr.io/nvidian/nim-llm-dev/vllm-runtime:dep-233.17
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1
workingDir: /workspace/components/backends/vllm
command:
- /bin/sh
......
......@@ -20,7 +20,7 @@ spec:
replicas: 1
extraPodSpec:
mainContainer:
image: nvcr.io/nvidian/nim-llm-dev/vllm-runtime:hzhou-0814-02
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1
Planner:
dynamoNamespace: vllm-disagg-planner
envFromSecret: hf-token-secret
......@@ -47,11 +47,11 @@ spec:
failureThreshold: 10
pvc:
create: false
name: profiling-pvc # Must be pre-created before deployment and SLA profiler must have been run
name: dynamo-pvc # Must be pre-created before deployment and SLA profiler must have been run
mountPoint: /workspace/profiling_results
extraPodSpec:
mainContainer:
image: nvcr.io/nvidian/nim-llm-dev/vllm-runtime:hzhou-0814-02
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1
workingDir: /workspace/components/planner/src/dynamo/planner
ports:
- name: metrics
......@@ -95,7 +95,7 @@ spec:
failureThreshold: 10
extraPodSpec:
mainContainer:
image: nvcr.io/nvidian/nim-llm-dev/vllm-runtime:hzhou-0814-02
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1
workingDir: /workspace/components/backends/vllm
command:
- /bin/sh
......@@ -118,7 +118,7 @@ spec:
port: 9090
periodSeconds: 10
failureThreshold: 60
image: nvcr.io/nvidian/nim-llm-dev/vllm-runtime:hzhou-0814-02
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1
workingDir: /workspace/components/backends/vllm
command:
- /bin/sh
......@@ -141,7 +141,7 @@ spec:
port: 9090
periodSeconds: 10
failureThreshold: 60
image: nvcr.io/nvidian/nim-llm-dev/vllm-runtime:hzhou-0814-02
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1
workingDir: /workspace/components/backends/vllm
command:
- /bin/sh
......
......@@ -13,7 +13,7 @@ spec:
replicas: 1
extraPodSpec:
mainContainer:
image: nvcr.io/nvidian/nim-llm-dev/vllm-runtime:dep-233.17
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1
envs:
- name: DYN_ROUTER_MODE
value: kv
......@@ -27,7 +27,7 @@ spec:
gpu: "1"
extraPodSpec:
mainContainer:
image: nvcr.io/nvidian/nim-llm-dev/vllm-runtime:dep-233.17
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1
workingDir: /workspace/components/backends/vllm
command:
- /bin/sh
......@@ -44,7 +44,7 @@ spec:
gpu: "1"
extraPodSpec:
mainContainer:
image: nvcr.io/nvidian/nim-llm-dev/vllm-runtime:dep-233.17
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1
workingDir: /workspace/components/backends/vllm
command:
- /bin/sh
......
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Package marker for deploy utilities
# Kubernetes utilities for Dynamo
This directory contains small utilities and manifests used by benchmarking and profiling flows.
## Contents
- `setup_k8s_namespace.sh`**fully encapsulated deployment setup** that provides one-time per Kubernetes namespace setup. Creates namespace (if missing), applies common manifests, installs CRDs, and deploys the Dynamo operator. If `DOCKER_SERVER`/`IMAGE_TAG` are provided, it installs your custom operator image; otherwise it installs the default published image. If your registry is private, provide `DOCKER_USERNAME`/`DOCKER_PASSWORD` or respond to the prompt to create an image pull secret.
- `manifests/`
- `serviceaccount.yaml` — ServiceAccount `dynamo-sa`
- `role.yaml` — Role `dynamo-role`
- `rolebinding.yaml` — RoleBinding `dynamo-binding`
- `pvc.yaml` — PVC `dynamo-pvc`
- `pvc-access-pod.yaml` — short‑lived pod for copying profiler results from the PVC
- `kubernetes.py` — helper used by tooling to apply/read resources (e.g., access pod for PVC downloads).
## Quick start
### Kubernetes Setup (one-time per namespace)
Use the helper script to prepare a Kubernetes namespace with the common manifests and install the operator. This provides a **fully encapsulated deployment setup**.
This script creates a Kubernetes namespace with the given name if it does not yet exist. It then applies common manifests (serviceaccount, role, rolebinding, pvc), installs CRDs, creates secrets, and deploys the Dynamo Cloud Operator to your namespace.
If your namespace is already set up, you can skip this step.
```bash
export HF_TOKEN=<HF_TOKEN>
export DOCKER_SERVER=<YOUR_DOCKER_SERVER>
NAMESPACE=benchmarking HF_TOKEN=$HF_TOKEN DOCKER_SERVER=$DOCKER_SERVER deploy/utils/setup_k8s_namespace.sh
# IF you want to build and push a new Docker image for the Dynamo Cloud Operator, include an IMAGE_TAG
# NAMESPACE=benchmarking HF_TOKEN=$HF_TOKEN DOCKER_SERVER=$DOCKER_SERVER IMAGE_TAG=latest deploy/utils/setup_k8s_namespace.sh
```
This script applies the following manifests:
- `deploy/utils/manifests/serviceaccount.yaml` - ServiceAccount `dynamo-sa`
- `deploy/utils/manifests/role.yaml` - Role `dynamo-role`
- `deploy/utils/manifests/rolebinding.yaml` - RoleBinding `dynamo-binding`
- `deploy/utils/manifests/pvc.yaml` - PVC `dynamo-pvc`
If `DOCKER_SERVER` and `IMAGE_TAG` are not both provided, the script deploys the operator using the default published image `nvcr.io/nvidia/ai-dynamo/kubernetes-operator:0.4.0`.
To build/push and use a new image instead, pass both `DOCKER_SERVER` and `IMAGE_TAG`.
This script also installs the Dynamo CRDs if not present.
If the registry is private, either pass credentials or respond to the prompt:
```bash
NAMESPACE=benchmarking \
DOCKER_SERVER=my-registry.example.com \
IMAGE_TAG=latest \
DOCKER_USERNAME="$oauthtoken" \
DOCKER_PASSWORD=<token> \
deploy/utils/setup_k8s_namespace.sh
```
If `DOCKER_SERVER`/`IMAGE_TAG` are omitted, the script installs the default operator image `nvcr.io/nvidia/ai-dynamo/kubernetes-operator:0.4.0`.
After running the setup script, verify the installation by checking the pods:
```bash
kubectl get pods -n $NAMESPACE
```
The output should look something like:
```
NAME READY STATUS RESTARTS AGE
dynamo-platform-dynamo-operator-controller-manager-xxxxx 2/2 Running 0 5m
dynamo-platform-etcd-0 1/1 Running 0 5m
dynamo-platform-nats-0 2/2 Running 0 5m
dynamo-platform-nats-box-xxxxx 1/1 Running 0 5m
```
### PVC Manipulation Scripts
These scripts interact with the Persistent Volume Claim (PVC) that stores configuration files and benchmark/profiling results. They're essential for the Dynamo benchmarking and profiling workflows.
#### Why These Scripts Are Needed
1. **For Pre-Deployment Profiling**: The profiling job needs access to your Dynamo deployment configurations (DGD manifests) to test different parallelization strategies
2. **For Retrieving Results**: Both benchmarking and profiling jobs write their results to the PVC, which you need to download for analysis
#### Script Usage
**Inject deployment configurations for profiling:**
```bash
# The profiling job reads your DGD config from the PVC
python3 deploy/utils/inject_manifest.py \
--namespace $NAMESPACE \
--src ./my-disagg.yaml \
--dest /configs/disagg.yaml
```
**Download benchmark/profiling results:**
```bash
# After benchmarking or profiling completes, download results
python3 deploy/utils/download_pvc_results.py \
--namespace $NAMESPACE \
--output-dir ./pvc_files \
--folder /results \
--no-config # optional: skip *.yaml/*.yml in the download
```
#### Next Steps
For complete benchmarking workflows:
- **Benchmarking Guide**: See [docs/benchmarks/benchmarking.md](../../docs/benchmarks/benchmarking.md) for comparing DynamoGraphDeployments and external endpoints
- **Pre-Deployment Profiling**: See [docs/benchmarks/pre_deployment_profiling.md](../../docs/benchmarks/pre_deployment_profiling.md) for optimizing configurations before deployment
## Notes
- Benchmarking scripts (`benchmarks/benchmark.sh`, `benchmarks/deploy_benchmark.sh`) call this setup automatically when present.
- Profiling job manifest remains in `benchmarks/profiler/deploy/profile_sla_job.yaml` and now relies on the common ServiceAccount/PVC here.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment