feat: add benchmarking guide (#2620)

Signed-off-by: Hannah Zhang <hannahz@nvidia.com>

feat: add benchmarking guide (#2620)
Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
699996e4 · hhzhang16 · GitHub · 3c4adde5 · 699996e4 · 699996e4
Unverified Commit 699996e4 authored Aug 29, 2025 by hhzhang16 Committed by GitHub Aug 29, 2025
20 changed files
--- a/README.md
+++ b/README.md
@@ -151,6 +151,13 @@ Rerun with `curl -N` and change `stream` in the request to `true` to get the res
 - Check out [Backends](components/backends) to deploy various workflow configurations (e.g. SGLang with router, vLLM with disaggregated serving, etc.)
 - Run some [Examples](examples) to learn about building components in Dynamo and exploring various integrations.

+### Benchmarking Dynamo
+
+Dynamo provides comprehensive benchmarking tools to evaluate and optimize your deployments:
+
+* **[Benchmarking Guide](docs/benchmarks/benchmarking.md)** – Compare deployment topologies (aggregated vs. disaggregated vs. vanilla vLLM) using GenAI-Perf
+* **[Pre-Deployment Profiling](docs/benchmarks/pre_deployment_profiling.md)** – Optimize configurations before deployment to meet SLA requirements
+
 # Engines

 Dynamo is designed to be inference engine agnostic. To use any engine with Dynamo, NATS and etcd need to be installed, along with a Dynamo frontend (`python -m dynamo.frontend [--interactive]`).

--- a/benchmarks/README.md
+++ b/benchmarks/README.md
@@ -15,19 +15,72 @@

 # Benchmarks

-This directory contains benchmarking scripts and tools for performance evaluation.
+This directory contains benchmarking scripts and tools for performance evaluation of Dynamo deployments. The benchmarking framework is a wrapper around genai-perf that makes it easy to benchmark DynamoGraphDeployments and compare them with external endpoints.
+
+## Quick Start
+
+### Benchmark an Existing Endpoint
+```bash
+./benchmark.sh --namespace my-namespace --input my-endpoint=http://your-endpoint:8000
+```
+
+### Benchmark Dynamo Deployments
+```bash
+# Benchmark disaggregated vLLM with custom label
+./benchmark.sh --namespace my-namespace --input vllm-disagg=components/backends/vllm/deploy/disagg.yaml
+
+# Benchmark TensorRT-LLM disaggregated deployment
+./benchmark.sh --namespace my-namespace --input trtllm-disagg=components/backends/trtllm/deploy/disagg.yaml
+
+# Compare multiple Dynamo deployments
+./benchmark.sh --namespace my-namespace \
+  --input agg=components/backends/vllm/deploy/agg.yaml \
+  --input disagg=components/backends/vllm/deploy/disagg.yaml
+
+# Compare Dynamo vs external endpoint
+./benchmark.sh --namespace my-namespace \
+  --input dynamo=components/backends/vllm/deploy/disagg.yaml \
+  --input external=http://localhost:8000
+```
+
+**Note**:
+- The sample manifests may reference private registry images. Update the `image:` fields to use accessible images from [Dynamo NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo/artifacts) or your own registry before running.
+- Only DynamoGraphDeployment manifests are supported for automatic deployment. To benchmark non-Dynamo backends (vLLM, TensorRT-LLM, SGLang, etc.), deploy them manually using their Kubernetes guides and use the endpoint option.
+
+## Features
+
+The benchmarking framework supports:
+
+**Two Benchmarking Modes:**
+- **Endpoint Benchmarking**: Test existing HTTP endpoints without deployment overhead
+- **Deployment Benchmarking**: Deploy, test, and cleanup DynamoGraphDeployments automatically
+
+**Flexible Configuration:**
+- User-defined labels for each input using `--input label=value` format
+- Support for multiple inputs to enable comparisons
+- Customizable concurrency levels (configurable via CONCURRENCIES env var), sequence lengths, and models
+- Automated performance plot generation with custom labels
+
+**Supported Backends:**
+- DynamoGraphDeployments
+- External HTTP endpoints (for comparison with non-Dynamo backends)

 ## Installation

-This is already included as part of the dynamo vllm image. To install locally or standalone, run:
+This is already included as part of the Dynamo container images. To install locally or standalone:

 ```bash
 pip install -e .
 ```

-Currently, this will install lightweight tools for:
+## Data Generation Tools
+
+This directory also includes lightweight tools for:
 - Analyzing prefix-structured data (`datagen analyze`)
 - Synthesizing structured data customizable for testing purposes (`datagen synthesize`)
-Detailed information are provided in the `prefix_data_generator` directory.

-The benchmarking scripts for the core dynamo components are to come soon (e.g. routing, disagg, Planner).
\ No newline at end of file
+Detailed information is provided in the `prefix_data_generator` directory.
+
+## Comprehensive Guide
+
+For detailed documentation, configuration options, and advanced usage, see the [complete benchmarking guide](../docs/benchmarks/benchmarking.md).
--- a/benchmarks/benchmark.sh
+++ b/benchmarks/benchmark.sh
+#!/bin/bash
+
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+set -euo pipefail
+
+# Script directory
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+DYNAMO_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
+
+# Configuration - all set via command line arguments
+NAMESPACE=""
+MODEL="deepseek-ai/DeepSeek-R1-Distill-Llama-8B"
+ISL=2000
+STD=10
+OSL=256
+OUTPUT_DIR="./benchmarks/results"
+
+# Input configurations stored as associative arrays
+declare -A INPUT_LABELS
+declare -A INPUT_VALUES
+
+# Flags
+VERBOSE=false
+
+show_help() {
+    cat << EOF
+Dynamo Benchmark Runner
+
+This script is a wrapper around genai-perf that benchmarks Dynamo LLM deployments and
+plots the results in an easy-to-use way. It supports comparing multiple DynamoGraphDeployments
+or endpoints with custom labels defined by you.
+
+The client runs locally and connects to your deployments/endpoints for benchmarking.
+
+USAGE:
+    $0 --namespace NAMESPACE --input <label>=<manifest_or_endpoint> [--input <label>=<manifest_or_endpoint>]... [OPTIONS]
+
+REQUIRED:
+    -n, --namespace NAMESPACE           Kubernetes namespace
+    --input <label>=<manifest_path_or_endpoint>  Benchmark input with custom label
+                                          - <label>: becomes the name/label in plots
+                                          - <manifest_path_or_endpoint>: either a DynamoGraphDeployment manifest or HTTP endpoint URL
+                                          Can be specified multiple times for comparisons
+
+OPTIONS:
+    -h, --help                    Show this help message
+    -m, --model MODEL             Model name for GenAI-Perf configuration and logging (default: deepseek-ai/DeepSeek-R1-Distill-Llama-8B)
+                                  NOTE: This must match the model configured in your deployment manifests and the model deployed in any endpoints.
+    -i, --isl LENGTH              Input sequence length (default: $ISL)
+    -s, --std STDDEV              Input sequence standard deviation (default: $STD)
+    -o, --osl LENGTH              Output sequence length (default: $OSL)
+    -d, --output-dir DIR          Output directory (default: $OUTPUT_DIR)
+    --verbose                     Enable verbose output
+
+EXAMPLES:
+    # Compare aggregated vs disaggregated Dynamo deployments
+    $0 --namespace \$NAMESPACE \\
+       --input agg=components/backends/vllm/deploy/agg.yaml \\
+       --input disagg=components/backends/vllm/deploy/disagg.yaml
+
+    # Compare Dynamo deployment vs external endpoint
+    $0 --namespace \$NAMESPACE \\
+       --input dynamo=components/backends/vllm/deploy/disagg.yaml \\
+       --input external=http://localhost:8000
+
+    # Compare three different configurations
+    $0 --namespace \$NAMESPACE \\
+       --input dynamo-agg=components/backends/vllm/deploy/agg.yaml \\
+       --input dynamo-disagg=components/backends/vllm/deploy/disagg.yaml \\
+       --input external-vllm=http://localhost:8000
+
+    # Benchmark a single Dynamo deployment
+    $0 --namespace \$NAMESPACE \\
+       --input my-setup=components/backends/vllm/deploy/disagg.yaml
+
+    # Benchmark single external endpoint
+    $0 --namespace \$NAMESPACE \\
+       --input production=http://localhost:8000
+
+DEPLOYMENT TYPES:
+    - DynamoGraphDeployment: Supports various Dynamo deployment configurations including:
+      * Aggregated deployments (prefill and decode together)
+      * Disaggregated deployments (prefill and decode separate)
+      * Router deployments
+      * Planner deployments
+      * And other Dynamo configurations
+    - External Endpoints: For comparing against non-Dynamo backends
+
+NOTE:
+    - Only DynamoGraphDeployment manifests are supported for automatic deployment.
+    - To benchmark non-Dynamo backends (vLLM, TensorRT-LLM, SGLang, etc.), deploy them
+      manually following their Kubernetes deployment guides, expose a port (i.e. via port-forward),
+      and use the endpoint option.
+    - For Dynamo deployment setup, setup_k8s_namespace.sh provides fully encapsulated
+      deployment setup including namespace creation, CRDs, and operator installation.
+    - The --model flag configures GenAI-Perf and should match what's configured in your deployment manifests and endpoints.
+    - Only one model can be benchmarked at a time across all inputs.
+
+EOF
+}
+
+parse_input() {
+    local input_arg="$1"
+
+    # Basic format validation: must contain exactly one '=' character
+    if [[ ! "$input_arg" =~ ^[^=]+=[^=]+$ ]]; then
+        echo "ERROR: Invalid input format. Expected: <label>=<manifest_path_or_endpoint>" >&2
+        echo "Got: $input_arg" >&2
+        echo "Format must be: key=value with exactly one '=' character" >&2
+        exit 1
+    fi
+
+    # Split on the first '=' character
+    local label="${input_arg%%=*}"
+    local value="${input_arg#*=}"
+
+    # Basic validation - detailed validation will be done in Python
+    if [[ -z "$label" ]]; then
+        echo "ERROR: Label cannot be empty in input: $input_arg" >&2
+        exit 1
+    fi
+
+    if [[ -z "$value" ]]; then
+        echo "ERROR: Value cannot be empty in input: $input_arg" >&2
+        exit 1
+    fi
+
+    # Check for duplicate labels
+    if [[ -n "${INPUT_LABELS[$label]:-}" ]]; then
+        echo "ERROR: Duplicate label '$label' found. Each label must be unique." >&2
+        exit 1
+    fi
+
+    # Store the input
+    INPUT_LABELS["$label"]=1
+    INPUT_VALUES["$label"]="$value"
+
+    echo "Added input: $label -> $value"
+}
+
+parse_args() {
+    while [[ $# -gt 0 ]]; do
+        case $1 in
+            -h|--help)
+                show_help
+                exit 0
+                ;;
+            -n|--namespace)
+                NAMESPACE="$2"
+                shift 2
+                ;;
+            -m|--model)
+                MODEL="$2"
+                shift 2
+                ;;
+            -i|--isl)
+                ISL="$2"
+                shift 2
+                ;;
+            -s|--std)
+                STD="$2"
+                shift 2
+                ;;
+            -o|--osl)
+                OSL="$2"
+                shift 2
+                ;;
+            -d|--output-dir)
+                OUTPUT_DIR="$2"
+                shift 2
+                ;;
+            --input)
+                parse_input "$2"
+                shift 2
+                ;;
+            --verbose)
+                VERBOSE=true
+                shift
+                ;;
+            *)
+                echo "Unknown option: $1" >&2
+                echo "Use --help for usage information." >&2
+                exit 1
+                ;;
+        esac
+    done
+}
+
+validate_config() {
+    local errors=()
+
+    if [[ -z "$NAMESPACE" ]]; then
+        errors+=("--namespace is required")
+    fi
+
+    # Check that at least one input is specified
+    if [[ ${#INPUT_LABELS[@]} -eq 0 ]]; then
+        errors+=("At least one --input must be specified")
+    fi
+
+    if [[ ${#errors[@]} -gt 0 ]]; then
+        echo "ERROR: Missing required arguments:" >&2
+        for error in "${errors[@]}"; do
+            echo "  $error" >&2
+        done
+        echo "Use --help for usage information." >&2
+        exit 1
+    fi
+
+    # Validate that specified files exist and endpoints are valid URLs
+    for label in "${!INPUT_VALUES[@]}"; do
+        local value="${INPUT_VALUES[$label]}"
+
+        # Check if it's a URL (starts with http:// or https://)
+        if [[ "$value" =~ ^https?:// ]]; then
+            echo "Input '$label': endpoint $value"
+        else
+            # It should be a file path - validate it exists
+            if [[ ! -f "$value" ]]; then
+                echo "ERROR: Manifest file not found for input '$label': $value" >&2
+                exit 1
+            fi
+            echo "Input '$label': manifest $value"
+        fi
+    done
+
+    if [[ ! "$ISL" =~ ^[0-9]+$ ]] || [[ "$ISL" -le 0 ]]; then
+        echo "ERROR: ISL must be a positive integer, got: $ISL" >&2
+        exit 1
+    fi
+
+    if [[ ! "$OSL" =~ ^[0-9]+$ ]] || [[ "$OSL" -le 0 ]]; then
+        echo "ERROR: OSL must be a positive integer, got: $OSL" >&2
+        exit 1
+    fi
+
+    if [[ ! "$STD" =~ ^[0-9]+$ ]] || [[ "$STD" -lt 0 ]]; then
+        echo "ERROR: STD must be a non-negative integer, got: $STD" >&2
+        exit 1
+    fi
+}
+
+print_config() {
+    echo "=== Benchmark Configuration ==="
+    echo "Namespace:              $NAMESPACE"
+    echo "Model:                  $MODEL"
+    echo "Input Sequence Length:  $ISL tokens"
+    echo "Output Sequence Length: $OSL tokens"
+    echo "Sequence Std Dev:       $STD tokens"
+    echo "Output Directory:       $OUTPUT_DIR"
+    echo ""
+    echo "Benchmark Inputs:"
+
+    for label in "${!INPUT_VALUES[@]}"; do
+        local value="${INPUT_VALUES[$label]}"
+        if [[ "$value" =~ ^https?:// ]]; then
+            echo "  $label: endpoint $value"
+        else
+            echo "  $label: manifest $value"
+        fi
+    done
+
+    echo "==============================="
+    echo
+}
+
+clear_output_directory() {
+    if [[ -d "$OUTPUT_DIR" ]]; then
+        echo "🧹 Clearing existing output directory: $OUTPUT_DIR"
+        rm -rf "$OUTPUT_DIR"
+    fi
+    mkdir -p "$OUTPUT_DIR"
+    echo "✅ Output directory prepared: $OUTPUT_DIR"
+}
+
+run_benchmark() {
+    echo "🚀 Starting benchmark workflow..."
+
+    # Clear and recreate output directory
+    clear_output_directory
+
+    # Change to dynamo root directory
+    cd "$DYNAMO_ROOT"
+
+    local cmd=(
+        python3 -u -m benchmarks.utils.benchmark
+        --namespace "$NAMESPACE"
+        --model "$MODEL"
+        --isl "$ISL"
+        --std "$STD"
+        --osl "$OSL"
+        --output-dir "$OUTPUT_DIR"
+    )
+
+    # Add all input arguments
+    for label in "${!INPUT_VALUES[@]}"; do
+        local value="${INPUT_VALUES[$label]}"
+        cmd+=(--input "$label=$value")
+    done
+
+    if [[ "$VERBOSE" == "true" ]]; then
+        echo "Executing: ${cmd[*]}"
+    fi
+
+    if ! "${cmd[@]}"; then
+        echo "❌ Benchmark failed!" >&2
+        exit 1
+    fi
+
+    echo "✅ Benchmark completed successfully!"
+}
+
+generate_plots() {
+    echo "📊 Generating performance plots..."
+
+    cd "$DYNAMO_ROOT"
+
+    local plot_cmd=(
+        python3 -m benchmarks.utils.plot
+        --data-dir "$OUTPUT_DIR"
+    )
+
+    if [[ "$VERBOSE" == "true" ]]; then
+        echo "Executing: ${plot_cmd[*]}"
+    fi
+
+    if ! "${plot_cmd[@]}"; then
+        echo "⚠️  Plot generation failed, but benchmark data is still available" >&2
+        return 1
+    fi
+
+    echo "✅ Plots generated successfully!"
+    echo "📁 Results available at: $OUTPUT_DIR"
+    echo "📈 Plots available at: $OUTPUT_DIR/plots"
+}
+
+main() {
+    trap cleanup EXIT
+
+    parse_args "$@"
+    validate_config
+    print_config
+    if [[ "$VERBOSE" == "true" ]]; then
+        export DYNAMO_VERBOSE=true
+    fi
+
+    local start_time
+    start_time=$(date +%s)
+
+    run_benchmark
+    generate_plots
+
+    local end_time
+    end_time=$(date +%s)
+    local duration
+    duration=$((end_time - start_time))
+
+    echo
+    echo "🎉 All done!"
+    echo "⏱️  Total time: ${duration}s"
+    echo "📁 Results: $OUTPUT_DIR"
+    echo "📊 Plots: $OUTPUT_DIR/plots"
+}
+
+cleanup() {
+    if [[ $? -ne 0 ]]; then
+        echo "❌ Script failed. Check logs above for details." >&2
+    fi
+}
+
+# Only run main if script is executed directly (not sourced)
+if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then
+    trap 'cleanup $?' EXIT
+    main "$@"
+fi
--- a/benchmarks/profiler/deploy/profile_sla_job.yaml
+++ b/benchmarks/profiler/deploy/profile_sla_job.yaml
@@ -8,7 +8,7 @@ metadata:
 spec:
  template:
    spec:
-      serviceAccountName: profile-sla-sa
+      serviceAccountName: dynamo-sa
      containers:
      - name: profile-sla
        image: ${DOCKER_IMAGE}
@@ -26,10 +26,10 @@ spec:
            value: nats://${NAMESPACE}-nats:4222
          - name: ETCD_ENDPOINTS
            value: ${NAMESPACE}-etcd:2379
-        command: ["python", "/workspace/benchmarks/profiler/profile_sla.py"]
+        command: ["python", "-m", "benchmarks.profiler.profile_sla"]
        args:
          - --config
-          - ${DGD_CONFIG_FILE}
+          - /workspace/configs/disagg.yaml
          - --output-dir
          - /workspace/profiling_results
          - --namespace
@@ -51,9 +51,14 @@ spec:
        volumeMounts:
          - name: output-volume
            mountPath: /workspace/profiling_results
+          - name: configs
+            mountPath: /workspace/configs
      restartPolicy: Never
      volumes:
        - name: output-volume
          persistentVolumeClaim:
-            claimName: profiling-pvc
+            claimName: dynamo-pvc
+        - name: configs
+          persistentVolumeClaim:
+            claimName: dynamo-pvc
  backoffLimit: 0
--- a/benchmarks/profiler/profile_sla.py
+++ b/benchmarks/profiler/profile_sla.py
@@ -23,10 +23,6 @@ import numpy as np
 import yaml
 from utils.config import CONFIG_MODIFIERS, WORKER_COMPONENT_NAMES
 from utils.defaults import DECODE_NUM_REQUESTS_RANGE
-from utils.dynamo_deployment import (
-    DynamoDeploymentClient,
-    cleanup_remaining_deployments,
-)
 from utils.genai_perf import benchmark_decode, benchmark_prefill
 from utils.plot import plot_decode_performance, plot_prefill_performance
 from utils.profile_cache import (
@@ -38,6 +34,11 @@ from utils.profile_cache import (
 from utils.profile_decode import profile_decode
 from utils.profile_prefill import profile_prefill

+from deploy.utils.dynamo_deployment import (
+    DynamoDeploymentClient,
+    cleanup_remaining_deployments,
+)
+
 logger = logging.getLogger(__name__)
 logger.setLevel(logging.INFO)
 console_handler = logging.StreamHandler()
@@ -170,10 +171,10 @@ async def run_profile(args):
                prefill_ttft.append(ttft)
                prefill_thpt_per_gpu.append(args.isl / ttft / tp_size * 1000)

-            print("Cleaning up deployment...")
+            logger.info("Cleaning up deployment...")
            await client.delete_deployment()
            deployment_clients.remove(client)
-            print("Deployment deleted")
+            logger.info("Deployment deleted")

        # Plot the results as a 2D scatter plot
        if prefill_tp_size and prefill_ttft and prefill_thpt_per_gpu:
@@ -270,7 +271,7 @@ async def run_profile(args):
            )
            max_concurrency = max_kv_tokens // (args.isl + args.osl)
            sweep_num_request = [
-                num for num in DECODE_NUM_REQUESTS_RANGE if num < max_concurrency
+                num for num in DECODE_NUM_REQUESTS_RANGE if num <= max_concurrency
            ]
            logger.info(
                f"Sweeping num_request range based on maximum number of kv tokens: {sweep_num_request}"
@@ -303,10 +304,10 @@ async def run_profile(args):
                    decode_concurrency.append(num_request)
                    decode_kv_cache_size.append(max_kv_tokens)

-            print("Cleaning up deployment...")
+            logger.info("Cleaning up deployment...")
            await client.delete_deployment()
            deployment_clients.remove(client)
-            print("Deployment deleted")
+            logger.info("Deployment deleted")

            # Store partial results for plotting later
            decode_results.append(
@@ -318,6 +319,11 @@ async def run_profile(args):
            plot_decode_performance(decode_results, args.itl, args.output_dir)

        logger.info("Analyzing results and generate recommendations...")
+        # Safety guards: no results → exit early with a clear message
+        if not (prefill_tp_size and prefill_ttft and prefill_thpt_per_gpu):
+            logger.error("No prefill results produced; skipping recommendations.")
+            return
+
        # select best tp size for prefill
        if min(prefill_ttft) > args.ttft:
            logger.info(
@@ -349,6 +355,15 @@ async def run_profile(args):
        )

        # select best tp size for decode
+        if not (
+            decode_tp_size
+            and decode_itl
+            and decode_thpt_per_gpu
+            and decode_concurrency
+            and decode_kv_cache_size
+        ):
+            logger.error("No decode results produced; skipping recommendations.")
+            return
        if min(decode_itl) > args.itl:
            logger.info(
                "No TP size satisfies the ITL requirement, please try a smaller model or a more powerful GPU SKU"
@@ -367,7 +382,7 @@ async def run_profile(args):
        # calculate kv cache utlization for the selected TP and concurrency
        selected_decode_kv_cache_utilization = (
            decode_concurrency[selected_decode_idx]
-            * (args.isl + args.osl / 2)
+            * (args.isl + (args.osl / 2))
            / decode_kv_cache_size[selected_decode_idx]
        )
        # set a +- 20% range for the kv cache utilization
@@ -433,10 +448,10 @@ async def run_profile(args):
            args.prefill_interpolation_granularity,
        )

-        print("Cleaning up deployment...")
+        logger.info("Cleaning up deployment...")
        await client.delete_deployment()
        deployment_clients.remove(client)
-        print("Deployment deleted")
+        logger.info("Deployment deleted")

        # interpolate ITL - Active_KV_Cache - Decode_Context_Length with best decode TP
        best_decode_tp = decode_tp_size[selected_decode_idx]
@@ -490,10 +505,10 @@ async def run_profile(args):
            args.decode_interpolation_granularity,
        )

-        print("Cleaning up deployment...")
+        logger.info("Cleaning up deployment...")
        await client.delete_deployment()
        deployment_clients.remove(client)
-        print("Deployment deleted")
+        logger.info("Deployment deleted")

    except Exception as e:
        logger.error(f"Profile job failed with error: {e}")

--- a/benchmarks/profiler/utils/utils.py
+++ b/benchmarks/profiler/utils/utils.py
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import logging
-import os
-import signal
-import subprocess
-import time
-
-import pynvml
-import requests
-
-logger = logging.getLogger(__name__)
-logger.setLevel(logging.INFO)
-console_handler = logging.StreamHandler()
-console_handler.setLevel(logging.INFO)
-formatter = logging.Formatter(
-    "%(asctime)s - %(name)s - %(levelname)s - %(message)s", "%Y-%m-%d %H:%M:%S"
-)
-console_handler.setFormatter(formatter)
-logger.addHandler(console_handler)
-
-
-def get_dynamo_serve_cmd(config_file_path):
-    config_file_path = os.path.abspath(config_file_path)
-    return [
-        "dynamo",
-        "serve",
-        "graphs.agg:Frontend",
-        "-f",
-        config_file_path,
-    ]
-
-
-def get_available_gpu_count():
-    try:
-        pynvml.nvmlInit()
-        gpu_count = pynvml.nvmlDeviceGetCount()
-
-        if gpu_count > 0:
-            logger.info(f"Detected {gpu_count} GPUs in the system:")
-            for i in range(gpu_count):
-                handle = pynvml.nvmlDeviceGetHandleByIndex(i)
-                name = pynvml.nvmlDeviceGetName(handle)
-                memory = pynvml.nvmlDeviceGetMemoryInfo(handle)
-                total_memory_mb = memory.total / (1024 * 1024)
-                free_memory_mb = memory.free / (1024 * 1024)
-                logger.info(
-                    f"  GPU {i}: {name}, Total Memory: {total_memory_mb:.2f} MB, Free Memory: {free_memory_mb:.2f} MB"
-                )
-        else:
-            logger.warning("No GPUs detected with pynvml.")
-
-        pynvml.nvmlShutdown()
-        return gpu_count
-    except ImportError:
-        logger.error(
-            "pynvml module not found. Please install it with 'pip install pynvml'"
-        )
-        return 0
-    except pynvml.NVMLError as e:
-        logger.error(f"NVML Error: {e}")
-        return 0
-    except Exception as e:
-        logger.error(f"Error detecting GPUs: {e}")
-        return 0
-
-
-def shutdown_deployment(dynamo_process):
-    os.killpg(os.getpgid(dynamo_process.pid), signal.SIGINT)
-    dynamo_process.communicate()
-
-    try:
-        current_pid = os.getpid()
-        ps_cmd = ["ps", "-ef"]
-        ps_output = subprocess.check_output(ps_cmd, text=True)
-        for line in ps_output.splitlines():
-            if "python" in line.lower():
-                parts = line.split()
-                if len(parts) >= 2:
-                    try:
-                        pid = int(parts[1])
-                        if pid != current_pid:  # Exclude current process
-                            os.kill(pid, signal.SIGKILL)
-                    except ValueError:
-                        continue
-    except Exception as e:
-        logger.error(f"Error killing Python processes: {e}")
-    time.sleep(5)
-
-
-def wait_for_server_ready(model_name: str, port: int, timeout: int = 300):
-    logger.info("Waiting for the server to be ready...")
-    endpoint_url = f"http://localhost:{port}/v1/chat/completions"
-    start_time = time.time()
-    server_ready = False
-
-    while time.time() - start_time < timeout:
-        try:
-            # Send a simple request to check if the server is up
-            response = requests.post(
-                endpoint_url,
-                json={
-                    "model": model_name,
-                    "messages": [{"role": "user", "content": "Hello"}],
-                    "max_tokens": 1,
-                },
-                timeout=5,
-            )
-            if response.status_code != 200:
-                logger.info(
-                    f"Server returned status code {response.status_code}, waiting..."
-                )
-                time.sleep(5)
-                continue
-            logger.info(f"Server is ready after {time.time() - start_time:.2f} seconds")
-            server_ready = True
-            break
-
-        except (requests.RequestException, ConnectionError) as e:
-            logger.info(f"Server not ready yet: {e}")
-        time.sleep(5)
-
-    return server_ready
--- a/benchmarks/utils/__init__.py
+++ b/benchmarks/utils/__init__.py
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Package marker for benchmarks utilities
--- a/benchmarks/utils/benchmark.py
+++ b/benchmarks/utils/benchmark.py
+#!/usr/bin/env python3
+
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+import argparse
+import asyncio
+import sys
+from typing import Tuple
+
+from benchmarks.utils.workflow import categorize_inputs, run_benchmark_workflow
+
+
+def parse_input(input_str: str) -> Tuple[str, str]:
+    """Parse input string in format key=value with additional validation"""
+    if "=" not in input_str:
+        raise ValueError(
+            f"Invalid input format. Expected: <label>=<manifest_path_or_endpoint>, got: {input_str}"
+        )
+
+    parts = input_str.split("=", 1)  # Split on first '=' only
+    if len(parts) != 2:
+        raise ValueError(
+            f"Invalid input format. Expected: <label>=<manifest_path_or_endpoint>, got: {input_str}"
+        )
+
+    label, value = parts
+
+    if not label.strip():
+        raise ValueError("Label cannot be empty")
+    if not value.strip():
+        raise ValueError("Value cannot be empty")
+
+    label = label.strip()
+    value = value.strip()
+
+    # Validate label characters
+    import re
+
+    if not re.match(r"^[a-zA-Z0-9_-]+$", label):
+        raise ValueError(
+            f"Label must contain only letters, numbers, hyphens, and underscores. Invalid label: {label}"
+        )
+
+    return label, value
+
+
+def main() -> int:
+    parser = argparse.ArgumentParser(description="Benchmark Orchestrator")
+    parser.add_argument(
+        "--input",
+        action="append",
+        dest="inputs",
+        help="Input in format <label>=<manifest_path_or_endpoint>. Can be specified multiple times for comparisons.",
+    )
+    parser.add_argument("--namespace", required=True, help="Kubernetes namespace")
+    parser.add_argument("--isl", type=int, default=200, help="Input sequence length")
+    parser.add_argument(
+        "--std",
+        type=int,
+        default=10,
+        help="Input sequence standard deviation",
+    )
+    parser.add_argument("--osl", type=int, default=200, help="Output sequence length")
+    parser.add_argument(
+        "--model",
+        default="deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
+        help="Model name",
+    )
+    parser.add_argument(
+        "--output-dir", type=str, default="benchmarks/results", help="Output directory"
+    )
+    args = parser.parse_args()
+
+    # Validate inputs
+    if not args.inputs:
+        print("ERROR: At least one --input must be specified")
+        return 1
+
+    # Parse inputs
+    try:
+        parsed_inputs = {}
+        for input_str in args.inputs:
+            label, value = parse_input(input_str)
+            if label in parsed_inputs:
+                print(
+                    f"ERROR: Duplicate label '{label}' found. Each label must be unique."
+                )
+                return 1
+            parsed_inputs[label] = value
+
+        # Check for plotting limitations
+        if len(parsed_inputs) > 12:
+            print(
+                f"WARNING: You provided {len(parsed_inputs)} inputs, but the plotting system supports up to 12 inputs."
+            )
+            print(
+                "Consider running separate benchmark sessions or grouping related comparisons together."
+            )
+            print(
+                "Continuing with benchmark, but some inputs may not appear in plots..."
+            )
+            print()
+
+        endpoints, manifests = categorize_inputs(parsed_inputs)
+
+    except (ValueError, FileNotFoundError) as e:
+        print(f"ERROR: {e}")
+        return 1
+
+    # Run the benchmark workflow with the parsed inputs
+    asyncio.run(
+        run_benchmark_workflow(
+            namespace=args.namespace,
+            inputs=parsed_inputs,
+            isl=args.isl,
+            std=args.std,
+            osl=args.osl,
+            model=args.model,
+            output_dir=args.output_dir,
+        )
+    )
+    return 0
+
+
+if __name__ == "__main__":
+    sys.exit(main())
--- a/benchmarks/utils/genai.py
+++ b/benchmarks/utils/genai.py
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+import os
+import subprocess
+from pathlib import Path
+from typing import List
+
+# Default concurrency levels - can be overridden with CONCURRENCIES environment variable
+DEFAULT_CONCURRENCIES: List[int] = [1, 2, 5, 10, 50, 100, 250]
+
+
+def get_concurrency_levels() -> List[int]:
+    """Get concurrency levels from environment variable or use defaults"""
+    concurrencies_env = os.getenv("CONCURRENCIES")
+    if concurrencies_env:
+        try:
+            # Parse comma-separated values
+            concurrencies = [int(x.strip()) for x in concurrencies_env.split(",")]
+            # Validate all are positive integers
+            for c in concurrencies:
+                if c <= 0:
+                    raise ValueError(f"Concurrency level must be positive, got: {c}")
+            return sorted(concurrencies)
+        except ValueError as e:
+            print(f"WARNING: Invalid CONCURRENCIES environment variable: {e}")
+            print(f"Using default concurrency levels: {DEFAULT_CONCURRENCIES}")
+            return DEFAULT_CONCURRENCIES
+
+    return DEFAULT_CONCURRENCIES
+
+
+CONCURRENCIES: List[int] = get_concurrency_levels()
+
+
+def run_genai_perf(
+    service_url: str,
+    model_name: str,
+    isl: int,
+    osl: int,
+    stddev: int,
+    concurrency: int,
+    output_dir: Path,
+) -> None:
+    output_dir.mkdir(parents=True, exist_ok=True)
+    cmd = [
+        "genai-perf",
+        "profile",
+        "-m",
+        model_name,
+        "--endpoint-type",
+        "chat",
+        "--streaming",
+        "-u",
+        service_url,
+        "--synthetic-input-tokens-mean",
+        str(isl),
+        "--synthetic-input-tokens-stddev",
+        str(stddev),
+        "--concurrency",
+        str(concurrency),
+        "--output-tokens-mean",
+        str(osl),
+        "--extra-inputs",
+        f"max_tokens:{osl}",
+        "--extra-inputs",
+        f"min_tokens:{osl}",
+        "--extra-inputs",
+        "ignore_eos:true",
+        "--tokenizer",
+        model_name,
+        "--artifact-dir",
+        str(output_dir),
+        "--",
+        "-vv",
+        "--max-threads=300",
+    ]
+    print(
+        f"Running genai-perf with isl {isl}, osl {osl}, concurrency {concurrency}",
+        flush=True,
+    )
+
+    gap_process = subprocess.Popen(
+        cmd,
+        cwd=str(output_dir),
+        stdout=subprocess.PIPE,
+        stderr=subprocess.PIPE,
+        text=True,
+    )
+    stdout, stderr = gap_process.communicate()
+    if gap_process.returncode == 0:
+        print("Genai-perf profiling completed successfully", flush=True)
+        if stdout:
+            print(stdout)
+    else:
+        print(f"Genai-perf failed with error code: {gap_process.returncode}")
+        if stderr:
+            print(f"stderr: {stderr}")
+        raise subprocess.CalledProcessError(
+            gap_process.returncode, cmd, output=stdout, stderr=stderr
+        )
+
+
+def run_concurrency_sweep(
+    service_url: str, model_name: str, isl: int, osl: int, stddev: int, output_dir: Path
+) -> None:
+    concurrency_levels = get_concurrency_levels()
+    print(
+        f"Running concurrency sweep for {model_name} with ISL {isl} and OSL {osl} and standard deviation {stddev}",
+        flush=True,
+    )
+    print(f"Concurrency levels: {concurrency_levels}", flush=True)
+
+    for c in concurrency_levels:
+        print(f"Starting concurrency level {c}", flush=True)
+        run_genai_perf(
+            service_url, model_name, isl, osl, stddev, c, output_dir / f"c{c}"
+        )
--- a/benchmarks/utils/plot.py
+++ b/benchmarks/utils/plot.py
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+import json
+import re
+from pathlib import Path
+from typing import Dict, List, Tuple
+
+import matplotlib.pyplot as plt
+
+
+def parse_benchmark_results(result_dir: Path) -> List[Tuple[int, Dict]]:
+    """
+    Parse benchmark results from a deployment directory.
+
+    Args:
+        result_dir: Path to the result directory
+
+    Returns:
+        List of (concurrency_level, metrics_dict) tuples sorted by concurrency
+    """
+    results = []
+
+    # Find all concurrency directories (e.g., c1, c2, c5, c10, c50, c100, c250)
+    for concurrency_dir in result_dir.iterdir():
+        if not concurrency_dir.is_dir() or not concurrency_dir.name.startswith("c"):
+            continue
+
+        # Extract concurrency level from directory name
+        match = re.match(r"c(\d+)", concurrency_dir.name)
+        if not match:
+            continue
+        concurrency = int(match.group(1))
+
+        # Find the genai-perf JSON file
+        genai_perf_json = None
+        for json_file in concurrency_dir.rglob("profile_export_genai_perf.json"):
+            genai_perf_json = json_file
+            break
+
+        if genai_perf_json and genai_perf_json.exists():
+            try:
+                with open(genai_perf_json, "r") as f:
+                    metrics = json.load(f)
+                results.append((concurrency, metrics))
+                print(f"Loaded metrics for concurrency {concurrency}")
+            except Exception as e:
+                print(f"Error loading {genai_perf_json}: {e}")
+        else:
+            print(f"Warning: No genai-perf JSON found for {concurrency_dir}")
+
+    # Sort by concurrency level
+    results.sort(key=lambda x: x[0])
+    return results
+
+
+def extract_metric_series(
+    results: List[Tuple[int, Dict]], metric_path: str, stat: str = "avg"
+) -> Tuple[List[int], List[float]]:
+    """
+    Extract a time series of a specific metric across concurrency levels.
+
+    Args:
+        results: List of (concurrency, metrics) tuples
+        metric_path: Dot-separated path to the metric (e.g., 'inter_token_latency')
+        stat: Statistic to extract ('avg', 'p50', 'p90', etc.)
+
+    Returns:
+        Tuple of (concurrency_levels, metric_values)
+    """
+    concurrencies = []
+    values = []
+
+    path_keys = metric_path.split(".")
+    for concurrency, metrics in results:
+        try:
+            node = metrics
+            for k in path_keys:
+                node = node[k]
+            value = node[stat]
+            concurrencies.append(concurrency)
+            values.append(float(value))
+        except (KeyError, TypeError):
+            print(
+                f"Warning: {metric_path}.{stat} not found for concurrency {concurrency}"
+            )
+            continue
+
+    return concurrencies, values
+
+
+def create_plot(
+    title: str,
+    xlabel: str,
+    ylabel: str,
+    data_series: List[Tuple[str, List[int], List[float]]],
+    output_path: Path,
+    log_scale_x: bool = False,
+    log_scale_y: bool = False,
+) -> None:
+    """
+    Create a line plot with multiple series.
+
+    Args:
+        title: Plot title
+        xlabel: X-axis label
+        ylabel: Y-axis label
+        data_series: List of (label, x_values, y_values) tuples
+        output_path: Path to save the plot
+        log_scale_x: Whether to use log scale for X axis
+        log_scale_y: Whether to use log scale for Y axis
+    """
+    plt.figure(figsize=(10, 6))
+
+    colors = ["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd", "#8c564b"]
+
+    for i, (label, x_vals, y_vals) in enumerate(data_series):
+        if x_vals and y_vals:  # Only plot if we have data
+            plt.plot(
+                x_vals,
+                y_vals,
+                marker="o",
+                linewidth=2,
+                markersize=6,
+                color=colors[i % len(colors)],
+                label=label,
+            )
+
+    plt.title(title, fontsize=14, fontweight="bold")
+    plt.xlabel(xlabel, fontsize=12)
+    plt.ylabel(ylabel, fontsize=12)
+    plt.grid(True, alpha=0.3)
+
+    if log_scale_x:
+        plt.xscale("log")
+    if log_scale_y:
+        plt.yscale("log")
+
+    plt.legend()
+
+    plt.tight_layout()
+    plt.savefig(output_path, dpi=300, bbox_inches="tight")
+    plt.close()
+    print(f"Saved plot: {output_path}")
+
+
+def create_efficiency_plot(
+    deployment_results: Dict, plots_dir: Path, output_tokens: int = 200
+) -> None:
+    """
+    Create an efficiency plot showing tok/s/gpu vs tok/s/user with concurrency as labeled points.
+
+    Args:
+        deployment_results: Dict of deployment_type -> results
+        plots_dir: Directory to save plots
+        output_tokens: Average output tokens per request (default 200)
+    """
+    plt.figure(figsize=(12, 8))
+
+    # Support for up to 12 deployments in the plots
+    colors = [
+        "#1f77b4",
+        "#ff7f0e",
+        "#2ca02c",
+        "#d62728",
+        "#9467bd",
+        "#8c564b",
+        "#e377c2",
+        "#7f7f7f",
+        "#bcbd22",
+        "#17becf",
+        "#aec7e8",
+        "#ffbb78",
+    ]
+    markers = ["o", "s", "^", "D", "v", "<", ">", "p", "*", "h", "H", "+"]
+
+    for deployment_type, results in deployment_results.items():
+        tok_s_per_user = []
+        tok_s_per_gpu = []
+        concurrency_levels = []
+
+        for concurrency, metrics in results:
+            try:
+                # Get request throughput (requests/sec)
+                request_throughput = metrics["request_throughput"]["avg"]
+
+                # Calculate total tokens per second
+                total_tok_s = request_throughput * output_tokens
+
+                # Guard against zero concurrency and parameterize GPU count
+                if concurrency <= 0:
+                    continue
+                num_gpus = metrics.get("cluster", {}).get("num_gpus", 1)
+                tok_s_user = total_tok_s / concurrency
+                tok_s_gpu = total_tok_s / max(1, num_gpus)
+
+                tok_s_per_user.append(tok_s_user)
+                tok_s_per_gpu.append(tok_s_gpu)
+                concurrency_levels.append(concurrency)
+
+            except KeyError as e:
+                print(
+                    f"Warning: Missing metric for {deployment_type} concurrency {concurrency}: {e}"
+                )
+                continue
+
+        if tok_s_per_user and tok_s_per_gpu:
+            # Plot points
+            color_idx = list(deployment_results.keys()).index(deployment_type)
+            color = colors[color_idx % len(colors)]
+            marker = markers[color_idx % len(markers)]
+
+            plt.scatter(
+                tok_s_per_user,
+                tok_s_per_gpu,
+                c=color,
+                marker=marker,
+                s=120,
+                alpha=0.8,
+                label=deployment_type.title(),
+                edgecolors="black",
+                linewidth=1.5,
+            )
+
+            # Add concurrency labels
+            for i, (x, y, c) in enumerate(
+                zip(tok_s_per_user, tok_s_per_gpu, concurrency_levels)
+            ):
+                plt.annotate(
+                    f"{c}",
+                    (x, y),
+                    xytext=(8, 8),
+                    textcoords="offset points",
+                    fontsize=10,
+                    fontweight="bold",
+                    ha="left",
+                )
+
+    plt.title("GPU Efficiency vs User Experience", fontsize=14, fontweight="bold")
+    plt.xlabel("Tokens/sec per User", fontsize=12)
+    plt.ylabel("Tokens/sec per GPU", fontsize=12)
+    plt.grid(True, alpha=0.3)
+
+    # Add a note about what the numbers represent
+    plt.figtext(
+        0.02,
+        0.02,
+        "Note: Numbers on dots indicate concurrency level",
+        fontsize=10,
+        style="italic",
+        alpha=0.7,
+    )
+
+    plt.legend()
+
+    plt.tight_layout()
+    output_path = plots_dir / "efficiency_tok_s_gpu_vs_user.png"
+    plt.savefig(output_path, dpi=300, bbox_inches="tight")
+    plt.close()
+    print(f"Saved efficiency plot: {output_path}")
+
+
+def generate_plots(base_output_dir: Path, output_dir: Path) -> None:
+    """
+    Generate performance plots from benchmark results.
+
+    Args:
+        base_output_dir: Base directory containing benchmark results
+        output_dir: Directory to save plots
+    """
+    print(f"Generating plots from results in {base_output_dir}")
+
+    # Create plots directory
+    output_dir.mkdir(exist_ok=True)
+
+    # Parse results for each deployment type
+    deployment_results = {}
+
+    # Find all subdirectories that contain benchmark results
+    for item in base_output_dir.iterdir():
+        if item.is_dir() and item.name != "plots":
+            deployment_type = item.name
+            results = parse_benchmark_results(item)
+            if results:
+                deployment_results[deployment_type] = results
+                print(f"Found {len(results)} concurrency levels for {deployment_type}")
+            else:
+                print(f"No valid results found for {deployment_type}")
+
+    if not deployment_results:
+        print("No benchmark results found to plot!")
+        return
+
+    # 1. P50 Inter-token Latency vs Concurrency
+    p50_data = []
+    for deployment_type, results in deployment_results.items():
+        concurrencies, latencies = extract_metric_series(
+            results, "inter_token_latency", "p50"
+        )
+        if concurrencies:
+            p50_data.append((deployment_type.title(), concurrencies, latencies))
+
+    create_plot(
+        title="P50 Inter-Token Latency vs Concurrency",
+        xlabel="Concurrency Level",
+        ylabel="P50 Inter-Token Latency (ms)",
+        data_series=p50_data,
+        output_path=output_dir / "p50_inter_token_latency_vs_concurrency.png",
+        log_scale_x=True,
+    )
+
+    # 2. Average Inter-token Latency vs Concurrency
+    avg_latency_data = []
+    for deployment_type, results in deployment_results.items():
+        concurrencies, latencies = extract_metric_series(
+            results, "inter_token_latency", "avg"
+        )
+        if concurrencies:
+            avg_latency_data.append((deployment_type.title(), concurrencies, latencies))
+
+    create_plot(
+        title="Average Inter-Token Latency vs Concurrency",
+        xlabel="Concurrency Level",
+        ylabel="Average Inter-Token Latency (ms)",
+        data_series=avg_latency_data,
+        output_path=output_dir / "avg_inter_token_latency_vs_concurrency.png",
+        log_scale_x=True,
+    )
+
+    # 3. Request Throughput vs Concurrency
+    throughput_data = []
+    for deployment_type, results in deployment_results.items():
+        concurrencies, throughputs = extract_metric_series(
+            results, "request_throughput", "avg"
+        )
+        if concurrencies:
+            throughput_data.append(
+                (deployment_type.title(), concurrencies, throughputs)
+            )
+
+    create_plot(
+        title="Request Throughput vs Concurrency",
+        xlabel="Concurrency Level",
+        ylabel="Request Throughput (req/s)",
+        data_series=throughput_data,
+        output_path=output_dir / "request_throughput_vs_concurrency.png",
+        log_scale_x=True,
+    )
+
+    # 4. Average Time to First Token vs Concurrency
+    ttft_data = []
+    for deployment_type, results in deployment_results.items():
+        concurrencies, ttfts = extract_metric_series(
+            results, "time_to_first_token", "avg"
+        )
+        if concurrencies:
+            ttft_data.append((deployment_type.title(), concurrencies, ttfts))
+
+    create_plot(
+        title="Average Time to First Token vs Concurrency",
+        xlabel="Concurrency Level",
+        ylabel="Average Time to First Token (ms)",
+        data_series=ttft_data,
+        output_path=output_dir / "avg_time_to_first_token_vs_concurrency.png",
+        log_scale_x=True,
+    )
+
+    # 5. Efficiency plot: tok/s/gpu vs tok/s/user
+    create_efficiency_plot(deployment_results, output_dir)
+
+    # Generate summary
+    summary_lines = [
+        "Benchmark Results Summary",
+        "=" * 30,
+        "",
+        f"Results directory: {base_output_dir}",
+        f"Plots generated: {output_dir}",
+        "",
+        "Deployment Types Found:",
+    ]
+
+    for deployment_type, results in deployment_results.items():
+        concurrency_levels = [r[0] for r in results]
+        summary_lines.append(
+            f"  {deployment_type}: {len(results)} concurrency levels ({min(concurrency_levels)}-{max(concurrency_levels)})"
+        )
+
+    summary_lines.extend(
+        [
+            "",
+            "Generated Plots:",
+            "  - p50_inter_token_latency_vs_concurrency.png",
+            "  - avg_inter_token_latency_vs_concurrency.png",
+            "  - request_throughput_vs_concurrency.png",
+            "  - avg_time_to_first_token_vs_concurrency.png",
+            "  - efficiency_tok_s_gpu_vs_user.png",
+        ]
+    )
+
+    summary_path = output_dir / "SUMMARY.txt"
+    summary_path.write_text("\n".join(summary_lines))
+    print(f"Generated summary: {summary_path}")
+
+    print(f"All plots saved to: {output_dir}")
+
+
+if __name__ == "__main__":
+    import argparse
+
+    parser = argparse.ArgumentParser(
+        description="Generate performance plots from benchmark results"
+    )
+    parser.add_argument(
+        "--data-dir", required=True, help="Directory containing benchmark results"
+    )
+    parser.add_argument(
+        "--output-dir", help="Output directory for plots (defaults to data-dir/plots)"
+    )
+
+    args = parser.parse_args()
+
+    data_dir = Path(args.data_dir)
+    if args.output_dir:
+        # If output dir specified, use it as base and call generate_plots
+        output_dir = Path(args.output_dir)
+        output_dir.mkdir(parents=True, exist_ok=True)
+        generate_plots(data_dir, output_dir)
+    else:
+        # Use data_dir as base output dir
+        generate_plots(data_dir, data_dir / "plots")
--- a/benchmarks/utils/workflow.py
+++ b/benchmarks/utils/workflow.py
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Callable, Dict, List, Tuple
+
+from benchmarks.utils.genai import run_concurrency_sweep
+from benchmarks.utils.plot import generate_plots
+from deploy.utils.dynamo_deployment import DynamoDeploymentClient
+
+
+@dataclass
+class DeploymentConfig:
+    """Configuration for a single deployment type"""
+
+    name: str  # Human-readable name (e.g., "aggregated")
+    manifest_path: str  # Path to deployment manifest
+    output_subdir: str  # Subdirectory name for results (e.g., "agg")
+    client_factory: Callable  # Function to create the client
+    deploy_func: Callable  # Function to deploy the client
+
+
+def create_dynamo_client(
+    namespace: str, deployment_name: str
+) -> DynamoDeploymentClient:
+    """Factory function for DynamoDeploymentClient"""
+    return DynamoDeploymentClient(namespace=namespace, deployment_name=deployment_name)
+
+
+async def deploy_dynamo_client(
+    client: DynamoDeploymentClient, manifest_path: str
+) -> None:
+    """Deploy a DynamoDeploymentClient"""
+    await client.create_deployment(manifest_path)
+    await client.wait_for_deployment_ready(timeout=1800)
+
+
+async def teardown(client) -> None:
+    """Clean up deployment and stop port forwarding"""
+    try:
+        if hasattr(client, "stop_port_forward"):
+            client.stop_port_forward()
+        await client.delete_deployment()
+    except Exception:
+        pass
+
+
+def print_deployment_start(config: DeploymentConfig, output_dir: str) -> None:
+    """Print deployment start messages"""
+    print(f"🚀 Starting {config.name} deployment benchmark...")
+    print(f"📄 Manifest: {config.manifest_path}")
+    print(f"📁 Results will be saved to: {Path(output_dir) / config.output_subdir}")
+
+
+def print_concurrency_start(
+    deployment_name: str, model: str, isl: int, osl: int, std: int
+) -> None:
+    """Print concurrency sweep start messages"""
+    print(f"⚙️  Starting {deployment_name} concurrency sweep!", flush=True)
+    print(
+        "⏱️  This may take several minutes - running through multiple concurrency levels...",
+        flush=True,
+    )
+    print(f"🎯 Model: {model} | ISL: {isl} | OSL: {osl} | StdDev: {std}")
+
+
+def print_deployment_complete(config: DeploymentConfig) -> None:
+    """Print deployment completion message"""
+    print(f"✅ {config.name.title()} deployment benchmark completed successfully!")
+
+
+def print_deployment_skip(deployment_type: str) -> None:
+    """Print deployment skip message"""
+    print(f"⏭️  Skipping {deployment_type} deployment (not specified)")
+
+
+async def run_single_deployment_benchmark(
+    config: DeploymentConfig,
+    namespace: str,
+    output_dir: str,
+    model: str,
+    isl: int,
+    osl: int,
+    std: int,
+) -> None:
+    """Run benchmark for a single deployment type"""
+    print_deployment_start(config, output_dir)
+
+    # Create and deploy client
+    client = config.client_factory(namespace, config.output_subdir)
+    await config.deploy_func(client, config.manifest_path)
+
+    try:
+        print_concurrency_start(config.name, model, isl, osl, std)
+
+        # Run concurrency sweep
+        (Path(output_dir) / config.output_subdir).mkdir(parents=True, exist_ok=True)
+        run_concurrency_sweep(
+            service_url=client.port_forward_frontend(quiet=True),
+            model_name=model,
+            isl=isl,
+            osl=osl,
+            stddev=std,
+            output_dir=Path(output_dir) / config.output_subdir,
+        )
+
+    finally:
+        await teardown(client)
+
+    print_deployment_complete(config)
+
+
+async def run_endpoint_benchmark(
+    label: str,
+    endpoint: str,
+    model: str,
+    isl: int,
+    osl: int,
+    std: int,
+    output_dir: str,
+) -> None:
+    """Run benchmark for an existing endpoint with custom label"""
+    print(f"🚀 Starting benchmark of endpoint '{label}': {endpoint}")
+    print(f"📁 Results will be saved to: {Path(output_dir) / label}")
+    print_concurrency_start(f"endpoint ({label})", model, isl, osl, std)
+
+    run_concurrency_sweep(
+        service_url=endpoint,
+        model_name=model,
+        isl=isl,
+        osl=osl,
+        stddev=std,
+        output_dir=Path(output_dir) / label,
+    )
+    print("✅ Endpoint benchmark completed successfully!")
+
+
+def print_final_summary(output_dir: str, deployed_types: List[str]) -> None:
+    """Print final benchmark summary"""
+    print("📊 Generating performance plots...")
+    generate_plots(
+        base_output_dir=Path(output_dir), output_dir=Path(output_dir) / "plots"
+    )
+    print(f"📈 Plots saved to: {Path(output_dir) / 'plots'}")
+    print(f"📋 Summary saved to: {Path(output_dir) / 'SUMMARY.txt'}")
+
+    print()
+    print("🎉 Benchmark workflow completed successfully!")
+    print(f"📁 All results available at: {output_dir}")
+
+    if deployed_types:
+        print(f"🚀 Benchmarked deployments: {', '.join(deployed_types)}")
+
+    print(f"📊 View plots at: {Path(output_dir) / 'plots'}")
+
+
+def categorize_inputs(inputs: Dict[str, str]) -> Tuple[Dict[str, str], Dict[str, str]]:
+    """Categorize inputs into endpoints and manifests"""
+    endpoints = {}
+    manifests = {}
+
+    for label, value in inputs.items():
+        # Validate reserved labels
+        if label.lower() == "plots":
+            raise ValueError(
+                "Label 'plots' is reserved and cannot be used. Please choose a different label."
+            )
+
+        if value.startswith(("http://", "https://")):
+            endpoints[label] = value
+        else:
+            # It should be a file path - validate it exists
+            if not Path(value).is_file():
+                raise FileNotFoundError(
+                    f"Manifest file not found for input '{label}': {value}"
+                )
+            manifests[label] = value
+
+    return endpoints, manifests
+
+
+def validate_dynamo_manifest(manifest_path: str) -> None:
+    """Validate that the manifest is a DynamoGraphDeployment"""
+    try:
+        with open(manifest_path, "r") as f:
+            content = f.read()
+
+        # Check for DynamoGraphDeployment
+        if "kind: DynamoGraphDeployment" not in content:
+            raise ValueError(
+                f"Manifest {manifest_path} is not a DynamoGraphDeployment. Only DynamoGraphDeployments are supported for deployment benchmarking."
+            )
+
+    except FileNotFoundError:
+        raise FileNotFoundError(f"Manifest file not found: {manifest_path}")
+    except Exception as e:
+        raise ValueError(f"Error reading manifest {manifest_path}: {e}")
+
+
+async def run_benchmark_workflow(
+    namespace: str,
+    inputs: Dict[str, str],
+    isl: int = 200,
+    std: int = 10,
+    osl: int = 200,
+    model: str = "nvidia/Llama-3.1-8B-Instruct-FP8",
+    output_dir: str = "benchmarks/results",
+) -> None:
+    """Main benchmark workflow orchestrator with dynamic inputs"""
+    Path(output_dir).mkdir(parents=True, exist_ok=True)
+
+    # Categorize inputs into endpoints and manifests
+    endpoints, manifests = categorize_inputs(inputs)
+
+    # Run endpoint benchmarks
+    for label, endpoint in endpoints.items():
+        await run_endpoint_benchmark(label, endpoint, model, isl, osl, std, output_dir)
+
+    # Create deployment configurations for manifests
+    deployment_configs = []
+
+    for label, manifest_path in manifests.items():
+        # Validate that it's a DynamoGraphDeployment
+        validate_dynamo_manifest(manifest_path)
+
+        config = DeploymentConfig(
+            name=label,
+            manifest_path=manifest_path,
+            output_subdir=label,
+            client_factory=create_dynamo_client,
+            deploy_func=deploy_dynamo_client,
+        )
+
+        deployment_configs.append(config)
+
+    # Run benchmarks for each deployment type
+    deployed_labels = list(endpoints.keys())
+    for config in deployment_configs:
+        await run_single_deployment_benchmark(
+            config=config,
+            namespace=namespace,
+            output_dir=output_dir,
+            model=model,
+            isl=isl,
+            osl=osl,
+            std=std,
+        )
+        deployed_labels.append(config.name)
+
+    # Generate final summary
+    print_final_summary(output_dir, deployed_labels)
--- a/components/backends/sglang/deploy/disagg_planner.yaml
+++ b/components/backends/sglang/deploy/disagg_planner.yaml
@@ -47,7 +47,7 @@ spec:
        failureThreshold: 10
      pvc:
        create: false
-        name: profiling-pvc # Must be pre-created before deployment and SLA profiler must have been run
+        name: dynamo-pvc # Must be pre-created before deployment and SLA profiler must have been run
        mountPoint: /workspace/profiling_results
      extraPodSpec:
        mainContainer:

--- a/components/backends/vllm/deploy/README.md
+++ b/components/backends/vllm/deploy/README.md
@@ -99,7 +99,7 @@ We have public images available on [NGC Catalog](https://catalog.ngc.nvidia.com/

 ### Pre-Deployment Profiling (SLA Planner Only)

-If using the SLA Planner deployment (`disagg_planner.yaml`), follow the [pre-deployment profiling guide](../../../../docs/architecture/pre_deployment_profiling.md) to run pre-deployment profiling. The results will be saved to the `profiling-pvc` PVC and queried by the SLA Planner.
+If using the SLA Planner deployment (`disagg_planner.yaml`), follow the [pre-deployment profiling guide](../../../../docs/benchmarks/pre_deployment_profiling.md) to run pre-deployment profiling. The results will be saved to the `dynamo-pvc` PVC and queried by the SLA Planner.

 ## Usage


--- a/components/backends/vllm/deploy/agg.yaml
+++ b/components/backends/vllm/deploy/agg.yaml
@@ -13,7 +13,7 @@ spec:
      replicas: 1
      extraPodSpec:
        mainContainer:
-          image: nvcr.io/nvidian/nim-llm-dev/vllm-runtime:dep-233.17
+          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1
    VllmDecodeWorker:
      envFromSecret: hf-token-secret
      dynamoNamespace: vllm-agg
@@ -24,7 +24,7 @@ spec:
          gpu: "1"
      extraPodSpec:
        mainContainer:
-          image: nvcr.io/nvidian/nim-llm-dev/vllm-runtime:dep-233.17
+          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1
          workingDir: /workspace/components/backends/vllm
          command:
            - /bin/sh

--- a/components/backends/vllm/deploy/agg_router.yaml
+++ b/components/backends/vllm/deploy/agg_router.yaml
@@ -13,7 +13,7 @@ spec:
      replicas: 1
      extraPodSpec:
        mainContainer:
-          image: nvcr.io/nvidian/nim-llm-dev/vllm-runtime:dep-233.17
+          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1
      envs:
        - name: DYN_ROUTER_MODE
          value: kv
@@ -27,7 +27,7 @@ spec:
          gpu: "1"
      extraPodSpec:
        mainContainer:
-          image: nvcr.io/nvidian/nim-llm-dev/vllm-runtime:dep-233.17
+          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1
          workingDir: /workspace/components/backends/vllm
          command:
            - /bin/sh

--- a/components/backends/vllm/deploy/disagg.yaml
+++ b/components/backends/vllm/deploy/disagg.yaml
@@ -13,7 +13,7 @@ spec:
      replicas: 1
      extraPodSpec:
        mainContainer:
-          image: nvcr.io/nvidian/nim-llm-dev/vllm-runtime:dep-233.17
+          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1
    VllmDecodeWorker:
      dynamoNamespace: vllm-disagg
      envFromSecret: hf-token-secret
@@ -24,7 +24,7 @@ spec:
          gpu: "1"
      extraPodSpec:
        mainContainer:
-          image: nvcr.io/nvidian/nim-llm-dev/vllm-runtime:dep-233.17
+          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1
          workingDir: /workspace/components/backends/vllm
          command:
            - /bin/sh
@@ -41,7 +41,7 @@ spec:
          gpu: "1"
      extraPodSpec:
        mainContainer:
-          image: nvcr.io/nvidian/nim-llm-dev/vllm-runtime:dep-233.17
+          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1
          workingDir: /workspace/components/backends/vllm
          command:
            - /bin/sh

--- a/components/backends/vllm/deploy/disagg_planner.yaml
+++ b/components/backends/vllm/deploy/disagg_planner.yaml
@@ -20,7 +20,7 @@ spec:
      replicas: 1
      extraPodSpec:
        mainContainer:
-          image: nvcr.io/nvidian/nim-llm-dev/vllm-runtime:hzhou-0814-02
+          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1
    Planner:
      dynamoNamespace: vllm-disagg-planner
      envFromSecret: hf-token-secret
@@ -47,11 +47,11 @@ spec:
        failureThreshold: 10
      pvc:
        create: false
-        name: profiling-pvc # Must be pre-created before deployment and SLA profiler must have been run
+        name: dynamo-pvc # Must be pre-created before deployment and SLA profiler must have been run
        mountPoint: /workspace/profiling_results
      extraPodSpec:
        mainContainer:
-          image: nvcr.io/nvidian/nim-llm-dev/vllm-runtime:hzhou-0814-02
+          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1
          workingDir: /workspace/components/planner/src/dynamo/planner
          ports:
            - name: metrics
@@ -95,7 +95,7 @@ spec:
        failureThreshold: 10
      extraPodSpec:
        mainContainer:
-          image: nvcr.io/nvidian/nim-llm-dev/vllm-runtime:hzhou-0814-02
+          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1
          workingDir: /workspace/components/backends/vllm
          command:
            - /bin/sh
@@ -118,7 +118,7 @@ spec:
              port: 9090
            periodSeconds: 10
            failureThreshold: 60
-          image: nvcr.io/nvidian/nim-llm-dev/vllm-runtime:hzhou-0814-02
+          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1
          workingDir: /workspace/components/backends/vllm
          command:
            - /bin/sh
@@ -141,7 +141,7 @@ spec:
              port: 9090
            periodSeconds: 10
            failureThreshold: 60
-          image: nvcr.io/nvidian/nim-llm-dev/vllm-runtime:hzhou-0814-02
+          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1
          workingDir: /workspace/components/backends/vllm
          command:
            - /bin/sh

--- a/components/backends/vllm/deploy/disagg_router.yaml
+++ b/components/backends/vllm/deploy/disagg_router.yaml
@@ -13,7 +13,7 @@ spec:
      replicas: 1
      extraPodSpec:
        mainContainer:
-          image: nvcr.io/nvidian/nim-llm-dev/vllm-runtime:dep-233.17
+          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1
      envs:
        - name: DYN_ROUTER_MODE
          value: kv
@@ -27,7 +27,7 @@ spec:
          gpu: "1"
      extraPodSpec:
        mainContainer:
-          image: nvcr.io/nvidian/nim-llm-dev/vllm-runtime:dep-233.17
+          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1
          workingDir: /workspace/components/backends/vllm
          command:
            - /bin/sh
@@ -44,7 +44,7 @@ spec:
          gpu: "1"
      extraPodSpec:
        mainContainer:
-          image: nvcr.io/nvidian/nim-llm-dev/vllm-runtime:dep-233.17
+          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1
          workingDir: /workspace/components/backends/vllm
          command:
            - /bin/sh

--- a/deploy/__init__.py
+++ b/deploy/__init__.py
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Package marker for deploy utilities
--- a/deploy/utils/README.md
+++ b/deploy/utils/README.md
+# Kubernetes utilities for Dynamo
+
+This directory contains small utilities and manifests used by benchmarking and profiling flows.
+
+## Contents
+
+- `setup_k8s_namespace.sh` — **fully encapsulated deployment setup** that provides one-time per Kubernetes namespace setup. Creates namespace (if missing), applies common manifests, installs CRDs, and deploys the Dynamo operator. If `DOCKER_SERVER`/`IMAGE_TAG` are provided, it installs your custom operator image; otherwise it installs the default published image. If your registry is private, provide `DOCKER_USERNAME`/`DOCKER_PASSWORD` or respond to the prompt to create an image pull secret.
+- `manifests/`
+  - `serviceaccount.yaml` — ServiceAccount `dynamo-sa`
+  - `role.yaml` — Role `dynamo-role`
+  - `rolebinding.yaml` — RoleBinding `dynamo-binding`
+  - `pvc.yaml` — PVC `dynamo-pvc`
+  - `pvc-access-pod.yaml` — short‑lived pod for copying profiler results from the PVC
+- `kubernetes.py` — helper used by tooling to apply/read resources (e.g., access pod for PVC downloads).
+
+## Quick start
+
+### Kubernetes Setup (one-time per namespace)
+
+Use the helper script to prepare a Kubernetes namespace with the common manifests and install the operator. This provides a **fully encapsulated deployment setup**.
+
+This script creates a Kubernetes namespace with the given name if it does not yet exist. It then applies common manifests (serviceaccount, role, rolebinding, pvc), installs CRDs, creates secrets, and deploys the Dynamo Cloud Operator to your namespace.
+If your namespace is already set up, you can skip this step.
+
+```bash
+export HF_TOKEN=<HF_TOKEN>
+export DOCKER_SERVER=<YOUR_DOCKER_SERVER>
+
+NAMESPACE=benchmarking HF_TOKEN=$HF_TOKEN DOCKER_SERVER=$DOCKER_SERVER deploy/utils/setup_k8s_namespace.sh
+
+# IF you want to build and push a new Docker image for the Dynamo Cloud Operator, include an IMAGE_TAG
+# NAMESPACE=benchmarking HF_TOKEN=$HF_TOKEN DOCKER_SERVER=$DOCKER_SERVER IMAGE_TAG=latest deploy/utils/setup_k8s_namespace.sh
+```
+
+This script applies the following manifests:
+
+- `deploy/utils/manifests/serviceaccount.yaml` - ServiceAccount `dynamo-sa`
+- `deploy/utils/manifests/role.yaml` - Role `dynamo-role`
+- `deploy/utils/manifests/rolebinding.yaml` - RoleBinding `dynamo-binding`
+- `deploy/utils/manifests/pvc.yaml` - PVC `dynamo-pvc`
+
+If `DOCKER_SERVER` and `IMAGE_TAG` are not both provided, the script deploys the operator using the default published image `nvcr.io/nvidia/ai-dynamo/kubernetes-operator:0.4.0`.
+To build/push and use a new image instead, pass both `DOCKER_SERVER` and `IMAGE_TAG`.
+
+This script also installs the Dynamo CRDs if not present.
+
+If the registry is private, either pass credentials or respond to the prompt:
+
+```bash
+NAMESPACE=benchmarking \
+DOCKER_SERVER=my-registry.example.com \
+IMAGE_TAG=latest \
+DOCKER_USERNAME="$oauthtoken" \
+DOCKER_PASSWORD=<token> \
+deploy/utils/setup_k8s_namespace.sh
+```
+
+If `DOCKER_SERVER`/`IMAGE_TAG` are omitted, the script installs the default operator image `nvcr.io/nvidia/ai-dynamo/kubernetes-operator:0.4.0`.
+
+After running the setup script, verify the installation by checking the pods:
+
+```bash
+kubectl get pods -n $NAMESPACE
+```
+
+The output should look something like:
+
+```
+NAME                                                            READY   STATUS    RESTARTS   AGE
+dynamo-platform-dynamo-operator-controller-manager-xxxxx       2/2     Running   0          5m
+dynamo-platform-etcd-0                                          1/1     Running   0          5m
+dynamo-platform-nats-0                                          2/2     Running   0          5m
+dynamo-platform-nats-box-xxxxx                                  1/1     Running   0          5m
+```
+
+### PVC Manipulation Scripts
+
+These scripts interact with the Persistent Volume Claim (PVC) that stores configuration files and benchmark/profiling results. They're essential for the Dynamo benchmarking and profiling workflows.
+
+#### Why These Scripts Are Needed
+
+1. **For Pre-Deployment Profiling**: The profiling job needs access to your Dynamo deployment configurations (DGD manifests) to test different parallelization strategies
+2. **For Retrieving Results**: Both benchmarking and profiling jobs write their results to the PVC, which you need to download for analysis
+
+#### Script Usage
+
+**Inject deployment configurations for profiling:**
+
+```bash
+# The profiling job reads your DGD config from the PVC
+python3 deploy/utils/inject_manifest.py \
+  --namespace $NAMESPACE \
+  --src ./my-disagg.yaml \
+  --dest /configs/disagg.yaml
+```
+
+**Download benchmark/profiling results:**
+
+```bash
+# After benchmarking or profiling completes, download results
+python3 deploy/utils/download_pvc_results.py \
+  --namespace $NAMESPACE \
+  --output-dir ./pvc_files \
+  --folder /results \
+  --no-config   # optional: skip *.yaml/*.yml in the download
+```
+
+#### Next Steps
+
+For complete benchmarking workflows:
+- **Benchmarking Guide**: See [docs/benchmarks/benchmarking.md](../../docs/benchmarks/benchmarking.md) for comparing DynamoGraphDeployments and external endpoints
+- **Pre-Deployment Profiling**: See [docs/benchmarks/pre_deployment_profiling.md](../../docs/benchmarks/pre_deployment_profiling.md) for optimizing configurations before deployment
+
+## Notes
+
+- Benchmarking scripts (`benchmarks/benchmark.sh`, `benchmarks/deploy_benchmark.sh`) call this setup automatically when present.
+- Profiling job manifest remains in `benchmarks/profiler/deploy/profile_sla_job.yaml` and now relies on the common ServiceAccount/PVC here.