"...controller/dynamocomponentdeployment_controller_test.go" did not exist on "f0e382ad7ef4076485f9688012d794ee0d251a3d"
Unverified Commit 273252e6 authored by Biswa Panda's avatar Biswa Panda Committed by GitHub
Browse files

feat(frontend): three-layer frontend perf sweep with local and k8s support (#7700)

parent 023a299c
<!-- SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -->
<!-- SPDX-License-Identifier: Apache-2.0 -->
# Frontend Performance Benchmark Suite
A configurable sweep runner for measuring Dynamo frontend model serving performance. It drives [aiperf](https://github.com/ai-dynamo/aiperf) load against a frontend/mocker (or frontend/vLLM) stack and collects throughput, latency, and observability data across a grid of parameters.
The primary use case is **HuggingFace tokenizer vs. fastokens comparison** -- sweeping across concurrency levels, input sequence lengths (ISL), and worker counts to quantify the tokenizer's impact on end-to-end performance.
---
## Architecture
The codebase follows a three-layer design that separates pure logic from execution and infrastructure concerns.
| Layer | Package | Responsibility |
|-------|---------|----------------|
| **Core** | `scripts/sweep_core/` | Pure data models, plan construction, artifact writing, reporting. No subprocess or kubectl calls. |
| **Executors** | `scripts/sweep_executors/` | `SweepExecutor` protocol with two implementations -- `LocalExecutor` (delegates to `run_perf.sh`) and `K8sDgdExecutor` (DynamoGraphDeployment-based k8s runs). |
| **K8s helpers** | `scripts/sweep_k8s/` | kubectl wrappers, DGD patching, template rendering, aiperf Job launching, and Prometheus metrics capture. |
The entry point is `scripts/sweep_runner.py`, a thin CLI that wires the three layers together: it builds a `SweepPlan` from CLI arguments, selects an executor based on `--mode`, and feeds the plan to the orchestrator.
**Data flow:**
```
CLI args --> SweepConfig --> SweepPlan (Cartesian grid of RunSpecs)
|
Orchestrator
|
LocalExecutor or K8sDgdExecutor
| |
run_perf.sh DGD + aiperf Job
| |
artifacts/ artifacts/
```
---
## Quick Start -- Local
Local mode starts a mocker backend and frontend process on the current machine, runs aiperf against them, and tears everything down between runs.
**Prerequisites:**
- `dynamo.mocker` and `dynamo.frontend` installed (from the Dynamo repo)
- `aiperf` installed and on `$PATH`
- A HuggingFace model accessible locally (default: `Qwen/Qwen3-0.6B`)
**Smoke test (2 runs, ~30 s each):**
```bash
cd benchmarks/frontend/scripts
python3 sweep_runner.py \
--tokenizers hf,fastokens \
--concurrency 32 \
--isl 512 \
--benchmark-duration 30 \
--speedup-ratio 1000000
```
**Full local sweep:**
```bash
python3 sweep_runner.py \
--tokenizers hf,fastokens \
--concurrency 32,64,128 \
--isl 512,1024,2048
```
**Transport saturation sweep (high concurrency, vary workers):**
```bash
python3 sweep_runner.py \
--tokenizers hf \
--concurrency 4096 \
--num-requests 16384,32768 \
--workers 1,2,4,8 \
--speedup-ratio 1000000
```
Results are written to `artifacts/sweep_<timestamp>/`.
---
## Quick Start -- Kubernetes
K8s mode deploys a DynamoGraphDeployment (DGD) into a Kubernetes namespace and launches aiperf as an in-cluster Job that targets the frontend service endpoint.
### Prerequisites
1. **Namespace** -- a dedicated namespace for the benchmark (default: `dynamo-bench`).
2. **HuggingFace token secret** -- a Kubernetes Secret named `hf-token-secret`
containing your HF token, if the model requires authentication.
3. **Model cache PVC** -- a PersistentVolumeClaim for caching model weights
(avoids repeated downloads across runs).
4. **DGD deployed** -- either pre-deploy the DGD yourself, or use the
`--deploy --deploy-template` flags to let the sweep runner create it.
5. **kubectl** configured with access to the target cluster and namespace.
### Example: mocker backend
```bash
python3 sweep_runner.py \
--mode k8s \
--dgd-name dynamo-bench-mocker \
--tokenizers hf,fastokens \
--concurrency 50,100 \
--isl 512
```
### Example: template-based deployment
When `--deploy-template` is provided, the runner renders the template with per-run variables (tokenizer, workers, model, etc.) and applies it via kubectl before each run group:
```bash
python3 sweep_runner.py \
--mode k8s \
--deploy \
--deploy-template dgd/templates/mocker.yaml \
--dgd-name dynamo-bench-mocker \
--image nvcr.io/.../image:tag \
--tokenizers hf,fastokens \
--concurrency 50,100 \
--isl 512
```
### How aiperf runs in-cluster
The sweep runner creates a short-lived Kubernetes Job in the same namespace as the DGD. The Job pod runs `aiperf` against the frontend's in-cluster service DNS name (e.g., `dynamo-bench-mocker-frontend:8000`). Once the Job completes, artifacts are copied back to the local host via `kubectl cp`.
### Reset strategy
Between runs, the `--reset-strategy` flag controls how the deployed stack is
recycled:
| Strategy | Behavior |
|----------|----------|
| `none` | No resets; runs back-to-back on the same deployment. |
| `frontend` | Restart only the frontend pod between runs. |
| `graph` (default) | Redeploy the entire DGD graph between run groups. |
---
## CLI Reference
All flags for `sweep_runner.py`:
### Common options
| Flag | Default | Description |
|------|---------|-------------|
| `--mode` | `local` | Execution mode: `local` or `k8s`. |
| `--backend` | `mocker` | Engine backend: `mocker` (synthetic) or `vllm` (real inference). |
| `--model` | `Qwen/Qwen3-0.6B` | HuggingFace model path. |
| `--model-name` | same as `--model` | Served model name (for multi-model setups). |
| `--tokenizers` | `hf,fastokens` | Comma-separated tokenizer backends. |
| `--concurrency` | `50,100,200` | Comma-separated concurrency levels. |
| `--isl` | `512,1024,2048` | Comma-separated input sequence lengths. |
| `--osl` | `256` | Output sequence length. |
| `--workers` | `2` | Comma-separated worker counts per model. |
| `--num-models` | `1` | Number of model instances. |
| `--speedup-ratio` | `1.0` | Mocker speedup divisor; use large values (e.g., 1000000) for near-instant mocker. |
| `--benchmark-duration` | `60` | aiperf duration in seconds. |
| `--num-requests` | none | Comma-separated request counts (overrides `--benchmark-duration`). |
| `--rps` | none | Comma-separated target request rates (req/s). |
| `--output-dir` | auto-timestamped | Output directory. |
| `--cooldown` | `3` | Seconds between runs. |
| `--max-consecutive-fails` | `2` | Abort sweep after N consecutive failures. |
| `--isolation` | `fresh_per_run` | Isolation policy: `fresh_per_run` or `reuse_by_deploy_key`. |
| `--no-report` | off | Skip per-run report generation. |
### Execution control
| Flag | Description |
|------|-------------|
| `--dry-run` | Print the sweep plan without executing any runs. |
| `--emit-plan` | Print the sweep plan as JSON and exit (useful for Argo or MCP integration). |
### K8s mode options
| Flag | Default | Description |
|------|---------|-------------|
| `--namespace` | `dynamo-bench` | Kubernetes namespace. |
| `--endpoint` | auto-derived | Frontend endpoint (`host:port`). |
| `--dgd-name` | none | DynamoGraphDeployment name. |
| `--image` | none | Container image for k8s deployment. |
| `--deploy-template` | none | Path to a DGD YAML template (enables template-based deployment). |
| `--deploy` | off | Deploy infrastructure before sweeping. |
| `--reset-strategy` | `graph` | Per-run reset: `none`, `frontend`, or `graph`. |
| `--frontend-port` | `8000` | Frontend HTTP port. |
| `--worker-replicas` | `1` | Number of worker pod replicas. |
| `--request-plane` | `tcp` | Request plane transport. |
| `--event-plane` | `nats` | Event plane transport. |
| `--router-mode` | `round-robin` | Frontend router mode. |
| `--hf-token` | none | HuggingFace token for k8s. |
| `--image-pull-secret` | none | Image pull secret name. |
| `--export-level` | `summary` | aiperf export level. |
---
## Artifact Structure
Each sweep produces a timestamped output directory:
```
artifacts/sweep_20260330_143000/
sweep_config.json # Full SweepConfig used for this run
results.csv # One row per run with key metrics
summary.md # Markdown summary table
mocker_hf_w2_c50_isl512/
aiperf/ # aiperf JSON output
prometheus/ # Prometheus metric snapshots
report.md # Per-run analysis report (unless --no-report)
mocker_fastokens_w2_c50_isl512/
aiperf/
prometheus/
report.md
...
```
**results.csv columns:**
`run_id`, `backend`, `tokenizer`, `concurrency`, `isl`, `osl`, `workers`,
`speedup_ratio`, `status`, `req_per_sec`, `output_tok_per_sec`,
`ttft_p50_ms`, `ttft_p99_ms`, `itl_p50_ms`, `itl_p99_ms`, `duration_sec`,
`run_dir`
---
## DGD Templates
The `dgd/templates/` directory contains DynamoGraphDeployment YAML templates
for k8s mode. Template variables (e.g., `${DGD_NAME}`, `${IMAGE}`,
`${DYN_TOKENIZER_BACKEND}`) are substituted by the sweep runner at deploy time.
| Template | Backend | GPU required | Description |
|----------|---------|-------------|-------------|
| `mocker.yaml` | mocker | No | Synthetic backend for isolating frontend/tokenizer overhead. |
| `vllm.yaml` | vLLM | Yes | Real inference backend for end-to-end benchmarking. |
---
## Analysis
Post-sweep analysis scripts live in `scripts/analysis/`:
| Script | Purpose |
|--------|---------|
| `create_report.py` | Generates a per-run observability report from aiperf JSON, Prometheus snapshots, NVTX traces, syscall profiles, and BPF data. |
| `frontend_perf_analysis.py` | Produces scalability curves (TTFT/ITL/throughput vs. concurrency), ISL heatmaps, stage waterfall breakdowns, and regression detection. Supports single-run analysis, A/B comparison, and heatmap generation. |
**Single-run report:**
```bash
python3 scripts/analysis/create_report.py analyze artifacts/sweep_*/mocker_hf_w2_c50_isl512/
```
**A/B comparison:**
```bash
python3 scripts/analysis/frontend_perf_analysis.py compare \
artifacts/sweep_*/mocker_hf_w2_c50_isl512/ \
artifacts/sweep_*/mocker_fastokens_w2_c50_isl512/
```
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Deploy template: Mocker backend (no GPUs required).
#
# Template variables (substituted by sweep_runner.py --deploy-template):
# ${DGD_NAME} - DynamoGraphDeployment name
# ${IMAGE} - Container image
# ${DYN_TOKENIZER_BACKEND} - "default" (hf) or "fast"
# ${FRONTEND_PORT} - Frontend HTTP port
# ${ROUTER_MODE} - Frontend router mode
# ${MODEL_PATH} - HF model ID
# ${MODEL_NAME} - Served model name
# ${NUM_WORKERS} - Mocker workers per pod
# ${FRONTEND_REPLICAS} - Number of frontend pods (default: 1)
# ${WORKER_REPLICAS} - Number of worker pods
# ${SPEEDUP_RATIO} - Mocker speedup ratio (use large value for near-instant)
#
# Usage:
# python3 sweep_runner.py --mode k8s --deploy-template dgd/templates/mocker.yaml \
# --dgd-name dynamo-bench-mocker --image nvcr.io/.../image:tag \
# --tokenizers hf,fastokens --concurrency 50,100 --isl 512
---
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: ${DGD_NAME}
spec:
services:
Frontend:
componentType: frontend
replicas: ${FRONTEND_REPLICAS}
extraPodSpec:
${FRONTEND_IMAGE_PULL_SECRETS_BLOCK}
mainContainer:
image: ${IMAGE}
imagePullPolicy: IfNotPresent
command:
- /bin/sh
- -c
args:
- python3 -m dynamo.frontend --router-mode ${ROUTER_MODE} --http-port ${FRONTEND_PORT}
env:
- name: DYN_TOKENIZER_BACKEND
value: "${DYN_TOKENIZER_BACKEND}"
- name: DYN_PERF_DIAG
value: "1"
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: ${HF_TOKEN_SECRET_NAME}
key: HF_TOKEN
MockerWorker:
componentType: worker
replicas: ${WORKER_REPLICAS}
extraPodSpec:
${WORKER_IMAGE_PULL_SECRETS_BLOCK}
mainContainer:
image: ${IMAGE}
imagePullPolicy: IfNotPresent
command:
- /bin/sh
- -c
args:
- |
python3 -m dynamo.mocker \
--model-path "${MODEL_PATH}" \
--model-name "${MODEL_NAME}" \
--num-workers ${NUM_WORKERS} \
--speedup-ratio ${SPEEDUP_RATIO}
env:
- name: MODEL_PATH
value: "${MODEL_PATH}"
- name: MODEL_NAME
value: "${MODEL_NAME}"
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: ${HF_TOKEN_SECRET_NAME}
key: HF_TOKEN
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Deploy template: vLLM backend for openai/gpt-oss-20b (TP=2, 2 GPUs per worker).
#
# Model: openai/gpt-oss-20b (20B params, BF16/FP8, 131K context)
# - Architecture: gpt_oss, 24 layers, 64 attn heads, 8 KV heads
# - TP options: 1, 2, 4, 8 (all divide heads/kv_heads evenly)
# - Weight size: ~13 GB (safetensors, excluding metal/ and original/)
# - Recommended: TP=2 on H100 for good prefill throughput
#
# Template variables (substituted by sweep_runner.py --deploy-template):
# ${DGD_NAME} - DynamoGraphDeployment name
# ${IMAGE} - Container image
# ${DYN_TOKENIZER_BACKEND} - "default" (hf) or "fast"
# ${FRONTEND_PORT} - Frontend HTTP port (default: 8000)
# ${ROUTER_MODE} - Frontend router mode (default: round-robin)
# ${MODEL} - Model path (HF ID or local path on PVC)
# ${MODEL_NAME} - Served model name (used by aiperf --model)
# ${FRONTEND_REPLICAS} - Number of frontend pods (default: 1)
# ${WORKER_REPLICAS} - Number of vLLM worker pods
#
# Prerequisites:
# - Model downloaded to model-cache PVC (excluding metal/ and original/):
# huggingface-cli download openai/gpt-oss-20b \
# --exclude "metal/*" --exclude "original/*" \
# --local-dir /model-store/hub/models--openai--gpt-oss-20b/snapshots/main
# - hf-token-secret in the target namespace
# - model-cache PVC (>= 100Gi) in the target namespace
# - GPU nodes with nvidia.com/gpu toleration
#
# Usage:
# python3 sweep_runner.py --mode k8s \
# --deploy-template dgd/templates/vllm-gpt-oss-20b.yaml \
# --dgd-name dynamo-bench-vllm \
# --model /model-store/hub/models--openai--gpt-oss-20b/snapshots/main \
# --image nvcr.io/.../vllm-runtime:tag \
# --tokenizers hf,fastokens --concurrency 20 --isl 8192
---
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: ${DGD_NAME}
spec:
services:
Frontend:
componentType: frontend
replicas: ${FRONTEND_REPLICAS}
extraPodSpec:
${FRONTEND_IMAGE_PULL_SECRETS_BLOCK}
mainContainer:
image: ${IMAGE}
imagePullPolicy: IfNotPresent
command:
- /bin/sh
- -c
args:
- python3 -m dynamo.frontend --router-mode ${ROUTER_MODE} --http-port ${FRONTEND_PORT}
env:
- name: DYN_TOKENIZER_BACKEND
value: "${DYN_TOKENIZER_BACKEND}"
- name: DYN_PERF_DIAG
value: "1"
- name: HF_HOME
value: /model-store
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: ${HF_TOKEN_SECRET_NAME}
key: HF_TOKEN
volumeMounts:
- name: model-cache
mountPath: /model-store
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache
VllmWorker:
componentType: worker
replicas: ${WORKER_REPLICAS}
extraPodSpec:
${WORKER_IMAGE_PULL_SECRETS_BLOCK}
tolerations:
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists
mainContainer:
image: ${IMAGE}
imagePullPolicy: IfNotPresent
command:
- /bin/sh
- -c
args:
- >-
python3 -m dynamo.vllm
--model ${MODEL}
--served-model-name ${MODEL_NAME}
--tensor-parallel-size 2
--max-model-len 65536
--max-num-batched-tokens 32768
--gpu-memory-utilization 0.90
env:
- name: HF_HOME
value: /model-store
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: ${HF_TOKEN_SECRET_NAME}
key: HF_TOKEN
resources:
limits:
nvidia.com/gpu: "2"
volumeMounts:
- name: model-cache
mountPath: /model-store
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Deploy template: vLLM backend (requires GPUs).
#
# Template variables (substituted by sweep_runner.py --deploy-template):
# ${DGD_NAME} - DynamoGraphDeployment name
# ${IMAGE} - Container image
# ${DYN_TOKENIZER_BACKEND} - "default" (hf) or "fast"
# ${FRONTEND_PORT} - Frontend HTTP port
# ${ROUTER_MODE} - Frontend router mode
# ${MODEL} - HF model ID
# ${MODEL_NAME} - Served model name
# ${FRONTEND_REPLICAS} - Number of frontend pods (default: 1)
# ${WORKER_REPLICAS} - Number of vLLM worker pods
#
# Usage:
# python3 sweep_runner.py --mode k8s --deploy-template dgd/templates/vllm.yaml \
# --dgd-name dynamo-bench-vllm --image nvcr.io/.../vllm-runtime:tag \
# --model meta-llama/Llama-3.1-8B-Instruct \
# --tokenizers hf --concurrency 128 --isl 1024
---
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: ${DGD_NAME}
spec:
services:
Frontend:
componentType: frontend
replicas: ${FRONTEND_REPLICAS}
extraPodSpec:
${FRONTEND_IMAGE_PULL_SECRETS_BLOCK}
mainContainer:
image: ${IMAGE}
imagePullPolicy: IfNotPresent
command:
- /bin/sh
- -c
args:
- python3 -m dynamo.frontend --router-mode ${ROUTER_MODE} --http-port ${FRONTEND_PORT}
env:
- name: DYN_TOKENIZER_BACKEND
value: "${DYN_TOKENIZER_BACKEND}"
- name: DYN_PERF_DIAG
value: "1"
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: ${HF_TOKEN_SECRET_NAME}
key: HF_TOKEN
VllmWorker:
componentType: worker
replicas: ${WORKER_REPLICAS}
extraPodSpec:
${WORKER_IMAGE_PULL_SECRETS_BLOCK}
mainContainer:
image: ${IMAGE}
imagePullPolicy: IfNotPresent
command:
- /bin/sh
- -c
args:
- python3 -m dynamo.vllm --model ${MODEL} --tensor-parallel-size 1
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: ${HF_TOKEN_SECRET_NAME}
key: HF_TOKEN
resources:
limits:
nvidia.com/gpu: "1"
......@@ -11,11 +11,11 @@ source dynamo/bin/activate
# Single run (mocker + frontend + aiperf + Prometheus)
cd benchmarks/frontend/scripts
./run_perf.sh --model Qwen/Qwen3-0.6B --concurrency 32 --num-requests 640 \
--speedup-ratio 0 --skip-bpf --skip-nsys --skip-flamegraph --skip-perf
--speedup-ratio 1000000 --skip-bpf --skip-nsys --skip-flamegraph --skip-perf
# Sweep (multiple config points)
python3 sweep_runner.py --tokenizers hf --concurrency 32 --isl 512 \
--benchmark-duration 30 --speedup-ratio 0 \
--benchmark-duration 30 --speedup-ratio 1000000 \
-- --skip-bpf --skip-nsys --skip-flamegraph --skip-perf
```
......@@ -132,17 +132,17 @@ The main entry point for running performance sweeps. Iterates over a grid of con
```bash
# Smoke test (1 run)
python3 sweep_runner.py --tokenizers hf --concurrency 32 --isl 512 \
--benchmark-duration 30 --speedup-ratio 0 \
--benchmark-duration 30 --speedup-ratio 1000000 \
-- --skip-bpf --skip-nsys --skip-flamegraph --skip-perf
# Full tokenizer comparison
python3 sweep_runner.py --tokenizers hf,fastokens \
--concurrency 32,64 --isl 512,1024,2048 \
--benchmark-duration 60 --speedup-ratio 0
--benchmark-duration 60 --speedup-ratio 1000000
# Transport saturation (vary workers and request count)
python3 sweep_runner.py --tokenizers hf --concurrency 4096 \
--num-requests 16384,32768 --workers 1,2,4,8 --speedup-ratio 0
--num-requests 16384,32768 --workers 1,2,4,8 --speedup-ratio 1000000
# Preview sweep plan without running
python3 sweep_runner.py --dry-run --tokenizers hf,fastokens \
......@@ -168,7 +168,7 @@ for m in 1 2 3 4; do
--num-models $m \
--rps 75 \
--benchmark-duration 60 \
--speedup-ratio 0 \
--speedup-ratio 1000000 \
--output-dir artifacts/sweep_models/m${m} \
-- --skip-bpf
done
......@@ -195,7 +195,7 @@ python3 sweep_runner.py \
--num-models 1 \
--rps 75 \
--benchmark-duration 60 \
--speedup-ratio 0 \
--speedup-ratio 1000000 \
--output-dir artifacts/sweep_workers \
-- --skip-bpf
```
......@@ -214,7 +214,7 @@ python3 sweep_runner.py \
--num-models 2 \
--rps 50 \
--benchmark-duration 60 \
--speedup-ratio 0 \
--speedup-ratio 1000000 \
--output-dir artifacts/sweep_grid \
-- --skip-bpf
```
......@@ -237,15 +237,15 @@ python3 sweep_runner.py \
```bash
# With perf stat + flamegraphs (no root needed)
python3 sweep_runner.py --tokenizers hf --concurrency 64 --isl 1024 \
--benchmark-duration 60 --speedup-ratio 0
--benchmark-duration 60 --speedup-ratio 1000000
# With everything including BPF (needs sudo)
sudo -E python3 sweep_runner.py --tokenizers hf --concurrency 64 --isl 1024 \
--benchmark-duration 60 --speedup-ratio 0
--benchmark-duration 60 --speedup-ratio 1000000
# nsys profiling (needs nsys in PATH)
python3 sweep_runner.py --tokenizers hf --concurrency 64 --isl 1024 \
--benchmark-duration 60 --speedup-ratio 0 \
--benchmark-duration 60 --speedup-ratio 1000000 \
-- --nsys-path /opt/nvidia/nsight-systems/bin/nsys
```
......@@ -272,7 +272,7 @@ Profiler controls are passed through to run_perf.sh after `--`:
| `--num-models` | `1` | Number of model instances (each gets `--workers` workers) |
| `--rps` | - | Comma-separated target request rates (req/s) |
| `--aiperf-targets` | `first` | `first`: model-1 only. `all`: run aiperf for each model |
| `--speedup-ratio` | `1.0` | Mocker speedup (0 = infinite) |
| `--speedup-ratio` | `1.0` | Mocker speedup divisor; use large values (e.g., 1000000) for near-instant mocker |
| `--benchmark-duration` | `60` | aiperf run duration (seconds) |
| `--num-requests` | - | Comma-separated request counts (overrides duration) |
| `--output-dir` | auto | Output directory |
......@@ -288,21 +288,21 @@ Low-level per-run harness. Normally called by sweep_runner.py, but can be used d
```bash
# Minimal (no profilers)
./run_perf.sh --model Qwen/Qwen3-0.6B --concurrency 32 --num-requests 640 \
--speedup-ratio 0 --skip-bpf --skip-nsys --skip-flamegraph --skip-perf
--speedup-ratio 1000000 --skip-bpf --skip-nsys --skip-flamegraph --skip-perf
# Full observability (needs sudo for BPF)
sudo -E ./run_perf.sh --model Qwen/Qwen3-0.6B --concurrency 64 \
--benchmark-duration 60 --speedup-ratio 0
--benchmark-duration 60 --speedup-ratio 1000000
# Multi-model with 2 workers each
./run_perf.sh --model Qwen/Qwen3-0.6B --num-models 2 --workers 2 \
--concurrency 32 --benchmark-duration 30 --speedup-ratio 0 \
--concurrency 32 --benchmark-duration 30 --speedup-ratio 1000000 \
--skip-bpf --skip-nsys --skip-flamegraph --skip-perf
# 4 models, 1 worker each, rate-limited to 75 rps
./run_perf.sh --model Qwen/Qwen3-0.6B --num-models 4 --workers 1 \
--concurrency 512 --benchmark-duration 60 --request-rate 75 \
--speedup-ratio 0 --skip-bpf
--speedup-ratio 1000000 --skip-bpf
```
## Analyzing Results
......
......@@ -121,7 +121,7 @@ Service Options:
--model PATH Model path (default: nvidia/Llama-3.1-8B-Instruct-FP8)
--model-name NAME Served model name (default: same as --model)
--workers N Number of mocker workers (default: 2)
--speedup-ratio RATIO Mocker speedup ratio (default: 1.0; 0 = infinite)
--speedup-ratio RATIO Mocker speedup ratio (default: 1.0; use large value for near-instant)
--data-parallel-size N Mocker DP workers (default: 1)
--request-plane PLANE nats|http|tcp (default: tcp)
--event-plane PLANE nats|zmq (default: nats)
......
<!-- SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -->
<!-- SPDX-License-Identifier: Apache-2.0 -->
# Frontend Scaling Test: Finding the Saturation Point
This guide walks through using the sweep runner to find the saturation point of
a Dynamo frontend serving a real vLLM backend. The saturation point is the
request rate at which latency begins to degrade -- prefill requests start
queuing instead of being served immediately, TTFT p99 spikes, and throughput
plateaus.
---
## Overview
The test sweeps increasing request rates (`--rps`) at a fixed input sequence
length while keeping the backend warm (`--reset-strategy frontend`). Each data
point is a 60-second aiperf run at a controlled RPS. The sweep stops
automatically after consecutive failures (`--max-consecutive-fails`).
**What you get:**
- Per-RPS throughput (actual req/s vs target), TTFT p50/p99, ITL p50/p99
- Prometheus pre/post metrics for pipeline stage breakdown
- CSV + summary for easy comparison
---
## Prerequisites
1. **K8s namespace** with:
- `hf-token-secret` (HuggingFace token)
- `nvcrimagepullsecret` (image pull credentials)
- `model-cache` PVC (RWX, large enough for model weights)
- Model weights downloaded to PVC (see "Model Download" below)
2. **DGD deployed** with the target model and backend.
3. **sweep_runner.py** accessible from a machine with `kubectl` access to the
cluster.
---
## Model Download (gpt-oss-20b example)
Download the model to the PVC, excluding large non-inference directories:
```bash
# Create a download Job (adjust image and namespace)
kubectl apply -n <namespace> -f - <<'EOF'
apiVersion: batch/v1
kind: Job
metadata:
name: model-download-gpt-oss-20b
spec:
backoffLimit: 2
template:
spec:
restartPolicy: Never
imagePullSecrets:
- name: nvcrimagepullsecret
containers:
- name: download
image: nvcr.io/nvidian/dynamo-dev/biswa:vllm-runtime-1a8bce12ea
command: ["python3", "-c"]
args:
- |
import os, subprocess, sys, pathlib
model = "openai/gpt-oss-20b"
os.environ["HF_HOME"] = "/model-store"
cmd = ["huggingface-cli", "download", model,
"--exclude", "metal/*", "--exclude", "original/*",
"--local-dir", "/model-store/hub/models--openai--gpt-oss-20b/snapshots/main"]
sys.exit(subprocess.run(cmd).returncode)
env:
- name: HF_HOME
value: /model-store
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: HF_TOKEN
volumeMounts:
- name: model-cache
mountPath: /model-store
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache
EOF
# Monitor
kubectl logs -n <namespace> -l job-name=model-download-gpt-oss-20b -f
```
---
## Deploy the DGD
Use the provided template for gpt-oss-20b with TP=2:
```bash
# Template path (relative to repo root)
# benchmarks/frontend/dgd/templates/vllm-gpt-oss-20b.yaml
#
# Key settings in the template:
# - tensor-parallel-size 2 (2 GPUs per worker)
# - max-model-len 65536
# - gpu-memory-utilization 0.90
# - GPU toleration for scheduling
# Deploy directly (adjust values as needed):
kubectl apply -n <namespace> -f - <<'EOF'
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: gpt-oss-20b-bench
spec:
services:
Frontend:
componentType: frontend
replicas: 1
extraPodSpec:
imagePullSecrets:
- name: nvcrimagepullsecret
mainContainer:
image: <your-image>
command: ["/bin/sh", "-c"]
args: ["python3 -m dynamo.frontend --router-mode round-robin --http-port 8000"]
env:
- name: DYN_TOKENIZER_BACKEND
value: "default"
- name: DYN_PERF_DIAG
value: "1"
- name: HF_HOME
value: /model-store
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: HF_TOKEN
volumeMounts:
- name: model-cache
mountPath: /model-store
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache
VllmWorker:
componentType: worker
replicas: 4 # <-- number of backend replicas
extraPodSpec:
imagePullSecrets:
- name: nvcrimagepullsecret
tolerations:
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists
mainContainer:
image: <your-image>
command: ["/bin/sh", "-c"]
args:
- >-
python3 -m dynamo.vllm
--model /model-store/hub/models--openai--gpt-oss-20b/snapshots/main
--served-model-name openai/gpt-oss-20b
--tensor-parallel-size 2
--max-model-len 65536
--gpu-memory-utilization 0.90
env:
- name: HF_HOME
value: /model-store
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: HF_TOKEN
resources:
limits:
nvidia.com/gpu: "2" # <-- 2 GPUs for TP=2
volumeMounts:
- name: model-cache
mountPath: /model-store
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache
EOF
# Wait for all pods to be ready
kubectl get pods -n <namespace> -w
```
---
## Run the Saturation Sweep
### Baseline: HF tokenizer, RPS sweep
```bash
cd benchmarks/frontend/scripts
python3 sweep_runner.py --mode k8s \
--dgd-name gpt-oss-20b-bench \
--namespace <namespace> \
--endpoint gpt-oss-20b-bench-frontend:8000 \
--model openai/gpt-oss-20b \
--backend vllm \
--image <your-image> \
--tokenizers hf \
--concurrency 200 \
--rps 10,20,30,40,50,60,70,80,90,100 \
--isl 6144 \
--osl 256 \
--benchmark-duration 60 \
--reset-strategy frontend \
--isolation reuse_by_deploy_key \
--worker-replicas 4 \
--max-consecutive-fails 2
```
**Flag explanations:**
| Flag | Value | Purpose |
|------|-------|---------|
| `--rps 10,20,...,100` | Sweep dimension | Each run targets a fixed request rate. aiperf uses `--request-rate` to cap submission. |
| `--concurrency 200` | High ceiling | Maximum in-flight requests. Set high so aiperf can sustain the target RPS without being limited by available connection slots. This is NOT a sweep dimension. |
| `--isl 6144` | Fixed ISL | Holds input length constant to isolate throughput scaling. |
| `--osl 256` | Fixed OSL | Consistent output length across all runs. |
| `--benchmark-duration 60` | 60s per point | Long enough for vLLM scheduling to stabilize. |
| `--reset-strategy frontend` | Frontend-only | Resets Prometheus counters between runs, but keeps vLLM workers alive with warm KV caches and CUDA graphs. Avoids the ~90s full DGD restart per point. |
| `--isolation reuse_by_deploy_key` | Reuse deployment | Since tokenizer=hf is constant, no DGD restart between runs. Only a frontend pod restart for clean metrics. |
| `--max-consecutive-fails 2` | Auto-stop | After 2 consecutive failures at a given RPS, remaining higher RPS values are skipped. |
### Follow-up: FastTokens comparison
Once you have the baseline, run the same sweep with fastokens to see if the
saturation point shifts:
```bash
python3 sweep_runner.py --mode k8s \
--dgd-name gpt-oss-20b-bench \
--namespace <namespace> \
--endpoint gpt-oss-20b-bench-frontend:8000 \
--model openai/gpt-oss-20b \
--backend vllm \
--image <your-image> \
--tokenizers fastokens \
--concurrency 200 \
--rps 10,20,30,40,50,60,70,80,90,100 \
--isl 6144 \
--osl 256 \
--benchmark-duration 60 \
--reset-strategy frontend \
--isolation reuse_by_deploy_key \
--worker-replicas 4 \
--max-consecutive-fails 2
```
### Fine-grained sweep around the inflection
If the baseline shows saturation between, say, RPS=40 and RPS=60:
```bash
python3 sweep_runner.py --mode k8s \
... \
--rps 35,40,45,50,55,60 \
--reset-strategy frontend \
--isolation reuse_by_deploy_key
```
---
## Reading the Results
The sweep produces `results.csv` and `summary.md` in the output directory.
### Identifying the saturation point
Look for these signals in the CSV:
| RPS | Actual Req/s | TTFT p50 | TTFT p99 | ITL p99 | Status |
|----:|-----------:|--------:|--------:|-------:|--------|
| 10 | 10.0 | 800ms | 1200ms | 30ms | ok |
| 20 | 19.8 | 850ms | 1400ms | 32ms | ok |
| 30 | 29.5 | 900ms | 2000ms | 35ms | ok |
| 40 | 38.0 | 1200ms | 5000ms | 45ms | ok -- onset |
| 50 | 42.0 | 3000ms | 15000ms | 80ms | ok -- saturated |
| 60 | 41.5 | 8000ms | 30000ms | 120ms | ok -- overloaded |
| 70 | -- | -- | -- | -- | fail |
**Saturation indicators:**
1. **Actual req/s < target RPS**: The system cannot sustain the requested rate.
At RPS=50, only 42 req/s are achieved.
2. **TTFT p99 spike**: A sharp increase (e.g., 2x-5x) means prefill requests
are queuing behind each other.
3. **ITL p99 degradation**: Decode throughput drops because the vLLM scheduler
is overloaded with concurrent prefills.
4. **Errors/failures**: Timeouts, OOM, or vLLM rejecting requests.
The **saturation point** in the example above is **RPS ~40** -- the last rate
where actual throughput tracks the target and TTFT p99 is still reasonable.
### Prometheus metrics
Each run captures `frontend_metrics_pre.txt` and `frontend_metrics_post.txt`.
Key metrics for saturation analysis:
- `dynamo_frontend_stage_duration_seconds{stage="preprocess"}` -- tokenization time
- `dynamo_frontend_stage_duration_seconds{stage="transport_roundtrip"}` -- backend latency
- `dynamo_frontend_queued_requests` -- requests waiting in HTTP queue (should be 0 below saturation)
- `dynamo_frontend_inflight_requests` -- concurrent in-flight requests
- `dynamo_frontend_time_to_first_token_seconds` -- TTFT histogram buckets
---
## DGD Template Reference
The `dgd/templates/vllm-gpt-oss-20b.yaml` template is pre-configured for
gpt-oss-20b with TP=2. To use it with `--deploy-template`:
```bash
python3 sweep_runner.py --mode k8s \
--deploy-template benchmarks/frontend/dgd/templates/vllm-gpt-oss-20b.yaml \
--dgd-name gpt-oss-20b-bench \
--model /model-store/hub/models--openai--gpt-oss-20b/snapshots/main \
--image <your-image> \
--worker-replicas 4 \
...
```
The template substitutes these variables at deploy time:
`${DGD_NAME}`, `${IMAGE}`, `${MODEL}`, `${MODEL_NAME}`,
`${WORKER_REPLICAS}`, `${DYN_TOKENIZER_BACKEND}`, `${FRONTEND_PORT}`,
`${ROUTER_MODE}`.
---
## Tuning Parameters
| Parameter | Recommended Range | Notes |
|-----------|-------------------|-------|
| `--benchmark-duration` | 60-120s | Longer = more stable averages but slower sweep |
| `--concurrency` | 2-4x max target RPS | Must be high enough that aiperf can reach the target rate |
| `--rps` | Start at 10, double until failures | Geometric progression finds the order of magnitude fast |
| `--worker-replicas` | 1-8 | More replicas = higher saturation point but more GPUs |
| `--reset-strategy` | `frontend` for saturation tests | `graph` for clean-baseline TTFT measurements |
| `--isolation` | `reuse_by_deploy_key` for same-tokenizer sweeps | Avoids unnecessary DGD restarts |
| `--max-consecutive-fails` | 2-3 | Higher = more data points at the failure boundary |
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
"""sweep_core -- pure-logic library for frontend performance sweeps."""
from sweep_core.models import (
AiperfDimension,
DeployDimension,
DeployKey,
IsolationPolicy,
RunResult,
RunSpec,
SweepConfig,
SweepPlan,
)
__all__ = [
"AiperfDimension",
"DeployDimension",
"DeployKey",
"IsolationPolicy",
"RunResult",
"RunSpec",
"SweepConfig",
"SweepPlan",
]
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
"""
Artifact writers for sweep results.
Produces CSV, markdown summary, and sweep_config.json -- the contract
consumed by downstream analysis tools (analyze_sweep.py, sweep_data.py).
"""
from __future__ import annotations
import csv
import json
import time
from pathlib import Path
from typing import List
from sweep_core.models import RunResult, SweepConfig
def write_csv(results: List[RunResult], csv_path: Path, config: SweepConfig) -> None:
"""Write incremental CSV results file (called after each run)."""
fieldnames = [
"run_id",
"backend",
"tokenizer",
"concurrency",
"isl",
"osl",
"workers",
"speedup_ratio",
"status",
"req_per_sec",
"output_tok_per_sec",
"ttft_p50_ms",
"ttft_p99_ms",
"itl_p50_ms",
"itl_p99_ms",
"duration_sec",
"run_dir",
]
with open(csv_path, "w", newline="") as f:
writer = csv.DictWriter(f, fieldnames=fieldnames, extrasaction="ignore")
writer.writeheader()
for r in results:
spec = r.run_spec
row = {
"run_id": spec.run_id,
"backend": spec.deploy.backend,
"tokenizer": spec.deploy.tokenizer,
"concurrency": spec.aiperf.concurrency,
"isl": spec.aiperf.isl,
"osl": spec.aiperf.osl,
"workers": spec.deploy.workers,
"speedup_ratio": config.speedup_ratio,
"status": r.status,
"req_per_sec": f"{r.req_per_sec:.2f}"
if r.req_per_sec is not None
else "",
"output_tok_per_sec": f"{r.output_tok_per_sec:.1f}"
if r.output_tok_per_sec is not None
else "",
"ttft_p50_ms": f"{r.ttft_p50_ms:.1f}"
if r.ttft_p50_ms is not None
else "",
"ttft_p99_ms": f"{r.ttft_p99_ms:.1f}"
if r.ttft_p99_ms is not None
else "",
"itl_p50_ms": f"{r.itl_p50_ms:.1f}" if r.itl_p50_ms is not None else "",
"itl_p99_ms": f"{r.itl_p99_ms:.1f}" if r.itl_p99_ms is not None else "",
"duration_sec": f"{r.duration_sec:.1f}"
if r.duration_sec is not None
else "",
"run_dir": r.run_dir,
}
writer.writerow(row)
def write_summary(results: List[RunResult], summary_path: Path) -> None:
"""Write markdown summary table."""
lines = ["# Sweep Summary\n"]
lines.append(f"**Generated:** {time.strftime('%Y-%m-%d %H:%M:%S')}\n")
lines.append(
"| Run ID | Req/s | Tok/s | TTFT p50 | TTFT p99 | ITL p50 | Duration | Status |"
)
lines.append(
"|--------|------:|------:|---------:|---------:|--------:|---------:|--------|"
)
for r in results:
rps = f"{r.req_per_sec:.1f}" if r.req_per_sec is not None else "-"
tps = f"{r.output_tok_per_sec:.0f}" if r.output_tok_per_sec is not None else "-"
tp50 = f"{r.ttft_p50_ms:.1f}ms" if r.ttft_p50_ms is not None else "-"
tp99 = f"{r.ttft_p99_ms:.1f}ms" if r.ttft_p99_ms is not None else "-"
ip50 = f"{r.itl_p50_ms:.1f}ms" if r.itl_p50_ms is not None else "-"
dur = f"{r.duration_sec:.0f}s" if r.duration_sec is not None else "-"
lines.append(
f"| {r.run_spec.run_id} | {rps} | {tps} | {tp50} | {tp99} | {ip50} | {dur} | {r.status} |"
)
lines.append("")
ok = sum(1 for r in results if r.status == "ok")
fail = sum(1 for r in results if r.status == "fail")
skip = sum(1 for r in results if r.status == "skipped")
lines.append(
f"**Totals:** {ok} passed, {fail} failed, {skip} skipped out of {len(results)}"
)
summary_path.write_text("\n".join(lines) + "\n")
def write_sweep_config(
config: SweepConfig, output_dir: Path, total_runs: int = 0
) -> None:
"""Write sweep_config.json for downstream consumers."""
config_path = output_dir / "sweep_config.json"
config_data = {
"timestamp": time.strftime("%Y%m%d_%H%M%S"),
"mode": config.mode,
"model": config.model,
"model_name": config.model_name,
"backend": config.backend,
"backends": config.backend,
"tokenizers": ",".join(config.tokenizers),
"isl_list": ",".join(str(i) for i in config.isls),
"concurrency_list": ",".join(str(c) for c in config.concurrencies),
"benchmark_duration": config.benchmark_duration or "N/A",
"osl": config.osl,
"speedup_ratio": config.speedup_ratio,
"output_dir": config.output_dir,
"total_runs": total_runs,
"isolation_policy": config.isolation_policy,
}
config_path.write_text(json.dumps(config_data, indent=2) + "\n")
def print_results_table(results: List[RunResult]) -> None:
"""Print a compact results table to stdout."""
print(f"\n{'=' * 90}")
print(
f" {'Run ID':<30} {'Req/s':>8} {'Tok/s':>8} {'TTFT p50':>10} {'TTFT p99':>10} {'Status':>8}"
)
print(f" {'-' * 30} {'-' * 8} {'-' * 8} {'-' * 10} {'-' * 10} {'-' * 8}")
for r in results:
rps = f"{r.req_per_sec:.1f}" if r.req_per_sec is not None else "N/A"
tps = (
f"{r.output_tok_per_sec:.0f}" if r.output_tok_per_sec is not None else "N/A"
)
tp50 = f"{r.ttft_p50_ms:.1f}ms" if r.ttft_p50_ms is not None else "N/A"
tp99 = f"{r.ttft_p99_ms:.1f}ms" if r.ttft_p99_ms is not None else "N/A"
print(
f" {r.run_spec.run_id:<30} {rps:>8} {tps:>8} {tp50:>10} {tp99:>10} {r.status:>8}"
)
print(f"{'=' * 90}")
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
"""
Typed SweepConfig construction from argparse Namespace.
Centralizes the parsing of CLI arguments into the SweepConfig data model.
"""
from __future__ import annotations
import argparse
import time
from pathlib import Path
from typing import List, Optional
from sweep_core.models import K8sConfig, SweepConfig
SCRIPT_DIR = Path(__file__).resolve().parent.parent
REPO_ROOT = SCRIPT_DIR.parent.parent.parent
DEFAULT_MODEL = "Qwen/Qwen3-0.6B"
DEFAULT_OSL = 256
DEFAULT_SPEEDUP = 1.0
DEFAULT_BENCHMARK_DURATION = 60
DEFAULT_MAX_CONSECUTIVE_FAILS = 2
DEFAULT_COOLDOWN = 3
def build_argument_parser() -> argparse.ArgumentParser:
"""Build the argument parser for sweep_runner.py."""
parser = argparse.ArgumentParser(
description="Frontend performance sweep runner",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""Examples:
# Local smoke test
python3 sweep_runner.py --tokenizers hf,fastokens --concurrency 32 --isl 512 \\
--benchmark-duration 30 --speedup-ratio 1000000
# K8s sweep with DGD
python3 sweep_runner.py --mode k8s --tokenizers hf,fastokens --concurrency 50,100 --isl 512
# K8s with custom deploy template
python3 sweep_runner.py --mode k8s --deploy-template dgd/templates/vllm.yaml \\
--tokenizers hf --concurrency 128 --isl 1024
# Transport saturation (high concurrency, vary workers)
python3 sweep_runner.py --tokenizers hf --concurrency 4096 \\
--num-requests 16384,32768 --workers 1,2,4,8 --speedup-ratio 1000000
# Dry run
python3 sweep_runner.py --dry-run --tokenizers hf,fastokens --concurrency 32,64 --isl 512,1024
""",
)
# Common options
parser.add_argument("--model", default=DEFAULT_MODEL, help="HF model path")
parser.add_argument(
"--model-name", default="", help="Served model name (default: same as --model)"
)
parser.add_argument(
"--mode",
choices=["local", "k8s"],
default="local",
help="Execution mode: local (run_perf.sh) or k8s (DGD + aiperf)",
)
parser.add_argument(
"--backend",
choices=["mocker", "vllm"],
default="mocker",
help="Engine backend: mocker (synthetic) or vllm (real inference)",
)
parser.add_argument(
"--tokenizers",
default="hf,fastokens",
help="Comma-separated tokenizer backends (hf, fastokens)",
)
parser.add_argument(
"--concurrency", default="50,100,200", help="Comma-separated concurrency levels"
)
parser.add_argument(
"--isl", default="512,1024,2048", help="Comma-separated ISL values"
)
parser.add_argument(
"--osl", type=int, default=DEFAULT_OSL, help="Output sequence length"
)
parser.add_argument(
"--workers", default="2", help="Comma-separated worker counts per model"
)
parser.add_argument(
"--num-models",
type=int,
default=1,
help="Number of model instances",
)
parser.add_argument(
"--aiperf-targets",
choices=["first", "all"],
default="first",
help="'first': aiperf targets model-1 only. 'all': run aiperf for each model.",
)
parser.add_argument(
"--speedup-ratio",
type=float,
default=DEFAULT_SPEEDUP,
help="Mocker speedup (0=infinite)",
)
parser.add_argument(
"--benchmark-duration",
type=int,
default=DEFAULT_BENCHMARK_DURATION,
help="aiperf duration (seconds)",
)
parser.add_argument(
"--num-requests",
default=None,
help="Comma-separated request counts (overrides --benchmark-duration)",
)
parser.add_argument(
"--rps",
default=None,
help="Comma-separated target request rates (req/s)",
)
parser.add_argument(
"--output-dir",
default=None,
help="Output directory (default: auto timestamped)",
)
parser.add_argument(
"--max-consecutive-fails",
type=int,
default=DEFAULT_MAX_CONSECUTIVE_FAILS,
)
parser.add_argument(
"--cooldown", type=int, default=DEFAULT_COOLDOWN, help="Seconds between runs"
)
parser.add_argument(
"--dry-run", action="store_true", help="Print plan without executing"
)
parser.add_argument(
"--no-report", action="store_true", help="Skip per-run report generation"
)
parser.add_argument(
"--isolation",
choices=["fresh_per_run", "reuse_by_deploy_key"],
default="fresh_per_run",
help="Isolation policy (default: fresh_per_run)",
)
# K8s-specific options
k8s_group = parser.add_argument_group("K8s mode options")
k8s_group.add_argument("--namespace", default="dynamo-bench", help="K8s namespace")
k8s_group.add_argument(
"--endpoint", default=None, help="K8s frontend endpoint (host:port)"
)
k8s_group.add_argument("--dgd-name", default="", help="DynamoGraphDeployment name")
k8s_group.add_argument(
"--image", default="", help="Container image for k8s deployment"
)
k8s_group.add_argument(
"--deploy-template",
default="",
help="Path to deploy.yaml template (enables template-based deployment)",
)
k8s_group.add_argument(
"--reset-strategy",
choices=["none", "frontend", "graph"],
default="graph",
help="K8s reset strategy per run (default: graph)",
)
k8s_group.add_argument(
"--deploy", action="store_true", help="Deploy infrastructure before sweeping"
)
k8s_group.add_argument(
"--frontend-port", type=int, default=8000, help="Frontend HTTP port"
)
k8s_group.add_argument(
"--worker-replicas", type=int, default=1, help="Number of worker pod replicas"
)
k8s_group.add_argument(
"--frontend-replicas",
type=int,
default=1,
help="Number of frontend pod replicas",
)
k8s_group.add_argument(
"--request-plane", default="tcp", help="Request plane transport"
)
k8s_group.add_argument(
"--event-plane", default="nats", help="Event plane transport"
)
k8s_group.add_argument(
"--router-mode", default="round-robin", help="Frontend router mode"
)
k8s_group.add_argument("--hf-token", default="", help="HuggingFace token for k8s")
k8s_group.add_argument(
"--image-pull-secret", default="", help="Image pull secret name"
)
k8s_group.add_argument(
"--export-level", default="summary", help="aiperf export level"
)
# Passthrough args for run_perf.sh
parser.add_argument(
"passthrough", nargs="*", help="Extra args passed to run_perf.sh (after --)"
)
return parser
def config_from_args(args: argparse.Namespace) -> SweepConfig:
"""Convert parsed argparse Namespace to SweepConfig."""
# Parse comma-separated lists
tokenizers = [t.strip() for t in args.tokenizers.split(",")]
concurrencies = [int(c) for c in args.concurrency.split(",")]
isls = [int(i) for i in args.isl.split(",")]
worker_counts = [int(w) for w in args.workers.split(",")]
num_requests_list: List[Optional[int]] = (
[int(n) for n in args.num_requests.split(",")] if args.num_requests else [None]
)
rps_list: List[Optional[int]] = (
[int(r) for r in args.rps.split(",")] if args.rps else [None]
)
# Output directory
if args.output_dir:
output_dir = args.output_dir
else:
ts = time.strftime("%Y%m%d_%H%M%S")
if args.mode == "k8s" and Path("/artifacts").is_dir():
# Inside a k8s pod with /artifacts PVC mounted
output_dir = f"/artifacts/sweep_{ts}"
else:
# Local or k8s-from-host: use repo artifacts directory
output_dir = str(REPO_ROOT / "artifacts" / f"sweep_{ts}")
# Build K8s config
k8s_config = K8sConfig(
namespace=args.namespace,
dgd_name=args.dgd_name,
image=args.image,
frontend_port=args.frontend_port,
worker_replicas=args.worker_replicas,
frontend_replicas=args.frontend_replicas,
deploy_template=args.deploy_template,
reset_strategy=args.reset_strategy,
request_plane=args.request_plane,
event_plane=args.event_plane,
router_mode=args.router_mode,
deploy=args.deploy,
hf_token=args.hf_token,
image_pull_secret=args.image_pull_secret,
export_level=args.export_level,
)
# Compute k8s endpoint
if args.endpoint:
k8s_config.endpoint = args.endpoint
elif k8s_config.dgd_name:
k8s_config.endpoint = (
f"{k8s_config.dgd_name}-frontend:{k8s_config.frontend_port}"
)
else:
k8s_config.endpoint = f"frontend:{k8s_config.frontend_port}"
return SweepConfig(
model=args.model,
model_name=args.model_name or args.model,
mode=args.mode,
backend=args.backend,
tokenizers=tokenizers,
concurrencies=concurrencies,
isls=isls,
osl=args.osl,
worker_counts=worker_counts,
num_models=args.num_models,
aiperf_targets=args.aiperf_targets,
speedup_ratio=args.speedup_ratio,
benchmark_duration=args.benchmark_duration,
num_requests_list=num_requests_list,
rps_list=rps_list,
output_dir=output_dir,
max_consecutive_fails=args.max_consecutive_fails,
cooldown=args.cooldown,
dry_run=args.dry_run,
no_report=args.no_report,
isolation_policy=args.isolation,
passthrough_args=args.passthrough or [],
k8s=k8s_config,
)
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
"""Consecutive-failure skip policy for sweep runs."""
from __future__ import annotations
from typing import Dict, Tuple
class FailureTracker:
"""Track consecutive failures per (backend, concurrency, workers) tuple.
After max_consecutive_fails consecutive failures at a given key,
subsequent runs with the same key are skipped.
"""
def __init__(self, max_consecutive_fails: int = 2):
self.max_consecutive_fails = max_consecutive_fails
self._counts: Dict[Tuple[str, int, int], int] = {}
def should_skip(self, backend: str, concurrency: int, workers: int) -> bool:
"""Check if a run should be skipped due to prior consecutive failures."""
key = (backend, concurrency, workers)
return self._counts.get(key, 0) >= self.max_consecutive_fails
def record_success(self, backend: str, concurrency: int, workers: int) -> None:
"""Record a successful run, resetting the failure count."""
key = (backend, concurrency, workers)
self._counts[key] = 0
def record_failure(self, backend: str, concurrency: int, workers: int) -> int:
"""Record a failed run. Returns the new consecutive failure count."""
key = (backend, concurrency, workers)
self._counts[key] = self._counts.get(key, 0) + 1
return self._counts[key]
def get_count(self, backend: str, concurrency: int, workers: int) -> int:
"""Get the current consecutive failure count for a key."""
key = (backend, concurrency, workers)
return self._counts.get(key, 0)
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
"""Lifecycle management -- deploy-dimension delta detection and reset strategy."""
from __future__ import annotations
from typing import Optional
from sweep_core.models import IsolationPolicy, RunSpec
def needs_deploy_or_reset(
current: RunSpec,
previous: Optional[RunSpec],
isolation_policy: IsolationPolicy,
) -> bool:
"""Determine if the current run needs a deploy/reset before execution.
Args:
current: The run about to execute.
previous: The run that just completed (None for the first run).
isolation_policy: The sweep-level isolation policy.
Returns:
True if a deploy/reset is needed before this run.
"""
if previous is None:
# First run always needs deployment
return True
if isolation_policy == "fresh_per_run":
# Every run gets its own deploy/reset cycle
return True
# reuse_by_deploy_key: only reset when the deploy key changes
return current.deploy_key != previous.deploy_key
def deploy_key_changed(
current: RunSpec,
previous: Optional[RunSpec],
) -> bool:
"""Check if the deploy key has changed between consecutive runs."""
if previous is None:
return True
return current.deploy_key != previous.deploy_key
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
"""
Data models for sweep_core.
All data structures are plain dataclasses that serialize to/from JSON/dict.
No subprocess, kubectl, or argparse imports allowed in this module.
"""
from __future__ import annotations
import json
from dataclasses import dataclass, field
from typing import Dict, List, Literal, Optional
IsolationPolicy = Literal["fresh_per_run", "reuse_by_deploy_key"]
@dataclass(frozen=True)
class DeployKey:
"""Hashable key identifying a unique deployment configuration."""
backend: str
tokenizer: str
workers: int
num_models: int
env_overrides: frozenset[tuple[str, str]] = field(default_factory=frozenset)
def to_dict(self) -> dict:
return {
"backend": self.backend,
"tokenizer": self.tokenizer,
"workers": self.workers,
"num_models": self.num_models,
"env_overrides": dict(self.env_overrides),
}
@classmethod
def from_dict(cls, d: dict) -> DeployKey:
env = d.get("env_overrides", {})
return cls(
backend=d["backend"],
tokenizer=d["tokenizer"],
workers=d["workers"],
num_models=d["num_models"],
env_overrides=frozenset(env.items())
if isinstance(env, dict)
else frozenset(env),
)
@dataclass
class DeployDimension:
"""Configuration for a single deployment state."""
backend: str # "mocker" or "vllm"
tokenizer: str # "hf" or "fastokens"
workers: int = 2
num_models: int = 1
env_overrides: Dict[str, str] = field(default_factory=dict)
@property
def deploy_key(self) -> DeployKey:
return DeployKey(
backend=self.backend,
tokenizer=self.tokenizer,
workers=self.workers,
num_models=self.num_models,
env_overrides=frozenset(self.env_overrides.items()),
)
def to_dict(self) -> dict:
return {
"backend": self.backend,
"tokenizer": self.tokenizer,
"workers": self.workers,
"num_models": self.num_models,
"env_overrides": self.env_overrides,
}
@classmethod
def from_dict(cls, d: dict) -> DeployDimension:
return cls(
backend=d["backend"],
tokenizer=d["tokenizer"],
workers=d.get("workers", 2),
num_models=d.get("num_models", 1),
env_overrides=d.get("env_overrides", {}),
)
@dataclass
class AiperfDimension:
"""Configuration for a single aiperf run."""
concurrency: int
isl: int
osl: int = 256
num_requests: Optional[int] = None
benchmark_duration: Optional[int] = None
request_rate: Optional[int] = None
def to_dict(self) -> dict:
return {
"concurrency": self.concurrency,
"isl": self.isl,
"osl": self.osl,
"num_requests": self.num_requests,
"benchmark_duration": self.benchmark_duration,
"request_rate": self.request_rate,
}
@classmethod
def from_dict(cls, d: dict) -> AiperfDimension:
return cls(
concurrency=d["concurrency"],
isl=d["isl"],
osl=d.get("osl", 256),
num_requests=d.get("num_requests"),
benchmark_duration=d.get("benchmark_duration"),
request_rate=d.get("request_rate"),
)
@dataclass
class RunSpec:
"""One logical perf run -- the atomic unit of execution."""
deploy: DeployDimension
aiperf: AiperfDimension
deploy_key: DeployKey
run_id: str
def to_dict(self) -> dict:
return {
"deploy": self.deploy.to_dict(),
"aiperf": self.aiperf.to_dict(),
"deploy_key": self.deploy_key.to_dict(),
"run_id": self.run_id,
}
@classmethod
def from_dict(cls, d: dict) -> RunSpec:
deploy = DeployDimension.from_dict(d["deploy"])
aiperf = AiperfDimension.from_dict(d["aiperf"])
deploy_key = DeployKey.from_dict(d["deploy_key"])
return cls(
deploy=deploy,
aiperf=aiperf,
deploy_key=deploy_key,
run_id=d["run_id"],
)
@dataclass
class RunResult:
"""Result from a single sweep point."""
run_spec: RunSpec
status: str = "pending" # ok, fail, skipped
req_per_sec: float = 0.0
output_tok_per_sec: float = 0.0
ttft_p50_ms: float = 0.0
ttft_p99_ms: float = 0.0
itl_p50_ms: float = 0.0
itl_p99_ms: float = 0.0
duration_sec: float = 0.0
run_dir: str = ""
def to_dict(self) -> dict:
return {
"run_spec": self.run_spec.to_dict(),
"status": self.status,
"req_per_sec": self.req_per_sec,
"output_tok_per_sec": self.output_tok_per_sec,
"ttft_p50_ms": self.ttft_p50_ms,
"ttft_p99_ms": self.ttft_p99_ms,
"itl_p50_ms": self.itl_p50_ms,
"itl_p99_ms": self.itl_p99_ms,
"duration_sec": self.duration_sec,
"run_dir": self.run_dir,
}
@dataclass
class K8sConfig:
"""K8s-specific configuration."""
namespace: str = "dynamo-bench"
endpoint: str = "frontend:8000"
dgd_name: str = ""
image: str = ""
frontend_port: int = 8000
worker_replicas: int = 1
frontend_replicas: int = 1
deploy_template: str = "" # path to deploy.yaml template
reset_strategy: str = "graph" # none | frontend | graph
request_plane: str = "tcp"
event_plane: str = "nats"
router_mode: str = "round-robin"
deploy: bool = False
hf_token: str = ""
image_pull_secret: str = ""
export_level: str = "summary"
def to_dict(self) -> dict:
return {
"namespace": self.namespace,
"endpoint": self.endpoint,
"dgd_name": self.dgd_name,
"image": self.image,
"frontend_port": self.frontend_port,
"worker_replicas": self.worker_replicas,
"frontend_replicas": self.frontend_replicas,
"deploy_template": self.deploy_template,
"reset_strategy": self.reset_strategy,
"request_plane": self.request_plane,
"event_plane": self.event_plane,
"router_mode": self.router_mode,
"deploy": self.deploy,
"hf_token": "***" if self.hf_token else "",
"image_pull_secret": self.image_pull_secret,
"export_level": self.export_level,
}
@classmethod
def from_dict(cls, d: dict) -> K8sConfig:
return cls(**{k: v for k, v in d.items() if k in cls.__dataclass_fields__})
@dataclass
class SweepConfig:
"""Top-level configuration for a sweep."""
model: str = "Qwen/Qwen3-0.6B"
model_name: str = ""
mode: str = "local" # "local" or "k8s"
backend: str = "mocker"
tokenizers: List[str] = field(default_factory=lambda: ["hf", "fastokens"])
concurrencies: List[int] = field(default_factory=lambda: [50, 100, 200])
isls: List[int] = field(default_factory=lambda: [512, 1024, 2048])
osl: int = 256
worker_counts: List[int] = field(default_factory=lambda: [2])
num_models: int = 1
aiperf_targets: str = "first"
speedup_ratio: float = 1.0
benchmark_duration: Optional[int] = 60
num_requests_list: List[Optional[int]] = field(default_factory=lambda: [None])
rps_list: List[Optional[int]] = field(default_factory=lambda: [None])
output_dir: str = ""
max_consecutive_fails: int = 2
cooldown: int = 3
dry_run: bool = False
no_report: bool = False
isolation_policy: IsolationPolicy = "fresh_per_run"
passthrough_args: List[str] = field(default_factory=list)
k8s: K8sConfig = field(default_factory=K8sConfig)
def __post_init__(self):
if not self.model_name:
self.model_name = self.model
def to_dict(self) -> dict:
return {
"model": self.model,
"model_name": self.model_name,
"mode": self.mode,
"backend": self.backend,
"tokenizers": self.tokenizers,
"concurrencies": self.concurrencies,
"isls": self.isls,
"osl": self.osl,
"worker_counts": self.worker_counts,
"num_models": self.num_models,
"aiperf_targets": self.aiperf_targets,
"speedup_ratio": self.speedup_ratio,
"benchmark_duration": self.benchmark_duration,
"num_requests_list": self.num_requests_list,
"rps_list": self.rps_list,
"output_dir": self.output_dir,
"max_consecutive_fails": self.max_consecutive_fails,
"cooldown": self.cooldown,
"dry_run": self.dry_run,
"no_report": self.no_report,
"isolation_policy": self.isolation_policy,
"passthrough_args": self.passthrough_args,
"k8s": self.k8s.to_dict(),
}
@classmethod
def from_dict(cls, d: dict) -> SweepConfig:
k8s_data = d.pop("k8s", {})
config = cls(**{k: v for k, v in d.items() if k in cls.__dataclass_fields__})
if k8s_data:
config.k8s = K8sConfig.from_dict(k8s_data)
return config
@dataclass
class SweepPlan:
"""Serializable execution plan."""
config: SweepConfig
runs: List[RunSpec]
isolation_policy: IsolationPolicy
total_runs: int
def to_dict(self) -> dict:
return {
"config": self.config.to_dict(),
"runs": [r.to_dict() for r in self.runs],
"isolation_policy": self.isolation_policy,
"total_runs": self.total_runs,
}
@classmethod
def from_dict(cls, d: dict) -> SweepPlan:
config = SweepConfig.from_dict(d["config"])
runs = [RunSpec.from_dict(r) for r in d["runs"]]
return cls(
config=config,
runs=runs,
isolation_policy=d["isolation_policy"],
total_runs=d["total_runs"],
)
def to_json(self, indent: int = 2) -> str:
return json.dumps(self.to_dict(), indent=indent)
@classmethod
def from_json(cls, s: str) -> SweepPlan:
return cls.from_dict(json.loads(s))
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
"""Run ID and directory naming conventions for sweep runs."""
from __future__ import annotations
from sweep_core.models import AiperfDimension, DeployDimension
def build_run_id(deploy: DeployDimension, aiperf: AiperfDimension) -> str:
"""Build a human-readable run ID from deploy + aiperf dimensions.
Format: {tokenizer}_c{concurrency}_isl{isl}_w{workers}[_m{models}][_rps{rate}]
This matches the naming convention from the original sweep_runner.py.
"""
base = f"{deploy.tokenizer}_c{aiperf.concurrency}_isl{aiperf.isl}_w{deploy.workers}"
if deploy.num_models > 1:
base += f"_m{deploy.num_models}"
if aiperf.request_rate is not None:
base += f"_rps{aiperf.request_rate}"
return base
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
"""
Sequential plan runner -- iterates through a SweepPlan using a SweepExecutor.
This module is interface-agnostic: it does not import argparse, subprocess,
or kubectl. It is callable from CLI, MCP server, or test harness.
"""
from __future__ import annotations
import time
from pathlib import Path
from typing import TYPE_CHECKING, List, Optional
from sweep_core.artifacts import (
print_results_table,
write_csv,
write_summary,
write_sweep_config,
)
from sweep_core.failures import FailureTracker
from sweep_core.lifecycle import needs_deploy_or_reset
from sweep_core.models import RunResult, RunSpec, SweepPlan
from sweep_core.reporting import generate_report
if TYPE_CHECKING:
from sweep_executors.base import SweepExecutor
def run(plan: SweepPlan, executor: "SweepExecutor") -> List[RunResult]:
"""Execute a SweepPlan sequentially using the given executor.
Args:
plan: The sweep plan to execute.
executor: The executor that handles individual runs.
Returns:
List of RunResult objects, one per run.
"""
config = plan.config
output_root = Path(config.output_dir)
output_root.mkdir(parents=True, exist_ok=True)
csv_path = output_root / "results.csv"
summary_path = output_root / "summary.md"
# Write sweep config
write_sweep_config(config, output_root, total_runs=plan.total_runs)
failure_tracker = FailureTracker(config.max_consecutive_fails)
results: List[RunResult] = []
previous_run: Optional[RunSpec] = None
try:
# Prepare executor inside try so cleanup() runs on prepare failure
executor.prepare(config)
for i, run_spec in enumerate(plan.runs, 1):
deploy = run_spec.deploy
aiperf = run_spec.aiperf
run_dir = output_root / run_spec.run_id
# Check skip policy
if failure_tracker.should_skip(
deploy.backend, aiperf.concurrency, deploy.workers
):
result = RunResult(
run_spec=run_spec,
status="skipped",
run_dir=str(run_dir),
)
results.append(result)
print(
f"\n [{i}/{plan.total_runs}] SKIPPED {run_spec.run_id} "
f"({config.max_consecutive_fails} consecutive failures)"
)
continue
print(f"\n{'=' * 60}")
print(f" [{i}/{plan.total_runs}] {run_spec.run_id}")
print(f"{'=' * 60}")
# Deploy or reset if needed
if needs_deploy_or_reset(run_spec, previous_run, plan.isolation_policy):
prev_deploy = previous_run.deploy if previous_run else None
executor.apply_deploy(deploy, prev_deploy)
# Execute the run
result = executor.execute_run(run_spec, run_dir)
results.append(result)
previous_run = run_spec
# Update failure tracking
if result.status == "ok":
failure_tracker.record_success(
deploy.backend, aiperf.concurrency, deploy.workers
)
rps = f"{result.req_per_sec:.1f}" if result.req_per_sec else "N/A"
tp50 = f"{result.ttft_p50_ms:.1f}ms" if result.ttft_p50_ms else "N/A"
print(f" OK: {rps} req/s, TTFT p50={tp50}")
else:
count = failure_tracker.record_failure(
deploy.backend, aiperf.concurrency, deploy.workers
)
print(f" FAIL (consecutive: {count}/{config.max_consecutive_fails})")
# Generate per-run report
if not config.no_report and result.status == "ok":
generate_report(run_dir)
# Write incremental CSV + summary
write_csv(results, csv_path, config)
write_summary(results, summary_path)
# Cooldown between runs
if i < plan.total_runs:
time.sleep(config.cooldown)
except KeyboardInterrupt:
print("\n\nInterrupted! Partial results saved.")
finally:
# Final write
write_csv(results, csv_path, config)
write_summary(results, summary_path)
# Cleanup executor
executor.cleanup()
# Print final table
print_results_table(results)
print(f"\nResults: {csv_path}")
print(f"Summary: {summary_path}")
print(f"Per-run: {output_root}/<run_id>/report.md")
return results
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
"""
SweepPlan builder -- constructs a serializable execution plan from SweepConfig.
The planner builds the Cartesian product of deploy dimensions x aiperf dimensions,
producing a flat list of RunSpec objects. The isolation policy determines how
they are executed by the orchestrator.
"""
from __future__ import annotations
from sweep_core.models import (
AiperfDimension,
DeployDimension,
RunSpec,
SweepConfig,
SweepPlan,
)
from sweep_core.naming import build_run_id
def build_plan(config: SweepConfig) -> SweepPlan:
"""Build a SweepPlan from a SweepConfig.
The plan contains a flat list of RunSpecs, one per (deploy, aiperf) combination.
The ordering is: tokenizers -> workers -> concurrencies -> ISLs -> num_requests -> rps
This matches the grid construction order from the original sweep_runner.py.
"""
runs: list[RunSpec] = []
for tokenizer in config.tokenizers:
for workers in config.worker_counts:
for concurrency in config.concurrencies:
for isl in config.isls:
for nr in config.num_requests_list:
for rps in config.rps_list:
deploy = DeployDimension(
backend=config.backend,
tokenizer=tokenizer,
workers=workers,
num_models=config.num_models,
)
aiperf = AiperfDimension(
concurrency=concurrency,
isl=isl,
osl=config.osl,
num_requests=nr,
benchmark_duration=config.benchmark_duration
if nr is None
else None,
request_rate=rps,
)
run_id = build_run_id(deploy, aiperf)
runs.append(
RunSpec(
deploy=deploy,
aiperf=aiperf,
deploy_key=deploy.deploy_key,
run_id=run_id,
)
)
return SweepPlan(
config=config,
runs=runs,
isolation_policy=config.isolation_policy,
total_runs=len(runs),
)
def print_plan(plan: SweepPlan) -> None:
"""Print a human-readable summary of the sweep plan."""
config = plan.config
print(f"Sweep plan: {plan.total_runs} runs")
print(f" Model: {config.model}")
print(f" Mode: {config.mode}")
print(f" Backend: {config.backend}")
print(f" Tokenizers: {config.tokenizers}")
print(f" Concurrencies: {config.concurrencies}")
print(f" ISLs: {config.isls}")
print(f" Workers/model: {config.worker_counts}")
print(f" Models: {config.num_models}")
print(f" Isolation: {plan.isolation_policy}")
print(f" Benchmark dur: {config.benchmark_duration}s")
nr_list = [n for n in config.num_requests_list if n is not None]
if nr_list:
print(f" Num requests: {nr_list}")
rps_list = [r for r in config.rps_list if r is not None]
if rps_list:
print(f" Request rates: {rps_list} req/s")
print(f" Output: {config.output_dir}")
if config.mode == "k8s":
print(f" Namespace: {config.k8s.namespace}")
print(f" Endpoint: {config.k8s.endpoint}")
if config.k8s.frontend_replicas > 1:
print(f" FE replicas: {config.k8s.frontend_replicas}")
if config.k8s.dgd_name:
print(f" DGD: {config.k8s.dgd_name}")
if config.k8s.deploy_template:
print(f" Template: {config.k8s.deploy_template}")
print(f" Reset strategy: {config.k8s.reset_strategy}")
print()
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
"""Per-run report generation -- wraps analysis/create_report.py."""
from __future__ import annotations
import sys
from pathlib import Path
SCRIPT_DIR = Path(__file__).resolve().parent.parent
ANALYSIS_DIR = SCRIPT_DIR / "analysis"
# Add analysis directory to sys.path once at import time
if str(ANALYSIS_DIR) not in sys.path:
sys.path.insert(0, str(ANALYSIS_DIR))
def generate_report(run_dir: Path) -> None:
"""Run create_report.py on a single run directory, saving report.md."""
try:
from create_report import run_analysis
report = run_analysis(run_dir)
(run_dir / "report.md").write_text(report)
except (ImportError, OSError) as e:
print(f" Report generation failed: {e}")
except Exception as e:
print(f" Report generation failed: {e}")
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
"""sweep_executors -- how individual runs execute."""
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
"""
SweepExecutor protocol -- the run-level extensibility seam.
Each executor implements this protocol. The orchestrator calls these methods
without knowing whether runs execute locally, in k8s, or elsewhere.
"""
from __future__ import annotations
from pathlib import Path
from typing import Optional, Protocol, runtime_checkable
from sweep_core.models import DeployDimension, RunResult, RunSpec, SweepConfig
@runtime_checkable
class SweepExecutor(Protocol):
"""Protocol for sweep executors."""
def prepare(self, config: SweepConfig) -> None:
"""One-time setup before the sweep begins (e.g., start infra)."""
...
def apply_deploy(
self,
deploy: DeployDimension,
prev: Optional[DeployDimension],
) -> None:
"""Apply a deployment change (e.g., restart frontend, switch backend).
Args:
deploy: The deployment configuration to apply.
prev: The previous deployment configuration (None for first run).
"""
...
def execute_run(self, run_spec: RunSpec, run_dir: Path) -> RunResult:
"""Execute a single run and return results.
Args:
run_spec: The run specification.
run_dir: Directory where artifacts should be written.
Returns:
RunResult with status and metrics.
"""
...
def cleanup(self) -> None:
"""Cleanup after the sweep completes (e.g., stop infra)."""
...
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
"""
K8sDgdExecutor -- DynamoGraphDeployment-based executor for k8s sweeps.
Handles DGD backend switching, restart strategies, metrics capture,
and aiperf invocation against a k8s-deployed frontend.
When --deploy-template is provided, uses template rendering instead of
DGD patching. This enables arbitrary backend deployments.
"""
from __future__ import annotations
import json
from pathlib import Path
from typing import Optional
from sweep_core.models import DeployDimension, RunResult, RunSpec, SweepConfig
from sweep_k8s import aiperf as k8s_aiperf
from sweep_k8s import dgd as k8s_dgd
from sweep_k8s import template as k8s_template
from sweep_k8s.kubectl import apply_secret_literal
from sweep_k8s.metrics import capture_metrics
class K8sDgdExecutor:
"""Executor for k8s sweeps using DynamoGraphDeployment."""
def __init__(self) -> None:
self._config: Optional[SweepConfig] = None
self._template_path: Optional[Path] = None
self._incluster_endpoint: str = "" # in-cluster service DNS for aiperf Jobs
def prepare(self, config: SweepConfig) -> None:
"""Store config and validate k8s setup."""
self._config = config
k8s = config.k8s
if k8s.deploy and not k8s.deploy_template:
raise ValueError(
"--deploy requires --deploy-template; otherwise pre-deploy the DGD and omit --deploy"
)
if k8s.deploy_template and not k8s.deploy:
raise ValueError(
"--deploy-template mutates cluster resources; pass --deploy to allow template application"
)
if k8s.deploy_template:
self._template_path = Path(k8s.deploy_template)
if not self._template_path.exists():
raise FileNotFoundError(
f"Deploy template not found: {self._template_path}"
)
print(f" Using deploy template: {self._template_path}")
if k8s.hf_token:
print(
f" Updating HuggingFace token secret: {k8s_template.DEFAULT_HF_TOKEN_SECRET_NAME}"
)
apply_secret_literal(
k8s_template.DEFAULT_HF_TOKEN_SECRET_NAME,
k8s.namespace,
"HF_TOKEN",
k8s.hf_token,
)
# Compute the in-cluster endpoint for aiperf Jobs.
# The user-provided --endpoint may be port-forwarded (e.g. localhost:18000),
# but aiperf Jobs run inside the cluster and need the service DNS name.
if k8s.dgd_name:
self._incluster_endpoint = f"{k8s.dgd_name}-frontend:{k8s.frontend_port}"
else:
self._incluster_endpoint = k8s.endpoint
print(f" In-cluster endpoint for aiperf: {self._incluster_endpoint}")
# Wait for model to be ready before starting sweep.
# Skip when using deploy templates -- the deployment hasn't been applied yet.
if not self._template_path:
print("--- Pre-flight: waiting for frontend ---")
k8s_dgd.wait_model_ready(
self._incluster_endpoint,
config.model_name,
max_wait=300,
namespace=k8s.namespace,
)
def apply_deploy(
self,
deploy: DeployDimension,
prev: Optional[DeployDimension],
) -> None:
"""Apply a deployment change -- template-based or DGD patching."""
if self._config is None:
raise RuntimeError("prepare() must be called before apply_deploy()")
config = self._config
k8s = config.k8s
if self._template_path:
# Template-based deployment: render + apply
k8s_template.apply_rendered_template(self._template_path, deploy, config)
print(" Waiting for deployment to be ready...")
k8s_dgd.wait_model_ready(
self._incluster_endpoint,
config.model_name,
namespace=k8s.namespace,
max_wait=300,
)
return
# Legacy DGD patching
if not k8s.dgd_name:
print(" WARNING: no DGD name set for k8s mode; skipping deploy")
return
# Check if tokenizer changed from previous run
if prev is not None and deploy.tokenizer != prev.tokenizer:
# Tokenizer changed -- need to switch backend
k8s_dgd.dgd_switch_backend(
k8s.dgd_name,
k8s.namespace,
k8s.endpoint,
config.model_name,
deploy.tokenizer,
)
return
# First run or same tokenizer -- apply reset strategy
# (On first run the DGD is already deployed with the right backend;
# we just reset to get a clean baseline for metrics.)
self._apply_reset_strategy()
def _apply_reset_strategy(self) -> None:
"""Apply the configured reset strategy."""
if self._config is None:
raise RuntimeError(
"prepare() must be called before _apply_reset_strategy()"
)
k8s = self._config.k8s
strategy = k8s.reset_strategy
if strategy == "graph":
if k8s.dgd_name:
k8s_dgd.dgd_restart_graph(
k8s.dgd_name,
k8s.namespace,
k8s.endpoint,
self._config.model_name,
)
else:
print(" WARNING: graph reset requires --dgd-name")
elif strategy == "frontend":
if k8s.dgd_name:
k8s_dgd.dgd_restart_frontend(
k8s.dgd_name,
k8s.namespace,
k8s.endpoint,
self._config.model_name,
)
else:
print(" WARNING: frontend reset requires --dgd-name")
elif strategy == "none":
# Just wait for readiness
if k8s.dgd_name:
k8s_dgd.dgd_wait_all_ready(
k8s.dgd_name,
k8s.namespace,
k8s.endpoint,
self._config.model_name,
max_wait=60,
)
else:
k8s_dgd.wait_model_ready(
k8s.endpoint, self._config.model_name, max_wait=60
)
def execute_run(self, run_spec: RunSpec, run_dir: Path) -> RunResult:
"""Execute a single k8s run: metrics capture + aiperf + post-metrics."""
if self._config is None:
raise RuntimeError("prepare() must be called before execute_run()")
config = self._config
k8s = config.k8s
aiperf = run_spec.aiperf
result = RunResult(run_spec=run_spec, run_dir=str(run_dir))
run_dir.mkdir(parents=True, exist_ok=True)
# Capture pre-run metrics (use in-cluster endpoint + kubectl exec fallback)
frontend_label = (
(
f"nvidia.com/dynamo-graph-deployment-name={k8s.dgd_name},"
f"nvidia.com/dynamo-component-type=frontend"
)
if k8s.dgd_name
else None
)
capture_metrics(
self._incluster_endpoint,
run_dir / "frontend_metrics_pre.txt",
namespace=k8s.namespace,
pod_label=frontend_label,
)
# Run aiperf as a k8s Job (uses in-cluster service endpoint)
success = k8s_aiperf.run_aiperf(
artifact_dir=run_dir / "aiperf",
endpoint=self._incluster_endpoint,
model_name=config.model_name,
concurrency=aiperf.concurrency,
isl=aiperf.isl,
namespace=k8s.namespace,
image=k8s.image,
run_id=run_spec.run_id,
osl=aiperf.osl,
benchmark_duration=aiperf.benchmark_duration,
num_requests=aiperf.num_requests,
request_rate=aiperf.request_rate,
export_level=k8s.export_level,
image_pull_secret=k8s.image_pull_secret,
hf_token_secret_name=k8s_template.DEFAULT_HF_TOKEN_SECRET_NAME,
)
if success:
result.status = "ok"
else:
result.status = "fail"
# Capture post-run metrics
capture_metrics(
self._incluster_endpoint,
run_dir / "frontend_metrics_post.txt",
namespace=k8s.namespace,
pod_label=frontend_label,
)
# Parse aiperf results
_parse_k8s_aiperf_into_result(result, run_dir)
return result
def cleanup(self) -> None:
"""No persistent state to clean up."""
pass
def _parse_k8s_aiperf_into_result(result: RunResult, run_dir: Path) -> None:
"""Parse aiperf results from k8s run directory."""
aiperf_json = run_dir / "aiperf" / "profile_export_aiperf.json"
if not aiperf_json.exists():
return
try:
data = json.loads(aiperf_json.read_text())
rt = data.get("request_throughput", {})
result.req_per_sec = rt.get("avg", 0) or 0
ot = data.get("output_token_throughput", {})
result.output_tok_per_sec = ot.get("avg", 0) or 0
ttft = data.get("time_to_first_token", data.get("ttft", {}))
if isinstance(ttft, dict):
result.ttft_p50_ms = ttft.get("p50", 0) or 0
result.ttft_p99_ms = ttft.get("p99", 0) or 0
itl = data.get("inter_token_latency", data.get("itl", {}))
if isinstance(itl, dict):
result.itl_p50_ms = itl.get("p50", 0) or 0
result.itl_p99_ms = itl.get("p99", 0) or 0
bd = data.get("benchmark_duration", 0)
result.duration_sec = bd.get("avg", 0) if isinstance(bd, dict) else (bd or 0)
except (json.JSONDecodeError, KeyError, TypeError):
pass
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment