feat(frontend): three-layer frontend perf sweep with local and k8s support (#7700)

273252e6 · Biswa Panda · GitHub · 023a299c · 273252e6 · 273252e6
Unverified Commit 273252e6 authored Mar 31, 2026 by Biswa Panda Committed by GitHub Mar 31, 2026
20 changed files
--- a/benchmarks/frontend/README.md
+++ b/benchmarks/frontend/README.md
+<!-- SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -->
+<!-- SPDX-License-Identifier: Apache-2.0 -->
+
+# Frontend Performance Benchmark Suite
+
+A configurable sweep runner for measuring Dynamo frontend model serving performance.  It drives [aiperf](https://github.com/ai-dynamo/aiperf) load against a frontend/mocker (or frontend/vLLM) stack and collects throughput, latency, and observability data across a grid of parameters.
+
+The primary use case is **HuggingFace tokenizer vs. fastokens comparison** -- sweeping across concurrency levels, input sequence lengths (ISL), and worker counts to quantify the tokenizer's impact on end-to-end performance.
+
+---
+
+## Architecture
+
+The codebase follows a three-layer design that separates pure logic from execution and infrastructure concerns.
+
+| Layer | Package | Responsibility |
+|-------|---------|----------------|
+| **Core** | `scripts/sweep_core/` | Pure data models, plan construction, artifact writing, reporting. No subprocess or kubectl calls. |
+| **Executors** | `scripts/sweep_executors/` | `SweepExecutor` protocol with two implementations -- `LocalExecutor` (delegates to `run_perf.sh`) and `K8sDgdExecutor` (DynamoGraphDeployment-based k8s runs). |
+| **K8s helpers** | `scripts/sweep_k8s/` | kubectl wrappers, DGD patching, template rendering, aiperf Job launching, and Prometheus metrics capture. |
+
+The entry point is `scripts/sweep_runner.py`, a thin CLI that wires the three layers together: it builds a `SweepPlan` from CLI arguments, selects an executor based on `--mode`, and feeds the plan to the orchestrator.
+
+**Data flow:**
+
+```
+CLI args --> SweepConfig --> SweepPlan (Cartesian grid of RunSpecs)
+                                |
+                          Orchestrator
+                                |
+                   LocalExecutor  or  K8sDgdExecutor
+                        |                   |
+                  run_perf.sh        DGD + aiperf Job
+                        |                   |
+                  artifacts/           artifacts/
+```
+
+---
+
+## Quick Start -- Local
+
+Local mode starts a mocker backend and frontend process on the current machine, runs aiperf against them, and tears everything down between runs.
+
+**Prerequisites:**
+
+- `dynamo.mocker` and `dynamo.frontend` installed (from the Dynamo repo)
+- `aiperf` installed and on `$PATH`
+- A HuggingFace model accessible locally (default: `Qwen/Qwen3-0.6B`)
+
+**Smoke test (2 runs, ~30 s each):**
+
+```bash
+cd benchmarks/frontend/scripts
+
+python3 sweep_runner.py \
+    --tokenizers hf,fastokens \
+    --concurrency 32 \
+    --isl 512 \
+    --benchmark-duration 30 \
+    --speedup-ratio 1000000
+```
+
+**Full local sweep:**
+
+```bash
+python3 sweep_runner.py \
+    --tokenizers hf,fastokens \
+    --concurrency 32,64,128 \
+    --isl 512,1024,2048
+```
+
+**Transport saturation sweep (high concurrency, vary workers):**
+
+```bash
+python3 sweep_runner.py \
+    --tokenizers hf \
+    --concurrency 4096 \
+    --num-requests 16384,32768 \
+    --workers 1,2,4,8 \
+    --speedup-ratio 1000000
+```
+
+Results are written to `artifacts/sweep_<timestamp>/`.
+
+---
+
+## Quick Start -- Kubernetes
+
+K8s mode deploys a DynamoGraphDeployment (DGD) into a Kubernetes namespace and launches aiperf as an in-cluster Job that targets the frontend service endpoint.
+
+### Prerequisites
+
+1. **Namespace** -- a dedicated namespace for the benchmark (default: `dynamo-bench`).
+2. **HuggingFace token secret** -- a Kubernetes Secret named `hf-token-secret`
+   containing your HF token, if the model requires authentication.
+3. **Model cache PVC** -- a PersistentVolumeClaim for caching model weights
+   (avoids repeated downloads across runs).
+4. **DGD deployed** -- either pre-deploy the DGD yourself, or use the
+   `--deploy --deploy-template` flags to let the sweep runner create it.
+5. **kubectl** configured with access to the target cluster and namespace.
+
+### Example: mocker backend
+
+```bash
+python3 sweep_runner.py \
+    --mode k8s \
+    --dgd-name dynamo-bench-mocker \
+    --tokenizers hf,fastokens \
+    --concurrency 50,100 \
+    --isl 512
+```
+
+### Example: template-based deployment
+
+When `--deploy-template` is provided, the runner renders the template with per-run variables (tokenizer, workers, model, etc.) and applies it via kubectl before each run group:
+
+```bash
+python3 sweep_runner.py \
+    --mode k8s \
+    --deploy \
+    --deploy-template dgd/templates/mocker.yaml \
+    --dgd-name dynamo-bench-mocker \
+    --image nvcr.io/.../image:tag \
+    --tokenizers hf,fastokens \
+    --concurrency 50,100 \
+    --isl 512
+```
+
+### How aiperf runs in-cluster
+
+The sweep runner creates a short-lived Kubernetes Job in the same namespace as the DGD. The Job pod runs `aiperf` against the frontend's in-cluster service DNS name (e.g., `dynamo-bench-mocker-frontend:8000`). Once the Job completes, artifacts are copied back to the local host via `kubectl cp`.
+
+### Reset strategy
+
+Between runs, the `--reset-strategy` flag controls how the deployed stack is
+recycled:
+
+| Strategy | Behavior |
+|----------|----------|
+| `none` | No resets; runs back-to-back on the same deployment. |
+| `frontend` | Restart only the frontend pod between runs. |
+| `graph` (default) | Redeploy the entire DGD graph between run groups. |
+
+---
+
+## CLI Reference
+
+All flags for `sweep_runner.py`:
+
+### Common options
+
+| Flag | Default | Description |
+|------|---------|-------------|
+| `--mode` | `local` | Execution mode: `local` or `k8s`. |
+| `--backend` | `mocker` | Engine backend: `mocker` (synthetic) or `vllm` (real inference). |
+| `--model` | `Qwen/Qwen3-0.6B` | HuggingFace model path. |
+| `--model-name` | same as `--model` | Served model name (for multi-model setups). |
+| `--tokenizers` | `hf,fastokens` | Comma-separated tokenizer backends. |
+| `--concurrency` | `50,100,200` | Comma-separated concurrency levels. |
+| `--isl` | `512,1024,2048` | Comma-separated input sequence lengths. |
+| `--osl` | `256` | Output sequence length. |
+| `--workers` | `2` | Comma-separated worker counts per model. |
+| `--num-models` | `1` | Number of model instances. |
+| `--speedup-ratio` | `1.0` | Mocker speedup divisor; use large values (e.g., 1000000) for near-instant mocker. |
+| `--benchmark-duration` | `60` | aiperf duration in seconds. |
+| `--num-requests` | none | Comma-separated request counts (overrides `--benchmark-duration`). |
+| `--rps` | none | Comma-separated target request rates (req/s). |
+| `--output-dir` | auto-timestamped | Output directory. |
+| `--cooldown` | `3` | Seconds between runs. |
+| `--max-consecutive-fails` | `2` | Abort sweep after N consecutive failures. |
+| `--isolation` | `fresh_per_run` | Isolation policy: `fresh_per_run` or `reuse_by_deploy_key`. |
+| `--no-report` | off | Skip per-run report generation. |
+
+### Execution control
+
+| Flag | Description |
+|------|-------------|
+| `--dry-run` | Print the sweep plan without executing any runs. |
+| `--emit-plan` | Print the sweep plan as JSON and exit (useful for Argo or MCP integration). |
+
+### K8s mode options
+
+| Flag | Default | Description |
+|------|---------|-------------|
+| `--namespace` | `dynamo-bench` | Kubernetes namespace. |
+| `--endpoint` | auto-derived | Frontend endpoint (`host:port`). |
+| `--dgd-name` | none | DynamoGraphDeployment name. |
+| `--image` | none | Container image for k8s deployment. |
+| `--deploy-template` | none | Path to a DGD YAML template (enables template-based deployment). |
+| `--deploy` | off | Deploy infrastructure before sweeping. |
+| `--reset-strategy` | `graph` | Per-run reset: `none`, `frontend`, or `graph`. |
+| `--frontend-port` | `8000` | Frontend HTTP port. |
+| `--worker-replicas` | `1` | Number of worker pod replicas. |
+| `--request-plane` | `tcp` | Request plane transport. |
+| `--event-plane` | `nats` | Event plane transport. |
+| `--router-mode` | `round-robin` | Frontend router mode. |
+| `--hf-token` | none | HuggingFace token for k8s. |
+| `--image-pull-secret` | none | Image pull secret name. |
+| `--export-level` | `summary` | aiperf export level. |
+
+---
+
+## Artifact Structure
+
+Each sweep produces a timestamped output directory:
+
+```
+artifacts/sweep_20260330_143000/
+    sweep_config.json        # Full SweepConfig used for this run
+    results.csv              # One row per run with key metrics
+    summary.md               # Markdown summary table
+
+    mocker_hf_w2_c50_isl512/
+        aiperf/              # aiperf JSON output
+        prometheus/          # Prometheus metric snapshots
+        report.md            # Per-run analysis report (unless --no-report)
+
+    mocker_fastokens_w2_c50_isl512/
+        aiperf/
+        prometheus/
+        report.md
+    ...
+```
+
+**results.csv columns:**
+
+`run_id`, `backend`, `tokenizer`, `concurrency`, `isl`, `osl`, `workers`,
+`speedup_ratio`, `status`, `req_per_sec`, `output_tok_per_sec`,
+`ttft_p50_ms`, `ttft_p99_ms`, `itl_p50_ms`, `itl_p99_ms`, `duration_sec`,
+`run_dir`
+
+---
+
+## DGD Templates
+
+The `dgd/templates/` directory contains DynamoGraphDeployment YAML templates
+for k8s mode. Template variables (e.g., `${DGD_NAME}`, `${IMAGE}`,
+`${DYN_TOKENIZER_BACKEND}`) are substituted by the sweep runner at deploy time.
+
+| Template | Backend | GPU required | Description |
+|----------|---------|-------------|-------------|
+| `mocker.yaml` | mocker | No | Synthetic backend for isolating frontend/tokenizer overhead. |
+| `vllm.yaml` | vLLM | Yes | Real inference backend for end-to-end benchmarking. |
+
+---
+
+## Analysis
+
+Post-sweep analysis scripts live in `scripts/analysis/`:
+
+| Script | Purpose |
+|--------|---------|
+| `create_report.py` | Generates a per-run observability report from aiperf JSON, Prometheus snapshots, NVTX traces, syscall profiles, and BPF data. |
+| `frontend_perf_analysis.py` | Produces scalability curves (TTFT/ITL/throughput vs. concurrency), ISL heatmaps, stage waterfall breakdowns, and regression detection. Supports single-run analysis, A/B comparison, and heatmap generation. |
+
+**Single-run report:**
+
+```bash
+python3 scripts/analysis/create_report.py analyze artifacts/sweep_*/mocker_hf_w2_c50_isl512/
+```
+
+**A/B comparison:**
+
+```bash
+python3 scripts/analysis/frontend_perf_analysis.py compare \
+    artifacts/sweep_*/mocker_hf_w2_c50_isl512/ \
+    artifacts/sweep_*/mocker_fastokens_w2_c50_isl512/
+```
--- a/benchmarks/frontend/dgd/templates/mocker.yaml
+++ b/benchmarks/frontend/dgd/templates/mocker.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Deploy template: Mocker backend (no GPUs required).
+#
+# Template variables (substituted by sweep_runner.py --deploy-template):
+#   ${DGD_NAME}              - DynamoGraphDeployment name
+#   ${IMAGE}                 - Container image
+#   ${DYN_TOKENIZER_BACKEND} - "default" (hf) or "fast"
+#   ${FRONTEND_PORT}         - Frontend HTTP port
+#   ${ROUTER_MODE}           - Frontend router mode
+#   ${MODEL_PATH}            - HF model ID
+#   ${MODEL_NAME}            - Served model name
+#   ${NUM_WORKERS}           - Mocker workers per pod
+#   ${FRONTEND_REPLICAS}     - Number of frontend pods (default: 1)
+#   ${WORKER_REPLICAS}       - Number of worker pods
+#   ${SPEEDUP_RATIO}         - Mocker speedup ratio (use large value for near-instant)
+#
+# Usage:
+#   python3 sweep_runner.py --mode k8s --deploy-template dgd/templates/mocker.yaml \
+#       --dgd-name dynamo-bench-mocker --image nvcr.io/.../image:tag \
+#       --tokenizers hf,fastokens --concurrency 50,100 --isl 512
+---
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: ${DGD_NAME}
+spec:
+  services:
+    Frontend:
+      componentType: frontend
+      replicas: ${FRONTEND_REPLICAS}
+      extraPodSpec:
+${FRONTEND_IMAGE_PULL_SECRETS_BLOCK}
+        mainContainer:
+          image: ${IMAGE}
+          imagePullPolicy: IfNotPresent
+          command:
+            - /bin/sh
+            - -c
+          args:
+            - python3 -m dynamo.frontend --router-mode ${ROUTER_MODE} --http-port ${FRONTEND_PORT}
+          env:
+            - name: DYN_TOKENIZER_BACKEND
+              value: "${DYN_TOKENIZER_BACKEND}"
+            - name: DYN_PERF_DIAG
+              value: "1"
+            - name: HF_TOKEN
+              valueFrom:
+                secretKeyRef:
+                  name: ${HF_TOKEN_SECRET_NAME}
+                  key: HF_TOKEN
+
+    MockerWorker:
+      componentType: worker
+      replicas: ${WORKER_REPLICAS}
+      extraPodSpec:
+${WORKER_IMAGE_PULL_SECRETS_BLOCK}
+        mainContainer:
+          image: ${IMAGE}
+          imagePullPolicy: IfNotPresent
+          command:
+            - /bin/sh
+            - -c
+          args:
+            - |
+              python3 -m dynamo.mocker \
+                --model-path "${MODEL_PATH}" \
+                --model-name "${MODEL_NAME}" \
+                --num-workers ${NUM_WORKERS} \
+                --speedup-ratio ${SPEEDUP_RATIO}
+          env:
+            - name: MODEL_PATH
+              value: "${MODEL_PATH}"
+            - name: MODEL_NAME
+              value: "${MODEL_NAME}"
+            - name: HF_TOKEN
+              valueFrom:
+                secretKeyRef:
+                  name: ${HF_TOKEN_SECRET_NAME}
+                  key: HF_TOKEN
--- a/benchmarks/frontend/dgd/templates/vllm-gpt-oss-20b.yaml
+++ b/benchmarks/frontend/dgd/templates/vllm-gpt-oss-20b.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Deploy template: vLLM backend for openai/gpt-oss-20b (TP=2, 2 GPUs per worker).
+#
+# Model: openai/gpt-oss-20b (20B params, BF16/FP8, 131K context)
+#   - Architecture: gpt_oss, 24 layers, 64 attn heads, 8 KV heads
+#   - TP options: 1, 2, 4, 8 (all divide heads/kv_heads evenly)
+#   - Weight size: ~13 GB (safetensors, excluding metal/ and original/)
+#   - Recommended: TP=2 on H100 for good prefill throughput
+#
+# Template variables (substituted by sweep_runner.py --deploy-template):
+#   ${DGD_NAME}              - DynamoGraphDeployment name
+#   ${IMAGE}                 - Container image
+#   ${DYN_TOKENIZER_BACKEND} - "default" (hf) or "fast"
+#   ${FRONTEND_PORT}         - Frontend HTTP port (default: 8000)
+#   ${ROUTER_MODE}           - Frontend router mode (default: round-robin)
+#   ${MODEL}                 - Model path (HF ID or local path on PVC)
+#   ${MODEL_NAME}            - Served model name (used by aiperf --model)
+#   ${FRONTEND_REPLICAS}     - Number of frontend pods (default: 1)
+#   ${WORKER_REPLICAS}       - Number of vLLM worker pods
+#
+# Prerequisites:
+#   - Model downloaded to model-cache PVC (excluding metal/ and original/):
+#       huggingface-cli download openai/gpt-oss-20b \
+#           --exclude "metal/*" --exclude "original/*" \
+#           --local-dir /model-store/hub/models--openai--gpt-oss-20b/snapshots/main
+#   - hf-token-secret in the target namespace
+#   - model-cache PVC (>= 100Gi) in the target namespace
+#   - GPU nodes with nvidia.com/gpu toleration
+#
+# Usage:
+#   python3 sweep_runner.py --mode k8s \
+#       --deploy-template dgd/templates/vllm-gpt-oss-20b.yaml \
+#       --dgd-name dynamo-bench-vllm \
+#       --model /model-store/hub/models--openai--gpt-oss-20b/snapshots/main \
+#       --image nvcr.io/.../vllm-runtime:tag \
+#       --tokenizers hf,fastokens --concurrency 20 --isl 8192
+---
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: ${DGD_NAME}
+spec:
+  services:
+    Frontend:
+      componentType: frontend
+      replicas: ${FRONTEND_REPLICAS}
+      extraPodSpec:
+${FRONTEND_IMAGE_PULL_SECRETS_BLOCK}
+        mainContainer:
+          image: ${IMAGE}
+          imagePullPolicy: IfNotPresent
+          command:
+            - /bin/sh
+            - -c
+          args:
+            - python3 -m dynamo.frontend --router-mode ${ROUTER_MODE} --http-port ${FRONTEND_PORT}
+          env:
+            - name: DYN_TOKENIZER_BACKEND
+              value: "${DYN_TOKENIZER_BACKEND}"
+            - name: DYN_PERF_DIAG
+              value: "1"
+            - name: HF_HOME
+              value: /model-store
+            - name: HF_TOKEN
+              valueFrom:
+                secretKeyRef:
+                  name: ${HF_TOKEN_SECRET_NAME}
+                  key: HF_TOKEN
+          volumeMounts:
+            - name: model-cache
+              mountPath: /model-store
+        volumes:
+          - name: model-cache
+            persistentVolumeClaim:
+              claimName: model-cache
+
+    VllmWorker:
+      componentType: worker
+      replicas: ${WORKER_REPLICAS}
+      extraPodSpec:
+${WORKER_IMAGE_PULL_SECRETS_BLOCK}
+        tolerations:
+          - effect: NoSchedule
+            key: nvidia.com/gpu
+            operator: Exists
+        mainContainer:
+          image: ${IMAGE}
+          imagePullPolicy: IfNotPresent
+          command:
+            - /bin/sh
+            - -c
+          args:
+            - >-
+              python3 -m dynamo.vllm
+              --model ${MODEL}
+              --served-model-name ${MODEL_NAME}
+              --tensor-parallel-size 2
+              --max-model-len 65536
+              --max-num-batched-tokens 32768
+              --gpu-memory-utilization 0.90
+          env:
+            - name: HF_HOME
+              value: /model-store
+            - name: HF_TOKEN
+              valueFrom:
+                secretKeyRef:
+                  name: ${HF_TOKEN_SECRET_NAME}
+                  key: HF_TOKEN
+          resources:
+            limits:
+              nvidia.com/gpu: "2"
+          volumeMounts:
+            - name: model-cache
+              mountPath: /model-store
+        volumes:
+          - name: model-cache
+            persistentVolumeClaim:
+              claimName: model-cache
--- a/benchmarks/frontend/dgd/templates/vllm.yaml
+++ b/benchmarks/frontend/dgd/templates/vllm.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Deploy template: vLLM backend (requires GPUs).
+#
+# Template variables (substituted by sweep_runner.py --deploy-template):
+#   ${DGD_NAME}              - DynamoGraphDeployment name
+#   ${IMAGE}                 - Container image
+#   ${DYN_TOKENIZER_BACKEND} - "default" (hf) or "fast"
+#   ${FRONTEND_PORT}         - Frontend HTTP port
+#   ${ROUTER_MODE}           - Frontend router mode
+#   ${MODEL}                 - HF model ID
+#   ${MODEL_NAME}            - Served model name
+#   ${FRONTEND_REPLICAS}     - Number of frontend pods (default: 1)
+#   ${WORKER_REPLICAS}       - Number of vLLM worker pods
+#
+# Usage:
+#   python3 sweep_runner.py --mode k8s --deploy-template dgd/templates/vllm.yaml \
+#       --dgd-name dynamo-bench-vllm --image nvcr.io/.../vllm-runtime:tag \
+#       --model meta-llama/Llama-3.1-8B-Instruct \
+#       --tokenizers hf --concurrency 128 --isl 1024
+---
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: ${DGD_NAME}
+spec:
+  services:
+    Frontend:
+      componentType: frontend
+      replicas: ${FRONTEND_REPLICAS}
+      extraPodSpec:
+${FRONTEND_IMAGE_PULL_SECRETS_BLOCK}
+        mainContainer:
+          image: ${IMAGE}
+          imagePullPolicy: IfNotPresent
+          command:
+            - /bin/sh
+            - -c
+          args:
+            - python3 -m dynamo.frontend --router-mode ${ROUTER_MODE} --http-port ${FRONTEND_PORT}
+          env:
+            - name: DYN_TOKENIZER_BACKEND
+              value: "${DYN_TOKENIZER_BACKEND}"
+            - name: DYN_PERF_DIAG
+              value: "1"
+            - name: HF_TOKEN
+              valueFrom:
+                secretKeyRef:
+                  name: ${HF_TOKEN_SECRET_NAME}
+                  key: HF_TOKEN
+
+    VllmWorker:
+      componentType: worker
+      replicas: ${WORKER_REPLICAS}
+      extraPodSpec:
+${WORKER_IMAGE_PULL_SECRETS_BLOCK}
+        mainContainer:
+          image: ${IMAGE}
+          imagePullPolicy: IfNotPresent
+          command:
+            - /bin/sh
+            - -c
+          args:
+            - python3 -m dynamo.vllm --model ${MODEL} --tensor-parallel-size 1
+          env:
+            - name: HF_TOKEN
+              valueFrom:
+                secretKeyRef:
+                  name: ${HF_TOKEN_SECRET_NAME}
+                  key: HF_TOKEN
+          resources:
+            limits:
+              nvidia.com/gpu: "1"
--- a/benchmarks/frontend/scripts/README.md
+++ b/benchmarks/frontend/scripts/README.md
@@ -11,11 +11,11 @@ source dynamo/bin/activate
 # Single run (mocker + frontend + aiperf + Prometheus)
 cd benchmarks/frontend/scripts
 ./run_perf.sh --model Qwen/Qwen3-0.6B --concurrency 32 --num-requests 640 \
-    --speedup-ratio 0 --skip-bpf --skip-nsys --skip-flamegraph --skip-perf
+    --speedup-ratio 1000000 --skip-bpf --skip-nsys --skip-flamegraph --skip-perf

 # Sweep (multiple config points)
 python3 sweep_runner.py --tokenizers hf --concurrency 32 --isl 512 \
-    --benchmark-duration 30 --speedup-ratio 0 \
+    --benchmark-duration 30 --speedup-ratio 1000000 \
    -- --skip-bpf --skip-nsys --skip-flamegraph --skip-perf
 ```

@@ -132,17 +132,17 @@ The main entry point for running performance sweeps. Iterates over a grid of con
 ```bash
 # Smoke test (1 run)
 python3 sweep_runner.py --tokenizers hf --concurrency 32 --isl 512 \
-    --benchmark-duration 30 --speedup-ratio 0 \
+    --benchmark-duration 30 --speedup-ratio 1000000 \
    -- --skip-bpf --skip-nsys --skip-flamegraph --skip-perf

 # Full tokenizer comparison
 python3 sweep_runner.py --tokenizers hf,fastokens \
    --concurrency 32,64 --isl 512,1024,2048 \
-    --benchmark-duration 60 --speedup-ratio 0
+    --benchmark-duration 60 --speedup-ratio 1000000

 # Transport saturation (vary workers and request count)
 python3 sweep_runner.py --tokenizers hf --concurrency 4096 \
-    --num-requests 16384,32768 --workers 1,2,4,8 --speedup-ratio 0
+    --num-requests 16384,32768 --workers 1,2,4,8 --speedup-ratio 1000000

 # Preview sweep plan without running
 python3 sweep_runner.py --dry-run --tokenizers hf,fastokens \
@@ -168,7 +168,7 @@ for m in 1 2 3 4; do
        --num-models $m \
        --rps 75 \
        --benchmark-duration 60 \
-        --speedup-ratio 0 \
+        --speedup-ratio 1000000 \
        --output-dir artifacts/sweep_models/m${m} \
        -- --skip-bpf
 done
@@ -195,7 +195,7 @@ python3 sweep_runner.py \
    --num-models 1 \
    --rps 75 \
    --benchmark-duration 60 \
-    --speedup-ratio 0 \
+    --speedup-ratio 1000000 \
    --output-dir artifacts/sweep_workers \
    -- --skip-bpf
 ```
@@ -214,7 +214,7 @@ python3 sweep_runner.py \
    --num-models 2 \
    --rps 50 \
    --benchmark-duration 60 \
-    --speedup-ratio 0 \
+    --speedup-ratio 1000000 \
    --output-dir artifacts/sweep_grid \
    -- --skip-bpf
 ```
@@ -237,15 +237,15 @@ python3 sweep_runner.py \
 ```bash
 # With perf stat + flamegraphs (no root needed)
 python3 sweep_runner.py --tokenizers hf --concurrency 64 --isl 1024 \
-    --benchmark-duration 60 --speedup-ratio 0
+    --benchmark-duration 60 --speedup-ratio 1000000

 # With everything including BPF (needs sudo)
 sudo -E python3 sweep_runner.py --tokenizers hf --concurrency 64 --isl 1024 \
-    --benchmark-duration 60 --speedup-ratio 0
+    --benchmark-duration 60 --speedup-ratio 1000000

 # nsys profiling (needs nsys in PATH)
 python3 sweep_runner.py --tokenizers hf --concurrency 64 --isl 1024 \
-    --benchmark-duration 60 --speedup-ratio 0 \
+    --benchmark-duration 60 --speedup-ratio 1000000 \
    -- --nsys-path /opt/nvidia/nsight-systems/bin/nsys
 ```

@@ -272,7 +272,7 @@ Profiler controls are passed through to run_perf.sh after `--`:
 | `--num-models` | `1` | Number of model instances (each gets `--workers` workers) |
 | `--rps` | - | Comma-separated target request rates (req/s) |
 | `--aiperf-targets` | `first` | `first`: model-1 only. `all`: run aiperf for each model |
-| `--speedup-ratio` | `1.0` | Mocker speedup (0 = infinite) |
+| `--speedup-ratio` | `1.0` | Mocker speedup divisor; use large values (e.g., 1000000) for near-instant mocker |
 | `--benchmark-duration` | `60` | aiperf run duration (seconds) |
 | `--num-requests` | - | Comma-separated request counts (overrides duration) |
 | `--output-dir` | auto | Output directory |
@@ -288,21 +288,21 @@ Low-level per-run harness. Normally called by sweep_runner.py, but can be used d
 ```bash
 # Minimal (no profilers)
 ./run_perf.sh --model Qwen/Qwen3-0.6B --concurrency 32 --num-requests 640 \
-    --speedup-ratio 0 --skip-bpf --skip-nsys --skip-flamegraph --skip-perf
+    --speedup-ratio 1000000 --skip-bpf --skip-nsys --skip-flamegraph --skip-perf

 # Full observability (needs sudo for BPF)
 sudo -E ./run_perf.sh --model Qwen/Qwen3-0.6B --concurrency 64 \
-    --benchmark-duration 60 --speedup-ratio 0
+    --benchmark-duration 60 --speedup-ratio 1000000

 # Multi-model with 2 workers each
 ./run_perf.sh --model Qwen/Qwen3-0.6B --num-models 2 --workers 2 \
-    --concurrency 32 --benchmark-duration 30 --speedup-ratio 0 \
+    --concurrency 32 --benchmark-duration 30 --speedup-ratio 1000000 \
    --skip-bpf --skip-nsys --skip-flamegraph --skip-perf

 # 4 models, 1 worker each, rate-limited to 75 rps
 ./run_perf.sh --model Qwen/Qwen3-0.6B --num-models 4 --workers 1 \
    --concurrency 512 --benchmark-duration 60 --request-rate 75 \
-    --speedup-ratio 0 --skip-bpf
+    --speedup-ratio 1000000 --skip-bpf
 ```

 ## Analyzing Results

--- a/benchmarks/frontend/scripts/run_perf.sh
+++ b/benchmarks/frontend/scripts/run_perf.sh
@@ -121,7 +121,7 @@ Service Options:
  --model PATH              Model path (default: nvidia/Llama-3.1-8B-Instruct-FP8)
  --model-name NAME         Served model name (default: same as --model)
  --workers N               Number of mocker workers (default: 2)
-  --speedup-ratio RATIO     Mocker speedup ratio (default: 1.0; 0 = infinite)
+  --speedup-ratio RATIO     Mocker speedup ratio (default: 1.0; use large value for near-instant)
  --data-parallel-size N    Mocker DP workers (default: 1)
  --request-plane PLANE     nats|http|tcp (default: tcp)
  --event-plane PLANE       nats|zmq (default: nats)

--- a/benchmarks/frontend/scripts/scaling-test.md
+++ b/benchmarks/frontend/scripts/scaling-test.md
+<!-- SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -->
+<!-- SPDX-License-Identifier: Apache-2.0 -->
+
+# Frontend Scaling Test: Finding the Saturation Point
+
+This guide walks through using the sweep runner to find the saturation point of
+a Dynamo frontend serving a real vLLM backend.  The saturation point is the
+request rate at which latency begins to degrade -- prefill requests start
+queuing instead of being served immediately, TTFT p99 spikes, and throughput
+plateaus.
+
+---
+
+## Overview
+
+The test sweeps increasing request rates (`--rps`) at a fixed input sequence
+length while keeping the backend warm (`--reset-strategy frontend`).  Each data
+point is a 60-second aiperf run at a controlled RPS.  The sweep stops
+automatically after consecutive failures (`--max-consecutive-fails`).
+
+**What you get:**
+
+- Per-RPS throughput (actual req/s vs target), TTFT p50/p99, ITL p50/p99
+- Prometheus pre/post metrics for pipeline stage breakdown
+- CSV + summary for easy comparison
+
+---
+
+## Prerequisites
+
+1. **K8s namespace** with:
+   - `hf-token-secret` (HuggingFace token)
+   - `nvcrimagepullsecret` (image pull credentials)
+   - `model-cache` PVC (RWX, large enough for model weights)
+   - Model weights downloaded to PVC (see "Model Download" below)
+
+2. **DGD deployed** with the target model and backend.
+
+3. **sweep_runner.py** accessible from a machine with `kubectl` access to the
+   cluster.
+
+---
+
+## Model Download (gpt-oss-20b example)
+
+Download the model to the PVC, excluding large non-inference directories:
+
+```bash
+# Create a download Job (adjust image and namespace)
+kubectl apply -n <namespace> -f - <<'EOF'
+apiVersion: batch/v1
+kind: Job
+metadata:
+  name: model-download-gpt-oss-20b
+spec:
+  backoffLimit: 2
+  template:
+    spec:
+      restartPolicy: Never
+      imagePullSecrets:
+        - name: nvcrimagepullsecret
+      containers:
+        - name: download
+          image: nvcr.io/nvidian/dynamo-dev/biswa:vllm-runtime-1a8bce12ea
+          command: ["python3", "-c"]
+          args:
+            - |
+              import os, subprocess, sys, pathlib
+              model = "openai/gpt-oss-20b"
+              os.environ["HF_HOME"] = "/model-store"
+              cmd = ["huggingface-cli", "download", model,
+                     "--exclude", "metal/*", "--exclude", "original/*",
+                     "--local-dir", "/model-store/hub/models--openai--gpt-oss-20b/snapshots/main"]
+              sys.exit(subprocess.run(cmd).returncode)
+          env:
+            - name: HF_HOME
+              value: /model-store
+            - name: HF_TOKEN
+              valueFrom:
+                secretKeyRef:
+                  name: hf-token-secret
+                  key: HF_TOKEN
+          volumeMounts:
+            - name: model-cache
+              mountPath: /model-store
+      volumes:
+        - name: model-cache
+          persistentVolumeClaim:
+            claimName: model-cache
+EOF
+
+# Monitor
+kubectl logs -n <namespace> -l job-name=model-download-gpt-oss-20b -f
+```
+
+---
+
+## Deploy the DGD
+
+Use the provided template for gpt-oss-20b with TP=2:
+
+```bash
+# Template path (relative to repo root)
+# benchmarks/frontend/dgd/templates/vllm-gpt-oss-20b.yaml
+#
+# Key settings in the template:
+#   - tensor-parallel-size 2 (2 GPUs per worker)
+#   - max-model-len 65536
+#   - gpu-memory-utilization 0.90
+#   - GPU toleration for scheduling
+
+# Deploy directly (adjust values as needed):
+kubectl apply -n <namespace> -f - <<'EOF'
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: gpt-oss-20b-bench
+spec:
+  services:
+    Frontend:
+      componentType: frontend
+      replicas: 1
+      extraPodSpec:
+        imagePullSecrets:
+          - name: nvcrimagepullsecret
+        mainContainer:
+          image: <your-image>
+          command: ["/bin/sh", "-c"]
+          args: ["python3 -m dynamo.frontend --router-mode round-robin --http-port 8000"]
+          env:
+            - name: DYN_TOKENIZER_BACKEND
+              value: "default"
+            - name: DYN_PERF_DIAG
+              value: "1"
+            - name: HF_HOME
+              value: /model-store
+            - name: HF_TOKEN
+              valueFrom:
+                secretKeyRef:
+                  name: hf-token-secret
+                  key: HF_TOKEN
+          volumeMounts:
+            - name: model-cache
+              mountPath: /model-store
+        volumes:
+          - name: model-cache
+            persistentVolumeClaim:
+              claimName: model-cache
+
+    VllmWorker:
+      componentType: worker
+      replicas: 4                    # <-- number of backend replicas
+      extraPodSpec:
+        imagePullSecrets:
+          - name: nvcrimagepullsecret
+        tolerations:
+          - effect: NoSchedule
+            key: nvidia.com/gpu
+            operator: Exists
+        mainContainer:
+          image: <your-image>
+          command: ["/bin/sh", "-c"]
+          args:
+            - >-
+              python3 -m dynamo.vllm
+              --model /model-store/hub/models--openai--gpt-oss-20b/snapshots/main
+              --served-model-name openai/gpt-oss-20b
+              --tensor-parallel-size 2
+              --max-model-len 65536
+              --gpu-memory-utilization 0.90
+          env:
+            - name: HF_HOME
+              value: /model-store
+            - name: HF_TOKEN
+              valueFrom:
+                secretKeyRef:
+                  name: hf-token-secret
+                  key: HF_TOKEN
+          resources:
+            limits:
+              nvidia.com/gpu: "2"    # <-- 2 GPUs for TP=2
+          volumeMounts:
+            - name: model-cache
+              mountPath: /model-store
+        volumes:
+          - name: model-cache
+            persistentVolumeClaim:
+              claimName: model-cache
+EOF
+
+# Wait for all pods to be ready
+kubectl get pods -n <namespace> -w
+```
+
+---
+
+## Run the Saturation Sweep
+
+### Baseline: HF tokenizer, RPS sweep
+
+```bash
+cd benchmarks/frontend/scripts
+
+python3 sweep_runner.py --mode k8s \
+    --dgd-name gpt-oss-20b-bench \
+    --namespace <namespace> \
+    --endpoint gpt-oss-20b-bench-frontend:8000 \
+    --model openai/gpt-oss-20b \
+    --backend vllm \
+    --image <your-image> \
+    --tokenizers hf \
+    --concurrency 200 \
+    --rps 10,20,30,40,50,60,70,80,90,100 \
+    --isl 6144 \
+    --osl 256 \
+    --benchmark-duration 60 \
+    --reset-strategy frontend \
+    --isolation reuse_by_deploy_key \
+    --worker-replicas 4 \
+    --max-consecutive-fails 2
+```
+
+**Flag explanations:**
+
+| Flag | Value | Purpose |
+|------|-------|---------|
+| `--rps 10,20,...,100` | Sweep dimension | Each run targets a fixed request rate. aiperf uses `--request-rate` to cap submission. |
+| `--concurrency 200` | High ceiling | Maximum in-flight requests. Set high so aiperf can sustain the target RPS without being limited by available connection slots. This is NOT a sweep dimension. |
+| `--isl 6144` | Fixed ISL | Holds input length constant to isolate throughput scaling. |
+| `--osl 256` | Fixed OSL | Consistent output length across all runs. |
+| `--benchmark-duration 60` | 60s per point | Long enough for vLLM scheduling to stabilize. |
+| `--reset-strategy frontend` | Frontend-only | Resets Prometheus counters between runs, but keeps vLLM workers alive with warm KV caches and CUDA graphs. Avoids the ~90s full DGD restart per point. |
+| `--isolation reuse_by_deploy_key` | Reuse deployment | Since tokenizer=hf is constant, no DGD restart between runs. Only a frontend pod restart for clean metrics. |
+| `--max-consecutive-fails 2` | Auto-stop | After 2 consecutive failures at a given RPS, remaining higher RPS values are skipped. |
+
+### Follow-up: FastTokens comparison
+
+Once you have the baseline, run the same sweep with fastokens to see if the
+saturation point shifts:
+
+```bash
+python3 sweep_runner.py --mode k8s \
+    --dgd-name gpt-oss-20b-bench \
+    --namespace <namespace> \
+    --endpoint gpt-oss-20b-bench-frontend:8000 \
+    --model openai/gpt-oss-20b \
+    --backend vllm \
+    --image <your-image> \
+    --tokenizers fastokens \
+    --concurrency 200 \
+    --rps 10,20,30,40,50,60,70,80,90,100 \
+    --isl 6144 \
+    --osl 256 \
+    --benchmark-duration 60 \
+    --reset-strategy frontend \
+    --isolation reuse_by_deploy_key \
+    --worker-replicas 4 \
+    --max-consecutive-fails 2
+```
+
+### Fine-grained sweep around the inflection
+
+If the baseline shows saturation between, say, RPS=40 and RPS=60:
+
+```bash
+python3 sweep_runner.py --mode k8s \
+    ... \
+    --rps 35,40,45,50,55,60 \
+    --reset-strategy frontend \
+    --isolation reuse_by_deploy_key
+```
+
+---
+
+## Reading the Results
+
+The sweep produces `results.csv` and `summary.md` in the output directory.
+
+### Identifying the saturation point
+
+Look for these signals in the CSV:
+
+| RPS | Actual Req/s | TTFT p50 | TTFT p99 | ITL p99 | Status |
+|----:|-----------:|--------:|--------:|-------:|--------|
+| 10 | 10.0 | 800ms | 1200ms | 30ms | ok |
+| 20 | 19.8 | 850ms | 1400ms | 32ms | ok |
+| 30 | 29.5 | 900ms | 2000ms | 35ms | ok |
+| 40 | 38.0 | 1200ms | 5000ms | 45ms | ok -- onset |
+| 50 | 42.0 | 3000ms | 15000ms | 80ms | ok -- saturated |
+| 60 | 41.5 | 8000ms | 30000ms | 120ms | ok -- overloaded |
+| 70 | -- | -- | -- | -- | fail |
+
+**Saturation indicators:**
+
+1. **Actual req/s < target RPS**: The system cannot sustain the requested rate.
+   At RPS=50, only 42 req/s are achieved.
+2. **TTFT p99 spike**: A sharp increase (e.g., 2x-5x) means prefill requests
+   are queuing behind each other.
+3. **ITL p99 degradation**: Decode throughput drops because the vLLM scheduler
+   is overloaded with concurrent prefills.
+4. **Errors/failures**: Timeouts, OOM, or vLLM rejecting requests.
+
+The **saturation point** in the example above is **RPS ~40** -- the last rate
+where actual throughput tracks the target and TTFT p99 is still reasonable.
+
+### Prometheus metrics
+
+Each run captures `frontend_metrics_pre.txt` and `frontend_metrics_post.txt`.
+Key metrics for saturation analysis:
+
+- `dynamo_frontend_stage_duration_seconds{stage="preprocess"}` -- tokenization time
+- `dynamo_frontend_stage_duration_seconds{stage="transport_roundtrip"}` -- backend latency
+- `dynamo_frontend_queued_requests` -- requests waiting in HTTP queue (should be 0 below saturation)
+- `dynamo_frontend_inflight_requests` -- concurrent in-flight requests
+- `dynamo_frontend_time_to_first_token_seconds` -- TTFT histogram buckets
+
+---
+
+## DGD Template Reference
+
+The `dgd/templates/vllm-gpt-oss-20b.yaml` template is pre-configured for
+gpt-oss-20b with TP=2.  To use it with `--deploy-template`:
+
+```bash
+python3 sweep_runner.py --mode k8s \
+    --deploy-template benchmarks/frontend/dgd/templates/vllm-gpt-oss-20b.yaml \
+    --dgd-name gpt-oss-20b-bench \
+    --model /model-store/hub/models--openai--gpt-oss-20b/snapshots/main \
+    --image <your-image> \
+    --worker-replicas 4 \
+    ...
+```
+
+The template substitutes these variables at deploy time:
+`${DGD_NAME}`, `${IMAGE}`, `${MODEL}`, `${MODEL_NAME}`,
+`${WORKER_REPLICAS}`, `${DYN_TOKENIZER_BACKEND}`, `${FRONTEND_PORT}`,
+`${ROUTER_MODE}`.
+
+---
+
+## Tuning Parameters
+
+| Parameter | Recommended Range | Notes |
+|-----------|-------------------|-------|
+| `--benchmark-duration` | 60-120s | Longer = more stable averages but slower sweep |
+| `--concurrency` | 2-4x max target RPS | Must be high enough that aiperf can reach the target rate |
+| `--rps` | Start at 10, double until failures | Geometric progression finds the order of magnitude fast |
+| `--worker-replicas` | 1-8 | More replicas = higher saturation point but more GPUs |
+| `--reset-strategy` | `frontend` for saturation tests | `graph` for clean-baseline TTFT measurements |
+| `--isolation` | `reuse_by_deploy_key` for same-tokenizer sweeps | Avoids unnecessary DGD restarts |
+| `--max-consecutive-fails` | 2-3 | Higher = more data points at the failure boundary |
--- a/benchmarks/frontend/scripts/sweep_core/__init__.py
+++ b/benchmarks/frontend/scripts/sweep_core/__init__.py
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+"""sweep_core -- pure-logic library for frontend performance sweeps."""
+
+from sweep_core.models import (
+    AiperfDimension,
+    DeployDimension,
+    DeployKey,
+    IsolationPolicy,
+    RunResult,
+    RunSpec,
+    SweepConfig,
+    SweepPlan,
+)
+
+__all__ = [
+    "AiperfDimension",
+    "DeployDimension",
+    "DeployKey",
+    "IsolationPolicy",
+    "RunResult",
+    "RunSpec",
+    "SweepConfig",
+    "SweepPlan",
+]
--- a/benchmarks/frontend/scripts/sweep_core/artifacts.py
+++ b/benchmarks/frontend/scripts/sweep_core/artifacts.py
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+"""
+Artifact writers for sweep results.
+
+Produces CSV, markdown summary, and sweep_config.json -- the contract
+consumed by downstream analysis tools (analyze_sweep.py, sweep_data.py).
+"""
+
+from __future__ import annotations
+
+import csv
+import json
+import time
+from pathlib import Path
+from typing import List
+
+from sweep_core.models import RunResult, SweepConfig
+
+
+def write_csv(results: List[RunResult], csv_path: Path, config: SweepConfig) -> None:
+    """Write incremental CSV results file (called after each run)."""
+    fieldnames = [
+        "run_id",
+        "backend",
+        "tokenizer",
+        "concurrency",
+        "isl",
+        "osl",
+        "workers",
+        "speedup_ratio",
+        "status",
+        "req_per_sec",
+        "output_tok_per_sec",
+        "ttft_p50_ms",
+        "ttft_p99_ms",
+        "itl_p50_ms",
+        "itl_p99_ms",
+        "duration_sec",
+        "run_dir",
+    ]
+    with open(csv_path, "w", newline="") as f:
+        writer = csv.DictWriter(f, fieldnames=fieldnames, extrasaction="ignore")
+        writer.writeheader()
+        for r in results:
+            spec = r.run_spec
+            row = {
+                "run_id": spec.run_id,
+                "backend": spec.deploy.backend,
+                "tokenizer": spec.deploy.tokenizer,
+                "concurrency": spec.aiperf.concurrency,
+                "isl": spec.aiperf.isl,
+                "osl": spec.aiperf.osl,
+                "workers": spec.deploy.workers,
+                "speedup_ratio": config.speedup_ratio,
+                "status": r.status,
+                "req_per_sec": f"{r.req_per_sec:.2f}"
+                if r.req_per_sec is not None
+                else "",
+                "output_tok_per_sec": f"{r.output_tok_per_sec:.1f}"
+                if r.output_tok_per_sec is not None
+                else "",
+                "ttft_p50_ms": f"{r.ttft_p50_ms:.1f}"
+                if r.ttft_p50_ms is not None
+                else "",
+                "ttft_p99_ms": f"{r.ttft_p99_ms:.1f}"
+                if r.ttft_p99_ms is not None
+                else "",
+                "itl_p50_ms": f"{r.itl_p50_ms:.1f}" if r.itl_p50_ms is not None else "",
+                "itl_p99_ms": f"{r.itl_p99_ms:.1f}" if r.itl_p99_ms is not None else "",
+                "duration_sec": f"{r.duration_sec:.1f}"
+                if r.duration_sec is not None
+                else "",
+                "run_dir": r.run_dir,
+            }
+            writer.writerow(row)
+
+
+def write_summary(results: List[RunResult], summary_path: Path) -> None:
+    """Write markdown summary table."""
+    lines = ["# Sweep Summary\n"]
+    lines.append(f"**Generated:** {time.strftime('%Y-%m-%d %H:%M:%S')}\n")
+    lines.append(
+        "| Run ID | Req/s | Tok/s | TTFT p50 | TTFT p99 | ITL p50 | Duration | Status |"
+    )
+    lines.append(
+        "|--------|------:|------:|---------:|---------:|--------:|---------:|--------|"
+    )
+
+    for r in results:
+        rps = f"{r.req_per_sec:.1f}" if r.req_per_sec is not None else "-"
+        tps = f"{r.output_tok_per_sec:.0f}" if r.output_tok_per_sec is not None else "-"
+        tp50 = f"{r.ttft_p50_ms:.1f}ms" if r.ttft_p50_ms is not None else "-"
+        tp99 = f"{r.ttft_p99_ms:.1f}ms" if r.ttft_p99_ms is not None else "-"
+        ip50 = f"{r.itl_p50_ms:.1f}ms" if r.itl_p50_ms is not None else "-"
+        dur = f"{r.duration_sec:.0f}s" if r.duration_sec is not None else "-"
+        lines.append(
+            f"| {r.run_spec.run_id} | {rps} | {tps} | {tp50} | {tp99} | {ip50} | {dur} | {r.status} |"
+        )
+
+    lines.append("")
+    ok = sum(1 for r in results if r.status == "ok")
+    fail = sum(1 for r in results if r.status == "fail")
+    skip = sum(1 for r in results if r.status == "skipped")
+    lines.append(
+        f"**Totals:** {ok} passed, {fail} failed, {skip} skipped out of {len(results)}"
+    )
+
+    summary_path.write_text("\n".join(lines) + "\n")
+
+
+def write_sweep_config(
+    config: SweepConfig, output_dir: Path, total_runs: int = 0
+) -> None:
+    """Write sweep_config.json for downstream consumers."""
+    config_path = output_dir / "sweep_config.json"
+    config_data = {
+        "timestamp": time.strftime("%Y%m%d_%H%M%S"),
+        "mode": config.mode,
+        "model": config.model,
+        "model_name": config.model_name,
+        "backend": config.backend,
+        "backends": config.backend,
+        "tokenizers": ",".join(config.tokenizers),
+        "isl_list": ",".join(str(i) for i in config.isls),
+        "concurrency_list": ",".join(str(c) for c in config.concurrencies),
+        "benchmark_duration": config.benchmark_duration or "N/A",
+        "osl": config.osl,
+        "speedup_ratio": config.speedup_ratio,
+        "output_dir": config.output_dir,
+        "total_runs": total_runs,
+        "isolation_policy": config.isolation_policy,
+    }
+    config_path.write_text(json.dumps(config_data, indent=2) + "\n")
+
+
+def print_results_table(results: List[RunResult]) -> None:
+    """Print a compact results table to stdout."""
+    print(f"\n{'=' * 90}")
+    print(
+        f"  {'Run ID':<30} {'Req/s':>8} {'Tok/s':>8} {'TTFT p50':>10} {'TTFT p99':>10} {'Status':>8}"
+    )
+    print(f"  {'-' * 30} {'-' * 8} {'-' * 8} {'-' * 10} {'-' * 10} {'-' * 8}")
+    for r in results:
+        rps = f"{r.req_per_sec:.1f}" if r.req_per_sec is not None else "N/A"
+        tps = (
+            f"{r.output_tok_per_sec:.0f}" if r.output_tok_per_sec is not None else "N/A"
+        )
+        tp50 = f"{r.ttft_p50_ms:.1f}ms" if r.ttft_p50_ms is not None else "N/A"
+        tp99 = f"{r.ttft_p99_ms:.1f}ms" if r.ttft_p99_ms is not None else "N/A"
+        print(
+            f"  {r.run_spec.run_id:<30} {rps:>8} {tps:>8} {tp50:>10} {tp99:>10} {r.status:>8}"
+        )
+    print(f"{'=' * 90}")
--- a/benchmarks/frontend/scripts/sweep_core/config.py
+++ b/benchmarks/frontend/scripts/sweep_core/config.py
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+"""
+Typed SweepConfig construction from argparse Namespace.
+
+Centralizes the parsing of CLI arguments into the SweepConfig data model.
+"""
+
+from __future__ import annotations
+
+import argparse
+import time
+from pathlib import Path
+from typing import List, Optional
+
+from sweep_core.models import K8sConfig, SweepConfig
+
+SCRIPT_DIR = Path(__file__).resolve().parent.parent
+REPO_ROOT = SCRIPT_DIR.parent.parent.parent
+
+DEFAULT_MODEL = "Qwen/Qwen3-0.6B"
+DEFAULT_OSL = 256
+DEFAULT_SPEEDUP = 1.0
+DEFAULT_BENCHMARK_DURATION = 60
+DEFAULT_MAX_CONSECUTIVE_FAILS = 2
+DEFAULT_COOLDOWN = 3
+
+
+def build_argument_parser() -> argparse.ArgumentParser:
+    """Build the argument parser for sweep_runner.py."""
+    parser = argparse.ArgumentParser(
+        description="Frontend performance sweep runner",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""Examples:
+  # Local smoke test
+  python3 sweep_runner.py --tokenizers hf,fastokens --concurrency 32 --isl 512 \\
+      --benchmark-duration 30 --speedup-ratio 1000000
+
+  # K8s sweep with DGD
+  python3 sweep_runner.py --mode k8s --tokenizers hf,fastokens --concurrency 50,100 --isl 512
+
+  # K8s with custom deploy template
+  python3 sweep_runner.py --mode k8s --deploy-template dgd/templates/vllm.yaml \\
+      --tokenizers hf --concurrency 128 --isl 1024
+
+  # Transport saturation (high concurrency, vary workers)
+  python3 sweep_runner.py --tokenizers hf --concurrency 4096 \\
+      --num-requests 16384,32768 --workers 1,2,4,8 --speedup-ratio 1000000
+
+  # Dry run
+  python3 sweep_runner.py --dry-run --tokenizers hf,fastokens --concurrency 32,64 --isl 512,1024
+""",
+    )
+
+    # Common options
+    parser.add_argument("--model", default=DEFAULT_MODEL, help="HF model path")
+    parser.add_argument(
+        "--model-name", default="", help="Served model name (default: same as --model)"
+    )
+    parser.add_argument(
+        "--mode",
+        choices=["local", "k8s"],
+        default="local",
+        help="Execution mode: local (run_perf.sh) or k8s (DGD + aiperf)",
+    )
+    parser.add_argument(
+        "--backend",
+        choices=["mocker", "vllm"],
+        default="mocker",
+        help="Engine backend: mocker (synthetic) or vllm (real inference)",
+    )
+    parser.add_argument(
+        "--tokenizers",
+        default="hf,fastokens",
+        help="Comma-separated tokenizer backends (hf, fastokens)",
+    )
+    parser.add_argument(
+        "--concurrency", default="50,100,200", help="Comma-separated concurrency levels"
+    )
+    parser.add_argument(
+        "--isl", default="512,1024,2048", help="Comma-separated ISL values"
+    )
+    parser.add_argument(
+        "--osl", type=int, default=DEFAULT_OSL, help="Output sequence length"
+    )
+    parser.add_argument(
+        "--workers", default="2", help="Comma-separated worker counts per model"
+    )
+    parser.add_argument(
+        "--num-models",
+        type=int,
+        default=1,
+        help="Number of model instances",
+    )
+    parser.add_argument(
+        "--aiperf-targets",
+        choices=["first", "all"],
+        default="first",
+        help="'first': aiperf targets model-1 only. 'all': run aiperf for each model.",
+    )
+    parser.add_argument(
+        "--speedup-ratio",
+        type=float,
+        default=DEFAULT_SPEEDUP,
+        help="Mocker speedup (0=infinite)",
+    )
+    parser.add_argument(
+        "--benchmark-duration",
+        type=int,
+        default=DEFAULT_BENCHMARK_DURATION,
+        help="aiperf duration (seconds)",
+    )
+    parser.add_argument(
+        "--num-requests",
+        default=None,
+        help="Comma-separated request counts (overrides --benchmark-duration)",
+    )
+    parser.add_argument(
+        "--rps",
+        default=None,
+        help="Comma-separated target request rates (req/s)",
+    )
+    parser.add_argument(
+        "--output-dir",
+        default=None,
+        help="Output directory (default: auto timestamped)",
+    )
+    parser.add_argument(
+        "--max-consecutive-fails",
+        type=int,
+        default=DEFAULT_MAX_CONSECUTIVE_FAILS,
+    )
+    parser.add_argument(
+        "--cooldown", type=int, default=DEFAULT_COOLDOWN, help="Seconds between runs"
+    )
+    parser.add_argument(
+        "--dry-run", action="store_true", help="Print plan without executing"
+    )
+    parser.add_argument(
+        "--no-report", action="store_true", help="Skip per-run report generation"
+    )
+    parser.add_argument(
+        "--isolation",
+        choices=["fresh_per_run", "reuse_by_deploy_key"],
+        default="fresh_per_run",
+        help="Isolation policy (default: fresh_per_run)",
+    )
+
+    # K8s-specific options
+    k8s_group = parser.add_argument_group("K8s mode options")
+    k8s_group.add_argument("--namespace", default="dynamo-bench", help="K8s namespace")
+    k8s_group.add_argument(
+        "--endpoint", default=None, help="K8s frontend endpoint (host:port)"
+    )
+    k8s_group.add_argument("--dgd-name", default="", help="DynamoGraphDeployment name")
+    k8s_group.add_argument(
+        "--image", default="", help="Container image for k8s deployment"
+    )
+    k8s_group.add_argument(
+        "--deploy-template",
+        default="",
+        help="Path to deploy.yaml template (enables template-based deployment)",
+    )
+    k8s_group.add_argument(
+        "--reset-strategy",
+        choices=["none", "frontend", "graph"],
+        default="graph",
+        help="K8s reset strategy per run (default: graph)",
+    )
+    k8s_group.add_argument(
+        "--deploy", action="store_true", help="Deploy infrastructure before sweeping"
+    )
+    k8s_group.add_argument(
+        "--frontend-port", type=int, default=8000, help="Frontend HTTP port"
+    )
+    k8s_group.add_argument(
+        "--worker-replicas", type=int, default=1, help="Number of worker pod replicas"
+    )
+    k8s_group.add_argument(
+        "--frontend-replicas",
+        type=int,
+        default=1,
+        help="Number of frontend pod replicas",
+    )
+    k8s_group.add_argument(
+        "--request-plane", default="tcp", help="Request plane transport"
+    )
+    k8s_group.add_argument(
+        "--event-plane", default="nats", help="Event plane transport"
+    )
+    k8s_group.add_argument(
+        "--router-mode", default="round-robin", help="Frontend router mode"
+    )
+    k8s_group.add_argument("--hf-token", default="", help="HuggingFace token for k8s")
+    k8s_group.add_argument(
+        "--image-pull-secret", default="", help="Image pull secret name"
+    )
+    k8s_group.add_argument(
+        "--export-level", default="summary", help="aiperf export level"
+    )
+
+    # Passthrough args for run_perf.sh
+    parser.add_argument(
+        "passthrough", nargs="*", help="Extra args passed to run_perf.sh (after --)"
+    )
+
+    return parser
+
+
+def config_from_args(args: argparse.Namespace) -> SweepConfig:
+    """Convert parsed argparse Namespace to SweepConfig."""
+    # Parse comma-separated lists
+    tokenizers = [t.strip() for t in args.tokenizers.split(",")]
+    concurrencies = [int(c) for c in args.concurrency.split(",")]
+    isls = [int(i) for i in args.isl.split(",")]
+    worker_counts = [int(w) for w in args.workers.split(",")]
+    num_requests_list: List[Optional[int]] = (
+        [int(n) for n in args.num_requests.split(",")] if args.num_requests else [None]
+    )
+    rps_list: List[Optional[int]] = (
+        [int(r) for r in args.rps.split(",")] if args.rps else [None]
+    )
+
+    # Output directory
+    if args.output_dir:
+        output_dir = args.output_dir
+    else:
+        ts = time.strftime("%Y%m%d_%H%M%S")
+        if args.mode == "k8s" and Path("/artifacts").is_dir():
+            # Inside a k8s pod with /artifacts PVC mounted
+            output_dir = f"/artifacts/sweep_{ts}"
+        else:
+            # Local or k8s-from-host: use repo artifacts directory
+            output_dir = str(REPO_ROOT / "artifacts" / f"sweep_{ts}")
+
+    # Build K8s config
+    k8s_config = K8sConfig(
+        namespace=args.namespace,
+        dgd_name=args.dgd_name,
+        image=args.image,
+        frontend_port=args.frontend_port,
+        worker_replicas=args.worker_replicas,
+        frontend_replicas=args.frontend_replicas,
+        deploy_template=args.deploy_template,
+        reset_strategy=args.reset_strategy,
+        request_plane=args.request_plane,
+        event_plane=args.event_plane,
+        router_mode=args.router_mode,
+        deploy=args.deploy,
+        hf_token=args.hf_token,
+        image_pull_secret=args.image_pull_secret,
+        export_level=args.export_level,
+    )
+
+    # Compute k8s endpoint
+    if args.endpoint:
+        k8s_config.endpoint = args.endpoint
+    elif k8s_config.dgd_name:
+        k8s_config.endpoint = (
+            f"{k8s_config.dgd_name}-frontend:{k8s_config.frontend_port}"
+        )
+    else:
+        k8s_config.endpoint = f"frontend:{k8s_config.frontend_port}"
+
+    return SweepConfig(
+        model=args.model,
+        model_name=args.model_name or args.model,
+        mode=args.mode,
+        backend=args.backend,
+        tokenizers=tokenizers,
+        concurrencies=concurrencies,
+        isls=isls,
+        osl=args.osl,
+        worker_counts=worker_counts,
+        num_models=args.num_models,
+        aiperf_targets=args.aiperf_targets,
+        speedup_ratio=args.speedup_ratio,
+        benchmark_duration=args.benchmark_duration,
+        num_requests_list=num_requests_list,
+        rps_list=rps_list,
+        output_dir=output_dir,
+        max_consecutive_fails=args.max_consecutive_fails,
+        cooldown=args.cooldown,
+        dry_run=args.dry_run,
+        no_report=args.no_report,
+        isolation_policy=args.isolation,
+        passthrough_args=args.passthrough or [],
+        k8s=k8s_config,
+    )
--- a/benchmarks/frontend/scripts/sweep_core/failures.py
+++ b/benchmarks/frontend/scripts/sweep_core/failures.py
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+"""Consecutive-failure skip policy for sweep runs."""
+
+from __future__ import annotations
+
+from typing import Dict, Tuple
+
+
+class FailureTracker:
+    """Track consecutive failures per (backend, concurrency, workers) tuple.
+
+    After max_consecutive_fails consecutive failures at a given key,
+    subsequent runs with the same key are skipped.
+    """
+
+    def __init__(self, max_consecutive_fails: int = 2):
+        self.max_consecutive_fails = max_consecutive_fails
+        self._counts: Dict[Tuple[str, int, int], int] = {}
+
+    def should_skip(self, backend: str, concurrency: int, workers: int) -> bool:
+        """Check if a run should be skipped due to prior consecutive failures."""
+        key = (backend, concurrency, workers)
+        return self._counts.get(key, 0) >= self.max_consecutive_fails
+
+    def record_success(self, backend: str, concurrency: int, workers: int) -> None:
+        """Record a successful run, resetting the failure count."""
+        key = (backend, concurrency, workers)
+        self._counts[key] = 0
+
+    def record_failure(self, backend: str, concurrency: int, workers: int) -> int:
+        """Record a failed run. Returns the new consecutive failure count."""
+        key = (backend, concurrency, workers)
+        self._counts[key] = self._counts.get(key, 0) + 1
+        return self._counts[key]
+
+    def get_count(self, backend: str, concurrency: int, workers: int) -> int:
+        """Get the current consecutive failure count for a key."""
+        key = (backend, concurrency, workers)
+        return self._counts.get(key, 0)
--- a/benchmarks/frontend/scripts/sweep_core/lifecycle.py
+++ b/benchmarks/frontend/scripts/sweep_core/lifecycle.py
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+"""Lifecycle management -- deploy-dimension delta detection and reset strategy."""
+
+from __future__ import annotations
+
+from typing import Optional
+
+from sweep_core.models import IsolationPolicy, RunSpec
+
+
+def needs_deploy_or_reset(
+    current: RunSpec,
+    previous: Optional[RunSpec],
+    isolation_policy: IsolationPolicy,
+) -> bool:
+    """Determine if the current run needs a deploy/reset before execution.
+
+    Args:
+        current: The run about to execute.
+        previous: The run that just completed (None for the first run).
+        isolation_policy: The sweep-level isolation policy.
+
+    Returns:
+        True if a deploy/reset is needed before this run.
+    """
+    if previous is None:
+        # First run always needs deployment
+        return True
+
+    if isolation_policy == "fresh_per_run":
+        # Every run gets its own deploy/reset cycle
+        return True
+
+    # reuse_by_deploy_key: only reset when the deploy key changes
+    return current.deploy_key != previous.deploy_key
+
+
+def deploy_key_changed(
+    current: RunSpec,
+    previous: Optional[RunSpec],
+) -> bool:
+    """Check if the deploy key has changed between consecutive runs."""
+    if previous is None:
+        return True
+    return current.deploy_key != previous.deploy_key
--- a/benchmarks/frontend/scripts/sweep_core/models.py
+++ b/benchmarks/frontend/scripts/sweep_core/models.py
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+"""
+Data models for sweep_core.
+
+All data structures are plain dataclasses that serialize to/from JSON/dict.
+No subprocess, kubectl, or argparse imports allowed in this module.
+"""
+
+from __future__ import annotations
+
+import json
+from dataclasses import dataclass, field
+from typing import Dict, List, Literal, Optional
+
+IsolationPolicy = Literal["fresh_per_run", "reuse_by_deploy_key"]
+
+
+@dataclass(frozen=True)
+class DeployKey:
+    """Hashable key identifying a unique deployment configuration."""
+
+    backend: str
+    tokenizer: str
+    workers: int
+    num_models: int
+    env_overrides: frozenset[tuple[str, str]] = field(default_factory=frozenset)
+
+    def to_dict(self) -> dict:
+        return {
+            "backend": self.backend,
+            "tokenizer": self.tokenizer,
+            "workers": self.workers,
+            "num_models": self.num_models,
+            "env_overrides": dict(self.env_overrides),
+        }
+
+    @classmethod
+    def from_dict(cls, d: dict) -> DeployKey:
+        env = d.get("env_overrides", {})
+        return cls(
+            backend=d["backend"],
+            tokenizer=d["tokenizer"],
+            workers=d["workers"],
+            num_models=d["num_models"],
+            env_overrides=frozenset(env.items())
+            if isinstance(env, dict)
+            else frozenset(env),
+        )
+
+
+@dataclass
+class DeployDimension:
+    """Configuration for a single deployment state."""
+
+    backend: str  # "mocker" or "vllm"
+    tokenizer: str  # "hf" or "fastokens"
+    workers: int = 2
+    num_models: int = 1
+    env_overrides: Dict[str, str] = field(default_factory=dict)
+
+    @property
+    def deploy_key(self) -> DeployKey:
+        return DeployKey(
+            backend=self.backend,
+            tokenizer=self.tokenizer,
+            workers=self.workers,
+            num_models=self.num_models,
+            env_overrides=frozenset(self.env_overrides.items()),
+        )
+
+    def to_dict(self) -> dict:
+        return {
+            "backend": self.backend,
+            "tokenizer": self.tokenizer,
+            "workers": self.workers,
+            "num_models": self.num_models,
+            "env_overrides": self.env_overrides,
+        }
+
+    @classmethod
+    def from_dict(cls, d: dict) -> DeployDimension:
+        return cls(
+            backend=d["backend"],
+            tokenizer=d["tokenizer"],
+            workers=d.get("workers", 2),
+            num_models=d.get("num_models", 1),
+            env_overrides=d.get("env_overrides", {}),
+        )
+
+
+@dataclass
+class AiperfDimension:
+    """Configuration for a single aiperf run."""
+
+    concurrency: int
+    isl: int
+    osl: int = 256
+    num_requests: Optional[int] = None
+    benchmark_duration: Optional[int] = None
+    request_rate: Optional[int] = None
+
+    def to_dict(self) -> dict:
+        return {
+            "concurrency": self.concurrency,
+            "isl": self.isl,
+            "osl": self.osl,
+            "num_requests": self.num_requests,
+            "benchmark_duration": self.benchmark_duration,
+            "request_rate": self.request_rate,
+        }
+
+    @classmethod
+    def from_dict(cls, d: dict) -> AiperfDimension:
+        return cls(
+            concurrency=d["concurrency"],
+            isl=d["isl"],
+            osl=d.get("osl", 256),
+            num_requests=d.get("num_requests"),
+            benchmark_duration=d.get("benchmark_duration"),
+            request_rate=d.get("request_rate"),
+        )
+
+
+@dataclass
+class RunSpec:
+    """One logical perf run -- the atomic unit of execution."""
+
+    deploy: DeployDimension
+    aiperf: AiperfDimension
+    deploy_key: DeployKey
+    run_id: str
+
+    def to_dict(self) -> dict:
+        return {
+            "deploy": self.deploy.to_dict(),
+            "aiperf": self.aiperf.to_dict(),
+            "deploy_key": self.deploy_key.to_dict(),
+            "run_id": self.run_id,
+        }
+
+    @classmethod
+    def from_dict(cls, d: dict) -> RunSpec:
+        deploy = DeployDimension.from_dict(d["deploy"])
+        aiperf = AiperfDimension.from_dict(d["aiperf"])
+        deploy_key = DeployKey.from_dict(d["deploy_key"])
+        return cls(
+            deploy=deploy,
+            aiperf=aiperf,
+            deploy_key=deploy_key,
+            run_id=d["run_id"],
+        )
+
+
+@dataclass
+class RunResult:
+    """Result from a single sweep point."""
+
+    run_spec: RunSpec
+    status: str = "pending"  # ok, fail, skipped
+    req_per_sec: float = 0.0
+    output_tok_per_sec: float = 0.0
+    ttft_p50_ms: float = 0.0
+    ttft_p99_ms: float = 0.0
+    itl_p50_ms: float = 0.0
+    itl_p99_ms: float = 0.0
+    duration_sec: float = 0.0
+    run_dir: str = ""
+
+    def to_dict(self) -> dict:
+        return {
+            "run_spec": self.run_spec.to_dict(),
+            "status": self.status,
+            "req_per_sec": self.req_per_sec,
+            "output_tok_per_sec": self.output_tok_per_sec,
+            "ttft_p50_ms": self.ttft_p50_ms,
+            "ttft_p99_ms": self.ttft_p99_ms,
+            "itl_p50_ms": self.itl_p50_ms,
+            "itl_p99_ms": self.itl_p99_ms,
+            "duration_sec": self.duration_sec,
+            "run_dir": self.run_dir,
+        }
+
+
+@dataclass
+class K8sConfig:
+    """K8s-specific configuration."""
+
+    namespace: str = "dynamo-bench"
+    endpoint: str = "frontend:8000"
+    dgd_name: str = ""
+    image: str = ""
+    frontend_port: int = 8000
+    worker_replicas: int = 1
+    frontend_replicas: int = 1
+    deploy_template: str = ""  # path to deploy.yaml template
+    reset_strategy: str = "graph"  # none | frontend | graph
+    request_plane: str = "tcp"
+    event_plane: str = "nats"
+    router_mode: str = "round-robin"
+    deploy: bool = False
+    hf_token: str = ""
+    image_pull_secret: str = ""
+    export_level: str = "summary"
+
+    def to_dict(self) -> dict:
+        return {
+            "namespace": self.namespace,
+            "endpoint": self.endpoint,
+            "dgd_name": self.dgd_name,
+            "image": self.image,
+            "frontend_port": self.frontend_port,
+            "worker_replicas": self.worker_replicas,
+            "frontend_replicas": self.frontend_replicas,
+            "deploy_template": self.deploy_template,
+            "reset_strategy": self.reset_strategy,
+            "request_plane": self.request_plane,
+            "event_plane": self.event_plane,
+            "router_mode": self.router_mode,
+            "deploy": self.deploy,
+            "hf_token": "***" if self.hf_token else "",
+            "image_pull_secret": self.image_pull_secret,
+            "export_level": self.export_level,
+        }
+
+    @classmethod
+    def from_dict(cls, d: dict) -> K8sConfig:
+        return cls(**{k: v for k, v in d.items() if k in cls.__dataclass_fields__})
+
+
+@dataclass
+class SweepConfig:
+    """Top-level configuration for a sweep."""
+
+    model: str = "Qwen/Qwen3-0.6B"
+    model_name: str = ""
+    mode: str = "local"  # "local" or "k8s"
+    backend: str = "mocker"
+    tokenizers: List[str] = field(default_factory=lambda: ["hf", "fastokens"])
+    concurrencies: List[int] = field(default_factory=lambda: [50, 100, 200])
+    isls: List[int] = field(default_factory=lambda: [512, 1024, 2048])
+    osl: int = 256
+    worker_counts: List[int] = field(default_factory=lambda: [2])
+    num_models: int = 1
+    aiperf_targets: str = "first"
+    speedup_ratio: float = 1.0
+    benchmark_duration: Optional[int] = 60
+    num_requests_list: List[Optional[int]] = field(default_factory=lambda: [None])
+    rps_list: List[Optional[int]] = field(default_factory=lambda: [None])
+    output_dir: str = ""
+    max_consecutive_fails: int = 2
+    cooldown: int = 3
+    dry_run: bool = False
+    no_report: bool = False
+    isolation_policy: IsolationPolicy = "fresh_per_run"
+    passthrough_args: List[str] = field(default_factory=list)
+    k8s: K8sConfig = field(default_factory=K8sConfig)
+
+    def __post_init__(self):
+        if not self.model_name:
+            self.model_name = self.model
+
+    def to_dict(self) -> dict:
+        return {
+            "model": self.model,
+            "model_name": self.model_name,
+            "mode": self.mode,
+            "backend": self.backend,
+            "tokenizers": self.tokenizers,
+            "concurrencies": self.concurrencies,
+            "isls": self.isls,
+            "osl": self.osl,
+            "worker_counts": self.worker_counts,
+            "num_models": self.num_models,
+            "aiperf_targets": self.aiperf_targets,
+            "speedup_ratio": self.speedup_ratio,
+            "benchmark_duration": self.benchmark_duration,
+            "num_requests_list": self.num_requests_list,
+            "rps_list": self.rps_list,
+            "output_dir": self.output_dir,
+            "max_consecutive_fails": self.max_consecutive_fails,
+            "cooldown": self.cooldown,
+            "dry_run": self.dry_run,
+            "no_report": self.no_report,
+            "isolation_policy": self.isolation_policy,
+            "passthrough_args": self.passthrough_args,
+            "k8s": self.k8s.to_dict(),
+        }
+
+    @classmethod
+    def from_dict(cls, d: dict) -> SweepConfig:
+        k8s_data = d.pop("k8s", {})
+        config = cls(**{k: v for k, v in d.items() if k in cls.__dataclass_fields__})
+        if k8s_data:
+            config.k8s = K8sConfig.from_dict(k8s_data)
+        return config
+
+
+@dataclass
+class SweepPlan:
+    """Serializable execution plan."""
+
+    config: SweepConfig
+    runs: List[RunSpec]
+    isolation_policy: IsolationPolicy
+    total_runs: int
+
+    def to_dict(self) -> dict:
+        return {
+            "config": self.config.to_dict(),
+            "runs": [r.to_dict() for r in self.runs],
+            "isolation_policy": self.isolation_policy,
+            "total_runs": self.total_runs,
+        }
+
+    @classmethod
+    def from_dict(cls, d: dict) -> SweepPlan:
+        config = SweepConfig.from_dict(d["config"])
+        runs = [RunSpec.from_dict(r) for r in d["runs"]]
+        return cls(
+            config=config,
+            runs=runs,
+            isolation_policy=d["isolation_policy"],
+            total_runs=d["total_runs"],
+        )
+
+    def to_json(self, indent: int = 2) -> str:
+        return json.dumps(self.to_dict(), indent=indent)
+
+    @classmethod
+    def from_json(cls, s: str) -> SweepPlan:
+        return cls.from_dict(json.loads(s))
--- a/benchmarks/frontend/scripts/sweep_core/naming.py
+++ b/benchmarks/frontend/scripts/sweep_core/naming.py
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+"""Run ID and directory naming conventions for sweep runs."""
+
+from __future__ import annotations
+
+from sweep_core.models import AiperfDimension, DeployDimension
+
+
+def build_run_id(deploy: DeployDimension, aiperf: AiperfDimension) -> str:
+    """Build a human-readable run ID from deploy + aiperf dimensions.
+
+    Format: {tokenizer}_c{concurrency}_isl{isl}_w{workers}[_m{models}][_rps{rate}]
+
+    This matches the naming convention from the original sweep_runner.py.
+    """
+    base = f"{deploy.tokenizer}_c{aiperf.concurrency}_isl{aiperf.isl}_w{deploy.workers}"
+    if deploy.num_models > 1:
+        base += f"_m{deploy.num_models}"
+    if aiperf.request_rate is not None:
+        base += f"_rps{aiperf.request_rate}"
+    return base
--- a/benchmarks/frontend/scripts/sweep_core/orchestrator.py
+++ b/benchmarks/frontend/scripts/sweep_core/orchestrator.py
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+"""
+Sequential plan runner -- iterates through a SweepPlan using a SweepExecutor.
+
+This module is interface-agnostic: it does not import argparse, subprocess,
+or kubectl. It is callable from CLI, MCP server, or test harness.
+"""
+
+from __future__ import annotations
+
+import time
+from pathlib import Path
+from typing import TYPE_CHECKING, List, Optional
+
+from sweep_core.artifacts import (
+    print_results_table,
+    write_csv,
+    write_summary,
+    write_sweep_config,
+)
+from sweep_core.failures import FailureTracker
+from sweep_core.lifecycle import needs_deploy_or_reset
+from sweep_core.models import RunResult, RunSpec, SweepPlan
+from sweep_core.reporting import generate_report
+
+if TYPE_CHECKING:
+    from sweep_executors.base import SweepExecutor
+
+
+def run(plan: SweepPlan, executor: "SweepExecutor") -> List[RunResult]:
+    """Execute a SweepPlan sequentially using the given executor.
+
+    Args:
+        plan: The sweep plan to execute.
+        executor: The executor that handles individual runs.
+
+    Returns:
+        List of RunResult objects, one per run.
+    """
+    config = plan.config
+    output_root = Path(config.output_dir)
+    output_root.mkdir(parents=True, exist_ok=True)
+
+    csv_path = output_root / "results.csv"
+    summary_path = output_root / "summary.md"
+
+    # Write sweep config
+    write_sweep_config(config, output_root, total_runs=plan.total_runs)
+
+    failure_tracker = FailureTracker(config.max_consecutive_fails)
+    results: List[RunResult] = []
+    previous_run: Optional[RunSpec] = None
+
+    try:
+        # Prepare executor inside try so cleanup() runs on prepare failure
+        executor.prepare(config)
+
+        for i, run_spec in enumerate(plan.runs, 1):
+            deploy = run_spec.deploy
+            aiperf = run_spec.aiperf
+            run_dir = output_root / run_spec.run_id
+
+            # Check skip policy
+            if failure_tracker.should_skip(
+                deploy.backend, aiperf.concurrency, deploy.workers
+            ):
+                result = RunResult(
+                    run_spec=run_spec,
+                    status="skipped",
+                    run_dir=str(run_dir),
+                )
+                results.append(result)
+                print(
+                    f"\n  [{i}/{plan.total_runs}] SKIPPED {run_spec.run_id} "
+                    f"({config.max_consecutive_fails} consecutive failures)"
+                )
+                continue
+
+            print(f"\n{'=' * 60}")
+            print(f"  [{i}/{plan.total_runs}] {run_spec.run_id}")
+            print(f"{'=' * 60}")
+
+            # Deploy or reset if needed
+            if needs_deploy_or_reset(run_spec, previous_run, plan.isolation_policy):
+                prev_deploy = previous_run.deploy if previous_run else None
+                executor.apply_deploy(deploy, prev_deploy)
+
+            # Execute the run
+            result = executor.execute_run(run_spec, run_dir)
+            results.append(result)
+            previous_run = run_spec
+
+            # Update failure tracking
+            if result.status == "ok":
+                failure_tracker.record_success(
+                    deploy.backend, aiperf.concurrency, deploy.workers
+                )
+                rps = f"{result.req_per_sec:.1f}" if result.req_per_sec else "N/A"
+                tp50 = f"{result.ttft_p50_ms:.1f}ms" if result.ttft_p50_ms else "N/A"
+                print(f"    OK: {rps} req/s, TTFT p50={tp50}")
+            else:
+                count = failure_tracker.record_failure(
+                    deploy.backend, aiperf.concurrency, deploy.workers
+                )
+                print(f"    FAIL (consecutive: {count}/{config.max_consecutive_fails})")
+
+            # Generate per-run report
+            if not config.no_report and result.status == "ok":
+                generate_report(run_dir)
+
+            # Write incremental CSV + summary
+            write_csv(results, csv_path, config)
+            write_summary(results, summary_path)
+
+            # Cooldown between runs
+            if i < plan.total_runs:
+                time.sleep(config.cooldown)
+
+    except KeyboardInterrupt:
+        print("\n\nInterrupted! Partial results saved.")
+    finally:
+        # Final write
+        write_csv(results, csv_path, config)
+        write_summary(results, summary_path)
+        # Cleanup executor
+        executor.cleanup()
+
+    # Print final table
+    print_results_table(results)
+    print(f"\nResults:  {csv_path}")
+    print(f"Summary:  {summary_path}")
+    print(f"Per-run:  {output_root}/<run_id>/report.md")
+
+    return results
--- a/benchmarks/frontend/scripts/sweep_core/planner.py
+++ b/benchmarks/frontend/scripts/sweep_core/planner.py
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+"""
+SweepPlan builder -- constructs a serializable execution plan from SweepConfig.
+
+The planner builds the Cartesian product of deploy dimensions x aiperf dimensions,
+producing a flat list of RunSpec objects. The isolation policy determines how
+they are executed by the orchestrator.
+"""
+
+from __future__ import annotations
+
+from sweep_core.models import (
+    AiperfDimension,
+    DeployDimension,
+    RunSpec,
+    SweepConfig,
+    SweepPlan,
+)
+from sweep_core.naming import build_run_id
+
+
+def build_plan(config: SweepConfig) -> SweepPlan:
+    """Build a SweepPlan from a SweepConfig.
+
+    The plan contains a flat list of RunSpecs, one per (deploy, aiperf) combination.
+    The ordering is: tokenizers -> workers -> concurrencies -> ISLs -> num_requests -> rps
+
+    This matches the grid construction order from the original sweep_runner.py.
+    """
+    runs: list[RunSpec] = []
+
+    for tokenizer in config.tokenizers:
+        for workers in config.worker_counts:
+            for concurrency in config.concurrencies:
+                for isl in config.isls:
+                    for nr in config.num_requests_list:
+                        for rps in config.rps_list:
+                            deploy = DeployDimension(
+                                backend=config.backend,
+                                tokenizer=tokenizer,
+                                workers=workers,
+                                num_models=config.num_models,
+                            )
+
+                            aiperf = AiperfDimension(
+                                concurrency=concurrency,
+                                isl=isl,
+                                osl=config.osl,
+                                num_requests=nr,
+                                benchmark_duration=config.benchmark_duration
+                                if nr is None
+                                else None,
+                                request_rate=rps,
+                            )
+
+                            run_id = build_run_id(deploy, aiperf)
+
+                            runs.append(
+                                RunSpec(
+                                    deploy=deploy,
+                                    aiperf=aiperf,
+                                    deploy_key=deploy.deploy_key,
+                                    run_id=run_id,
+                                )
+                            )
+
+    return SweepPlan(
+        config=config,
+        runs=runs,
+        isolation_policy=config.isolation_policy,
+        total_runs=len(runs),
+    )
+
+
+def print_plan(plan: SweepPlan) -> None:
+    """Print a human-readable summary of the sweep plan."""
+    config = plan.config
+    print(f"Sweep plan: {plan.total_runs} runs")
+    print(f"  Model:          {config.model}")
+    print(f"  Mode:           {config.mode}")
+    print(f"  Backend:        {config.backend}")
+    print(f"  Tokenizers:     {config.tokenizers}")
+    print(f"  Concurrencies:  {config.concurrencies}")
+    print(f"  ISLs:           {config.isls}")
+    print(f"  Workers/model:  {config.worker_counts}")
+    print(f"  Models:         {config.num_models}")
+    print(f"  Isolation:      {plan.isolation_policy}")
+    print(f"  Benchmark dur:  {config.benchmark_duration}s")
+    nr_list = [n for n in config.num_requests_list if n is not None]
+    if nr_list:
+        print(f"  Num requests:   {nr_list}")
+    rps_list = [r for r in config.rps_list if r is not None]
+    if rps_list:
+        print(f"  Request rates:  {rps_list} req/s")
+    print(f"  Output:         {config.output_dir}")
+    if config.mode == "k8s":
+        print(f"  Namespace:      {config.k8s.namespace}")
+        print(f"  Endpoint:       {config.k8s.endpoint}")
+        if config.k8s.frontend_replicas > 1:
+            print(f"  FE replicas:    {config.k8s.frontend_replicas}")
+        if config.k8s.dgd_name:
+            print(f"  DGD:            {config.k8s.dgd_name}")
+        if config.k8s.deploy_template:
+            print(f"  Template:       {config.k8s.deploy_template}")
+        print(f"  Reset strategy: {config.k8s.reset_strategy}")
+    print()
--- a/benchmarks/frontend/scripts/sweep_core/reporting.py
+++ b/benchmarks/frontend/scripts/sweep_core/reporting.py
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+"""Per-run report generation -- wraps analysis/create_report.py."""
+
+from __future__ import annotations
+
+import sys
+from pathlib import Path
+
+SCRIPT_DIR = Path(__file__).resolve().parent.parent
+ANALYSIS_DIR = SCRIPT_DIR / "analysis"
+
+# Add analysis directory to sys.path once at import time
+if str(ANALYSIS_DIR) not in sys.path:
+    sys.path.insert(0, str(ANALYSIS_DIR))
+
+
+def generate_report(run_dir: Path) -> None:
+    """Run create_report.py on a single run directory, saving report.md."""
+    try:
+        from create_report import run_analysis
+
+        report = run_analysis(run_dir)
+        (run_dir / "report.md").write_text(report)
+    except (ImportError, OSError) as e:
+        print(f"    Report generation failed: {e}")
+    except Exception as e:
+        print(f"    Report generation failed: {e}")
--- a/benchmarks/frontend/scripts/sweep_executors/__init__.py
+++ b/benchmarks/frontend/scripts/sweep_executors/__init__.py
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+"""sweep_executors -- how individual runs execute."""
--- a/benchmarks/frontend/scripts/sweep_executors/base.py
+++ b/benchmarks/frontend/scripts/sweep_executors/base.py
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+"""
+SweepExecutor protocol -- the run-level extensibility seam.
+
+Each executor implements this protocol. The orchestrator calls these methods
+without knowing whether runs execute locally, in k8s, or elsewhere.
+"""
+
+from __future__ import annotations
+
+from pathlib import Path
+from typing import Optional, Protocol, runtime_checkable
+
+from sweep_core.models import DeployDimension, RunResult, RunSpec, SweepConfig
+
+
+@runtime_checkable
+class SweepExecutor(Protocol):
+    """Protocol for sweep executors."""
+
+    def prepare(self, config: SweepConfig) -> None:
+        """One-time setup before the sweep begins (e.g., start infra)."""
+        ...
+
+    def apply_deploy(
+        self,
+        deploy: DeployDimension,
+        prev: Optional[DeployDimension],
+    ) -> None:
+        """Apply a deployment change (e.g., restart frontend, switch backend).
+
+        Args:
+            deploy: The deployment configuration to apply.
+            prev: The previous deployment configuration (None for first run).
+        """
+        ...
+
+    def execute_run(self, run_spec: RunSpec, run_dir: Path) -> RunResult:
+        """Execute a single run and return results.
+
+        Args:
+            run_spec: The run specification.
+            run_dir: Directory where artifacts should be written.
+
+        Returns:
+            RunResult with status and metrics.
+        """
+        ...
+
+    def cleanup(self) -> None:
+        """Cleanup after the sweep completes (e.g., stop infra)."""
+        ...
--- a/benchmarks/frontend/scripts/sweep_executors/k8s_dgd.py
+++ b/benchmarks/frontend/scripts/sweep_executors/k8s_dgd.py
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+"""
+K8sDgdExecutor -- DynamoGraphDeployment-based executor for k8s sweeps.
+
+Handles DGD backend switching, restart strategies, metrics capture,
+and aiperf invocation against a k8s-deployed frontend.
+
+When --deploy-template is provided, uses template rendering instead of
+DGD patching. This enables arbitrary backend deployments.
+"""
+
+from __future__ import annotations
+
+import json
+from pathlib import Path
+from typing import Optional
+
+from sweep_core.models import DeployDimension, RunResult, RunSpec, SweepConfig
+from sweep_k8s import aiperf as k8s_aiperf
+from sweep_k8s import dgd as k8s_dgd
+from sweep_k8s import template as k8s_template
+from sweep_k8s.kubectl import apply_secret_literal
+from sweep_k8s.metrics import capture_metrics
+
+
+class K8sDgdExecutor:
+    """Executor for k8s sweeps using DynamoGraphDeployment."""
+
+    def __init__(self) -> None:
+        self._config: Optional[SweepConfig] = None
+        self._template_path: Optional[Path] = None
+        self._incluster_endpoint: str = ""  # in-cluster service DNS for aiperf Jobs
+
+    def prepare(self, config: SweepConfig) -> None:
+        """Store config and validate k8s setup."""
+        self._config = config
+        k8s = config.k8s
+
+        if k8s.deploy and not k8s.deploy_template:
+            raise ValueError(
+                "--deploy requires --deploy-template; otherwise pre-deploy the DGD and omit --deploy"
+            )
+        if k8s.deploy_template and not k8s.deploy:
+            raise ValueError(
+                "--deploy-template mutates cluster resources; pass --deploy to allow template application"
+            )
+
+        if k8s.deploy_template:
+            self._template_path = Path(k8s.deploy_template)
+            if not self._template_path.exists():
+                raise FileNotFoundError(
+                    f"Deploy template not found: {self._template_path}"
+                )
+            print(f"  Using deploy template: {self._template_path}")
+
+        if k8s.hf_token:
+            print(
+                f"  Updating HuggingFace token secret: {k8s_template.DEFAULT_HF_TOKEN_SECRET_NAME}"
+            )
+            apply_secret_literal(
+                k8s_template.DEFAULT_HF_TOKEN_SECRET_NAME,
+                k8s.namespace,
+                "HF_TOKEN",
+                k8s.hf_token,
+            )
+
+        # Compute the in-cluster endpoint for aiperf Jobs.
+        # The user-provided --endpoint may be port-forwarded (e.g. localhost:18000),
+        # but aiperf Jobs run inside the cluster and need the service DNS name.
+        if k8s.dgd_name:
+            self._incluster_endpoint = f"{k8s.dgd_name}-frontend:{k8s.frontend_port}"
+        else:
+            self._incluster_endpoint = k8s.endpoint
+        print(f"  In-cluster endpoint for aiperf: {self._incluster_endpoint}")
+
+        # Wait for model to be ready before starting sweep.
+        # Skip when using deploy templates -- the deployment hasn't been applied yet.
+        if not self._template_path:
+            print("--- Pre-flight: waiting for frontend ---")
+            k8s_dgd.wait_model_ready(
+                self._incluster_endpoint,
+                config.model_name,
+                max_wait=300,
+                namespace=k8s.namespace,
+            )
+
+    def apply_deploy(
+        self,
+        deploy: DeployDimension,
+        prev: Optional[DeployDimension],
+    ) -> None:
+        """Apply a deployment change -- template-based or DGD patching."""
+        if self._config is None:
+            raise RuntimeError("prepare() must be called before apply_deploy()")
+        config = self._config
+        k8s = config.k8s
+
+        if self._template_path:
+            # Template-based deployment: render + apply
+            k8s_template.apply_rendered_template(self._template_path, deploy, config)
+            print("  Waiting for deployment to be ready...")
+            k8s_dgd.wait_model_ready(
+                self._incluster_endpoint,
+                config.model_name,
+                namespace=k8s.namespace,
+                max_wait=300,
+            )
+            return
+
+        # Legacy DGD patching
+        if not k8s.dgd_name:
+            print("  WARNING: no DGD name set for k8s mode; skipping deploy")
+            return
+
+        # Check if tokenizer changed from previous run
+        if prev is not None and deploy.tokenizer != prev.tokenizer:
+            # Tokenizer changed -- need to switch backend
+            k8s_dgd.dgd_switch_backend(
+                k8s.dgd_name,
+                k8s.namespace,
+                k8s.endpoint,
+                config.model_name,
+                deploy.tokenizer,
+            )
+            return
+
+        # First run or same tokenizer -- apply reset strategy
+        # (On first run the DGD is already deployed with the right backend;
+        #  we just reset to get a clean baseline for metrics.)
+        self._apply_reset_strategy()
+
+    def _apply_reset_strategy(self) -> None:
+        """Apply the configured reset strategy."""
+        if self._config is None:
+            raise RuntimeError(
+                "prepare() must be called before _apply_reset_strategy()"
+            )
+        k8s = self._config.k8s
+        strategy = k8s.reset_strategy
+
+        if strategy == "graph":
+            if k8s.dgd_name:
+                k8s_dgd.dgd_restart_graph(
+                    k8s.dgd_name,
+                    k8s.namespace,
+                    k8s.endpoint,
+                    self._config.model_name,
+                )
+            else:
+                print("  WARNING: graph reset requires --dgd-name")
+        elif strategy == "frontend":
+            if k8s.dgd_name:
+                k8s_dgd.dgd_restart_frontend(
+                    k8s.dgd_name,
+                    k8s.namespace,
+                    k8s.endpoint,
+                    self._config.model_name,
+                )
+            else:
+                print("  WARNING: frontend reset requires --dgd-name")
+        elif strategy == "none":
+            # Just wait for readiness
+            if k8s.dgd_name:
+                k8s_dgd.dgd_wait_all_ready(
+                    k8s.dgd_name,
+                    k8s.namespace,
+                    k8s.endpoint,
+                    self._config.model_name,
+                    max_wait=60,
+                )
+            else:
+                k8s_dgd.wait_model_ready(
+                    k8s.endpoint, self._config.model_name, max_wait=60
+                )
+
+    def execute_run(self, run_spec: RunSpec, run_dir: Path) -> RunResult:
+        """Execute a single k8s run: metrics capture + aiperf + post-metrics."""
+        if self._config is None:
+            raise RuntimeError("prepare() must be called before execute_run()")
+        config = self._config
+        k8s = config.k8s
+        aiperf = run_spec.aiperf
+
+        result = RunResult(run_spec=run_spec, run_dir=str(run_dir))
+        run_dir.mkdir(parents=True, exist_ok=True)
+
+        # Capture pre-run metrics (use in-cluster endpoint + kubectl exec fallback)
+        frontend_label = (
+            (
+                f"nvidia.com/dynamo-graph-deployment-name={k8s.dgd_name},"
+                f"nvidia.com/dynamo-component-type=frontend"
+            )
+            if k8s.dgd_name
+            else None
+        )
+        capture_metrics(
+            self._incluster_endpoint,
+            run_dir / "frontend_metrics_pre.txt",
+            namespace=k8s.namespace,
+            pod_label=frontend_label,
+        )
+
+        # Run aiperf as a k8s Job (uses in-cluster service endpoint)
+        success = k8s_aiperf.run_aiperf(
+            artifact_dir=run_dir / "aiperf",
+            endpoint=self._incluster_endpoint,
+            model_name=config.model_name,
+            concurrency=aiperf.concurrency,
+            isl=aiperf.isl,
+            namespace=k8s.namespace,
+            image=k8s.image,
+            run_id=run_spec.run_id,
+            osl=aiperf.osl,
+            benchmark_duration=aiperf.benchmark_duration,
+            num_requests=aiperf.num_requests,
+            request_rate=aiperf.request_rate,
+            export_level=k8s.export_level,
+            image_pull_secret=k8s.image_pull_secret,
+            hf_token_secret_name=k8s_template.DEFAULT_HF_TOKEN_SECRET_NAME,
+        )
+
+        if success:
+            result.status = "ok"
+        else:
+            result.status = "fail"
+
+        # Capture post-run metrics
+        capture_metrics(
+            self._incluster_endpoint,
+            run_dir / "frontend_metrics_post.txt",
+            namespace=k8s.namespace,
+            pod_label=frontend_label,
+        )
+
+        # Parse aiperf results
+        _parse_k8s_aiperf_into_result(result, run_dir)
+
+        return result
+
+    def cleanup(self) -> None:
+        """No persistent state to clean up."""
+        pass
+
+
+def _parse_k8s_aiperf_into_result(result: RunResult, run_dir: Path) -> None:
+    """Parse aiperf results from k8s run directory."""
+    aiperf_json = run_dir / "aiperf" / "profile_export_aiperf.json"
+    if not aiperf_json.exists():
+        return
+
+    try:
+        data = json.loads(aiperf_json.read_text())
+        rt = data.get("request_throughput", {})
+        result.req_per_sec = rt.get("avg", 0) or 0
+        ot = data.get("output_token_throughput", {})
+        result.output_tok_per_sec = ot.get("avg", 0) or 0
+        ttft = data.get("time_to_first_token", data.get("ttft", {}))
+        if isinstance(ttft, dict):
+            result.ttft_p50_ms = ttft.get("p50", 0) or 0
+            result.ttft_p99_ms = ttft.get("p99", 0) or 0
+        itl = data.get("inter_token_latency", data.get("itl", {}))
+        if isinstance(itl, dict):
+            result.itl_p50_ms = itl.get("p50", 0) or 0
+            result.itl_p99_ms = itl.get("p99", 0) or 0
+        bd = data.get("benchmark_duration", 0)
+        result.duration_sec = bd.get("avg", 0) if isinstance(bd, dict) else (bd or 0)
+    except (json.JSONDecodeError, KeyError, TypeError):
+        pass