<!-- SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -->
<!-- SPDX-License-Identifier: Apache-2.0 -->
# Frontend Performance Benchmark Suite
A configurable sweep runner for measuring Dynamo frontend model serving performance. It drives [aiperf](https://github.com/ai-dynamo/aiperf) load against a frontend/mocker (or frontend/vLLM) stack and collects throughput, latency, and observability data across a grid of parameters.
The primary use case is **HuggingFace tokenizer vs. fastokens comparison** -- sweeping across concurrency levels, input sequence lengths (ISL), and worker counts to quantify the tokenizer's impact on end-to-end performance.
---
## Architecture
The codebase follows a three-layer design that separates pure logic from execution and infrastructure concerns.
| Layer | Package | Responsibility |
|-------|---------|----------------|
| **Core** | `scripts/sweep_core/` | Pure data models, plan construction, artifact writing, reporting. No subprocess or kubectl calls. |
| **Executors** | `scripts/sweep_executors/` | `SweepExecutor` protocol with two implementations -- `LocalExecutor` (delegates to `run_perf.sh`) and `K8sDgdExecutor` (DynamoGraphDeployment-based k8s runs). |
The entry point is `scripts/sweep_runner.py`, a thin CLI that wires the three layers together: it builds a `SweepPlan` from CLI arguments, selects an executor based on `--mode`, and feeds the plan to the orchestrator.
**Data flow:**
```
CLI args --> SweepConfig --> SweepPlan (Cartesian grid of RunSpecs)
|
Orchestrator
|
LocalExecutor or K8sDgdExecutor
| |
run_perf.sh DGD + aiperf Job
| |
artifacts/ artifacts/
```
---
## Quick Start -- Local
Local mode starts a mocker backend and frontend process on the current machine, runs aiperf against them, and tears everything down between runs.
**Prerequisites:**
-`dynamo.mocker` and `dynamo.frontend` installed (from the Dynamo repo)
-`aiperf` installed and on `$PATH`
- A HuggingFace model accessible locally (default: `Qwen/Qwen3-0.6B`)
Results are written to `artifacts/sweep_<timestamp>/`.
---
## Quick Start -- Kubernetes
K8s mode deploys a DynamoGraphDeployment (DGD) into a Kubernetes namespace and launches aiperf as an in-cluster Job that targets the frontend service endpoint.
### Prerequisites
1.**Namespace** -- a dedicated namespace for the benchmark (default: `dynamo-bench`).
2.**HuggingFace token secret** -- a Kubernetes Secret named `hf-token-secret`
containing your HF token, if the model requires authentication.
3.**Model cache PVC** -- a PersistentVolumeClaim for caching model weights
(avoids repeated downloads across runs).
4.**DGD deployed** -- either pre-deploy the DGD yourself, or use the
`--deploy --deploy-template` flags to let the sweep runner create it.
5.**kubectl** configured with access to the target cluster and namespace.
### Example: mocker backend
```bash
python3 sweep_runner.py \
--mode k8s \
--dgd-name dynamo-bench-mocker \
--tokenizers hf,fastokens \
--concurrency 50,100 \
--isl 512
```
### Example: template-based deployment
When `--deploy-template` is provided, the runner renders the template with per-run variables (tokenizer, workers, model, etc.) and applies it via kubectl before each run group:
```bash
python3 sweep_runner.py \
--mode k8s \
--deploy\
--deploy-template dgd/templates/mocker.yaml \
--dgd-name dynamo-bench-mocker \
--image nvcr.io/.../image:tag \
--tokenizers hf,fastokens \
--concurrency 50,100 \
--isl 512
```
### How aiperf runs in-cluster
The sweep runner creates a short-lived Kubernetes Job in the same namespace as the DGD. The Job pod runs `aiperf` against the frontend's in-cluster service DNS name (e.g., `dynamo-bench-mocker-frontend:8000`). Once the Job completes, artifacts are copied back to the local host via `kubectl cp`.
### Reset strategy
Between runs, the `--reset-strategy` flag controls how the deployed stack is
recycled:
| Strategy | Behavior |
|----------|----------|
| `none` | No resets; runs back-to-back on the same deployment. |
| `frontend` | Restart only the frontend pod between runs. |
| `graph` (default) | Redeploy the entire DGD graph between run groups. |
| `--rps 10,20,...,100` | Sweep dimension | Each run targets a fixed request rate. aiperf uses `--request-rate` to cap submission. |
| `--concurrency 200` | High ceiling | Maximum in-flight requests. Set high so aiperf can sustain the target RPS without being limited by available connection slots. This is NOT a sweep dimension. |
| `--osl 256` | Fixed OSL | Consistent output length across all runs. |
| `--benchmark-duration 60` | 60s per point | Long enough for vLLM scheduling to stabilize. |
| `--reset-strategy frontend` | Frontend-only | Resets Prometheus counters between runs, but keeps vLLM workers alive with warm KV caches and CUDA graphs. Avoids the ~90s full DGD restart per point. |
| `--isolation reuse_by_deploy_key` | Reuse deployment | Since tokenizer=hf is constant, no DGD restart between runs. Only a frontend pod restart for clean metrics. |
| `--max-consecutive-fails 2` | Auto-stop | After 2 consecutive failures at a given RPS, remaining higher RPS values are skipped. |
### Follow-up: FastTokens comparison
Once you have the baseline, run the same sweep with fastokens to see if the
saturation point shifts:
```bash
python3 sweep_runner.py --mode k8s \
--dgd-name gpt-oss-20b-bench \
--namespace <namespace> \
--endpoint gpt-oss-20b-bench-frontend:8000 \
--model openai/gpt-oss-20b \
--backend vllm \
--image <your-image> \
--tokenizers fastokens \
--concurrency 200 \
--rps 10,20,30,40,50,60,70,80,90,100 \
--isl 6144 \
--osl 256 \
--benchmark-duration 60 \
--reset-strategy frontend \
--isolation reuse_by_deploy_key \
--worker-replicas 4 \
--max-consecutive-fails 2
```
### Fine-grained sweep around the inflection
If the baseline shows saturation between, say, RPS=40 and RPS=60:
```bash
python3 sweep_runner.py --mode k8s \
... \
--rps 35,40,45,50,55,60 \
--reset-strategy frontend \
--isolation reuse_by_deploy_key
```
---
## Reading the Results
The sweep produces `results.csv` and `summary.md` in the output directory.
### Identifying the saturation point
Look for these signals in the CSV:
| RPS | Actual Req/s | TTFT p50 | TTFT p99 | ITL p99 | Status |