# Frontend Performance Profiling Unified observability and benchmarking suite for Dynamo frontend performance. ## Quick Start ```bash cd ~/dev/dynamo source dynamo/bin/activate # Single run (mocker + frontend + aiperf + Prometheus) cd benchmarks/frontend/scripts ./run_perf.sh --model Qwen/Qwen3-0.6B --concurrency 32 --num-requests 640 \ --speedup-ratio 0 --skip-bpf --skip-nsys --skip-flamegraph --skip-perf # Sweep (multiple config points) python3 sweep_runner.py --tokenizers hf --concurrency 32 --isl 512 \ --benchmark-duration 30 --speedup-ratio 0 \ -- --skip-bpf --skip-nsys --skip-flamegraph --skip-perf ``` ## Architecture The benchmarking suite has two layers: a Python sweep orchestrator that builds a grid of configurations, and a shell harness that executes each individual run. ```mermaid flowchart TB subgraph Orchestrator ["sweep_runner.py (Python orchestrator)"] direction TB grid["Build sweep grid
(tokenizers x concurrency x ISL x workers x models x rps)"] loop["For each config point"] collect["Collect results into CSV + summary.md"] report["Generate per-run report.md"] grid --> loop --> collect --> report end loop -- "invokes" --> run_perf subgraph run_perf ["run_perf.sh (per-run harness)"] direction TB infra["Step 0: Ensure etcd + NATS"] mockers["Step 1: Start mocker workers
(N models x M workers)"] frontend["Step 2: Start frontend
(optionally under nsys)"] ready["Step 3: Wait for /v1/models readiness"] captures["Step 4: Start parallel captures
(perf stat, BPF, flamegraph, /proc, Prometheus)"] load["Step 5: aiperf load generation"] wait["Step 6: Wait for captures to finish"] export["Step 7: Final Prometheus snapshot + nsys export"] save["Step 8: Save config.json"] infra --> mockers --> frontend --> ready --> captures --> load --> wait --> export --> save end ``` ### Runtime topology During a benchmark run, the following processes are active. The frontend receives HTTP requests from aiperf, tokenizes the input, routes to a backend model via the request plane (TCP), and streams response tokens back to the client. ```mermaid flowchart LR aiperf["aiperf
(load generator)"] subgraph Frontend ["Frontend (Rust, port 8000)"] direction TB http["HTTP server
/v1/chat/completions"] preprocess["Preprocess
(template + tokenize)"] router["Router
(model lookup)"] transport["Transport
(TCP request plane)"] http --> preprocess --> router --> transport end subgraph Models ["Mocker Workers"] direction TB subgraph model1 ["model-1"] w1a["worker 1
port 8081"] w1b["worker 2
port 8082"] end subgraph model2 ["model-2"] w2a["worker 1
port 8083"] w2b["worker 2
port 8084"] end end subgraph Infra ["Infrastructure"] etcd["etcd
(service discovery)"] nats["NATS
(event plane)"] end subgraph Observability ["Parallel Captures"] prom["Prometheus
(/metrics scraping)"] perf["perf stat
(HW counters)"] nsys["nsys
(NVTX + OS runtime)"] flame["flamegraph
(CPU + off-CPU)"] bpf["BPF traces
(kernel-level)"] end aiperf -- "HTTP/SSE" --> http transport -- "TCP" --> w1a & w1b & w2a & w2b Frontend -. "register/discover" .-> etcd Models -. "register/discover" .-> etcd Models -. "events" .-> nats Frontend -. "events" .-> nats prom -. "scrape" .-> Frontend & Models perf -. "attach" .-> Frontend nsys -. "profile" .-> Frontend flame -. "sample" .-> Frontend bpf -. "trace" .-> Frontend ``` ### Multi-model naming When `--num-models` is 1, the served model name matches the HF model path (e.g., `Qwen/Qwen3-0.6B`). When `--num-models` is greater than 1, each model instance gets a synthetic name (`model-1`, `model-2`, ...) but all share the same underlying `--model-path` for weights and tokenizer config. ## Prerequisites | Tool | Required | Install | |------|----------|---------| | etcd | Yes | `apt install etcd` or [releases](https://github.com/etcd-io/etcd/releases) | | nats-server | Yes | `apt install nats-server` or [nats.io](https://nats.io/download/) | | aiperf | Yes | `uv pip install "git+https://github.com/ai-dynamo/aiperf.git@main"` (in dynamo venv) | | jq | Yes | `apt install jq` | | perf | Optional | `apt install linux-tools-$(uname -r)` | | bpftrace | Optional | `apt install bpftrace` (needs root or CAP_BPF + CAP_PERFMON) | | inferno | Optional | `cargo install inferno` (for flamegraphs) | | nsys | Optional | NVIDIA Nsight Systems | ## sweep_runner.py The main entry point for running performance sweeps. Iterates over a grid of configurations and delegates each point to `run_perf.sh`. ### Basic Usage ```bash # Smoke test (1 run) python3 sweep_runner.py --tokenizers hf --concurrency 32 --isl 512 \ --benchmark-duration 30 --speedup-ratio 0 \ -- --skip-bpf --skip-nsys --skip-flamegraph --skip-perf # Full tokenizer comparison python3 sweep_runner.py --tokenizers hf,fastokens \ --concurrency 32,64 --isl 512,1024,2048 \ --benchmark-duration 60 --speedup-ratio 0 # Transport saturation (vary workers and request count) python3 sweep_runner.py --tokenizers hf --concurrency 4096 \ --num-requests 16384,32768 --workers 1,2,4,8 --speedup-ratio 0 # Preview sweep plan without running python3 sweep_runner.py --dry-run --tokenizers hf,fastokens \ --concurrency 32,64 --isl 512,1024 ``` ### Multi-Model and Worker Sweeps The `--num-models` and `--workers` flags control how many model instances and backend workers per model are launched. These are the primary knobs for studying frontend scalability under multi-tenant and parallel-worker configurations. #### Scaling models (fixed workers per model) Useful for measuring how adding more served models affects frontend routing, transport fan-out, and per-model latency. ```bash # Sweep across 1, 2, 3, 4 model instances, 1 worker each, at 75 rps for m in 1 2 3 4; do python3 sweep_runner.py \ --tokenizers hf \ --concurrency 512 \ --isl 512 \ --workers 1 \ --num-models $m \ --rps 75 \ --benchmark-duration 60 \ --speedup-ratio 0 \ --output-dir artifacts/sweep_models/m${m} \ -- --skip-bpf done # Compare results for m in 1 2 3 4; do echo "=== m=$m ===" cat artifacts/sweep_models/m${m}/summary.md echo done ``` #### Scaling workers per model (fixed model count) Useful for measuring whether adding more backend workers relieves transport bottlenecks for a single model under heavy load. ```bash # Sweep across 1, 2, 4, 8 workers for a single model python3 sweep_runner.py \ --tokenizers hf \ --concurrency 512 \ --isl 512 \ --workers 1,2,4,8 \ --num-models 1 \ --rps 75 \ --benchmark-duration 60 \ --speedup-ratio 0 \ --output-dir artifacts/sweep_workers \ -- --skip-bpf ``` #### Combined model + worker grid For a full factorial sweep over both dimensions, supply multiple values for both flags. Each combination produces a separate run. ```bash # 2x3 grid: (1 model, 2 models) x (1, 2, 4 workers) python3 sweep_runner.py \ --tokenizers hf \ --concurrency 256 \ --isl 512 \ --workers 1,2,4 \ --num-models 2 \ --rps 50 \ --benchmark-duration 60 \ --speedup-ratio 0 \ --output-dir artifacts/sweep_grid \ -- --skip-bpf ``` > **Note:** `--num-models` is a single integer (not comma-separated). To sweep across model counts, loop externally as shown in the "Scaling models" example above. #### What to look for in the results | Metric | Where to find it | What it tells you | |--------|-----------------|-------------------| | Req/s and Tok/s | `summary.md` | Whether the frontend can sustain the target load | | TTFT p50/p99 | `summary.md` | End-to-end first-token latency (includes preprocess + routing + transport) | | `transport_roundtrip` p50 | `report.md` section 4 | Time spent in the TCP request plane; grows when workers are saturated | | Tokio worker busy ratio | `report.md` section 7 | Fraction of time each async worker is busy; values above 0.95 indicate saturation | | Event loop stalls | `report.md` section 7 | How often the Tokio runtime stalled; high counts suggest blocking work on the async executor | | `preprocess.tokenize` | `report.md` section 5 (NVTX) | Per-request tokenization cost; varies by tokenizer backend | ### With Profilers ```bash # With perf stat + flamegraphs (no root needed) python3 sweep_runner.py --tokenizers hf --concurrency 64 --isl 1024 \ --benchmark-duration 60 --speedup-ratio 0 # With everything including BPF (needs sudo) sudo -E python3 sweep_runner.py --tokenizers hf --concurrency 64 --isl 1024 \ --benchmark-duration 60 --speedup-ratio 0 # nsys profiling (needs nsys in PATH) python3 sweep_runner.py --tokenizers hf --concurrency 64 --isl 1024 \ --benchmark-duration 60 --speedup-ratio 0 \ -- --nsys-path /opt/nvidia/nsight-systems/bin/nsys ``` Profiler controls are passed through to run_perf.sh after `--`: | Flag | Effect | |------|--------| | `--skip-bpf` | Skip BPF tracing | | `--skip-nsys` | Skip Nsight Systems | | `--skip-flamegraph` | Skip CPU/off-CPU flamegraphs | | `--skip-perf` | Skip perf stat hardware counters | ### All Options | Option | Default | Description | |--------|---------|-------------| | `--model` | `Qwen/Qwen3-0.6B` | HF model path | | `--backend` | `mocker` | Engine: `mocker` (synthetic) or `vllm` | | `--tokenizers` | `hf,fastokens` | Comma-separated tokenizer backends | | `--concurrency` | `50,100,200` | Comma-separated concurrency levels | | `--isl` | `512,1024,2048` | Comma-separated input sequence lengths | | `--osl` | `256` | Output sequence length | | `--workers` | `2` | Comma-separated worker counts per model | | `--num-models` | `1` | Number of model instances (each gets `--workers` workers) | | `--rps` | - | Comma-separated target request rates (req/s) | | `--aiperf-targets` | `first` | `first`: model-1 only. `all`: run aiperf for each model | | `--speedup-ratio` | `1.0` | Mocker speedup (0 = infinite) | | `--benchmark-duration` | `60` | aiperf run duration (seconds) | | `--num-requests` | - | Comma-separated request counts (overrides duration) | | `--output-dir` | auto | Output directory | | `--max-consecutive-fails` | `2` | Skip remaining ISLs after N failures | | `--cooldown` | `3` | Seconds between runs | | `--dry-run` | - | Print plan without executing | | `--no-report` | - | Skip per-run report generation | ## run_perf.sh Low-level per-run harness. Normally called by sweep_runner.py, but can be used directly for single runs. ```bash # Minimal (no profilers) ./run_perf.sh --model Qwen/Qwen3-0.6B --concurrency 32 --num-requests 640 \ --speedup-ratio 0 --skip-bpf --skip-nsys --skip-flamegraph --skip-perf # Full observability (needs sudo for BPF) sudo -E ./run_perf.sh --model Qwen/Qwen3-0.6B --concurrency 64 \ --benchmark-duration 60 --speedup-ratio 0 # Multi-model with 2 workers each ./run_perf.sh --model Qwen/Qwen3-0.6B --num-models 2 --workers 2 \ --concurrency 32 --benchmark-duration 30 --speedup-ratio 0 \ --skip-bpf --skip-nsys --skip-flamegraph --skip-perf # 4 models, 1 worker each, rate-limited to 75 rps ./run_perf.sh --model Qwen/Qwen3-0.6B --num-models 4 --workers 1 \ --concurrency 512 --benchmark-duration 60 --request-rate 75 \ --speedup-ratio 0 --skip-bpf ``` ## Analyzing Results ```bash # Per-run report (generated automatically by sweep_runner.py) python3 analysis/create_report.py analyze artifacts/sweep_/hf_c32_isl512_w2 # Auto-find latest run python3 analysis/create_report.py analyze # Prometheus delta (initial vs final snapshot) diff <(grep "^dynamo_frontend" artifacts/.../prometheus/initial_snapshot.txt | sort) \ <(grep "^dynamo_frontend" artifacts/.../prometheus/final_snapshot.txt | sort) # nsys SQLite queries (when nsys was enabled) sqlite3 artifacts/.../nsys/frontend.sqlite \ "SELECT name, COUNT(*), ROUND(AVG(end-start)/1e3,1) as avg_us FROM NVTX_EVENTS WHERE end > start GROUP BY name ORDER BY avg_us DESC" ``` ## Output Structure ```text artifacts/sweep_YYYYMMDD_HHMMSS/ results.csv Sweep results (all runs) summary.md Comparison table hf_c32_isl512_w2/ Per-run directory config.json Run parameters report.md Analysis report aiperf/ profile_export_aiperf.json aiperf metrics prometheus/ initial_snapshot.txt Pre-load metrics final_snapshot.txt Post-load metrics timeseries.jsonl Per-second scrapes system/ thread_count.txt Thread count over time fd_count.txt FD count over time proc_status.txt /proc/PID/status snapshots logs/ frontend.log mocker_*.log perf/ (if --with-perf) perf_stat.txt cpu_flamegraph.svg bpf/ (if --with-bpf, needs root) runqlat.txt syscall_latency.txt ... nsys/ (if --with-nsys) frontend.nsys-rep frontend.sqlite ```