The benchmarking suite has two layers: a Python sweep orchestrator that builds a grid of configurations, and a shell harness that executes each individual run.
export["Step 7: Final Prometheus snapshot + nsys export"]
save["Step 8: Save config.json"]
infra --> mockers --> frontend --> ready --> captures --> load --> wait --> export --> save
end
```
### Runtime topology
During a benchmark run, the following processes are active. The frontend receives HTTP requests from aiperf, tokenizes the input, routes to a backend model via the request plane (TCP), and streams response tokens back to the client.
When `--num-models` is 1, the served model name matches the HF model path (e.g., `Qwen/Qwen3-0.6B`). When `--num-models` is greater than 1, each model instance gets a synthetic name (`model-1`, `model-2`, ...) but all share the same underlying `--model-path` for weights and tokenizer config.
The `--num-models` and `--workers` flags control how many model instances and backend workers per model are launched. These are the primary knobs for studying frontend scalability under multi-tenant and parallel-worker configurations.
#### Scaling models (fixed workers per model)
Useful for measuring how adding more served models affects frontend routing, transport fan-out, and per-model latency.
```bash
# Sweep across 1, 2, 3, 4 model instances, 1 worker each, at 75 rps
for m in 1 2 3 4;do
python3 sweep_runner.py \
--tokenizers hf \
--concurrency 512 \
--isl 512 \
--workers 1 \
--num-models$m\
--rps 75 \
--benchmark-duration 60 \
--speedup-ratio 0 \
--output-dir artifacts/sweep_models/m${m}\
----skip-bpf
done
# Compare results
for m in 1 2 3 4;do
echo"=== m=$m ==="
cat artifacts/sweep_models/m${m}/summary.md
echo
done
```
#### Scaling workers per model (fixed model count)
Useful for measuring whether adding more backend workers relieves transport bottlenecks for a single model under heavy load.
```bash
# Sweep across 1, 2, 4, 8 workers for a single model
python3 sweep_runner.py \
--tokenizers hf \
--concurrency 512 \
--isl 512 \
--workers 1,2,4,8 \
--num-models 1 \
--rps 75 \
--benchmark-duration 60 \
--speedup-ratio 0 \
--output-dir artifacts/sweep_workers \
----skip-bpf
```
#### Combined model + worker grid
For a full factorial sweep over both dimensions, supply multiple values for both flags. Each combination produces a separate run.
> **Note:** `--num-models` is a single integer (not comma-separated). To sweep across model counts, loop externally as shown in the "Scaling models" example above.
#### What to look for in the results
| Metric | Where to find it | What it tells you |
|--------|-----------------|-------------------|
| Req/s and Tok/s | `summary.md` | Whether the frontend can sustain the target load |