README.md 13.5 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
# Frontend Performance Profiling

Unified observability and benchmarking suite for Dynamo frontend performance.

## Quick Start

```bash
cd ~/dev/dynamo
source dynamo/bin/activate

# Single run (mocker + frontend + aiperf + Prometheus)
cd benchmarks/frontend/scripts
./run_perf.sh --model Qwen/Qwen3-0.6B --concurrency 32 --num-requests 640 \
14
    --speedup-ratio 1000000 --skip-bpf --skip-nsys --skip-flamegraph --skip-perf
15
16
17

# Sweep (multiple config points)
python3 sweep_runner.py --tokenizers hf --concurrency 32 --isl 512 \
18
    --benchmark-duration 30 --speedup-ratio 1000000 \
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
    -- --skip-bpf --skip-nsys --skip-flamegraph --skip-perf
```

## Architecture

The benchmarking suite has two layers: a Python sweep orchestrator that builds a grid of configurations, and a shell harness that executes each individual run.

```mermaid
flowchart TB
    subgraph Orchestrator ["sweep_runner.py (Python orchestrator)"]
        direction TB
        grid["Build sweep grid<br/>(tokenizers x concurrency x ISL x workers x models x rps)"]
        loop["For each config point"]
        collect["Collect results into CSV + summary.md"]
        report["Generate per-run report.md"]
        grid --> loop --> collect --> report
    end

    loop -- "invokes" --> run_perf

    subgraph run_perf ["run_perf.sh (per-run harness)"]
        direction TB
        infra["Step 0: Ensure etcd + NATS"]
        mockers["Step 1: Start mocker workers<br/>(N models x M workers)"]
        frontend["Step 2: Start frontend<br/>(optionally under nsys)"]
        ready["Step 3: Wait for /v1/models readiness"]
        captures["Step 4: Start parallel captures<br/>(perf stat, BPF, flamegraph, /proc, Prometheus)"]
        load["Step 5: aiperf load generation"]
        wait["Step 6: Wait for captures to finish"]
        export["Step 7: Final Prometheus snapshot + nsys export"]
        save["Step 8: Save config.json"]
        infra --> mockers --> frontend --> ready --> captures --> load --> wait --> export --> save
    end
```

### Runtime topology

During a benchmark run, the following processes are active. The frontend receives HTTP requests from aiperf, tokenizes the input, routes to a backend model via the request plane (TCP), and streams response tokens back to the client.

```mermaid
flowchart LR
    aiperf["aiperf<br/>(load generator)"]

    subgraph Frontend ["Frontend (Rust, port 8000)"]
        direction TB
        http["HTTP server<br/>/v1/chat/completions"]
        preprocess["Preprocess<br/>(template + tokenize)"]
        router["Router<br/>(model lookup)"]
        transport["Transport<br/>(TCP request plane)"]
        http --> preprocess --> router --> transport
    end

    subgraph Models ["Mocker Workers"]
        direction TB
        subgraph model1 ["model-1"]
            w1a["worker 1<br/>port 8081"]
            w1b["worker 2<br/>port 8082"]
        end
        subgraph model2 ["model-2"]
            w2a["worker 1<br/>port 8083"]
            w2b["worker 2<br/>port 8084"]
        end
    end

    subgraph Infra ["Infrastructure"]
        etcd["etcd<br/>(service discovery)"]
        nats["NATS<br/>(event plane)"]
    end

    subgraph Observability ["Parallel Captures"]
        prom["Prometheus<br/>(/metrics scraping)"]
        perf["perf stat<br/>(HW counters)"]
        nsys["nsys<br/>(NVTX + OS runtime)"]
        flame["flamegraph<br/>(CPU + off-CPU)"]
        bpf["BPF traces<br/>(kernel-level)"]
    end

    aiperf -- "HTTP/SSE" --> http
    transport -- "TCP" --> w1a & w1b & w2a & w2b
    Frontend -. "register/discover" .-> etcd
    Models -. "register/discover" .-> etcd
    Models -. "events" .-> nats
    Frontend -. "events" .-> nats
    prom -. "scrape" .-> Frontend & Models
    perf -. "attach" .-> Frontend
    nsys -. "profile" .-> Frontend
    flame -. "sample" .-> Frontend
    bpf -. "trace" .-> Frontend
```

### Multi-model naming

When `--num-models` is 1, the served model name matches the HF model path (e.g., `Qwen/Qwen3-0.6B`). When `--num-models` is greater than 1, each model instance gets a synthetic name (`model-1`, `model-2`, ...) but all share the same underlying `--model-path` for weights and tokenizer config.

## Prerequisites

| Tool | Required | Install |
|------|----------|---------|
| etcd | Yes | `apt install etcd` or [releases](https://github.com/etcd-io/etcd/releases) |
| nats-server | Yes | `apt install nats-server` or [nats.io](https://nats.io/download/) |
| aiperf | Yes | `uv pip install "git+https://github.com/ai-dynamo/aiperf.git@main"` (in dynamo venv) |
| jq | Yes | `apt install jq` |
| perf | Optional | `apt install linux-tools-$(uname -r)` |
| bpftrace | Optional | `apt install bpftrace` (needs root or CAP_BPF + CAP_PERFMON) |
| inferno | Optional | `cargo install inferno` (for flamegraphs) |
| nsys | Optional | NVIDIA Nsight Systems |

## sweep_runner.py

The main entry point for running performance sweeps. Iterates over a grid of configurations and delegates each point to `run_perf.sh`.

### Basic Usage

```bash
# Smoke test (1 run)
python3 sweep_runner.py --tokenizers hf --concurrency 32 --isl 512 \
135
    --benchmark-duration 30 --speedup-ratio 1000000 \
136
137
138
139
140
    -- --skip-bpf --skip-nsys --skip-flamegraph --skip-perf

# Full tokenizer comparison
python3 sweep_runner.py --tokenizers hf,fastokens \
    --concurrency 32,64 --isl 512,1024,2048 \
141
    --benchmark-duration 60 --speedup-ratio 1000000
142
143
144

# Transport saturation (vary workers and request count)
python3 sweep_runner.py --tokenizers hf --concurrency 4096 \
145
    --num-requests 16384,32768 --workers 1,2,4,8 --speedup-ratio 1000000
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170

# Preview sweep plan without running
python3 sweep_runner.py --dry-run --tokenizers hf,fastokens \
    --concurrency 32,64 --isl 512,1024
```

### Multi-Model and Worker Sweeps

The `--num-models` and `--workers` flags control how many model instances and backend workers per model are launched. These are the primary knobs for studying frontend scalability under multi-tenant and parallel-worker configurations.

#### Scaling models (fixed workers per model)

Useful for measuring how adding more served models affects frontend routing, transport fan-out, and per-model latency.

```bash
# Sweep across 1, 2, 3, 4 model instances, 1 worker each, at 75 rps
for m in 1 2 3 4; do
    python3 sweep_runner.py \
        --tokenizers hf \
        --concurrency 512 \
        --isl 512 \
        --workers 1 \
        --num-models $m \
        --rps 75 \
        --benchmark-duration 60 \
171
        --speedup-ratio 1000000 \
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
        --output-dir artifacts/sweep_models/m${m} \
        -- --skip-bpf
done

# Compare results
for m in 1 2 3 4; do
    echo "=== m=$m ==="
    cat artifacts/sweep_models/m${m}/summary.md
    echo
done
```

#### Scaling workers per model (fixed model count)

Useful for measuring whether adding more backend workers relieves transport bottlenecks for a single model under heavy load.

```bash
# Sweep across 1, 2, 4, 8 workers for a single model
python3 sweep_runner.py \
    --tokenizers hf \
    --concurrency 512 \
    --isl 512 \
    --workers 1,2,4,8 \
    --num-models 1 \
    --rps 75 \
    --benchmark-duration 60 \
198
    --speedup-ratio 1000000 \
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
    --output-dir artifacts/sweep_workers \
    -- --skip-bpf
```

#### Combined model + worker grid

For a full factorial sweep over both dimensions, supply multiple values for both flags. Each combination produces a separate run.

```bash
# 2x3 grid: (1 model, 2 models) x (1, 2, 4 workers)
python3 sweep_runner.py \
    --tokenizers hf \
    --concurrency 256 \
    --isl 512 \
    --workers 1,2,4 \
    --num-models 2 \
    --rps 50 \
    --benchmark-duration 60 \
217
    --speedup-ratio 1000000 \
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
    --output-dir artifacts/sweep_grid \
    -- --skip-bpf
```

> **Note:** `--num-models` is a single integer (not comma-separated). To sweep across model counts, loop externally as shown in the "Scaling models" example above.

#### What to look for in the results

| Metric | Where to find it | What it tells you |
|--------|-----------------|-------------------|
| Req/s and Tok/s | `summary.md` | Whether the frontend can sustain the target load |
| TTFT p50/p99 | `summary.md` | End-to-end first-token latency (includes preprocess + routing + transport) |
| `transport_roundtrip` p50 | `report.md` section 4 | Time spent in the TCP request plane; grows when workers are saturated |
| Tokio worker busy ratio | `report.md` section 7 | Fraction of time each async worker is busy; values above 0.95 indicate saturation |
| Event loop stalls | `report.md` section 7 | How often the Tokio runtime stalled; high counts suggest blocking work on the async executor |
| `preprocess.tokenize` | `report.md` section 5 (NVTX) | Per-request tokenization cost; varies by tokenizer backend |

### With Profilers

```bash
# With perf stat + flamegraphs (no root needed)
python3 sweep_runner.py --tokenizers hf --concurrency 64 --isl 1024 \
240
    --benchmark-duration 60 --speedup-ratio 1000000
241
242
243

# With everything including BPF (needs sudo)
sudo -E python3 sweep_runner.py --tokenizers hf --concurrency 64 --isl 1024 \
244
    --benchmark-duration 60 --speedup-ratio 1000000
245
246
247

# nsys profiling (needs nsys in PATH)
python3 sweep_runner.py --tokenizers hf --concurrency 64 --isl 1024 \
248
    --benchmark-duration 60 --speedup-ratio 1000000 \
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
    -- --nsys-path /opt/nvidia/nsight-systems/bin/nsys
```

Profiler controls are passed through to run_perf.sh after `--`:

| Flag | Effect |
|------|--------|
| `--skip-bpf` | Skip BPF tracing |
| `--skip-nsys` | Skip Nsight Systems |
| `--skip-flamegraph` | Skip CPU/off-CPU flamegraphs |
| `--skip-perf` | Skip perf stat hardware counters |

### All Options

| Option | Default | Description |
|--------|---------|-------------|
| `--model` | `Qwen/Qwen3-0.6B` | HF model path |
| `--backend` | `mocker` | Engine: `mocker` (synthetic) or `vllm` |
| `--tokenizers` | `hf,fastokens` | Comma-separated tokenizer backends |
| `--concurrency` | `50,100,200` | Comma-separated concurrency levels |
| `--isl` | `512,1024,2048` | Comma-separated input sequence lengths |
| `--osl` | `256` | Output sequence length |
| `--workers` | `2` | Comma-separated worker counts per model |
| `--num-models` | `1` | Number of model instances (each gets `--workers` workers) |
| `--rps` | - | Comma-separated target request rates (req/s) |
| `--aiperf-targets` | `first` | `first`: model-1 only. `all`: run aiperf for each model |
275
| `--speedup-ratio` | `1.0` | Mocker speedup divisor; use large values (e.g., 1000000) for near-instant mocker |
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
| `--benchmark-duration` | `60` | aiperf run duration (seconds) |
| `--num-requests` | - | Comma-separated request counts (overrides duration) |
| `--output-dir` | auto | Output directory |
| `--max-consecutive-fails` | `2` | Skip remaining ISLs after N failures |
| `--cooldown` | `3` | Seconds between runs |
| `--dry-run` | - | Print plan without executing |
| `--no-report` | - | Skip per-run report generation |

## run_perf.sh

Low-level per-run harness. Normally called by sweep_runner.py, but can be used directly for single runs.

```bash
# Minimal (no profilers)
./run_perf.sh --model Qwen/Qwen3-0.6B --concurrency 32 --num-requests 640 \
291
    --speedup-ratio 1000000 --skip-bpf --skip-nsys --skip-flamegraph --skip-perf
292
293
294

# Full observability (needs sudo for BPF)
sudo -E ./run_perf.sh --model Qwen/Qwen3-0.6B --concurrency 64 \
295
    --benchmark-duration 60 --speedup-ratio 1000000
296
297
298

# Multi-model with 2 workers each
./run_perf.sh --model Qwen/Qwen3-0.6B --num-models 2 --workers 2 \
299
    --concurrency 32 --benchmark-duration 30 --speedup-ratio 1000000 \
300
301
302
303
304
    --skip-bpf --skip-nsys --skip-flamegraph --skip-perf

# 4 models, 1 worker each, rate-limited to 75 rps
./run_perf.sh --model Qwen/Qwen3-0.6B --num-models 4 --workers 1 \
    --concurrency 512 --benchmark-duration 60 --request-rate 75 \
305
    --speedup-ratio 1000000 --skip-bpf
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
```

## Analyzing Results

```bash
# Per-run report (generated automatically by sweep_runner.py)
python3 analysis/create_report.py analyze artifacts/sweep_<ts>/hf_c32_isl512_w2

# Auto-find latest run
python3 analysis/create_report.py analyze

# Prometheus delta (initial vs final snapshot)
diff <(grep "^dynamo_frontend" artifacts/.../prometheus/initial_snapshot.txt | sort) \
     <(grep "^dynamo_frontend" artifacts/.../prometheus/final_snapshot.txt | sort)

# nsys SQLite queries (when nsys was enabled)
sqlite3 artifacts/.../nsys/frontend.sqlite \
    "SELECT name, COUNT(*), ROUND(AVG(end-start)/1e3,1) as avg_us
     FROM NVTX_EVENTS WHERE end > start GROUP BY name ORDER BY avg_us DESC"
```

## Output Structure

```text
artifacts/sweep_YYYYMMDD_HHMMSS/
    results.csv                        Sweep results (all runs)
    summary.md                         Comparison table
    hf_c32_isl512_w2/                  Per-run directory
        config.json                    Run parameters
        report.md                      Analysis report
        aiperf/
            profile_export_aiperf.json aiperf metrics
        prometheus/
            initial_snapshot.txt        Pre-load metrics
            final_snapshot.txt          Post-load metrics
            timeseries.jsonl            Per-second scrapes
        system/
            thread_count.txt            Thread count over time
            fd_count.txt                FD count over time
            proc_status.txt             /proc/PID/status snapshots
        logs/
            frontend.log
            mocker_*.log
        perf/                           (if --with-perf)
            perf_stat.txt
            cpu_flamegraph.svg
        bpf/                            (if --with-bpf, needs root)
            runqlat.txt
            syscall_latency.txt
            ...
        nsys/                           (if --with-nsys)
            frontend.nsys-rep
            frontend.sqlite
```