benchmarking.md 18.8 KB
Newer Older
1
2
3
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
4
title: Dynamo Benchmarking
5
subtitle: Benchmark and compare performance across Dynamo deployment configurations
6
7
---

8
9
10
This guide shows how to benchmark Dynamo deployments using [AIPerf](https://github.com/ai-dynamo/aiperf), a comprehensive tool for measuring generative AI inference performance. AIPerf provides detailed metrics, real-time dashboards, and automatic visualization — you call it directly against your endpoints.

You can benchmark any combination of:
11
- **DynamoGraphDeployments**
12
- **External HTTP endpoints** (vLLM, llm-d, AIBrix, etc.)
13
14
15

## Choosing Your Benchmarking Approach

16
**Client-side** runs benchmarks on your local machine via port-forwarding. **Server-side** runs benchmarks directly within the Kubernetes cluster using internal service URLs.
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52

**TLDR:**
Need high performance/load testing? Server-side.
Just quick testing/comparison? Client-side.

### Use Client-Side Benchmarking When:
- You want to quickly test deployments
- You want immediate access to results on your local machine
- You're comparing external services or deployments (not necessarily just Dynamo deployments)
- You need to run benchmarks from your laptop/workstation

**[Go to Client-Side Benchmarking (Local)](#client-side-benchmarking-local)**

### Use Server-Side Benchmarking When:
- You have a development environment with kubectl access
- You're doing performance validation with high load/speed requirements
- You're experiencing timeouts or performance issues with client-side benchmarking
- You want optimal network performance (no port-forwarding overhead)
- You're running automated CI/CD pipelines
- You need isolated execution environments
- You want persistent result storage in the cluster

**[Go to Server-Side Benchmarking (In-Cluster)](#server-side-benchmarking-in-cluster)**

### Quick Comparison

| Feature | Client-Side | Server-Side |
|---------|-------------|-------------|
| **Location** | Your local machine | Kubernetes cluster |
| **Network** | Port-forwarding required | Direct service DNS |
| **Setup** | Quick and simple | Requires cluster resources |
| **Performance** | Limited by local resources, may timeout under high load | Optimal cluster performance, handles high load |
| **Isolation** | Shared environment | Isolated job execution |
| **Results** | Local filesystem | Persistent volumes |
| **Best for** | Light load | High load |

53
54
55
## AIPerf Overview

[AIPerf](https://github.com/ai-dynamo/aiperf) is a standalone benchmarking tool available on [PyPI](https://pypi.org/project/aiperf/). It is pre-installed in Dynamo container images. Key features:
56

57
58
59
60
61
62
- Measures latency, throughput, TTFT, inter-token latency, and more
- Multiple load modes: concurrency, request-rate, trace replay
- Automatic visualization with `aiperf plot` (Pareto curves, time series, GPU telemetry)
- Interactive dashboard mode for real-time exploration
- Arrival patterns (Poisson, constant, gamma) for realistic traffic simulation
- Warmup phases, gradual ramping, and multi-URL load balancing
63

64
**Important**: The `--model` parameter must match the model deployed at the endpoint.
65

66
For full documentation, see the [AIPerf docs](https://github.com/ai-dynamo/aiperf/tree/main/docs).
67
68
69

---

70
# Client-Side Benchmarking (Local)
71
72
73
74
75

Client-side benchmarking runs on your local machine and connects to Kubernetes deployments via port-forwarding.

## Prerequisites

76
77
78
79
1. **Dynamo container environment** - You must be running inside a Dynamo container with AIPerf pre-installed, or install it locally:
   ```bash
   pip install aiperf
   ```
80
81
82
83

2. **HTTP endpoints** - Ensure you have HTTP endpoints available for benchmarking. These can be:
   - DynamoGraphDeployments exposed via HTTP endpoints
   - External services (vLLM, llm-d, AIBrix, etc.)
84
   - Any HTTP endpoint serving OpenAI-compatible models
85
86
87

## User Workflow

88
### Step 1: Set Up Cluster and Deploy
89

90
Set up your Kubernetes cluster with NVIDIA GPUs and install the Dynamo Kubernetes Platform following the [installation guide](../kubernetes/installation-guide.md). Then deploy your DynamoGraphDeployments using the [deployment documentation](https://github.com/ai-dynamo/dynamo/blob/main/examples/backends).
91

92
93
94
### Step 2: Port-Forward and Run a Single Benchmark

> **Wait for model readiness.** Before benchmarking, ensure your deployment has fully loaded the model. Check pod logs or hit the health endpoint (`curl http://localhost:8000/health`) — it should return `200 OK` before you proceed.
95
96

```bash
97
# Port-forward the frontend service
98
99
kubectl port-forward -n <namespace> svc/<frontend-service-name> 8000:8000 > /dev/null 2>&1 &

100
101
102
103
104
105
106
107
108
109
# Run a single benchmark
aiperf profile \
    --model <your-model-name> \
    --url http://localhost:8000 \
    --endpoint-type chat \
    --streaming \
    --concurrency 10 \
    --request-count 100 \
    --synthetic-input-tokens-mean 2000 \
    --output-tokens-mean 256
110
111
```

112
This produces results in `artifacts/` and prints a summary table to the console:
113

114
115
116
117
118
119
120
121
122
123
124
125
126
127
```text
                                NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃              Metric ┃     avg ┃     min ┃     max ┃     p99 ┃     p90 ┃     p50 ┃     std ┃
┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│ Time to First Token │  234.56 │  189.23 │  298.45 │  289.34 │  267.12 │  231.12 │   28.45 │
│                (ms) │         │         │         │         │         │         │         │
│     Request Latency │ 1234.56 │  987.34 │ 1567.89 │ 1534.23 │ 1456.78 │ 1223.45 │  156.78 │
│                (ms) │         │         │         │         │         │         │         │
│ Inter Token Latency │   15.67 │   12.34 │   19.45 │   19.01 │   18.23 │   15.45 │    1.89 │
│                (ms) │         │         │         │         │         │         │         │
│  Request Throughput │   31.45 │     N/A │     N/A │     N/A │     N/A │     N/A │     N/A │
│      (requests/sec) │         │         │         │         │         │         │         │
└─────────────────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘
128
129
```

130
*Actual numbers will vary based on model size, hardware, batch size, and network conditions. Client-side benchmarks include port-forwarding overhead — use [server-side benchmarking](#server-side-benchmarking-in-cluster) for accurate performance measurement.*
131

132
To stop the port-forward when done: `kill %1` (or `kill <PID>`).
133

134
### Step 3: Concurrency Sweep for Pareto Analysis
135

136
To understand how your deployment behaves across load levels, run a concurrency sweep. Each concurrency level sends enough requests for stable measurements (`max(c*3, 10)`):
137
138

```bash
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
MODEL="<your-model-name>"
URL="http://localhost:8000"

for c in 1 2 5 10 50 100; do
    aiperf profile \
        --model "$MODEL" \
        --url "$URL" \
        --endpoint-type chat \
        --streaming \
        --concurrency $c \
        --request-count $(( c * 3 > 10 ? c * 3 : 10 )) \
        --synthetic-input-tokens-mean 2000 \
        --output-tokens-mean 256 \
        --artifact-dir "artifacts/deployment-a/c$c"
done
154
155
```

156
**Note**: Adjust concurrency levels to match your deployment's capacity. Very high concurrency on a small deployment (e.g., c250 on a single GPU) will cause server errors. Start with lower values and increase until you find the saturation point.
157

158
### Step 4: [If Comparative] Benchmark a Second Deployment
159

160
Teardown deployment A and deploy deployment B with a different configuration. Kill the previous port-forward (`kill %1`), then repeat:
161

162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
```bash
kubectl port-forward -n <namespace> svc/<frontend-service-b> 8000:8000 > /dev/null 2>&1 &

for c in 1 2 5 10 50 100; do
    aiperf profile \
        --model "$MODEL" \
        --url "$URL" \
        --endpoint-type chat \
        --streaming \
        --concurrency $c \
        --request-count $(( c * 3 > 10 ? c * 3 : 10 )) \
        --synthetic-input-tokens-mean 2000 \
        --output-tokens-mean 256 \
        --artifact-dir "artifacts/deployment-b/c$c"
done
```
178

179
### Step 5: Generate Visualizations
180
181

```bash
182
183
# Compare all runs — auto-detects multi-run directories
aiperf plot artifacts/deployment-a artifacts/deployment-b
184

185
186
# Or compare all subdirectories under a parent
aiperf plot artifacts/
187

188
189
# Launch interactive dashboard for exploration
aiperf plot artifacts/ --dashboard
190
191
```

192
193
194
195
AIPerf automatically generates plots based on available data:
- **TTFT vs Throughput** — find the sweet spot between responsiveness and capacity (always generated for multi-run comparisons)
- **Pareto Curves** — throughput per GPU vs latency and interactivity (only generated when GPU telemetry data is available — add `--gpu-telemetry` during profiling if DCGM is running)
- **Time series** — per-request TTFT, ITL, and latency over time (generated for single-run analysis)
196

197
Here is an example Pareto frontier from a concurrency sweep of Qwen3-0.6B on 8x H200 with vLLM, showing the tradeoff between user experience (tokens/sec per user) and resource efficiency (tokens/sec per GPU):
198

199
![AIPerf Pareto Frontier](../assets/img/aiperf-pareto-frontier.png)
200

201
See the [AIPerf Visualization Guide](https://github.com/ai-dynamo/aiperf/blob/main/docs/tutorials/plot.md) for full details on plot customization, experiment classification, and themes.
202

203
## Use Cases
204

205
206
207
208
209
210
- **Compare DynamoGraphDeployments** (e.g., aggregated vs disaggregated configurations)
- **Compare different backends** (e.g., SGLang vs TensorRT-LLM vs vLLM)
- **Compare Dynamo vs other platforms** (e.g., Dynamo vs llm-d vs AIBrix)
- **Compare different models** (e.g., Llama-3-8B vs Llama-3-70B vs Qwen-3-0.6B)
- **Compare different hardware configurations** (e.g., H100 vs A100 vs H200)
- **Compare different parallelization strategies** (e.g., different GPU counts or memory configurations)
211

212
## AIPerf Quick Reference
213

214
### Commonly Used Options
215

216
217
```text
aiperf profile [OPTIONS]
218

219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
REQUIRED:
  --model MODEL               Model name (must match the deployed model)
  --url URL                   Endpoint URL (e.g., http://localhost:8000)

COMMON OPTIONS:
  --endpoint-type TYPE        Endpoint type: chat, completions, embeddings (default: chat)
  --streaming                 Enable streaming responses
  --concurrency N             Number of concurrent requests
  --request-rate N            Target requests per second (alternative to --concurrency)
  --request-count N           Total number of requests to send
  --benchmark-duration N      Run for N seconds instead of a fixed request count
  --synthetic-input-tokens-mean N   Average input sequence length in tokens
  --output-tokens-mean N      Average output sequence length in tokens
  --artifact-dir DIR          Output directory for results (default: artifacts/)
  --warmup-request-count N    Warmup requests before measurement
  --ui TYPE                   UI mode: dashboard, simple, none (default: dashboard)
235
236
```

237
For the complete CLI reference, see `aiperf profile --help` or the [CLI docs](https://github.com/ai-dynamo/aiperf/blob/main/docs/cli-options.md).
238

239
### Output Sequence Length
240

241
To enforce a specific output length, pass `ignore_eos` and `min_tokens` via `--extra-inputs`:
242

243
244
245
246
247
248
249
250
251
252
253
```bash
aiperf profile \
    --model <model> \
    --url http://localhost:8000 \
    --endpoint-type chat \
    --streaming \
    --concurrency 10 \
    --output-tokens-mean 256 \
    --extra-inputs max_tokens:256 \
    --extra-inputs min_tokens:256 \
    --extra-inputs ignore_eos:true
254
255
```

256
### Understanding Results
257

258
259
260
261
Each `aiperf profile` run produces an artifact directory containing:
- **`profile_export_aiperf.json`** — Structured metrics (latency, throughput, TTFT, ITL, etc.)
- **`profile_export.jsonl`** — Per-request raw data
- **`profile_export_aiperf.csv`** — CSV format metrics
262

263
Results are organized by the `--artifact-dir` you specify. For concurrency sweeps, a common pattern is:
264
265

```text
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
artifacts/
├── deployment-a/
│   ├── c1/
│   │   ├── profile_export_aiperf.json
│   │   └── profile_export.jsonl
│   ├── c10/
│   ├── c50/
│   └── c100/
├── deployment-b/
│   ├── c1/
│   ├── c10/
│   ├── c50/
│   └── c100/
└── plots/                    # Generated by aiperf plot
    ├── ttft_vs_throughput.png
    ├── pareto_curve_throughput_per_gpu_vs_latency.png      # If GPU telemetry available
    └── pareto_curve_throughput_per_gpu_vs_interactivity.png # If GPU telemetry available
283
284
285
286
```

---

287
# Server-Side Benchmarking (In-Cluster)
288

289
Server-side benchmarking runs directly within the Kubernetes cluster, eliminating port-forwarding overhead and enabling high-load testing.
290
291
292
293

## Prerequisites

1. **Kubernetes cluster** with NVIDIA GPUs and Dynamo namespace setup (see [Dynamo Kubernetes Platform docs](../kubernetes/README.md))
294
295
2. **Storage**: PersistentVolumeClaim configured with appropriate permissions (see [deploy/utils README](https://github.com/ai-dynamo/dynamo/blob/main/deploy/utils/README.md))
3. **Docker image** containing AIPerf (Dynamo runtime images include it)
296
297
298
299

## Quick Start

### Step 1: Deploy Your DynamoGraphDeployment
300
301
302
Deploy using the [deployment documentation](https://github.com/ai-dynamo/dynamo/blob/main/examples/backends). Ensure it has a frontend service exposed and the model is fully loaded before running benchmarks — check pod logs or verify the health endpoint returns `200 OK`.

### Step 2: Configure and Run Benchmark Job
303

304
First, edit `benchmarks/incluster/benchmark_job.yaml` to match your deployment:
305

306
307
308
309
310
311
- **Model name**: Update the `MODEL` variable
- **Service URL**: Update the `URL` variable (use `<svc_name>.<namespace>.svc.cluster.local:port` for cross-namespace access)
- **Concurrency levels**: Adjust the `for c in ...` loop
- **Docker image**: Update the `image` field if needed

Then deploy:
312
313
314
315

```bash
export NAMESPACE=benchmarking

316
# Deploy the benchmark job
317
318
kubectl apply -f benchmarks/incluster/benchmark_job.yaml -n $NAMESPACE

319
# Monitor the job
320
321
322
323
324
kubectl logs -f job/dynamo-benchmark -n $NAMESPACE
```

### Step 3: Retrieve Results
```bash
325
# Create access pod (skip if already running)
326
327
328
329
kubectl apply -f deploy/utils/manifests/pvc-access-pod.yaml -n $NAMESPACE
kubectl wait --for=condition=Ready pod/pvc-access-pod -n $NAMESPACE --timeout=60s

# Download the results
330
kubectl cp $NAMESPACE/pvc-access-pod:/data/results ./results
331
332
333
334
335
336
337

# Cleanup
kubectl delete pod pvc-access-pod -n $NAMESPACE
```

### Step 4: Generate Plots
```bash
338
aiperf plot ./results
339
340
341
342
```

## Cross-Namespace Service Access

343
When referencing services in other namespaces, use full Kubernetes DNS:
344
345

```bash
346
347
# Same namespace
--url http://vllm-agg-frontend:8000
348

349
350
# Different namespace
--url http://vllm-agg-frontend.production.svc.cluster.local:8000
351
352
353
354
355
```

## Monitoring and Debugging

```bash
356
# Check job status
357
358
kubectl describe job dynamo-benchmark -n $NAMESPACE

359
# Follow logs
360
361
362
363
364
kubectl logs -f job/dynamo-benchmark -n $NAMESPACE

# Check pod status
kubectl get pods -n $NAMESPACE -l job-name=dynamo-benchmark

365
# Debug failed pod
366
367
368
kubectl describe pod <pod-name> -n $NAMESPACE
```

369
### Troubleshooting
370
371

1. **Service not found**: Ensure your DynamoGraphDeployment frontend service is running
372
373
374
2. **PVC access**: Check that `dynamo-pvc` is properly configured and accessible
3. **Image pull issues**: Ensure the Docker image is accessible from the cluster
4. **Resource constraints**: Adjust resource limits if the job is being evicted
375
376
377
378
379

```bash
# Check PVC status
kubectl get pvc dynamo-pvc -n $NAMESPACE

380
# Verify service exists and has endpoints
381
kubectl get svc -n $NAMESPACE
382
kubectl get endpoints <service-name> -n $NAMESPACE
383
384
385
386
387
388
```

---

## Testing with Mocker Backend

389
For development and testing purposes, Dynamo provides a [mocker backend](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/mocker) that simulates LLM inference without requiring actual GPU resources. This is useful for:
390
391
392
393
394
395

- **Testing deployments** without expensive GPU infrastructure
- **Developing and debugging** router, planner, or frontend logic
- **CI/CD pipelines** that need to validate infrastructure without model execution
- **Benchmarking framework validation** to ensure your setup works before using real backends

396
The mocker backend mimics the API and behavior of real backends (SGLang, TensorRT-LLM, vLLM) but generates mock responses instead of running actual inference.
397

398
See the [mocker directory](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/mocker) for usage examples and configuration options.
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417

---

## Advanced AIPerf Features

AIPerf has many capabilities beyond basic profiling. Here are some particularly useful for Dynamo benchmarking:

| Feature | Description | Docs |
|---------|-------------|------|
| Trace Replay | Replay production traces for deterministic benchmarking | [Trace Replay](https://github.com/ai-dynamo/aiperf/blob/main/docs/benchmark-modes/trace-replay.md) |
| Arrival Patterns | Poisson, constant, gamma traffic distributions | [Arrival Patterns](https://github.com/ai-dynamo/aiperf/blob/main/docs/tutorials/arrival-patterns.md) |
| Gradual Ramping | Smooth ramp-up of concurrency and request rate | [Ramping](https://github.com/ai-dynamo/aiperf/blob/main/docs/tutorials/ramping.md) |
| Warmup Phase | Eliminate cold-start effects from measurements | [Warmup](https://github.com/ai-dynamo/aiperf/blob/main/docs/tutorials/warmup.md) |
| Multi-URL Load Balancing | Distribute requests across multiple endpoints | [Multi-URL](https://github.com/ai-dynamo/aiperf/blob/main/docs/tutorials/multi-url-load-balancing.md) |
| GPU Telemetry | Collect DCGM metrics during benchmarking | [GPU Telemetry](https://github.com/ai-dynamo/aiperf/blob/main/docs/tutorials/gpu-telemetry.md) |
| Goodput Analysis | SLO-based throughput measurement | [Goodput](https://github.com/ai-dynamo/aiperf/blob/main/docs/tutorials/goodput.md) |
| Timeslice Analysis | Per-timeslice performance breakdown | [Timeslices](https://github.com/ai-dynamo/aiperf/blob/main/docs/tutorials/timeslices.md) |
| Multi-Turn Conversations | Benchmark multi-turn chat workloads | [Multi-Turn](https://github.com/ai-dynamo/aiperf/blob/main/docs/tutorials/multi-turn.md) |
| Experiment Classification | Baseline vs treatment semantic colors in plots | [Plotting](https://github.com/ai-dynamo/aiperf/blob/main/docs/tutorials/plot.md) |