scaling-test.md 11.6 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
<!-- SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -->
<!-- SPDX-License-Identifier: Apache-2.0 -->

# Frontend Scaling Test: Finding the Saturation Point

This guide walks through using the sweep runner to find the saturation point of
a Dynamo frontend serving a real vLLM backend.  The saturation point is the
request rate at which latency begins to degrade -- prefill requests start
queuing instead of being served immediately, TTFT p99 spikes, and throughput
plateaus.

---

## Overview

The test sweeps increasing request rates (`--rps`) at a fixed input sequence
length while keeping the backend warm (`--reset-strategy frontend`).  Each data
point is a 60-second aiperf run at a controlled RPS.  The sweep stops
automatically after consecutive failures (`--max-consecutive-fails`).

**What you get:**

- Per-RPS throughput (actual req/s vs target), TTFT p50/p99, ITL p50/p99
- Prometheus pre/post metrics for pipeline stage breakdown
- CSV + summary for easy comparison

---

## Prerequisites

1. **K8s namespace** with:
   - `hf-token-secret` (HuggingFace token)
   - `nvcrimagepullsecret` (image pull credentials)
   - `model-cache` PVC (RWX, large enough for model weights)
   - Model weights downloaded to PVC (see "Model Download" below)

2. **DGD deployed** with the target model and backend.

3. **sweep_runner.py** accessible from a machine with `kubectl` access to the
   cluster.

---

## Model Download (gpt-oss-20b example)

Download the model to the PVC, excluding large non-inference directories:

```bash
# Create a download Job (adjust image and namespace)
kubectl apply -n <namespace> -f - <<'EOF'
apiVersion: batch/v1
kind: Job
metadata:
  name: model-download-gpt-oss-20b
spec:
  backoffLimit: 2
  template:
    spec:
      restartPolicy: Never
      imagePullSecrets:
        - name: nvcrimagepullsecret
      containers:
        - name: download
          image: nvcr.io/nvidian/dynamo-dev/biswa:vllm-runtime-1a8bce12ea
          command: ["python3", "-c"]
          args:
            - |
              import os, subprocess, sys, pathlib
              model = "openai/gpt-oss-20b"
              os.environ["HF_HOME"] = "/model-store"
              cmd = ["huggingface-cli", "download", model,
                     "--exclude", "metal/*", "--exclude", "original/*",
                     "--local-dir", "/model-store/hub/models--openai--gpt-oss-20b/snapshots/main"]
              sys.exit(subprocess.run(cmd).returncode)
          env:
            - name: HF_HOME
              value: /model-store
            - name: HF_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-token-secret
                  key: HF_TOKEN
          volumeMounts:
            - name: model-cache
              mountPath: /model-store
      volumes:
        - name: model-cache
          persistentVolumeClaim:
            claimName: model-cache
EOF

# Monitor
kubectl logs -n <namespace> -l job-name=model-download-gpt-oss-20b -f
```

---

## Deploy the DGD

Use the provided template for gpt-oss-20b with TP=2:

```bash
# Template path (relative to repo root)
# benchmarks/frontend/dgd/templates/vllm-gpt-oss-20b.yaml
#
# Key settings in the template:
#   - tensor-parallel-size 2 (2 GPUs per worker)
#   - max-model-len 65536
#   - gpu-memory-utilization 0.90
#   - GPU toleration for scheduling

# Deploy directly (adjust values as needed):
kubectl apply -n <namespace> -f - <<'EOF'
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
  name: gpt-oss-20b-bench
spec:
  services:
    Frontend:
      componentType: frontend
      replicas: 1
      extraPodSpec:
        imagePullSecrets:
          - name: nvcrimagepullsecret
        mainContainer:
          image: <your-image>
          command: ["/bin/sh", "-c"]
          args: ["python3 -m dynamo.frontend --router-mode round-robin --http-port 8000"]
          env:
            - name: DYN_TOKENIZER_BACKEND
              value: "default"
            - name: DYN_PERF_DIAG
              value: "1"
            - name: HF_HOME
              value: /model-store
            - name: HF_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-token-secret
                  key: HF_TOKEN
          volumeMounts:
            - name: model-cache
              mountPath: /model-store
        volumes:
          - name: model-cache
            persistentVolumeClaim:
              claimName: model-cache

    VllmWorker:
      componentType: worker
      replicas: 4                    # <-- number of backend replicas
      extraPodSpec:
        imagePullSecrets:
          - name: nvcrimagepullsecret
        tolerations:
          - effect: NoSchedule
            key: nvidia.com/gpu
            operator: Exists
        mainContainer:
          image: <your-image>
          command: ["/bin/sh", "-c"]
          args:
            - >-
              python3 -m dynamo.vllm
              --model /model-store/hub/models--openai--gpt-oss-20b/snapshots/main
              --served-model-name openai/gpt-oss-20b
              --tensor-parallel-size 2
              --max-model-len 65536
              --gpu-memory-utilization 0.90
          env:
            - name: HF_HOME
              value: /model-store
            - name: HF_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-token-secret
                  key: HF_TOKEN
          resources:
            limits:
              nvidia.com/gpu: "2"    # <-- 2 GPUs for TP=2
          volumeMounts:
            - name: model-cache
              mountPath: /model-store
        volumes:
          - name: model-cache
            persistentVolumeClaim:
              claimName: model-cache
EOF

# Wait for all pods to be ready
kubectl get pods -n <namespace> -w
```

---

## Run the Saturation Sweep

### Baseline: HF tokenizer, RPS sweep

```bash
cd benchmarks/frontend/scripts

python3 sweep_runner.py --mode k8s \
    --dgd-name gpt-oss-20b-bench \
    --namespace <namespace> \
    --endpoint gpt-oss-20b-bench-frontend:8000 \
    --model openai/gpt-oss-20b \
    --backend vllm \
    --image <your-image> \
    --tokenizers hf \
    --concurrency 200 \
    --rps 10,20,30,40,50,60,70,80,90,100 \
    --isl 6144 \
    --osl 256 \
    --benchmark-duration 60 \
    --reset-strategy frontend \
    --isolation reuse_by_deploy_key \
    --worker-replicas 4 \
    --max-consecutive-fails 2
```

**Flag explanations:**

| Flag | Value | Purpose |
|------|-------|---------|
| `--rps 10,20,...,100` | Sweep dimension | Each run targets a fixed request rate. aiperf uses `--request-rate` to cap submission. |
| `--concurrency 200` | High ceiling | Maximum in-flight requests. Set high so aiperf can sustain the target RPS without being limited by available connection slots. This is NOT a sweep dimension. |
| `--isl 6144` | Fixed ISL | Holds input length constant to isolate throughput scaling. |
| `--osl 256` | Fixed OSL | Consistent output length across all runs. |
| `--benchmark-duration 60` | 60s per point | Long enough for vLLM scheduling to stabilize. |
| `--reset-strategy frontend` | Frontend-only | Resets Prometheus counters between runs, but keeps vLLM workers alive with warm KV caches and CUDA graphs. Avoids the ~90s full DGD restart per point. |
| `--isolation reuse_by_deploy_key` | Reuse deployment | Since tokenizer=hf is constant, no DGD restart between runs. Only a frontend pod restart for clean metrics. |
| `--max-consecutive-fails 2` | Auto-stop | After 2 consecutive failures at a given RPS, remaining higher RPS values are skipped. |

### Follow-up: FastTokens comparison

Once you have the baseline, run the same sweep with fastokens to see if the
saturation point shifts:

```bash
python3 sweep_runner.py --mode k8s \
    --dgd-name gpt-oss-20b-bench \
    --namespace <namespace> \
    --endpoint gpt-oss-20b-bench-frontend:8000 \
    --model openai/gpt-oss-20b \
    --backend vllm \
    --image <your-image> \
    --tokenizers fastokens \
    --concurrency 200 \
    --rps 10,20,30,40,50,60,70,80,90,100 \
    --isl 6144 \
    --osl 256 \
    --benchmark-duration 60 \
    --reset-strategy frontend \
    --isolation reuse_by_deploy_key \
    --worker-replicas 4 \
    --max-consecutive-fails 2
```

### Fine-grained sweep around the inflection

If the baseline shows saturation between, say, RPS=40 and RPS=60:

```bash
python3 sweep_runner.py --mode k8s \
    ... \
    --rps 35,40,45,50,55,60 \
    --reset-strategy frontend \
    --isolation reuse_by_deploy_key
```

---

## Reading the Results

The sweep produces `results.csv` and `summary.md` in the output directory.

### Identifying the saturation point

Look for these signals in the CSV:

| RPS | Actual Req/s | TTFT p50 | TTFT p99 | ITL p99 | Status |
|----:|-----------:|--------:|--------:|-------:|--------|
| 10 | 10.0 | 800ms | 1200ms | 30ms | ok |
| 20 | 19.8 | 850ms | 1400ms | 32ms | ok |
| 30 | 29.5 | 900ms | 2000ms | 35ms | ok |
| 40 | 38.0 | 1200ms | 5000ms | 45ms | ok -- onset |
| 50 | 42.0 | 3000ms | 15000ms | 80ms | ok -- saturated |
| 60 | 41.5 | 8000ms | 30000ms | 120ms | ok -- overloaded |
| 70 | -- | -- | -- | -- | fail |

**Saturation indicators:**

1. **Actual req/s < target RPS**: The system cannot sustain the requested rate.
   At RPS=50, only 42 req/s are achieved.
2. **TTFT p99 spike**: A sharp increase (e.g., 2x-5x) means prefill requests
   are queuing behind each other.
3. **ITL p99 degradation**: Decode throughput drops because the vLLM scheduler
   is overloaded with concurrent prefills.
4. **Errors/failures**: Timeouts, OOM, or vLLM rejecting requests.

The **saturation point** in the example above is **RPS ~40** -- the last rate
where actual throughput tracks the target and TTFT p99 is still reasonable.

### Prometheus metrics

Each run captures `frontend_metrics_pre.txt` and `frontend_metrics_post.txt`.
Key metrics for saturation analysis:

- `dynamo_frontend_stage_duration_seconds{stage="preprocess"}` -- tokenization time
- `dynamo_frontend_stage_duration_seconds{stage="transport_roundtrip"}` -- backend latency
- `dynamo_frontend_queued_requests` -- requests waiting in HTTP queue (should be 0 below saturation)
- `dynamo_frontend_inflight_requests` -- concurrent in-flight requests
- `dynamo_frontend_time_to_first_token_seconds` -- TTFT histogram buckets

---

## DGD Template Reference

The `dgd/templates/vllm-gpt-oss-20b.yaml` template is pre-configured for
gpt-oss-20b with TP=2.  To use it with `--deploy-template`:

```bash
python3 sweep_runner.py --mode k8s \
    --deploy-template benchmarks/frontend/dgd/templates/vllm-gpt-oss-20b.yaml \
    --dgd-name gpt-oss-20b-bench \
    --model /model-store/hub/models--openai--gpt-oss-20b/snapshots/main \
    --image <your-image> \
    --worker-replicas 4 \
    ...
```

The template substitutes these variables at deploy time:
`${DGD_NAME}`, `${IMAGE}`, `${MODEL}`, `${MODEL_NAME}`,
`${WORKER_REPLICAS}`, `${DYN_TOKENIZER_BACKEND}`, `${FRONTEND_PORT}`,
`${ROUTER_MODE}`.

---

## Tuning Parameters

| Parameter | Recommended Range | Notes |
|-----------|-------------------|-------|
| `--benchmark-duration` | 60-120s | Longer = more stable averages but slower sweep |
| `--concurrency` | 2-4x max target RPS | Must be high enough that aiperf can reach the target rate |
| `--rps` | Start at 10, double until failures | Geometric progression finds the order of magnitude fast |
| `--worker-replicas` | 1-8 | More replicas = higher saturation point but more GPUs |
| `--reset-strategy` | `frontend` for saturation tests | `graph` for clean-baseline TTFT measurements |
| `--isolation` | `reuse_by_deploy_key` for same-tokenizer sweeps | Avoids unnecessary DGD restarts |
| `--max-consecutive-fails` | 2-3 | Higher = more data points at the failure boundary |