README.md 30 KB
Newer Older
1
2
3
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
4
title: Disaggregated Serving
5
subtitle: Find optimal prefill/decode configuration for disaggregated serving deployments
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
---

[AIConfigurator](https://github.com/ai-dynamo/aiconfigurator/tree/main) is a performance optimization tool that helps you find the optimal configuration for deploying LLMs with Dynamo. It automatically determines the best number of prefill and decode workers, parallelism settings, and deployment parameters to meet your SLA targets while maximizing throughput.

## Why Use AIConfigurator?

When deploying LLMs with Dynamo, you need to make several critical decisions:
- **Aggregated vs Disaggregated**: Which architecture gives better performance for your workload?
- **Worker Configuration**: How many prefill and decode workers to deploy?
- **Parallelism Settings**: What tensor/pipeline parallel configuration to use?
- **SLA Compliance**: How to meet your TTFT and TPOT targets?

AIConfigurator answers these questions in seconds, providing:
- Recommended configurations that meet your SLA requirements
- Ready-to-deploy Dynamo configuration files (including Kubernetes manifests)
- Performance comparisons between different deployment strategies
- Up to 1.7x better throughput compared to manual configuration

### End-to-End Workflow

26
![AIConfigurator end-to-end workflow](../../assets/img/e2e-workflow.svg)
27
28
29
30
31

### Aggregated vs Disaggregated Architecture

AIConfigurator evaluates two deployment architectures and recommends the best one for your workload:

32
![Aggregated vs Disaggregated architecture comparison](../../assets/img/arch-comparison.svg)
33
34
35

### When to Use Each Architecture

36
![Decision flowchart for choosing aggregated vs disaggregated](../../assets/img/decision-flowchart.svg)
37
38
39
40
41
42
43
44
45

## Quick Start

```bash
# Install
pip3 install aiconfigurator

# Find optimal configuration for vLLM backend
aiconfigurator cli default \
46
47
  --model Qwen/Qwen3-32B-FP8 \
  --total-gpus 8 \
48
49
  --system h200_sxm \
  --backend vllm \
50
  --backend-version 0.12.0 \
51
52
53
54
  --isl 4000 \
  --osl 500 \
  --ttft 600 \
  --tpot 16.67 \
55
  --save-dir ./results_vllm
56
57
58
59
60
61
62
63
64
65
66
67
68

# Deploy on Kubernetes
kubectl apply -f ./results_vllm/agg/top1/agg/k8s_deploy.yaml
```

## Complete Walkthrough: vLLM on H200

This section walks through a validated example deploying Qwen3-32B-FP8 on 8× H200 GPUs using vLLM.

### Step 1: Run AIConfigurator

```bash
aiconfigurator cli default \
69
  --model Qwen/Qwen3-32B-FP8 \
70
  --system h200_sxm \
71
  --total-gpus 8 \
72
73
74
  --isl 4000 \
  --osl 500 \
  --ttft 600 \
75
  --tpot 25 \
76
  --backend vllm \
77
78
79
80
81
  --backend-version 0.12.0 \
  --generator-dynamo-version 0.8.0 \
  --generator-set K8sConfig.k8s_namespace=$YOUR_NAMESPACE \
  --generator-set K8sConfig.k8s_pvc_name=$YOUR_PVC \
  --save-dir ./results_vllm
82
83
84
```

**Parameters explained:**
85
- `--model`: HuggingFace model ID or local path (e.g., `Qwen/Qwen3-32B-FP8`)
86
- `--system`: GPU system type (`h200_sxm`, `h100_sxm`, `a100_sxm`)
87
- `--total-gpus`: Number of GPUs available for deployment
88
89
- `--isl` / `--osl`: Input/Output sequence lengths in tokens
- `--ttft` / `--tpot`: SLA targets - Time To First Token (ms) and Time Per Output Token (ms)
90
91
92
- `--backend`: Inference backend (`vllm`, `trtllm`, or `sglang`)
- `--backend-version`: Backend version (e.g., `0.12.0` for vLLM)
- `--save-dir`: Directory to save generated deployment configs
93
94
95
96
97
98
99
100
101
102
103
104
105

### Step 2: Review the Results

AIConfigurator outputs a comparison of aggregated vs disaggregated deployment strategies:

```text
********************************************************************************
*                     Dynamo aiconfigurator Final Results                      *
********************************************************************************
  ----------------------------------------------------------------------------
  Input Configuration & SLA Target:
    Model: Qwen/Qwen3-32B-FP8 (is_moe: False)
    Total GPUs: 8
106
    Best Experiment Chosen: disagg at 446.85 tokens/s/gpu (disagg 1.38x better)
107
108
  ----------------------------------------------------------------------------
  Overall Best Configuration:
109
110
111
112
113
114
    - Best Throughput: 3,574.80 tokens/s
    - Per-GPU Throughput: 446.85 tokens/s/gpu
    - Per-User Throughput: 53.58 tokens/s/user
    - TTFT: 453.18ms
    - TPOT: 18.66ms
    - Request Latency: 9766.51ms
115
  ----------------------------------------------------------------------------
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
  Pareto Frontier:
      Qwen/Qwen3-32B-FP8 Pareto Frontier: tokens/s/gpu_cluster vs tokens/s/user
     ┌─────────────────────────────────────────────────────────────────────────┐
850.0┤ •• agg                                                                  │
     │ ff disagg                                                               │
     │ xx disagg best                                                          │
     │                                                                         │
708.3┤                                                                         │
     │         f                                                               │
     │         f                                                               │
     │          fff                                                            │
566.7┤             f                                                           │
     │             f                                                           │
     │              f                                                          │
     │    ••         fffffffffffffffffx                                        │
425.0┤     ••••                        ff                                      │
     │        •••                       f                                      │
     │           •••••                  f                                      │
     │                ••••••••••        f                                      │
283.3┤                          •••     f                                      │
     │                             ••    f                                     │
     │                               ••  f                                     │
     │                                ••••f                                    │
141.7┤                                   •f•                                   │
     │                                     f•••••                              │
     │                                      f    •••••••                       │
     │                                       fffff      ••••                   │
  0.0┤                                                      ••••               │
     └┬─────────────────┬─────────────────┬─────────────────┬─────────────────┬┘
      0                30                60                90               120
tokens/s/gpu_cluster                tokens/s/user
147

148
149
150
151
152
153
  ----------------------------------------------------------------------------
  Deployment Details:
    (p) stands for prefill, (d) stands for decode, bs stands for batch size, a replica stands for the smallest scalable unit xPyD of the disagg system
    Some math: total gpus used = replicas * gpus/replica
               gpus/replica = (p)gpus/worker * (p)workers + (d)gpus/worker * (d)workers; for Agg, gpus/replica = gpus/worker
               gpus/worker = tp * pp * dp = etp * ep * pp for MoE models; tp * pp for dense models (underlined numbers are the actual values in math)
154
155

agg Top Configurations: (Sorted by tokens/s/gpu)
156
157
158
159
160
161
162
+------+---------+--------------+---------------+--------+-----------------+-------------+-------------------+----------+--------------+-------------+----------+----+
| Rank | backend | tokens/s/gpu | tokens/s/user |  TTFT  | request_latency | concurrency | total_gpus (used) | replicas | gpus/replica | gpus/worker | parallel | bs |
+------+---------+--------------+---------------+--------+-----------------+-------------+-------------------+----------+--------------+-------------+----------+----+
|  1   |   vllm  |    322.69    |     41.78     | 546.92 |     12490.03    |  64 (=32x2) |     8 (8=2x4)     |    2     |      4       |  4 (=4x1x1) |  tp4pp1  | 32 |
|  2   |   vllm  |    293.94    |     44.43     | 593.10 |     11823.67    |  56 (=14x4) |     8 (8=4x2)     |    4     |      2       |  2 (=2x1x1) |  tp2pp1  | 14 |
|  3   |   vllm  |    208.87    |     42.90     | 460.58 |     12093.52    |  40 (=40x1) |     8 (8=1x8)     |    1     |      8       |  8 (=8x1x1) |  tp8pp1  | 40 |
+------+---------+--------------+---------------+--------+-----------------+-------------+-------------------+----------+--------------+-------------+----------+----+
163
164

disagg Top Configurations: (Sorted by tokens/s/gpu)
165
166
167
168
169
170
171
+------+---------+--------------+---------------+--------+-----------------+-------------+-------------------+----------+--------------+------------+----------------+-------------+-------+------------+----------------+-------------+-------+
| Rank | backend | tokens/s/gpu | tokens/s/user |  TTFT  | request_latency | concurrency | total_gpus (used) | replicas | gpus/replica | (p)workers | (p)gpus/worker | (p)parallel | (p)bs | (d)workers | (d)gpus/worker | (d)parallel | (d)bs |
+------+---------+--------------+---------------+--------+-----------------+-------------+-------------------+----------+--------------+------------+----------------+-------------+-------+------------+----------------+-------------+-------+
|  1   |   vllm  |    446.85    |     53.58     | 453.18 |     9766.51     |  76 (=76x1) |     8 (8=1x8)     |    1     | 8 (=2x2+1x4) |     2      |    2 (=2x1)    |    tp2pp1   |   1   |     1      |    4 (=4x1)    |    tp4pp1   |   76  |
|  2   |   vllm  |    446.85    |     41.14     | 453.18 |     12581.87    | 144 (=72x2) |     8 (8=2x4)     |    2     | 4 (=1x2+1x2) |     1      |    2 (=2x1)    |    tp2pp1   |   1   |     1      |    2 (=2x1)    |    tp2pp1   |   72  |
|  3   |   vllm  |    333.73    |     40.22     | 453.18 |     12860.32    |  72 (=36x2) |     8 (8=2x4)     |    2     | 4 (=1x2+2x1) |     1      |    2 (=2x1)    |    tp2pp1   |   1   |     2      |    1 (=1x1)    |    tp1pp1   |   18  |
+------+---------+--------------+---------------+--------+-----------------+-------------+-------------------+----------+--------------+------------+----------------+-------------+-------+------------+----------------+-------------+-------+
172
173
174
175
176
177
178
```

**Reading the output:**
- **tokens/s/gpu**: Overall throughput efficiency — higher is better
- **tokens/s/user**: Per-request generation speed (inverse of TPOT)
- **TTFT**: Predicted time to first token
- **concurrency**: Total concurrent requests across all replicas (e.g., `56 (=14x4)` means batch size 14 × 4 replicas)
179
- **agg Rank 1** recommends TP4 with 2 replicas — simpler to deploy
180
181
182
183
- **disagg Rank 1** recommends 2 prefill workers (TP2) + 1 decode worker (TP4) — higher throughput but requires RDMA

### Step 3: Deploy on Kubernetes

184
The `--save-dir` generates ready-to-use Kubernetes manifests:
185
186

```
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
├── agg
│   ├── best_config_topn.csv
│   ├── exp_config.yaml
│   ├── pareto.csv
│   ├── top1
│   │   ├── agg_config.yaml
│   │   ├── bench_run.sh          # aiperf benchmark sweep script (bare-metal)
│   │   ├── generator_config.yaml
│   │   ├── k8s_bench.yaml        # aiperf benchmark sweep Job (Kubernetes)
│   │   ├── k8s_deploy.yaml       # Kubernetes DynamoGraphDeployment
│   │   └── run_0.sh
│   ...
├── disagg
│   ├── best_config_topn.csv
│   ├── exp_config.yaml
│   ├── pareto.csv
│   ├── top1
│   │   ├── bench_run.sh          # aiperf benchmark sweep script (bare-metal)
│   │   ├── decode_config.yaml
│   │   ├── generator_config.yaml
│   │   ├── k8s_bench.yaml        # aiperf benchmark sweep Job (Kubernetes)
│   │   ├── k8s_deploy.yaml       # Kubernetes DynamoGraphDeployment
│   │   ├── prefill_config.yaml
│   │   ├── run_0.sh
│   │   └── run_1.sh  (for multi-node setups)
│   ...
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
└── pareto_frontier.png
```

#### Prerequisites

Before deploying, ensure you have:

1. **HuggingFace Token Secret** (for gated models):
   ```bash
   kubectl create secret generic hf-token-secret \
     -n your-namespace \
     --from-literal=HF_TOKEN="your-huggingface-token"
   ```

2. **Model Cache PVC** (recommended for faster restarts):
   ```yaml
   apiVersion: v1
   kind: PersistentVolumeClaim
   metadata:
     name: model-cache
     namespace: your-namespace
   spec:
     accessModes:
       - ReadWriteMany
     resources:
       requests:
         storage: 100Gi
   ```

#### Deploy the Configuration

The generated `k8s_deploy.yaml` provides a starting point. You'll typically need to customize it for your environment:

```bash
kubectl apply -f ./results_vllm/agg/top1/agg/k8s_deploy.yaml
```

**Complete deployment example** with model cache and production settings:

```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
  name: dynamo-agg
  namespace: your-namespace
spec:
  backendFramework: vllm
  pvcs:
    - name: model-cache
      create: false           # Use existing PVC
  services:
    Frontend:
      componentType: frontend
      replicas: 1
      volumeMounts:
        - name: model-cache
          mountPoint: /opt/models
      envs:
        - name: HF_HOME
          value: /opt/models
      extraPodSpec:
        mainContainer:
          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.0
          imagePullPolicy: IfNotPresent

    VLLMWorker:
      envFromSecret: hf-token-secret
      componentType: worker
      replicas: 4
      resources:
        limits:
          gpu: "2"
      sharedMemory:
        size: 16Gi            # Required for vLLM
      volumeMounts:
        - name: model-cache
          mountPoint: /opt/models
      envs:
        - name: HF_HOME
          value: /opt/models
      extraPodSpec:
        mainContainer:
          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.0
          workingDir: /workspace
          imagePullPolicy: IfNotPresent
          command:
            - python3
            - -m
            - dynamo.vllm
          args:
            - --model
            - "Qwen/Qwen3-32B-FP8"
            - "--no-enable-prefix-caching"
            - "--tensor-parallel-size"
            - "2"
            - "--pipeline-parallel-size"
            - "1"
            - "--data-parallel-size"
            - "1"
            - "--kv-cache-dtype"
            - "fp8"
            - "--max-model-len"
            - "6000"
            - "--max-num-seqs"
            - "1024"
```

**Key deployment settings:**

| Setting | Purpose | Notes |
|---------|---------|-------|
| `backendFramework: vllm` | Tells Dynamo which runtime to use | Required at spec level |
| `pvcs` + `volumeMounts` | Caches model weights across restarts | Mount at `/opt/models` (not `/root/`) |
| `HF_HOME` env var | Points HuggingFace to cache location | Must match `mountPoint` |
| `sharedMemory.size: 16Gi` | IPC memory for vLLM | 16Gi for vLLM, 80Gi for TRT-LLM |
| `envFromSecret` | Injects HF_TOKEN | Required for gated models |

### Step 4: Validate with AIPerf

After deployment, validate the predictions against actual performance using [AIPerf](https://github.com/ai-dynamo/aiperf).

334
> ℹ️ Run AIPerf **inside the cluster** to avoid network latency affecting measurements.
335

336
AIC automatically generates AIPerf scripts along with Dynamo configs and stores them in the results folder (when `--save-dir ...` is specified). For Kubernetes deployments, you can run benchmarks using `k8s_bench.yaml`; while for bare-metal systems, use the `bench_run.sh` script. These scripts execute AIPerf across a concurrency list: the default set (`1 2 8 16 32 64 128`) along with `BenchConfig.estimated_concurrency` and its values within ±5%. You can also customize this concurrency list as needed.
337

338
By default, AIPerf results will be saved in `/tmp/bench_artifacts` of the containers. If PVC name is specified in `--generator-set K8sConfig.k8s_pvc_name=$YOUR_PVC`, result artifacts will be saved in the PVC volume mount instead.
339

340
![AIC-to-AIPerf parameter mapping](../../assets/img/param-mapping.svg)
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383

| AIC Output | AIPerf Parameter | Notes |
|------------|-----------------|-------|
| `concurrency: 56 (=14x4)` | `--concurrency 56` | Use total concurrency when benchmarking via the frontend |
| ISL/OSL targets | `--isl 4000 --osl 500` | Match your AIC inputs |
| - | `--num-requests 800` | Use `concurrency × 40` minimum for statistical stability |
| - | `--extra-inputs "ignore_eos:true"` | Ensures exact OSL tokens generated |

> **Note on concurrency**: AIC reports concurrency as `total (=bs × replicas)`. When benchmarking through the frontend (which routes to all replicas), use the total value. If benchmarking a single replica directly, use the per-replica `bs` value instead.

```yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: aiperf-benchmark
  namespace: your-namespace
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: aiperf
        image: python:3.10
        command:
        - /bin/bash
        - -c
        - |
          pip install aiperf
          aiperf profile \
            -m Qwen/Qwen3-32B-FP8 \
            --endpoint-type chat \
            -u http://dynamo-agg-frontend:8000 \
            --isl 4000 --isl-stddev 0 \
            --osl 500 --osl-stddev 0 \
            --num-requests 800 \
            --concurrency 56 \
            --streaming \
            --extra-inputs "ignore_eos:true" \
            --num-warmup-requests 40 \
            --ui-type simple
```

```bash
384
kubectl apply -f k8s_bench.yaml
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
kubectl logs -f -l job-name=aiperf-benchmark
```

**Validated results** (Qwen3-32B-FP8, 8× H200, TP2×4 replicas, aggregated):

| Metric | AIC Prediction | Actual (avg) | Status |
|--------|---------------|--------------|--------|
| TTFT (ms) | 509 | 209 | Better than target |
| ITL/TPOT (ms) | 16.49 | 15.06 | Within 10% |
| Throughput (req/s) | ~6.3 | 6.9 | Within 10% |
| Total Output TPS | ~3,178 | 3,462 | Within 10% |

<Note>
Actual throughput typically reaches ~85-90% of AIC predictions, with ITL/TPOT being the most accurate metric. Expect some variance between benchmark runs; running multiple times is recommended. Enable prefix caching (`--enable-prefix-caching`) for additional TTFT improvements with repeated prompts.
</Note>

## Fine-Tuning Your Deployment

AIConfigurator provides a strong starting point. Here's how to iterate for production:

### Adjusting for Actual Workload

If your real workload differs from the benchmark parameters:

```bash
# For longer outputs (chat/code generation):
# increase OSL, relax TTFT target
aiconfigurator cli default \
413
414
  --model Qwen/Qwen3-32B-FP8 \
  --total-gpus 8 \
415
416
  --system h200_sxm \
  --backend vllm \
417
  --backend-version 0.12.0 \
418
419
420
421
  --isl 2000 \
  --osl 2000 \
  --ttft 1000 \
  --tpot 10 \
422
  --save-dir ./results_long_output
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
```

### Exploring Alternative Configurations

Use `exp` mode to compare custom configurations:

```yaml
# custom_exp.yaml
exps:
  - exp_tp2
  - exp_tp4

exp_tp2:
  mode: "patch"
  serving_mode: "agg"
  model_path: "Qwen/Qwen3-32B-FP8"
  total_gpus: 8
  system_name: "h200_sxm"
  backend_name: "vllm"
  backend_version: "0.12.0"
  isl: 4000
  osl: 500
  ttft: 600
  tpot: 16.67
  config:
    agg_worker_config:
      tp_list: [2]

exp_tp4:
  mode: "patch"
  serving_mode: "agg"
  model_path: "Qwen/Qwen3-32B-FP8"
  total_gpus: 8
  system_name: "h200_sxm"
  backend_name: "vllm"
  backend_version: "0.12.0"
  isl: 4000
  osl: 500
  ttft: 600
  tpot: 16.67
  config:
    agg_worker_config:
      tp_list: [4]
```

```bash
469
aiconfigurator cli exp --yaml-path custom_exp.yaml --save-dir ./results_custom
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
```

> **Critical**: Disaggregated deployments **require RDMA** for KV cache transfer. Without RDMA, performance degrades by **40x** (TTFT increases from 355ms to 10+ seconds). See the Disaggregated Deployment section below.

### Deploying Disaggregated (RDMA Required)

Disaggregated deployments transfer KV cache between prefill and decode workers. **Without RDMA, this transfer becomes a severe bottleneck**, causing 40x performance degradation.

#### Prerequisites for Disaggregated

1. **RDMA-capable network** (InfiniBand or RoCE)
2. **RDMA device plugin** installed on the cluster (provides `rdma/ib` resources)
3. **ETCD and NATS** deployed (for coordination)

#### Disaggregated DGD with RDMA

```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
  name: dynamo-disagg
  namespace: your-namespace
spec:
  backendFramework: vllm
  pvcs:
    - name: model-cache
      create: false
  services:
    Frontend:
      componentType: frontend
      replicas: 1
      volumeMounts:
        - name: model-cache
          mountPoint: /opt/models
      envs:
        - name: HF_HOME
          value: /opt/models
      extraPodSpec:
        mainContainer:
          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.0
          imagePullPolicy: IfNotPresent

    VLLMPrefillWorker:
      envFromSecret: hf-token-secret
      componentType: worker
      subComponentType: prefill
      replicas: 2
      resources:
        limits:
          gpu: "2"
      sharedMemory:
        size: 16Gi
      volumeMounts:
        - name: model-cache
          mountPoint: /opt/models
      envs:
        - name: HF_HOME
          value: /opt/models
        - name: UCX_TLS
          value: "rc_x,rc,dc_x,dc,cuda_copy,cuda_ipc"  # Enable RDMA transports
        - name: UCX_RNDV_SCHEME
          value: "get_zcopy"
        - name: UCX_RNDV_THRESH
          value: "0"
      extraPodSpec:
        mainContainer:
          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.0
          workingDir: /workspace
          imagePullPolicy: IfNotPresent
          securityContext:
            capabilities:
              add: ["IPC_LOCK"]  # Required for RDMA memory registration
          resources:
            limits:
              rdma/ib: "2"      # Request RDMA resources
            requests:
              rdma/ib: "2"
          command: ["python3", "-m", "dynamo.vllm"]
          args:
            - --model
            - "Qwen/Qwen3-32B-FP8"
            - "--tensor-parallel-size"
            - "2"
            - "--kv-cache-dtype"
            - "fp8"
            - "--max-num-seqs"
            - "1"               # Prefill workers use batch size 1
557
558
            - --disaggregation-mode
            - prefill
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604

    VLLMDecodeWorker:
      envFromSecret: hf-token-secret
      componentType: worker
      subComponentType: decode
      replicas: 1
      resources:
        limits:
          gpu: "4"
      sharedMemory:
        size: 16Gi
      volumeMounts:
        - name: model-cache
          mountPoint: /opt/models
      envs:
        - name: HF_HOME
          value: /opt/models
        - name: UCX_TLS
          value: "rc_x,rc,dc_x,dc,cuda_copy,cuda_ipc"
        - name: UCX_RNDV_SCHEME
          value: "get_zcopy"
        - name: UCX_RNDV_THRESH
          value: "0"
      extraPodSpec:
        mainContainer:
          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.0
          workingDir: /workspace
          imagePullPolicy: IfNotPresent
          securityContext:
            capabilities:
              add: ["IPC_LOCK"]
          resources:
            limits:
              rdma/ib: "4"
            requests:
              rdma/ib: "4"
          command: ["python3", "-m", "dynamo.vllm"]
          args:
            - --model
            - "Qwen/Qwen3-32B-FP8"
            - "--tensor-parallel-size"
            - "4"
            - "--kv-cache-dtype"
            - "fp8"
            - "--max-num-seqs"
            - "1024"            # Decode workers handle high concurrency
605
606
            - --disaggregation-mode
            - decode
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
```

**Critical RDMA settings:**

| Setting | Purpose |
|---------|---------|
| `rdma/ib: "N"` | Request N RDMA resources (match TP size) |
| `IPC_LOCK` capability | Required for RDMA memory registration |
| `UCX_TLS` env var | Enables RDMA transports (rc_x, dc_x) |
| `UCX_RNDV_SCHEME=get_zcopy` | Zero-copy RDMA transfers |

#### Verifying RDMA is Active

After deployment, check the worker logs for UCX initialization:

```bash
kubectl logs <prefill-worker-pod> | grep -i "UCX\|NIXL"
```

You should see:
```
NIXL INFO Backend UCX was instantiated
```

If you see only TCP transports, RDMA is not active - check your RDMA device plugin and resource requests.

### Tuning vLLM-Specific Parameters

Override vLLM engine parameters with `--generator-set`:

```bash
aiconfigurator cli default \
639
640
  --model Qwen/Qwen3-32B-FP8 \
  --total-gpus 8 \
641
642
  --system h200_sxm \
  --backend vllm \
643
  --backend-version 0.12.0 \
644
645
  --isl 4000 --osl 500 \
  --ttft 600 --tpot 16.67 \
646
  --save-dir ./results_tuned \
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
  --generator-set Workers.agg.kv_cache_free_gpu_memory_fraction=0.85 \
  --generator-set Workers.agg.max_num_seqs=2048
```

Run `aiconfigurator cli default --generator-help` to see all available parameters.

### Prefix Caching Considerations

For workloads with repeated prefixes (e.g., system prompts):

- **Enable prefix caching** when you have high prefix hit rates
- **Disable prefix caching** (`--no-enable-prefix-caching`) for diverse prompts

AIConfigurator's default predictions assume no prefix caching. Enable it post-deployment if your workload benefits.

## Supported Configurations

### Backends and Versions

666
For a comprehensive breakdown of which model/system/backend/version combinations are supported in both aggregated and disaggregated modes, refer to the [**support matrix CSV**](https://github.com/ai-dynamo/aiconfigurator/blob/main/src/aiconfigurator/systems/support_matrix.csv). This file is automatically generated and tested to ensure accuracy across all supported configurations.
667

668
669
670
671
You can also check if a system / framework version is supported via the `aiconfigurator cli support` command. For example:
```bash
aiconfigurator cli support --model Qwen/Qwen3-32B-FP8 --system h100_sxm --backend-version 1.2.0rc5
```
672
673
674
675
676
677
678


## Common Use Cases

```bash
# Strict latency SLAs (real-time chat)
aiconfigurator cli default \
679
680
  --model meta-llama/Llama-3.1-70B \
  --total-gpus 16 \
681
682
  --system h200_sxm \
  --backend vllm \
683
  --backend-version 0.12.0 \
684
685
686
687
  --ttft 200 --tpot 8

# High throughput (batch processing)
aiconfigurator cli default \
688
689
  --model Qwen/Qwen3-32B-FP8 \
  --total-gpus 32 \
690
691
692
693
694
695
  --system h200_sxm \
  --backend trtllm \
  --ttft 2000 --tpot 50

# Request latency constraint (end-to-end SLA)
aiconfigurator cli default \
696
697
  --model Qwen/Qwen3-32B-FP8 \
  --total-gpus 16 \
698
699
  --system h200_sxm \
  --backend vllm \
700
701
  --backend-version 0.12.0 \
  --request-latency 12000 \
702
703
704
705
706
707
708
709
710
711
712
713
  --isl 4000 --osl 500
```

## Additional Options

```bash
# Web interface for interactive exploration
pip3 install aiconfigurator[webapp]
aiconfigurator webapp  # Visit http://127.0.0.1:7860

# Quick config generation (no parameter sweep)
aiconfigurator cli generate \
714
715
  --model Qwen/Qwen3-32B-FP8 \
  --total-gpus 8 \
716
717
718
719
720
  --system h200_sxm \
  --backend vllm

# Check model/system support
aiconfigurator cli support \
721
  --model Qwen/Qwen3-32B-FP8 \
722
723
724
725
726
727
728
729
730
731
  --system h200_sxm \
  --backend vllm
```

## Troubleshooting

### AIConfigurator Issues

**Model not found**: Use the full HuggingFace path (e.g., `Qwen/Qwen3-32B-FP8` not `QWEN3_32B`)

732
**Backend version mismatch**: Check supported versions with `aiconfigurator cli support --model <model> --system <system> --backend <backend>`
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792

### Deployment Issues

**Pods crash with "Permission denied" on cache directory**:
- Mount the PVC at `/opt/models` instead of `/root/.cache/huggingface`
- Set `HF_HOME=/opt/models` environment variable
- Ensure the PVC has `ReadWriteMany` access mode

**Workers stuck in CrashLoopBackOff**:
- Check logs: `kubectl logs <pod-name> --previous`
- Verify `sharedMemory.size` is set (16Gi for vLLM, 80Gi for TRT-LLM)
- Ensure HuggingFace token secret exists and is named correctly

**Model download slow on every restart**:
- Add PVC for model caching (see deployment example above)
- Verify `volumeMounts` and `HF_HOME` are configured on workers

**"Context stopped or killed" errors (disaggregated only)**:
- Deploy ETCD and NATS infrastructure (required for KV cache transfer)
- See [Dynamo Kubernetes Guide](../../kubernetes/README.md) for platform setup

### Performance Issues

**OOM errors**: Reduce `--max-num-seqs` or increase tensor parallelism

**Performance below predictions**:
- Verify warmup requests are sufficient (40+ recommended)
- Check for competing workloads on the cluster
- Ensure KV cache memory fraction is optimized
- Run benchmarks from inside the cluster to eliminate network latency

**Disaggregated TTFT extremely high (10+ seconds)**:
This is almost always caused by **missing RDMA configuration**. Without RDMA, KV cache transfer falls back to TCP and becomes a severe bottleneck.

To diagnose:
```bash
# Check if RDMA resources are allocated
kubectl get pod <worker-pod> -o yaml | grep -A5 "resources:"

# Check UCX transport in logs
kubectl logs <worker-pod> | grep -i "UCX\|transport"
```

To fix:
1. Ensure your cluster has RDMA device plugin installed
2. Add `rdma/ib` resource requests to worker pods
3. Add `IPC_LOCK` capability to security context
4. Add UCX environment variables (see Disaggregated Deployment section)

**Disaggregated working but throughput lower than aggregated**:
For balanced workloads (ISL/OSL ratio between 2:1 and 10:1), aggregated is often better. Disaggregated shines for:
- Very long inputs (ISL > 8000) with short outputs
- Workloads needing independent prefill/decode scaling

## Learn More

- [AIConfigurator CLI Guide](https://github.com/ai-dynamo/aiconfigurator/blob/main/docs/cli_user_guide.md)
- [Dynamo Deployment Guide](https://github.com/ai-dynamo/aiconfigurator/blob/main/docs/dynamo_deployment_guide.md)
- [Dynamo Installation Guide](../../kubernetes/installation-guide.md)
- [Benchmarking Guide](../../benchmarks/benchmarking.md)