README.md 14.7 KB
Newer Older
1
2
3
4
5
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->

6
# Dynamo Production-Ready Recipes
7

8
Production-tested Kubernetes deployment recipes for LLM inference using NVIDIA Dynamo.
9

10
> **Prerequisites:** This guide assumes you have already installed the Dynamo Kubernetes Platform.
11
> If not, follow the **[Kubernetes Deployment Guide](../docs/kubernetes/README.md)** first.
12

13
## Available Recipes
14

15
### Feature Comparison Recipes
16

17
These recipes compare Dynamo performance features with benchmark results, each including both baseline and optimized deployment configurations:
18
19
20

| Model | Framework | Configuration | GPUs | Features |
|-------|-----------|---------------|------|----------|
21
22
23
| **[Qwen3-32B](qwen3-32b/)** | vLLM | Disagg + KV-Router | 16x H200 | **Disaggregated Serving + KV-Aware Routing** — benchmark comparison with real-world Mooncake traces |
| **[DeepSeek-V3.2-NVFP4](deepseek-v32-fp4/)** | TensorRT-LLM | Agg + Disagg WideEP | 32x GB200 | **Disaggregated Serving + KV-Aware Routing** — benchmark comparison with Mooncake-based synthetic coding trace |
| **[Qwen3-VL-30B-A3B-FP8](qwen3-vl-30b/)** | vLLM | Agg + Embedding Cache | 1x GB200 | **Multimodal Embedding Cache** — benchmark comparison showing +16% throughput, -28% TTFT |
24
25
26
27
28
29
30

### Aggregated & Disaggregated Recipes

These recipes demonstrate aggregated or disaggregated serving:

**GAIE Column**: Indicates whether the recipe includes integration with the [Gateway API Inference Extension (GAIE)](../deploy/inference-gateway/README.md) — a Kubernetes SIG project that extends the Gateway API for AI inference workloads, providing load balancing, model routing, and request management.

31
32
| Model | Framework | Mode | GPUs | Deployment | Benchmark | Notes | GAIE |
|-------|-----------|------|------|------------|-----------|-------|------|
33
| **[Llama-3-70B](llama-3-70b/vllm/agg/)** | vLLM | Aggregated | 4x H100/H200 | ✅ | ✅ | FP8 dynamic quantization | ✅ |
34
35
| **[Llama-3-70B](llama-3-70b/vllm/disagg-single-node/)** | vLLM | Disagg (Single-Node) | 8x H100/H200 | ✅ | ✅ | Prefill + Decode separation | ❌ |
| **[Llama-3-70B](llama-3-70b/vllm/disagg-multi-node/)** | vLLM | Disagg (Multi-Node) | 16x H100/H200 | ✅ | ✅ | 2 nodes, 8 GPUs each | ❌ |
36
37
| **[Qwen3-32B-FP8](qwen3-32b-fp8/trtllm/agg/)** | TensorRT-LLM | Aggregated | 2x H100/H200/A100 | ✅ | ✅ | FP8 quantization | ❌ |
| **[Qwen3-32B-FP8](qwen3-32b-fp8/trtllm/disagg/)** | TensorRT-LLM | Disaggregated | 8x H100/H200/A100 | ✅ | ✅ | Prefill + Decode separation | ❌ |
38
| **[Qwen3-32B-FP8](qwen3-32b-fp8/vllm/disagg/)** | vLLM | Disagg (Single-Node) | 8x A100 | ✅ | ✅ | 2× TP2 prefill + 1× TP4 decode, NixlConnector KV transfer | ❌ |
39
40
| **[Qwen3-235B-A22B-FP8](qwen3-235b-a22b-fp8/trtllm/agg/)** | TensorRT-LLM | Aggregated | 16x H100/H200 | ✅ | ✅ | MoE model, TP4×EP4 | ❌ |
| **[Qwen3-235B-A22B-FP8](qwen3-235b-a22b-fp8/trtllm/disagg/)** | TensorRT-LLM | Disaggregated | 16x H100/H200 | ✅ | ✅ | MoE model, Prefill + Decode | ❌ |
41
| **[GPT-OSS-120B](gpt-oss-120b/trtllm/agg/)** | TensorRT-LLM | Aggregated | 4x GB200 | ✅ | ✅ | Blackwell only, WideEP | ❌ |
42
| **[GPT-OSS-120B](gpt-oss-120b/trtllm/disagg/)** | TensorRT-LLM | Disaggregated | 5x Blackwell (GB200/B200) | ✅ | ✅ | Prefill/Decode split | ❌ |
43
44
45
46
| **[DeepSeek-R1](deepseek-r1/sglang/disagg-8gpu/)** | SGLang | Disagg WideEP | 16x H200 | ✅ | ❌ | TP=8, single-node. Use `model-download-sglang.yaml` | ❌ |
| **[DeepSeek-R1](deepseek-r1/sglang/disagg-16gpu/)** | SGLang | Disagg WideEP | 32x H200 | ✅ | ❌ | TP=16, multi-node. Use `model-download-sglang.yaml` | ❌ |
| **[DeepSeek-R1](deepseek-r1/trtllm/disagg/wide_ep/gb200/)** | TensorRT-LLM | Disagg WideEP (GB200) | 36x GB200 | ✅ | ✅ | Multi-node: 8 decode + 1 prefill nodes | ❌ |
| **[DeepSeek-R1](deepseek-r1/)** | vLLM | Disagg DEP16 | 32x H200 | ✅ | ❌ | Multi-node, data-expert parallel | ❌ |
47

48
49
50
**Legend:**
- **Deployment**: ✅ = Complete `deploy.yaml` manifest available
- **Benchmark**: ✅ = Includes `perf.yaml` for running AIPerf benchmarks
51

52
### Functional Recipes (Not Yet Benchmarked)
53

54
These recipes demonstrate functional deployments with Dynamo features, but have not yet been performance-tuned or paired with benchmark manifests.
55
56
57
58
59
60
61

| Model | Framework | Mode | GPUs | Deployment | Notes |
|-------|-----------|-------|------|------------|-------|
| **[Nemotron-3-Super-FP8](nemotron-3-super-fp8/vllm/agg/)** | vLLM | Aggregated | 4x H100/H200 | ✅ | TP=4, KV-aware routing |
| **[Nemotron-3-Super-FP8](nemotron-3-super-fp8/sglang/agg/)** | SGLang | Aggregated | 4x H100/H200 | ✅ | TP=4, KV-aware routing, 1.0+ |
| **[Nemotron-3-Super-FP8](nemotron-3-super-fp8/trtllm/disagg/)** | TensorRT-LLM | Disaggregated | 4x H100/H200 | ✅ | TP=2 prefill/decode split, UCX KV transfer |
| **[Nemotron-3-Super-FP8](nemotron-3-super-fp8/sglang/disagg/)** | SGLang | Disaggregated | 4x H100/H200 | ✅ | TP=2 prefill/decode split, nixl KV transfer, 1.0+ |
62
63
64
65
66
67
68
69
| **[Kimi-K2.5 (Baseten)](kimi-k2.5/trtllm/agg/baseten/)** | TensorRT-LLM | Aggregated | 8x B200 | ✅ | Text only — MoE model, TP8×EP8, reasoning + tool calling |

### Experimental Recipes

These recipes are under active development and may require additional setup steps (e.g., container patching). They are functional but not yet fully validated for production use.

| Model | Framework | Mode | GPUs | Deployment | Notes |
|-------|-----------|------|------|------------|-------|
70
| **[GLM-5-NVFP4](glm-5-nvfp4/sglang/disagg/)** | SGLang | Disagg Prefill/Decode | 20x GB200 | ✅ | NVFP4, EAGLE speculative decoding, TP16 decode + TP4 prefill. Requires [custom container build](glm-5-nvfp4/). |
71
| **[nvidia/Kimi-K2.5-NVFP4](kimi-k2.5/trtllm/agg/nvidia/)** | TensorRT-LLM | Aggregated | 8x B200 | ✅ | Text only — MoE model, TP8×EP8, reasoning + tool calling. Vision input not yet functional. |
72
73
| **[DeepSeek-V4-Flash](deepseek-v4-flash/vllm/agg/)** | vLLM | Aggregated | 4x B200 | ✅ | Text only — MoE model (284B / 13B active), DP=4 + EP, FP8 KV cache, reasoning + tool calling. Requires [custom container build](deepseek-v4-flash/container/). |
| **[DeepSeek-V4-Pro](deepseek-v4-pro/vllm/agg/)** | vLLM | Aggregated | 8x B200 | ✅ | Text only — MoE model (1.6T / 49B active, 1M context), TP=8 + EP, FP4+FP8 mixed checkpoint, FP8 KV cache, CSA+HCA attention, three reasoning effort modes, tool calling. Requires [custom container build](deepseek-v4-pro/container/). |
74

75
## Recipe Structure
76

77
Each complete recipe follows this standard structure:
78

79
```
80
<model-name>/
81
├── README.md (optional)           # Model-specific deployment notes
82
├── model-cache/
83
84
85
86
87
88
│   ├── model-cache.yaml          # PersistentVolumeClaim for model storage
│   └── model-download.yaml       # Job to download model from HuggingFace
└── <framework>/                  # vllm, sglang, or trtllm
    └── <deployment-mode>/        # agg, disagg, disagg-single-node, etc.
        ├── deploy.yaml           # Complete DynamoGraphDeployment manifest
        └── perf.yaml (optional)  # AIPerf benchmark job
89
90
91
92
```

## Quick Start

93
### Prerequisites
94

95
**1. Dynamo Platform Installed**
96

97
The recipes require the Dynamo Kubernetes Platform to be installed. Follow the installation guide:
98

99
100
- **[Kubernetes Deployment Guide](../docs/kubernetes/README.md)** - Quickstart (~10 minutes)
- **[Detailed Installation Guide](../docs/kubernetes/installation-guide.md)** - Advanced options
101

102
**2. GPU Cluster Requirements**
103

104
105
Ensure your cluster has:
- GPU nodes matching recipe requirements (see table above)
106
- GPU operator installed
107
- Appropriate GPU drivers and container runtime
108

109
**3. HuggingFace Access**
110

111
Configure authentication to download models:
112
113

```bash
114
115
export NAMESPACE=your-namespace
kubectl create namespace ${NAMESPACE}
116

117
118
119
120
# Create HuggingFace token secret
kubectl create secret generic hf-token-secret \
  --from-literal=HF_TOKEN="your-token-here" \
  -n ${NAMESPACE}
121
122
```

123
124
125
**4. Storage Configuration**

Update the `storageClassName` in `<model>/model-cache/model-cache.yaml` to match your cluster:
126
127

```bash
128
# Find your storage class name
129
kubectl get storageclass
130

131
132
133
# Edit the model-cache.yaml file and update:
# spec:
#   storageClassName: "your-actual-storage-class"
134
135
```

136
### Deploy a Recipe
137

138
**Step 1: Download Model**
139
140

```bash
141
cd recipes
142
143
# Update storageClassName in model-cache.yaml first!
kubectl apply -f <model>/model-cache/ -n ${NAMESPACE}
144

145
146
# Wait for download to complete (may take 10-60 minutes depending on model size)
kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=6000s
147

148
149
150
# Monitor progress
kubectl logs -f job/model-download -n ${NAMESPACE}
```
151

152
**Step 2: Deploy Service**
153

154
155
Update the image in `<model>/<framework>/<mode>/deploy.yaml`.

156
```bash
157
kubectl apply -f <model>/<framework>/<mode>/deploy.yaml -n ${NAMESPACE}
158

159
160
# Check deployment status
kubectl get dynamographdeployment -n ${NAMESPACE}
161

162
163
# Check pod status
kubectl get pods -n ${NAMESPACE}
164

165
166
# Wait for pods to be ready
kubectl wait --for=condition=ready pod -l nvidia.com/dynamo-graph-deployment-name=<deployment-name> -n ${NAMESPACE} --timeout=600s
167
```
168

169
**Step 3: Test Deployment**
170

171
172
173
```bash
# Port forward to access the service locally
kubectl port-forward svc/<deployment-name>-frontend 8000:8000 -n ${NAMESPACE}
174

175
176
# In another terminal, test the endpoint
curl http://localhost:8000/v1/models
177

178
179
180
181
182
183
184
185
# Send a test request
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "<model-name>",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 50
  }'
186
```
187

188
**Step 4: Run Benchmark (Optional)**
189
190

```bash
191
192
# Only if perf.yaml exists in the recipe directory
kubectl apply -f <model>/<framework>/<mode>/perf.yaml -n ${NAMESPACE}
193

194
195
# Monitor benchmark progress
kubectl logs -f job/<benchmark-job-name> -n ${NAMESPACE}
196

197
198
# View results after completion
kubectl logs job/<benchmark-job-name> -n ${NAMESPACE} | tail -50
199
200
201
```


202
## Example Deployments
203

204
### Llama-3-70B with vLLM (Aggregated)
205
206

```bash
207
208
export NAMESPACE=dynamo-demo
kubectl create namespace ${NAMESPACE}
209

210
211
212
213
# Create HF token secret
kubectl create secret generic hf-token-secret \
  --from-literal=HF_TOKEN="your-token" \
  -n ${NAMESPACE}
214

215
# Deploy
216
cd recipes
217
218
219
220
221
222
kubectl apply -f llama-3-70b/model-cache/ -n ${NAMESPACE}
kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=6000s
kubectl apply -f llama-3-70b/vllm/agg/deploy.yaml -n ${NAMESPACE}

# Test
kubectl port-forward svc/llama3-70b-agg-frontend 8000:8000 -n ${NAMESPACE}
223
224
```

225
### Inference Gateway (GAIE) Integration (Optional)
atchernych's avatar
atchernych committed
226
227
228

For Llama-3-70B with vLLM (Aggregated), an example of integration with the Inference Gateway is provided.

229
230
231
232
First, deploy the Dynamo Graph per instructions above.

Then follow [Deploy Inference Gateway Section 2](../deploy/inference-gateway/README.md#2-deploy-inference-gateway) to install GAIE.

233
Update the containers.epp.image in the deployment file, i.e. llama-3-70b/vllm/agg/gaie/k8s-manifests/epp/deployment.yaml. It should match the release tag and be in the format `nvcr.io/nvidia/ai-dynamo/frontend:<version>` e.g. `nvcr.io/nvidia/ai-dynamo/frontend:0.9.0`
234
235
236
237
238
The recipe assumes you are using Kubernetes discovery backend and sets the `DYN_DISCOVERY_BACKEND` env variable in the epp deployment. If you want to use etcd enable the lines below and remove the DYN_DISCOVERY_BACKEND env var.
```bash
- name: ETCD_ENDPOINTS
  value: "dynamo-platform-etcd.$(PLATFORM_NAMESPACE):2379" #  update dynamo-platform to appropriate namespace
```
atchernych's avatar
atchernych committed
239
240
241

```bash
export DEPLOY_PATH=llama-3-70b/vllm/agg/
242
# DEPLOY_PATH=<model>/<framework>/<mode>/
atchernych's avatar
atchernych committed
243
kubectl apply -R -f "$DEPLOY_PATH/gaie/k8s-manifests" -n "$NAMESPACE"
atchernych's avatar
atchernych committed
244
```
atchernych's avatar
atchernych committed
245

246
### DeepSeek-R1 on GB200 (Multi-node)
247

248
See [deepseek-r1/trtllm/disagg/wide_ep/gb200/deploy.yaml](deepseek-r1/trtllm/disagg/wide_ep/gb200/deploy.yaml) for the complete multi-node WideEP configuration.
249

250
## Customization
251

252
253
254
255
256
Each `deploy.yaml` contains:
- **ConfigMap**: Engine-specific configuration (embedded in the manifest)
- **DynamoGraphDeployment**: Kubernetes resource definitions
- **Resource limits**: GPU count, memory, CPU requests/limits
- **Image references**: Container images with version tags
257

258
### Key Customization Points
259

260
261
262
263
264
265
**Model Configuration:**
```yaml
# In deploy.yaml under worker args:
args:
  - python3 -m dynamo.vllm --model <your-model-path> --served-model-name <name>
```
266

267
268
269
270
271
272
273
274
**GPU Resources:**
```yaml
resources:
  limits:
    gpu: "4"  # Adjust based on your requirements
  requests:
    gpu: "4"
```
275

276
277
278
279
280
281
**Scaling:**
```yaml
services:
  VllmDecodeWorker:
    replicas: 2  # Scale to multiple workers
```
282

283
284
285
286
287
288
289
**Router Mode:**
```yaml
# In Frontend args:
args:
  - python3 -m dynamo.frontend --router-mode kv --http-port 8000
# Options: round-robin, kv (KV-aware routing)
```
290

291
292
293
294
**Container Images:**
```yaml
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:x.y.z
# Update version tag as needed
295
296
```

297
## Troubleshooting
298

299
### Common Issues
300

301
302
303
304
**Pods stuck in Pending:**
- Check GPU availability: `kubectl describe node <node-name>`
- Verify storage class exists: `kubectl get storageclass`
- Check resource requests vs. available resources
305

306
307
308
309
**Model download fails:**
- Verify HuggingFace token is correct
- Check network connectivity from cluster
- Review job logs: `kubectl logs job/model-download -n ${NAMESPACE}`
310

311
312
313
314
**Workers fail to start:**
- Check GPU compatibility (driver version, CUDA version)
- Verify image pull secrets if using private registries
- Review pod logs: `kubectl logs <pod-name> -n ${NAMESPACE}`
315

316
**For more troubleshooting:**
317
318
- [Kubernetes Deployment Guide](../docs/kubernetes/README.md#troubleshooting)
- [Observability Documentation](../docs/kubernetes/observability/)
319

320
## Related Documentation
321

322
323
324
325
326
327
328
- **[Kubernetes Deployment Guide](../docs/kubernetes/README.md)** - Platform installation and concepts
- **[API Reference](../docs/kubernetes/api-reference.md)** - DynamoGraphDeployment CRD specification
- **[vLLM Backend Guide](../docs/backends/vllm/README.md)** - vLLM-specific features
- **[SGLang Backend Guide](../docs/backends/sglang/README.md)** - SGLang-specific features
- **[TensorRT-LLM Backend Guide](../docs/backends/trtllm/README.md)** - TensorRT-LLM features
- **[Observability](../docs/kubernetes/observability/)** - Monitoring and logging
- **[Benchmarking Guide](../docs/benchmarks/benchmarking.md)** - Performance testing
329

330
## Contributing
331

332
333
334
335
336
We welcome contributions of new recipes! See [CONTRIBUTING.md](CONTRIBUTING.md) for:
- Recipe submission guidelines
- Required components checklist
- Testing and validation requirements
- Documentation standards
337

338
339
340
341
342
343
344
345
### Recipe Quality Standards

A production-ready recipe must include:
- ✅ Complete `deploy.yaml` with DynamoGraphDeployment
- ✅ Model cache PVC and download job
- ✅ Benchmark recipe (`perf.yaml`) for performance testing
- ✅ Verification on target hardware
- ✅ Documentation of GPU requirements