README.md 9.66 KB
Newer Older
1
# Dynamo Production-Ready Recipes
2

3
Production-tested Kubernetes deployment recipes for LLM inference using NVIDIA Dynamo.
4

5
6
> **Prerequisites:** This guide assumes you have already installed the Dynamo Kubernetes Platform.
> If not, follow the **[Kubernetes Deployment Guide](../docs/kubernetes/README.md)** first.
7

8
## Available Recipes
9

10
11
12
13
14
15
16
17
18
19
20
21
| Model | Framework | Mode | GPUs | Deployment | Benchmark Recipe | Notes |GAIE integration |
|-------|-----------|------|------|------------|------------------|-------|------------------|
| **[Llama-3-70B](llama-3-70b/vllm/agg/)** | vLLM | Aggregated | 4x H100/H200 | ✅ | ✅ | FP8 dynamic quantization | ✅ | ❌ |
| **[Llama-3-70B](llama-3-70b/vllm/disagg-single-node/)** | vLLM | Disagg (Single-Node) | 8x H100/H200 | ✅ | ✅ | Prefill + Decode separation | ❌ |
| **[Llama-3-70B](llama-3-70b/vllm/disagg-multi-node/)** | vLLM | Disagg (Multi-Node) | 16x H100/H200 | ✅ | ✅ | 2 nodes, 8 GPUs each | ❌ |
| **[Qwen3-32B-FP8](qwen3-32b-fp8/trtllm/agg/)** | TensorRT-LLM | Aggregated | 4x GPU | ✅ | ✅ | FP8 quantization | ❌ |
| **[Qwen3-32B-FP8](qwen3-32b-fp8/trtllm/disagg/)** | TensorRT-LLM | Disaggregated | 8x GPU | ✅ | ✅ | Prefill + Decode separation | ❌ |
| **[GPT-OSS-120B](gpt-oss-120b/trtllm/agg/)** | TensorRT-LLM | Aggregated | 4x GB200 | ✅ | ✅ | Blackwell only, WideEP | ❌ |
| **[GPT-OSS-120B](gpt-oss-120b/trtllm/disagg/)** | TensorRT-LLM | Disaggregated | TBD | ❌ | ❌ | Engine configs only, no K8s manifest | ❌ |
| **[DeepSeek-R1](deepseek-r1/sglang/disagg-8gpu/)** | SGLang | Disagg WideEP | 8x H200 | ✅ | ❌ | Benchmark recipe pending | ❌ |
| **[DeepSeek-R1](deepseek-r1/sglang/disagg-16gpu/)** | SGLang | Disagg WideEP | 16x H200 | ✅ | ❌ | Benchmark recipe pending | ❌ |
| **[DeepSeek-R1](deepseek-r1/trtllm/disagg/wide_ep/gb200/)** | TensorRT-LLM | Disagg WideEP (GB200) | 32+4 GB200 | ✅ | ✅ |Multi-node: 8 decode + 1 prefill nodes | ❌ |
22
23

**Legend:**
24
25
26
27
- **Deployment**: ✅ = Complete `deploy.yaml` manifest available | ❌ = Missing or incomplete
- **Benchmark Recipe**: ✅ = Includes `perf.yaml` for running AIPerf benchmarks | ❌ = No benchmark recipe provided

## Recipe Structure
28

29
Each complete recipe follows this standard structure:
30

31
```
32
<model-name>/
33
├── README.md (optional)           # Model-specific deployment notes
34
├── model-cache/
35
36
37
38
39
40
│   ├── model-cache.yaml          # PersistentVolumeClaim for model storage
│   └── model-download.yaml       # Job to download model from HuggingFace
└── <framework>/                  # vllm, sglang, or trtllm
    └── <deployment-mode>/        # agg, disagg, disagg-single-node, etc.
        ├── deploy.yaml           # Complete DynamoGraphDeployment manifest
        └── perf.yaml (optional)  # AIPerf benchmark job
41
42
43
44
```

## Quick Start

45
### Prerequisites
46

47
**1. Dynamo Platform Installed**
48

49
The recipes require the Dynamo Kubernetes Platform to be installed. Follow the installation guide:
50

51
52
- **[Kubernetes Deployment Guide](../docs/kubernetes/README.md)** - Quickstart (~10 minutes)
- **[Detailed Installation Guide](../docs/kubernetes/installation_guide.md)** - Advanced options
53

54
**2. GPU Cluster Requirements**
55

56
57
Ensure your cluster has:
- GPU nodes matching recipe requirements (see table above)
58
- GPU operator installed
59
- Appropriate GPU drivers and container runtime
60

61
**3. HuggingFace Access**
62

63
Configure authentication to download models:
64
65

```bash
66
67
export NAMESPACE=your-namespace
kubectl create namespace ${NAMESPACE}
68

69
70
71
72
# Create HuggingFace token secret
kubectl create secret generic hf-token-secret \
  --from-literal=HF_TOKEN="your-token-here" \
  -n ${NAMESPACE}
73
74
```

75
76
77
**4. Storage Configuration**

Update the `storageClassName` in `<model>/model-cache/model-cache.yaml` to match your cluster:
78
79

```bash
80
# Find your storage class name
81
kubectl get storageclass
82

83
84
85
# Edit the model-cache.yaml file and update:
# spec:
#   storageClassName: "your-actual-storage-class"
86
87
```

88
### Deploy a Recipe
89

90
**Step 1: Download Model**
91
92

```bash
93
94
# Update storageClassName in model-cache.yaml first!
kubectl apply -f <model>/model-cache/ -n ${NAMESPACE}
95

96
97
# Wait for download to complete (may take 10-60 minutes depending on model size)
kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=6000s
98

99
100
101
# Monitor progress
kubectl logs -f job/model-download -n ${NAMESPACE}
```
102

103
**Step 2: Deploy Service**
104
105

```bash
106
kubectl apply -f <model>/<framework>/<mode>/deploy.yaml -n ${NAMESPACE}
107

108
109
# Check deployment status
kubectl get dynamographdeployment -n ${NAMESPACE}
110

111
112
# Check pod status
kubectl get pods -n ${NAMESPACE}
113

114
115
# Wait for pods to be ready
kubectl wait --for=condition=ready pod -l nvidia.com/dynamo-graph-deployment-name=<deployment-name> -n ${NAMESPACE} --timeout=600s
116
```
117

118
**Step 3: Test Deployment**
119

120
121
122
```bash
# Port forward to access the service locally
kubectl port-forward svc/<deployment-name>-frontend 8000:8000 -n ${NAMESPACE}
123

124
125
# In another terminal, test the endpoint
curl http://localhost:8000/v1/models
126

127
128
129
130
131
132
133
134
# Send a test request
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "<model-name>",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 50
  }'
135
```
136

137
**Step 4: Run Benchmark (Optional)**
138
139

```bash
140
141
# Only if perf.yaml exists in the recipe directory
kubectl apply -f <model>/<framework>/<mode>/perf.yaml -n ${NAMESPACE}
142

143
144
# Monitor benchmark progress
kubectl logs -f job/<benchmark-job-name> -n ${NAMESPACE}
145

146
147
# View results after completion
kubectl logs job/<benchmark-job-name> -n ${NAMESPACE} | tail -50
148
149
```

150
** Inference Gateway (GAIE) Integration (Optional)**
151

152
For Llama-3-70B with vLLM (Aggregated), an example of integration with the Inference Gateway is provided.
153

154
Follow to Follow [Deploy Inference Gateway Section 2](../deploy/inference-gateway/README.md#2-deploy-inference-gateway) to install GAIE. Then apply manifests.
155
156

```bash
157
158
159
export DEPLOY_PATH=llama-3-70b/vllm/agg/
#DEPLOY_PATH=<model>/<framework>/<mode>/
kubectl apply -R -f "$DEPLOY_PATH/gaie/k8s-manifests" -n "$NAMESPACE"
160

161
## Example Deployments
162

163
### Llama-3-70B with vLLM (Aggregated)
164
165

```bash
166
167
export NAMESPACE=dynamo-demo
kubectl create namespace ${NAMESPACE}
168

169
170
171
172
# Create HF token secret
kubectl create secret generic hf-token-secret \
  --from-literal=HF_TOKEN="your-token" \
  -n ${NAMESPACE}
173

174
175
176
177
178
179
180
# Deploy
kubectl apply -f llama-3-70b/model-cache/ -n ${NAMESPACE}
kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=6000s
kubectl apply -f llama-3-70b/vllm/agg/deploy.yaml -n ${NAMESPACE}

# Test
kubectl port-forward svc/llama3-70b-agg-frontend 8000:8000 -n ${NAMESPACE}
181
182
```

183
### DeepSeek-R1 on GB200 (Multi-node)
184

185
See [deepseek-r1/trtllm/disagg/wide_ep/gb200/deploy.yaml](deepseek-r1/trtllm/disagg/wide_ep/gb200/deploy.yaml) for the complete multi-node WideEP configuration.
186

187
## Customization
188

189
190
191
192
193
Each `deploy.yaml` contains:
- **ConfigMap**: Engine-specific configuration (embedded in the manifest)
- **DynamoGraphDeployment**: Kubernetes resource definitions
- **Resource limits**: GPU count, memory, CPU requests/limits
- **Image references**: Container images with version tags
194

195
### Key Customization Points
196

197
198
199
200
201
202
**Model Configuration:**
```yaml
# In deploy.yaml under worker args:
args:
  - python3 -m dynamo.vllm --model <your-model-path> --served-model-name <name>
```
203

204
205
206
207
208
209
210
211
**GPU Resources:**
```yaml
resources:
  limits:
    gpu: "4"  # Adjust based on your requirements
  requests:
    gpu: "4"
```
212

213
214
215
216
217
218
**Scaling:**
```yaml
services:
  VllmDecodeWorker:
    replicas: 2  # Scale to multiple workers
```
219

220
221
222
223
224
225
226
**Router Mode:**
```yaml
# In Frontend args:
args:
  - python3 -m dynamo.frontend --router-mode kv --http-port 8000
# Options: round-robin, kv (KV-aware routing)
```
227

228
229
230
231
**Container Images:**
```yaml
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:x.y.z
# Update version tag as needed
232
233
```

234
## Troubleshooting
235

236
### Common Issues
237

238
239
240
241
**Pods stuck in Pending:**
- Check GPU availability: `kubectl describe node <node-name>`
- Verify storage class exists: `kubectl get storageclass`
- Check resource requests vs. available resources
242

243
244
245
246
**Model download fails:**
- Verify HuggingFace token is correct
- Check network connectivity from cluster
- Review job logs: `kubectl logs job/model-download -n ${NAMESPACE}`
247

248
249
250
251
**Workers fail to start:**
- Check GPU compatibility (driver version, CUDA version)
- Verify image pull secrets if using private registries
- Review pod logs: `kubectl logs <pod-name> -n ${NAMESPACE}`
252

253
254
255
**For more troubleshooting:**
- [Kubernetes Deployment Guide](../docs/kubernetes/README.md#troubleshooting)
- [Observability Documentation](../docs/kubernetes/observability/)
256

257
## Related Documentation
258

259
260
261
262
263
264
265
- **[Kubernetes Deployment Guide](../docs/kubernetes/README.md)** - Platform installation and concepts
- **[API Reference](../docs/kubernetes/api_reference.md)** - DynamoGraphDeployment CRD specification
- **[vLLM Backend Guide](../docs/backends/vllm/README.md)** - vLLM-specific features
- **[SGLang Backend Guide](../docs/backends/sglang/README.md)** - SGLang-specific features
- **[TensorRT-LLM Backend Guide](../docs/backends/trtllm/README.md)** - TensorRT-LLM features
- **[Observability](../docs/kubernetes/observability/)** - Monitoring and logging
- **[Benchmarking Guide](../docs/benchmarks/benchmarking.md)** - Performance testing
266

267
## Contributing
268

269
270
271
272
273
We welcome contributions of new recipes! See [CONTRIBUTING.md](CONTRIBUTING.md) for:
- Recipe submission guidelines
- Required components checklist
- Testing and validation requirements
- Documentation standards
274

275
276
277
278
279
280
281
282
### Recipe Quality Standards

A production-ready recipe must include:
- ✅ Complete `deploy.yaml` with DynamoGraphDeployment
- ✅ Model cache PVC and download job
- ✅ Benchmark recipe (`perf.yaml`) for performance testing
- ✅ Verification on target hardware
- ✅ Documentation of GPU requirements