README.md 9.87 KB
Newer Older
1
<!--
2
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
SPDX-License-Identifier: Apache-2.0

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->

18
# Deploying Dynamo on Kubernetes
19

20
High-level guide to Dynamo Kubernetes deployments. Start here, then dive into specific guides.
atchernych's avatar
atchernych committed
21

22
23
24
25
## Important Terminology

**Kubernetes Namespace**: The K8s namespace where your DynamoGraphDeployment resource is created.
- Used for: Resource isolation, RBAC, organizing deployments
26
- Example: `dynamo-system`, `team-a-namespace`
27

28
**Dynamo Namespace**: The logical namespace used by Dynamo components for [service discovery](./service_discovery.md).
29
30
31
- Used for: Runtime component communication, service discovery
- Specified in: `.spec.services.<ServiceName>.dynamoNamespace` field
- Example: `my-llm`, `production-model`, `dynamo-dev`
32

33
These are independent. A single Kubernetes namespace can host multiple Dynamo namespaces, and vice versa.
34

35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
## Prerequisites

Before you begin, ensure you have the following tools installed:

| Tool | Minimum Version | Installation Guide |
|------|-----------------|-------------------|
| **kubectl** | v1.24+ | [Install kubectl](https://kubernetes.io/docs/tasks/tools/#kubectl) |
| **Helm** | v3.0+ | [Install Helm](https://helm.sh/docs/intro/install/) |

Verify your installation:
```bash
kubectl version --client  # Should show v1.24+
helm version              # Should show v3.0+
```

For detailed installation instructions, see the [Prerequisites section](./installation_guide.md#prerequisites) in the Installation Guide.

52
53
## Pre-deployment Checks

54
55
56
57
58
59
60
Before deploying the platform, run the pre-deployment checks to ensure the cluster is ready:

```bash
./deploy/pre-deployment/pre-deployment-check.sh
```

This validates kubectl connectivity, StorageClass configuration, and GPU availability. See [pre-deployment checks](../../deploy/pre-deployment/README.md) for more details.
61

62
## 1. Install Platform First
63
64
65

```bash
# 1. Set environment
66
export NAMESPACE=dynamo-system
67
68
export RELEASE_VERSION=0.x.x # any version of Dynamo 0.3.2+ listed at https://github.com/ai-dynamo/dynamo/releases

69
# 2. Install CRDs (skip if on shared cluster where CRDs already exist)
70
71
72
73
74
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${RELEASE_VERSION}.tgz
helm install dynamo-crds dynamo-crds-${RELEASE_VERSION}.tgz --namespace default

# 3. Install Platform
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${RELEASE_VERSION}.tgz
75
helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE} --create-namespace
76
77
```

78
79
80
81
82
83
84
85
**For Shared/Multi-Tenant Clusters:**

If your cluster has namespace-restricted Dynamo operators, add this flag to step 3:
```bash
--set dynamo-operator.namespaceRestriction.enabled=true
```

For more details or customization options (including multinode deployments), see **[Installation Guide for Dynamo Kubernetes Platform](./installation_guide.md)**.
atchernych's avatar
atchernych committed
86

87
## 2. Choose Your Backend
88

89
Each backend has deployment examples and configuration options:
90

91
92
| Backend      | Aggregated | Aggregated + Router | Disaggregated | Disaggregated + Router | Disaggregated + Planner | Disaggregated Multi-node |
|--------------|:----------:|:-------------------:|:-------------:|:----------------------:|:-----------------------:|:------------------------:|
93
94
95
| **[SGLang](../../examples/backends/sglang/deploy/README.md)**       | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| **[TensorRT-LLM](../../examples/backends/trtllm/deploy/README.md)** | ✅ | ✅ | ✅ | ✅ | 🚧 | ✅ |
| **[vLLM](../../examples/backends/vllm/deploy/README.md)**           | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
atchernych's avatar
atchernych committed
96

97
## 3. Deploy Your First Model
atchernych's avatar
atchernych committed
98

99
```bash
100
export NAMESPACE=dynamo-system
101
kubectl create namespace ${NAMESPACE}
atchernych's avatar
atchernych committed
102

Alec's avatar
Alec committed
103
104
105
106
107
108
# to pull model from HF
export HF_TOKEN=<Token-Here>
kubectl create secret generic hf-token-secret \
  --from-literal=HF_TOKEN="$HF_TOKEN" \
  -n ${NAMESPACE};

109
# Deploy any example (this uses vLLM with Qwen model using aggregated serving)
110
kubectl apply -f examples/backends/vllm/deploy/agg.yaml -n ${NAMESPACE}
atchernych's avatar
atchernych committed
111

112
113
# Check status
kubectl get dynamoGraphDeployment -n ${NAMESPACE}
atchernych's avatar
atchernych committed
114

115
# Test it
Alec's avatar
Alec committed
116
kubectl port-forward svc/vllm-agg-frontend 8000:8000 -n ${NAMESPACE}
117
curl http://localhost:8000/v1/models
atchernych's avatar
atchernych committed
118
119
```

120
121
For SLA-based autoscaling, see [SLA Planner Quick Start Guide](../planner/sla_planner_quickstart.md).

122
## Understanding Dynamo's Custom Resources
atchernych's avatar
atchernych committed
123

124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
Dynamo provides two main Kubernetes Custom Resources for deploying models:

### DynamoGraphDeploymentRequest (DGDR) - Simplified SLA-Driven Configuration

The **recommended approach** for generating optimal configurations. DGDR provides a high-level interface where you specify:
- Model name and backend framework
- SLA targets (latency requirements)
- GPU type (optional)

Dynamo automatically handles profiling and generates an optimized DGD spec in the status. Perfect for:
- SLA-driven configuration generation
- Automated resource optimization
- Users who want simplicity over control

**Note**: DGDR generates a DGD spec which you can then use to deploy.

### DynamoGraphDeployment (DGD) - Direct Configuration

A lower-level interface that defines your complete inference pipeline:
143
144
145
146
- Model configuration
- Resource allocation (GPUs, memory)
- Scaling policies
- Frontend/backend connections
atchernych's avatar
atchernych committed
147

148
149
Use this when you need fine-grained control or have already completed profiling.

150
Refer to the [API Reference and Documentation](./api_reference.md) for more details.
atchernych's avatar
atchernych committed
151

152
153
154
155
## 📖 API Reference & Documentation

For detailed technical specifications of Dynamo's Kubernetes resources:

156
157
158
- **[API Reference](./api_reference.md)** - Complete CRD field specifications for all Dynamo resources
- **[Create Deployment](./deployment/create_deployment.md)** - Step-by-step deployment creation with DynamoGraphDeployment
- **[Operator Guide](./dynamo_operator.md)** - Dynamo operator configuration and management
159

160
### Choosing Your Architecture Pattern
atchernych's avatar
atchernych committed
161

162
When creating a deployment, select the architecture pattern that best fits your use case:
atchernych's avatar
atchernych committed
163

164
165
166
- **Development / Testing** - Use `agg.yaml` as the base configuration
- **Production with Load Balancing** - Use `agg_router.yaml` to enable scalable, load-balanced inference
- **High Performance / Disaggregated** - Use `disagg_router.yaml` for maximum throughput and modular scalability
atchernych's avatar
atchernych committed
167

168
### Frontend and Worker Components
atchernych's avatar
atchernych committed
169

170
You can run the Frontend on one machine (e.g., a CPU node) and workers on different machines (GPU nodes). The Frontend serves as a framework-agnostic HTTP entry point that:
atchernych's avatar
atchernych committed
171

172
- Provides OpenAI-compatible `/v1/chat/completions` endpoint
173
- Auto-discovers backend workers via [service discovery](./service_discovery.md) (Kubernetes-native by default)
174
175
- Routes requests and handles load balancing
- Validates and preprocesses requests
atchernych's avatar
atchernych committed
176

177
### Customizing Your Deployment
atchernych's avatar
atchernych committed
178

179
180
181
182
183
184
185
186
187
188
189
190
191
Example structure:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
  name: my-llm
spec:
  services:
    Frontend:
      dynamoNamespace: my-llm
      componentType: frontend
      replicas: 1
      extraPodSpec:
atchernych's avatar
atchernych committed
192
        mainContainer:
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
          image: your-image
    VllmDecodeWorker:  # or SGLangDecodeWorker, TrtllmDecodeWorker
      dynamoNamespace: dynamo-dev
      componentType: worker
      replicas: 1
      envFromSecret: hf-token-secret  # for HuggingFace models
      resources:
        limits:
          gpu: "1"
      extraPodSpec:
        mainContainer:
          image: your-image
          command: ["/bin/sh", "-c"]
          args:
            - python3 -m dynamo.vllm --model YOUR_MODEL [--your-flags]
atchernych's avatar
atchernych committed
208
209
```

210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
Worker command examples per backend:
```yaml
# vLLM worker
args:
  - python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B

# SGLang worker
args:
  - >-
    python3 -m dynamo.sglang
    --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B
    --tp 1
    --trust-remote-code

# TensorRT-LLM worker
args:
  - python3 -m dynamo.trtllm
    --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B
    --served-model-name deepseek-ai/DeepSeek-R1-Distill-Llama-8B
229
    --extra-engine-args /workspace/examples/backends/trtllm/engine_configs/deepseek-r1-distill-llama-8b/agg.yaml
atchernych's avatar
atchernych committed
230
```
231

232
233
234
235
236
237
Key customization points include:
- **Model Configuration**: Specify model in the args command
- **Resource Allocation**: Configure GPU requirements under `resources.limits`
- **Scaling**: Set `replicas` for number of worker instances
- **Routing Mode**: Enable KV-cache routing by setting `DYN_ROUTER_MODE=kv` in Frontend envs
- **Worker Specialization**: Add `--is-prefill-worker` flag for disaggregated prefill workers
238

239
## Additional Resources
240

241
242
- **[Examples](../examples/README.md)** - Complete working examples
- **[Create Custom Deployments](./deployment/create_deployment.md)** - Build your own CRDs
243
- **[Managing Models with DynamoModel](./deployment/dynamomodel-guide.md)** - Deploy LoRA adapters and manage models
244
- **[Operator Documentation](./dynamo_operator.md)** - How the platform works
245
- **[Service Discovery](./service_discovery.md)** - Discovery backends and configuration
246
247
248
249
250
251
252
- **[Helm Charts](../../deploy/helm/README.md)** - For advanced users
- **[GitOps Deployment with FluxCD](./fluxcd.md)** - For advanced users
- **[Logging](./observability/logging.md)** - For logging setup
- **[Multinode Deployment](./deployment/multinode-deployment.md)** - For multinode deployment
- **[Grove](./grove.md)** - For grove details and custom installation
- **[Monitoring](./observability/metrics.md)** - For monitoring setup
- **[Model Caching with Fluid](./model_caching_with_fluid.md)** - For model caching with Fluid