installation_guide.md 11.7 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->

18
# Installation Guide for Dynamo Kubernetes Platform
19

20
Deploy and manage Dynamo inference graphs on Kubernetes with automated orchestration and scaling, using the Dynamo Kubernetes Platform.
21

22
## Before You Start
23

24
Determine your cluster environment:
25

26
27
28
**Shared/Multi-Tenant Cluster** (K8s cluster with existing Dynamo artifacts):
- CRDs already installed cluster-wide - skip CRD installation step
- Must use namespace-restricted installation (see note in installation steps)
29

30
31
32
**Dedicated Cluster** (full cluster admin access):
- You install CRDs yourself
- Can use cluster-wide operator (default)
33

34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
**Local Development** (Minikube, testing):
- See [Minikube Setup](deployment/minikube.md) first, then follow installation steps below

To check if CRDs already exist:
```bash
kubectl get crd | grep dynamo
# If you see dynamographdeployments, dynamocomponentdeployments, etc., CRDs are already installed
```

## Installation Paths

Platform is installed using Dynamo Kubernetes Platform [helm chart](../../deploy/cloud/helm/platform/README.md).

**Path A: Pre-built Artifacts**
- Use case: Production deployment, shared or dedicated clusters
- Source: NGC published Helm charts
- Time: ~10 minutes
- Jump to: [Path A](#path-a-production-install)

**Path B: Custom Build from Source**
- Use case: Contributing to Dynamo, using latest features from main branch, customization
- Requirements: Docker build environment
- Time: ~30 minutes
- Jump to: [Path B](#path-b-custom-build-from-source)
58

59
60
61
62
63
64
65
66
67
68
69
70
71
72
All helm install commands could be overridden by either setting the values.yaml file or by passing in your own values.yaml:

```bash
helm install ...
  -f your-values.yaml
```

and/or setting values as flags to the helm install command, as follows:

```bash
helm install ...
  --set "your-value=your-value"
```

73
## Prerequisites
74

75
76
77
78
79
80
81
82
83
84
85
Verify before proceeding:

- Kubernetes cluster v1.24+ access
- kubectl v1.24+ installed and configured
- Helm v3.0+ installed
- Cluster type determined (shared vs dedicated)
- CRD status checked if on shared cluster
- NGC credentials if using NVIDIA images (optional for public images)

Estimated time: 5-30 minutes depending on path

86
```bash
87
# Check required tools
88
89
kubectl version --client  # v1.24+
helm version             # v3.0+
90
docker version           # Running daemon (for Path D only)
91

92
# Set your release version
93
export RELEASE_VERSION=0.x.x # any version of Dynamo 0.3.2+ listed at https://github.com/ai-dynamo/dynamo/releases
94
```
95

96
> No cluster? See [Minikube Setup](deployment/minikube.md) for local development.
97

98
## Path A: Production Install
99

100
Install from [NGC published artifacts](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo/artifacts).
101
102

```bash
103
# 1. Set environment
104
export NAMESPACE=dynamo-system
105
export RELEASE_VERSION=0.x.x # any version of Dynamo 0.3.2+ listed at https://github.com/ai-dynamo/dynamo/releases
106

107
# 2. Install CRDs (skip if on shared cluster where CRDs already exist)
108
109
110
111
112
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${RELEASE_VERSION}.tgz
helm install dynamo-crds dynamo-crds-${RELEASE_VERSION}.tgz --namespace default

# 3. Install Platform
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${RELEASE_VERSION}.tgz
113
helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE} --create-namespace
114
115
```

116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
**For Shared/Multi-Tenant Clusters:**

If your cluster has namespace-restricted Dynamo operators, you MUST add namespace restriction to your installation:

```bash
# Add this flag to the helm install command above
--set dynamo-operator.namespaceRestriction.enabled=true
```

Note: Use the full path `dynamo-operator.namespaceRestriction.enabled=true` (not just `namespaceRestriction.enabled=true`).

If you see this validation error, you need namespace restriction:
```
VALIDATION ERROR: Cannot install cluster-wide Dynamo operator.
Found existing namespace-restricted Dynamo operators in namespaces: ...
```

133
> [!TIP]
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
> For multinode deployments, you need to install multinode orchestration components:
>
> **Option 1 (Recommended): Grove + KAI Scheduler**
> - Grove and KAI Scheduler can be installed manually or through the dynamo-platform helm install command.
> - When using the dynamo-platform helm install command, Grove and KAI Scheduler are NOT installed by default. You can enable their installation by setting the following flags:
>
> ```bash
> --set "grove.enabled=true"
> --set "kai-scheduler.enabled=true"
> ```
>
> **Option 2: LeaderWorkerSet (LWS) + Volcano**
> - If using LWS for multinode deployments, you must also install Volcano (required dependency):
>   - [LWS Installation](https://github.com/kubernetes-sigs/lws#installation)
>   - [Volcano Installation](https://volcano.sh/en/docs/installation/) (required for gang scheduling with LWS)
> - These must be installed manually before deploying multinode workloads with LWS.
>
> See the [Multinode Deployment Guide](./deployment/multinode-deployment.md) for details on orchestrator selection.
152

153
154
155
156
157
158
159
160
> [!TIP]
> By default, Model Express Server is not used.
> If you wish to use an existing Model Express Server, you can set the modelExpressURL to the existing server's URL in the helm install command:

```bash
--set "dynamo-operator.modelExpressURL=http://model-express-server.model-express.svc.cluster.local:8080"
```

161
162
163
164
165
166
167
168
169
> [!TIP]
> By default, Dynamo Operator is installed cluster-wide and will monitor all namespaces.
> If you wish to restrict the operator to monitor only a specific namespace (the helm release namespace by default), you can set the namespaceRestriction.enabled to true.
> You can also change the restricted namespace by setting the targetNamespace property.

```bash
--set "dynamo-operator.namespaceRestriction.enabled=true"
--set "dynamo-operator.namespaceRestriction.targetNamespace=dynamo-namespace" # optional
```
170

171
[Verify Installation](#verify-installation)
172

173
174
175
## Path B: Custom Build from Source

Build and deploy from source for customization, contributing to Dynamo, or using the latest features from the main branch.
176

177
Note: This gives you access to the latest unreleased features and fixes on the main branch.
178

179
180
```bash
# 1. Set environment
181
export NAMESPACE=dynamo-system
182
183
184
export DOCKER_SERVER=nvcr.io/nvidia/ai-dynamo/  # or your registry
export DOCKER_USERNAME='$oauthtoken'
export DOCKER_PASSWORD=<YOUR_NGC_CLI_API_KEY>
185
export IMAGE_TAG=${RELEASE_VERSION}
186
187
188

# 2. Build operator
cd deploy/cloud/operator
Julien Mancuso's avatar
Julien Mancuso committed
189
190
191
192
193
194
195
196
197

# 2.1 Alternative 1 : Build and push the operator image for multiple platforms
docker buildx create --name multiplatform --driver docker-container --bootstrap
docker buildx use multiplatform
docker buildx build --platform linux/amd64,linux/arm64 -t $DOCKER_SERVER/dynamo-operator:$IMAGE_TAG --push .

# 2.2 Alternative 2 : Build and push the operator image for a single platform
docker build -t $DOCKER_SERVER/dynamo-operator:$IMAGE_TAG . && docker push $DOCKER_SERVER/dynamo-operator:$IMAGE_TAG

198
199
cd -

200
# 3. Create namespace and secrets to be able to pull the operator image (only needed if you pushed the operator image to a private registry)
201
202
203
204
205
206
207
kubectl create namespace ${NAMESPACE}
kubectl create secret docker-registry docker-imagepullsecret \
  --docker-server=${DOCKER_SERVER} \
  --docker-username=${DOCKER_USERNAME} \
  --docker-password=${DOCKER_PASSWORD} \
  --namespace=${NAMESPACE}

208
209
cd deploy/cloud/helm

210
211
# 4. Install CRDs
helm upgrade --install dynamo-crds ./crds/ --namespace default
212

213
# 5. Install Platform
214
helm dep build ./platform/
215
216
217
218

# To install cluster-wide instead, set NS_RESTRICT_FLAGS="" (empty) or omit that line entirely.

NS_RESTRICT_FLAGS="--set dynamo-operator.namespaceRestriction.enabled=true"
219
helm install dynamo-platform ./platform/ \
220
  --namespace "${NAMESPACE}" \
221
222
  --set "dynamo-operator.controllerManager.manager.image.repository=${DOCKER_SERVER}/dynamo-operator" \
  --set "dynamo-operator.controllerManager.manager.image.tag=${IMAGE_TAG}" \
223
224
225
  --set "dynamo-operator.imagePullSecrets[0].name=docker-imagepullsecret" \
  ${NS_RESTRICT_FLAGS}

226
```
227
228

[Verify Installation](#verify-installation)
229

230
## Verify Installation
231
232

```bash
233
234
# Check CRDs
kubectl get crd | grep dynamo
235

236
237
# Check operator and platform pods
kubectl get pods -n ${NAMESPACE}
238
# Expected: dynamo-operator-* and etcd-* and nats-* pods Running
239
```
240

241
## Next Steps
242

243
244
245
1. **Deploy Model/Workflow**
   ```bash
   # Example: Deploy a vLLM workflow with Qwen3-0.6B using aggregated serving
246
   kubectl apply -f examples/backends/vllm/deploy/agg.yaml -n ${NAMESPACE}
247

248
249
250
251
   # Port forward and test
   kubectl port-forward svc/agg-vllm-frontend 8000:8000 -n ${NAMESPACE}
   curl http://localhost:8000/v1/models
   ```
252

253
2. **Explore Backend Guides**
254
255
256
   - [vLLM Deployments](../../examples/backends/vllm/deploy/README.md)
   - [SGLang Deployments](../../examples/backends/sglang/deploy/README.md)
   - [TensorRT-LLM Deployments](../../examples/backends/trtllm/deploy/README.md)
257

258
3. **Optional:**
259
   - [Set up Prometheus & Grafana](./observability/metrics.md)
260
   - [SLA Planner Quickstart Guide](../planner/sla_planner_quickstart.md) (for SLA-aware scheduling and autoscaling)
261

262
## Troubleshooting
263

264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
**"VALIDATION ERROR: Cannot install cluster-wide Dynamo operator"**

```
VALIDATION ERROR: Cannot install cluster-wide Dynamo operator.
Found existing namespace-restricted Dynamo operators in namespaces: ...
```

Cause: Attempting cluster-wide install on a shared cluster with existing namespace-restricted operators.

Solution: Add namespace restriction to your installation:
```bash
--set dynamo-operator.namespaceRestriction.enabled=true
```

Note: Use the full path `dynamo-operator.namespaceRestriction.enabled=true` (not just `namespaceRestriction.enabled=true`).

**CRDs already exist**

Cause: Installing CRDs on a cluster where they're already present (common on shared clusters).

Solution: Skip step 2 (CRD installation), proceed directly to platform installation.

To check if CRDs exist:
```bash
kubectl get crd | grep dynamo
```

291
292
293
294
295
**Pods not starting?**
```bash
kubectl describe pod <pod-name> -n ${NAMESPACE}
kubectl logs <pod-name> -n ${NAMESPACE}
```
296

297
298
299
300
301
302
**HuggingFace model access?**
```bash
kubectl create secret generic hf-token-secret \
  --from-literal=HF_TOKEN=${HF_TOKEN} \
  -n ${NAMESPACE}
```
303

304
305
306
307
308
309
310
311
312
313
314
315
**Bitnami etcd "unrecognized" image?**

```bash
ERROR: Original containers have been substituted for unrecognized ones. Deploying this chart with non-standard containers is likely to cause degraded security and performance, broken chart features, and missing environment variables.
```
This error that you might encounter during helm install is due to bitnami changing their docker repository to a [secure one](https://github.com/bitnami/charts/tree/main/bitnami/etcd#%EF%B8%8F-important-notice-upcoming-changes-to-the-bitnami-catalog).

just add the following to the helm install command:
```bash
--set "etcd.image.repository=bitnamilegacy/etcd" --set "etcd.global.security.allowInsecureImages=true"
```

316
317
318
319
**Clean uninstall?**
```bash
./uninstall.sh  # Removes all CRDs and platform
```
320

321
## Advanced Options
322

323
324
325
- [Helm Chart Configuration](../../deploy/cloud/helm/platform/README.md)
- [Create custom deployments](./deployment/create_deployment.md)
- [Dynamo Operator details](./dynamo_operator.md)
326
- [Model Express Server details](https://github.com/ai-dynamo/modelexpress)