installation_guide.md 14.3 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->

18
# Installation Guide for Dynamo Kubernetes Platform
19

20
Deploy and manage Dynamo inference graphs on Kubernetes with automated orchestration and scaling, using the Dynamo Kubernetes Platform.
21

22
## Before You Start
23

24
Determine your cluster environment:
25

26
27
**Shared/Multi-Tenant Cluster** (K8s cluster with existing Dynamo artifacts):
- CRDs already installed cluster-wide - skip CRD installation step
28
29
30
- A cluster-wide Dynamo operator is likely already running
- **Do NOT install another operator** - use the existing cluster-wide operator
- Only install a namespace-restricted operator if you specifically need to prevent the cluster-wide operator from managing your namespace (e.g., testing operator features you're developing)
31

32
33
34
**Dedicated Cluster** (full cluster admin access):
- You install CRDs yourself
- Can use cluster-wide operator (default)
35

36
37
38
39
40
41
42
43
44
**Local Development** (Minikube, testing):
- See [Minikube Setup](deployment/minikube.md) first, then follow installation steps below

To check if CRDs already exist:
```bash
kubectl get crd | grep dynamo
# If you see dynamographdeployments, dynamocomponentdeployments, etc., CRDs are already installed
```

45
46
47
48
49
50
51
52
53
54
55
To check if a cluster-wide operator already exists:
```bash
# Check for cluster-wide operator and show its namespace
kubectl get clusterrolebinding -o json | \
  jq -r '.items[] | select(.metadata.name | contains("dynamo-operator-manager")) |
  "Cluster-wide operator found in namespace: \(.subjects[0].namespace)"'

# If a cluster-wide operator exists: Do NOT install another operator
# Only install namespace-restricted mode if you specifically need namespace isolation
```

56
57
## Installation Paths

58
Platform is installed using Dynamo Kubernetes Platform [helm chart](../../deploy/helm/charts/platform/README.md).
59
60
61
62
63
64
65
66
67
68
69
70

**Path A: Pre-built Artifacts**
- Use case: Production deployment, shared or dedicated clusters
- Source: NGC published Helm charts
- Time: ~10 minutes
- Jump to: [Path A](#path-a-production-install)

**Path B: Custom Build from Source**
- Use case: Contributing to Dynamo, using latest features from main branch, customization
- Requirements: Docker build environment
- Time: ~30 minutes
- Jump to: [Path B](#path-b-custom-build-from-source)
71

72
73
74
75
76
77
78
79
80
81
82
83
84
85
All helm install commands could be overridden by either setting the values.yaml file or by passing in your own values.yaml:

```bash
helm install ...
  -f your-values.yaml
```

and/or setting values as flags to the helm install command, as follows:

```bash
helm install ...
  --set "your-value=your-value"
```

86
## Prerequisites
87

88
Before installing the Dynamo Kubernetes Platform, ensure you have the following tools and access:
89

90
### Required Tools
91

92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
| Tool | Minimum Version | Description | Installation |
|------|-----------------|-------------|--------------|
| **kubectl** | v1.24+ | Kubernetes command-line tool | [Install kubectl](https://kubernetes.io/docs/tasks/tools/#kubectl) |
| **Helm** | v3.0+ | Kubernetes package manager | [Install Helm](https://helm.sh/docs/intro/install/) |
| **Docker** | Latest | Container runtime (Path B only) | [Install Docker](https://docs.docker.com/get-docker/) |

### Cluster and Access Requirements

- **Kubernetes cluster v1.24+** with admin or namespace-scoped access
- **Cluster type determined** (shared vs dedicated) — see [Before You Start](#before-you-start)
- **CRD status checked** if on a shared cluster
- **NGC credentials** (optional) — required only if pulling NVIDIA images from NGC

### Verify Installation

Run the following to confirm your tools are correctly installed:
108

109
```bash
110
111
112
113
# Verify tools and versions
kubectl version --client  # Should show v1.24+
helm version              # Should show v3.0+
docker version            # Required for Path B only
114

115
# Set your release version
116
export RELEASE_VERSION=0.x.x # any version of Dynamo 0.3.2+ listed at https://github.com/ai-dynamo/dynamo/releases
117
```
118

119
120
121
122
123
124
125
126
127
128
129
130
131
### Pre-Deployment Checks

Before proceeding, run the pre-deployment check script to verify your cluster meets all requirements:

```bash
./deploy/cloud/pre-deployment/pre-deployment-check.sh
```

This script validates kubectl connectivity, default StorageClass configuration, and GPU node availability. See [Pre-Deployment Checks](../../deploy/cloud/pre-deployment/README.md) for details.

> **No cluster?** See [Minikube Setup](deployment/minikube.md) for local development.

**Estimated installation time:** 5-30 minutes depending on path
132

133
## Path A: Production Install
134

135
Install from [NGC published artifacts](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo/artifacts).
136
137

```bash
138
# 1. Set environment
139
export NAMESPACE=dynamo-system
140
export RELEASE_VERSION=0.x.x # any version of Dynamo 0.3.2+ listed at https://github.com/ai-dynamo/dynamo/releases
141

142
# 2. Install CRDs (skip if on shared cluster where CRDs already exist)
143
144
145
146
147
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${RELEASE_VERSION}.tgz
helm install dynamo-crds dynamo-crds-${RELEASE_VERSION}.tgz --namespace default

# 3. Install Platform
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${RELEASE_VERSION}.tgz
148
helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE} --create-namespace
149
150
```

151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
**For Shared/Multi-Tenant Clusters:**

If your cluster has namespace-restricted Dynamo operators, you MUST add namespace restriction to your installation:

```bash
# Add this flag to the helm install command above
--set dynamo-operator.namespaceRestriction.enabled=true
```

Note: Use the full path `dynamo-operator.namespaceRestriction.enabled=true` (not just `namespaceRestriction.enabled=true`).

If you see this validation error, you need namespace restriction:
```
VALIDATION ERROR: Cannot install cluster-wide Dynamo operator.
Found existing namespace-restricted Dynamo operators in namespaces: ...
```

168
> [!TIP]
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
> For multinode deployments, you need to install multinode orchestration components:
>
> **Option 1 (Recommended): Grove + KAI Scheduler**
> - Grove and KAI Scheduler can be installed manually or through the dynamo-platform helm install command.
> - When using the dynamo-platform helm install command, Grove and KAI Scheduler are NOT installed by default. You can enable their installation by setting the following flags:
>
> ```bash
> --set "grove.enabled=true"
> --set "kai-scheduler.enabled=true"
> ```
>
> **Option 2: LeaderWorkerSet (LWS) + Volcano**
> - If using LWS for multinode deployments, you must also install Volcano (required dependency):
>   - [LWS Installation](https://github.com/kubernetes-sigs/lws#installation)
>   - [Volcano Installation](https://volcano.sh/en/docs/installation/) (required for gang scheduling with LWS)
> - These must be installed manually before deploying multinode workloads with LWS.
>
> See the [Multinode Deployment Guide](./deployment/multinode-deployment.md) for details on orchestrator selection.
187

188
189
190
191
192
193
194
195
> [!TIP]
> By default, Model Express Server is not used.
> If you wish to use an existing Model Express Server, you can set the modelExpressURL to the existing server's URL in the helm install command:

```bash
--set "dynamo-operator.modelExpressURL=http://model-express-server.model-express.svc.cluster.local:8080"
```

196
197
198
199
200
201
202
203
204
> [!TIP]
> By default, Dynamo Operator is installed cluster-wide and will monitor all namespaces.
> If you wish to restrict the operator to monitor only a specific namespace (the helm release namespace by default), you can set the namespaceRestriction.enabled to true.
> You can also change the restricted namespace by setting the targetNamespace property.

```bash
--set "dynamo-operator.namespaceRestriction.enabled=true"
--set "dynamo-operator.namespaceRestriction.targetNamespace=dynamo-namespace" # optional
```
205

206
[Verify Installation](#verify-installation)
207

208
209
210
## Path B: Custom Build from Source

Build and deploy from source for customization, contributing to Dynamo, or using the latest features from the main branch.
211

212
Note: This gives you access to the latest unreleased features and fixes on the main branch.
213

214
215
```bash
# 1. Set environment
216
export NAMESPACE=dynamo-system
217
218
219
export DOCKER_SERVER=nvcr.io/nvidia/ai-dynamo/  # or your registry
export DOCKER_USERNAME='$oauthtoken'
export DOCKER_PASSWORD=<YOUR_NGC_CLI_API_KEY>
220
export IMAGE_TAG=${RELEASE_VERSION}
221
222

# 2. Build operator
223
cd deploy/operator
Julien Mancuso's avatar
Julien Mancuso committed
224
225
226
227
228
229
230
231
232

# 2.1 Alternative 1 : Build and push the operator image for multiple platforms
docker buildx create --name multiplatform --driver docker-container --bootstrap
docker buildx use multiplatform
docker buildx build --platform linux/amd64,linux/arm64 -t $DOCKER_SERVER/dynamo-operator:$IMAGE_TAG --push .

# 2.2 Alternative 2 : Build and push the operator image for a single platform
docker build -t $DOCKER_SERVER/dynamo-operator:$IMAGE_TAG . && docker push $DOCKER_SERVER/dynamo-operator:$IMAGE_TAG

233
234
cd -

235
# 3. Create namespace and secrets to be able to pull the operator image (only needed if you pushed the operator image to a private registry)
236
237
238
239
240
241
242
kubectl create namespace ${NAMESPACE}
kubectl create secret docker-registry docker-imagepullsecret \
  --docker-server=${DOCKER_SERVER} \
  --docker-username=${DOCKER_USERNAME} \
  --docker-password=${DOCKER_PASSWORD} \
  --namespace=${NAMESPACE}

243
cd deploy/helm/charts
244

245
246
# 4. Install CRDs
helm upgrade --install dynamo-crds ./crds/ --namespace default
247

248
# 5. Install Platform
249
helm dep build ./platform/
250
251
252
253

# To install cluster-wide instead, set NS_RESTRICT_FLAGS="" (empty) or omit that line entirely.

NS_RESTRICT_FLAGS="--set dynamo-operator.namespaceRestriction.enabled=true"
254
helm install dynamo-platform ./platform/ \
255
  --namespace "${NAMESPACE}" \
256
257
  --set "dynamo-operator.controllerManager.manager.image.repository=${DOCKER_SERVER}/dynamo-operator" \
  --set "dynamo-operator.controllerManager.manager.image.tag=${IMAGE_TAG}" \
258
259
260
  --set "dynamo-operator.imagePullSecrets[0].name=docker-imagepullsecret" \
  ${NS_RESTRICT_FLAGS}

261
```
262
263

[Verify Installation](#verify-installation)
264

265
## Verify Installation
266
267

```bash
268
269
# Check CRDs
kubectl get crd | grep dynamo
270

271
272
# Check operator and platform pods
kubectl get pods -n ${NAMESPACE}
273
# Expected: dynamo-operator-* and etcd-* and nats-* pods Running
274
```
275

276
## Next Steps
277

278
279
280
1. **Deploy Model/Workflow**
   ```bash
   # Example: Deploy a vLLM workflow with Qwen3-0.6B using aggregated serving
281
   kubectl apply -f examples/backends/vllm/deploy/agg.yaml -n ${NAMESPACE}
282

283
284
285
286
   # Port forward and test
   kubectl port-forward svc/agg-vllm-frontend 8000:8000 -n ${NAMESPACE}
   curl http://localhost:8000/v1/models
   ```
287

288
2. **Explore Backend Guides**
289
290
291
   - [vLLM Deployments](../../examples/backends/vllm/deploy/README.md)
   - [SGLang Deployments](../../examples/backends/sglang/deploy/README.md)
   - [TensorRT-LLM Deployments](../../examples/backends/trtllm/deploy/README.md)
292

293
3. **Optional:**
294
   - [Set up Prometheus & Grafana](./observability/metrics.md)
295
   - [SLA Planner Quickstart Guide](../planner/sla_planner_quickstart.md) (for SLA-aware scheduling and autoscaling)
296

297
## Troubleshooting
298

299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
**"VALIDATION ERROR: Cannot install cluster-wide Dynamo operator"**

```
VALIDATION ERROR: Cannot install cluster-wide Dynamo operator.
Found existing namespace-restricted Dynamo operators in namespaces: ...
```

Cause: Attempting cluster-wide install on a shared cluster with existing namespace-restricted operators.

Solution: Add namespace restriction to your installation:
```bash
--set dynamo-operator.namespaceRestriction.enabled=true
```

Note: Use the full path `dynamo-operator.namespaceRestriction.enabled=true` (not just `namespaceRestriction.enabled=true`).

**CRDs already exist**

Cause: Installing CRDs on a cluster where they're already present (common on shared clusters).

Solution: Skip step 2 (CRD installation), proceed directly to platform installation.

To check if CRDs exist:
```bash
kubectl get crd | grep dynamo
```

326
327
328
329
330
**Pods not starting?**
```bash
kubectl describe pod <pod-name> -n ${NAMESPACE}
kubectl logs <pod-name> -n ${NAMESPACE}
```
331

332
333
334
335
336
337
**HuggingFace model access?**
```bash
kubectl create secret generic hf-token-secret \
  --from-literal=HF_TOKEN=${HF_TOKEN} \
  -n ${NAMESPACE}
```
338

339
340
341
342
343
344
345
346
347
348
349
350
**Bitnami etcd "unrecognized" image?**

```bash
ERROR: Original containers have been substituted for unrecognized ones. Deploying this chart with non-standard containers is likely to cause degraded security and performance, broken chart features, and missing environment variables.
```
This error that you might encounter during helm install is due to bitnami changing their docker repository to a [secure one](https://github.com/bitnami/charts/tree/main/bitnami/etcd#%EF%B8%8F-important-notice-upcoming-changes-to-the-bitnami-catalog).

just add the following to the helm install command:
```bash
--set "etcd.image.repository=bitnamilegacy/etcd" --set "etcd.global.security.allowInsecureImages=true"
```

351
**Clean uninstall?**
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375

To uninstall the platform, you can run the following command:
```
helm uninstall dynamo-platform --namespace ${NAMESPACE}
```

To uninstall the CRDs, follow these steps:

Get all of the dynamo CRDs installed in your cluster:
```bash
kubectl get crd | grep "dynamo.*nvidia.com"
```

You should see something like this:
```
dynamocomponentdeployments.nvidia.com               2025-10-21T14:49:52Z
dynamocomponents.nvidia.com                         2025-10-25T05:16:10Z
dynamographdeploymentrequests.nvidia.com            2025-11-24T05:26:04Z
dynamographdeployments.nvidia.com                   2025-09-04T20:56:40Z
dynamographdeploymentscalingadapters.nvidia.com     2025-12-09T21:05:59Z
dynamomodels.nvidia.com                             2025-11-07T00:19:43Z
```

Delete each CRD one by one:
376
```bash
377
kubectl delete crd <crd-name>
378
```
379

380
## Advanced Options
381

382
- [Helm Chart Configuration](../../deploy/helm/charts/platform/README.md)
383
384
- [Create custom deployments](./deployment/create_deployment.md)
- [Dynamo Operator details](./dynamo_operator.md)
385
- [Model Express Server details](https://github.com/ai-dynamo/modelexpress)