installation-guide.md 14.8 KB
Newer Older
1
2
3
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
4
title: Detailed Installation Guide
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
---

Deploy and manage Dynamo inference graphs on Kubernetes with automated orchestration and scaling, using the Dynamo Kubernetes Platform.

## Before You Start

Determine your cluster environment:

**Shared/Multi-Tenant Cluster** (K8s cluster with existing Dynamo artifacts):
- CRDs already installed cluster-wide - skip CRD installation step
- A cluster-wide Dynamo operator is likely already running
- **Do NOT install another operator** - use the existing cluster-wide operator

**Dedicated Cluster** (full cluster admin access):
- You install CRDs yourself
- Can use cluster-wide operator (default)

**Local Development** (Minikube, testing):
23
- See [Minikube Setup](deployment/minikube.md) first, then follow installation steps below
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114

To check if CRDs already exist:
```bash
kubectl get crd | grep dynamo
# If you see dynamographdeployments, dynamocomponentdeployments, etc., CRDs are already installed
```

To check if a cluster-wide operator already exists:
```bash
# Check for cluster-wide operator and show its namespace
kubectl get clusterrolebinding -o json | \
  jq -r '.items[] | select(.metadata.name | contains("dynamo-operator-manager")) |
  "Cluster-wide operator found in namespace: \(.subjects[0].namespace)"'

# If a cluster-wide operator exists: Do NOT install another operator
# Only install namespace-restricted mode if you specifically need namespace isolation
```

## Installation Paths

Platform is installed using Dynamo Kubernetes Platform [helm chart](https://github.com/ai-dynamo/dynamo/tree/main/deploy/helm/charts/platform/README.md).

**Path A: Pre-built Artifacts**
- Use case: Production deployment, shared or dedicated clusters
- Source: NGC published Helm charts
- Time: ~10 minutes
- Jump to: [Path A](#path-a-production-install)

**Path B: Custom Build from Source**
- Use case: Contributing to Dynamo, using latest features from main branch, customization
- Requirements: Docker build environment
- Time: ~30 minutes
- Jump to: [Path B](#path-b-custom-build-from-source)

All helm install commands could be overridden by either setting the values.yaml file or by passing in your own values.yaml:

```bash
helm install ...
  -f your-values.yaml
```

and/or setting values as flags to the helm install command, as follows:

```bash
helm install ...
  --set "your-value=your-value"
```

## Prerequisites

Before installing the Dynamo Kubernetes Platform, ensure you have the following tools and access:

### Required Tools

| Tool | Minimum Version | Description | Installation |
|------|-----------------|-------------|--------------|
| **kubectl** | v1.24+ | Kubernetes command-line tool | [Install kubectl](https://kubernetes.io/docs/tasks/tools/#kubectl) |
| **Helm** | v3.0+ | Kubernetes package manager | [Install Helm](https://helm.sh/docs/intro/install/) |
| **Docker** | Latest | Container runtime (Path B only) | [Install Docker](https://docs.docker.com/get-docker/) |

### Cluster and Access Requirements

- **Kubernetes cluster v1.24+** with admin or namespace-scoped access
- **Cluster type determined** (shared vs dedicated) — see [Before You Start](#before-you-start)
- **CRD status checked** if on a shared cluster
- **NGC credentials** (optional) — required only if pulling NVIDIA images from NGC

### Verify Installation

Run the following to confirm your tools are correctly installed:

```bash
# Verify tools and versions
kubectl version --client  # Should show v1.24+
helm version              # Should show v3.0+
docker version            # Required for Path B only

# Set your release version
export RELEASE_VERSION=0.x.x # any version of Dynamo 0.3.2+ listed at https://github.com/ai-dynamo/dynamo/releases
```

### Pre-Deployment Checks

Before proceeding, run the pre-deployment check script to verify your cluster meets all requirements:

```bash
./deploy/pre-deployment/pre-deployment-check.sh
```

This script validates kubectl connectivity, default StorageClass configuration, and GPU node availability. See [Pre-Deployment Checks](https://github.com/ai-dynamo/dynamo/tree/main/deploy/pre-deployment/README.md) for details.

115
> **No cluster?** See [Minikube Setup](deployment/minikube.md) for local development.
116
117
118
119
120
121
122
123
124
125
126
127

**Estimated installation time:** 5-30 minutes depending on path

## Path A: Production Install

Install from [NGC published artifacts](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo/artifacts).

```bash
# 1. Set environment
export NAMESPACE=dynamo-system
export RELEASE_VERSION=0.x.x # any version of Dynamo 0.3.2+ listed at https://github.com/ai-dynamo/dynamo/releases

128
# 2. Install Platform (CRDs are automatically installed by the chart)
129
130
131
132
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${RELEASE_VERSION}.tgz
helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE} --create-namespace
```

133
134
135
> [!WARNING]
> **v0.9.0 Helm Chart Issue:** The initial v0.9.0 `dynamo-platform` Helm chart sets the operator image to v0.7.1 instead of v0.9.0. Use `RELEASE_VERSION=0.9.0-post1` or add `--set dynamo-operator.controllerManager.manager.image.tag=0.9.0` to your helm install command.

136
137
**For Shared/Multi-Tenant Clusters:**

138
> **DEPRECATED:** Namespace-restricted mode (`namespaceRestriction.enabled=true`) is deprecated and will be removed in a future release. New deployments should use the default cluster-wide mode. If you are currently using namespace-restricted mode, plan to migrate to cluster-wide mode.
139

140
141
> [!TIP]
> For multinode deployments, you need to install multinode orchestration components:
142
>
143
> **Option 1 (Recommended): Grove + KAI Scheduler**
144
145
146
147
148
149
150
151
152
153
>
> For production environments, Grove and KAI Scheduler should be installed **separately** from the dynamo-platform chart. This allows independent lifecycle management, version pinning, and upgrade control.
>
> **Compatibility Matrix:**
>
> | dynamo-platform | kai-scheduler | Grove |
> |-----------------|---------------|-------|
> | 1.0.x           | >= v0.13.0    | >= v0.1.0-alpha.6 |
>
> After installing them separately, enable Dynamo integration:
154
>
155
> ```bash
156
157
> --set "global.kai-scheduler.enabled=true"
> --set "global.grove.enabled=true"
158
> ```
159
>
160
161
162
163
164
165
166
167
168
> For **development/testing only**, you can install them as bundled subcharts:
>
> ```bash
> --set "global.grove.install=true"
> --set "global.kai-scheduler.install=true"
> ```
>
> Note: `global.kai-scheduler.install` / `global.grove.install` control whether the bundled subcharts are deployed. When set, integration is automatically enabled. `global.kai-scheduler.enabled` / `global.grove.enabled` can be set independently when using externally-managed installations.
>
169
170
> **Option 2: LeaderWorkerSet (LWS) + Volcano**
> - If using LWS for multinode deployments, you must also install Volcano (required dependency):
171
172
>   - [LWS Installation](https://github.com/kubernetes-sigs/lws#installation)
>   - [Volcano Installation](https://volcano.sh/en/docs/installation/) (required for gang scheduling with LWS)
173
> - These must be installed manually before deploying multinode workloads with LWS.
174
175
>
> See the [Multinode Deployment Guide](./deployment/multinode-deployment.md) for details on orchestrator selection.
176
177
178
179

> [!TIP]
> By default, Model Express Server is not used.
> If you wish to use an existing Model Express Server, you can set the modelExpressURL to the existing server's URL in the helm install command:
180
181
182
183
184

```bash
--set "dynamo-operator.modelExpressURL=http://model-express-server.model-express.svc.cluster.local:8080"
```

185
186
187
> [!WARNING]
> **DEPRECATED:** Namespace-restricted mode is deprecated and will be removed in a future release.
> By default, Dynamo Operator is installed cluster-wide and will monitor all namespaces. This is the recommended and only supported mode going forward.
188

189
### GPU Discovery for DynamoGraphDeploymentRequests (Deprecated Namespace-Scoped Mode)
190

191
> **DEPRECATED:** This section applies only to the deprecated namespace-restricted mode. New deployments should use cluster-wide mode, which has GPU discovery by default.
192
193
194
195
196
197
198
199
200
201
202
203
204

GPU discovery is **enabled by default** for namespace-scoped operators. The Helm chart automatically provisions a ClusterRole/ClusterRoleBinding granting the operator read-only access to node GPU labels.

**To disable GPU discovery** (if your installer lacks ClusterRole creation permissions):

```bash
helm install dynamo-platform ... --set dynamo-operator.gpuDiscovery.enabled=false
```

When GPU discovery is disabled, you must provide hardware configuration manually in each DynamoGraphDeploymentRequest:

```yaml
spec:
205
206
207
208
  hardware:
    numGpusPerNode: 8
    gpuSku: "H100-SXM5-80GB"
    vramMb: 81920
209
210
211
212
```

> **Note**: If GPU discovery is disabled and no hardware config is provided, the DGDR will be rejected at admission time with a clear error message.

213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
[Verify Installation](#verify-installation)

## Path B: Custom Build from Source

Build and deploy from source for customization, contributing to Dynamo, or using the latest features from the main branch.

Note: This gives you access to the latest unreleased features and fixes on the main branch.

```bash
# 1. Set environment
export NAMESPACE=dynamo-system
export DOCKER_SERVER=nvcr.io/nvidia/ai-dynamo/  # or your registry
export DOCKER_USERNAME='$oauthtoken'
export DOCKER_PASSWORD=<YOUR_NGC_CLI_API_KEY>
export IMAGE_TAG=${RELEASE_VERSION}

# 2. Build operator
cd deploy/operator

# 2.1 Alternative 1 : Build and push the operator image for multiple platforms
docker buildx create --name multiplatform --driver docker-container --bootstrap
docker buildx use multiplatform
235
docker buildx build --platform linux/amd64,linux/arm64 -t $DOCKER_SERVER/kubernetes-operator:$IMAGE_TAG --push .
236
237

# 2.2 Alternative 2 : Build and push the operator image for a single platform
238
docker build -t $DOCKER_SERVER/kubernetes-operator:$IMAGE_TAG . && docker push $DOCKER_SERVER/kubernetes-operator:$IMAGE_TAG
239
240
241
242
243
244
245
246
247
248
249
250
251

cd -

# 3. Create namespace and secrets to be able to pull the operator image (only needed if you pushed the operator image to a private registry)
kubectl create namespace ${NAMESPACE}
kubectl create secret docker-registry docker-imagepullsecret \
  --docker-server=${DOCKER_SERVER} \
  --docker-username=${DOCKER_USERNAME} \
  --docker-password=${DOCKER_PASSWORD} \
  --namespace=${NAMESPACE}

cd deploy/helm/charts

252
# 4. Install Platform (CRDs are automatically installed by the chart)
253
254
helm dep build ./platform/

255
# NOTE: Namespace-restricted mode is DEPRECATED. Use cluster-wide mode (the default).
256
257
258

helm install dynamo-platform ./platform/ \
  --namespace "${NAMESPACE}" \
259
  --set "dynamo-operator.controllerManager.manager.image.repository=${DOCKER_SERVER}/kubernetes-operator" \
260
  --set "dynamo-operator.controllerManager.manager.image.tag=${IMAGE_TAG}" \
261
  --set "dynamo-operator.imagePullSecrets[0].name=docker-imagepullsecret"
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295

```

[Verify Installation](#verify-installation)

## Verify Installation

```bash
# Check CRDs
kubectl get crd | grep dynamo

# Check operator and platform pods
kubectl get pods -n ${NAMESPACE}
# Expected: dynamo-operator-* and etcd-* and nats-* pods Running
```

## Next Steps

1. **Deploy Model/Workflow**
   ```bash
   # Example: Deploy a vLLM workflow with Qwen3-0.6B using aggregated serving
   kubectl apply -f examples/backends/vllm/deploy/agg.yaml -n ${NAMESPACE}

   # Port forward and test
   kubectl port-forward svc/agg-vllm-frontend 8000:8000 -n ${NAMESPACE}
   curl http://localhost:8000/v1/models
   ```

2. **Explore Backend Guides**
   - [vLLM Deployments](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/README.md)
   - [SGLang Deployments](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/deploy/README.md)
   - [TensorRT-LLM Deployments](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/deploy/README.md)

3. **Optional:**
296
297
   - [Set up Prometheus & Grafana](./observability/metrics.md)
   - [SLA Planner Guide](../components/planner/planner-guide.md) (for SLA-aware scheduling and autoscaling)
298
299
300
301
302
303
304
305
306
307
308
309

## Troubleshooting

**"VALIDATION ERROR: Cannot install cluster-wide Dynamo operator"**

```
VALIDATION ERROR: Cannot install cluster-wide Dynamo operator.
Found existing namespace-restricted Dynamo operators in namespaces: ...
```

Cause: Attempting cluster-wide install on a shared cluster with existing namespace-restricted operators.

310
Solution: Migrate the existing namespace-restricted operators to cluster-wide mode. Namespace-restricted mode is deprecated and should no longer be used.
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379

**CRDs already exist**

Cause: Installing CRDs on a cluster where they're already present (common on shared clusters).

Solution: Skip step 2 (CRD installation), proceed directly to platform installation.

To check if CRDs exist:
```bash
kubectl get crd | grep dynamo
```

**Pods not starting?**
```bash
kubectl describe pod <pod-name> -n ${NAMESPACE}
kubectl logs <pod-name> -n ${NAMESPACE}
```

**HuggingFace model access?**
```bash
kubectl create secret generic hf-token-secret \
  --from-literal=HF_TOKEN=${HF_TOKEN} \
  -n ${NAMESPACE}
```

**Bitnami etcd "unrecognized" image?**

```bash
ERROR: Original containers have been substituted for unrecognized ones. Deploying this chart with non-standard containers is likely to cause degraded security and performance, broken chart features, and missing environment variables.
```
This error that you might encounter during helm install is due to bitnami changing their docker repository to a [secure one](https://github.com/bitnami/charts/tree/main/bitnami/etcd#%EF%B8%8F-important-notice-upcoming-changes-to-the-bitnami-catalog).

just add the following to the helm install command:
```bash
--set "etcd.image.repository=bitnamilegacy/etcd" --set "etcd.global.security.allowInsecureImages=true"
```

**Clean uninstall?**

To uninstall the platform, you can run the following command:
```
helm uninstall dynamo-platform --namespace ${NAMESPACE}
```

To uninstall the CRDs, follow these steps:

Get all of the dynamo CRDs installed in your cluster:
```bash
kubectl get crd | grep "dynamo.*nvidia.com"
```

You should see something like this:
```
dynamocomponentdeployments.nvidia.com               2025-10-21T14:49:52Z
dynamocomponents.nvidia.com                         2025-10-25T05:16:10Z
dynamographdeploymentrequests.nvidia.com            2025-11-24T05:26:04Z
dynamographdeployments.nvidia.com                   2025-09-04T20:56:40Z
dynamographdeploymentscalingadapters.nvidia.com     2025-12-09T21:05:59Z
dynamomodels.nvidia.com                             2025-11-07T00:19:43Z
```

Delete each CRD one by one:
```bash
kubectl delete crd <crd-name>
```

## Advanced Options

- [Helm Chart Configuration](https://github.com/ai-dynamo/dynamo/tree/main/deploy/helm/charts/platform/README.md)
380
381
- [Create custom deployments](./deployment/create-deployment.md)
- [Dynamo Operator details](./dynamo-operator.md)
382
- [Model Express Server details](https://github.com/ai-dynamo/modelexpress)