--- # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-License-Identifier: Apache-2.0 title: Detailed Installation Guide --- Deploy and manage Dynamo inference graphs on Kubernetes with automated orchestration and scaling, using the Dynamo Kubernetes Platform. ## Before You Start Determine your cluster environment: **Shared/Multi-Tenant Cluster** (K8s cluster with existing Dynamo artifacts): - CRDs already installed cluster-wide - skip CRD installation step - A cluster-wide Dynamo operator is likely already running - **Do NOT install another operator** - use the existing cluster-wide operator **Dedicated Cluster** (full cluster admin access): - You install CRDs yourself - Can use cluster-wide operator (default) **Local Development** (Minikube, testing): - See [Minikube Setup](deployment/minikube.md) first, then follow installation steps below To check if CRDs already exist: ```bash kubectl get crd | grep dynamo # If you see dynamographdeployments, dynamocomponentdeployments, etc., CRDs are already installed ``` To check if a cluster-wide operator already exists: ```bash # Check for cluster-wide operator and show its namespace kubectl get clusterrolebinding -o json | \ jq -r '.items[] | select(.metadata.name | contains("dynamo-operator-manager")) | "Cluster-wide operator found in namespace: \(.subjects[0].namespace)"' # If a cluster-wide operator exists: Do NOT install another operator # Only install namespace-restricted mode if you specifically need namespace isolation ``` ## Installation Paths Platform is installed using Dynamo Kubernetes Platform [helm chart](https://github.com/ai-dynamo/dynamo/tree/main/deploy/helm/charts/platform/README.md). **Path A: Pre-built Artifacts** - Use case: Production deployment, shared or dedicated clusters - Source: NGC published Helm charts - Time: ~10 minutes - Jump to: [Path A](#path-a-production-install) **Path B: Custom Build from Source** - Use case: Contributing to Dynamo, using latest features from main branch, customization - Requirements: Docker build environment - Time: ~30 minutes - Jump to: [Path B](#path-b-custom-build-from-source) All helm install commands could be overridden by either setting the values.yaml file or by passing in your own values.yaml: ```bash helm install ... -f your-values.yaml ``` and/or setting values as flags to the helm install command, as follows: ```bash helm install ... --set "your-value=your-value" ``` ## Prerequisites Before installing the Dynamo Kubernetes Platform, ensure you have the following tools and access: ### Required Tools | Tool | Minimum Version | Description | Installation | |------|-----------------|-------------|--------------| | **kubectl** | v1.24+ | Kubernetes command-line tool | [Install kubectl](https://kubernetes.io/docs/tasks/tools/#kubectl) | | **Helm** | v3.0+ | Kubernetes package manager | [Install Helm](https://helm.sh/docs/intro/install/) | | **Docker** | Latest | Container runtime (Path B only) | [Install Docker](https://docs.docker.com/get-docker/) | ### Cluster and Access Requirements - **Kubernetes cluster v1.24+** with admin or namespace-scoped access - **Cluster type determined** (shared vs dedicated) — see [Before You Start](#before-you-start) - **CRD status checked** if on a shared cluster - **NGC credentials** (optional) — required only if pulling NVIDIA images from NGC ### Verify Installation Run the following to confirm your tools are correctly installed: ```bash # Verify tools and versions kubectl version --client # Should show v1.24+ helm version # Should show v3.0+ docker version # Required for Path B only # Set your release version export RELEASE_VERSION=0.x.x # any version of Dynamo 0.3.2+ listed at https://github.com/ai-dynamo/dynamo/releases ``` ### Pre-Deployment Checks Before proceeding, run the pre-deployment check script to verify your cluster meets all requirements: ```bash ./deploy/pre-deployment/pre-deployment-check.sh ``` This script validates kubectl connectivity, default StorageClass configuration, and GPU node availability. See [Pre-Deployment Checks](https://github.com/ai-dynamo/dynamo/tree/main/deploy/pre-deployment/README.md) for details. > **No cluster?** See [Minikube Setup](deployment/minikube.md) for local development. **Estimated installation time:** 5-30 minutes depending on path ## Path A: Production Install Install from [NGC published artifacts](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo/artifacts). ```bash # 1. Set environment export NAMESPACE=dynamo-system export RELEASE_VERSION=0.x.x # any version of Dynamo 0.3.2+ listed at https://github.com/ai-dynamo/dynamo/releases # 2. Install Platform (CRDs are automatically installed by the chart) helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${RELEASE_VERSION}.tgz helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE} --create-namespace ``` > [!WARNING] > **v0.9.0 Helm Chart Issue:** The initial v0.9.0 `dynamo-platform` Helm chart sets the operator image to v0.7.1 instead of v0.9.0. Use `RELEASE_VERSION=0.9.0-post1` or add `--set dynamo-operator.controllerManager.manager.image.tag=0.9.0` to your helm install command. **For Shared/Multi-Tenant Clusters:** > **DEPRECATED:** Namespace-restricted mode (`namespaceRestriction.enabled=true`) is deprecated and will be removed in a future release. New deployments should use the default cluster-wide mode. If you are currently using namespace-restricted mode, plan to migrate to cluster-wide mode. > [!TIP] > For multinode deployments, you need to install multinode orchestration components: > > **Option 1 (Recommended): Grove + KAI Scheduler** > > For production environments, Grove and KAI Scheduler should be installed **separately** from the dynamo-platform chart. This allows independent lifecycle management, version pinning, and upgrade control. > > **Compatibility Matrix:** > > | dynamo-platform | kai-scheduler | Grove | > |-----------------|---------------|-------| > | 1.0.x | >= v0.13.0 | >= v0.1.0-alpha.6 | > > After installing them separately, enable Dynamo integration: > > ```bash > --set "global.kai-scheduler.enabled=true" > --set "global.grove.enabled=true" > ``` > > For **development/testing only**, you can install them as bundled subcharts: > > ```bash > --set "global.grove.install=true" > --set "global.kai-scheduler.install=true" > ``` > > Note: `global.kai-scheduler.install` / `global.grove.install` control whether the bundled subcharts are deployed. When set, integration is automatically enabled. `global.kai-scheduler.enabled` / `global.grove.enabled` can be set independently when using externally-managed installations. > > **Option 2: LeaderWorkerSet (LWS) + Volcano** > - If using LWS for multinode deployments, you must also install Volcano (required dependency): > - [LWS Installation](https://github.com/kubernetes-sigs/lws#installation) > - [Volcano Installation](https://volcano.sh/en/docs/installation/) (required for gang scheduling with LWS) > - These must be installed manually before deploying multinode workloads with LWS. > > See the [Multinode Deployment Guide](./deployment/multinode-deployment.md) for details on orchestrator selection. > [!TIP] > By default, Model Express Server is not used. > If you wish to use an existing Model Express Server, you can set the modelExpressURL to the existing server's URL in the helm install command: ```bash --set "dynamo-operator.modelExpressURL=http://model-express-server.model-express.svc.cluster.local:8080" ``` > [!WARNING] > **DEPRECATED:** Namespace-restricted mode is deprecated and will be removed in a future release. > By default, Dynamo Operator is installed cluster-wide and will monitor all namespaces. This is the recommended and only supported mode going forward. ### GPU Discovery for DynamoGraphDeploymentRequests (Deprecated Namespace-Scoped Mode) > **DEPRECATED:** This section applies only to the deprecated namespace-restricted mode. New deployments should use cluster-wide mode, which has GPU discovery by default. GPU discovery is **enabled by default** for namespace-scoped operators. The Helm chart automatically provisions a ClusterRole/ClusterRoleBinding granting the operator read-only access to node GPU labels. **To disable GPU discovery** (if your installer lacks ClusterRole creation permissions): ```bash helm install dynamo-platform ... --set dynamo-operator.gpuDiscovery.enabled=false ``` When GPU discovery is disabled, you must provide hardware configuration manually in each DynamoGraphDeploymentRequest: ```yaml spec: hardware: numGpusPerNode: 8 gpuSku: "H100-SXM5-80GB" vramMb: 81920 ``` > **Note**: If GPU discovery is disabled and no hardware config is provided, the DGDR will be rejected at admission time with a clear error message. → [Verify Installation](#verify-installation) ## Path B: Custom Build from Source Build and deploy from source for customization, contributing to Dynamo, or using the latest features from the main branch. Note: This gives you access to the latest unreleased features and fixes on the main branch. ```bash # 1. Set environment export NAMESPACE=dynamo-system export DOCKER_SERVER=nvcr.io/nvidia/ai-dynamo/ # or your registry export DOCKER_USERNAME='$oauthtoken' export DOCKER_PASSWORD= export IMAGE_TAG=${RELEASE_VERSION} # 2. Build operator cd deploy/operator # 2.1 Alternative 1 : Build and push the operator image for multiple platforms docker buildx create --name multiplatform --driver docker-container --bootstrap docker buildx use multiplatform docker buildx build --platform linux/amd64,linux/arm64 -t $DOCKER_SERVER/kubernetes-operator:$IMAGE_TAG --push . # 2.2 Alternative 2 : Build and push the operator image for a single platform docker build -t $DOCKER_SERVER/kubernetes-operator:$IMAGE_TAG . && docker push $DOCKER_SERVER/kubernetes-operator:$IMAGE_TAG cd - # 3. Create namespace and secrets to be able to pull the operator image (only needed if you pushed the operator image to a private registry) kubectl create namespace ${NAMESPACE} kubectl create secret docker-registry docker-imagepullsecret \ --docker-server=${DOCKER_SERVER} \ --docker-username=${DOCKER_USERNAME} \ --docker-password=${DOCKER_PASSWORD} \ --namespace=${NAMESPACE} cd deploy/helm/charts # 4. Install Platform (CRDs are automatically installed by the chart) helm dep build ./platform/ # NOTE: Namespace-restricted mode is DEPRECATED. Use cluster-wide mode (the default). helm install dynamo-platform ./platform/ \ --namespace "${NAMESPACE}" \ --set "dynamo-operator.controllerManager.manager.image.repository=${DOCKER_SERVER}/kubernetes-operator" \ --set "dynamo-operator.controllerManager.manager.image.tag=${IMAGE_TAG}" \ --set "dynamo-operator.imagePullSecrets[0].name=docker-imagepullsecret" ``` → [Verify Installation](#verify-installation) ## Verify Installation ```bash # Check CRDs kubectl get crd | grep dynamo # Check operator and platform pods kubectl get pods -n ${NAMESPACE} # Expected: dynamo-operator-* and etcd-* and nats-* pods Running ``` ## Next Steps 1. **Deploy Model/Workflow** ```bash # Example: Deploy a vLLM workflow with Qwen3-0.6B using aggregated serving kubectl apply -f examples/backends/vllm/deploy/agg.yaml -n ${NAMESPACE} # Port forward and test kubectl port-forward svc/agg-vllm-frontend 8000:8000 -n ${NAMESPACE} curl http://localhost:8000/v1/models ``` 2. **Explore Backend Guides** - [vLLM Deployments](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/README.md) - [SGLang Deployments](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/deploy/README.md) - [TensorRT-LLM Deployments](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/deploy/README.md) 3. **Optional:** - [Set up Prometheus & Grafana](./observability/metrics.md) - [SLA Planner Guide](../components/planner/planner-guide.md) (for SLA-aware scheduling and autoscaling) ## Troubleshooting **"VALIDATION ERROR: Cannot install cluster-wide Dynamo operator"** ``` VALIDATION ERROR: Cannot install cluster-wide Dynamo operator. Found existing namespace-restricted Dynamo operators in namespaces: ... ``` Cause: Attempting cluster-wide install on a shared cluster with existing namespace-restricted operators. Solution: Migrate the existing namespace-restricted operators to cluster-wide mode. Namespace-restricted mode is deprecated and should no longer be used. **CRDs already exist** Cause: Installing CRDs on a cluster where they're already present (common on shared clusters). Solution: Skip step 2 (CRD installation), proceed directly to platform installation. To check if CRDs exist: ```bash kubectl get crd | grep dynamo ``` **Pods not starting?** ```bash kubectl describe pod -n ${NAMESPACE} kubectl logs -n ${NAMESPACE} ``` **HuggingFace model access?** ```bash kubectl create secret generic hf-token-secret \ --from-literal=HF_TOKEN=${HF_TOKEN} \ -n ${NAMESPACE} ``` **Bitnami etcd "unrecognized" image?** ```bash ERROR: Original containers have been substituted for unrecognized ones. Deploying this chart with non-standard containers is likely to cause degraded security and performance, broken chart features, and missing environment variables. ``` This error that you might encounter during helm install is due to bitnami changing their docker repository to a [secure one](https://github.com/bitnami/charts/tree/main/bitnami/etcd#%EF%B8%8F-important-notice-upcoming-changes-to-the-bitnami-catalog). just add the following to the helm install command: ```bash --set "etcd.image.repository=bitnamilegacy/etcd" --set "etcd.global.security.allowInsecureImages=true" ``` **Clean uninstall?** To uninstall the platform, you can run the following command: ``` helm uninstall dynamo-platform --namespace ${NAMESPACE} ``` To uninstall the CRDs, follow these steps: Get all of the dynamo CRDs installed in your cluster: ```bash kubectl get crd | grep "dynamo.*nvidia.com" ``` You should see something like this: ``` dynamocomponentdeployments.nvidia.com 2025-10-21T14:49:52Z dynamocomponents.nvidia.com 2025-10-25T05:16:10Z dynamographdeploymentrequests.nvidia.com 2025-11-24T05:26:04Z dynamographdeployments.nvidia.com 2025-09-04T20:56:40Z dynamographdeploymentscalingadapters.nvidia.com 2025-12-09T21:05:59Z dynamomodels.nvidia.com 2025-11-07T00:19:43Z ``` Delete each CRD one by one: ```bash kubectl delete crd ``` ## Advanced Options - [Helm Chart Configuration](https://github.com/ai-dynamo/dynamo/tree/main/deploy/helm/charts/platform/README.md) - [Create custom deployments](./deployment/create-deployment.md) - [Dynamo Operator details](./dynamo-operator.md) - [Model Express Server details](https://github.com/ai-dynamo/modelexpress)