Unverified Commit e28ff8d2 authored by julienmancuso's avatar julienmancuso Committed by GitHub
Browse files

feat: Add Grove and Kai scheduler as part of dynamo cloud helm chart (#2755)


Signed-off-by: default avatarJulien Mancuso <jmancuso@nvidia.com>
parent 8c2072cf
......@@ -34,3 +34,12 @@ dependencies:
version: 11.1.0
repository: "https://charts.bitnami.com/bitnami"
condition: etcd.enabled
- name: kai-scheduler
version: v0.8.1
repository: oci://ghcr.io/nvidia/kai-scheduler
condition: kai-scheduler.enabled
- name: grove-charts
alias: grove
version: v0.0.0-6e30275
repository: oci://ghcr.io/nvidia/grove
condition: grove.enabled
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# dynamo-platform
A Helm chart for NVIDIA Dynamo Platform.
![Version: 0.5.0](https://img.shields.io/badge/Version-0.5.0-informational?style=flat-square) ![Type: application](https://img.shields.io/badge/Type-application-informational?style=flat-square)
## 🚀 Overview
The Dynamo Platform Helm chart deploys the complete Dynamo Cloud infrastructure on Kubernetes, including:
- **Dynamo Operator**: Kubernetes operator for managing Dynamo deployments
- **NATS**: High-performance messaging system for component communication
- **etcd**: Distributed key-value store for operator state management
- **Grove**: Multi-node inference orchestration (optional)
- **Kai Scheduler**: Advanced workload scheduling (optional)
## 📋 Prerequisites
- Kubernetes cluster (v1.20+)
- Helm 3.8+
- Sufficient cluster resources for your deployment scale
- Container registry access (if using private images)
## 🔧 Configuration
## Requirements
| Repository | Name | Version |
|------------|------|---------|
| file://components/operator | dynamo-operator | 0.5.0 |
| https://charts.bitnami.com/bitnami | etcd | 11.1.0 |
| https://nats-io.github.io/k8s/helm/charts/ | nats | 1.3.2 |
| oci://ghcr.io/nvidia/grove | grove(grove-charts) | v0.0.0-6e30275 |
| oci://ghcr.io/nvidia/kai-scheduler | kai-scheduler | v0.8.1 |
## Values
| Key | Type | Default | Description |
|-----|------|---------|-------------|
| dynamo-operator.enabled | bool | `true` | Whether to enable the Dynamo Kubernetes operator deployment |
| dynamo-operator.natsAddr | string | `""` | NATS server address for operator communication (leave empty to use the bundled NATS chart). Format: "nats://hostname:port" |
| dynamo-operator.etcdAddr | string | `""` | etcd server address for operator state storage (leave empty to use the bundled etcd chart). Format: "http://hostname:port" or "https://hostname:port" |
| dynamo-operator.namespaceRestriction.enabled | bool | `true` | Whether to restrict operator to specific namespaces |
| dynamo-operator.namespaceRestriction.targetNamespace | string | `nil` | Target namespace for operator deployment (leave empty for current namespace) |
| dynamo-operator.controllerManager.tolerations | list | `[]` | Node tolerations for controller manager pods |
| dynamo-operator.controllerManager.manager.image.repository | string | `"nvcr.io/nvidia/ai-dynamo/kubernetes-operator"` | Official NVIDIA Dynamo operator image repository |
| dynamo-operator.controllerManager.manager.image.tag | string | `""` | Image tag (leave empty to use chart default) |
| dynamo-operator.controllerManager.manager.image.pullPolicy | string | `"IfNotPresent"` | Image pull policy - when to pull the image |
| dynamo-operator.controllerManager.manager.args[0] | string | `"--health-probe-bind-address=:8081"` | Health probe endpoint for Kubernetes health checks |
| dynamo-operator.controllerManager.manager.args[1] | string | `"--metrics-bind-address=127.0.0.1:8080"` | Metrics endpoint for Prometheus scraping (localhost only for security) |
| dynamo-operator.imagePullSecrets | list | `[]` | Secrets for pulling private container images |
| dynamo-operator.dynamo.groveTerminationDelay | string | `"15m"` | How long to wait before forcefully terminating Grove instances |
| dynamo-operator.dynamo.internalImages.debugger | string | `"python:3.12-slim"` | Debugger image for troubleshooting deployments |
| dynamo-operator.dynamo.enableRestrictedSecurityContext | bool | `false` | Whether to enable restricted security contexts for enhanced security |
| dynamo-operator.dynamo.dockerRegistry.useKubernetesSecret | bool | `false` | Whether to use Kubernetes secrets for registry authentication |
| dynamo-operator.dynamo.dockerRegistry.server | string | `nil` | Docker registry server URL |
| dynamo-operator.dynamo.dockerRegistry.username | string | `nil` | Registry username |
| dynamo-operator.dynamo.dockerRegistry.password | string | `nil` | Registry password (consider using existingSecretName instead) |
| dynamo-operator.dynamo.dockerRegistry.existingSecretName | string | `nil` | Name of existing Kubernetes secret containing registry credentials |
| dynamo-operator.dynamo.dockerRegistry.secure | bool | `true` | Whether the registry uses HTTPS |
| dynamo-operator.dynamo.ingress.enabled | bool | `false` | Whether to create ingress resources |
| dynamo-operator.dynamo.ingress.className | string | `nil` | Ingress class name (e.g., "nginx", "traefik") |
| dynamo-operator.dynamo.ingress.tlsSecretName | string | `"my-tls-secret"` | Secret name containing TLS certificates |
| dynamo-operator.dynamo.istio.enabled | bool | `false` | Whether to enable Istio integration |
| dynamo-operator.dynamo.istio.gateway | string | `nil` | Istio gateway name for routing |
| dynamo-operator.dynamo.ingressHostSuffix | string | `""` | Host suffix for generated ingress hostnames |
| dynamo-operator.dynamo.virtualServiceSupportsHTTPS | bool | `false` | Whether VirtualServices should support HTTPS routing |
| grove.enabled | bool | `false` | Whether to enable Grove for multi-node inference coordination, if enabled, the Grove operator will be deployed cluster-wide |
| kai-scheduler.enabled | bool | `false` | Whether to enable Kai Scheduler for intelligent resource allocation, if enabled, the Kai Scheduler operator will be deployed cluster-wide |
| etcd.enabled | bool | `true` | Whether to enable etcd deployment, disable if you want to use an external etcd instance |
| nats.enabled | bool | `true` | Whether to enable NATS deployment, disable if you want to use an external NATS instance |
### NATS Configuration
For detailed NATS configuration options beyond `nats.enabled`, please refer to the official NATS Helm chart documentation:
**[NATS Helm Chart Documentation](https://github.com/nats-io/k8s/tree/main/helm/charts/nats)**
### etcd Configuration
For detailed etcd configuration options beyond `etcd.enabled`, please refer to the official Bitnami etcd Helm chart documentation:
**[etcd Helm Chart Documentation](https://github.com/bitnami/charts/tree/main/bitnami/etcd)**
## 📚 Additional Resources
- [Dynamo Cloud Deployment Guide](../../../../docs/guides/dynamo_deploy/dynamo_cloud.md)
- [NATS Documentation](https://docs.nats.io/)
- [etcd Documentation](https://etcd.io/docs/)
- [Kubernetes Operator Pattern](https://kubernetes.io/docs/concepts/extend-kubernetes/operator/)
----------------------------------------------
Autogenerated from chart metadata using [helm-docs v1.14.2](https://github.com/norwoodj/helm-docs/releases/v1.14.2)
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
{{ template "chart.header" . }}
{{ template "chart.description" . }}
{{ template "chart.versionBadge" . }}{{ template "chart.typeBadge" . }}{{ template "chart.appVersionBadge" . }}
## 🚀 Overview
The Dynamo Platform Helm chart deploys the complete Dynamo Cloud infrastructure on Kubernetes, including:
- **Dynamo Operator**: Kubernetes operator for managing Dynamo deployments
- **NATS**: High-performance messaging system for component communication
- **etcd**: Distributed key-value store for operator state management
- **Grove**: Multi-node inference orchestration (optional)
- **Kai Scheduler**: Advanced workload scheduling (optional)
## 📋 Prerequisites
- Kubernetes cluster (v1.20+)
- Helm 3.8+
- Sufficient cluster resources for your deployment scale
- Container registry access (if using private images)
## 🔧 Configuration
{{ template "chart.requirementsSection" . }}
{{ template "chart.valuesSection" . }}
### NATS Configuration
For detailed NATS configuration options beyond `nats.enabled`, please refer to the official NATS Helm chart documentation:
**[NATS Helm Chart Documentation](https://github.com/nats-io/k8s/tree/main/helm/charts/nats)**
### etcd Configuration
For detailed etcd configuration options beyond `etcd.enabled`, please refer to the official Bitnami etcd Helm chart documentation:
**[etcd Helm Chart Documentation](https://github.com/bitnami/charts/tree/main/bitnami/etcd)**
## 📚 Additional Resources
- [Dynamo Cloud Deployment Guide](../../../../docs/guides/dynamo_deploy/dynamo_cloud.md)
- [NATS Documentation](https://docs.nats.io/)
- [etcd Documentation](https://etcd.io/docs/)
- [Kubernetes Operator Pattern](https://kubernetes.io/docs/concepts/extend-kubernetes/operator/)
{{ template "helm-docs.versionFooter" . }}
......@@ -491,7 +491,7 @@ subjects:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: {{ include "dynamo-operator.fullname" . }}-queue-reader
name: {{ include "dynamo-operator.fullname" . }}-{{ .Release.Namespace }}-queue-reader
labels:
app.kubernetes.io/component: rbac
app.kubernetes.io/created-by: dynamo-operator
......@@ -510,7 +510,7 @@ rules:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: {{ include "dynamo-operator.fullname" . }}-queue-reader-binding
name: {{ include "dynamo-operator.fullname" . }}-{{ .Release.Namespace }}-queue-reader-binding
labels:
app.kubernetes.io/component: rbac
app.kubernetes.io/created-by: dynamo-operator
......@@ -519,7 +519,7 @@ metadata:
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: {{ include "dynamo-operator.fullname" . }}-queue-reader
name: {{ include "dynamo-operator.fullname" . }}-{{ .Release.Namespace }}-queue-reader
subjects:
- kind: ServiceAccount
name: '{{ include "dynamo-operator.fullname" . }}-controller-manager'
......
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
---
{{- if .Capabilities.APIVersions.Has "scheduling.run.ai/v2" }}
{{- /* Create parent queue first */ -}}
{{- $defaultQueue := lookup "scheduling.run.ai/v2" "Queue" "" "dynamo-default" }}
{{- if not $defaultQueue }}
---
apiVersion: scheduling.run.ai/v2
kind: Queue
metadata:
name: dynamo-default
annotations:
"helm.sh/hook": post-install,post-upgrade
"helm.sh/hook-weight": "100"
"helm.sh/hook-delete-policy": before-hook-creation
spec:
resources:
cpu:
quota: -1
limit: -1
overQuotaWeight: 1
gpu:
quota: -1
limit: -1
overQuotaWeight: 1
memory:
quota: -1
limit: -1
overQuotaWeight: 1
{{- end }}
{{- /* Create child queue second */ -}}
{{- $dynamoQueue := lookup "scheduling.run.ai/v2" "Queue" "" "dynamo" }}
{{- if not $dynamoQueue }}
---
apiVersion: scheduling.run.ai/v2
kind: Queue
metadata:
name: dynamo
annotations:
"helm.sh/hook": post-install,post-upgrade
"helm.sh/hook-weight": "110"
"helm.sh/hook-delete-policy": before-hook-creation
spec:
parentQueue: dynamo-default
resources:
cpu:
quota: -1
limit: -1
overQuotaWeight: 1
gpu:
quota: -1
limit: -1
overQuotaWeight: 1
memory:
quota: -1
limit: -1
overQuotaWeight: 1
{{- end }}
{{- end }}
\ No newline at end of file
......@@ -14,170 +14,290 @@
# limitations under the License.
# Used to generate top-level secrets (overridden by custom-values.yaml)
# Subcharts
# Subcharts configuration
# Dynamo operator configuration
dynamo-operator:
# -- Whether to enable the Dynamo Kubernetes operator deployment
enabled: true
# -- NATS server address for operator communication (leave empty to use the bundled NATS chart). Format: "nats://hostname:port"
natsAddr: ""
# -- etcd server address for operator state storage (leave empty to use the bundled etcd chart). Format: "http://hostname:port" or "https://hostname:port"
etcdAddr: ""
# Namespace access controls for the operator
namespaceRestriction:
# -- Whether to restrict operator to specific namespaces
enabled: true
# -- Target namespace for operator deployment (leave empty for current namespace)
targetNamespace:
# Controller manager configuration
controllerManager:
# -- Node tolerations for controller manager pods
tolerations: []
manager:
# Container image configuration for the operator manager
image:
# -- Official NVIDIA Dynamo operator image repository
repository: "nvcr.io/nvidia/ai-dynamo/kubernetes-operator"
# -- Image tag (leave empty to use chart default)
tag: ""
# -- Image pull policy - when to pull the image
pullPolicy: IfNotPresent
# Command line arguments for the operator manager
args:
# -- Health probe endpoint for Kubernetes health checks
- --health-probe-bind-address=:8081
# -- Metrics endpoint for Prometheus scraping (localhost only for security)
- --metrics-bind-address=127.0.0.1:8080
# -- Secrets for pulling private container images
imagePullSecrets: []
# Core Dynamo platform configuration
dynamo:
# -- How long to wait before forcefully terminating Grove instances
groveTerminationDelay: 15m
# Internal utility images used by the platform
internalImages:
# -- Debugger image for troubleshooting deployments
debugger: python:3.12-slim
# -- Whether to enable restricted security contexts for enhanced security
enableRestrictedSecurityContext: false
# Docker registry configuration for private repositories
dockerRegistry:
# -- Whether to use Kubernetes secrets for registry authentication
useKubernetesSecret: false
# -- Docker registry server URL
server:
# -- Registry username
username:
# -- Registry password (consider using existingSecretName instead)
password:
# -- Name of existing Kubernetes secret containing registry credentials
existingSecretName:
# -- Whether the registry uses HTTPS
secure: true
# Ingress configuration for external access
ingress:
# -- Whether to create ingress resources
enabled: false
# -- Ingress class name (e.g., "nginx", "traefik")
className:
# -- Secret name containing TLS certificates
tlsSecretName: my-tls-secret
# Istio service mesh configuration
istio:
# -- Whether to enable Istio integration
enabled: false
# -- Istio gateway name for routing
gateway:
# -- Host suffix for generated ingress hostnames
ingressHostSuffix: ""
# -- Whether VirtualServices should support HTTPS routing
virtualServiceSupportsHTTPS: false
# Grove component - distributed inference orchestration
grove:
# -- Whether to enable Grove for multi-node inference coordination, if enabled, the Grove operator will be deployed cluster-wide
enabled: false
# Kai Scheduler component - advanced workload scheduling
kai-scheduler:
# -- Whether to enable Kai Scheduler for intelligent resource allocation, if enabled, the Kai Scheduler operator will be deployed cluster-wide
enabled: false
# etcd configuration - distributed key-value store for operator state
# For complete configuration options, see: https://github.com/bitnami/charts/tree/main/bitnami/etcd
etcd:
# -- Whether to enable etcd deployment, disable if you want to use an external etcd instance
enabled: true
# Persistent storage configuration for etcd data
persistence:
# Whether to enable persistent storage (recommended for production)
enabled: true
# Use the cluster default storage-class or override with a named class
storageClass: null
# Size of persistent volume for etcd data
size: 1Gi
# Pre-upgrade job configuration
preUpgrade:
# Whether to run pre-upgrade validation jobs
enabled: false
# Number of etcd replicas (1 for single-node, 3+ for HA)
replicaCount: 1
# Explicitly remove authentication
# Authentication and authorization settings
# Explicitly remove authentication for simplified internal communication
auth:
rbac:
# Whether to create RBAC authentication (disabled for internal use)
create: false
# Health check configuration
readinessProbe:
# Whether to enable readiness probes (disabled to reduce startup complexity)
enabled: false
livenessProbe:
# Whether to enable liveness probes (disabled to reduce startup complexity)
enabled: false
# Node tolerations for etcd pods (allows scheduling on specific nodes)
tolerations: []
# NATS configuration - messaging system for operator communication
# For complete configuration options, see: https://github.com/nats-io/k8s/tree/main/helm/charts/nats
nats:
# -- Whether to enable NATS deployment, disable if you want to use an external NATS instance
enabled: true
# reference a common CA Certificate or Bundle in all nats config `tls` blocks and nats-box contexts
# note: `tls.verify` still must be set in the appropriate nats config `tls` blocks to require mTLS
# TLS Certificate Authority configuration for secure communication
# Reference a common CA Certificate or Bundle in all nats config `tls` blocks and nats-box contexts
# Note: `tls.verify` still must be set in the appropriate nats config `tls` blocks to require mTLS
tlsCA:
# Whether to enable TLS CA configuration
enabled: false
# Core NATS server configuration
config:
# NATS clustering for high availability (multiple NATS servers)
cluster:
# Whether to enable NATS clustering (disabled for single-node setups)
enabled: false
# JetStream - persistent messaging and streaming capabilities
jetstream:
# Whether to enable JetStream (recommended for persistent messaging)
enabled: true
# File-based storage for JetStream streams and consumers
fileStore:
# Whether to enable file storage (persistent across restarts)
enabled: true
# Directory path for JetStream file storage
dir: /data
############################################################
# stateful set -> volume claim templates -> jetstream pvc
# Persistent Volume Claim for JetStream file storage
############################################################
pvc:
# Whether to create a PVC for JetStream storage
enabled: true
# Size of the persistent volume for JetStream data
size: 10Gi
# Storage class name (leave empty for default)
storageClassName:
# merge or patch the jetstream pvc
# Advanced PVC configuration (merge additional fields)
# https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#persistentvolumeclaim-v1-core
merge: {}
patch: []
# defaults to "{{ include "nats.fullname" $ }}-js"
# PVC name (defaults to "{{ include "nats.fullname" $ }}-js")
name:
# defaults to the PVC size
# Maximum size for JetStream file storage (defaults to PVC size)
maxSize:
# Memory-based storage for JetStream (non-persistent)
memoryStore:
# Whether to enable memory storage (faster but not persistent)
enabled: false
# merge or patch the jetstream config
# https://docs.nats.io/running-a-nats-service/configuration#jetstream
# Advanced JetStream configuration
# For options see: https://docs.nats.io/running-a-nats-service/configuration#jetstream
merge: {}
patch: []
# Core NATS server settings
nats:
# Port for NATS client connections
port: 4222
# TLS configuration for encrypted connections
tls:
# Whether to enable TLS encryption
enabled: false
# merge or patch the tls config
# https://docs.nats.io/running-a-nats-service/configuration/securing_nats/tls
# Advanced TLS configuration
# For options see: https://docs.nats.io/running-a-nats-service/configuration/securing_nats/tls
merge: {}
patch: []
# Leaf nodes for creating NATS topologies and remote connections
leafnodes:
# Whether to enable leaf node connections
enabled: false
# WebSocket support for browser-based NATS clients
websocket:
# Whether to enable WebSocket protocol support
enabled: false
# MQTT protocol bridge for IoT device connectivity
mqtt:
# Whether to enable MQTT protocol support
enabled: false
# Gateway connections for multi-cluster NATS deployments
gateway:
# Whether to enable gateway connections
enabled: false
# HTTP monitoring endpoint for NATS server metrics
monitor:
# Whether to enable HTTP monitoring interface
enabled: true
# Port for monitoring HTTP endpoint
port: 8222
# TLS configuration for monitoring endpoint
tls:
# config.nats.tls must be enabled also
# when enabled, monitoring port will use HTTPS with the options from config.nats.tls
# Whether to enable HTTPS for monitoring (requires config.nats.tls enabled)
# When enabled, monitoring port will use HTTPS with the options from config.nats.tls
enabled: false
# Go pprof profiling endpoint for performance debugging
profiling:
# Whether to enable profiling endpoint (for debugging only)
enabled: false
# Port for profiling endpoint
port: 65432
# Account resolver for multi-tenant NATS deployments
resolver:
# Whether to enable account resolution (for advanced multi-tenancy)
enabled: false
# adds a prefix to the server name, which defaults to the pod name
# helpful for ensuring server name is unique in a super cluster
# Server naming configuration
# Adds a prefix to the server name, which defaults to the pod name
# Helpful for ensuring server name is unique in a super cluster
serverNamePrefix: ""
# merge or patch the nats config
# https://docs.nats.io/running-a-nats-service/configuration
# following special rules apply
# Advanced NATS configuration merging and patching
# For complete options see: https://docs.nats.io/running-a-nats-service/configuration
# Special rules apply:
# 1. strings that start with << and end with >> will be unquoted
# use this for variables and numbers with units
# 2. keys ending in $include will be switched to include directives
# keys are sorted alphabetically, use prefix before $includes to control includes ordering
# paths should be relative to /etc/nats-config/nats.conf
# example:
#
# Example:
# merge:
# $include: ./my-config.conf
# zzz$include: ./my-config-last.conf
......@@ -186,48 +306,48 @@ nats:
# token: << $TOKEN >>
# jetstream:
# max_memory_store: << 1GB >>
#
# will yield the config:
# {
# include ./my-config.conf;
# "authorization": {
# "token": $TOKEN
# },
# "jetstream": {
# "max_memory_store": 1GB
# },
# "server_name": "nats",
# include ./my-config-last.conf;
# }
merge: {}
patch: []
############################################################
# stateful set -> pod template -> nats container
# NATS container configuration in StatefulSet
############################################################
container:
# NATS server container image configuration
image:
# Official NATS server repository
repository: nats
# NATS server version (Alpine-based for smaller size)
tag: 2.10.21-alpine
# Image pull policy (leave empty for chart default)
pullPolicy:
# Custom registry URL (leave empty for Docker Hub)
registry:
# container port options
# must be enabled in the config section also
# Container port configuration
# Note: Ports must also be enabled in the config section above
# https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#containerport-v1-core
ports:
# Main NATS client connection port
nats: {}
# Leaf node connection port
leafnodes: {}
# WebSocket connection port
websocket: {}
# MQTT protocol port
mqtt: {}
# Cluster communication port
cluster: {}
# Gateway connection port
gateway: {}
# HTTP monitoring port
monitor: {}
# Go profiling port
profiling: {}
# map with key as env var name, value can be string or map
# example:
#
# Environment variables for the NATS container
# Map with key as env var name, value can be string or map
# Example:
# env:
# GOMEMLIMIT: 7GiB
# TOKEN:
......@@ -237,211 +357,245 @@ nats:
# key: token
env: {}
# merge or patch the container
# Advanced container configuration merging and patching
# https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#container-v1-core
merge: {}
patch: []
############################################################
# stateful set -> pod template -> reloader container
# Configuration reloader container for hot config updates
############################################################
reloader:
# Whether to enable the config reloader sidecar container
enabled: true
# Config reloader container image
image:
# Official NATS config reloader repository
repository: natsio/nats-server-config-reloader
# Config reloader version
tag: 0.16.0
# Image pull policy (leave empty for chart default)
pullPolicy:
# Custom registry URL (leave empty for Docker Hub)
registry:
# env var map, see nats.env for an example
# Environment variables for the reloader container
env: {}
# all nats container volume mounts with the following prefixes
# will be mounted into the reloader container
# Volume mount prefixes from NATS container to share with reloader
# All NATS container volume mounts with these prefixes will be mounted into the reloader
natsVolumeMountPrefixes:
- /etc/
# merge or patch the container
# Advanced reloader container configuration
# https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#container-v1-core
merge: {}
patch: []
############################################################
# stateful set -> pod template -> prom-exporter container
# Prometheus metrics exporter container (optional)
############################################################
# config.monitor must be enabled
# Note: config.monitor must be enabled for this to work
promExporter:
# Whether to enable Prometheus metrics exporter sidecar
enabled: false
############################################################
# service
# Kubernetes Service for NATS access
############################################################
service:
# Whether to create a Kubernetes Service for NATS
enabled: true
# service port options
# additional boolean field enable to control whether port is exposed in the service
# must be enabled in the config section also
# Service port configuration
# Additional boolean field 'enabled' controls whether port is exposed in the service
# Note: Ports must also be enabled in the config section above
# https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#serviceport-v1-core
ports:
# Main NATS client connection port
nats:
enabled: true
# Leaf node connection port
leafnodes:
enabled: true
# WebSocket connection port
websocket:
enabled: true
# MQTT protocol port
mqtt:
enabled: true
# Cluster communication port (typically internal only)
cluster:
enabled: false
# Gateway connection port (typically internal only)
gateway:
enabled: false
# HTTP monitoring port (typically internal only)
monitor:
enabled: false
# Go profiling port (typically internal only)
profiling:
enabled: false
# merge or patch the service
# Advanced service configuration
# https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#service-v1-core
merge: {}
patch: []
# defaults to "{{ include "nats.fullname" $ }}"
# Service name (defaults to "{{ include "nats.fullname" $ }}")
name:
############################################################
# other nats extension points
# Advanced NATS Kubernetes resource configuration
############################################################
# stateful set
# StatefulSet configuration for NATS server persistence
statefulSet:
# merge or patch the stateful set
# Advanced StatefulSet configuration merging and patching
# https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#statefulset-v1-apps
merge: {}
patch: []
# defaults to "{{ include "nats.fullname" $ }}"
# StatefulSet name (defaults to "{{ include "nats.fullname" $ }}")
name:
# stateful set -> pod template
# Pod template configuration for NATS StatefulSet
podTemplate:
# adds a hash of the ConfigMap as a pod annotation
# this will cause the StatefulSet to roll when the ConfigMap is updated
# Whether to add a hash of the ConfigMap as a pod annotation
# This will cause the StatefulSet to roll when the ConfigMap is updated
configChecksumAnnotation: true
# map of topologyKey: topologySpreadConstraint
# labelSelector will be added to match StatefulSet pods
#
# Pod topology spread constraints for better distribution across nodes
# Map of topologyKey: topologySpreadConstraint
# labelSelector will be added automatically to match StatefulSet pods
# Example:
# topologySpreadConstraints:
# kubernetes.io/hostname:
# maxSkew: 1
#
topologySpreadConstraints: {}
# merge or patch the pod template
# Advanced pod template configuration
# https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#pod-v1-core
merge:
spec:
# Node tolerations for NATS pods (allows scheduling on specific nodes)
tolerations: []
patch: []
# headless service
# Headless service for StatefulSet pod discovery
headlessService:
# merge or patch the headless service
# Advanced headless service configuration
# https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#service-v1-core
merge: {}
patch: []
# defaults to "{{ include "nats.fullname" $ }}-headless"
# Headless service name (defaults to "{{ include "nats.fullname" $ }}-headless")
name:
# config map
# ConfigMap for NATS server configuration
configMap:
# merge or patch the config map
# Advanced ConfigMap configuration
# https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#configmap-v1-core
merge: {}
patch: []
# defaults to "{{ include "nats.fullname" $ }}-config"
# ConfigMap name (defaults to "{{ include "nats.fullname" $ }}-config")
name:
# pod disruption budget
# Pod Disruption Budget for controlled rolling updates
podDisruptionBudget:
# Whether to create a PodDisruptionBudget (recommended for production)
enabled: true
# service account
# Service Account for NATS server pods
serviceAccount:
# Whether to create and use a dedicated service account
enabled: false
############################################################
# natsBox
#
# NATS Box Deployment and associated resources
# NATS Box - CLI tools and debugging container
# NATS Box provides CLI tools for interacting with NATS server
############################################################
natsBox:
# Whether to deploy NATS Box for CLI access and debugging
enabled: true
############################################################
# NATS contexts
# NATS client contexts for authentication and connection
############################################################
contexts:
# Default context configuration
default:
# Credentials-based authentication
creds:
# set contents in order to create a secret with the creds file contents
# Inline credentials file contents (base64 encoded)
contents:
# set secretName in order to mount an existing secret to dir
# Name of existing secret containing credentials file
secretName:
# defaults to /etc/nats-creds/<context-name>
# Directory to mount credentials (defaults to /etc/nats-creds/<context-name>)
dir:
# Key name in secret for credentials file
key: nats.creds
# NKey-based authentication (public/private key pairs)
nkey:
# set contents in order to create a secret with the nkey file contents
# Inline NKey file contents (base64 encoded)
contents:
# set secretName in order to mount an existing secret to dir
# Name of existing secret containing NKey file
secretName:
# defaults to /etc/nats-nkeys/<context-name>
# Directory to mount NKey (defaults to /etc/nats-nkeys/<context-name>)
dir:
# Key name in secret for NKey file
key: nats.nk
# used to connect with client certificates
# TLS client certificate authentication
tls:
# set secretName in order to mount an existing secret to dir
# Name of existing secret containing TLS client certificates
secretName:
# defaults to /etc/nats-certs/<context-name>
# Directory to mount certificates (defaults to /etc/nats-certs/<context-name>)
dir:
# Certificate file name in secret
cert: tls.crt
# Private key file name in secret
key: tls.key
# merge or patch the context
# https://docs.nats.io/using-nats/nats-tools/nats_cli#nats-contexts
# Advanced context configuration
# For options see: https://docs.nats.io/using-nats/nats-tools/nats_cli#nats-contexts
merge: {}
patch: []
# name of context to select by default
# Name of context to select by default for NATS CLI operations
defaultContextName: default
############################################################
# deployment -> pod template -> nats-box container
# NATS Box container configuration
############################################################
container:
# NATS Box container image
image:
# Official NATS Box repository with CLI tools
repository: natsio/nats-box
# NATS Box version
tag: 0.14.5
# Image pull policy (leave empty for chart default)
pullPolicy:
# Custom registry URL (leave empty for Docker Hub)
registry:
# env var map, see nats.env for an example
# Environment variables for NATS Box container
env: {}
# merge or patch the container
# Advanced container configuration
# https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#container-v1-core
merge: {}
patch: []
# service account
# Service Account for NATS Box deployment
serviceAccount:
# Whether to create and use a dedicated service account for NATS Box
enabled: false
# Pod template configuration for NATS Box deployment
podTemplate:
merge:
spec:
# Node tolerations for NATS Box pods
tolerations: []
patch: []
......@@ -57,7 +57,7 @@ ensure-yq:
fi
.PHONY: manifests
manifests: controller-gen ensure-yq ## Generate WebhookConfiguration, ClusterRole and CustomResourceDefinition objects.
manifests: controller-gen ensure-yq generate-api-docs ## Generate WebhookConfiguration, ClusterRole and CustomResourceDefinition objects.
# Use a large maxDescLen to ensure all field comments are included as OpenAPI descriptions
$(CONTROLLER_GEN) rbac:roleName=manager-role crd:maxDescLen=100000 webhook paths="./..." output:crd:artifacts:config=config/crd/bases
echo "Removing name from mainContainer required fields"
......@@ -266,6 +266,27 @@ $(HELMIFY): $(LOCALBIN)
helm: manifests kustomize helmify
$(KUSTOMIZE) build config/default | $(HELMIFY) -image-pull-secrets charts/dynamo-kubernetes-operator
######################### CRD Reference Docs
CRD_REF_DOCS_VERSION ?= v0.0.12
CRD_REF_DOCS ?= $(LOCALBIN)/crd-ref-docs
.PHONY: crd-ref-docs
crd-ref-docs: $(CRD_REF_DOCS) ## Download crd-ref-docs locally if necessary.
$(CRD_REF_DOCS): $(LOCALBIN)
@echo "Installing crd-ref-docs $(CRD_REF_DOCS_VERSION)"
@GOBIN=$(LOCALBIN) go install github.com/elastic/crd-ref-docs@$(CRD_REF_DOCS_VERSION)
@echo "✅ crd-ref-docs $(CRD_REF_DOCS_VERSION) installed successfully"
.PHONY: generate-api-docs
generate-api-docs: crd-ref-docs ## Generate API reference documentation from CRDs
@echo "📚 Generating CRD API reference documentation..."
@mkdir -p docs
@$(CRD_REF_DOCS) \
--source-path=api \
--config=docs/crd-ref-docs-config.yaml \
--renderer=markdown \
--output-path=docs/api_reference.md
@echo "✅ Generated API reference at docs/api_reference.md"
.PHONY: coverage
coverage: test
......
......@@ -67,13 +67,13 @@ type DynamoComponentDeploymentSharedSpec struct {
// Labels to add to generated Kubernetes resources for this component.
Labels map[string]string `json:"labels,omitempty"`
// contains the name of the component
// The name of the component
ServiceName string `json:"serviceName,omitempty"`
// ComponentType indicates the role of this component (for example, "main").
ComponentType string `json:"componentType,omitempty"`
// dynamo namespace of the service (allows to override the dynamo namespace of the service defined in annotations inside the dynamo archive)
// Dynamo namespace of the service (allows to override the Dynamo namespace of the service defined in annotations inside the Dynamo archive)
DynamoNamespace *string `json:"dynamoNamespace,omitempty"`
// Resources requested and limits for this component, including CPU, memory,
......@@ -99,8 +99,9 @@ type DynamoComponentDeploymentSharedSpec struct {
// ExtraPodMetadata adds labels/annotations to the created Pods.
ExtraPodMetadata *dynamoCommon.ExtraPodMetadata `json:"extraPodMetadata,omitempty"`
// +optional
// ExtraPodSpec merges additional fields into the generated PodSpec for advanced
// customization (tolerations, node selectors, affinity, etc.).
// ExtraPodSpec allows to override the main pod spec configuration.
// It is a k8s standard PodSpec. It also contains a MainContainer (standard k8s Container) field
// that allows overriding the main container configuration.
ExtraPodSpec *dynamoCommon.ExtraPodSpec `json:"extraPodSpec,omitempty"`
// LivenessProbe to detect and restart unhealthy containers.
......
......@@ -17,7 +17,7 @@
* Modifications Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES
*/
// Package v1alpha1 contains API Schema definitions for the nvidia.com v1alpha1 API group
// Package v1alpha1 contains API Schema definitions for the nvidia.com v1alpha1 API group.
// +kubebuilder:object:generate=true
// +groupName=nvidia.com
package v1alpha1
......
# API Reference
## Packages
- [nvidia.com/v1alpha1](#nvidiacomv1alpha1)
## nvidia.com/v1alpha1
Package v1alpha1 contains API Schema definitions for the nvidia.com v1alpha1 API group.
### Resource Types
- [DynamoComponentDeployment](#dynamocomponentdeployment)
- [DynamoGraphDeployment](#dynamographdeployment)
#### Autoscaling
_Appears in:_
- [DynamoComponentDeploymentSharedSpec](#dynamocomponentdeploymentsharedspec)
- [DynamoComponentDeploymentSpec](#dynamocomponentdeploymentspec)
| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `enabled` _boolean_ | | | |
| `minReplicas` _integer_ | | | |
| `maxReplicas` _integer_ | | | |
| `behavior` _[HorizontalPodAutoscalerBehavior](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#horizontalpodautoscalerbehavior-v2-autoscaling)_ | | | |
| `metrics` _[MetricSpec](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#metricspec-v2-autoscaling) array_ | | | |
#### DynamoComponentDeployment
DynamoComponentDeployment is the Schema for the dynamocomponentdeployments API
| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `apiVersion` _string_ | `nvidia.com/v1alpha1` | | |
| `kind` _string_ | `DynamoComponentDeployment` | | |
| `metadata` _[ObjectMeta](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#objectmeta-v1-meta)_ | Refer to Kubernetes API documentation for fields of `metadata`. | | |
| `spec` _[DynamoComponentDeploymentSpec](#dynamocomponentdeploymentspec)_ | Spec defines the desired state for this Dynamo component deployment. | | |
#### DynamoComponentDeploymentSharedSpec
_Appears in:_
- [DynamoComponentDeploymentSpec](#dynamocomponentdeploymentspec)
| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `annotations` _object (keys:string, values:string)_ | Annotations to add to generated Kubernetes resources for this component<br />(such as Pod, Service, and Ingress when applicable). | | |
| `labels` _object (keys:string, values:string)_ | Labels to add to generated Kubernetes resources for this component. | | |
| `serviceName` _string_ | The name of the component | | |
| `componentType` _string_ | ComponentType indicates the role of this component (for example, "main"). | | |
| `dynamoNamespace` _string_ | Dynamo namespace of the service (allows to override the Dynamo namespace of the service defined in annotations inside the Dynamo archive) | | |
| `resources` _[Resources](#resources)_ | Resources requested and limits for this component, including CPU, memory,<br />GPUs/devices, and any runtime-specific resources. | | |
| `autoscaling` _[Autoscaling](#autoscaling)_ | Autoscaling config for this component (replica range, target utilization, etc.). | | |
| `envs` _[EnvVar](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#envvar-v1-core) array_ | Envs defines additional environment variables to inject into the component containers. | | |
| `envFromSecret` _string_ | EnvFromSecret references a Secret whose key/value pairs will be exposed as<br />environment variables in the component containers. | | |
| `pvc` _[PVC](#pvc)_ | PVC config describing volumes to be mounted by the component. | | |
| `ingress` _[IngressSpec](#ingressspec)_ | Ingress config to expose the component outside the cluster (or through a service mesh). | | |
| `sharedMemory` _[SharedMemorySpec](#sharedmemoryspec)_ | SharedMemory controls the tmpfs mounted at /dev/shm (enable/disable and size). | | |
| `extraPodMetadata` _[ExtraPodMetadata](#extrapodmetadata)_ | ExtraPodMetadata adds labels/annotations to the created Pods. | | |
| `extraPodSpec` _[ExtraPodSpec](#extrapodspec)_ | ExtraPodSpec allows to override the main pod spec configuration.<br />It is a k8s standard PodSpec. It also contains a MainContainer (standard k8s Container) field<br />that allows overriding the main container configuration. | | |
| `livenessProbe` _[Probe](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#probe-v1-core)_ | LivenessProbe to detect and restart unhealthy containers. | | |
| `readinessProbe` _[Probe](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#probe-v1-core)_ | ReadinessProbe to signal when the container is ready to receive traffic. | | |
| `replicas` _integer_ | Replicas is the desired number of Pods for this component when autoscaling is not used. | | |
| `multinode` _[MultinodeSpec](#multinodespec)_ | Multinode is the configuration for multinode components. | | |
#### DynamoComponentDeploymentSpec
DynamoComponentDeploymentSpec defines the desired state of DynamoComponentDeployment
_Appears in:_
- [DynamoComponentDeployment](#dynamocomponentdeployment)
| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `dynamoComponent` _string_ | DynamoComponent selects the Dynamo component from the archive to deploy.<br />Typically corresponds to a component defined in the packaged Dynamo artifacts. | | |
| `dynamoTag` _string_ | contains the tag of the DynamoComponent: for example, "my_package:MyService" | | |
| `backendFramework` _string_ | BackendFramework specifies the backend framework (e.g., "sglang", "vllm", "trtllm") | | Enum: [sglang vllm trtllm] <br /> |
| `annotations` _object (keys:string, values:string)_ | Annotations to add to generated Kubernetes resources for this component<br />(such as Pod, Service, and Ingress when applicable). | | |
| `labels` _object (keys:string, values:string)_ | Labels to add to generated Kubernetes resources for this component. | | |
| `serviceName` _string_ | The name of the component | | |
| `componentType` _string_ | ComponentType indicates the role of this component (for example, "main"). | | |
| `dynamoNamespace` _string_ | Dynamo namespace of the service (allows to override the Dynamo namespace of the service defined in annotations inside the Dynamo archive) | | |
| `resources` _[Resources](#resources)_ | Resources requested and limits for this component, including CPU, memory,<br />GPUs/devices, and any runtime-specific resources. | | |
| `autoscaling` _[Autoscaling](#autoscaling)_ | Autoscaling config for this component (replica range, target utilization, etc.). | | |
| `envs` _[EnvVar](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#envvar-v1-core) array_ | Envs defines additional environment variables to inject into the component containers. | | |
| `envFromSecret` _string_ | EnvFromSecret references a Secret whose key/value pairs will be exposed as<br />environment variables in the component containers. | | |
| `pvc` _[PVC](#pvc)_ | PVC config describing volumes to be mounted by the component. | | |
| `ingress` _[IngressSpec](#ingressspec)_ | Ingress config to expose the component outside the cluster (or through a service mesh). | | |
| `sharedMemory` _[SharedMemorySpec](#sharedmemoryspec)_ | SharedMemory controls the tmpfs mounted at /dev/shm (enable/disable and size). | | |
| `extraPodMetadata` _[ExtraPodMetadata](#extrapodmetadata)_ | ExtraPodMetadata adds labels/annotations to the created Pods. | | |
| `extraPodSpec` _[ExtraPodSpec](#extrapodspec)_ | ExtraPodSpec allows to override the main pod spec configuration.<br />It is a k8s standard PodSpec. It also contains a MainContainer (standard k8s Container) field<br />that allows overriding the main container configuration. | | |
| `livenessProbe` _[Probe](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#probe-v1-core)_ | LivenessProbe to detect and restart unhealthy containers. | | |
| `readinessProbe` _[Probe](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#probe-v1-core)_ | ReadinessProbe to signal when the container is ready to receive traffic. | | |
| `replicas` _integer_ | Replicas is the desired number of Pods for this component when autoscaling is not used. | | |
| `multinode` _[MultinodeSpec](#multinodespec)_ | Multinode is the configuration for multinode components. | | |
#### DynamoGraphDeployment
DynamoGraphDeployment is the Schema for the dynamographdeployments API.
| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `apiVersion` _string_ | `nvidia.com/v1alpha1` | | |
| `kind` _string_ | `DynamoGraphDeployment` | | |
| `metadata` _[ObjectMeta](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#objectmeta-v1-meta)_ | Refer to Kubernetes API documentation for fields of `metadata`. | | |
| `spec` _[DynamoGraphDeploymentSpec](#dynamographdeploymentspec)_ | Spec defines the desired state for this graph deployment. | | |
| `status` _[DynamoGraphDeploymentStatus](#dynamographdeploymentstatus)_ | Status reflects the current observed state of this graph deployment. | | |
#### DynamoGraphDeploymentSpec
DynamoGraphDeploymentSpec defines the desired state of DynamoGraphDeployment.
_Appears in:_
- [DynamoGraphDeployment](#dynamographdeployment)
| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `dynamoGraph` _string_ | DynamoGraph selects the graph (workflow/topology) to deploy. This must match<br />a graph name packaged with the Dynamo archive. | | |
| `envs` _[EnvVar](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#envvar-v1-core) array_ | Envs are environment variables applied to all services in the graph unless<br />overridden by service-specific configuration. | | Optional: {} <br /> |
| `backendFramework` _string_ | BackendFramework specifies the backend framework (e.g., "sglang", "vllm", "trtllm"). | | Enum: [sglang vllm trtllm] <br /> |
#### DynamoGraphDeploymentStatus
DynamoGraphDeploymentStatus defines the observed state of DynamoGraphDeployment.
_Appears in:_
- [DynamoGraphDeployment](#dynamographdeployment)
| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `state` _string_ | State is a high-level textual status of the graph deployment lifecycle. | | |
| `conditions` _[Condition](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#condition-v1-meta) array_ | Conditions contains the latest observed conditions of the graph deployment.<br />The slice is merged by type on patch updates. | | |
#### IngressSpec
_Appears in:_
- [DynamoComponentDeploymentSharedSpec](#dynamocomponentdeploymentsharedspec)
- [DynamoComponentDeploymentSpec](#dynamocomponentdeploymentspec)
| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `enabled` _boolean_ | Enabled exposes the component through an ingress or virtual service when true. | | |
| `host` _string_ | Host is the base host name to route external traffic to this component. | | |
| `useVirtualService` _boolean_ | UseVirtualService indicates whether to configure a service-mesh VirtualService instead of a standard Ingress. | | |
| `virtualServiceGateway` _string_ | VirtualServiceGateway optionally specifies the gateway name to attach the VirtualService to. | | |
| `hostPrefix` _string_ | HostPrefix is an optional prefix added before the host. | | |
| `annotations` _object (keys:string, values:string)_ | Annotations to set on the generated Ingress/VirtualService resources. | | |
| `labels` _object (keys:string, values:string)_ | Labels to set on the generated Ingress/VirtualService resources. | | |
| `tls` _[IngressTLSSpec](#ingresstlsspec)_ | TLS holds the TLS configuration used by the Ingress/VirtualService. | | |
| `hostSuffix` _string_ | HostSuffix is an optional suffix appended after the host. | | |
| `ingressControllerClassName` _string_ | IngressControllerClassName selects the ingress controller class (e.g., "nginx"). | | |
#### IngressTLSSpec
_Appears in:_
- [IngressSpec](#ingressspec)
| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `secretName` _string_ | SecretName is the name of a Kubernetes Secret containing the TLS certificate and key. | | |
#### MultinodeSpec
_Appears in:_
- [DynamoComponentDeploymentSharedSpec](#dynamocomponentdeploymentsharedspec)
- [DynamoComponentDeploymentSpec](#dynamocomponentdeploymentspec)
| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `nodeCount` _integer_ | Indicates the number of nodes to deploy for multinode components.<br />Total number of GPUs is NumberOfNodes * GPU limit.<br />Must be greater than 1. | 2 | Minimum: 2 <br /> |
#### PVC
_Appears in:_
- [DynamoComponentDeploymentSharedSpec](#dynamocomponentdeploymentsharedspec)
- [DynamoComponentDeploymentSpec](#dynamocomponentdeploymentspec)
| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `create` _boolean_ | Create indicates to create a new PVC | | |
| `name` _string_ | Name is the name of the PVC | | |
| `storageClass` _string_ | StorageClass to be used for PVC creation. Leave it as empty if the PVC is already created. | | |
| `size` _[Quantity](#quantity)_ | Size of the NIM cache in Gi, used during PVC creation | | |
| `volumeAccessMode` _[PersistentVolumeAccessMode](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#persistentvolumeaccessmode-v1-core)_ | VolumeAccessMode is the volume access mode of the PVC | | |
| `mountPoint` _string_ | | | |
#### SharedMemorySpec
_Appears in:_
- [DynamoComponentDeploymentSharedSpec](#dynamocomponentdeploymentsharedspec)
- [DynamoComponentDeploymentSpec](#dynamocomponentdeploymentspec)
| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `disabled` _boolean_ | | | |
| `size` _[Quantity](#quantity)_ | | | |
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Configuration file for crd-ref-docs
# https://github.com/elastic/crd-ref-docs
processor:
# Ignore common metadata fields that are not user-configurable
ignoreFields:
- "metadata.creationTimestamp"
- "metadata.generation"
- "metadata.resourceVersion"
- "metadata.selfLink"
- "metadata.uid"
- "status.conditions[*].lastTransitionTime"
- "status.observedGeneration"
- "TypeMeta$"
ignoreTypes:
- "List$"
- "ParseError$"
# Ignore only the override wrapper type to reduce repetition
# Keep SharedSpec so embedded fields are documented once
- "DynamoComponentDeploymentOverridesSpec$"
- "DynamoComponentDeploymentStatus$"
- "BaseStatus$"
render:
# Output format - use markdown instead of default asciidoc
format: markdown
# Kubernetes version for API compatibility info
kubernetesVersion: "1.28"
# Group related resources together
groupByKind: true
# Include resource descriptions
includeDescription: true
# Reduce repetition in links and references
allowDangerousTypes: false
# Sort types alphabetically for better organization
sortByName: true
......@@ -23,12 +23,68 @@ SPHINXBUILD ?= sphinx-build
SOURCEDIR = .
BUILDDIR = build
##@ General
# Put it first so that "make" without argument is like "make help".
help:
help: ## Display help for all targets
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
@echo ""
@echo "Additional documentation targets:"
@awk 'BEGIN {FS = ":.*##"; printf " \033[36m%-20s\033[0m %s\n", "TARGET", "DESCRIPTION"} /^[a-zA-Z_0-9-]+:.*?##/ { printf " \033[36m%-20s\033[0m %s\n", $$1, $$2 } /^##@/ { printf "\n\033[1m%s\033[0m\n", substr($$0, 5) }' $(MAKEFILE_LIST)
clean:
clean: ## Clean build artifacts
@rm -fr ${BUILDDIR}
##@ Helm Documentation
## Location to install dependencies to
LOCALBIN ?= $(shell pwd)/bin
$(LOCALBIN):
mkdir -p $(LOCALBIN)
## Tool Versions
HELM_DOCS_VERSION ?= 1.14.2
## Tool Binaries
HELM_DOCS ?= $(LOCALBIN)/helm-docs-$(HELM_DOCS_VERSION)
.PHONY: helm-docs-install
helm-docs-install: $(HELM_DOCS) ## Download helm-docs locally if necessary
$(HELM_DOCS): $(LOCALBIN)
@echo "📥 Downloading helm-docs $(HELM_DOCS_VERSION)..."
@ARCH=$$(uname -m | sed 's/x86_64/amd64/' | sed 's/aarch64/arm64/'); \
OS=$$(uname -s | tr '[:upper:]' '[:lower:]'); \
curl -sSL "https://github.com/norwoodj/helm-docs/releases/download/v$(HELM_DOCS_VERSION)/helm-docs_$(HELM_DOCS_VERSION)_$${OS}_$${ARCH}.tar.gz" | \
tar xz -C $(LOCALBIN) helm-docs && \
mv $(LOCALBIN)/helm-docs $(HELM_DOCS) && \
echo "✅ helm-docs $(HELM_DOCS_VERSION) installed successfully"
.PHONY: generate-helm-docs
generate-helm-docs: helm-docs-install ## Generate README.md for Helm charts from values.yaml
@echo "📚 Generating Helm chart documentation..."
@cd ../deploy/cloud/helm/platform && $(realpath $(HELM_DOCS)) \
--template-files=README.md.gotmpl \
--output-file=README.md \
--sort-values-order=file \
--chart-to-generate=. \
--ignore-non-descriptions
@echo "✅ Generated documentation at ../deploy/cloud/helm/platform/README.md"
.PHONY: helm-docs-clean
helm-docs-clean: ## Remove generated helm documentation
@echo "🧹 Cleaning generated helm documentation..."
@rm -f ../deploy/cloud/helm/platform/README.md
@echo "✅ Cleaned helm documentation"
.PHONY: generate-crd-docs
generate-crd-docs: ## Generate CRD API reference documentation
@echo "📚 Generating CRD API reference documentation..."
@cd ../deploy/cloud/operator && make generate-api-docs
@echo "✅ CRD API reference generated"
.PHONY: docs-all
docs-all: generate-helm-docs generate-crd-docs html ## Generate all documentation (Sphinx + Helm + CRDs)
.PHONY: help Makefile clean
......
......@@ -59,6 +59,14 @@ It's a Kubernetes Custom Resource that defines your inference pipeline:
The scripts in the `components/<backend>/launch` folder like `agg.sh` demonstrate how you can serve your models locally. The corresponding YAML files like `agg.yaml` show you how you could create a kubernetes deployment for your inference graph.
## 📖 API Reference & Documentation
For detailed technical specifications of Dynamo's Kubernetes resources:
- **[API Reference](api-reference.md)** - Complete CRD field specifications for `DynamoGraphDeployment` and `DynamoComponentDeployment`
- **[Operator Guide](dynamo_operator.md)** - Dynamo operator configuration and management
- **[Create Deployment](create_deployment.md)** - Step-by-step deployment creation examples
### Choosing Your Architecture Pattern
When creating a deployment, select the architecture pattern that best fits your use case:
......
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Dynamo CRD API Reference
For the complete technical API reference for Dynamo Custom Resource Definitions, see:
**📖 [Dynamo CRD API Reference](../../../deploy/cloud/operator/docs/api_reference.md)**
......@@ -39,7 +39,7 @@ helm version # v3.0+
docker version # Running daemon
# Set your inference runtime image
export DYNAMO_IMAGE=nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1
export DYNAMO_IMAGE=nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.0
# Also available: sglang-runtime, tensorrtllm-runtime
```
......@@ -53,7 +53,7 @@ Install from [NGC published artifacts](https://catalog.ngc.nvidia.com/orgs/nvidi
```bash
# 1. Set environment
export NAMESPACE=dynamo-kubernetes
export RELEASE_VERSION=0.4.1 # any version of Dynamo 0.3.2+
export RELEASE_VERSION=0.5.0 # any version of Dynamo 0.3.2+
# 2. Install CRDs
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${RELEASE_VERSION}.tgz
......@@ -65,6 +65,15 @@ helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-$
helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE}
```
> [!TIP]
> By default, Grove and Kai Scheduler are NOT installed. You can enable them by setting the following flags in the helm install command:
```bash
--set "grove.enabled=true"
--set "kai-scheduler.enabled=true"
```
[Verify Installation](#verify-installation)
## Path C: Custom Development
......@@ -79,7 +88,7 @@ export NAMESPACE=dynamo-cloud
export DOCKER_SERVER=nvcr.io/nvidia/ai-dynamo/ # or your registry
export DOCKER_USERNAME='$oauthtoken'
export DOCKER_PASSWORD=<YOUR_NGC_CLI_API_KEY>
export IMAGE_TAG=0.4.1
export IMAGE_TAG=0.5.0
# 2. Build operator
cd deploy/cloud/operator
......@@ -176,6 +185,7 @@ kubectl create secret generic hf-token-secret \
## Advanced Options
- [Helm Chart Configuration](../../../deploy/cloud/helm/platform/README.md)
- [GKE-specific setup](gke_setup.md)
- [Create custom deployments](create_deployment.md)
- [Dynamo Operator details](dynamo_operator.md)
......@@ -23,50 +23,9 @@ Dynamo operator is a Kubernetes operator that simplifies the deployment, configu
## Custom Resource Definitions (CRDs)
### CRD: `DynamoGraphDeployment`
For the complete technical API reference for Dynamo Custom Resource Definitions, see:
| Field | Type | Description | Required | Default |
|------------------|--------|------------------------------------------------------------------------------------------------------------------------------------------------------|----------|---------|
| `services` | map | Map of service names to runtime configurations. This allows the user to override the service configuration defined in the DynamoComponentDeployment. | Yes | |
| `envs` | list | list of global environment variables. | No | |
**API Version:** `nvidia.com/v1alpha1`
**Scope:** Namespaced
#### Example
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: disagg
spec:
envs:
- name: GLOBAL_ENV_VAR
value: some_global_value
services:
Frontend:
replicas: 1
envs:
- name: SPECIFIC_ENV_VAR
value: some_specific_value
Processor:
replicas: 1
envs:
- name: SPECIFIC_ENV_VAR
value: some_specific_value
VllmWorker:
replicas: 1
envs:
- name: SPECIFIC_ENV_VAR
value: some_specific_value
PrefillWorker:
replicas: 1
envs:
- name: SPECIFIC_ENV_VAR
value: some_specific_value
```
**📖 [Dynamo CRD API Reference](../../../deploy/cloud/operator/docs/api_reference.md)**
## Installation
......@@ -151,25 +110,6 @@ export NAMESPACE=<namespace-with-the-dynamo-cloud-operator>
kubectl get dynamographdeployment llm-agg -n $NAMESPACE
```
## Reconciliation Logic
### DynamoGraphDeployment
- **Actions:**
- Create a DynamoComponent CR to build the docker image
- Create a DynamoComponentDeployment CR for each component defined in the Dynamo graph being deployed
- **Status Management:**
- `.status.conditions`: Reflects readiness, failure, progress states
- `.status.state`: overall state of the deployment, based on the state of the DynamoComponentDeployments
### DynamoComponentDeployment
- **Actions:**
- Create a Deployment, Service, and Ingress for the service
- **Status Management:**
- `.status.conditions`: Reflects readiness, failure, progress states
## Configuration
......
......@@ -87,10 +87,14 @@ Grove represents a significant advancement in Kubernetes-based orchestration for
## Getting Started
> **Note**: Grove is currently in development and aligning with NVIDIA Dynamo's release schedule.
Grove relies on KAI Scheduler for resource allocation and scheduling.
For KAI Scheduler, see the [KAI Scheduler Deployment Guide](https://github.com/NVIDIA/KAI-Scheduler).
For installation instructions, see the [Grove Installation Guide](https://github.com/NVIDIA/grove/blob/main/docs/installation.md).
For practical examples of Grove-based multinode deployments in action, see the [Multinode Deployment Guide](multinode-deployment.md), which demonstrates multi-node disaggregated serving scenarios.
For the latest updates on Grove, refer to the [official project on GitHub](https://github.com/NVIDIA/grove).
Dynamo Cloud also allows you to install Grove and KAI Scheduler as part of the platform installation. See the [Dynamo Cloud Deployment Guide](dynamo_cloud.md) for more details.
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment