feat: Add Grove and Kai scheduler as part of dynamo cloud helm chart (#2755)

Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>

feat: Add Grove and Kai scheduler as part of dynamo cloud helm chart (#2755)
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
e28ff8d2 · julienmancuso · GitHub · 8c2072cf · e28ff8d2 · e28ff8d2
Unverified Commit e28ff8d2 authored Aug 28, 2025 by julienmancuso Committed by GitHub Aug 29, 2025
17 changed files
--- a/deploy/cloud/helm/platform/Chart.yaml
+++ b/deploy/cloud/helm/platform/Chart.yaml
@@ -34,3 +34,12 @@ dependencies:
    version: 11.1.0
    repository: "https://charts.bitnami.com/bitnami"
    condition: etcd.enabled
+  - name: kai-scheduler
+    version: v0.8.1
+    repository: oci://ghcr.io/nvidia/kai-scheduler
+    condition: kai-scheduler.enabled
+  - name: grove-charts
+    alias: grove
+    version: v0.0.0-6e30275
+    repository: oci://ghcr.io/nvidia/grove
+    condition: grove.enabled
--- a/deploy/cloud/helm/platform/README.md
+++ b/deploy/cloud/helm/platform/README.md
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# dynamo-platform
+
+A Helm chart for NVIDIA Dynamo Platform.
+
+![Version: 0.5.0](https://img.shields.io/badge/Version-0.5.0-informational?style=flat-square) ![Type: application](https://img.shields.io/badge/Type-application-informational?style=flat-square)
+
+## 🚀 Overview
+
+The Dynamo Platform Helm chart deploys the complete Dynamo Cloud infrastructure on Kubernetes, including:
+
+- **Dynamo Operator**: Kubernetes operator for managing Dynamo deployments
+- **NATS**: High-performance messaging system for component communication
+- **etcd**: Distributed key-value store for operator state management
+- **Grove**: Multi-node inference orchestration (optional)
+- **Kai Scheduler**: Advanced workload scheduling (optional)
+
+## 📋 Prerequisites
+
+- Kubernetes cluster (v1.20+)
+- Helm 3.8+
+- Sufficient cluster resources for your deployment scale
+- Container registry access (if using private images)
+
+## 🔧 Configuration
+
+## Requirements
+
+| Repository | Name | Version |
+|------------|------|---------|
+| file://components/operator | dynamo-operator | 0.5.0 |
+| https://charts.bitnami.com/bitnami | etcd | 11.1.0 |
+| https://nats-io.github.io/k8s/helm/charts/ | nats | 1.3.2 |
+| oci://ghcr.io/nvidia/grove | grove(grove-charts) | v0.0.0-6e30275 |
+| oci://ghcr.io/nvidia/kai-scheduler | kai-scheduler | v0.8.1 |
+
+## Values
+
+| Key | Type | Default | Description |
+|-----|------|---------|-------------|
+| dynamo-operator.enabled | bool | `true` | Whether to enable the Dynamo Kubernetes operator deployment |
+| dynamo-operator.natsAddr | string | `""` | NATS server address for operator communication (leave empty to use the bundled NATS chart). Format: "nats://hostname:port" |
+| dynamo-operator.etcdAddr | string | `""` | etcd server address for operator state storage (leave empty to use the bundled etcd chart). Format: "http://hostname:port" or "https://hostname:port" |
+| dynamo-operator.namespaceRestriction.enabled | bool | `true` | Whether to restrict operator to specific namespaces |
+| dynamo-operator.namespaceRestriction.targetNamespace | string | `nil` | Target namespace for operator deployment (leave empty for current namespace) |
+| dynamo-operator.controllerManager.tolerations | list | `[]` | Node tolerations for controller manager pods |
+| dynamo-operator.controllerManager.manager.image.repository | string | `"nvcr.io/nvidia/ai-dynamo/kubernetes-operator"` | Official NVIDIA Dynamo operator image repository |
+| dynamo-operator.controllerManager.manager.image.tag | string | `""` | Image tag (leave empty to use chart default) |
+| dynamo-operator.controllerManager.manager.image.pullPolicy | string | `"IfNotPresent"` | Image pull policy - when to pull the image |
+| dynamo-operator.controllerManager.manager.args[0] | string | `"--health-probe-bind-address=:8081"` | Health probe endpoint for Kubernetes health checks |
+| dynamo-operator.controllerManager.manager.args[1] | string | `"--metrics-bind-address=127.0.0.1:8080"` | Metrics endpoint for Prometheus scraping (localhost only for security) |
+| dynamo-operator.imagePullSecrets | list | `[]` | Secrets for pulling private container images |
+| dynamo-operator.dynamo.groveTerminationDelay | string | `"15m"` | How long to wait before forcefully terminating Grove instances |
+| dynamo-operator.dynamo.internalImages.debugger | string | `"python:3.12-slim"` | Debugger image for troubleshooting deployments |
+| dynamo-operator.dynamo.enableRestrictedSecurityContext | bool | `false` | Whether to enable restricted security contexts for enhanced security |
+| dynamo-operator.dynamo.dockerRegistry.useKubernetesSecret | bool | `false` | Whether to use Kubernetes secrets for registry authentication |
+| dynamo-operator.dynamo.dockerRegistry.server | string | `nil` | Docker registry server URL |
+| dynamo-operator.dynamo.dockerRegistry.username | string | `nil` | Registry username |
+| dynamo-operator.dynamo.dockerRegistry.password | string | `nil` | Registry password (consider using existingSecretName instead) |
+| dynamo-operator.dynamo.dockerRegistry.existingSecretName | string | `nil` | Name of existing Kubernetes secret containing registry credentials |
+| dynamo-operator.dynamo.dockerRegistry.secure | bool | `true` | Whether the registry uses HTTPS |
+| dynamo-operator.dynamo.ingress.enabled | bool | `false` | Whether to create ingress resources |
+| dynamo-operator.dynamo.ingress.className | string | `nil` | Ingress class name (e.g., "nginx", "traefik") |
+| dynamo-operator.dynamo.ingress.tlsSecretName | string | `"my-tls-secret"` | Secret name containing TLS certificates |
+| dynamo-operator.dynamo.istio.enabled | bool | `false` | Whether to enable Istio integration |
+| dynamo-operator.dynamo.istio.gateway | string | `nil` | Istio gateway name for routing |
+| dynamo-operator.dynamo.ingressHostSuffix | string | `""` | Host suffix for generated ingress hostnames |
+| dynamo-operator.dynamo.virtualServiceSupportsHTTPS | bool | `false` | Whether VirtualServices should support HTTPS routing |
+| grove.enabled | bool | `false` | Whether to enable Grove for multi-node inference coordination, if enabled, the Grove operator will be deployed cluster-wide |
+| kai-scheduler.enabled | bool | `false` | Whether to enable Kai Scheduler for intelligent resource allocation, if enabled, the Kai Scheduler operator will be deployed cluster-wide |
+| etcd.enabled | bool | `true` | Whether to enable etcd deployment, disable if you want to use an external etcd instance |
+| nats.enabled | bool | `true` | Whether to enable NATS deployment, disable if you want to use an external NATS instance |
+
+### NATS Configuration
+
+For detailed NATS configuration options beyond `nats.enabled`, please refer to the official NATS Helm chart documentation:
+**[NATS Helm Chart Documentation](https://github.com/nats-io/k8s/tree/main/helm/charts/nats)**
+
+### etcd Configuration
+
+For detailed etcd configuration options beyond `etcd.enabled`, please refer to the official Bitnami etcd Helm chart documentation:
+**[etcd Helm Chart Documentation](https://github.com/bitnami/charts/tree/main/bitnami/etcd)**
+
+## 📚 Additional Resources
+
+- [Dynamo Cloud Deployment Guide](../../../../docs/guides/dynamo_deploy/dynamo_cloud.md)
+- [NATS Documentation](https://docs.nats.io/)
+- [etcd Documentation](https://etcd.io/docs/)
+- [Kubernetes Operator Pattern](https://kubernetes.io/docs/concepts/extend-kubernetes/operator/)
+
+----------------------------------------------
+Autogenerated from chart metadata using [helm-docs v1.14.2](https://github.com/norwoodj/helm-docs/releases/v1.14.2)
--- a/deploy/cloud/helm/platform/README.md.gotmpl
+++ b/deploy/cloud/helm/platform/README.md.gotmpl
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+{{ template "chart.header" . }}
+
+{{ template "chart.description" . }}
+
+{{ template "chart.versionBadge" . }}{{ template "chart.typeBadge" . }}{{ template "chart.appVersionBadge" . }}
+
+## 🚀 Overview
+
+The Dynamo Platform Helm chart deploys the complete Dynamo Cloud infrastructure on Kubernetes, including:
+
+- **Dynamo Operator**: Kubernetes operator for managing Dynamo deployments
+- **NATS**: High-performance messaging system for component communication
+- **etcd**: Distributed key-value store for operator state management
+- **Grove**: Multi-node inference orchestration (optional)
+- **Kai Scheduler**: Advanced workload scheduling (optional)
+
+## 📋 Prerequisites
+
+- Kubernetes cluster (v1.20+)
+- Helm 3.8+
+- Sufficient cluster resources for your deployment scale
+- Container registry access (if using private images)
+
+## 🔧 Configuration
+
+{{ template "chart.requirementsSection" . }}
+
+{{ template "chart.valuesSection" . }}
+
+### NATS Configuration
+
+For detailed NATS configuration options beyond `nats.enabled`, please refer to the official NATS Helm chart documentation:
+**[NATS Helm Chart Documentation](https://github.com/nats-io/k8s/tree/main/helm/charts/nats)**
+
+### etcd Configuration
+
+For detailed etcd configuration options beyond `etcd.enabled`, please refer to the official Bitnami etcd Helm chart documentation:
+**[etcd Helm Chart Documentation](https://github.com/bitnami/charts/tree/main/bitnami/etcd)**
+
+
+## 📚 Additional Resources
+
+- [Dynamo Cloud Deployment Guide](../../../../docs/guides/dynamo_deploy/dynamo_cloud.md)
+- [NATS Documentation](https://docs.nats.io/)
+- [etcd Documentation](https://etcd.io/docs/)
+- [Kubernetes Operator Pattern](https://kubernetes.io/docs/concepts/extend-kubernetes/operator/)
+
+{{ template "helm-docs.versionFooter" . }}
--- a/deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml
+++ b/deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml
@@ -491,7 +491,7 @@ subjects:
 apiVersion: rbac.authorization.k8s.io/v1
 kind: ClusterRole
 metadata:
-  name: {{ include "dynamo-operator.fullname" . }}-queue-reader
+  name: {{ include "dynamo-operator.fullname" . }}-{{ .Release.Namespace }}-queue-reader
  labels:
    app.kubernetes.io/component: rbac
    app.kubernetes.io/created-by: dynamo-operator
@@ -510,7 +510,7 @@ rules:
 apiVersion: rbac.authorization.k8s.io/v1
 kind: ClusterRoleBinding
 metadata:
-  name: {{ include "dynamo-operator.fullname" . }}-queue-reader-binding
+  name: {{ include "dynamo-operator.fullname" . }}-{{ .Release.Namespace }}-queue-reader-binding
  labels:
    app.kubernetes.io/component: rbac
    app.kubernetes.io/created-by: dynamo-operator
@@ -519,7 +519,7 @@ metadata:
 roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
-  name: {{ include "dynamo-operator.fullname" . }}-queue-reader
+  name: {{ include "dynamo-operator.fullname" . }}-{{ .Release.Namespace }}-queue-reader
 subjects:
 - kind: ServiceAccount
  name: '{{ include "dynamo-operator.fullname" . }}-controller-manager'

--- a/deploy/cloud/helm/platform/templates/kai.yaml
+++ b/deploy/cloud/helm/platform/templates/kai.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+---
+{{- if .Capabilities.APIVersions.Has "scheduling.run.ai/v2" }}
+
+{{- /* Create parent queue first */ -}}
+{{- $defaultQueue := lookup "scheduling.run.ai/v2" "Queue" "" "dynamo-default" }}
+{{- if not $defaultQueue }}
+---
+apiVersion: scheduling.run.ai/v2
+kind: Queue
+metadata:
+  name: dynamo-default
+  annotations:
+    "helm.sh/hook": post-install,post-upgrade
+    "helm.sh/hook-weight": "100"
+    "helm.sh/hook-delete-policy": before-hook-creation
+spec:
+  resources:
+    cpu:
+      quota: -1
+      limit: -1
+      overQuotaWeight: 1
+    gpu:
+      quota: -1
+      limit: -1
+      overQuotaWeight: 1
+    memory:
+      quota: -1
+      limit: -1
+      overQuotaWeight: 1
+{{- end }}
+
+{{- /* Create child queue second */ -}}
+{{- $dynamoQueue := lookup "scheduling.run.ai/v2" "Queue" "" "dynamo" }}
+{{- if not $dynamoQueue }}
+---
+apiVersion: scheduling.run.ai/v2
+kind: Queue
+metadata:
+  name: dynamo
+  annotations:
+    "helm.sh/hook": post-install,post-upgrade
+    "helm.sh/hook-weight": "110"
+    "helm.sh/hook-delete-policy": before-hook-creation
+spec:
+  parentQueue: dynamo-default
+  resources:
+    cpu:
+      quota: -1
+      limit: -1
+      overQuotaWeight: 1
+    gpu:
+      quota: -1
+      limit: -1
+      overQuotaWeight: 1
+    memory:
+      quota: -1
+      limit: -1
+      overQuotaWeight: 1
+{{- end }}
+
+{{- end }}
\ No newline at end of file
--- a/deploy/cloud/helm/platform/values.yaml
+++ b/deploy/cloud/helm/platform/values.yaml
@@ -14,170 +14,290 @@
 # limitations under the License.
 # Used to generate top-level secrets (overridden by custom-values.yaml)

-# Subcharts
+# Subcharts configuration
+
+# Dynamo operator configuration
 dynamo-operator:
+  # -- Whether to enable the Dynamo Kubernetes operator deployment
  enabled: true
+
+  # -- NATS server address for operator communication (leave empty to use the bundled NATS chart). Format: "nats://hostname:port"
  natsAddr: ""
+
+  # -- etcd server address for operator state storage (leave empty to use the bundled etcd chart). Format: "http://hostname:port" or "https://hostname:port"
  etcdAddr: ""
+
+  # Namespace access controls for the operator
  namespaceRestriction:
+    # -- Whether to restrict operator to specific namespaces
    enabled: true
+    # -- Target namespace for operator deployment (leave empty for current namespace)
    targetNamespace:
+
+  # Controller manager configuration
  controllerManager:
+    # -- Node tolerations for controller manager pods
    tolerations: []
+
    manager:
+      # Container image configuration for the operator manager
      image:
+        # -- Official NVIDIA Dynamo operator image repository
        repository: "nvcr.io/nvidia/ai-dynamo/kubernetes-operator"
+        # -- Image tag (leave empty to use chart default)
        tag: ""
+        # -- Image pull policy - when to pull the image
        pullPolicy: IfNotPresent
+
+      # Command line arguments for the operator manager
      args:
+        # -- Health probe endpoint for Kubernetes health checks
        - --health-probe-bind-address=:8081
+        # -- Metrics endpoint for Prometheus scraping (localhost only for security)
        - --metrics-bind-address=127.0.0.1:8080
+
+  # -- Secrets for pulling private container images
  imagePullSecrets: []
+
+  # Core Dynamo platform configuration
  dynamo:
+    # -- How long to wait before forcefully terminating Grove instances
    groveTerminationDelay: 15m
+
+    # Internal utility images used by the platform
    internalImages:
+      # -- Debugger image for troubleshooting deployments
      debugger: python:3.12-slim
+
+    # -- Whether to enable restricted security contexts for enhanced security
    enableRestrictedSecurityContext: false
+
+    # Docker registry configuration for private repositories
    dockerRegistry:
+      # -- Whether to use Kubernetes secrets for registry authentication
      useKubernetesSecret: false
+      # -- Docker registry server URL
      server:
+      # -- Registry username
      username:
+      # -- Registry password (consider using existingSecretName instead)
      password:
+      # -- Name of existing Kubernetes secret containing registry credentials
      existingSecretName:
+      # -- Whether the registry uses HTTPS
      secure: true
+
+    # Ingress configuration for external access
    ingress:
+      # -- Whether to create ingress resources
      enabled: false
+      # -- Ingress class name (e.g., "nginx", "traefik")
      className:
+      # -- Secret name containing TLS certificates
      tlsSecretName: my-tls-secret
+
+    # Istio service mesh configuration
    istio:
+      # -- Whether to enable Istio integration
      enabled: false
+      # -- Istio gateway name for routing
      gateway:
+
+    # -- Host suffix for generated ingress hostnames
    ingressHostSuffix: ""
+
+    # -- Whether VirtualServices should support HTTPS routing
    virtualServiceSupportsHTTPS: false

+
+# Grove component - distributed inference orchestration
+grove:
+  # -- Whether to enable Grove for multi-node inference coordination, if enabled, the Grove operator will be deployed cluster-wide
+  enabled: false
+
+# Kai Scheduler component - advanced workload scheduling
+kai-scheduler:
+  # -- Whether to enable Kai Scheduler for intelligent resource allocation, if enabled, the Kai Scheduler operator will be deployed cluster-wide
+  enabled: false
+
+# etcd configuration - distributed key-value store for operator state
+# For complete configuration options, see: https://github.com/bitnami/charts/tree/main/bitnami/etcd
 etcd:
+  # -- Whether to enable etcd deployment, disable if you want to use an external etcd instance
  enabled: true
+
+  # Persistent storage configuration for etcd data
  persistence:
+    # Whether to enable persistent storage (recommended for production)
    enabled: true
    # Use the cluster default storage-class or override with a named class
    storageClass: null
+    # Size of persistent volume for etcd data
    size: 1Gi
+
+  # Pre-upgrade job configuration
  preUpgrade:
+    # Whether to run pre-upgrade validation jobs
    enabled: false
+
+  # Number of etcd replicas (1 for single-node, 3+ for HA)
  replicaCount: 1
-  # Explicitly remove authentication
+
+  # Authentication and authorization settings
+  # Explicitly remove authentication for simplified internal communication
  auth:
    rbac:
+      # Whether to create RBAC authentication (disabled for internal use)
      create: false

+  # Health check configuration
  readinessProbe:
+    # Whether to enable readiness probes (disabled to reduce startup complexity)
    enabled: false

  livenessProbe:
+    # Whether to enable liveness probes (disabled to reduce startup complexity)
    enabled: false

+  # Node tolerations for etcd pods (allows scheduling on specific nodes)
  tolerations: []

+# NATS configuration - messaging system for operator communication
+# For complete configuration options, see: https://github.com/nats-io/k8s/tree/main/helm/charts/nats
 nats:
+  # -- Whether to enable NATS deployment, disable if you want to use an external NATS instance
  enabled: true
-  # reference a common CA Certificate or Bundle in all nats config `tls` blocks and nats-box contexts
-  # note: `tls.verify` still must be set in the appropriate nats config `tls` blocks to require mTLS
+
+  # TLS Certificate Authority configuration for secure communication
+  # Reference a common CA Certificate or Bundle in all nats config `tls` blocks and nats-box contexts
+  # Note: `tls.verify` still must be set in the appropriate nats config `tls` blocks to require mTLS
  tlsCA:
+    # Whether to enable TLS CA configuration
    enabled: false

+  # Core NATS server configuration
  config:
+    # NATS clustering for high availability (multiple NATS servers)
    cluster:
+      # Whether to enable NATS clustering (disabled for single-node setups)
      enabled: false

-
+    # JetStream - persistent messaging and streaming capabilities
    jetstream:
+      # Whether to enable JetStream (recommended for persistent messaging)
      enabled: true

+      # File-based storage for JetStream streams and consumers
      fileStore:
+        # Whether to enable file storage (persistent across restarts)
        enabled: true
+        # Directory path for JetStream file storage
        dir: /data

        ############################################################
-        # stateful set -> volume claim templates -> jetstream pvc
+        # Persistent Volume Claim for JetStream file storage
        ############################################################
        pvc:
+          # Whether to create a PVC for JetStream storage
          enabled: true
+          # Size of the persistent volume for JetStream data
          size: 10Gi
+          # Storage class name (leave empty for default)
          storageClassName:

-          # merge or patch the jetstream pvc
+          # Advanced PVC configuration (merge additional fields)
          # https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#persistentvolumeclaim-v1-core
          merge: {}
          patch: []
-          # defaults to "{{ include "nats.fullname" $ }}-js"
+          # PVC name (defaults to "{{ include "nats.fullname" $ }}-js")
          name:

-        # defaults to the PVC size
+        # Maximum size for JetStream file storage (defaults to PVC size)
        maxSize:

+      # Memory-based storage for JetStream (non-persistent)
      memoryStore:
+        # Whether to enable memory storage (faster but not persistent)
        enabled: false

-      # merge or patch the jetstream config
-      # https://docs.nats.io/running-a-nats-service/configuration#jetstream
+      # Advanced JetStream configuration
+      # For options see: https://docs.nats.io/running-a-nats-service/configuration#jetstream
      merge: {}
      patch: []

+    # Core NATS server settings
    nats:
+      # Port for NATS client connections
      port: 4222
+
+      # TLS configuration for encrypted connections
      tls:
+        # Whether to enable TLS encryption
        enabled: false
-        # merge or patch the tls config
-        # https://docs.nats.io/running-a-nats-service/configuration/securing_nats/tls
+        # Advanced TLS configuration
+        # For options see: https://docs.nats.io/running-a-nats-service/configuration/securing_nats/tls
        merge: {}
        patch: []

+    # Leaf nodes for creating NATS topologies and remote connections
    leafnodes:
+      # Whether to enable leaf node connections
      enabled: false

-
+    # WebSocket support for browser-based NATS clients
    websocket:
+      # Whether to enable WebSocket protocol support
      enabled: false

-
+    # MQTT protocol bridge for IoT device connectivity
    mqtt:
+      # Whether to enable MQTT protocol support
      enabled: false

-
+    # Gateway connections for multi-cluster NATS deployments
    gateway:
+      # Whether to enable gateway connections
      enabled: false

-
+    # HTTP monitoring endpoint for NATS server metrics
    monitor:
+      # Whether to enable HTTP monitoring interface
      enabled: true
+      # Port for monitoring HTTP endpoint
      port: 8222
+
+      # TLS configuration for monitoring endpoint
      tls:
-        # config.nats.tls must be enabled also
-        # when enabled, monitoring port will use HTTPS with the options from config.nats.tls
+        # Whether to enable HTTPS for monitoring (requires config.nats.tls enabled)
+        # When enabled, monitoring port will use HTTPS with the options from config.nats.tls
        enabled: false

+    # Go pprof profiling endpoint for performance debugging
    profiling:
+      # Whether to enable profiling endpoint (for debugging only)
      enabled: false
+      # Port for profiling endpoint
      port: 65432

+    # Account resolver for multi-tenant NATS deployments
    resolver:
+      # Whether to enable account resolution (for advanced multi-tenancy)
      enabled: false

-
-    # adds a prefix to the server name, which defaults to the pod name
-    # helpful for ensuring server name is unique in a super cluster
+    # Server naming configuration
+    # Adds a prefix to the server name, which defaults to the pod name
+    # Helpful for ensuring server name is unique in a super cluster
    serverNamePrefix: ""

-    # merge or patch the nats config
-    # https://docs.nats.io/running-a-nats-service/configuration
-    # following special rules apply
+    # Advanced NATS configuration merging and patching
+    # For complete options see: https://docs.nats.io/running-a-nats-service/configuration
+    # Special rules apply:
    #  1. strings that start with << and end with >> will be unquoted
    #     use this for variables and numbers with units
    #  2. keys ending in $include will be switched to include directives
    #     keys are sorted alphabetically, use prefix before $includes to control includes ordering
    #     paths should be relative to /etc/nats-config/nats.conf
-    # example:
-    #
+    # Example:
    #   merge:
    #     $include: ./my-config.conf
    #     zzz$include: ./my-config-last.conf
@@ -186,48 +306,48 @@ nats:
    #       token: << $TOKEN >>
    #     jetstream:
    #       max_memory_store: << 1GB >>
-    #
-    # will yield the config:
-    # {
-    #   include ./my-config.conf;
-    #   "authorization": {
-    #     "token": $TOKEN
-    #   },
-    #   "jetstream": {
-    #     "max_memory_store": 1GB
-    #   },
-    #   "server_name": "nats",
-    #   include ./my-config-last.conf;
-    # }
    merge: {}
    patch: []

  ############################################################
-  # stateful set -> pod template -> nats container
+  # NATS container configuration in StatefulSet
  ############################################################
  container:
+    # NATS server container image configuration
    image:
+      # Official NATS server repository
      repository: nats
+      # NATS server version (Alpine-based for smaller size)
      tag: 2.10.21-alpine
+      # Image pull policy (leave empty for chart default)
      pullPolicy:
+      # Custom registry URL (leave empty for Docker Hub)
      registry:

-    # container port options
-    # must be enabled in the config section also
+    # Container port configuration
+    # Note: Ports must also be enabled in the config section above
    # https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#containerport-v1-core
    ports:
+      # Main NATS client connection port
      nats: {}
+      # Leaf node connection port
      leafnodes: {}
+      # WebSocket connection port
      websocket: {}
+      # MQTT protocol port
      mqtt: {}
+      # Cluster communication port
      cluster: {}
+      # Gateway connection port
      gateway: {}
+      # HTTP monitoring port
      monitor: {}
+      # Go profiling port
      profiling: {}

-    # map with key as env var name, value can be string or map
-    # example:
-    #
+    # Environment variables for the NATS container
+    # Map with key as env var name, value can be string or map
+    # Example:
    #   env:
    #     GOMEMLIMIT: 7GiB
    #     TOKEN:
@@ -237,211 +357,245 @@ nats:
    #           key: token
    env: {}

-    # merge or patch the container
+    # Advanced container configuration merging and patching
    # https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#container-v1-core
    merge: {}
    patch: []

  ############################################################
-  # stateful set -> pod template -> reloader container
+  # Configuration reloader container for hot config updates
  ############################################################
  reloader:
+    # Whether to enable the config reloader sidecar container
    enabled: true
+
+    # Config reloader container image
    image:
+      # Official NATS config reloader repository
      repository: natsio/nats-server-config-reloader
+      # Config reloader version
      tag: 0.16.0
+      # Image pull policy (leave empty for chart default)
      pullPolicy:
+      # Custom registry URL (leave empty for Docker Hub)
      registry:

-    # env var map, see nats.env for an example
+    # Environment variables for the reloader container
    env: {}

-    # all nats container volume mounts with the following prefixes
-    # will be mounted into the reloader container
+    # Volume mount prefixes from NATS container to share with reloader
+    # All NATS container volume mounts with these prefixes will be mounted into the reloader
    natsVolumeMountPrefixes:
    - /etc/

-    # merge or patch the container
+    # Advanced reloader container configuration
    # https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#container-v1-core
    merge: {}
    patch: []

  ############################################################
-  # stateful set -> pod template -> prom-exporter container
+  # Prometheus metrics exporter container (optional)
  ############################################################
-  # config.monitor must be enabled
+  # Note: config.monitor must be enabled for this to work
  promExporter:
+    # Whether to enable Prometheus metrics exporter sidecar
    enabled: false

-
  ############################################################
-  # service
+  # Kubernetes Service for NATS access
  ############################################################
  service:
+    # Whether to create a Kubernetes Service for NATS
    enabled: true

-    # service port options
-    # additional boolean field enable to control whether port is exposed in the service
-    # must be enabled in the config section also
+    # Service port configuration
+    # Additional boolean field 'enabled' controls whether port is exposed in the service
+    # Note: Ports must also be enabled in the config section above
    # https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#serviceport-v1-core
    ports:
+      # Main NATS client connection port
      nats:
        enabled: true
+      # Leaf node connection port
      leafnodes:
        enabled: true
+      # WebSocket connection port
      websocket:
        enabled: true
+      # MQTT protocol port
      mqtt:
        enabled: true
+      # Cluster communication port (typically internal only)
      cluster:
        enabled: false
+      # Gateway connection port (typically internal only)
      gateway:
        enabled: false
+      # HTTP monitoring port (typically internal only)
      monitor:
        enabled: false
+      # Go profiling port (typically internal only)
      profiling:
        enabled: false

-    # merge or patch the service
+    # Advanced service configuration
    # https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#service-v1-core
    merge: {}
    patch: []
-    # defaults to "{{ include "nats.fullname" $ }}"
+    # Service name (defaults to "{{ include "nats.fullname" $ }}")
    name:

  ############################################################
-  # other nats extension points
+  # Advanced NATS Kubernetes resource configuration
  ############################################################

-  # stateful set
+  # StatefulSet configuration for NATS server persistence
  statefulSet:
-    # merge or patch the stateful set
+    # Advanced StatefulSet configuration merging and patching
    # https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#statefulset-v1-apps
    merge: {}
    patch: []
-    # defaults to "{{ include "nats.fullname" $ }}"
+    # StatefulSet name (defaults to "{{ include "nats.fullname" $ }}")
    name:

-  # stateful set -> pod template
+  # Pod template configuration for NATS StatefulSet
  podTemplate:
-    # adds a hash of the ConfigMap as a pod annotation
-    # this will cause the StatefulSet to roll when the ConfigMap is updated
+    # Whether to add a hash of the ConfigMap as a pod annotation
+    # This will cause the StatefulSet to roll when the ConfigMap is updated
    configChecksumAnnotation: true

-    # map of topologyKey: topologySpreadConstraint
-    # labelSelector will be added to match StatefulSet pods
-    #
+    # Pod topology spread constraints for better distribution across nodes
+    # Map of topologyKey: topologySpreadConstraint
+    # labelSelector will be added automatically to match StatefulSet pods
+    # Example:
    #   topologySpreadConstraints:
    #     kubernetes.io/hostname:
    #       maxSkew: 1
-    #
    topologySpreadConstraints: {}

-    # merge or patch the pod template
+    # Advanced pod template configuration
    # https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#pod-v1-core
    merge:
      spec:
+        # Node tolerations for NATS pods (allows scheduling on specific nodes)
        tolerations: []
    patch: []

-  # headless service
+  # Headless service for StatefulSet pod discovery
  headlessService:
-    # merge or patch the headless service
+    # Advanced headless service configuration
    # https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#service-v1-core
    merge: {}
    patch: []
-    # defaults to "{{ include "nats.fullname" $ }}-headless"
+    # Headless service name (defaults to "{{ include "nats.fullname" $ }}-headless")
    name:

-  # config map
+  # ConfigMap for NATS server configuration
  configMap:
-    # merge or patch the config map
+    # Advanced ConfigMap configuration
    # https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#configmap-v1-core
    merge: {}
    patch: []
-    # defaults to "{{ include "nats.fullname" $ }}-config"
+    # ConfigMap name (defaults to "{{ include "nats.fullname" $ }}-config")
    name:

-  # pod disruption budget
+  # Pod Disruption Budget for controlled rolling updates
  podDisruptionBudget:
+    # Whether to create a PodDisruptionBudget (recommended for production)
    enabled: true

-
-  # service account
+  # Service Account for NATS server pods
  serviceAccount:
+    # Whether to create and use a dedicated service account
    enabled: false

-
-
  ############################################################
-  # natsBox
-  #
-  # NATS Box Deployment and associated resources
+  # NATS Box - CLI tools and debugging container
+  # NATS Box provides CLI tools for interacting with NATS server
  ############################################################
  natsBox:
+    # Whether to deploy NATS Box for CLI access and debugging
    enabled: true

    ############################################################
-    # NATS contexts
+    # NATS client contexts for authentication and connection
    ############################################################
    contexts:
+      # Default context configuration
      default:
+        # Credentials-based authentication
        creds:
-          # set contents in order to create a secret with the creds file contents
+          # Inline credentials file contents (base64 encoded)
          contents:
-          # set secretName in order to mount an existing secret to dir
+          # Name of existing secret containing credentials file
          secretName:
-          # defaults to /etc/nats-creds/<context-name>
+          # Directory to mount credentials (defaults to /etc/nats-creds/<context-name>)
          dir:
+          # Key name in secret for credentials file
          key: nats.creds
+
+        # NKey-based authentication (public/private key pairs)
        nkey:
-          # set contents in order to create a secret with the nkey file contents
+          # Inline NKey file contents (base64 encoded)
          contents:
-          # set secretName in order to mount an existing secret to dir
+          # Name of existing secret containing NKey file
          secretName:
-          # defaults to /etc/nats-nkeys/<context-name>
+          # Directory to mount NKey (defaults to /etc/nats-nkeys/<context-name>)
          dir:
+          # Key name in secret for NKey file
          key: nats.nk
-        # used to connect with client certificates
+
+        # TLS client certificate authentication
        tls:
-          # set secretName in order to mount an existing secret to dir
+          # Name of existing secret containing TLS client certificates
          secretName:
-          # defaults to /etc/nats-certs/<context-name>
+          # Directory to mount certificates (defaults to /etc/nats-certs/<context-name>)
          dir:
+          # Certificate file name in secret
          cert: tls.crt
+          # Private key file name in secret
          key: tls.key

-        # merge or patch the context
-        # https://docs.nats.io/using-nats/nats-tools/nats_cli#nats-contexts
+        # Advanced context configuration
+        # For options see: https://docs.nats.io/using-nats/nats-tools/nats_cli#nats-contexts
        merge: {}
        patch: []

-    # name of context to select by default
+    # Name of context to select by default for NATS CLI operations
    defaultContextName: default

    ############################################################
-    # deployment -> pod template -> nats-box container
+    # NATS Box container configuration
    ############################################################
    container:
+      # NATS Box container image
      image:
+        # Official NATS Box repository with CLI tools
        repository: natsio/nats-box
+        # NATS Box version
        tag: 0.14.5
+        # Image pull policy (leave empty for chart default)
        pullPolicy:
+        # Custom registry URL (leave empty for Docker Hub)
        registry:

-      # env var map, see nats.env for an example
+      # Environment variables for NATS Box container
      env: {}

-      # merge or patch the container
+      # Advanced container configuration
      # https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#container-v1-core
      merge: {}
      patch: []
-    # service account
+
+    # Service Account for NATS Box deployment
    serviceAccount:
+      # Whether to create and use a dedicated service account for NATS Box
      enabled: false

+    # Pod template configuration for NATS Box deployment
    podTemplate:
      merge:
        spec:
+          # Node tolerations for NATS Box pods
          tolerations: []
      patch: []
--- a/deploy/cloud/operator/Makefile
+++ b/deploy/cloud/operator/Makefile
@@ -57,7 +57,7 @@ ensure-yq:
 	fi

 .PHONY: manifests
-manifests: controller-gen ensure-yq ## Generate WebhookConfiguration, ClusterRole and CustomResourceDefinition objects.
+manifests: controller-gen ensure-yq generate-api-docs ## Generate WebhookConfiguration, ClusterRole and CustomResourceDefinition objects.
 	# Use a large maxDescLen to ensure all field comments are included as OpenAPI descriptions
 	$(CONTROLLER_GEN) rbac:roleName=manager-role crd:maxDescLen=100000 webhook paths="./..." output:crd:artifacts:config=config/crd/bases
 	echo "Removing name from mainContainer required fields"
@@ -266,6 +266,27 @@ $(HELMIFY): $(LOCALBIN)
 helm: manifests kustomize helmify
 	$(KUSTOMIZE) build config/default | $(HELMIFY) -image-pull-secrets charts/dynamo-kubernetes-operator

+######################### CRD Reference Docs
+CRD_REF_DOCS_VERSION ?= v0.0.12
+CRD_REF_DOCS ?= $(LOCALBIN)/crd-ref-docs
+
+.PHONY: crd-ref-docs
+crd-ref-docs: $(CRD_REF_DOCS) ## Download crd-ref-docs locally if necessary.
+$(CRD_REF_DOCS): $(LOCALBIN)
+	@echo "Installing crd-ref-docs $(CRD_REF_DOCS_VERSION)"
+	@GOBIN=$(LOCALBIN) go install github.com/elastic/crd-ref-docs@$(CRD_REF_DOCS_VERSION)
+	@echo "✅ crd-ref-docs $(CRD_REF_DOCS_VERSION) installed successfully"
+
+.PHONY: generate-api-docs
+generate-api-docs: crd-ref-docs ## Generate API reference documentation from CRDs
+	@echo "📚 Generating CRD API reference documentation..."
+	@mkdir -p docs
+	@$(CRD_REF_DOCS) \
+		--source-path=api \
+		--config=docs/crd-ref-docs-config.yaml \
+		--renderer=markdown \
+		--output-path=docs/api_reference.md
+	@echo "✅ Generated API reference at docs/api_reference.md"

 .PHONY: coverage
 coverage: test

--- a/deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go
+++ b/deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go
@@ -67,13 +67,13 @@ type DynamoComponentDeploymentSharedSpec struct {
 	// Labels to add to generated Kubernetes resources for this component.
 	Labels map[string]string `json:"labels,omitempty"`

-	// contains the name of the component
+	// The name of the component
 	ServiceName string `json:"serviceName,omitempty"`

 	// ComponentType indicates the role of this component (for example, "main").
 	ComponentType string `json:"componentType,omitempty"`

-	// dynamo namespace of the service (allows to override the dynamo namespace of the service defined in annotations inside the dynamo archive)
+	// Dynamo namespace of the service (allows to override the Dynamo namespace of the service defined in annotations inside the Dynamo archive)
 	DynamoNamespace *string `json:"dynamoNamespace,omitempty"`

 	// Resources requested and limits for this component, including CPU, memory,
@@ -99,8 +99,9 @@ type DynamoComponentDeploymentSharedSpec struct {
 	// ExtraPodMetadata adds labels/annotations to the created Pods.
 	ExtraPodMetadata *dynamoCommon.ExtraPodMetadata `json:"extraPodMetadata,omitempty"`
 	// +optional
-	// ExtraPodSpec merges additional fields into the generated PodSpec for advanced
-	// customization (tolerations, node selectors, affinity, etc.).
+	// ExtraPodSpec allows to override the main pod spec configuration.
+	// It is a k8s standard PodSpec. It also contains a MainContainer (standard k8s Container) field
+	// that allows overriding the main container configuration.
 	ExtraPodSpec *dynamoCommon.ExtraPodSpec `json:"extraPodSpec,omitempty"`

 	// LivenessProbe to detect and restart unhealthy containers.

--- a/deploy/cloud/operator/api/v1alpha1/groupversion_info.go
+++ b/deploy/cloud/operator/api/v1alpha1/groupversion_info.go
@@ -17,7 +17,7 @@
 * Modifications Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES
 */

-// Package v1alpha1 contains API Schema definitions for the nvidia.com v1alpha1 API group
+// Package v1alpha1 contains API Schema definitions for the nvidia.com v1alpha1 API group.
 // +kubebuilder:object:generate=true
 // +groupName=nvidia.com
 package v1alpha1

--- a/deploy/cloud/operator/docs/api_reference.md
+++ b/deploy/cloud/operator/docs/api_reference.md
+# API Reference
+
+## Packages
+- [nvidia.com/v1alpha1](#nvidiacomv1alpha1)
+
+
+## nvidia.com/v1alpha1
+
+Package v1alpha1 contains API Schema definitions for the nvidia.com v1alpha1 API group.
+
+### Resource Types
+- [DynamoComponentDeployment](#dynamocomponentdeployment)
+- [DynamoGraphDeployment](#dynamographdeployment)
+
+
+
+#### Autoscaling
+
+
+
+
+
+
+
+_Appears in:_
+- [DynamoComponentDeploymentSharedSpec](#dynamocomponentdeploymentsharedspec)
+- [DynamoComponentDeploymentSpec](#dynamocomponentdeploymentspec)
+
+| Field | Description | Default | Validation |
+| --- | --- | --- | --- |
+| `enabled` _boolean_ |  |  |  |
+| `minReplicas` _integer_ |  |  |  |
+| `maxReplicas` _integer_ |  |  |  |
+| `behavior` _[HorizontalPodAutoscalerBehavior](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#horizontalpodautoscalerbehavior-v2-autoscaling)_ |  |  |  |
+| `metrics` _[MetricSpec](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#metricspec-v2-autoscaling) array_ |  |  |  |
+
+
+
+
+#### DynamoComponentDeployment
+
+
+
+DynamoComponentDeployment is the Schema for the dynamocomponentdeployments API
+
+
+
+
+
+| Field | Description | Default | Validation |
+| --- | --- | --- | --- |
+| `apiVersion` _string_ | `nvidia.com/v1alpha1` | | |
+| `kind` _string_ | `DynamoComponentDeployment` | | |
+| `metadata` _[ObjectMeta](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#objectmeta-v1-meta)_ | Refer to Kubernetes API documentation for fields of `metadata`. |  |  |
+| `spec` _[DynamoComponentDeploymentSpec](#dynamocomponentdeploymentspec)_ | Spec defines the desired state for this Dynamo component deployment. |  |  |
+
+
+#### DynamoComponentDeploymentSharedSpec
+
+
+
+
+
+
+
+_Appears in:_
+- [DynamoComponentDeploymentSpec](#dynamocomponentdeploymentspec)
+
+| Field | Description | Default | Validation |
+| --- | --- | --- | --- |
+| `annotations` _object (keys:string, values:string)_ | Annotations to add to generated Kubernetes resources for this component<br />(such as Pod, Service, and Ingress when applicable). |  |  |
+| `labels` _object (keys:string, values:string)_ | Labels to add to generated Kubernetes resources for this component. |  |  |
+| `serviceName` _string_ | The name of the component |  |  |
+| `componentType` _string_ | ComponentType indicates the role of this component (for example, "main"). |  |  |
+| `dynamoNamespace` _string_ | Dynamo namespace of the service (allows to override the Dynamo namespace of the service defined in annotations inside the Dynamo archive) |  |  |
+| `resources` _[Resources](#resources)_ | Resources requested and limits for this component, including CPU, memory,<br />GPUs/devices, and any runtime-specific resources. |  |  |
+| `autoscaling` _[Autoscaling](#autoscaling)_ | Autoscaling config for this component (replica range, target utilization, etc.). |  |  |
+| `envs` _[EnvVar](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#envvar-v1-core) array_ | Envs defines additional environment variables to inject into the component containers. |  |  |
+| `envFromSecret` _string_ | EnvFromSecret references a Secret whose key/value pairs will be exposed as<br />environment variables in the component containers. |  |  |
+| `pvc` _[PVC](#pvc)_ | PVC config describing volumes to be mounted by the component. |  |  |
+| `ingress` _[IngressSpec](#ingressspec)_ | Ingress config to expose the component outside the cluster (or through a service mesh). |  |  |
+| `sharedMemory` _[SharedMemorySpec](#sharedmemoryspec)_ | SharedMemory controls the tmpfs mounted at /dev/shm (enable/disable and size). |  |  |
+| `extraPodMetadata` _[ExtraPodMetadata](#extrapodmetadata)_ | ExtraPodMetadata adds labels/annotations to the created Pods. |  |  |
+| `extraPodSpec` _[ExtraPodSpec](#extrapodspec)_ | ExtraPodSpec allows to override the main pod spec configuration.<br />It is a k8s standard PodSpec. It also contains a MainContainer (standard k8s Container) field<br />that allows overriding the main container configuration. |  |  |
+| `livenessProbe` _[Probe](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#probe-v1-core)_ | LivenessProbe to detect and restart unhealthy containers. |  |  |
+| `readinessProbe` _[Probe](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#probe-v1-core)_ | ReadinessProbe to signal when the container is ready to receive traffic. |  |  |
+| `replicas` _integer_ | Replicas is the desired number of Pods for this component when autoscaling is not used. |  |  |
+| `multinode` _[MultinodeSpec](#multinodespec)_ | Multinode is the configuration for multinode components. |  |  |
+
+
+#### DynamoComponentDeploymentSpec
+
+
+
+DynamoComponentDeploymentSpec defines the desired state of DynamoComponentDeployment
+
+
+
+_Appears in:_
+- [DynamoComponentDeployment](#dynamocomponentdeployment)
+
+| Field | Description | Default | Validation |
+| --- | --- | --- | --- |
+| `dynamoComponent` _string_ | DynamoComponent selects the Dynamo component from the archive to deploy.<br />Typically corresponds to a component defined in the packaged Dynamo artifacts. |  |  |
+| `dynamoTag` _string_ | contains the tag of the DynamoComponent: for example, "my_package:MyService" |  |  |
+| `backendFramework` _string_ | BackendFramework specifies the backend framework (e.g., "sglang", "vllm", "trtllm") |  | Enum: [sglang vllm trtllm] <br /> |
+| `annotations` _object (keys:string, values:string)_ | Annotations to add to generated Kubernetes resources for this component<br />(such as Pod, Service, and Ingress when applicable). |  |  |
+| `labels` _object (keys:string, values:string)_ | Labels to add to generated Kubernetes resources for this component. |  |  |
+| `serviceName` _string_ | The name of the component |  |  |
+| `componentType` _string_ | ComponentType indicates the role of this component (for example, "main"). |  |  |
+| `dynamoNamespace` _string_ | Dynamo namespace of the service (allows to override the Dynamo namespace of the service defined in annotations inside the Dynamo archive) |  |  |
+| `resources` _[Resources](#resources)_ | Resources requested and limits for this component, including CPU, memory,<br />GPUs/devices, and any runtime-specific resources. |  |  |
+| `autoscaling` _[Autoscaling](#autoscaling)_ | Autoscaling config for this component (replica range, target utilization, etc.). |  |  |
+| `envs` _[EnvVar](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#envvar-v1-core) array_ | Envs defines additional environment variables to inject into the component containers. |  |  |
+| `envFromSecret` _string_ | EnvFromSecret references a Secret whose key/value pairs will be exposed as<br />environment variables in the component containers. |  |  |
+| `pvc` _[PVC](#pvc)_ | PVC config describing volumes to be mounted by the component. |  |  |
+| `ingress` _[IngressSpec](#ingressspec)_ | Ingress config to expose the component outside the cluster (or through a service mesh). |  |  |
+| `sharedMemory` _[SharedMemorySpec](#sharedmemoryspec)_ | SharedMemory controls the tmpfs mounted at /dev/shm (enable/disable and size). |  |  |
+| `extraPodMetadata` _[ExtraPodMetadata](#extrapodmetadata)_ | ExtraPodMetadata adds labels/annotations to the created Pods. |  |  |
+| `extraPodSpec` _[ExtraPodSpec](#extrapodspec)_ | ExtraPodSpec allows to override the main pod spec configuration.<br />It is a k8s standard PodSpec. It also contains a MainContainer (standard k8s Container) field<br />that allows overriding the main container configuration. |  |  |
+| `livenessProbe` _[Probe](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#probe-v1-core)_ | LivenessProbe to detect and restart unhealthy containers. |  |  |
+| `readinessProbe` _[Probe](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#probe-v1-core)_ | ReadinessProbe to signal when the container is ready to receive traffic. |  |  |
+| `replicas` _integer_ | Replicas is the desired number of Pods for this component when autoscaling is not used. |  |  |
+| `multinode` _[MultinodeSpec](#multinodespec)_ | Multinode is the configuration for multinode components. |  |  |
+
+
+#### DynamoGraphDeployment
+
+
+
+DynamoGraphDeployment is the Schema for the dynamographdeployments API.
+
+
+
+
+
+| Field | Description | Default | Validation |
+| --- | --- | --- | --- |
+| `apiVersion` _string_ | `nvidia.com/v1alpha1` | | |
+| `kind` _string_ | `DynamoGraphDeployment` | | |
+| `metadata` _[ObjectMeta](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#objectmeta-v1-meta)_ | Refer to Kubernetes API documentation for fields of `metadata`. |  |  |
+| `spec` _[DynamoGraphDeploymentSpec](#dynamographdeploymentspec)_ | Spec defines the desired state for this graph deployment. |  |  |
+| `status` _[DynamoGraphDeploymentStatus](#dynamographdeploymentstatus)_ | Status reflects the current observed state of this graph deployment. |  |  |
+
+
+#### DynamoGraphDeploymentSpec
+
+
+
+DynamoGraphDeploymentSpec defines the desired state of DynamoGraphDeployment.
+
+
+
+_Appears in:_
+- [DynamoGraphDeployment](#dynamographdeployment)
+
+| Field | Description | Default | Validation |
+| --- | --- | --- | --- |
+| `dynamoGraph` _string_ | DynamoGraph selects the graph (workflow/topology) to deploy. This must match<br />a graph name packaged with the Dynamo archive. |  |  |
+| `envs` _[EnvVar](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#envvar-v1-core) array_ | Envs are environment variables applied to all services in the graph unless<br />overridden by service-specific configuration. |  | Optional: {} <br /> |
+| `backendFramework` _string_ | BackendFramework specifies the backend framework (e.g., "sglang", "vllm", "trtllm"). |  | Enum: [sglang vllm trtllm] <br /> |
+
+
+#### DynamoGraphDeploymentStatus
+
+
+
+DynamoGraphDeploymentStatus defines the observed state of DynamoGraphDeployment.
+
+
+
+_Appears in:_
+- [DynamoGraphDeployment](#dynamographdeployment)
+
+| Field | Description | Default | Validation |
+| --- | --- | --- | --- |
+| `state` _string_ | State is a high-level textual status of the graph deployment lifecycle. |  |  |
+| `conditions` _[Condition](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#condition-v1-meta) array_ | Conditions contains the latest observed conditions of the graph deployment.<br />The slice is merged by type on patch updates. |  |  |
+
+
+#### IngressSpec
+
+
+
+
+
+
+
+_Appears in:_
+- [DynamoComponentDeploymentSharedSpec](#dynamocomponentdeploymentsharedspec)
+- [DynamoComponentDeploymentSpec](#dynamocomponentdeploymentspec)
+
+| Field | Description | Default | Validation |
+| --- | --- | --- | --- |
+| `enabled` _boolean_ | Enabled exposes the component through an ingress or virtual service when true. |  |  |
+| `host` _string_ | Host is the base host name to route external traffic to this component. |  |  |
+| `useVirtualService` _boolean_ | UseVirtualService indicates whether to configure a service-mesh VirtualService instead of a standard Ingress. |  |  |
+| `virtualServiceGateway` _string_ | VirtualServiceGateway optionally specifies the gateway name to attach the VirtualService to. |  |  |
+| `hostPrefix` _string_ | HostPrefix is an optional prefix added before the host. |  |  |
+| `annotations` _object (keys:string, values:string)_ | Annotations to set on the generated Ingress/VirtualService resources. |  |  |
+| `labels` _object (keys:string, values:string)_ | Labels to set on the generated Ingress/VirtualService resources. |  |  |
+| `tls` _[IngressTLSSpec](#ingresstlsspec)_ | TLS holds the TLS configuration used by the Ingress/VirtualService. |  |  |
+| `hostSuffix` _string_ | HostSuffix is an optional suffix appended after the host. |  |  |
+| `ingressControllerClassName` _string_ | IngressControllerClassName selects the ingress controller class (e.g., "nginx"). |  |  |
+
+
+#### IngressTLSSpec
+
+
+
+
+
+
+
+_Appears in:_
+- [IngressSpec](#ingressspec)
+
+| Field | Description | Default | Validation |
+| --- | --- | --- | --- |
+| `secretName` _string_ | SecretName is the name of a Kubernetes Secret containing the TLS certificate and key. |  |  |
+
+
+#### MultinodeSpec
+
+
+
+
+
+
+
+_Appears in:_
+- [DynamoComponentDeploymentSharedSpec](#dynamocomponentdeploymentsharedspec)
+- [DynamoComponentDeploymentSpec](#dynamocomponentdeploymentspec)
+
+| Field | Description | Default | Validation |
+| --- | --- | --- | --- |
+| `nodeCount` _integer_ | Indicates the number of nodes to deploy for multinode components.<br />Total number of GPUs is NumberOfNodes * GPU limit.<br />Must be greater than 1. | 2 | Minimum: 2 <br /> |
+
+
+#### PVC
+
+
+
+
+
+
+
+_Appears in:_
+- [DynamoComponentDeploymentSharedSpec](#dynamocomponentdeploymentsharedspec)
+- [DynamoComponentDeploymentSpec](#dynamocomponentdeploymentspec)
+
+| Field | Description | Default | Validation |
+| --- | --- | --- | --- |
+| `create` _boolean_ | Create indicates to create a new PVC |  |  |
+| `name` _string_ | Name is the name of the PVC |  |  |
+| `storageClass` _string_ | StorageClass to be used for PVC creation. Leave it as empty if the PVC is already created. |  |  |
+| `size` _[Quantity](#quantity)_ | Size of the NIM cache in Gi, used during PVC creation |  |  |
+| `volumeAccessMode` _[PersistentVolumeAccessMode](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#persistentvolumeaccessmode-v1-core)_ | VolumeAccessMode is the volume access mode of the PVC |  |  |
+| `mountPoint` _string_ |  |  |  |
+
+
+#### SharedMemorySpec
+
+
+
+
+
+
+
+_Appears in:_
+- [DynamoComponentDeploymentSharedSpec](#dynamocomponentdeploymentsharedspec)
+- [DynamoComponentDeploymentSpec](#dynamocomponentdeploymentspec)
+
+| Field | Description | Default | Validation |
+| --- | --- | --- | --- |
+| `disabled` _boolean_ |  |  |  |
+| `size` _[Quantity](#quantity)_ |  |  |  |
+
+
--- a/deploy/cloud/operator/docs/crd-ref-docs-config.yaml
+++ b/deploy/cloud/operator/docs/crd-ref-docs-config.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Configuration file for crd-ref-docs
+# https://github.com/elastic/crd-ref-docs
+
+processor:
+  # Ignore common metadata fields that are not user-configurable
+  ignoreFields:
+    - "metadata.creationTimestamp"
+    - "metadata.generation"
+    - "metadata.resourceVersion"
+    - "metadata.selfLink"
+    - "metadata.uid"
+    - "status.conditions[*].lastTransitionTime"
+    - "status.observedGeneration"
+    - "TypeMeta$"
+  ignoreTypes:
+    - "List$"
+    - "ParseError$"
+    # Ignore only the override wrapper type to reduce repetition
+    # Keep SharedSpec so embedded fields are documented once
+    - "DynamoComponentDeploymentOverridesSpec$"
+    - "DynamoComponentDeploymentStatus$"
+    - "BaseStatus$"
+
+render:
+  # Output format - use markdown instead of default asciidoc
+  format: markdown
+
+  # Kubernetes version for API compatibility info
+  kubernetesVersion: "1.28"
+
+  # Group related resources together
+  groupByKind: true
+
+  # Include resource descriptions
+  includeDescription: true
+
+  # Reduce repetition in links and references
+  allowDangerousTypes: false
+
+  # Sort types alphabetically for better organization
+  sortByName: true
--- a/docs/Makefile
+++ b/docs/Makefile
@@ -23,12 +23,68 @@ SPHINXBUILD       ?= sphinx-build
 SOURCEDIR          = .
 BUILDDIR           = build

+##@ General
+
 # Put it first so that "make" without argument is like "make help".
-help:
+help: ## Display help for all targets
 	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
+	@echo ""
+	@echo "Additional documentation targets:"
+	@awk 'BEGIN {FS = ":.*##"; printf "  \033[36m%-20s\033[0m %s\n", "TARGET", "DESCRIPTION"} /^[a-zA-Z_0-9-]+:.*?##/ { printf "  \033[36m%-20s\033[0m %s\n", $$1, $$2 } /^##@/ { printf "\n\033[1m%s\033[0m\n", substr($$0, 5) }' $(MAKEFILE_LIST)

-clean:
+clean: ## Clean build artifacts
 	@rm -fr ${BUILDDIR}
+
+##@ Helm Documentation
+
+## Location to install dependencies to
+LOCALBIN ?= $(shell pwd)/bin
+$(LOCALBIN):
+	mkdir -p $(LOCALBIN)
+
+## Tool Versions
+HELM_DOCS_VERSION ?= 1.14.2
+
+## Tool Binaries
+HELM_DOCS ?= $(LOCALBIN)/helm-docs-$(HELM_DOCS_VERSION)
+
+.PHONY: helm-docs-install
+helm-docs-install: $(HELM_DOCS) ## Download helm-docs locally if necessary
+$(HELM_DOCS): $(LOCALBIN)
+	@echo "📥 Downloading helm-docs $(HELM_DOCS_VERSION)..."
+	@ARCH=$$(uname -m | sed 's/x86_64/amd64/' | sed 's/aarch64/arm64/'); \
+	OS=$$(uname -s | tr '[:upper:]' '[:lower:]'); \
+	curl -sSL "https://github.com/norwoodj/helm-docs/releases/download/v$(HELM_DOCS_VERSION)/helm-docs_$(HELM_DOCS_VERSION)_$${OS}_$${ARCH}.tar.gz" | \
+	tar xz -C $(LOCALBIN) helm-docs && \
+	mv $(LOCALBIN)/helm-docs $(HELM_DOCS) && \
+	echo "✅ helm-docs $(HELM_DOCS_VERSION) installed successfully"
+
+.PHONY: generate-helm-docs
+generate-helm-docs: helm-docs-install ## Generate README.md for Helm charts from values.yaml
+	@echo "📚 Generating Helm chart documentation..."
+	@cd ../deploy/cloud/helm/platform && $(realpath $(HELM_DOCS)) \
+		--template-files=README.md.gotmpl \
+		--output-file=README.md \
+		--sort-values-order=file \
+		--chart-to-generate=. \
+		--ignore-non-descriptions
+	@echo "✅ Generated documentation at ../deploy/cloud/helm/platform/README.md"
+
+.PHONY: helm-docs-clean
+helm-docs-clean: ## Remove generated helm documentation
+	@echo "🧹 Cleaning generated helm documentation..."
+	@rm -f ../deploy/cloud/helm/platform/README.md
+	@echo "✅ Cleaned helm documentation"
+
+.PHONY: generate-crd-docs
+generate-crd-docs: ## Generate CRD API reference documentation
+	@echo "📚 Generating CRD API reference documentation..."
+	@cd ../deploy/cloud/operator && make generate-api-docs
+	@echo "✅ CRD API reference generated"
+
+.PHONY: docs-all
+docs-all: generate-helm-docs generate-crd-docs html ## Generate all documentation (Sphinx + Helm + CRDs)
+
 .PHONY: help Makefile clean



--- a/docs/guides/dynamo_deploy/README.md
+++ b/docs/guides/dynamo_deploy/README.md
@@ -59,6 +59,14 @@ It's a Kubernetes Custom Resource that defines your inference pipeline:

 The scripts in the `components/<backend>/launch` folder like `agg.sh` demonstrate how you can serve your models locally. The corresponding YAML files like `agg.yaml` show you how you could create a kubernetes deployment for your inference graph.

+## 📖 API Reference & Documentation
+
+For detailed technical specifications of Dynamo's Kubernetes resources:
+
+- **[API Reference](api-reference.md)** - Complete CRD field specifications for `DynamoGraphDeployment` and `DynamoComponentDeployment`
+- **[Operator Guide](dynamo_operator.md)** - Dynamo operator configuration and management
+- **[Create Deployment](create_deployment.md)** - Step-by-step deployment creation examples
+
 ### Choosing Your Architecture Pattern

 When creating a deployment, select the architecture pattern that best fits your use case:

--- a/docs/guides/dynamo_deploy/api-reference.md
+++ b/docs/guides/dynamo_deploy/api-reference.md
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Dynamo CRD API Reference
+
+For the complete technical API reference for Dynamo Custom Resource Definitions, see:
+
+**📖 [Dynamo CRD API Reference](../../../deploy/cloud/operator/docs/api_reference.md)**
--- a/docs/guides/dynamo_deploy/dynamo_cloud.md
+++ b/docs/guides/dynamo_deploy/dynamo_cloud.md
@@ -39,7 +39,7 @@ helm version             # v3.0+
 docker version           # Running daemon

 # Set your inference runtime image
-export DYNAMO_IMAGE=nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1
+export DYNAMO_IMAGE=nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.0
 # Also available: sglang-runtime, tensorrtllm-runtime
 ```

@@ -53,7 +53,7 @@ Install from [NGC published artifacts](https://catalog.ngc.nvidia.com/orgs/nvidi
 ```bash
 # 1. Set environment
 export NAMESPACE=dynamo-kubernetes
-export RELEASE_VERSION=0.4.1 # any version of Dynamo 0.3.2+
+export RELEASE_VERSION=0.5.0 # any version of Dynamo 0.3.2+

 # 2. Install CRDs
 helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${RELEASE_VERSION}.tgz
@@ -65,6 +65,15 @@ helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-$
 helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE}
 ```

+> [!TIP]
+> By default, Grove and Kai Scheduler are NOT installed. You can enable them by setting the following flags in the helm install command:
+
+```bash
+--set "grove.enabled=true"
+--set "kai-scheduler.enabled=true"
+```
+
+
 → [Verify Installation](#verify-installation)

 ## Path C: Custom Development
@@ -79,7 +88,7 @@ export NAMESPACE=dynamo-cloud
 export DOCKER_SERVER=nvcr.io/nvidia/ai-dynamo/  # or your registry
 export DOCKER_USERNAME='$oauthtoken'
 export DOCKER_PASSWORD=<YOUR_NGC_CLI_API_KEY>
-export IMAGE_TAG=0.4.1
+export IMAGE_TAG=0.5.0

 # 2. Build operator
 cd deploy/cloud/operator
@@ -176,6 +185,7 @@ kubectl create secret generic hf-token-secret \

 ## Advanced Options

+- [Helm Chart Configuration](../../../deploy/cloud/helm/platform/README.md)
 - [GKE-specific setup](gke_setup.md)
 - [Create custom deployments](create_deployment.md)
 - [Dynamo Operator details](dynamo_operator.md)
--- a/docs/guides/dynamo_deploy/dynamo_operator.md
+++ b/docs/guides/dynamo_deploy/dynamo_operator.md
@@ -23,50 +23,9 @@ Dynamo operator is a Kubernetes operator that simplifies the deployment, configu

 ## Custom Resource Definitions (CRDs)

-### CRD: `DynamoGraphDeployment`
+For the complete technical API reference for Dynamo Custom Resource Definitions, see:

-
-| Field            | Type   | Description                                                                                                                                          | Required | Default |
-|------------------|--------|------------------------------------------------------------------------------------------------------------------------------------------------------|----------|---------|
-| `services`       | map    | Map of service names to runtime configurations. This allows the user to override the service configuration defined in the DynamoComponentDeployment. | Yes      |         |
-| `envs`           | list   | list of global environment variables.                                                                                                                | No       |         |
-
-
-**API Version:** `nvidia.com/v1alpha1`
-**Scope:** Namespaced
-
-#### Example
-```yaml
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoGraphDeployment
-metadata:
-  name: disagg
-spec:
-  envs:
-  - name: GLOBAL_ENV_VAR
-    value: some_global_value
-  services:
-    Frontend:
-      replicas: 1
-      envs:
-      - name: SPECIFIC_ENV_VAR
-        value: some_specific_value
-    Processor:
-      replicas: 1
-      envs:
-      - name: SPECIFIC_ENV_VAR
-        value: some_specific_value
-    VllmWorker:
-      replicas: 1
-      envs:
-      - name: SPECIFIC_ENV_VAR
-        value: some_specific_value
-    PrefillWorker:
-      replicas: 1
-      envs:
-      - name: SPECIFIC_ENV_VAR
-        value: some_specific_value
-```
+**📖 [Dynamo CRD API Reference](../../../deploy/cloud/operator/docs/api_reference.md)**

 ## Installation

@@ -151,25 +110,6 @@ export NAMESPACE=<namespace-with-the-dynamo-cloud-operator>
 kubectl get dynamographdeployment llm-agg -n $NAMESPACE
 ```

-
-## Reconciliation Logic
-
-### DynamoGraphDeployment
-
- **Actions:**
-  - Create a DynamoComponent CR to build the docker image
-  - Create a DynamoComponentDeployment CR for each component defined in the Dynamo graph being deployed
- **Status Management:**
-  - `.status.conditions`: Reflects readiness, failure, progress states
-  - `.status.state`: overall state of the deployment, based on the state of the DynamoComponentDeployments
-
-### DynamoComponentDeployment
-
- **Actions:**
-  - Create a Deployment, Service, and Ingress for the service
- **Status Management:**
-  - `.status.conditions`: Reflects readiness, failure, progress states
-
 ## Configuration



--- a/docs/guides/dynamo_deploy/grove.md
+++ b/docs/guides/dynamo_deploy/grove.md
@@ -87,10 +87,14 @@ Grove represents a significant advancement in Kubernetes-based orchestration for

 ## Getting Started

-> **Note**: Grove is currently in development and aligning with NVIDIA Dynamo's release schedule.
+Grove relies on KAI Scheduler for resource allocation and scheduling.
+
+For KAI Scheduler, see the [KAI Scheduler Deployment Guide](https://github.com/NVIDIA/KAI-Scheduler).

 For installation instructions, see the [Grove Installation Guide](https://github.com/NVIDIA/grove/blob/main/docs/installation.md).

 For practical examples of Grove-based multinode deployments in action, see the [Multinode Deployment Guide](multinode-deployment.md), which demonstrates multi-node disaggregated serving scenarios.

 For the latest updates on Grove, refer to the [official project on GitHub](https://github.com/NVIDIA/grove).
+
+Dynamo Cloud also allows you to install Grove and KAI Scheduler as part of the platform installation. See the [Dynamo Cloud Deployment Guide](dynamo_cloud.md) for more details.
\ No newline at end of file