Unverified Commit f83d9a53 authored by Julien Mancuso's avatar Julien Mancuso Committed by GitHub
Browse files

fix: decouple Helm chart from runtime cluster state for helm template and...


fix: decouple Helm chart from runtime cluster state for helm template and GitOps compatibility (#6754)
Signed-off-by: default avatarJulien Mancuso <jmancuso@nvidia.com>
parent 63070ddf
......@@ -35,11 +35,11 @@ dependencies:
repository: "https://charts.bitnami.com/bitnami"
condition: global.etcd.install
- name: kai-scheduler
version: v0.9.4
version: v0.13.0-rc1
repository: oci://ghcr.io/nvidia/kai-scheduler
condition: kai-scheduler.enabled
condition: global.kai-scheduler.install
- name: grove-charts
alias: grove
version: v0.1.0-alpha.6
repository: oci://ghcr.io/ai-dynamo/grove
condition: grove.enabled
condition: global.grove.install
......@@ -101,13 +101,17 @@ The chart includes built-in validation to prevent all operator conflicts:
| https://charts.bitnami.com/bitnami | etcd | 12.0.18 |
| https://nats-io.github.io/k8s/helm/charts/ | nats | 1.3.2 |
| oci://ghcr.io/ai-dynamo/grove | grove(grove-charts) | v0.1.0-alpha.6 |
| oci://ghcr.io/nvidia/kai-scheduler | kai-scheduler | v0.9.4 |
| oci://ghcr.io/nvidia/kai-scheduler | kai-scheduler | v0.13.0-rc1 |
## Values
| Key | Type | Default | Description |
|-----|------|---------|-------------|
| global.etcd.install | bool | `false` | Whether this chart should install the bundled etcd subchart. When true, deploys etcd and auto-configures the operator with its address. When false, etcd is not deployed. Use dynamo-operator.etcdAddr to point at an external instance if you are bringing your own etcd. |
| global.kai-scheduler.install | bool | `false` | Whether this chart should install the bundled kai-scheduler subchart. When true, deploys kai-scheduler and its CRDs. Integration is automatically enabled. NOTE: For production environments, it is recommended to install kai-scheduler separately. |
| global.kai-scheduler.enabled | bool | `false` | Whether to enable Kai Scheduler integration (queue creation, schedulerName injection). Set to true when kai-scheduler is available in the cluster (installed externally). Automatically true when install=true. The operator uses this to decide whether to inject schedulerName and queue labels into pod templates. |
| global.grove.install | bool | `false` | Whether this chart should install the bundled Grove subchart. When true, deploys the Grove operator cluster-wide. Integration is automatically enabled. NOTE: For production environments, it is recommended to install Grove separately. |
| global.grove.enabled | bool | `false` | Whether to enable Grove integration (multinode orchestration via PodCliqueSets). Set to true when Grove is available in the cluster (installed externally). Automatically true when install=true. The operator uses this to decide whether to create PodCliqueSets for multinode deployments. |
| dynamo-operator.enabled | bool | `true` | Whether to enable the Dynamo Kubernetes operator deployment |
| dynamo-operator.upgradeCRD | bool | `true` | Whether to manage CRDs via a pre-install/pre-upgrade hook Job. The Job runs the operator image with the crd-apply tool to apply CRDs via server-side apply. |
| dynamo-operator.natsAddr | string | `""` | NATS server address for operator communication (leave empty to use the bundled NATS chart). Format: "nats://hostname:port" |
......@@ -170,10 +174,8 @@ The chart includes built-in validation to prevent all operator conflicts:
| dynamo-operator.checkpoint.storage.s3.credentialsSecretRef | string | `""` | Reference to a secret containing AWS credentials |
| dynamo-operator.checkpoint.storage.oci.uri | string | `""` | OCI URI in format: oci://registry/repository |
| dynamo-operator.checkpoint.storage.oci.credentialsSecretRef | string | `""` | Reference to a docker config secret for registry authentication |
| grove.enabled | bool | `false` | Whether to enable Grove for multi-node inference coordination, if enabled, the Grove operator will be deployed cluster-wide |
| grove.tolerations | list | `[]` | Node tolerations for Grove pods |
| grove.affinity | object | `{}` | Affinity for Grove pods |
| kai-scheduler.enabled | bool | `false` | Whether to enable Kai Scheduler for intelligent resource allocation, if enabled, the Kai Scheduler operator will be deployed cluster-wide |
| kai-scheduler.global.tolerations | list | `[]` | Node tolerations for kai-scheduler pods |
| kai-scheduler.global.affinity | object | `{}` | Affinity for kai-scheduler pods |
| etcd.image.repository | string | `"bitnamilegacy/etcd"` | following bitnami announcement for brownout - https://github.com/bitnami/charts/tree/main/bitnami/etcd#%EF%B8%8F-important-notice-upcoming-changes-to-the-bitnami-catalog, we need to use the legacy repository until we migrate to the new "secure" repository |
......@@ -206,6 +208,38 @@ dynamo-operator:
For detailed etcd configuration options, please refer to the official Bitnami etcd Helm chart documentation:
**[etcd Helm Chart Documentation](https://github.com/bitnami/charts/tree/main/bitnami/etcd)**
### Kai Scheduler and Grove Configuration
For **production environments**, Kai Scheduler and Grove should be installed separately from this chart to allow independent lifecycle management, version pinning, and upgrade control.
**Compatibility Matrix:**
| dynamo-platform | kai-scheduler | Grove |
|-----------------|---------------|-------|
| 1.0.x | >= v0.13.0 | >= v0.1.0-alpha.6 |
After installing them separately, enable Dynamo integration:
```yaml
global:
kai-scheduler:
enabled: true # Enables queue creation and schedulerName injection
grove:
enabled: true # Enables multinode orchestration via PodCliqueSets
```
For **development/testing only**, you can deploy them as bundled subcharts:
```yaml
global:
kai-scheduler:
install: true # Deploys the bundled kai-scheduler subchart (integration auto-enabled)
grove:
install: true # Deploys the bundled Grove subchart (integration auto-enabled)
```
Note: `global.*.install` controls whether the bundled subcharts are deployed. When set, integration is automatically enabled. `global.*.enabled` can be set independently when using externally-managed installations.
## 📚 Additional Resources
- [Dynamo Cloud Deployment Installation Guide](../../../../docs/kubernetes/installation-guide.md)
......
......@@ -124,6 +124,37 @@ dynamo-operator:
For detailed etcd configuration options, please refer to the official Bitnami etcd Helm chart documentation:
**[etcd Helm Chart Documentation](https://github.com/bitnami/charts/tree/main/bitnami/etcd)**
### Kai Scheduler and Grove Configuration
For **production environments**, Kai Scheduler and Grove should be installed separately from this chart to allow independent lifecycle management, version pinning, and upgrade control.
**Compatibility Matrix:**
| dynamo-platform | kai-scheduler | Grove |
|-----------------|---------------|-------|
| 1.0.x | >= v0.13.0 | >= v0.1.0-alpha.6 |
After installing them separately, enable Dynamo integration:
```yaml
global:
kai-scheduler:
enabled: true # Enables queue creation and schedulerName injection
grove:
enabled: true # Enables multinode orchestration via PodCliqueSets
```
For **development/testing only**, you can deploy them as bundled subcharts:
```yaml
global:
kai-scheduler:
install: true # Deploys the bundled kai-scheduler subchart (integration auto-enabled)
grove:
install: true # Deploys the bundled Grove subchart (integration auto-enabled)
```
Note: `global.*.install` controls whether the bundled subcharts are deployed. When set, integration is automatically enabled. `global.*.enabled` can be set independently when using externally-managed installations.
## 📚 Additional Resources
......
......@@ -12,7 +12,10 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
{{- if .Values.metricsService.enabled }}
# metricsService.enabled: null/true = create, false = skip.
# The Service is a core K8s resource (no CRD dependency), so null and true
# both result in creation. Only explicit false disables it.
{{- if not (eq (toString .Values.metricsService.enabled) "false") }}
---
apiVersion: v1
kind: Service
......
......@@ -49,11 +49,14 @@ data:
{{- end }}
{{- end }}
{{- end }}
{{- if and .Values.dynamo.groveTerminationDelay (ne (.Values.dynamo.groveTerminationDelay | toString) "15m") }}
orchestrators:
kaiScheduler:
enabled: {{ or (and .Values.global (index .Values.global "kai-scheduler") (index .Values.global "kai-scheduler").enabled) (and .Values.global (index .Values.global "kai-scheduler") (index .Values.global "kai-scheduler").install) }}
grove:
enabled: {{ or (and .Values.global .Values.global.grove .Values.global.grove.enabled) (and .Values.global .Values.global.grove .Values.global.grove.install) }}
{{- if and .Values.dynamo.groveTerminationDelay (ne (.Values.dynamo.groveTerminationDelay | toString) "15m") }}
terminationDelay: {{ .Values.dynamo.groveTerminationDelay | quote }}
{{- end }}
{{- end }}
{{- $natsAddr := "" }}
{{- if .Values.natsAddr }}
{{- $natsAddr = .Values.natsAddr }}
......
......@@ -12,7 +12,17 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
{{- if and .Values.metricsService.enabled (.Capabilities.APIVersions.Has "monitoring.coreos.com/v1") }}
# Tri-state for ServiceMonitor creation (metricsService.enabled):
# null = auto-detect via .Capabilities (works with helm install/upgrade)
# true = always create (use with helm template / GitOps)
# false = never create (opt-out even if prometheus-operator is installed)
{{- $serviceMonitorEnabled := false }}
{{- if kindIs "invalid" .Values.metricsService.enabled }}
{{- $serviceMonitorEnabled = (.Capabilities.APIVersions.Has "monitoring.coreos.com/v1") }}
{{- else }}
{{- $serviceMonitorEnabled = .Values.metricsService.enabled }}
{{- end }}
{{- if $serviceMonitorEnabled }}
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
......
......@@ -12,7 +12,17 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
{{- if .Capabilities.APIVersions.Has "monitoring.coreos.com/v1" }}
# Tri-state for PodMonitor creation (dynamo.metrics.podMonitors.enabled):
# null = auto-detect via .Capabilities (works with helm install/upgrade)
# true = always create (use with helm template / GitOps)
# false = never create (opt-out even if prometheus-operator is installed)
{{- $podMonitorsEnabled := false }}
{{- if kindIs "invalid" .Values.dynamo.metrics.podMonitors.enabled }}
{{- $podMonitorsEnabled = (.Capabilities.APIVersions.Has "monitoring.coreos.com/v1") }}
{{- else }}
{{- $podMonitorsEnabled = .Values.dynamo.metrics.podMonitors.enabled }}
{{- end }}
{{- if $podMonitorsEnabled }}
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
......
......@@ -120,6 +120,12 @@ dynamo:
metrics:
prometheusEndpoint: ""
podMonitors:
# -- Whether to create PodMonitor resources for Prometheus scraping of dynamo components.
# null=auto-detect (creates PodMonitors if prometheus-operator CRDs exist in cluster),
# true=always create (use for helm template / GitOps workflows),
# false=never create
enabled: null
mpiRun:
secretName: "mpi-run-ssh-secret"
......@@ -129,8 +135,13 @@ dynamo:
#imagePullSecrets: []
kubernetesClusterDomain: cluster.local
metricsService:
enabled: true
# -- Whether to create the operator metrics Service and ServiceMonitor.
# null=auto-detect (Service always created; ServiceMonitor created if prometheus-operator CRDs exist),
# true=always create both (use for helm template / GitOps workflows),
# false=never create either
enabled: null
ports:
- name: https
port: 8443
......
......@@ -12,12 +12,7 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
---
{{- if .Capabilities.APIVersions.Has "scheduling.run.ai/v2" }}
{{- /* Create parent queue first */ -}}
{{- $defaultQueue := lookup "scheduling.run.ai/v2" "Queue" "" "dynamo-default" }}
{{- if not $defaultQueue }}
{{- if or (index .Values.global "kai-scheduler").enabled (index .Values.global "kai-scheduler").install }}
---
apiVersion: scheduling.run.ai/v2
kind: Queue
......@@ -25,8 +20,8 @@ metadata:
name: dynamo-default
annotations:
"helm.sh/hook": post-install,post-upgrade
"helm.sh/hook-weight": "100"
"helm.sh/hook-delete-policy": before-hook-creation
"helm.sh/hook-weight": "100"
spec:
resources:
cpu:
......@@ -41,11 +36,6 @@ spec:
quota: -1
limit: -1
overQuotaWeight: 1
{{- end }}
{{- /* Create child queue second */ -}}
{{- $dynamoQueue := lookup "scheduling.run.ai/v2" "Queue" "" "dynamo" }}
{{- if not $dynamoQueue }}
---
apiVersion: scheduling.run.ai/v2
kind: Queue
......@@ -53,8 +43,8 @@ metadata:
name: dynamo
annotations:
"helm.sh/hook": post-install,post-upgrade
"helm.sh/hook-weight": "110"
"helm.sh/hook-delete-policy": before-hook-creation
"helm.sh/hook-weight": "110"
spec:
parentQueue: dynamo-default
resources:
......@@ -70,6 +60,4 @@ spec:
quota: -1
limit: -1
overQuotaWeight: 1
{{- end }}
{{- end }}
\ No newline at end of file
......@@ -21,6 +21,28 @@ global:
# When false, etcd is not deployed. Use dynamo-operator.etcdAddr to point at an external instance if you are bringing your own etcd.
install: false
kai-scheduler:
# -- Whether this chart should install the bundled kai-scheduler subchart.
# When true, deploys kai-scheduler and its CRDs. Integration is automatically enabled.
# NOTE: For production environments, it is recommended to install kai-scheduler separately.
install: false
# -- Whether to enable Kai Scheduler integration (queue creation, schedulerName injection).
# Set to true when kai-scheduler is available in the cluster (installed externally).
# Automatically enabled when install=true. The operator uses this to decide whether to
# inject schedulerName and queue labels into pod templates.
enabled: false
grove:
# -- Whether this chart should install the bundled Grove subchart.
# When true, deploys the Grove operator cluster-wide. Integration is automatically enabled.
# NOTE: For production environments, it is recommended to install Grove separately.
install: false
# -- Whether to enable Grove integration (multinode orchestration via PodCliqueSets).
# Set to true when Grove is available in the cluster (installed externally).
# Automatically true when install=true. The operator uses this to decide whether to
# create PodCliqueSets for multinode deployments.
enabled: false
# Subcharts configuration
# Dynamo operator configuration
......@@ -255,18 +277,17 @@ dynamo-operator:
credentialsSecretRef: ""
# Grove component - distributed inference orchestration
# Installation is controlled by global.grove.install above.
grove:
# -- Whether to enable Grove for multi-node inference coordination, if enabled, the Grove operator will be deployed cluster-wide
enabled: false
# -- Node tolerations for Grove pods
tolerations: []
# -- Affinity for Grove pods
affinity: {}
# Kai Scheduler component - advanced workload scheduling
# Installation is controlled by global.kai-scheduler.install above.
# Integration is controlled by global.kai-scheduler.enabled above.
kai-scheduler:
# -- Whether to enable Kai Scheduler for intelligent resource allocation, if enabled, the Kai Scheduler operator will be deployed cluster-wide
enabled: false
# Global configuration for kai-scheduler (applies to all components including crd-upgrader)
global:
# -- Node tolerations for kai-scheduler pods
......
......@@ -155,14 +155,31 @@ Found existing namespace-restricted Dynamo operators in namespaces: ...
> For multinode deployments, you need to install multinode orchestration components:
>
> **Option 1 (Recommended): Grove + KAI Scheduler**
> - Grove and KAI Scheduler can be installed manually or through the dynamo-platform helm install command.
> - When using the dynamo-platform helm install command, Grove and KAI Scheduler are NOT installed by default. You can enable their installation by setting the following flags:
>
> For production environments, Grove and KAI Scheduler should be installed **separately** from the dynamo-platform chart. This allows independent lifecycle management, version pinning, and upgrade control.
>
> **Compatibility Matrix:**
>
> | dynamo-platform | kai-scheduler | Grove |
> |-----------------|---------------|-------|
> | 1.0.x | >= v0.13.0 | >= v0.1.0-alpha.6 |
>
> After installing them separately, enable Dynamo integration:
>
> ```bash
> --set "grove.enabled=true"
> --set "kai-scheduler.enabled=true"
> --set "global.kai-scheduler.enabled=true"
> --set "global.grove.enabled=true"
> ```
>
> For **development/testing only**, you can install them as bundled subcharts:
>
> ```bash
> --set "global.grove.install=true"
> --set "global.kai-scheduler.install=true"
> ```
>
> Note: `global.kai-scheduler.install` / `global.grove.install` control whether the bundled subcharts are deployed. When set, integration is automatically enabled. `global.kai-scheduler.enabled` / `global.grove.enabled` can be set independently when using externally-managed installations.
>
> **Option 2: LeaderWorkerSet (LWS) + Volcano**
> - If using LWS for multinode deployments, you must also install Volcano (required dependency):
> - [LWS Installation](https://github.com/kubernetes-sigs/lws#installation)
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment