"docs/backends/vscode:/vscode.git/clone" did not exist on "60ba7b258655f9387a7789d821e55f534e2f7f41"
Unverified Commit 7c620812 authored by Julien Mancuso's avatar Julien Mancuso Committed by GitHub
Browse files

feat: install dynamo operator cluster-wide by default (#3199)


Signed-off-by: default avatarJulien Mancuso <jmancuso@nvidia.com>
parent 088295e0
...@@ -38,6 +38,48 @@ The Dynamo Platform Helm chart deploys the complete Dynamo Cloud infrastructure ...@@ -38,6 +38,48 @@ The Dynamo Platform Helm chart deploys the complete Dynamo Cloud infrastructure
- Sufficient cluster resources for your deployment scale - Sufficient cluster resources for your deployment scale
- Container registry access (if using private images) - Container registry access (if using private images)
## ⚠️ Important: Cluster-Wide vs Namespace-Scoped Deployment
### Single Cluster-Wide Operator (Recommended)
**By default, the Dynamo operator runs with cluster-wide permissions and should only be deployed ONCE per cluster.**
-**Recommended**: Deploy one cluster-wide operator per cluster
-**Not Recommended**: Multiple cluster-wide operators in the same cluster
### Multiple Namespace-Scoped Operators (Advanced)
If you need multiple operator instances (e.g., for multi-tenancy), use namespace-scoped deployment:
```yaml
# values.yaml
dynamo-operator:
namespaceRestriction:
enabled: true
targetNamespace: "my-tenant-namespace" # Optional, defaults to release namespace
```
### Validation and Safety
The chart includes built-in validation to prevent all operator conflicts:
- **Automatic Detection**: Scans for existing operators (both cluster-wide and namespace-restricted) during installation
- **Prevents Multiple Cluster-Wide**: Installation will fail if another cluster-wide operator exists
- **Prevents Mixed Deployments (Type 1)**: Installation will fail if trying to install namespace-restricted operator when cluster-wide exists
- **Prevents Mixed Deployments (Type 2)**: Installation will fail if trying to install cluster-wide operator when namespace-restricted operators exist
- **Safe Defaults**: Leader election uses shared ID for proper coordination
#### 🚫 **Blocked Conflict Scenarios**
| Existing Operator | New Operator | Status | Reason |
|-------------------|--------------|---------|--------|
| None | Cluster-wide | ✅ **Allowed** | No conflicts |
| None | Namespace-restricted | ✅ **Allowed** | No conflicts |
| Cluster-wide | Cluster-wide | ❌ **Blocked** | Multiple cluster managers |
| Cluster-wide | Namespace-restricted | ❌ **Blocked** | Cluster-wide already manages target namespace |
| Namespace-restricted | Cluster-wide | ❌ **Blocked** | Would conflict with existing namespace operators |
| Namespace-restricted A | Namespace-restricted B (diff ns) | ✅ **Allowed** | Different scopes |
## 🔧 Configuration ## 🔧 Configuration
## Requirements ## Requirements
...@@ -58,11 +100,13 @@ The Dynamo Platform Helm chart deploys the complete Dynamo Cloud infrastructure ...@@ -58,11 +100,13 @@ The Dynamo Platform Helm chart deploys the complete Dynamo Cloud infrastructure
| dynamo-operator.natsAddr | string | `""` | NATS server address for operator communication (leave empty to use the bundled NATS chart). Format: "nats://hostname:port" | | dynamo-operator.natsAddr | string | `""` | NATS server address for operator communication (leave empty to use the bundled NATS chart). Format: "nats://hostname:port" |
| dynamo-operator.etcdAddr | string | `""` | etcd server address for operator state storage (leave empty to use the bundled etcd chart). Format: "http://hostname:port" or "https://hostname:port" | | dynamo-operator.etcdAddr | string | `""` | etcd server address for operator state storage (leave empty to use the bundled etcd chart). Format: "http://hostname:port" or "https://hostname:port" |
| dynamo-operator.modelExpressURL | string | `""` | URL for the Model Express server if not deployed by this helm chart. This is ignored if Model Express server is installed by this helm chart (global.model-express.enabled is true). | | dynamo-operator.modelExpressURL | string | `""` | URL for the Model Express server if not deployed by this helm chart. This is ignored if Model Express server is installed by this helm chart (global.model-express.enabled is true). |
| dynamo-operator.namespaceRestriction | object | `{"enabled":true,"targetNamespace":null}` | Namespace access controls for the operator | | dynamo-operator.namespaceRestriction | object | `{"enabled":false,"targetNamespace":null}` | Namespace access controls for the operator |
| dynamo-operator.namespaceRestriction.enabled | bool | `true` | Whether to restrict operator to specific namespaces | | dynamo-operator.namespaceRestriction.enabled | bool | `false` | Whether to restrict operator to specific namespaces. By default, the operator will run with cluster-wide permissions. Only 1 instance of the operator should be deployed in the cluster. If you want to deploy multiple operator instances, you can set this to true and specify the target namespace (by default, the target namespace is the helm release namespace). |
| dynamo-operator.namespaceRestriction.targetNamespace | string | `nil` | Target namespace for operator deployment (leave empty for current namespace) | | dynamo-operator.namespaceRestriction.targetNamespace | string | `nil` | Target namespace for operator deployment (leave empty for current namespace) |
| dynamo-operator.controllerManager.tolerations | list | `[]` | Node tolerations for controller manager pods | | dynamo-operator.controllerManager.tolerations | list | `[]` | Node tolerations for controller manager pods |
| dynamo-operator.controllerManager.affinity | list | `[]` | Affinity for controller manager pods | | dynamo-operator.controllerManager.affinity | list | `[]` | Affinity for controller manager pods |
| dynamo-operator.controllerManager.leaderElection.id | string | `""` | Leader election ID for cluster-wide coordination. WARNING: All cluster-wide operators must use the SAME ID to prevent split-brain. Different IDs would allow multiple leaders simultaneously. |
| dynamo-operator.controllerManager.leaderElection.namespace | string | `""` | Namespace for leader election leases (only used in cluster-wide mode). If empty, defaults to kube-system for cluster-wide coordination. All cluster-wide operators should use the SAME namespace for proper leader election. |
| dynamo-operator.controllerManager.manager.image.repository | string | `"nvcr.io/nvidia/ai-dynamo/kubernetes-operator"` | Official NVIDIA Dynamo operator image repository | | dynamo-operator.controllerManager.manager.image.repository | string | `"nvcr.io/nvidia/ai-dynamo/kubernetes-operator"` | Official NVIDIA Dynamo operator image repository |
| dynamo-operator.controllerManager.manager.image.tag | string | `""` | Image tag (leave empty to use chart default) | | dynamo-operator.controllerManager.manager.image.tag | string | `""` | Image tag (leave empty to use chart default) |
| dynamo-operator.controllerManager.manager.image.pullPolicy | string | `"IfNotPresent"` | Image pull policy - when to pull the image | | dynamo-operator.controllerManager.manager.image.pullPolicy | string | `"IfNotPresent"` | Image pull policy - when to pull the image |
......
...@@ -38,6 +38,48 @@ The Dynamo Platform Helm chart deploys the complete Dynamo Cloud infrastructure ...@@ -38,6 +38,48 @@ The Dynamo Platform Helm chart deploys the complete Dynamo Cloud infrastructure
- Sufficient cluster resources for your deployment scale - Sufficient cluster resources for your deployment scale
- Container registry access (if using private images) - Container registry access (if using private images)
## ⚠️ Important: Cluster-Wide vs Namespace-Scoped Deployment
### Single Cluster-Wide Operator (Recommended)
**By default, the Dynamo operator runs with cluster-wide permissions and should only be deployed ONCE per cluster.**
- ✅ **Recommended**: Deploy one cluster-wide operator per cluster
- ❌ **Not Recommended**: Multiple cluster-wide operators in the same cluster
### Multiple Namespace-Scoped Operators (Advanced)
If you need multiple operator instances (e.g., for multi-tenancy), use namespace-scoped deployment:
```yaml
# values.yaml
dynamo-operator:
namespaceRestriction:
enabled: true
targetNamespace: "my-tenant-namespace" # Optional, defaults to release namespace
```
### Validation and Safety
The chart includes built-in validation to prevent all operator conflicts:
- **Automatic Detection**: Scans for existing operators (both cluster-wide and namespace-restricted) during installation
- **Prevents Multiple Cluster-Wide**: Installation will fail if another cluster-wide operator exists
- **Prevents Mixed Deployments (Type 1)**: Installation will fail if trying to install namespace-restricted operator when cluster-wide exists
- **Prevents Mixed Deployments (Type 2)**: Installation will fail if trying to install cluster-wide operator when namespace-restricted operators exist
- **Safe Defaults**: Leader election uses shared ID for proper coordination
#### 🚫 **Blocked Conflict Scenarios**
| Existing Operator | New Operator | Status | Reason |
|-------------------|--------------|---------|--------|
| None | Cluster-wide | ✅ **Allowed** | No conflicts |
| None | Namespace-restricted | ✅ **Allowed** | No conflicts |
| Cluster-wide | Cluster-wide | ❌ **Blocked** | Multiple cluster managers |
| Cluster-wide | Namespace-restricted | ❌ **Blocked** | Cluster-wide already manages target namespace |
| Namespace-restricted | Cluster-wide | ❌ **Blocked** | Would conflict with existing namespace operators |
| Namespace-restricted A | Namespace-restricted B (diff ns) | ✅ **Allowed** | Different scopes |
## 🔧 Configuration ## 🔧 Configuration
{{ template "chart.requirementsSection" . }} {{ template "chart.requirementsSection" . }}
......
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
{{/*
Validation to prevent operator conflicts
Prevents all conflict scenarios:
1. Multiple cluster-wide operators (multiple cluster managers)
2. Namespace-restricted operator when cluster-wide exists (both would manage same resources)
3. Cluster-wide operator when namespace-restricted exist (both would manage same resources)
*/}}
{{- define "dynamo-operator.validateClusterWideInstallation" -}}
{{- $currentReleaseName := .Release.Name -}}
{{/* Check for existing namespace-restricted operators (only when installing cluster-wide) */}}
{{- if not .Values.namespaceRestriction.enabled -}}
{{- $allRoles := lookup "rbac.authorization.k8s.io/v1" "Role" "" "" -}}
{{- $namespaceRestrictedOperators := list -}}
{{- if $allRoles -}}
{{- range $role := $allRoles.items -}}
{{- if and (contains "-dynamo-operator-" $role.metadata.name) (hasSuffix "-manager-role" $role.metadata.name) -}}
{{- $namespaceRestrictedOperators = append $namespaceRestrictedOperators $role.metadata.namespace -}}
{{- end -}}
{{- end -}}
{{- end -}}
{{- if $namespaceRestrictedOperators -}}
{{- fail (printf "VALIDATION ERROR: Cannot install cluster-wide Dynamo operator. Found existing namespace-restricted Dynamo operators in namespaces: %s. This would create resource conflicts as both the cluster-wide operator and namespace-restricted operators would manage the same DGDs/DCDs. Either:\n1. Use one of the existing namespace-restricted operators for your specific namespace, or\n2. Uninstall all existing namespace-restricted operators first, or\n3. Install this operator in namespace-restricted mode: --set namespaceRestriction.enabled=true" (join ", " ($namespaceRestrictedOperators | uniq))) -}}
{{- end -}}
{{- end -}}
{{/* Check for existing ClusterRoles that would indicate other cluster-wide installations */}}
{{- $existingClusterRoles := lookup "rbac.authorization.k8s.io/v1" "ClusterRole" "" "" -}}
{{- $foundExistingClusterWideOperator := false -}}
{{- $existingOperatorRelease := "" -}}
{{- $existingOperatorRoleName := "" -}}
{{- $existingOperatorNamespace := "" -}}
{{- if $existingClusterRoles -}}
{{- range $cr := $existingClusterRoles.items -}}
{{- if and (contains "-dynamo-operator-" $cr.metadata.name) (hasSuffix "-manager-role" $cr.metadata.name) -}}
{{- $currentRoleName := printf "%s-dynamo-operator-manager-role" $currentReleaseName -}}
{{- if ne $cr.metadata.name $currentRoleName -}}
{{- $foundExistingClusterWideOperator = true -}}
{{- $existingOperatorRoleName = $cr.metadata.name -}}
{{- if $cr.metadata.labels -}}
{{- if $cr.metadata.labels.release -}}
{{- $existingOperatorRelease = $cr.metadata.labels.release -}}
{{- else if index $cr.metadata.labels "app.kubernetes.io/instance" -}}
{{- $existingOperatorRelease = index $cr.metadata.labels "app.kubernetes.io/instance" -}}
{{- end -}}
{{- end -}}
{{/* Find the namespace by looking at ClusterRoleBinding subjects */}}
{{- $clusterRoleBindings := lookup "rbac.authorization.k8s.io/v1" "ClusterRoleBinding" "" "" -}}
{{- if $clusterRoleBindings -}}
{{- range $crb := $clusterRoleBindings.items -}}
{{- if eq $crb.roleRef.name $cr.metadata.name -}}
{{- range $subject := $crb.subjects -}}
{{- if and (eq $subject.kind "ServiceAccount") $subject.namespace -}}
{{- $existingOperatorNamespace = $subject.namespace -}}
{{- end -}}
{{- end -}}
{{- end -}}
{{- end -}}
{{- end -}}
{{- end -}}
{{- end -}}
{{- end -}}
{{- end -}}
{{- if $foundExistingClusterWideOperator -}}
{{- $uninstallCmd := printf "helm uninstall %s" $existingOperatorRelease -}}
{{- if $existingOperatorNamespace -}}
{{- $uninstallCmd = printf "helm uninstall %s -n %s" $existingOperatorRelease $existingOperatorNamespace -}}
{{- end -}}
{{- if .Values.namespaceRestriction.enabled -}}
{{- if $existingOperatorNamespace -}}
{{- fail (printf "VALIDATION ERROR: Found existing cluster-wide Dynamo operator from release '%s' in namespace '%s' (ClusterRole: %s). Cannot install namespace-restricted operator because the cluster-wide operator already manages resources in all namespaces, including the target namespace. This would create resource conflicts. Either:\n1. Use the existing cluster-wide operator, or\n2. Uninstall the existing cluster-wide operator first: %s" $existingOperatorRelease $existingOperatorNamespace $existingOperatorRoleName $uninstallCmd) -}}
{{- else -}}
{{- fail (printf "VALIDATION ERROR: Found existing cluster-wide Dynamo operator from release '%s' (ClusterRole: %s). Cannot install namespace-restricted operator because the cluster-wide operator already manages resources in all namespaces, including the target namespace. This would create resource conflicts. Either:\n1. Use the existing cluster-wide operator, or\n2. Uninstall the existing cluster-wide operator first: %s" $existingOperatorRelease $existingOperatorRoleName $uninstallCmd) -}}
{{- end -}}
{{- else -}}
{{- if $existingOperatorNamespace -}}
{{- fail (printf "VALIDATION ERROR: Found existing cluster-wide Dynamo operator from release '%s' in namespace '%s' (ClusterRole: %s). Only one cluster-wide Dynamo operator should be deployed per cluster. Either:\n1. Use the existing cluster-wide operator (no need to install another), or\n2. Uninstall the existing cluster-wide operator first: %s" $existingOperatorRelease $existingOperatorNamespace $existingOperatorRoleName $uninstallCmd) -}}
{{- else -}}
{{- fail (printf "VALIDATION ERROR: Found existing cluster-wide Dynamo operator from release '%s' (ClusterRole: %s). Only one cluster-wide Dynamo operator should be deployed per cluster. Either:\n1. Use the existing cluster-wide operator (no need to install another), or\n2. Uninstall the existing cluster-wide operator first: %s" $existingOperatorRelease $existingOperatorRoleName $uninstallCmd) -}}
{{- end -}}
{{- end -}}
{{- end -}}
{{/* Additional validation for cluster-wide mode */}}
{{- if not .Values.namespaceRestriction.enabled -}}
{{/* Warn if using different leader election IDs */}}
{{- $leaderElectionId := default "dynamo.nvidia.com" .Values.controllerManager.leaderElection.id -}}
{{- if ne $leaderElectionId "dynamo.nvidia.com" -}}
{{- fail (printf "VALIDATION WARNING: Using custom leader election ID '%s' in cluster-wide mode. For proper coordination, all cluster-wide Dynamo operators should use the SAME leader election ID. Different IDs will allow multiple leaders simultaneously (split-brain scenario)." $leaderElectionId) -}}
{{- end -}}
{{- end -}}
{{- end -}}
{{/*
Validation for configuration consistency
*/}}
{{- define "dynamo-operator.validateConfiguration" -}}
{{/* Validate leader election namespace setting */}}
{{- if and (not .Values.namespaceRestriction.enabled) .Values.controllerManager.leaderElection.namespace -}}
{{- if eq .Values.controllerManager.leaderElection.namespace .Release.Namespace -}}
{{- printf "\nWARNING: Leader election namespace is set to the same as release namespace (%s) in cluster-wide mode. This may prevent proper coordination between multiple releases. Consider using 'kube-system' or leaving empty for default.\n" .Release.Namespace | fail -}}
{{- end -}}
{{- end -}}
{{- end -}}
...@@ -12,6 +12,11 @@ ...@@ -12,6 +12,11 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
{{/* Validate installation to prevent conflicts */}}
{{- include "dynamo-operator.validateClusterWideInstallation" . -}}
{{- include "dynamo-operator.validateConfiguration" . -}}
--- ---
apiVersion: apps/v1 apiVersion: apps/v1
kind: Deployment kind: Deployment
...@@ -76,7 +81,8 @@ spec: ...@@ -76,7 +81,8 @@ spec:
- --leader-elect=false - --leader-elect=false
{{- else }} {{- else }}
- --leader-elect - --leader-elect
- --leader-election-id=dynamo.nvidia.com - --leader-election-id={{ default "dynamo.nvidia.com" .Values.controllerManager.leaderElection.id }}
- --leader-election-namespace={{ default "kube-system" .Values.controllerManager.leaderElection.namespace }}
{{- end }} {{- end }}
{{- if .Values.natsAddr }} {{- if .Values.natsAddr }}
- --natsAddr={{ .Values.natsAddr }} - --natsAddr={{ .Values.natsAddr }}
......
...@@ -12,8 +12,14 @@ ...@@ -12,8 +12,14 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
{{/*
Only create leader election RBAC when leader election is enabled.
When namespaceRestriction.enabled=true, leader election is disabled (--leader-elect=false),
so these permissions are not needed.
*/}}
{{- if not .Values.namespaceRestriction.enabled }}
apiVersion: rbac.authorization.k8s.io/v1 apiVersion: rbac.authorization.k8s.io/v1
kind: Role kind: ClusterRole
metadata: metadata:
name: {{ include "dynamo-operator.fullname" . }}-leader-election-role name: {{ include "dynamo-operator.fullname" . }}-leader-election-role
labels: labels:
...@@ -55,7 +61,7 @@ rules: ...@@ -55,7 +61,7 @@ rules:
- patch - patch
--- ---
apiVersion: rbac.authorization.k8s.io/v1 apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding kind: ClusterRoleBinding
metadata: metadata:
name: {{ include "dynamo-operator.fullname" . }}-leader-election-rolebinding name: {{ include "dynamo-operator.fullname" . }}-leader-election-rolebinding
labels: labels:
...@@ -65,9 +71,10 @@ metadata: ...@@ -65,9 +71,10 @@ metadata:
{{- include "dynamo-operator.labels" . | nindent 4 }} {{- include "dynamo-operator.labels" . | nindent 4 }}
roleRef: roleRef:
apiGroup: rbac.authorization.k8s.io apiGroup: rbac.authorization.k8s.io
kind: Role kind: ClusterRole
name: '{{ include "dynamo-operator.fullname" . }}-leader-election-role' name: '{{ include "dynamo-operator.fullname" . }}-leader-election-role'
subjects: subjects:
- kind: ServiceAccount - kind: ServiceAccount
name: '{{ include "dynamo-operator.fullname" . }}-controller-manager' name: '{{ include "dynamo-operator.fullname" . }}-controller-manager'
namespace: '{{ .Release.Namespace }}' namespace: '{{ .Release.Namespace }}'
{{- end }}
\ No newline at end of file
...@@ -27,6 +27,19 @@ namespaceRestriction: ...@@ -27,6 +27,19 @@ namespaceRestriction:
targetNamespace: "" targetNamespace: ""
controllerManager: controllerManager:
tolerations: [] tolerations: []
# Leader election configuration
leaderElection:
# Leader election ID for cluster-wide coordination
# WARNING: All cluster-wide operators must use the SAME ID to prevent split-brain
# Different IDs would allow multiple leaders simultaneously
id: "" # If empty, defaults to: dynamo.nvidia.com (shared across all cluster-wide operators)
# Namespace for leader election leases (only used in cluster-wide mode)
# If empty, defaults to kube-system for cluster-wide coordination
# All cluster-wide operators should use the SAME namespace for proper leader election
namespace: ""
kubeRbacProxy: kubeRbacProxy:
args: args:
- --secure-listen-address=0.0.0.0:8443 - --secure-listen-address=0.0.0.0:8443
......
...@@ -31,8 +31,8 @@ dynamo-operator: ...@@ -31,8 +31,8 @@ dynamo-operator:
modelExpressURL: "" modelExpressURL: ""
# -- Namespace access controls for the operator # -- Namespace access controls for the operator
namespaceRestriction: namespaceRestriction:
# -- Whether to restrict operator to specific namespaces # -- Whether to restrict operator to specific namespaces. By default, the operator will run with cluster-wide permissions. Only 1 instance of the operator should be deployed in the cluster. If you want to deploy multiple operator instances, you can set this to true and specify the target namespace (by default, the target namespace is the helm release namespace).
enabled: true enabled: false
# -- Target namespace for operator deployment (leave empty for current namespace) # -- Target namespace for operator deployment (leave empty for current namespace)
targetNamespace: targetNamespace:
...@@ -44,6 +44,13 @@ dynamo-operator: ...@@ -44,6 +44,13 @@ dynamo-operator:
# -- Affinity for controller manager pods # -- Affinity for controller manager pods
affinity: [] affinity: []
# Leader election configuration for cluster-wide coordination
leaderElection:
# -- Leader election ID for cluster-wide coordination. WARNING: All cluster-wide operators must use the SAME ID to prevent split-brain. Different IDs would allow multiple leaders simultaneously.
id: "" # If empty, defaults to: dynamo.nvidia.com (shared across all cluster-wide operators)
# -- Namespace for leader election leases (only used in cluster-wide mode). If empty, defaults to kube-system for cluster-wide coordination. All cluster-wide operators should use the SAME namespace for proper leader election.
namespace: ""
manager: manager:
# Container image configuration for the operator manager # Container image configuration for the operator manager
image: image:
......
...@@ -124,6 +124,7 @@ func main() { ...@@ -124,6 +124,7 @@ func main() {
var enableHTTP2 bool var enableHTTP2 bool
var restrictedNamespace string var restrictedNamespace string
var leaderElectionID string var leaderElectionID string
var leaderElectionNamespace string
var natsAddr string var natsAddr string
var etcdAddr string var etcdAddr string
var istioVirtualServiceGateway string var istioVirtualServiceGateway string
...@@ -149,6 +150,9 @@ func main() { ...@@ -149,6 +150,9 @@ func main() {
"Enable resources filtering, only the resources belonging to the given namespace will be handled.") "Enable resources filtering, only the resources belonging to the given namespace will be handled.")
flag.StringVar(&leaderElectionID, "leader-election-id", "", "Leader election id"+ flag.StringVar(&leaderElectionID, "leader-election-id", "", "Leader election id"+
"Id to use for the leader election.") "Id to use for the leader election.")
flag.StringVar(&leaderElectionNamespace,
"leader-election-namespace", "",
"Namespace where the leader election resource will be created (default: same as operator namespace)")
flag.StringVar(&natsAddr, "natsAddr", "", "address of the NATS server") flag.StringVar(&natsAddr, "natsAddr", "", "address of the NATS server")
flag.StringVar(&etcdAddr, "etcdAddr", "", "address of the etcd server") flag.StringVar(&etcdAddr, "etcdAddr", "", "address of the etcd server")
flag.StringVar(&istioVirtualServiceGateway, "istio-virtual-service-gateway", "", flag.StringVar(&istioVirtualServiceGateway, "istio-virtual-service-gateway", "",
...@@ -257,6 +261,7 @@ func main() { ...@@ -257,6 +261,7 @@ func main() {
HealthProbeBindAddress: probeAddr, HealthProbeBindAddress: probeAddr,
LeaderElection: enableLeaderElection, LeaderElection: enableLeaderElection,
LeaderElectionID: leaderElectionID, LeaderElectionID: leaderElectionID,
LeaderElectionNamespace: leaderElectionNamespace,
// LeaderElectionReleaseOnCancel defines if the leader should step down voluntarily // LeaderElectionReleaseOnCancel defines if the leader should step down voluntarily
// when the Manager ends. This requires the binary to immediately end when the // when the Manager ends. This requires the binary to immediately end when the
// Manager is stopped, otherwise, this setting is unsafe. Setting this significantly // Manager is stopped, otherwise, this setting is unsafe. Setting this significantly
......
...@@ -19,7 +19,7 @@ func (m *MockSimpleDeployer) GetHostNames(serviceName string, numberOfNodes int3 ...@@ -19,7 +19,7 @@ func (m *MockSimpleDeployer) GetHostNames(serviceName string, numberOfNodes int3
hostnames := make([]string, numberOfNodes) hostnames := make([]string, numberOfNodes)
hostnames[0] = m.GetLeaderHostname(serviceName) hostnames[0] = m.GetLeaderHostname(serviceName)
for i := int32(1); i < numberOfNodes; i++ { for i := int32(1); i < numberOfNodes; i++ {
hostnames[i] = "worker" + string(rune('0'+i)) + ".example.com" hostnames[i] = "worker" + string('0'+i) + ".example.com"
} }
return hostnames return hostnames
} }
...@@ -39,7 +39,7 @@ func (m *MockShellDeployer) GetHostNames(serviceName string, numberOfNodes int32 ...@@ -39,7 +39,7 @@ func (m *MockShellDeployer) GetHostNames(serviceName string, numberOfNodes int32
hostnames := make([]string, numberOfNodes) hostnames := make([]string, numberOfNodes)
hostnames[0] = m.GetLeaderHostname(serviceName) hostnames[0] = m.GetLeaderHostname(serviceName)
for i := int32(1); i < numberOfNodes; i++ { for i := int32(1); i < numberOfNodes; i++ {
hostnames[i] = "$(WORKER_" + string(rune('0'+i)) + "_HOST)" hostnames[i] = "$(WORKER_" + string('0'+i) + "_HOST)"
} }
return hostnames return hostnames
} }
......
...@@ -23,7 +23,7 @@ High-level guide to Dynamo Kubernetes deployments. Start here, then dive into sp ...@@ -23,7 +23,7 @@ High-level guide to Dynamo Kubernetes deployments. Start here, then dive into sp
```bash ```bash
# 1. Set environment # 1. Set environment
export NAMESPACE=dynamo-kubernetes export NAMESPACE=dynamo-system
export RELEASE_VERSION=0.x.x # any version of Dynamo 0.3.2+ listed at https://github.com/ai-dynamo/dynamo/releases export RELEASE_VERSION=0.x.x # any version of Dynamo 0.3.2+ listed at https://github.com/ai-dynamo/dynamo/releases
# 2. Install CRDs # 2. Install CRDs
...@@ -50,8 +50,8 @@ Each backend has deployment examples and configuration options: ...@@ -50,8 +50,8 @@ Each backend has deployment examples and configuration options:
## 3. Deploy Your First Model ## 3. Deploy Your First Model
```bash ```bash
# Set same namespace from platform install
export NAMESPACE=dynamo-cloud export NAMESPACE=dynamo-cloud
kubectl create namespace ${NAMESPACE}
# Deploy any example (this uses vLLM with Qwen model using aggregated serving) # Deploy any example (this uses vLLM with Qwen model using aggregated serving)
kubectl apply -f components/backends/vllm/deploy/agg.yaml -n ${NAMESPACE} kubectl apply -f components/backends/vllm/deploy/agg.yaml -n ${NAMESPACE}
......
...@@ -69,7 +69,7 @@ Install from [NGC published artifacts](https://catalog.ngc.nvidia.com/orgs/nvidi ...@@ -69,7 +69,7 @@ Install from [NGC published artifacts](https://catalog.ngc.nvidia.com/orgs/nvidi
```bash ```bash
# 1. Set environment # 1. Set environment
export NAMESPACE=dynamo-kubernetes export NAMESPACE=dynamo-system
export RELEASE_VERSION=0.x.x # any version of Dynamo 0.3.2+ listed at https://github.com/ai-dynamo/dynamo/releases export RELEASE_VERSION=0.x.x # any version of Dynamo 0.3.2+ listed at https://github.com/ai-dynamo/dynamo/releases
# 2. Install CRDs # 2. Install CRDs
...@@ -99,6 +99,15 @@ helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ...@@ -99,6 +99,15 @@ helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace
--set "dynamo-operator.modelExpressURL=http://model-express-server.model-express.svc.cluster.local:8080" --set "dynamo-operator.modelExpressURL=http://model-express-server.model-express.svc.cluster.local:8080"
``` ```
> [!TIP]
> By default, Dynamo Operator is installed cluster-wide and will monitor all namespaces.
> If you wish to restrict the operator to monitor only a specific namespace (the helm release namespace by default), you can set the namespaceRestriction.enabled to true.
> You can also change the restricted namespace by setting the targetNamespace property.
```bash
--set "dynamo-operator.namespaceRestriction.enabled=true"
--set "dynamo-operator.namespaceRestriction.targetNamespace=dynamo-namespace" # optional
```
[Verify Installation](#verify-installation) [Verify Installation](#verify-installation)
...@@ -108,7 +117,7 @@ Build and deploy from source for customization. ...@@ -108,7 +117,7 @@ Build and deploy from source for customization.
```bash ```bash
# 1. Set environment # 1. Set environment
export NAMESPACE=dynamo-cloud export NAMESPACE=dynamo-system
export DOCKER_SERVER=nvcr.io/nvidia/ai-dynamo/ # or your registry export DOCKER_SERVER=nvcr.io/nvidia/ai-dynamo/ # or your registry
export DOCKER_USERNAME='$oauthtoken' export DOCKER_USERNAME='$oauthtoken'
export DOCKER_PASSWORD=<YOUR_NGC_CLI_API_KEY> export DOCKER_PASSWORD=<YOUR_NGC_CLI_API_KEY>
......
...@@ -31,7 +31,7 @@ The following env variables are set: ...@@ -31,7 +31,7 @@ The following env variables are set:
```bash ```bash
export MONITORING_NAMESPACE=monitoring export MONITORING_NAMESPACE=monitoring
export DYNAMO_NAMESPACE=dynamo-cloud export DYNAMO_NAMESPACE=dynamo-system
``` ```
## Installation Steps ## Installation Steps
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment