Unverified Commit fc2d01d1 authored by Julien Mancuso's avatar Julien Mancuso Committed by GitHub
Browse files

refactor: move MPI SSH key generation from Helm hook Job into operator reconciliation (#6940)


Signed-off-by: default avatarJulien Mancuso <jmancuso@nvidia.com>
parent c09a9aad
......@@ -37,7 +37,7 @@ The Dynamo Platform Helm chart deploys the complete Dynamo Kubernetes Platform i
- Helm 3.8+
- Sufficient cluster resources for your deployment scale
- Container registry access (if using private images)
- TLS certificate infrastructure for admission webhooks (auto-generated via Helm hooks by default, or [cert-manager](https://cert-manager.io/), or externally managed)
- TLS certificate infrastructure for admission webhooks (auto-generated by the operator's built-in cert-controller by default, or [cert-manager](https://cert-manager.io/), or externally managed)
## 🔄 Upgrading Notes
......@@ -45,7 +45,7 @@ The Dynamo Platform Helm chart deploys the complete Dynamo Kubernetes Platform i
The `webhook.enabled` Helm value has been removed. Admission webhooks are now a required component of the operator and cannot be disabled. This change aligns with the upcoming addition of CRD conversion webhooks, which are mandatory for multi-version API support.
No action is required for most upgrades — Helm hooks automatically generate TLS certificates and inject the CA bundle during `helm upgrade`. If you use cert-manager or externally managed certificates, ensure your existing configuration is correct before upgrading.
No action is required for most upgrades — the operator's built-in cert-controller automatically generates and rotates TLS certificates at startup. If you use cert-manager or externally managed certificates, ensure your existing configuration is correct before upgrading.
---
......@@ -149,18 +149,13 @@ The chart includes built-in validation to prevent all operator conflicts:
| dynamo-operator.dynamo.virtualServiceSupportsHTTPS | bool | `false` | Whether VirtualServices should support HTTPS routing |
| dynamo-operator.dynamo.metrics.prometheusEndpoint | string | `""` | Endpoint that services can use to retrieve metrics. If set, dynamo operator will automatically inject the PROMETHEUS_ENDPOINT environment variable into services it manages. Users can override the value of the PROMETHEUS_ENDPOINT environment variable by modifying the corresponding deployment's environment variables |
| dynamo-operator.dynamo.mpiRun.secretName | string | `"mpi-run-ssh-secret"` | Name of the secret containing the SSH key for MPI Run |
| dynamo-operator.dynamo.mpiRun.sshKeygen.enabled | bool | `true` | Whether to enable SSH key generation for MPI Run |
| dynamo-operator.webhook.certificateSecret.name | string | `"webhook-server-cert"` | Name of the Kubernetes secret containing webhook TLS certificates. The secret must contain three keys: tls.crt (server certificate), tls.key (server private key), and ca.crt (Certificate Authority certificate). |
| dynamo-operator.webhook.certificateSecret.external | bool | `false` | Whether to manage the certificate secret externally. When false (default), certificates are automatically generated via Helm hooks during installation. When true, you must create the secret manually before installing the chart. |
| dynamo-operator.webhook.certificateValidity | int | `365` | Certificate validity duration in days for auto-generated certificates. Only used when certManager.enabled=false and certificateSecret.external=false. After this duration, certificates will expire and need to be regenerated. |
| dynamo-operator.webhook.certGenerator.image.repository | string | `"bitnami/kubectl"` | Container image repository for certificate generation jobs. This image must contain both openssl and kubectl commands. |
| dynamo-operator.webhook.certGenerator.image.tag | string | `"latest"` | Container image tag for certificate generation jobs |
| dynamo-operator.webhook.certGenerator.image.pullPolicy | string | `"IfNotPresent"` | Image pull policy for certificate generation jobs |
| dynamo-operator.webhook.certificateSecret.external | bool | `false` | Whether to manage the certificate secret externally. When false (default), the operator's built-in cert-controller generates and rotates certificates automatically. When true, you must create the secret manually before installing the chart. |
| dynamo-operator.webhook.caBundle | string | `""` | CA bundle (base64 encoded) for webhook validation. Only used when certificateSecret.external=true. For automatic certificate generation or cert-manager integration, leave this empty as it will be injected automatically. |
| dynamo-operator.webhook.failurePolicy | string | `"Fail"` | Webhook failure policy controls how Kubernetes handles requests when the webhook is unavailable. 'Fail' (recommended for production) rejects requests if the webhook cannot be reached, ensuring strict validation. 'Ignore' allows requests through if the webhook is unavailable, providing availability over validation guarantees. |
| dynamo-operator.webhook.timeoutSeconds | int | `10` | Timeout in seconds for webhook validation calls. If the webhook doesn't respond within this time, the request will be handled according to the failurePolicy. |
| dynamo-operator.webhook.namespaceSelector | object | `{}` | Custom namespace selector for webhook validation. Use this to include or exclude specific namespaces from webhook validation. For CLUSTER-WIDE operators, you can exclude namespaces managed by namespace-restricted operators by using: matchExpressions: [{ key: "dynamo-operator", operator: "NotIn", values: ["namespace-restricted"] }]. For NAMESPACE-RESTRICTED operators, leave empty as it will be auto-configured to match only the operator's namespace. |
| dynamo-operator.webhook.certManager.enabled | bool | `false` | Whether to use cert-manager for automatic certificate management. Requires cert-manager to be installed in the cluster. When enabled, cert-manager will automatically generate, renew, and rotate certificates, and the automatic certificate generation via Helm hooks will be disabled. |
| dynamo-operator.webhook.certManager.enabled | bool | `false` | Whether to use cert-manager for automatic certificate management. Requires cert-manager to be installed in the cluster. When enabled, cert-manager will provision and rotate certificates instead of the operator's built-in cert-controller. |
| dynamo-operator.webhook.certManager.certificate.duration | string | `"8760h"` | Certificate duration for webhook certificates managed by cert-manager (e.g., "8760h" for 1 year). cert-manager will automatically renew the certificate before it expires. |
| dynamo-operator.webhook.certManager.certificate.renewBefore | string | `"360h"` | Time before certificate expiration to trigger renewal (e.g., "360h" for 15 days). cert-manager will attempt to renew the certificate when this threshold is reached. |
| dynamo-operator.webhook.certManager.certificate.rootCA.duration | string | `"87600h"` | Duration for the root CA certificate (e.g., "87600h" for 10 years). The root CA typically has a much longer lifetime than the leaf certificates it signs. |
......
......@@ -37,7 +37,7 @@ The Dynamo Platform Helm chart deploys the complete Dynamo Kubernetes Platform i
- Helm 3.8+
- Sufficient cluster resources for your deployment scale
- Container registry access (if using private images)
- TLS certificate infrastructure for admission webhooks (auto-generated via Helm hooks by default, or [cert-manager](https://cert-manager.io/), or externally managed)
- TLS certificate infrastructure for admission webhooks (auto-generated by the operator's built-in cert-controller by default, or [cert-manager](https://cert-manager.io/), or externally managed)
## 🔄 Upgrading Notes
......@@ -45,7 +45,7 @@ The Dynamo Platform Helm chart deploys the complete Dynamo Kubernetes Platform i
The `webhook.enabled` Helm value has been removed. Admission webhooks are now a required component of the operator and cannot be disabled. This change aligns with the upcoming addition of CRD conversion webhooks, which are mandatory for multi-version API support.
No action is required for most upgrades — Helm hooks automatically generate TLS certificates and inject the CA bundle during `helm upgrade`. If you use cert-manager or externally managed certificates, ensure your existing configuration is correct before upgrading.
No action is required for most upgrades — the operator's built-in cert-controller automatically generates and rotates TLS certificates at startup. If you use cert-manager or externally managed certificates, ensure your existing configuration is correct before upgrading.
---
......
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This job is used to generate an SSH key pair and create a Kubernetes secret with the key pair.
# The secret is used when mpi is in use by dynamo workers.
{{- if .Values.dynamo.mpiRun.sshKeygen.enabled }}
apiVersion: batch/v1
kind: Job
metadata:
name: {{ include "dynamo-operator.fullname" . }}-ssh-keygen
annotations:
"helm.sh/hook": pre-install,pre-upgrade
"helm.sh/hook-weight": "-5"
"helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
spec:
backoffLimit: 1
activeDeadlineSeconds: 300
template:
spec:
restartPolicy: Never
serviceAccountName: {{ include "dynamo-operator.fullname" . }}-ssh-keygen
{{- $sshKeygen := .Values.dynamo.mpiRun.sshKeygen | default dict -}}
{{- $controller := .Values.controllerManager | default dict -}}
{{- $tolerations := ternary $sshKeygen.tolerations $controller.tolerations (hasKey $sshKeygen "tolerations") -}}
{{- if $tolerations }}
tolerations:
{{- toYaml $tolerations | nindent 8 }}
{{- end }}
{{- $affinity := ternary $sshKeygen.affinity $controller.affinity (hasKey $sshKeygen "affinity") -}}
{{- if $affinity }}
affinity:
{{- toYaml $affinity | nindent 8 }}
{{- end }}
securityContext:
runAsNonRoot: true
runAsUser: 65534
fsGroup: 65534
initContainers:
- name: keygen
image: bitnamisecure/git:latest
volumeMounts:
- name: shared
mountPath: /shared
env:
- name: SECRET_NAME
value: "{{ .Values.dynamo.mpiRun.secretName }}"
- name: NAMESPACE
value: "{{ .Release.Namespace }}"
command:
- /bin/bash
- -e
- -c
- |
echo "Generating SSH key pair with ssh-keygen..."
ssh-keygen -t rsa -b 2048 -f /shared/private.key -N ""
echo "SSH keys generated and saved to shared volume"
containers:
- name: kubectl-create-secret
image: alpine/k8s:1.34.1
volumeMounts:
- name: shared
mountPath: /shared
env:
- name: SECRET_NAME
value: "{{ .Values.dynamo.mpiRun.secretName }}"
- name: NAMESPACE
value: "{{ .Release.Namespace }}"
command:
- /bin/bash
- -e
- -c
- |
# Check if secret already exists
if kubectl get secret "$SECRET_NAME" -n "$NAMESPACE" &>/dev/null; then
echo "Secret $SECRET_NAME already exists, skipping creation"
exit 0
fi
echo "Creating Kubernetes secret..."
kubectl create secret generic "$SECRET_NAME" \
--from-file=private.key=/shared/private.key \
--from-file=private.key.pub=/shared/private.key.pub \
-n "$NAMESPACE"
echo "SSH key secret created successfully"
volumes:
- name: shared
emptyDir: {}
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: {{ include "dynamo-operator.fullname" . }}-ssh-keygen
labels:
{{- include "dynamo-operator.labels" . | nindent 4 }}
annotations:
"helm.sh/hook": pre-install,pre-upgrade
"helm.sh/hook-weight": "-10"
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: {{ include "dynamo-operator.fullname" . }}-ssh-keygen
labels:
{{- include "dynamo-operator.labels" . | nindent 4 }}
annotations:
"helm.sh/hook": pre-install,pre-upgrade
"helm.sh/hook-weight": "-10"
rules:
- apiGroups: [""]
resources: ["secrets"]
verbs: ["get", "create", "update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: {{ include "dynamo-operator.fullname" . }}-ssh-keygen
labels:
{{- include "dynamo-operator.labels" . | nindent 4 }}
annotations:
"helm.sh/hook": pre-install,pre-upgrade
"helm.sh/hook-weight": "-10"
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: {{ include "dynamo-operator.fullname" . }}-ssh-keygen
subjects:
- kind: ServiceAccount
name: {{ include "dynamo-operator.fullname" . }}-ssh-keygen
namespace: {{ .Release.Namespace }}
---
{{- end }}
......@@ -130,8 +130,6 @@ dynamo:
mpiRun:
secretName: "mpi-run-ssh-secret"
sshKeygen:
enabled: true
#imagePullSecrets: []
......
......@@ -175,10 +175,6 @@ dynamo-operator:
mpiRun:
# -- Name of the secret containing the SSH key for MPI Run
secretName: "mpi-run-ssh-secret"
# SSH key generation configuration
sshKeygen:
# -- Whether to enable SSH key generation for MPI Run
enabled: true
# Webhook configuration for admission control and validation
webhook:
......@@ -187,23 +183,9 @@ dynamo-operator:
# -- Name of the Kubernetes secret containing webhook TLS certificates. The secret must contain three keys: tls.crt (server certificate), tls.key (server private key), and ca.crt (Certificate Authority certificate).
name: webhook-server-cert
# -- Whether to manage the certificate secret externally. When false (default), certificates are automatically generated via Helm hooks during installation. When true, you must create the secret manually before installing the chart.
# -- Whether to manage the certificate secret externally. When false (default), the operator's built-in cert-controller generates and rotates certificates automatically. When true, you must create the secret manually before installing the chart.
external: false
# -- Certificate validity duration in days for auto-generated certificates. Only used when certManager.enabled=false and certificateSecret.external=false. After this duration, certificates will expire and need to be regenerated.
certificateValidity: 365
# Container image for certificate generation and CA injection jobs
# Only used when certManager.enabled=false and certificateSecret.external=false
certGenerator:
image:
# -- Container image repository for certificate generation jobs. This image must contain both openssl and kubectl commands.
repository: bitnami/kubectl
# -- Container image tag for certificate generation jobs
tag: latest
# -- Image pull policy for certificate generation jobs
pullPolicy: IfNotPresent
# -- CA bundle (base64 encoded) for webhook validation. Only used when certificateSecret.external=true. For automatic certificate generation or cert-manager integration, leave this empty as it will be injected automatically.
caBundle: ""
......@@ -219,7 +201,7 @@ dynamo-operator:
# cert-manager integration for automated certificate lifecycle management
certManager:
# -- Whether to use cert-manager for automatic certificate management. Requires cert-manager to be installed in the cluster. When enabled, cert-manager will automatically generate, renew, and rotate certificates, and the automatic certificate generation via Helm hooks will be disabled.
# -- Whether to use cert-manager for automatic certificate management. Requires cert-manager to be installed in the cluster. When enabled, cert-manager will provision and rotate certificates instead of the operator's built-in cert-controller.
enabled: false
# Certificate configuration for cert-manager
......
......@@ -521,12 +521,7 @@ func main() {
}
}()
// Create MPI SSH SecretReplicator for cross-namespace secret replication
mpiSecretReplicator := secret.NewSecretReplicator(
mgr.GetClient(),
operatorCfg.MPI.SSHSecretNamespace,
operatorCfg.MPI.SSHSecretName,
)
sshKeyManager := secret.NewSSHKeyManager(mgr.GetClient(), operatorCfg.MPI)
if err := mgr.AddHealthzCheck("healthz", healthz.Ping); err != nil {
setupLog.Error(err, "unable to set up health check")
......@@ -549,7 +544,7 @@ func main() {
// Controllers don't depend on TLS certificates.
if err := registerControllers(
mgr, operatorCfg, runtimeConfig,
dockerSecretRetriever, mpiSecretReplicator,
dockerSecretRetriever, sshKeyManager,
); err != nil {
setupLog.Error(err, "failed to register controllers")
os.Exit(1)
......@@ -591,7 +586,7 @@ func registerControllers(
operatorCfg *configv1alpha1.OperatorConfiguration,
runtimeConfig *commonController.RuntimeConfig,
dockerSecretRetriever *secrets.DockerSecretIndexer,
mpiSecretReplicator *secret.SecretReplicator,
sshKeyManager *secret.SSHKeyManager,
) error {
if err := (&controller.DynamoComponentDeploymentReconciler{
Client: mgr.GetClient(),
......@@ -617,7 +612,7 @@ func registerControllers(
RuntimeConfig: runtimeConfig,
DockerSecretRetriever: dockerSecretRetriever,
ScaleClient: scaleClient,
MPISecretReplicator: mpiSecretReplicator,
SSHKeyManager: sshKeyManager,
RBACManager: rbacManager,
}).SetupWithManager(mgr); err != nil {
return fmt.Errorf("unable to create DynamoGraphDeployment controller: %w", err)
......
......@@ -16,6 +16,7 @@ require (
github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring v0.71.2
github.com/prometheus/client_golang v1.23.2
github.com/stretchr/testify v1.11.1
golang.org/x/crypto v0.48.0
istio.io/api v1.23.1
istio.io/client-go v1.23.1
k8s.io/api v0.34.3
......@@ -68,15 +69,15 @@ require (
go.uber.org/zap v1.27.1 // indirect
go.yaml.in/yaml/v2 v2.4.3 // indirect
go.yaml.in/yaml/v3 v3.0.4 // indirect
golang.org/x/mod v0.30.0 // indirect
golang.org/x/net v0.48.0 // indirect
golang.org/x/mod v0.32.0 // indirect
golang.org/x/net v0.49.0 // indirect
golang.org/x/oauth2 v0.34.0 // indirect
golang.org/x/sync v0.19.0 // indirect
golang.org/x/sys v0.39.0 // indirect
golang.org/x/term v0.38.0 // indirect
golang.org/x/text v0.32.0 // indirect
golang.org/x/sys v0.41.0 // indirect
golang.org/x/term v0.40.0 // indirect
golang.org/x/text v0.34.0 // indirect
golang.org/x/time v0.13.0 // indirect
golang.org/x/tools v0.39.0 // indirect
golang.org/x/tools v0.41.0 // indirect
gomodules.xyz/jsonpatch/v2 v2.4.0 // indirect
google.golang.org/genproto/googleapis/api v0.0.0-20251202230838-ff82c1b0f217 // indirect
google.golang.org/protobuf v1.36.11 // indirect
......
......@@ -155,16 +155,18 @@ go.yaml.in/yaml/v3 v3.0.4/go.mod h1:DhzuOOF2ATzADvBadXxruRBLzYTpT36CKvDb3+aBEFg=
golang.org/x/crypto v0.0.0-20190308221718-c2843e01d9a2/go.mod h1:djNgcEr1/C05ACkg1iLfiJU5Ep61QUkGW8qpdssI0+w=
golang.org/x/crypto v0.0.0-20191011191535-87dc89f01550/go.mod h1:yigFU9vqHzYiE8UmvKecakEJjdnWj3jj499lnFckfCI=
golang.org/x/crypto v0.0.0-20200622213623-75b288015ac9/go.mod h1:LzIPMQfyMNhhGPhUkYOs5KpL4U8rLKemX1yGLhDgUto=
golang.org/x/crypto v0.48.0 h1:/VRzVqiRSggnhY7gNRxPauEQ5Drw9haKdM0jqfcCFts=
golang.org/x/crypto v0.48.0/go.mod h1:r0kV5h3qnFPlQnBSrULhlsRfryS2pmewsg+XfMgkVos=
golang.org/x/mod v0.2.0/go.mod h1:s0Qsj1ACt9ePp/hMypM3fl4fZqREWJwdYDEqhRiZZUA=
golang.org/x/mod v0.3.0/go.mod h1:s0Qsj1ACt9ePp/hMypM3fl4fZqREWJwdYDEqhRiZZUA=
golang.org/x/mod v0.30.0 h1:fDEXFVZ/fmCKProc/yAXXUijritrDzahmwwefnjoPFk=
golang.org/x/mod v0.30.0/go.mod h1:lAsf5O2EvJeSFMiBxXDki7sCgAxEUcZHXoXMKT4GJKc=
golang.org/x/mod v0.32.0 h1:9F4d3PHLljb6x//jOyokMv3eX+YDeepZSEo3mFJy93c=
golang.org/x/mod v0.32.0/go.mod h1:SgipZ/3h2Ci89DlEtEXWUk/HteuRin+HHhN+WbNhguU=
golang.org/x/net v0.0.0-20190404232315-eb5bcb51f2a3/go.mod h1:t9HGtf8HONx5eT2rtn7q6eTqICYqUVnKs3thJo3Qplg=
golang.org/x/net v0.0.0-20190620200207-3b0461eec859/go.mod h1:z5CRVTTTmAJ677TzLLGU+0bjPO0LkuOLi4/5GtJWs/s=
golang.org/x/net v0.0.0-20200226121028-0de0cce0169b/go.mod h1:z5CRVTTTmAJ677TzLLGU+0bjPO0LkuOLi4/5GtJWs/s=
golang.org/x/net v0.0.0-20201021035429-f5854403a974/go.mod h1:sp8m0HH+o8qH0wwXwYZr8TS3Oi6o0r6Gce1SSxlDquU=
golang.org/x/net v0.48.0 h1:zyQRTTrjc33Lhh0fBgT/H3oZq9WuvRR5gPC70xpDiQU=
golang.org/x/net v0.48.0/go.mod h1:+ndRgGjkh8FGtu1w1FGbEC31if4VrNVMuKTgcAAnQRY=
golang.org/x/net v0.49.0 h1:eeHFmOGUTtaaPSGNmjBKpbng9MulQsJURQUAfUwY++o=
golang.org/x/net v0.49.0/go.mod h1:/ysNB2EvaqvesRkuLAyjI1ycPZlQHM3q01F02UY/MV8=
golang.org/x/oauth2 v0.34.0 h1:hqK/t4AKgbqWkdkcAeI8XLmbK+4m4G5YeQRrmiotGlw=
golang.org/x/oauth2 v0.34.0/go.mod h1:lzm5WQJQwKZ3nwavOZ3IS5Aulzxi68dUSgRHujetwEA=
golang.org/x/sync v0.0.0-20190423024810-112230192c58/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM=
......@@ -175,22 +177,22 @@ golang.org/x/sync v0.19.0/go.mod h1:9KTHXmSnoGruLpwFjVSX0lNNA75CykiMECbovNTZqGI=
golang.org/x/sys v0.0.0-20190215142949-d0b11bdaac8a/go.mod h1:STP8DvDyc/dI5b8T5hshtkjS+E42TnysNCUPdjciGhY=
golang.org/x/sys v0.0.0-20190412213103-97732733099d/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.0.0-20200930185726-fdedc70b468f/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.39.0 h1:CvCKL8MeisomCi6qNZ+wbb0DN9E5AATixKsvNtMoMFk=
golang.org/x/sys v0.39.0/go.mod h1:OgkHotnGiDImocRcuBABYBEXf8A9a87e/uXjp9XT3ks=
golang.org/x/term v0.38.0 h1:PQ5pkm/rLO6HnxFR7N2lJHOZX6Kez5Y1gDSJla6jo7Q=
golang.org/x/term v0.38.0/go.mod h1:bSEAKrOT1W+VSu9TSCMtoGEOUcKxOKgl3LE5QEF/xVg=
golang.org/x/sys v0.41.0 h1:Ivj+2Cp/ylzLiEU89QhWblYnOE9zerudt9Ftecq2C6k=
golang.org/x/sys v0.41.0/go.mod h1:OgkHotnGiDImocRcuBABYBEXf8A9a87e/uXjp9XT3ks=
golang.org/x/term v0.40.0 h1:36e4zGLqU4yhjlmxEaagx2KuYbJq3EwY8K943ZsHcvg=
golang.org/x/term v0.40.0/go.mod h1:w2P8uVp06p2iyKKuvXIm7N/y0UCRt3UfJTfZ7oOpglM=
golang.org/x/text v0.3.0/go.mod h1:NqM8EUOU14njkJ3fqMW+pc6Ldnwhi/IjpwHt7yyuwOQ=
golang.org/x/text v0.3.3/go.mod h1:5Zoc/QRtKVWzQhOtBMvqHzDpF6irO9z98xDceosuGiQ=
golang.org/x/text v0.32.0 h1:ZD01bjUt1FQ9WJ0ClOL5vxgxOI/sVCNgX1YtKwcY0mU=
golang.org/x/text v0.32.0/go.mod h1:o/rUWzghvpD5TXrTIBuJU77MTaN0ljMWE47kxGJQ7jY=
golang.org/x/text v0.34.0 h1:oL/Qq0Kdaqxa1KbNeMKwQq0reLCCaFtqu2eNuSeNHbk=
golang.org/x/text v0.34.0/go.mod h1:homfLqTYRFyVYemLBFl5GgL/DWEiH5wcsQ5gSh1yziA=
golang.org/x/time v0.13.0 h1:eUlYslOIt32DgYD6utsuUeHs4d7AsEYLuIAdg7FlYgI=
golang.org/x/time v0.13.0/go.mod h1:eL/Oa2bBBK0TkX57Fyni+NgnyQQN4LitPmob2Hjnqw4=
golang.org/x/tools v0.0.0-20180917221912-90fa682c2a6e/go.mod h1:n7NCudcB/nEzxVGmLbDWY5pfWTLqBcC2KZ6jyYvM4mQ=
golang.org/x/tools v0.0.0-20191119224855-298f0cb1881e/go.mod h1:b+2E5dAYhXwXZwtnZ6UAqBI28+e2cm9otk0dWdXHAEo=
golang.org/x/tools v0.0.0-20200619180055-7c47624df98f/go.mod h1:EkVYQZoAsY45+roYkvgYkIh4xh/qjgUK9TdY2XT94GE=
golang.org/x/tools v0.0.0-20210106214847-113979e3529a/go.mod h1:emZCQorbCU4vsT4fOWvOPXz4eW1wZW4PmDk9uLelYpA=
golang.org/x/tools v0.39.0 h1:ik4ho21kwuQln40uelmciQPp9SipgNDdrafrYA4TmQQ=
golang.org/x/tools v0.39.0/go.mod h1:JnefbkDPyD8UU2kI5fuf8ZX4/yUeh9W877ZeBONxUqQ=
golang.org/x/tools v0.41.0 h1:a9b8iMweWG+S0OBnlU36rzLp20z1Rp10w+IY2czHTQc=
golang.org/x/tools v0.41.0/go.mod h1:XSY6eDqxVNiYgezAVqqCeihT4j1U2CCsqvH3WhQpnlg=
golang.org/x/xerrors v0.0.0-20190717185122-a985d3407aa7/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0=
golang.org/x/xerrors v0.0.0-20191011141410-1b5146add898/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0=
golang.org/x/xerrors v0.0.0-20191204190536-9bdfabe68543/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0=
......
......@@ -74,7 +74,7 @@ type DynamoGraphDeploymentReconciler struct {
Recorder record.EventRecorder
DockerSecretRetriever dockerSecretRetriever
ScaleClient scale.ScalesGetter
MPISecretReplicator *secret.SecretReplicator
SSHKeyManager *secret.SSHKeyManager
RBACManager rbacManager
}
......@@ -322,12 +322,10 @@ func (r *DynamoGraphDeploymentReconciler) reconcileResources(ctx context.Context
// Determine if any service is multinode
hasMultinode := dynamoDeployment.HasAnyMultinodeService()
// Always ensure MPI SSH secret is available in this namespace
if r.MPISecretReplicator != nil {
err := r.MPISecretReplicator.Replicate(ctx, dynamoDeployment.Namespace)
if err != nil {
logger.Error(err, "Failed to replicate MPI secret", "namespace", dynamoDeployment.Namespace)
return ReconcileResult{}, fmt.Errorf("failed to replicate MPI secret: %w", err)
if r.SSHKeyManager != nil && hasMultinode {
if err := r.SSHKeyManager.EnsureAndReplicate(ctx, dynamoDeployment.Namespace); err != nil {
logger.Error(err, "Failed to ensure MPI SSH key secret", "namespace", dynamoDeployment.Namespace)
return ReconcileResult{}, fmt.Errorf("failed to ensure MPI SSH key secret: %w", err)
}
}
......
/*
* SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
* SPDX-License-Identifier: Apache-2.0
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package secret
import (
"context"
"fmt"
corev1 "k8s.io/api/core/v1"
k8serrors "k8s.io/apimachinery/pkg/api/errors"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/types"
"sigs.k8s.io/controller-runtime/pkg/client"
)
// SecretReplicator handles replication of secrets across namespaces
type SecretReplicator struct {
client.Client
sourceNamespace string
secretName string
}
// NewSecretReplicator creates a new SecretReplicator for replicating a specific secret
func NewSecretReplicator(client client.Client, sourceNamespace, secretName string) *SecretReplicator {
return &SecretReplicator{
Client: client,
sourceNamespace: sourceNamespace,
secretName: secretName,
}
}
// Replicate ensures the secret exists in the target namespace by copying from source namespace
func (r *SecretReplicator) Replicate(ctx context.Context, targetNamespace string) error {
// Check if secret already exists in target namespace
targetSecret := &corev1.Secret{}
err := r.Get(ctx, types.NamespacedName{
Name: r.secretName,
Namespace: targetNamespace,
}, targetSecret)
if err == nil {
// Secret already exists - do nothing
return nil
}
if !k8serrors.IsNotFound(err) {
return fmt.Errorf("failed to check target secret: %w", err)
}
// Get source secret
sourceSecret := &corev1.Secret{}
err = r.Get(ctx, types.NamespacedName{
Name: r.secretName,
Namespace: r.sourceNamespace,
}, sourceSecret)
if err != nil {
return fmt.Errorf("error getting source secret: %w", err)
}
// Create replica secret
replicaSecret := &corev1.Secret{
ObjectMeta: metav1.ObjectMeta{
Name: r.secretName,
Namespace: targetNamespace,
},
Type: sourceSecret.Type,
Data: sourceSecret.Data,
}
// Create the replica
err = r.Create(ctx, replicaSecret)
if err != nil && !k8serrors.IsAlreadyExists(err) {
return fmt.Errorf("failed to create replica: %w", err)
}
return nil
}
/*
* SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
* SPDX-License-Identifier: Apache-2.0
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package secret
import (
"context"
"strings"
"testing"
corev1 "k8s.io/api/core/v1"
k8serrors "k8s.io/apimachinery/pkg/api/errors"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/runtime"
"k8s.io/apimachinery/pkg/runtime/schema"
"k8s.io/apimachinery/pkg/types"
"sigs.k8s.io/controller-runtime/pkg/client"
"sigs.k8s.io/controller-runtime/pkg/client/fake"
)
func TestSecretReplicator_Replicate(t *testing.T) {
sourceSecret := &corev1.Secret{
ObjectMeta: metav1.ObjectMeta{
Name: "test-secret",
Namespace: "source-ns",
},
Type: corev1.SecretTypeOpaque,
Data: map[string][]byte{
"private.key": []byte("private-key-content"),
"private.key.pub": []byte("public-key-content"),
},
}
existingTargetSecret := &corev1.Secret{
ObjectMeta: metav1.ObjectMeta{
Name: "test-secret",
Namespace: "target-ns",
},
Type: corev1.SecretTypeOpaque,
Data: map[string][]byte{
"private.key": []byte("existing-private-key"),
"private.key.pub": []byte("existing-public-key"),
},
}
tests := []struct {
name string
sourceNamespace string
secretName string
targetNamespace string
existingSecrets []client.Object
mockGetError error
mockCreateError error
wantError bool
wantErrorContains string
validateResult func(t *testing.T, client client.Client)
}{
{
name: "secret already exists in target namespace - does nothing",
sourceNamespace: "source-ns",
secretName: "test-secret",
targetNamespace: "target-ns",
existingSecrets: []client.Object{sourceSecret, existingTargetSecret},
wantError: false,
validateResult: func(t *testing.T, client client.Client) {
// Should not have modified existing secret
var secret corev1.Secret
err := client.Get(context.Background(), types.NamespacedName{
Name: "test-secret",
Namespace: "target-ns",
}, &secret)
if err != nil {
t.Errorf("Expected secret to exist in target namespace")
}
if string(secret.Data["private.key"]) != "existing-private-key" {
t.Errorf("Expected existing secret to remain unchanged")
}
},
},
{
name: "source secret does not exist - returns error",
sourceNamespace: "source-ns",
secretName: "missing-secret",
targetNamespace: "target-ns",
existingSecrets: []client.Object{},
wantError: true,
wantErrorContains: "error getting source secret",
},
{
name: "successful replication",
sourceNamespace: "source-ns",
secretName: "test-secret",
targetNamespace: "target-ns",
existingSecrets: []client.Object{sourceSecret},
wantError: false,
validateResult: func(t *testing.T, client client.Client) {
var secret corev1.Secret
err := client.Get(context.Background(), types.NamespacedName{
Name: "test-secret",
Namespace: "target-ns",
}, &secret)
if err != nil {
t.Errorf("Expected secret to be created in target namespace: %v", err)
}
if secret.Type != corev1.SecretTypeOpaque {
t.Errorf("Expected secret type %v, got %v", corev1.SecretTypeOpaque, secret.Type)
}
if string(secret.Data["private.key"]) != "private-key-content" {
t.Errorf("Expected private key data to be copied")
}
if string(secret.Data["private.key.pub"]) != "public-key-content" {
t.Errorf("Expected public key data to be copied")
}
},
},
{
name: "race condition - AlreadyExists error is ignored",
sourceNamespace: "source-ns",
secretName: "test-secret",
targetNamespace: "target-ns",
existingSecrets: []client.Object{sourceSecret},
mockCreateError: k8serrors.NewAlreadyExists(schema.GroupResource{Resource: "secrets"}, "test-secret"),
wantError: false,
},
{
name: "create error other than AlreadyExists - returns error",
sourceNamespace: "source-ns",
secretName: "test-secret",
targetNamespace: "target-ns",
existingSecrets: []client.Object{sourceSecret},
mockCreateError: k8serrors.NewServiceUnavailable("mock error"),
wantError: true,
wantErrorContains: "failed to create replica",
},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
// Create fake client with existing secrets
scheme := runtime.NewScheme()
_ = corev1.AddToScheme(scheme)
clientBuilder := fake.NewClientBuilder().WithScheme(scheme)
if len(tt.existingSecrets) > 0 {
clientBuilder = clientBuilder.WithObjects(tt.existingSecrets...)
}
fakeClient := clientBuilder.Build()
// Wrap client to inject errors if needed
var testClient client.Client = fakeClient
if tt.mockCreateError != nil {
testClient = &errorInjectingClient{
Client: fakeClient,
createError: tt.mockCreateError,
}
}
replicator := NewSecretReplicator(testClient, tt.sourceNamespace, tt.secretName)
err := replicator.Replicate(context.Background(), tt.targetNamespace)
if tt.wantError {
if err == nil {
t.Errorf("Replicate() expected error, got nil")
} else if tt.wantErrorContains != "" && !strings.Contains(err.Error(), tt.wantErrorContains) {
t.Errorf("Replicate() error = %v, want error containing %v", err, tt.wantErrorContains)
}
} else {
if err != nil {
t.Errorf("Replicate() unexpected error = %v", err)
}
}
if tt.validateResult != nil {
tt.validateResult(t, fakeClient)
}
})
}
}
// errorInjectingClient wraps a client to inject specific errors for testing
type errorInjectingClient struct {
client.Client
createError error
}
func (c *errorInjectingClient) Create(ctx context.Context, obj client.Object, opts ...client.CreateOption) error {
if c.createError != nil {
return c.createError
}
return c.Client.Create(ctx, obj, opts...)
}
/*
* SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
* SPDX-License-Identifier: Apache-2.0
*/
package secret
import (
"context"
"crypto/rand"
"crypto/rsa"
"crypto/x509"
"encoding/pem"
"fmt"
configv1alpha1 "github.com/ai-dynamo/dynamo/deploy/operator/api/config/v1alpha1"
"github.com/go-logr/logr"
"golang.org/x/crypto/ssh"
corev1 "k8s.io/api/core/v1"
apierrors "k8s.io/apimachinery/pkg/api/errors"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/types"
ctrl "sigs.k8s.io/controller-runtime"
"sigs.k8s.io/controller-runtime/pkg/client"
)
const (
rsaKeyBits = 2048
privateKeyFile = "private.key"
publicKeyFile = "private.key.pub"
)
// KeyPairGenerator abstracts SSH key pair generation for testability.
type KeyPairGenerator interface {
Generate() (privateKeyPEM, publicKeyAuthorized []byte, err error)
}
// rsaKeyPairGenerator generates 2048-bit RSA key pairs, producing the same
// output format as ssh-keygen -t rsa -b 2048.
type rsaKeyPairGenerator struct{}
func (rsaKeyPairGenerator) Generate() ([]byte, []byte, error) {
key, err := rsa.GenerateKey(rand.Reader, rsaKeyBits)
if err != nil {
return nil, nil, fmt.Errorf("generating RSA key: %w", err)
}
privPEM := pem.EncodeToMemory(&pem.Block{
Type: "RSA PRIVATE KEY",
Bytes: x509.MarshalPKCS1PrivateKey(key),
})
sshPub, err := ssh.NewPublicKey(&key.PublicKey)
if err != nil {
return nil, nil, fmt.Errorf("converting to SSH public key: %w", err)
}
return privPEM, ssh.MarshalAuthorizedKey(sshPub), nil
}
// SSHKeyManager ensures the MPI SSH key pair secret exists and can replicate
// it to target namespaces for cross-namespace deployments. Keys are generated
// lazily on first use.
type SSHKeyManager struct {
client client.Client
cfg configv1alpha1.MPIConfiguration
generator KeyPairGenerator
logger logr.Logger
}
// NewSSHKeyManager creates an SSHKeyManager backed by the production RSA
// key generator. The client should be the manager's cached client.
func NewSSHKeyManager(cl client.Client, cfg configv1alpha1.MPIConfiguration) *SSHKeyManager {
return &SSHKeyManager{
client: cl,
cfg: cfg,
generator: rsaKeyPairGenerator{},
logger: ctrl.Log.WithName("ssh-key-manager"),
}
}
// EnsureAndReplicate ensures the SSH key pair secret exists in the source
// namespace (generating keys if needed) and replicates it to the target
// namespace if different from the source.
func (m *SSHKeyManager) EnsureAndReplicate(ctx context.Context, targetNamespace string) error {
if err := m.ensureSourceSecret(ctx); err != nil {
return fmt.Errorf("ensuring SSH key secret in %s: %w", m.cfg.SSHSecretNamespace, err)
}
if targetNamespace == m.cfg.SSHSecretNamespace {
return nil
}
return m.replicateToNamespace(ctx, targetNamespace)
}
// secretExists returns true if the SSH key secret already exists in the given namespace.
func (m *SSHKeyManager) secretExists(ctx context.Context, namespace string) (bool, error) {
key := types.NamespacedName{Namespace: namespace, Name: m.cfg.SSHSecretName}
if err := m.client.Get(ctx, key, &corev1.Secret{}); err == nil {
return true, nil
} else if apierrors.IsNotFound(err) {
return false, nil
} else {
return false, err
}
}
func (m *SSHKeyManager) ensureSourceSecret(ctx context.Context) error {
exists, err := m.secretExists(ctx, m.cfg.SSHSecretNamespace)
if err != nil {
return err
}
if exists {
return nil
}
privKey, pubKey, err := m.generator.Generate()
if err != nil {
return fmt.Errorf("generating SSH key pair: %w", err)
}
secret := &corev1.Secret{
ObjectMeta: metav1.ObjectMeta{
Namespace: m.cfg.SSHSecretNamespace,
Name: m.cfg.SSHSecretName,
Labels: map[string]string{
"app.kubernetes.io/managed-by": "dynamo-operator",
},
},
Data: map[string][]byte{
privateKeyFile: privKey,
publicKeyFile: pubKey,
},
}
if err := m.client.Create(ctx, secret); err != nil {
if apierrors.IsAlreadyExists(err) {
return nil
}
return fmt.Errorf("creating SSH key secret: %w", err)
}
m.logger.Info("Created MPI SSH key pair secret",
"namespace", m.cfg.SSHSecretNamespace, "name", m.cfg.SSHSecretName)
return nil
}
func (m *SSHKeyManager) replicateToNamespace(ctx context.Context, targetNamespace string) error {
exists, err := m.secretExists(ctx, targetNamespace)
if err != nil {
return fmt.Errorf("checking target secret: %w", err)
}
if exists {
return nil
}
source := &corev1.Secret{}
sourceKey := types.NamespacedName{Namespace: m.cfg.SSHSecretNamespace, Name: m.cfg.SSHSecretName}
if err := m.client.Get(ctx, sourceKey, source); err != nil {
return fmt.Errorf("reading source secret: %w", err)
}
replica := &corev1.Secret{
ObjectMeta: metav1.ObjectMeta{
Namespace: targetNamespace,
Name: m.cfg.SSHSecretName,
Labels: map[string]string{
"app.kubernetes.io/managed-by": "dynamo-operator",
},
},
Type: source.Type,
Data: source.Data,
}
if err := m.client.Create(ctx, replica); err != nil {
if apierrors.IsAlreadyExists(err) {
return nil
}
return fmt.Errorf("replicating secret to %s: %w", targetNamespace, err)
}
m.logger.Info("Replicated MPI SSH key secret",
"source", m.cfg.SSHSecretNamespace, "target", targetNamespace)
return nil
}
/*
* SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
* SPDX-License-Identifier: Apache-2.0
*/
package secret
import (
"context"
"crypto/x509"
"encoding/pem"
"fmt"
"reflect"
"testing"
configv1alpha1 "github.com/ai-dynamo/dynamo/deploy/operator/api/config/v1alpha1"
"github.com/go-logr/logr"
"golang.org/x/crypto/ssh"
corev1 "k8s.io/api/core/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/runtime"
"k8s.io/apimachinery/pkg/types"
"sigs.k8s.io/controller-runtime/pkg/client/fake"
)
const (
testSecretName = "mpi-ssh-secret"
testSourceNS = "operator-ns"
testTargetNS = "workload-ns"
testPrivateKey = "fake-private-key"
testPublicKey = "fake-public-key"
)
type fakeKeyPairGenerator struct {
called bool
err error
}
func (f *fakeKeyPairGenerator) Generate() ([]byte, []byte, error) {
f.called = true
if f.err != nil {
return nil, nil, f.err
}
return []byte(testPrivateKey), []byte(testPublicKey), nil
}
func newScheme() *runtime.Scheme {
s := runtime.NewScheme()
_ = corev1.AddToScheme(s)
return s
}
func newTestManager(builder *fake.ClientBuilder, gen KeyPairGenerator) *SSHKeyManager {
return &SSHKeyManager{
client: builder.Build(),
cfg: configv1alpha1.MPIConfiguration{
SSHSecretName: testSecretName,
SSHSecretNamespace: testSourceNS,
},
generator: gen,
logger: logr.Discard(),
}
}
func TestRSAKeyPairGenerator(t *testing.T) {
gen := rsaKeyPairGenerator{}
privPEM, pubAuthorized, err := gen.Generate()
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
block, _ := pem.Decode(privPEM)
if block == nil {
t.Fatal("expected PEM block in private key")
}
if block.Type != "RSA PRIVATE KEY" {
t.Errorf("expected PEM type RSA PRIVATE KEY, got %s", block.Type)
}
rsaKey, err := x509.ParsePKCS1PrivateKey(block.Bytes)
if err != nil {
t.Fatalf("failed to parse private key: %v", err)
}
if rsaKey.N.BitLen() != rsaKeyBits {
t.Errorf("expected %d-bit key, got %d", rsaKeyBits, rsaKey.N.BitLen())
}
_, _, _, _, err = ssh.ParseAuthorizedKey(pubAuthorized)
if err != nil {
t.Fatalf("failed to parse public key as authorized_keys format: %v", err)
}
}
func TestEnsureAndReplicate_CreatesSourceWhenMissing(t *testing.T) {
gen := &fakeKeyPairGenerator{}
mgr := newTestManager(fake.NewClientBuilder().WithScheme(newScheme()), gen)
ctx := context.Background()
if err := mgr.EnsureAndReplicate(ctx, testSourceNS); err != nil {
t.Fatalf("unexpected error: %v", err)
}
if !gen.called {
t.Fatal("expected generator to be called")
}
secret := &corev1.Secret{}
if err := mgr.client.Get(ctx, types.NamespacedName{Namespace: testSourceNS, Name: testSecretName}, secret); err != nil {
t.Fatalf("source secret should exist: %v", err)
}
if !reflect.DeepEqual(secret.Data[privateKeyFile], []byte(testPrivateKey)) {
t.Error("private key data mismatch")
}
if !reflect.DeepEqual(secret.Data[publicKeyFile], []byte(testPublicKey)) {
t.Error("public key data mismatch")
}
if secret.Labels["app.kubernetes.io/managed-by"] != "dynamo-operator" {
t.Error("expected managed-by label")
}
}
func TestEnsureAndReplicate_SkipsExistingSource(t *testing.T) {
existing := &corev1.Secret{
ObjectMeta: metav1.ObjectMeta{
Namespace: testSourceNS,
Name: testSecretName,
},
Data: map[string][]byte{
privateKeyFile: []byte("existing-private"),
publicKeyFile: []byte("existing-public"),
},
}
gen := &fakeKeyPairGenerator{}
mgr := newTestManager(fake.NewClientBuilder().WithScheme(newScheme()).WithObjects(existing), gen)
ctx := context.Background()
if err := mgr.EnsureAndReplicate(ctx, testSourceNS); err != nil {
t.Fatalf("unexpected error: %v", err)
}
if gen.called {
t.Error("generator should not be called when source secret exists")
}
}
func TestEnsureAndReplicate_ReplicatesToTargetNamespace(t *testing.T) {
source := &corev1.Secret{
ObjectMeta: metav1.ObjectMeta{
Namespace: testSourceNS,
Name: testSecretName,
},
Data: map[string][]byte{
privateKeyFile: []byte("source-private"),
publicKeyFile: []byte("source-public"),
},
}
gen := &fakeKeyPairGenerator{}
mgr := newTestManager(fake.NewClientBuilder().WithScheme(newScheme()).WithObjects(source), gen)
ctx := context.Background()
if err := mgr.EnsureAndReplicate(ctx, testTargetNS); err != nil {
t.Fatalf("unexpected error: %v", err)
}
replica := &corev1.Secret{}
if err := mgr.client.Get(ctx, types.NamespacedName{Namespace: testTargetNS, Name: testSecretName}, replica); err != nil {
t.Fatalf("replica secret should exist: %v", err)
}
if !reflect.DeepEqual(replica.Data[privateKeyFile], []byte("source-private")) {
t.Error("replica private key should match source")
}
if !reflect.DeepEqual(replica.Data[publicKeyFile], []byte("source-public")) {
t.Error("replica public key should match source")
}
}
func TestEnsureAndReplicate_SameNamespaceSkipsReplicate(t *testing.T) {
gen := &fakeKeyPairGenerator{}
mgr := newTestManager(fake.NewClientBuilder().WithScheme(newScheme()), gen)
ctx := context.Background()
if err := mgr.EnsureAndReplicate(ctx, testSourceNS); err != nil {
t.Fatalf("unexpected error: %v", err)
}
// Only the source secret should exist, no replica
list := &corev1.SecretList{}
if err := mgr.client.List(ctx, list); err != nil {
t.Fatalf("failed to list secrets: %v", err)
}
if len(list.Items) != 1 {
t.Errorf("expected exactly 1 secret (source only), got %d", len(list.Items))
}
}
func TestEnsureAndReplicate_SkipsExistingReplica(t *testing.T) {
source := &corev1.Secret{
ObjectMeta: metav1.ObjectMeta{Namespace: testSourceNS, Name: testSecretName},
Data: map[string][]byte{privateKeyFile: []byte("key"), publicKeyFile: []byte("pub")},
}
existingReplica := &corev1.Secret{
ObjectMeta: metav1.ObjectMeta{Namespace: testTargetNS, Name: testSecretName},
Data: map[string][]byte{privateKeyFile: []byte("old-key"), publicKeyFile: []byte("old-pub")},
}
gen := &fakeKeyPairGenerator{}
mgr := newTestManager(fake.NewClientBuilder().WithScheme(newScheme()).WithObjects(source, existingReplica), gen)
ctx := context.Background()
if err := mgr.EnsureAndReplicate(ctx, testTargetNS); err != nil {
t.Fatalf("unexpected error: %v", err)
}
replica := &corev1.Secret{}
if err := mgr.client.Get(ctx, types.NamespacedName{Namespace: testTargetNS, Name: testSecretName}, replica); err != nil {
t.Fatalf("failed to get replica: %v", err)
}
if !reflect.DeepEqual(replica.Data[privateKeyFile], []byte("old-key")) {
t.Error("existing replica should not be overwritten")
}
}
func TestEnsureAndReplicate_GeneratorError(t *testing.T) {
gen := &fakeKeyPairGenerator{err: fmt.Errorf("keygen failed")}
mgr := newTestManager(fake.NewClientBuilder().WithScheme(newScheme()), gen)
err := mgr.EnsureAndReplicate(context.Background(), testSourceNS)
if err == nil {
t.Fatal("expected error from generator")
}
if !gen.called {
t.Fatal("expected generator to be called")
}
}
......@@ -288,8 +288,7 @@ For TensorRT-LLM multinode deployments, the operator configures MPI-based commun
#### Additional Configuration
- **Environment Variable**: `OMPI_MCA_orte_keep_fqdn_hostnames=1` is added to all nodes
- **SSH Volume**: Automatically mounts the SSH keypair secret (typically named `mpirun-ssh-key-<deployment-name>`)
**Important:** TensorRT-LLM requires an SSH keypair secret to be created before deployment. The secret name follows the pattern `mpirun-ssh-key-<component-name>`.
- **Automatic SSH key generation**: The operator automatically generates the SSH keypair secret when it detects a multi-node `DynamoGraphDeployment`. No manual secret creation is required.
### Compilation Cache Configuration
......
......@@ -126,8 +126,8 @@ The Dynamo Operator uses **Kubernetes admission webhooks** for real-time validat
**Key Features:**
- ✅ Shared certificate infrastructure across all webhook types
- ✅ Automatic certificate generation (for testing/development)
- ✅ cert-manager integration (for production)
- ✅ Automatic certificate generation and rotation (default, all environments)
- ✅ cert-manager integration (optional, for custom PKI)
- ✅ Multi-operator support with lease-based coordination
- ✅ Immutability enforcement for critical fields
......
......@@ -32,8 +32,8 @@ All webhook types (validating, mutating, conversion, etc.) share the same **webh
-**Always enabled** - Webhooks are a required component of the operator
-**Shared certificate infrastructure** - All webhook types use the same TLS certificates
-**Automatic certificate generation** - No manual certificate management required
-**cert-manager integration** - Optional integration for automated certificate lifecycle
-**Automatic certificate generation and rotation** - Built-in cert-controller, no manual management required
-**cert-manager integration** - Optional integration for custom PKI or organizational certificate policies
-**Multi-operator support** - Lease-based coordination for cluster-wide and namespace-restricted deployments
-**Immutability enforcement** - Critical fields protected via CEL validation rules
......@@ -102,7 +102,7 @@ The `webhook.enabled` Helm value has been removed. Webhooks are now a required c
1. **Remove `webhook.enabled`** from any custom values files. Helm will ignore the unknown key, but it should be cleaned up to avoid confusion.
2. **Ensure port 9443 is reachable** from the Kubernetes API server to the operator pod. If you have `NetworkPolicy` rules or firewall configurations restricting traffic, add an ingress rule allowing the API server to reach the webhook server on port 9443.
3. **Ensure webhook TLS certificates are available.** By default, Helm hooks generate self-signed certificates automatically during `helm upgrade` — no action needed. If you use cert-manager or externally managed certificates, verify your configuration is in place before upgrading.
3. **Ensure webhook TLS certificates are available.** By default, the operator's built-in cert-controller generates and rotates self-signed certificates automatically at startup — no action needed. If you use cert-manager or externally managed certificates, verify your configuration is in place before upgrading.
---
......@@ -114,9 +114,9 @@ The operator supports three certificate management modes:
| Mode | Description | Use Case |
|------|-------------|----------|
| **Automatic (Default)** | Helm hooks generate self-signed certificates | Testing and development environments |
| **cert-manager** | Integrate with cert-manager for automated lifecycle | Production deployments with cert-manager |
| **External** | Bring your own certificates | Production deployments with custom PKI |
| **Automatic (Default)** | Operator's built-in cert-controller generates and rotates certificates | All environments (recommended) |
| **cert-manager** | Integrate with cert-manager for certificate lifecycle management | Clusters with cert-manager and custom PKI requirements |
| **External** | Bring your own certificates | Environments with externally managed PKI |
---
......@@ -127,7 +127,7 @@ The operator supports three certificate management modes:
```yaml
dynamo-operator:
webhook:
# Certificate management
# Certificate management (optional, to use cert-manager instead of built-in)
certManager:
enabled: false
issuerRef:
......@@ -137,16 +137,7 @@ dynamo-operator:
# Certificate secret configuration
certificateSecret:
name: webhook-server-cert
external: false
# Certificate validity period (automatic generation only)
certificateValidity: 3650 # 10 years
# Certificate generator image (automatic generation only)
certGenerator:
image:
repository: bitnami/kubectl
tag: latest
external: false # Set to true for externally managed certificates
# Webhook behavior configuration
failurePolicy: Fail # Fail (reject on error) or Ignore (allow on error)
......@@ -198,49 +189,46 @@ webhook:
### Automatic Certificates (Default)
**Zero configuration required!** Certificates are automatically generated during `helm install` and `helm upgrade`.
**Zero configuration required!** The operator's built-in cert-controller generates and rotates certificates automatically at startup.
#### How It Works
1. **Pre-install/pre-upgrade hook**: Generates self-signed TLS certificates
- Root CA (valid 10 years)
- Server certificate (valid 10 years)
- Stores in Secret: `<release>-webhook-server-cert`
1. **Operator starts**: The `CertManager` checks for an existing certificate Secret (configured via `webhook.certificateSecret.name`, default: `webhook-server-cert`). If missing or invalid, it generates a self-signed Root CA and server certificate and writes them to the Secret.
2. **CA bundle injection**: The `CABundleInjector` reads `ca.crt` from the Secret and patches both the `ValidatingWebhookConfiguration` and `MutatingWebhookConfiguration` with the base64-encoded CA bundle.
2. **Post-install/post-upgrade hook**: Injects CA bundle into `ValidatingWebhookConfiguration`
- Reads `ca.crt` from Secret
- Patches `ValidatingWebhookConfiguration` with base64-encoded CA bundle
3. **Certificate rotation**: The cert-controller monitors certificate validity and regenerates certificates before they expire.
3. **Operator pod**: Mounts certificate secret and serves webhook on port 9443
4. **Webhook server starts**: The webhook server only begins serving after certificates are confirmed ready, preventing startup races.
#### Certificate Validity
- **Root CA**: 10 years
- **Server Certificate**: 10 years (same as Root CA)
- **Automatic rotation**: Certificates are re-generated on every `helm upgrade`
- **Automatic rotation**: The cert-controller monitors validity and regenerates before expiration
#### Smart Certificate Generation
#### Smart Certificate Management
The certificate generation hook is intelligent:
-**Checks existing certificates** before generating new ones
-**Skips generation** if valid certificates exist (valid for 30+ days with correct SANs)
The cert-controller is intelligent about certificate lifecycle:
-**Checks existing certificates** at startup before generating new ones
-**Skips generation** if valid certificates already exist in the Secret
-**Regenerates** only when needed (missing, expiring soon, or incorrect SANs)
This means:
- Fast `helm upgrade` operations (no unnecessary cert generation)
- Safe to run `helm upgrade` frequently
- Certificates persist across reinstalls (stored in Secret)
- Fast operator restarts (no unnecessary cert generation)
- No dependency on Helm hooks or external Jobs
- Certificates persist across pod restarts (stored in Secret)
#### Manual Certificate Rotation
If you need to rotate certificates manually:
```bash
# Delete the certificate secret
# Delete the certificate secret -- the operator will regenerate it on restart
kubectl delete secret <release>-webhook-server-cert -n <namespace>
# Upgrade the release to regenerate certificates
helm upgrade <release> dynamo-platform -n <namespace>
# Restart the operator pod to trigger regeneration
kubectl rollout restart deployment/<release>-dynamo-operator -n <namespace>
```
---
......@@ -274,12 +262,11 @@ dynamo-operator:
4. **cert-manager ca-injector**: Automatically injects CA bundle into `ValidatingWebhookConfiguration`
5. **Operator pod**: Mounts certificate secret and serves webhook
#### Benefits Over Automatic Mode
#### When to Use cert-manager
-**Automated rotation**: cert-manager renews certificates before expiration
-**Custom validity periods**: Configure certificate lifetime
-**CA rotation support**: ca-injector handles CA updates automatically
-**Custom validity periods**: Configure certificate lifetime to match organizational policy
-**Integration with existing PKI**: Use your organization's certificate infrastructure
-**Centralized certificate management**: Manage all cluster certificates through cert-manager
#### Certificate Rotation
......@@ -550,53 +537,45 @@ kubectl get secret -n <namespace> <release>-webhook-server-cert -o jsonpath='{.d
# - SAN includes: <service-name>.<namespace>.svc
```
4. **Check CA injection job logs**:
4. **Check operator logs for CA injection errors**:
```bash
kubectl logs -n <namespace> job/<release>-webhook-ca-inject-<revision>
kubectl logs -n <namespace> deployment/<release>-dynamo-operator | grep -i "cert\|ca.*bundle\|inject"
```
---
### Helm Hook Job Failures
### Certificate Controller Errors
**Symptoms:**
- `helm install` or `helm upgrade` hangs or fails
- Certificate generation errors
- Operator logs show cert-controller errors
- Certificate Secret is not created
- CA bundle is not injected into webhook configurations
**Checks:**
1. **List hook jobs**:
1. **Check cert-controller logs**:
```bash
kubectl get jobs -n <namespace> | grep webhook
kubectl logs -n <namespace> deployment/<release>-dynamo-operator | grep -i "cert-manager\|cert-rotation\|cert-controller"
```
2. **Check job logs**:
2. **Verify RBAC permissions**:
```bash
# Certificate generation
kubectl logs -n <namespace> job/<release>-webhook-cert-gen-<revision>
# CA injection
kubectl logs -n <namespace> job/<release>-webhook-ca-inject-<revision>
# The operator needs permissions to manage Secrets, ValidatingWebhookConfigurations,
# MutatingWebhookConfigurations, and CustomResourceDefinitions
kubectl auth can-i create secrets -n <namespace> --as=system:serviceaccount:<namespace>:<release>-dynamo-operator
kubectl auth can-i patch validatingwebhookconfigurations --as=system:serviceaccount:<namespace>:<release>-dynamo-operator
```
3. **Check RBAC permissions**:
3. **Check if the certificate Secret was created**:
```bash
# Verify ServiceAccount exists
kubectl get sa -n <namespace> <release>-webhook-ca-inject
# Verify ClusterRole and ClusterRoleBinding exist
kubectl get clusterrole <release>-webhook-ca-inject
kubectl get clusterrolebinding <release>-webhook-ca-inject
kubectl get secret -n <namespace> <release>-webhook-server-cert
```
4. **Manual cleanup**:
4. **Force certificate regeneration**:
```bash
# Delete failed jobs
kubectl delete job -n <namespace> <release>-webhook-cert-gen-<revision>
kubectl delete job -n <namespace> <release>-webhook-ca-inject-<revision>
# Retry helm upgrade
helm upgrade <release> dynamo-platform -n <namespace>
# Delete the certificate secret and restart the operator
kubectl delete secret <release>-webhook-server-cert -n <namespace>
kubectl rollout restart deployment/<release>-dynamo-operator -n <namespace>
```
---
......@@ -666,13 +645,13 @@ helm upgrade <release> dynamo-platform -n <namespace>
1.**Use `failurePolicy: Fail`** (default) to ensure validation is enforced
2.**Monitor webhook latency** - Validation adds ~10-50ms per resource operation
3.**Use cert-manager** for automated certificate lifecycle in large deployments
3.**Automatic certificates work well for production** - The built-in cert-controller handles generation and rotation; use cert-manager only if you need integration with organizational PKI
4.**Test webhook configuration** in staging before production
### Development Deployments
1.**Use `failurePolicy: Ignore`** if webhook availability is problematic during development
2.**Keep automatic certificates** (simpler than cert-manager for dev)
2.**Keep automatic certificates** (zero configuration, built into the operator)
### Multi-Tenant Deployments
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment