Unverified Commit f385661a authored by Julien Mancuso's avatar Julien Mancuso Committed by GitHub
Browse files

chore: remove mechanism to disable webhooks (#6441)


Signed-off-by: default avatarJulien Mancuso <jmancuso@nvidia.com>
parent c82fe888
...@@ -19,7 +19,7 @@ limitations under the License. ...@@ -19,7 +19,7 @@ limitations under the License.
A Helm chart for NVIDIA Dynamo Platform. A Helm chart for NVIDIA Dynamo Platform.
![Version: 0.9.0](https://img.shields.io/badge/Version-0.9.0-informational?style=flat-square) ![Type: application](https://img.shields.io/badge/Type-application-informational?style=flat-square) ![Version: 1.0.0-dev](https://img.shields.io/badge/Version-1.0.0--dev-informational?style=flat-square) ![Type: application](https://img.shields.io/badge/Type-application-informational?style=flat-square)
## 🚀 Overview ## 🚀 Overview
...@@ -37,6 +37,17 @@ The Dynamo Platform Helm chart deploys the complete Dynamo Kubernetes Platform i ...@@ -37,6 +37,17 @@ The Dynamo Platform Helm chart deploys the complete Dynamo Kubernetes Platform i
- Helm 3.8+ - Helm 3.8+
- Sufficient cluster resources for your deployment scale - Sufficient cluster resources for your deployment scale
- Container registry access (if using private images) - Container registry access (if using private images)
- TLS certificate infrastructure for admission webhooks (auto-generated via Helm hooks by default, or [cert-manager](https://cert-manager.io/), or externally managed)
## 🔄 Upgrading Notes
### Webhooks are now mandatory (v1.0.0+)
The `webhook.enabled` Helm value has been removed. Admission webhooks are now a required component of the operator and cannot be disabled. This change aligns with the upcoming addition of CRD conversion webhooks, which are mandatory for multi-version API support.
No action is required for most upgrades — Helm hooks automatically generate TLS certificates and inject the CA bundle during `helm upgrade`. If you use cert-manager or externally managed certificates, ensure your existing configuration is correct before upgrading.
---
## ⚠️ Important: Cluster-Wide vs Namespace-Scoped Deployment ## ⚠️ Important: Cluster-Wide vs Namespace-Scoped Deployment
...@@ -86,7 +97,7 @@ The chart includes built-in validation to prevent all operator conflicts: ...@@ -86,7 +97,7 @@ The chart includes built-in validation to prevent all operator conflicts:
| Repository | Name | Version | | Repository | Name | Version |
|------------|------|---------| |------------|------|---------|
| file://components/operator | dynamo-operator | 0.7.1 | | file://components/operator | dynamo-operator | 1.0.0-dev |
| https://charts.bitnami.com/bitnami | etcd | 12.0.18 | | https://charts.bitnami.com/bitnami | etcd | 12.0.18 |
| https://nats-io.github.io/k8s/helm/charts/ | nats | 1.3.2 | | https://nats-io.github.io/k8s/helm/charts/ | nats | 1.3.2 |
| oci://ghcr.io/ai-dynamo/grove | grove(grove-charts) | v0.1.0-alpha.6 | | oci://ghcr.io/ai-dynamo/grove | grove(grove-charts) | v0.1.0-alpha.6 |
...@@ -96,6 +107,7 @@ The chart includes built-in validation to prevent all operator conflicts: ...@@ -96,6 +107,7 @@ The chart includes built-in validation to prevent all operator conflicts:
| Key | Type | Default | Description | | Key | Type | Default | Description |
|-----|------|---------|-------------| |-----|------|---------|-------------|
| global.etcd.install | bool | `false` | Whether this chart should install the bundled etcd subchart. When true, deploys etcd and auto-configures the operator with its address. When false, etcd is not deployed. Use dynamo-operator.etcdAddr to point at an external instance if you are bringing your own etcd. |
| dynamo-operator.enabled | bool | `true` | Whether to enable the Dynamo Kubernetes operator deployment | | dynamo-operator.enabled | bool | `true` | Whether to enable the Dynamo Kubernetes operator deployment |
| dynamo-operator.natsAddr | string | `""` | NATS server address for operator communication (leave empty to use the bundled NATS chart). Format: "nats://hostname:port" | | dynamo-operator.natsAddr | string | `""` | NATS server address for operator communication (leave empty to use the bundled NATS chart). Format: "nats://hostname:port" |
| dynamo-operator.etcdAddr | string | `""` | etcd server address for an external etcd instance. Only needed when using external etcd without the bundled subchart. Format: "http://hostname:port" or "https://hostname:port" | | dynamo-operator.etcdAddr | string | `""` | etcd server address for an external etcd instance. Only needed when using external etcd without the bundled subchart. Format: "http://hostname:port" or "https://hostname:port" |
...@@ -104,6 +116,8 @@ The chart includes built-in validation to prevent all operator conflicts: ...@@ -104,6 +116,8 @@ The chart includes built-in validation to prevent all operator conflicts:
| dynamo-operator.namespaceRestriction | object | `{"enabled":false,"lease":{"duration":"30s","renewInterval":"10s"},"targetNamespace":null}` | Namespace access controls for the operator | | dynamo-operator.namespaceRestriction | object | `{"enabled":false,"lease":{"duration":"30s","renewInterval":"10s"},"targetNamespace":null}` | Namespace access controls for the operator |
| dynamo-operator.namespaceRestriction.enabled | bool | `false` | Whether to restrict operator to specific namespaces. By default, the operator will run with cluster-wide permissions. Only 1 instance of the operator should be deployed in the cluster. If you want to deploy multiple operator instances, you can set this to true and specify the target namespace (by default, the target namespace is the helm release namespace). | | dynamo-operator.namespaceRestriction.enabled | bool | `false` | Whether to restrict operator to specific namespaces. By default, the operator will run with cluster-wide permissions. Only 1 instance of the operator should be deployed in the cluster. If you want to deploy multiple operator instances, you can set this to true and specify the target namespace (by default, the target namespace is the helm release namespace). |
| dynamo-operator.namespaceRestriction.targetNamespace | string | `nil` | Target namespace for operator deployment (leave empty for current namespace) | | dynamo-operator.namespaceRestriction.targetNamespace | string | `nil` | Target namespace for operator deployment (leave empty for current namespace) |
| dynamo-operator.gpuDiscovery | object | `{"enabled":true}` | GPU discovery configuration (only applies when namespaceRestriction.enabled=true) |
| dynamo-operator.gpuDiscovery.enabled | bool | `true` | Whether to provision a ClusterRole for the namespace-scoped operator to read GPU node labels. When true (default), Helm creates a ClusterRole/ClusterRoleBinding granting node read access. Set to false if your installer lacks ClusterRole creation permissions. |
| dynamo-operator.controllerManager.tolerations | list | `[]` | Node tolerations for controller manager pods | | dynamo-operator.controllerManager.tolerations | list | `[]` | Node tolerations for controller manager pods |
| dynamo-operator.controllerManager.affinity | object | `{}` | Affinity for controller manager pods | | dynamo-operator.controllerManager.affinity | object | `{}` | Affinity for controller manager pods |
| dynamo-operator.controllerManager.leaderElection.id | string | `""` | Leader election ID for cluster-wide coordination. WARNING: All cluster-wide operators must use the SAME ID to prevent split-brain. Different IDs would allow multiple leaders simultaneously. | | dynamo-operator.controllerManager.leaderElection.id | string | `""` | Leader election ID for cluster-wide coordination. WARNING: All cluster-wide operators must use the SAME ID to prevent split-brain. Different IDs would allow multiple leaders simultaneously. |
...@@ -131,7 +145,6 @@ The chart includes built-in validation to prevent all operator conflicts: ...@@ -131,7 +145,6 @@ The chart includes built-in validation to prevent all operator conflicts:
| dynamo-operator.dynamo.metrics.prometheusEndpoint | string | `""` | Endpoint that services can use to retrieve metrics. If set, dynamo operator will automatically inject the PROMETHEUS_ENDPOINT environment variable into services it manages. Users can override the value of the PROMETHEUS_ENDPOINT environment variable by modifying the corresponding deployment's environment variables | | dynamo-operator.dynamo.metrics.prometheusEndpoint | string | `""` | Endpoint that services can use to retrieve metrics. If set, dynamo operator will automatically inject the PROMETHEUS_ENDPOINT environment variable into services it manages. Users can override the value of the PROMETHEUS_ENDPOINT environment variable by modifying the corresponding deployment's environment variables |
| dynamo-operator.dynamo.mpiRun.secretName | string | `"mpi-run-ssh-secret"` | Name of the secret containing the SSH key for MPI Run | | dynamo-operator.dynamo.mpiRun.secretName | string | `"mpi-run-ssh-secret"` | Name of the secret containing the SSH key for MPI Run |
| dynamo-operator.dynamo.mpiRun.sshKeygen.enabled | bool | `true` | Whether to enable SSH key generation for MPI Run | | dynamo-operator.dynamo.mpiRun.sshKeygen.enabled | bool | `true` | Whether to enable SSH key generation for MPI Run |
| dynamo-operator.webhook.enabled | bool | `true` | Whether to enable admission webhooks for resource validation. When enabled, the operator will validate DynamoComponentDeployment and DynamoGraphDeployment resources before they are created or updated in the cluster. Enabled by default for production-ready validation and better error reporting. |
| dynamo-operator.webhook.certificateSecret.name | string | `"webhook-server-cert"` | Name of the Kubernetes secret containing webhook TLS certificates. The secret must contain three keys: tls.crt (server certificate), tls.key (server private key), and ca.crt (Certificate Authority certificate). | | dynamo-operator.webhook.certificateSecret.name | string | `"webhook-server-cert"` | Name of the Kubernetes secret containing webhook TLS certificates. The secret must contain three keys: tls.crt (server certificate), tls.key (server private key), and ca.crt (Certificate Authority certificate). |
| dynamo-operator.webhook.certificateSecret.external | bool | `false` | Whether to manage the certificate secret externally. When false (default), certificates are automatically generated via Helm hooks during installation. When true, you must create the secret manually before installing the chart. | | dynamo-operator.webhook.certificateSecret.external | bool | `false` | Whether to manage the certificate secret externally. When false (default), certificates are automatically generated via Helm hooks during installation. When true, you must create the secret manually before installing the chart. |
| dynamo-operator.webhook.certificateValidity | int | `365` | Certificate validity duration in days for auto-generated certificates. Only used when certManager.enabled=false and certificateSecret.external=false. After this duration, certificates will expire and need to be regenerated. | | dynamo-operator.webhook.certificateValidity | int | `365` | Certificate validity duration in days for auto-generated certificates. Only used when certManager.enabled=false and certificateSecret.external=false. After this duration, certificates will expire and need to be regenerated. |
...@@ -148,6 +161,9 @@ The chart includes built-in validation to prevent all operator conflicts: ...@@ -148,6 +161,9 @@ The chart includes built-in validation to prevent all operator conflicts:
| dynamo-operator.webhook.certManager.certificate.rootCA.duration | string | `"87600h"` | Duration for the root CA certificate (e.g., "87600h" for 10 years). The root CA typically has a much longer lifetime than the leaf certificates it signs. | | dynamo-operator.webhook.certManager.certificate.rootCA.duration | string | `"87600h"` | Duration for the root CA certificate (e.g., "87600h" for 10 years). The root CA typically has a much longer lifetime than the leaf certificates it signs. |
| dynamo-operator.webhook.certManager.certificate.rootCA.renewBefore | string | `"720h"` | Time before root CA expiration to trigger renewal (e.g., "720h" for 30 days). Renewing a CA can be disruptive as all signed certificates must be reissued. | | dynamo-operator.webhook.certManager.certificate.rootCA.renewBefore | string | `"720h"` | Time before root CA expiration to trigger renewal (e.g., "720h" for 30 days). Renewing a CA can be disruptive as all signed certificates must be reissued. |
| dynamo-operator.checkpoint.enabled | bool | `false` | Whether to enable checkpoint/restore functionality | | dynamo-operator.checkpoint.enabled | bool | `false` | Whether to enable checkpoint/restore functionality |
| dynamo-operator.checkpoint.initContainerImage | string | `"busybox:latest"` | Image used for init containers in checkpoint jobs (e.g., signal file cleanup) |
| dynamo-operator.checkpoint.readyForCheckpointFilePath | string | `"/tmp/ready-for-checkpoint"` | Path written by worker when model is loaded and ready for checkpointing |
| dynamo-operator.checkpoint.restoreMarkerFilePath | string | `"/tmp/dynamo-restored"` | Path written by restore-entrypoint after successful CRIU restore |
| dynamo-operator.checkpoint.storage.type | string | `"pvc"` | Storage backend type: pvc, s3, or oci | | dynamo-operator.checkpoint.storage.type | string | `"pvc"` | Storage backend type: pvc, s3, or oci |
| dynamo-operator.checkpoint.storage.signalHostPath | string | `"/var/lib/chrek/signals"` | Host path for signal files (communication between checkpoint pod and DaemonSet) | | dynamo-operator.checkpoint.storage.signalHostPath | string | `"/var/lib/chrek/signals"` | Host path for signal files (communication between checkpoint pod and DaemonSet) |
| dynamo-operator.checkpoint.storage.pvc.pvcName | string | `"chrek-pvc"` | Name of the PVC created by the chrek chart | | dynamo-operator.checkpoint.storage.pvc.pvcName | string | `"chrek-pvc"` | Name of the PVC created by the chrek chart |
...@@ -156,14 +172,12 @@ The chart includes built-in validation to prevent all operator conflicts: ...@@ -156,14 +172,12 @@ The chart includes built-in validation to prevent all operator conflicts:
| dynamo-operator.checkpoint.storage.s3.credentialsSecretRef | string | `""` | Reference to a secret containing AWS credentials | | dynamo-operator.checkpoint.storage.s3.credentialsSecretRef | string | `""` | Reference to a secret containing AWS credentials |
| dynamo-operator.checkpoint.storage.oci.uri | string | `""` | OCI URI in format: oci://registry/repository | | dynamo-operator.checkpoint.storage.oci.uri | string | `""` | OCI URI in format: oci://registry/repository |
| dynamo-operator.checkpoint.storage.oci.credentialsSecretRef | string | `""` | Reference to a docker config secret for registry authentication | | dynamo-operator.checkpoint.storage.oci.credentialsSecretRef | string | `""` | Reference to a docker config secret for registry authentication |
| dynamo-operator.checkpoint.criu.timeout | string | `"21600"` | CRIU operation timeout in seconds. Default: 21600 (6 hours) |
| grove.enabled | bool | `false` | Whether to enable Grove for multi-node inference coordination, if enabled, the Grove operator will be deployed cluster-wide | | grove.enabled | bool | `false` | Whether to enable Grove for multi-node inference coordination, if enabled, the Grove operator will be deployed cluster-wide |
| grove.tolerations | list | `[]` | Node tolerations for Grove pods | | grove.tolerations | list | `[]` | Node tolerations for Grove pods |
| grove.affinity | object | `{}` | Affinity for Grove pods | | grove.affinity | object | `{}` | Affinity for Grove pods |
| kai-scheduler.enabled | bool | `false` | Whether to enable Kai Scheduler for intelligent resource allocation, if enabled, the Kai Scheduler operator will be deployed cluster-wide | | kai-scheduler.enabled | bool | `false` | Whether to enable Kai Scheduler for intelligent resource allocation, if enabled, the Kai Scheduler operator will be deployed cluster-wide |
| kai-scheduler.global.tolerations | list | `[]` | Node tolerations for kai-scheduler pods | | kai-scheduler.global.tolerations | list | `[]` | Node tolerations for kai-scheduler pods |
| kai-scheduler.global.affinity | object | `{}` | Affinity for kai-scheduler pods | | kai-scheduler.global.affinity | object | `{}` | Affinity for kai-scheduler pods |
| global.etcd.install | bool | `false` | Whether to install the bundled etcd subchart. When true, deploys etcd and auto-configures the operator with its address. Use dynamo-operator.etcdAddr to point at an external instance instead. |
| etcd.image.repository | string | `"bitnamilegacy/etcd"` | following bitnami announcement for brownout - https://github.com/bitnami/charts/tree/main/bitnami/etcd#%EF%B8%8F-important-notice-upcoming-changes-to-the-bitnami-catalog, we need to use the legacy repository until we migrate to the new "secure" repository | | etcd.image.repository | string | `"bitnamilegacy/etcd"` | following bitnami announcement for brownout - https://github.com/bitnami/charts/tree/main/bitnami/etcd#%EF%B8%8F-important-notice-upcoming-changes-to-the-bitnami-catalog, we need to use the legacy repository until we migrate to the new "secure" repository |
| nats.enabled | bool | `true` | Whether to enable NATS deployment, disable if you want to use an external NATS instance. For complete configuration options, see: https://github.com/nats-io/k8s/tree/main/helm/charts/nats , all nats settings should be prefixed with "nats." | | nats.enabled | bool | `true` | Whether to enable NATS deployment, disable if you want to use an external NATS instance. For complete configuration options, see: https://github.com/nats-io/k8s/tree/main/helm/charts/nats , all nats settings should be prefixed with "nats." |
......
...@@ -37,6 +37,17 @@ The Dynamo Platform Helm chart deploys the complete Dynamo Kubernetes Platform i ...@@ -37,6 +37,17 @@ The Dynamo Platform Helm chart deploys the complete Dynamo Kubernetes Platform i
- Helm 3.8+ - Helm 3.8+
- Sufficient cluster resources for your deployment scale - Sufficient cluster resources for your deployment scale
- Container registry access (if using private images) - Container registry access (if using private images)
- TLS certificate infrastructure for admission webhooks (auto-generated via Helm hooks by default, or [cert-manager](https://cert-manager.io/), or externally managed)
## 🔄 Upgrading Notes
### Webhooks are now mandatory (v1.0.0+)
The `webhook.enabled` Helm value has been removed. Admission webhooks are now a required component of the operator and cannot be disabled. This change aligns with the upcoming addition of CRD conversion webhooks, which are mandatory for multi-version API support.
No action is required for most upgrades — Helm hooks automatically generate TLS certificates and inject the CA bundle during `helm upgrade`. If you use cert-manager or externally managed certificates, ensure your existing configuration is correct before upgrading.
---
## ⚠️ Important: Cluster-Wide vs Namespace-Scoped Deployment ## ⚠️ Important: Cluster-Wide vs Namespace-Scoped Deployment
......
...@@ -145,9 +145,6 @@ spec: ...@@ -145,9 +145,6 @@ spec:
- --gpu-discovery-enabled={{ .Values.gpuDiscovery.enabled }} - --gpu-discovery-enabled={{ .Values.gpuDiscovery.enabled }}
{{- end }} {{- end }}
- --operator-version={{ .Chart.AppVersion }} - --operator-version={{ .Chart.AppVersion }}
{{- if .Values.webhook.enabled }}
- --enable-webhooks=true
{{- end }}
{{- if .Values.checkpoint.enabled }} {{- if .Values.checkpoint.enabled }}
- --checkpoint-enabled=true - --checkpoint-enabled=true
- --checkpoint-storage-type={{ .Values.checkpoint.storage.type }} - --checkpoint-storage-type={{ .Values.checkpoint.storage.type }}
...@@ -187,12 +184,10 @@ spec: ...@@ -187,12 +184,10 @@ spec:
initialDelaySeconds: 15 initialDelaySeconds: 15
periodSeconds: 20 periodSeconds: 20
name: manager name: manager
{{- if .Values.webhook.enabled }}
ports: ports:
- containerPort: 9443 - containerPort: 9443
name: webhook-server name: webhook-server
protocol: TCP protocol: TCP
{{- end }}
readinessProbe: readinessProbe:
httpGet: httpGet:
path: /readyz path: /readyz
...@@ -203,20 +198,16 @@ spec: ...@@ -203,20 +198,16 @@ spec:
10 }} 10 }}
securityContext: {{- toYaml .Values.controllerManager.manager.containerSecurityContext securityContext: {{- toYaml .Values.controllerManager.manager.containerSecurityContext
| nindent 10 }} | nindent 10 }}
{{- if .Values.webhook.enabled }}
volumeMounts: volumeMounts:
- mountPath: /tmp/k8s-webhook-server/serving-certs - mountPath: /tmp/k8s-webhook-server/serving-certs
name: cert name: cert
readOnly: true readOnly: true
{{- end }}
securityContext: securityContext:
runAsNonRoot: true runAsNonRoot: true
serviceAccountName: {{ include "dynamo-operator.fullname" . }}-controller-manager serviceAccountName: {{ include "dynamo-operator.fullname" . }}-controller-manager
terminationGracePeriodSeconds: 30 terminationGracePeriodSeconds: 30
{{- if .Values.webhook.enabled }}
volumes: volumes:
- name: cert - name: cert
secret: secret:
defaultMode: 420 defaultMode: 420
secretName: {{ .Values.webhook.certificateSecret.name }} secretName: {{ .Values.webhook.certificateSecret.name }}
{{- end }}
...@@ -13,7 +13,7 @@ ...@@ -13,7 +13,7 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
{{- if and .Values.webhook.enabled (not .Values.webhook.certManager.enabled) (not .Values.webhook.certificateSecret.external) }} {{- if and (not .Values.webhook.certManager.enabled) (not .Values.webhook.certificateSecret.external) }}
--- ---
# ServiceAccount for CA bundle injection job # ServiceAccount for CA bundle injection job
apiVersion: v1 apiVersion: v1
......
...@@ -13,7 +13,7 @@ ...@@ -13,7 +13,7 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
{{- if and .Values.webhook.enabled (not .Values.webhook.certManager.enabled) (not .Values.webhook.certificateSecret.external) }} {{- if and (not .Values.webhook.certManager.enabled) (not .Values.webhook.certificateSecret.external) }}
--- ---
# ServiceAccount for certificate generation job # ServiceAccount for certificate generation job
apiVersion: v1 apiVersion: v1
......
...@@ -13,7 +13,7 @@ ...@@ -13,7 +13,7 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
{{- if and .Values.webhook.enabled .Values.webhook.certManager.enabled }} {{- if .Values.webhook.certManager.enabled }}
--- ---
# Self-signed issuer to bootstrap the CA # Self-signed issuer to bootstrap the CA
apiVersion: cert-manager.io/v1 apiVersion: cert-manager.io/v1
......
...@@ -13,7 +13,6 @@ ...@@ -13,7 +13,6 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
{{- if .Values.webhook.enabled }}
--- ---
apiVersion: admissionregistration.k8s.io/v1 apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration kind: ValidatingWebhookConfiguration
...@@ -221,5 +220,4 @@ webhooks: ...@@ -221,5 +220,4 @@ webhooks:
- dynamographdeployments - dynamographdeployments
sideEffects: None sideEffects: None
timeoutSeconds: {{ .Values.webhook.timeoutSeconds }} timeoutSeconds: {{ .Values.webhook.timeoutSeconds }}
{{- end }}
...@@ -13,7 +13,7 @@ ...@@ -13,7 +13,7 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
{{- if and .Values.webhook.enabled .Values.webhook.certManager.enabled }} {{- if .Values.webhook.certManager.enabled }}
--- ---
# ServiceAccount for the CRD conversion patch job # ServiceAccount for the CRD conversion patch job
apiVersion: v1 apiVersion: v1
......
...@@ -13,7 +13,6 @@ ...@@ -13,7 +13,6 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
{{- if .Values.webhook.enabled }}
--- ---
# Role to read the webhook certificate secret # Role to read the webhook certificate secret
apiVersion: rbac.authorization.k8s.io/v1 apiVersion: rbac.authorization.k8s.io/v1
...@@ -56,5 +55,4 @@ subjects: ...@@ -56,5 +55,4 @@ subjects:
- kind: ServiceAccount - kind: ServiceAccount
name: {{ include "dynamo-operator.fullname" . }}-controller-manager name: {{ include "dynamo-operator.fullname" . }}-controller-manager
namespace: {{ .Release.Namespace }} namespace: {{ .Release.Namespace }}
{{- end }}
...@@ -13,7 +13,6 @@ ...@@ -13,7 +13,6 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
{{- if .Values.webhook.enabled }}
--- ---
apiVersion: v1 apiVersion: v1
kind: Service kind: Service
...@@ -34,5 +33,4 @@ spec: ...@@ -34,5 +33,4 @@ spec:
selector: selector:
control-plane: controller-manager control-plane: controller-manager
{{- include "dynamo-operator.selectorLabels" . | nindent 4 }} {{- include "dynamo-operator.selectorLabels" . | nindent 4 }}
{{- end }}
...@@ -208,10 +208,6 @@ checkpoint: ...@@ -208,10 +208,6 @@ checkpoint:
# Webhook configuration # Webhook configuration
webhook: webhook:
# Enable admission webhooks for validation
# Enabled by default for production-ready validation and better error reporting
enabled: true
# Certificate configuration # Certificate configuration
certificateSecret: certificateSecret:
# Name of the secret containing webhook TLS certificates # Name of the secret containing webhook TLS certificates
......
...@@ -156,9 +156,6 @@ dynamo-operator: ...@@ -156,9 +156,6 @@ dynamo-operator:
# Webhook configuration for admission control and validation # Webhook configuration for admission control and validation
webhook: webhook:
# -- Whether to enable admission webhooks for resource validation. When enabled, the operator will validate DynamoComponentDeployment and DynamoGraphDeployment resources before they are created or updated in the cluster. Enabled by default for production-ready validation and better error reporting.
enabled: true
# Certificate configuration for webhook TLS # Certificate configuration for webhook TLS
certificateSecret: certificateSecret:
# -- Name of the Kubernetes secret containing webhook TLS certificates. The secret must contain three keys: tls.crt (server certificate), tls.key (server private key), and ca.crt (Certificate Authority certificate). # -- Name of the Kubernetes secret containing webhook TLS certificates. The secret must contain three keys: tls.crt (server certificate), tls.key (server private key), and ca.crt (Certificate Authority certificate).
......
...@@ -157,7 +157,6 @@ func main() { ...@@ -157,7 +157,6 @@ func main() {
var namespaceScopeLeaseRenewInterval time.Duration var namespaceScopeLeaseRenewInterval time.Duration
var operatorVersion string var operatorVersion string
var discoveryBackend string var discoveryBackend string
var enableWebhooks bool
var gpuDiscoveryEnabled bool var gpuDiscoveryEnabled bool
// Checkpoint configuration // Checkpoint configuration
var checkpointEnabled bool var checkpointEnabled bool
...@@ -181,9 +180,6 @@ func main() { ...@@ -181,9 +180,6 @@ func main() {
"If set the metrics endpoint is served securely") "If set the metrics endpoint is served securely")
flag.BoolVar(&enableHTTP2, "enable-http2", false, flag.BoolVar(&enableHTTP2, "enable-http2", false,
"If set, HTTP/2 will be enabled for the metrics and webhook servers") "If set, HTTP/2 will be enabled for the metrics and webhook servers")
flag.BoolVar(&enableWebhooks, "enable-webhooks", false,
"Enable admission webhooks for validation. When enabled, controllers skip validation "+
"(webhooks handle it). When disabled, controllers perform validation.")
flag.BoolVar(&gpuDiscoveryEnabled, "gpu-discovery-enabled", true, flag.BoolVar(&gpuDiscoveryEnabled, "gpu-discovery-enabled", true,
"Whether GPU discovery is enabled for namespace-scoped operators. When true (default), "+ "Whether GPU discovery is enabled for namespace-scoped operators. When true (default), "+
"the Helm chart has provisioned a ClusterRole granting node read access for GPU hardware discovery.") "the Helm chart has provisioned a ClusterRole granting node read access for GPU hardware discovery.")
...@@ -692,80 +688,68 @@ func main() { ...@@ -692,80 +688,68 @@ func main() {
os.Exit(1) os.Exit(1)
} }
// Set webhooks enabled flag in config
ctrlConfig.WebhooksEnabled = enableWebhooks
ctrlConfig.GPUDiscoveryEnabled = gpuDiscoveryEnabled ctrlConfig.GPUDiscoveryEnabled = gpuDiscoveryEnabled
if enableWebhooks { // Configure webhooks with lease-based namespace exclusion
setupLog.Info("Webhooks are enabled - webhooks will validate, controllers will skip validation")
} else {
setupLog.Info("Webhooks are disabled - controllers will validate (defense in depth)")
}
// Configure webhooks with lease-based namespace exclusion (only if enabled)
// In cluster-wide mode, inject ctrlConfig.ExcludedNamespaces (leaseWatcher) so webhooks can defer // In cluster-wide mode, inject ctrlConfig.ExcludedNamespaces (leaseWatcher) so webhooks can defer
// to namespace-restricted operators. In namespace-restricted mode, webhooks validate without checking // to namespace-restricted operators. In namespace-restricted mode, webhooks validate without checking
// leases (ExcludedNamespaces is nil). The webhooks use LeaseAwareValidator wrapper to add coordination. // leases (ExcludedNamespaces is nil). The webhooks use LeaseAwareValidator wrapper to add coordination.
if enableWebhooks { isClusterWide := ctrlConfig.RestrictedNamespace == ""
if ctrlConfig.RestrictedNamespace == "" { if isClusterWide {
// Cluster-wide mode: inject the same ExcludedNamespaces used by controllers setupLog.Info("Configuring webhooks with lease-based namespace exclusion for cluster-wide mode")
setupLog.Info("Configuring webhooks with lease-based namespace exclusion for cluster-wide mode") internalwebhook.SetExcludedNamespaces(ctrlConfig.ExcludedNamespaces)
internalwebhook.SetExcludedNamespaces(ctrlConfig.ExcludedNamespaces) } else {
} else { setupLog.Info("Configuring webhooks for namespace-restricted mode (no lease checking)",
// Namespace-restricted mode: no exclusion checking needed (validators not wrapped) "restrictedNamespace", ctrlConfig.RestrictedNamespace)
setupLog.Info("Configuring webhooks for namespace-restricted mode (no lease checking)", internalwebhook.SetExcludedNamespaces(nil)
"restrictedNamespace", ctrlConfig.RestrictedNamespace) }
internalwebhook.SetExcludedNamespaces(nil)
}
// Register validation webhook handlers
setupLog.Info("Registering validation webhooks")
dcdHandler := webhookvalidation.NewDynamoComponentDeploymentHandler() // Register validation webhook handlers
if err = dcdHandler.RegisterWithManager(mgr); err != nil { setupLog.Info("Registering validation webhooks")
setupLog.Error(err, "unable to register webhook", "webhook", "DynamoComponentDeployment")
os.Exit(1)
}
dgdHandler := webhookvalidation.NewDynamoGraphDeploymentHandler(mgr) dcdHandler := webhookvalidation.NewDynamoComponentDeploymentHandler()
if err = dgdHandler.RegisterWithManager(mgr); err != nil { if err = dcdHandler.RegisterWithManager(mgr); err != nil {
setupLog.Error(err, "unable to register webhook", "webhook", "DynamoGraphDeployment") setupLog.Error(err, "unable to register webhook", "webhook", "DynamoComponentDeployment")
os.Exit(1) os.Exit(1)
} }
dmHandler := webhookvalidation.NewDynamoModelHandler() dgdHandler := webhookvalidation.NewDynamoGraphDeploymentHandler(mgr)
if err = dmHandler.RegisterWithManager(mgr); err != nil { if err = dgdHandler.RegisterWithManager(mgr); err != nil {
setupLog.Error(err, "unable to register webhook", "webhook", "DynamoModel") setupLog.Error(err, "unable to register webhook", "webhook", "DynamoGraphDeployment")
os.Exit(1) os.Exit(1)
} }
isClusterWide := ctrlConfig.RestrictedNamespace == "" dmHandler := webhookvalidation.NewDynamoModelHandler()
dgdrHandler := webhookvalidation.NewDynamoGraphDeploymentRequestHandler(isClusterWide, gpuDiscoveryEnabled) if err = dmHandler.RegisterWithManager(mgr); err != nil {
if err = dgdrHandler.RegisterWithManager(mgr); err != nil { setupLog.Error(err, "unable to register webhook", "webhook", "DynamoModel")
setupLog.Error(err, "unable to register webhook", "webhook", "DynamoGraphDeploymentRequest") os.Exit(1)
os.Exit(1) }
}
if err = ctrl.NewWebhookManagedBy(mgr). dgdrHandler := webhookvalidation.NewDynamoGraphDeploymentRequestHandler(isClusterWide, gpuDiscoveryEnabled)
For(&nvidiacomv1alpha1.DynamoGraphDeploymentRequest{}). if err = dgdrHandler.RegisterWithManager(mgr); err != nil {
Complete(); err != nil { setupLog.Error(err, "unable to register webhook", "webhook", "DynamoGraphDeploymentRequest")
setupLog.Error(err, "unable to register conversion webhook", "webhook", "DynamoGraphDeploymentRequest-conversion") os.Exit(1)
os.Exit(1) }
}
setupLog.Info("Validation webhooks registered successfully") if err = ctrl.NewWebhookManagedBy(mgr).
For(&nvidiacomv1alpha1.DynamoGraphDeploymentRequest{}).
Complete(); err != nil {
setupLog.Error(err, "unable to register conversion webhook", "webhook", "DynamoGraphDeploymentRequest-conversion")
os.Exit(1)
}
// Register defaulting (mutating) webhook handlers setupLog.Info("Validation webhooks registered successfully")
setupLog.Info("Registering defaulting webhooks")
dgdDefaulter := webhookdefaulting.NewDGDDefaulter(operatorVersion) // Register defaulting (mutating) webhook handlers
if err = dgdDefaulter.RegisterWithManager(mgr); err != nil { setupLog.Info("Registering defaulting webhooks")
setupLog.Error(err, "unable to register webhook", "webhook", "DynamoGraphDeployment-defaulting")
os.Exit(1)
}
setupLog.Info("Defaulting webhooks registered successfully") dgdDefaulter := webhookdefaulting.NewDGDDefaulter(operatorVersion)
if err = dgdDefaulter.RegisterWithManager(mgr); err != nil {
setupLog.Error(err, "unable to register webhook", "webhook", "DynamoGraphDeployment-defaulting")
os.Exit(1)
} }
setupLog.Info("Defaulting webhooks registered successfully")
//+kubebuilder:scaffold:builder //+kubebuilder:scaffold:builder
if err := mgr.AddHealthzCheck("healthz", healthz.Ping); err != nil { if err := mgr.AddHealthzCheck("healthz", healthz.Ping); err != nil {
......
...@@ -43,7 +43,6 @@ import ( ...@@ -43,7 +43,6 @@ import (
"github.com/ai-dynamo/dynamo/deploy/operator/internal/dynamo" "github.com/ai-dynamo/dynamo/deploy/operator/internal/dynamo"
"github.com/ai-dynamo/dynamo/deploy/operator/internal/modelendpoint" "github.com/ai-dynamo/dynamo/deploy/operator/internal/modelendpoint"
"github.com/ai-dynamo/dynamo/deploy/operator/internal/observability" "github.com/ai-dynamo/dynamo/deploy/operator/internal/observability"
webhookvalidation "github.com/ai-dynamo/dynamo/deploy/operator/internal/webhook/validation"
) )
const ( const (
...@@ -98,45 +97,6 @@ func (r *DynamoModelReconciler) Reconcile(ctx context.Context, req ctrl.Request) ...@@ -98,45 +97,6 @@ func (r *DynamoModelReconciler) Reconcile(ctx context.Context, req ctrl.Request)
logs = logs.WithValues("dynamoModel", model.Name, "namespace", model.Namespace, "baseModelName", model.Spec.BaseModelName) logs = logs.WithValues("dynamoModel", model.Name, "namespace", model.Namespace, "baseModelName", model.Spec.BaseModelName)
logs.Info("Reconciling DynamoModel") logs.Info("Reconciling DynamoModel")
// Validate the DynamoModel spec (defense in depth - only when webhooks are disabled)
if !r.Config.WebhooksEnabled {
validator := webhookvalidation.NewDynamoModelValidator(model)
if _, err := validator.Validate(); err != nil {
logs.Error(err, "DynamoModel validation failed, refusing to reconcile")
// Set validation error condition
meta.SetStatusCondition(&model.Status.Conditions, metav1.Condition{
Type: "Valid",
Status: metav1.ConditionFalse,
ObservedGeneration: model.Generation,
Reason: "ValidationFailed",
Message: fmt.Sprintf("Validation failed: %v", err),
})
// Update status and don't requeue (user must fix the spec)
if statusErr := r.Status().Update(ctx, model); statusErr != nil {
logs.Error(statusErr, "Failed to update DynamoModel status with validation error")
return ctrl.Result{}, statusErr
}
// Record event for visibility
r.Recorder.Event(model, corev1.EventTypeWarning, "ValidationFailed", err.Error())
// Don't requeue - user must fix the spec
logs.Info("DynamoModel is invalid, not reconciling until spec is fixed")
return ctrl.Result{}, nil
}
// Set Valid condition to True
meta.SetStatusCondition(&model.Status.Conditions, metav1.Condition{
Type: "Valid",
Status: metav1.ConditionTrue,
ObservedGeneration: model.Generation,
Reason: "ValidationPassed",
Message: "DynamoModel spec is valid",
})
}
// Handle finalizer using common handler // Handle finalizer using common handler
finalized, err := commoncontroller.HandleFinalizer(ctx, model, r.Client, r) finalized, err := commoncontroller.HandleFinalizer(ctx, model, r.Client, r)
if err != nil { if err != nil {
......
...@@ -40,7 +40,6 @@ import ( ...@@ -40,7 +40,6 @@ import (
commonController "github.com/ai-dynamo/dynamo/deploy/operator/internal/controller_common" commonController "github.com/ai-dynamo/dynamo/deploy/operator/internal/controller_common"
"github.com/ai-dynamo/dynamo/deploy/operator/internal/dynamo" "github.com/ai-dynamo/dynamo/deploy/operator/internal/dynamo"
"github.com/ai-dynamo/dynamo/deploy/operator/internal/observability" "github.com/ai-dynamo/dynamo/deploy/operator/internal/observability"
webhookvalidation "github.com/ai-dynamo/dynamo/deploy/operator/internal/webhook/validation"
networkingv1beta1 "istio.io/client-go/pkg/apis/networking/v1beta1" networkingv1beta1 "istio.io/client-go/pkg/apis/networking/v1beta1"
k8serrors "k8s.io/apimachinery/pkg/api/errors" k8serrors "k8s.io/apimachinery/pkg/api/errors"
"k8s.io/apimachinery/pkg/api/meta" "k8s.io/apimachinery/pkg/api/meta"
...@@ -146,52 +145,6 @@ func (r *DynamoComponentDeploymentReconciler) Reconcile(ctx context.Context, req ...@@ -146,52 +145,6 @@ func (r *DynamoComponentDeploymentReconciler) Reconcile(ctx context.Context, req
} }
}() }()
// Validate the DynamoComponentDeployment spec (defense in depth - only when webhooks are disabled)
if !r.Config.WebhooksEnabled {
validator := webhookvalidation.NewDynamoComponentDeploymentValidator(dynamoComponentDeployment)
if _, validationErr := validator.Validate(ctx); validationErr != nil {
logs.Error(validationErr, "DynamoComponentDeployment validation failed, refusing to reconcile")
// Set validation error condition
meta.SetStatusCondition(&dynamoComponentDeployment.Status.Conditions, metav1.Condition{
Type: "Valid",
Status: metav1.ConditionFalse,
ObservedGeneration: dynamoComponentDeployment.Generation,
Reason: "ValidationFailed",
Message: fmt.Sprintf("Validation failed: %v", validationErr),
})
// Update status and don't requeue (user must fix the spec)
if statusErr := r.Status().Update(ctx, dynamoComponentDeployment); statusErr != nil {
logs.Error(statusErr, "Failed to update DynamoComponentDeployment status with validation error")
err = statusErr
return ctrl.Result{}, err
}
// Record event for visibility
r.Recorder.Event(dynamoComponentDeployment, corev1.EventTypeWarning, "ValidationFailed", validationErr.Error())
// Don't requeue - user must fix the spec
logs.Info("DynamoComponentDeployment is invalid, not reconciling until spec is fixed")
err = nil
return ctrl.Result{}, nil
}
// Set Valid condition to True and persist it
meta.SetStatusCondition(&dynamoComponentDeployment.Status.Conditions, metav1.Condition{
Type: "Valid",
Status: metav1.ConditionTrue,
ObservedGeneration: dynamoComponentDeployment.Generation,
Reason: "ValidationPassed",
Message: "DynamoComponentDeployment spec is valid",
})
if statusErr := r.Status().Update(ctx, dynamoComponentDeployment); statusErr != nil {
logs.Error(statusErr, "Failed to update DynamoComponentDeployment status with validation success")
err = statusErr
return ctrl.Result{}, err
}
}
deleted, err := commonController.HandleFinalizer(ctx, dynamoComponentDeployment, r.Client, r) deleted, err := commonController.HandleFinalizer(ctx, dynamoComponentDeployment, r.Client, r)
if err != nil { if err != nil {
logs.Error(err, "Failed to handle finalizer") logs.Error(err, "Failed to handle finalizer")
......
...@@ -53,7 +53,6 @@ import ( ...@@ -53,7 +53,6 @@ import (
"github.com/ai-dynamo/dynamo/deploy/operator/internal/dynamo" "github.com/ai-dynamo/dynamo/deploy/operator/internal/dynamo"
"github.com/ai-dynamo/dynamo/deploy/operator/internal/dynamo/epp" "github.com/ai-dynamo/dynamo/deploy/operator/internal/dynamo/epp"
"github.com/ai-dynamo/dynamo/deploy/operator/internal/observability" "github.com/ai-dynamo/dynamo/deploy/operator/internal/observability"
webhookvalidation "github.com/ai-dynamo/dynamo/deploy/operator/internal/webhook/validation"
rbacv1 "k8s.io/api/rbac/v1" rbacv1 "k8s.io/api/rbac/v1"
gaiev1 "sigs.k8s.io/gateway-api-inference-extension/api/v1" gaiev1 "sigs.k8s.io/gateway-api-inference-extension/api/v1"
) )
...@@ -164,28 +163,6 @@ func (r *DynamoGraphDeploymentReconciler) Reconcile(ctx context.Context, req ctr ...@@ -164,28 +163,6 @@ func (r *DynamoGraphDeploymentReconciler) Reconcile(ctx context.Context, req ctr
return ctrl.Result{}, nil return ctrl.Result{}, nil
} }
// Validate the DynamoGraphDeployment spec (defense in depth - only when webhooks are disabled)
if !r.Config.WebhooksEnabled {
validator := webhookvalidation.NewDynamoGraphDeploymentValidator(dynamoDeployment)
if _, validationErr := validator.Validate(ctx); validationErr != nil {
logger.Error(validationErr, "DynamoGraphDeployment validation failed, refusing to reconcile")
// Set validation error state and reason (defer will update status)
state = nvidiacomv1alpha1.DGDStateFailed
reason = Reason("ValidationFailed")
message = Message(fmt.Sprintf("Validation failed: %v", validationErr))
// Record event for visibility
r.Recorder.Event(dynamoDeployment, corev1.EventTypeWarning, "ValidationFailed", validationErr.Error())
// Don't requeue - user must fix the spec
logger.Info("DynamoGraphDeployment is invalid, not reconciling until spec is fixed")
// Return without error so defer updates status but doesn't requeue
return ctrl.Result{}, nil
}
}
if r.supportsManagedRollingUpdate(dynamoDeployment) { if r.supportsManagedRollingUpdate(dynamoDeployment) {
if err = r.initializeWorkerHashIfNeeded(ctx, dynamoDeployment); err != nil { if err = r.initializeWorkerHashIfNeeded(ctx, dynamoDeployment); err != nil {
logger.Error(err, "Failed to initialize worker hash") logger.Error(err, "Failed to initialize worker hash")
......
...@@ -49,7 +49,6 @@ import ( ...@@ -49,7 +49,6 @@ import (
commonController "github.com/ai-dynamo/dynamo/deploy/operator/internal/controller_common" commonController "github.com/ai-dynamo/dynamo/deploy/operator/internal/controller_common"
"github.com/ai-dynamo/dynamo/deploy/operator/internal/gpu" "github.com/ai-dynamo/dynamo/deploy/operator/internal/gpu"
"github.com/ai-dynamo/dynamo/deploy/operator/internal/observability" "github.com/ai-dynamo/dynamo/deploy/operator/internal/observability"
webhookvalidation "github.com/ai-dynamo/dynamo/deploy/operator/internal/webhook/validation"
) )
const ( const (
...@@ -890,24 +889,6 @@ func isOnlineProfiling(dgdr *nvidiacomv1alpha1.DynamoGraphDeploymentRequest) boo ...@@ -890,24 +889,6 @@ func isOnlineProfiling(dgdr *nvidiacomv1alpha1.DynamoGraphDeploymentRequest) boo
// validateSpec validates the DGDR spec // validateSpec validates the DGDR spec
func (r *DynamoGraphDeploymentRequestReconciler) validateSpec(ctx context.Context, dgdr *nvidiacomv1alpha1.DynamoGraphDeploymentRequest) error { func (r *DynamoGraphDeploymentRequestReconciler) validateSpec(ctx context.Context, dgdr *nvidiacomv1alpha1.DynamoGraphDeploymentRequest) error {
// Use the validator for simple validation (defense in depth - only when webhooks are disabled)
if !r.Config.WebhooksEnabled {
isClusterWide := r.Config.RestrictedNamespace == ""
validator := webhookvalidation.NewDynamoGraphDeploymentRequestValidator(dgdr, isClusterWide, r.Config.GPUDiscoveryEnabled)
warnings, err := validator.Validate()
if err != nil {
return err
}
// Log warnings if any
if len(warnings) > 0 {
logger := log.FromContext(ctx)
for _, warning := range warnings {
logger.Info("Validation warning", "warning", warning)
}
}
}
// Validate ConfigMap if provided (for the DGD base config) // Validate ConfigMap if provided (for the DGD base config)
// This requires cluster access and cannot be done in the stateless validator // This requires cluster access and cannot be done in the stateless validator
if dgdr.Spec.ProfilingConfig.ConfigMapRef != nil { if dgdr.Spec.ProfilingConfig.ConfigMapRef != nil {
......
...@@ -80,11 +80,6 @@ type Config struct { ...@@ -80,11 +80,6 @@ type Config struct {
// DiscoveryBackend is the discovery backend to use. Default is "kubernetes" for Kubernetes API service discovery. Set to "etcd" to use ETCD for discovery. // DiscoveryBackend is the discovery backend to use. Default is "kubernetes" for Kubernetes API service discovery. Set to "etcd" to use ETCD for discovery.
DiscoveryBackend string DiscoveryBackend string
// WebhooksEnabled indicates whether admission webhooks are enabled
// When true, controllers skip validation (webhooks handle it)
// When false, controllers perform validation (defense in depth)
WebhooksEnabled bool
// GPUDiscoveryEnabled indicates whether Helm provisioned node read access for the namespace-scoped operator. // GPUDiscoveryEnabled indicates whether Helm provisioned node read access for the namespace-scoped operator.
// Only relevant for namespace-scoped operators (RestrictedNamespace != ""). // Only relevant for namespace-scoped operators (RestrictedNamespace != "").
GPUDiscoveryEnabled bool GPUDiscoveryEnabled bool
......
...@@ -122,7 +122,7 @@ For a user-focused guide on deploying and managing models with DynamoModel, see: ...@@ -122,7 +122,7 @@ For a user-focused guide on deploying and managing models with DynamoModel, see:
## Webhooks ## Webhooks
The Dynamo Operator uses **Kubernetes admission webhooks** for real-time validation of custom resources before they are persisted to the cluster. Webhooks are **enabled by default** and ensure that invalid configurations are rejected immediately at the API server level. The Dynamo Operator uses **Kubernetes admission webhooks** for real-time validation and mutation of custom resources before they are persisted to the cluster. Webhooks are a required component of the operator and ensure that invalid configurations are rejected immediately at the API server level.
**Key Features:** **Key Features:**
- ✅ Shared certificate infrastructure across all webhook types - ✅ Shared certificate infrastructure across all webhook types
......
...@@ -11,7 +11,6 @@ This document describes the webhook functionality in the Dynamo Operator, includ ...@@ -11,7 +11,6 @@ This document describes the webhook functionality in the Dynamo Operator, includ
- [Overview](#overview) - [Overview](#overview)
- [Architecture](#architecture) - [Architecture](#architecture)
- [Configuration](#configuration) - [Configuration](#configuration)
- [Enabling/Disabling Webhooks](#enablingdisabling-webhooks)
- [Certificate Management Options](#certificate-management-options) - [Certificate Management Options](#certificate-management-options)
- [Advanced Configuration](#advanced-configuration) - [Advanced Configuration](#advanced-configuration)
- [Certificate Management](#certificate-management) - [Certificate Management](#certificate-management)
...@@ -31,10 +30,9 @@ All webhook types (validating, mutating, conversion, etc.) share the same **webh ...@@ -31,10 +30,9 @@ All webhook types (validating, mutating, conversion, etc.) share the same **webh
### Key Features ### Key Features
-**Enabled by default** - Zero-touch validation out of the box -**Always enabled** - Webhooks are a required component of the operator
-**Shared certificate infrastructure** - All webhook types use the same TLS certificates -**Shared certificate infrastructure** - All webhook types use the same TLS certificates
-**Automatic certificate generation** - No manual certificate management required -**Automatic certificate generation** - No manual certificate management required
-**Defense in depth** - Controllers validate when webhooks are disabled
-**cert-manager integration** - Optional integration for automated certificate lifecycle -**cert-manager integration** - Optional integration for automated certificate lifecycle
-**Multi-operator support** - Lease-based coordination for cluster-wide and namespace-restricted deployments -**Multi-operator support** - Lease-based coordination for cluster-wide and namespace-restricted deployments
-**Immutability enforcement** - Critical fields protected via CEL validation rules -**Immutability enforcement** - Critical fields protected via CEL validation rules
...@@ -45,8 +43,11 @@ All webhook types (validating, mutating, conversion, etc.) share the same **webh ...@@ -45,8 +43,11 @@ All webhook types (validating, mutating, conversion, etc.) share the same **webh
- `DynamoComponentDeployment` validation - `DynamoComponentDeployment` validation
- `DynamoGraphDeployment` validation - `DynamoGraphDeployment` validation
- `DynamoModel` validation - `DynamoModel` validation
- `DynamoGraphDeploymentRequest` validation
- **Mutating Webhooks**: Apply default values to resources on creation
- `DynamoGraphDeployment` defaulting
**Note:** Future releases may add mutating webhooks (for defaults/transformations) and conversion webhooks (for CRD version migrations). All will use the same certificate infrastructure described in this document. **Note:** All webhook types use the same certificate infrastructure described in this document.
--- ---
...@@ -56,54 +57,57 @@ All webhook types (validating, mutating, conversion, etc.) share the same **webh ...@@ -56,54 +57,57 @@ All webhook types (validating, mutating, conversion, etc.) share the same **webh
┌─────────────────────────────────────────────────────────────────┐ ┌─────────────────────────────────────────────────────────────────┐
│ API Server │ │ API Server │
│ 1. User submits CR (kubectl apply) │ │ 1. User submits CR (kubectl apply) │
│ 2. API server calls ValidatingWebhookConfiguration │ │ 2. API server calls MutatingWebhookConfiguration
└────────────────────────┬────────────────────────────────────────┘ └────────────────────────┬────────────────────────────────────────┘
│ HTTPS (TLS required) │ HTTPS (TLS required)
┌─────────────────────────────────────────────────────────────────┐ ┌─────────────────────────────────────────────────────────────────┐
│ Webhook Server (in Operator Pod) │ │ Webhook Server (in Operator Pod) │
│ 3. Validates CR against business rules │ │ 3. Applies defaults (e.g., operator version annotation) │
│ 4. Returns admit/deny decision + warnings │ │ 4. Returns mutated CR │
└─────────────────────────────────────────────────────────────────┘ └────────────────────────┬────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ API Server │
│ 5. API server calls ValidatingWebhookConfiguration │
└────────────────────────┬────────────────────────────────────────┘
│ HTTPS (TLS required)
┌─────────────────────────────────────────────────────────────────┐
│ Webhook Server (in Operator Pod) │
│ 6. Validates CR against business rules │
│ 7. Returns admit/deny decision + warnings │
└────────────────────────┬────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐ ┌─────────────────────────────────────────────────────────────────┐
│ API Server │ │ API Server │
5. If admitted: Persist CR to etcd │ 8. If admitted: Persist CR to etcd │
6. If denied: Return error to user │ 9. If denied: Return error to user │
└─────────────────────────────────────────────────────────────────┘ └─────────────────────────────────────────────────────────────────┘
``` ```
### Validation Flow ### Admission Flow
1. **Webhook validation** (if enabled): Validates at API server level 1. **Mutating webhooks**: Apply defaults and transformations before validation
2. **CEL validation**: Kubernetes-native immutability checks (always active) 2. **Validating webhooks**: Validate the (possibly mutated) CR against business rules
3. **Controller validation** (if webhooks disabled): Defense-in-depth validation during reconciliation 3. **CEL validation**: Kubernetes-native immutability checks (always active)
--- ---
## Configuration ## Upgrading from versions with `webhook.enabled: false`
### Enabling/Disabling Webhooks The `webhook.enabled` Helm value has been removed. Webhooks are now a required component of the operator and are always active. If you previously ran with `webhook.enabled: false`, take the following steps before upgrading:
Webhooks are **enabled by default**. To disable them: 1. **Remove `webhook.enabled`** from any custom values files. Helm will ignore the unknown key, but it should be cleaned up to avoid confusion.
2. **Ensure port 9443 is reachable** from the Kubernetes API server to the operator pod. If you have `NetworkPolicy` rules or firewall configurations restricting traffic, add an ingress rule allowing the API server to reach the webhook server on port 9443.
```yaml 3. **Ensure webhook TLS certificates are available.** By default, Helm hooks generate self-signed certificates automatically during `helm upgrade` — no action needed. If you use cert-manager or externally managed certificates, verify your configuration is in place before upgrading.
# Platform-level values.yaml
dynamo-operator:
webhook:
enabled: false
```
**When to disable webhooks:**
- During development/testing when rapid iteration is needed
- In environments where admission webhooks are not supported
- When troubleshooting validation issues
**Note:** When webhooks are disabled, controllers perform validation during reconciliation (defense in depth).
--- ---
## Configuration
### Certificate Management Options ### Certificate Management Options
The operator supports three certificate management modes: The operator supports three certificate management modes:
...@@ -123,9 +127,6 @@ The operator supports three certificate management modes: ...@@ -123,9 +127,6 @@ The operator supports three certificate management modes:
```yaml ```yaml
dynamo-operator: dynamo-operator:
webhook: webhook:
# Enable/disable validation webhooks
enabled: true
# Certificate management # Certificate management
certManager: certManager:
enabled: false enabled: false
...@@ -455,7 +456,7 @@ If the namespace-restricted operator is deleted or becomes unhealthy: ...@@ -455,7 +456,7 @@ If the namespace-restricted operator is deleted or becomes unhealthy:
**Checks:** **Checks:**
1. **Verify webhook is enabled**: 1. **Verify webhook configuration exists**:
```bash ```bash
kubectl get validatingwebhookconfiguration | grep dynamo kubectl get validatingwebhookconfiguration | grep dynamo
``` ```
...@@ -477,7 +478,7 @@ kubectl get service -n <namespace> | grep webhook ...@@ -477,7 +478,7 @@ kubectl get service -n <namespace> | grep webhook
4. **Check operator logs for webhook startup**: 4. **Check operator logs for webhook startup**:
```bash ```bash
kubectl logs -n <namespace> deployment/<release>-dynamo-operator | grep webhook kubectl logs -n <namespace> deployment/<release>-dynamo-operator | grep webhook
# Should see: "Webhooks are enabled - webhooks will validate, controllers will skip validation" # Should see: "Registering validation webhooks"
# Should see: "Starting webhook server" # Should see: "Starting webhook server"
``` ```
...@@ -636,15 +637,15 @@ kubectl describe <resource-type> <name> -n <namespace> ...@@ -636,15 +637,15 @@ kubectl describe <resource-type> <name> -n <namespace>
# Look for events mentioning webhook errors # Look for events mentioning webhook errors
``` ```
2. **Temporarily disable webhook**: 2. **Temporarily work around the webhook**:
```bash ```bash
# Option 1: Delete ValidatingWebhookConfiguration # Option 1: Set failurePolicy to Ignore
kubectl delete validatingwebhookconfiguration <name>
# Option 2: Set failurePolicy to Ignore
kubectl patch validatingwebhookconfiguration <name> \ kubectl patch validatingwebhookconfiguration <name> \
--type='json' \ --type='json' \
-p='[{"op": "replace", "path": "/webhooks/0/failurePolicy", "value": "Ignore"}]' -p='[{"op": "replace", "path": "/webhooks/0/failurePolicy", "value": "Ignore"}]'
# Option 2 (last resort): Delete ValidatingWebhookConfiguration
kubectl delete validatingwebhookconfiguration <name>
``` ```
3. **Delete resource again**: 3. **Delete resource again**:
...@@ -652,7 +653,7 @@ kubectl patch validatingwebhookconfiguration <name> \ ...@@ -652,7 +653,7 @@ kubectl patch validatingwebhookconfiguration <name> \
kubectl delete <resource-type> <name> -n <namespace> kubectl delete <resource-type> <name> -n <namespace>
``` ```
4. **Re-enable webhook**: 4. **Restore webhook configuration**:
```bash ```bash
helm upgrade <release> dynamo-platform -n <namespace> helm upgrade <release> dynamo-platform -n <namespace>
``` ```
...@@ -663,17 +664,15 @@ helm upgrade <release> dynamo-platform -n <namespace> ...@@ -663,17 +664,15 @@ helm upgrade <release> dynamo-platform -n <namespace>
### Production Deployments ### Production Deployments
1.**Keep webhooks enabled** (default) for real-time validation 1.**Use `failurePolicy: Fail`** (default) to ensure validation is enforced
2.**Use `failurePolicy: Fail`** (default) to ensure validation is enforced 2.**Monitor webhook latency** - Validation adds ~10-50ms per resource operation
3.**Monitor webhook latency** - Validation adds ~10-50ms per resource operation 3.**Use cert-manager** for automated certificate lifecycle in large deployments
4.**Use cert-manager** for automated certificate lifecycle in large deployments 4.**Test webhook configuration** in staging before production
5.**Test webhook configuration** in staging before production
### Development Deployments ### Development Deployments
1.**Disable webhooks** for rapid iteration if needed 1.**Use `failurePolicy: Ignore`** if webhook availability is problematic during development
2.**Use `failurePolicy: Ignore`** if webhook availability is problematic 2.**Keep automatic certificates** (simpler than cert-manager for dev)
3.**Keep automatic certificates** (simpler than cert-manager for dev)
### Multi-Tenant Deployments ### Multi-Tenant Deployments
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment