Unverified Commit f385661a authored by Julien Mancuso's avatar Julien Mancuso Committed by GitHub
Browse files

chore: remove mechanism to disable webhooks (#6441)


Signed-off-by: default avatarJulien Mancuso <jmancuso@nvidia.com>
parent c82fe888
......@@ -19,7 +19,7 @@ limitations under the License.
A Helm chart for NVIDIA Dynamo Platform.
![Version: 0.9.0](https://img.shields.io/badge/Version-0.9.0-informational?style=flat-square) ![Type: application](https://img.shields.io/badge/Type-application-informational?style=flat-square)
![Version: 1.0.0-dev](https://img.shields.io/badge/Version-1.0.0--dev-informational?style=flat-square) ![Type: application](https://img.shields.io/badge/Type-application-informational?style=flat-square)
## 🚀 Overview
......@@ -37,6 +37,17 @@ The Dynamo Platform Helm chart deploys the complete Dynamo Kubernetes Platform i
- Helm 3.8+
- Sufficient cluster resources for your deployment scale
- Container registry access (if using private images)
- TLS certificate infrastructure for admission webhooks (auto-generated via Helm hooks by default, or [cert-manager](https://cert-manager.io/), or externally managed)
## 🔄 Upgrading Notes
### Webhooks are now mandatory (v1.0.0+)
The `webhook.enabled` Helm value has been removed. Admission webhooks are now a required component of the operator and cannot be disabled. This change aligns with the upcoming addition of CRD conversion webhooks, which are mandatory for multi-version API support.
No action is required for most upgrades — Helm hooks automatically generate TLS certificates and inject the CA bundle during `helm upgrade`. If you use cert-manager or externally managed certificates, ensure your existing configuration is correct before upgrading.
---
## ⚠️ Important: Cluster-Wide vs Namespace-Scoped Deployment
......@@ -86,7 +97,7 @@ The chart includes built-in validation to prevent all operator conflicts:
| Repository | Name | Version |
|------------|------|---------|
| file://components/operator | dynamo-operator | 0.7.1 |
| file://components/operator | dynamo-operator | 1.0.0-dev |
| https://charts.bitnami.com/bitnami | etcd | 12.0.18 |
| https://nats-io.github.io/k8s/helm/charts/ | nats | 1.3.2 |
| oci://ghcr.io/ai-dynamo/grove | grove(grove-charts) | v0.1.0-alpha.6 |
......@@ -96,6 +107,7 @@ The chart includes built-in validation to prevent all operator conflicts:
| Key | Type | Default | Description |
|-----|------|---------|-------------|
| global.etcd.install | bool | `false` | Whether this chart should install the bundled etcd subchart. When true, deploys etcd and auto-configures the operator with its address. When false, etcd is not deployed. Use dynamo-operator.etcdAddr to point at an external instance if you are bringing your own etcd. |
| dynamo-operator.enabled | bool | `true` | Whether to enable the Dynamo Kubernetes operator deployment |
| dynamo-operator.natsAddr | string | `""` | NATS server address for operator communication (leave empty to use the bundled NATS chart). Format: "nats://hostname:port" |
| dynamo-operator.etcdAddr | string | `""` | etcd server address for an external etcd instance. Only needed when using external etcd without the bundled subchart. Format: "http://hostname:port" or "https://hostname:port" |
......@@ -104,6 +116,8 @@ The chart includes built-in validation to prevent all operator conflicts:
| dynamo-operator.namespaceRestriction | object | `{"enabled":false,"lease":{"duration":"30s","renewInterval":"10s"},"targetNamespace":null}` | Namespace access controls for the operator |
| dynamo-operator.namespaceRestriction.enabled | bool | `false` | Whether to restrict operator to specific namespaces. By default, the operator will run with cluster-wide permissions. Only 1 instance of the operator should be deployed in the cluster. If you want to deploy multiple operator instances, you can set this to true and specify the target namespace (by default, the target namespace is the helm release namespace). |
| dynamo-operator.namespaceRestriction.targetNamespace | string | `nil` | Target namespace for operator deployment (leave empty for current namespace) |
| dynamo-operator.gpuDiscovery | object | `{"enabled":true}` | GPU discovery configuration (only applies when namespaceRestriction.enabled=true) |
| dynamo-operator.gpuDiscovery.enabled | bool | `true` | Whether to provision a ClusterRole for the namespace-scoped operator to read GPU node labels. When true (default), Helm creates a ClusterRole/ClusterRoleBinding granting node read access. Set to false if your installer lacks ClusterRole creation permissions. |
| dynamo-operator.controllerManager.tolerations | list | `[]` | Node tolerations for controller manager pods |
| dynamo-operator.controllerManager.affinity | object | `{}` | Affinity for controller manager pods |
| dynamo-operator.controllerManager.leaderElection.id | string | `""` | Leader election ID for cluster-wide coordination. WARNING: All cluster-wide operators must use the SAME ID to prevent split-brain. Different IDs would allow multiple leaders simultaneously. |
......@@ -131,7 +145,6 @@ The chart includes built-in validation to prevent all operator conflicts:
| dynamo-operator.dynamo.metrics.prometheusEndpoint | string | `""` | Endpoint that services can use to retrieve metrics. If set, dynamo operator will automatically inject the PROMETHEUS_ENDPOINT environment variable into services it manages. Users can override the value of the PROMETHEUS_ENDPOINT environment variable by modifying the corresponding deployment's environment variables |
| dynamo-operator.dynamo.mpiRun.secretName | string | `"mpi-run-ssh-secret"` | Name of the secret containing the SSH key for MPI Run |
| dynamo-operator.dynamo.mpiRun.sshKeygen.enabled | bool | `true` | Whether to enable SSH key generation for MPI Run |
| dynamo-operator.webhook.enabled | bool | `true` | Whether to enable admission webhooks for resource validation. When enabled, the operator will validate DynamoComponentDeployment and DynamoGraphDeployment resources before they are created or updated in the cluster. Enabled by default for production-ready validation and better error reporting. |
| dynamo-operator.webhook.certificateSecret.name | string | `"webhook-server-cert"` | Name of the Kubernetes secret containing webhook TLS certificates. The secret must contain three keys: tls.crt (server certificate), tls.key (server private key), and ca.crt (Certificate Authority certificate). |
| dynamo-operator.webhook.certificateSecret.external | bool | `false` | Whether to manage the certificate secret externally. When false (default), certificates are automatically generated via Helm hooks during installation. When true, you must create the secret manually before installing the chart. |
| dynamo-operator.webhook.certificateValidity | int | `365` | Certificate validity duration in days for auto-generated certificates. Only used when certManager.enabled=false and certificateSecret.external=false. After this duration, certificates will expire and need to be regenerated. |
......@@ -148,6 +161,9 @@ The chart includes built-in validation to prevent all operator conflicts:
| dynamo-operator.webhook.certManager.certificate.rootCA.duration | string | `"87600h"` | Duration for the root CA certificate (e.g., "87600h" for 10 years). The root CA typically has a much longer lifetime than the leaf certificates it signs. |
| dynamo-operator.webhook.certManager.certificate.rootCA.renewBefore | string | `"720h"` | Time before root CA expiration to trigger renewal (e.g., "720h" for 30 days). Renewing a CA can be disruptive as all signed certificates must be reissued. |
| dynamo-operator.checkpoint.enabled | bool | `false` | Whether to enable checkpoint/restore functionality |
| dynamo-operator.checkpoint.initContainerImage | string | `"busybox:latest"` | Image used for init containers in checkpoint jobs (e.g., signal file cleanup) |
| dynamo-operator.checkpoint.readyForCheckpointFilePath | string | `"/tmp/ready-for-checkpoint"` | Path written by worker when model is loaded and ready for checkpointing |
| dynamo-operator.checkpoint.restoreMarkerFilePath | string | `"/tmp/dynamo-restored"` | Path written by restore-entrypoint after successful CRIU restore |
| dynamo-operator.checkpoint.storage.type | string | `"pvc"` | Storage backend type: pvc, s3, or oci |
| dynamo-operator.checkpoint.storage.signalHostPath | string | `"/var/lib/chrek/signals"` | Host path for signal files (communication between checkpoint pod and DaemonSet) |
| dynamo-operator.checkpoint.storage.pvc.pvcName | string | `"chrek-pvc"` | Name of the PVC created by the chrek chart |
......@@ -156,14 +172,12 @@ The chart includes built-in validation to prevent all operator conflicts:
| dynamo-operator.checkpoint.storage.s3.credentialsSecretRef | string | `""` | Reference to a secret containing AWS credentials |
| dynamo-operator.checkpoint.storage.oci.uri | string | `""` | OCI URI in format: oci://registry/repository |
| dynamo-operator.checkpoint.storage.oci.credentialsSecretRef | string | `""` | Reference to a docker config secret for registry authentication |
| dynamo-operator.checkpoint.criu.timeout | string | `"21600"` | CRIU operation timeout in seconds. Default: 21600 (6 hours) |
| grove.enabled | bool | `false` | Whether to enable Grove for multi-node inference coordination, if enabled, the Grove operator will be deployed cluster-wide |
| grove.tolerations | list | `[]` | Node tolerations for Grove pods |
| grove.affinity | object | `{}` | Affinity for Grove pods |
| kai-scheduler.enabled | bool | `false` | Whether to enable Kai Scheduler for intelligent resource allocation, if enabled, the Kai Scheduler operator will be deployed cluster-wide |
| kai-scheduler.global.tolerations | list | `[]` | Node tolerations for kai-scheduler pods |
| kai-scheduler.global.affinity | object | `{}` | Affinity for kai-scheduler pods |
| global.etcd.install | bool | `false` | Whether to install the bundled etcd subchart. When true, deploys etcd and auto-configures the operator with its address. Use dynamo-operator.etcdAddr to point at an external instance instead. |
| etcd.image.repository | string | `"bitnamilegacy/etcd"` | following bitnami announcement for brownout - https://github.com/bitnami/charts/tree/main/bitnami/etcd#%EF%B8%8F-important-notice-upcoming-changes-to-the-bitnami-catalog, we need to use the legacy repository until we migrate to the new "secure" repository |
| nats.enabled | bool | `true` | Whether to enable NATS deployment, disable if you want to use an external NATS instance. For complete configuration options, see: https://github.com/nats-io/k8s/tree/main/helm/charts/nats , all nats settings should be prefixed with "nats." |
......
......@@ -37,6 +37,17 @@ The Dynamo Platform Helm chart deploys the complete Dynamo Kubernetes Platform i
- Helm 3.8+
- Sufficient cluster resources for your deployment scale
- Container registry access (if using private images)
- TLS certificate infrastructure for admission webhooks (auto-generated via Helm hooks by default, or [cert-manager](https://cert-manager.io/), or externally managed)
## 🔄 Upgrading Notes
### Webhooks are now mandatory (v1.0.0+)
The `webhook.enabled` Helm value has been removed. Admission webhooks are now a required component of the operator and cannot be disabled. This change aligns with the upcoming addition of CRD conversion webhooks, which are mandatory for multi-version API support.
No action is required for most upgrades — Helm hooks automatically generate TLS certificates and inject the CA bundle during `helm upgrade`. If you use cert-manager or externally managed certificates, ensure your existing configuration is correct before upgrading.
---
## ⚠️ Important: Cluster-Wide vs Namespace-Scoped Deployment
......
......@@ -145,9 +145,6 @@ spec:
- --gpu-discovery-enabled={{ .Values.gpuDiscovery.enabled }}
{{- end }}
- --operator-version={{ .Chart.AppVersion }}
{{- if .Values.webhook.enabled }}
- --enable-webhooks=true
{{- end }}
{{- if .Values.checkpoint.enabled }}
- --checkpoint-enabled=true
- --checkpoint-storage-type={{ .Values.checkpoint.storage.type }}
......@@ -187,12 +184,10 @@ spec:
initialDelaySeconds: 15
periodSeconds: 20
name: manager
{{- if .Values.webhook.enabled }}
ports:
- containerPort: 9443
name: webhook-server
protocol: TCP
{{- end }}
readinessProbe:
httpGet:
path: /readyz
......@@ -203,20 +198,16 @@ spec:
10 }}
securityContext: {{- toYaml .Values.controllerManager.manager.containerSecurityContext
| nindent 10 }}
{{- if .Values.webhook.enabled }}
volumeMounts:
- mountPath: /tmp/k8s-webhook-server/serving-certs
name: cert
readOnly: true
{{- end }}
securityContext:
runAsNonRoot: true
serviceAccountName: {{ include "dynamo-operator.fullname" . }}-controller-manager
terminationGracePeriodSeconds: 30
{{- if .Values.webhook.enabled }}
volumes:
- name: cert
secret:
defaultMode: 420
secretName: {{ .Values.webhook.certificateSecret.name }}
{{- end }}
......@@ -13,7 +13,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.
{{- if and .Values.webhook.enabled (not .Values.webhook.certManager.enabled) (not .Values.webhook.certificateSecret.external) }}
{{- if and (not .Values.webhook.certManager.enabled) (not .Values.webhook.certificateSecret.external) }}
---
# ServiceAccount for CA bundle injection job
apiVersion: v1
......
......@@ -13,7 +13,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.
{{- if and .Values.webhook.enabled (not .Values.webhook.certManager.enabled) (not .Values.webhook.certificateSecret.external) }}
{{- if and (not .Values.webhook.certManager.enabled) (not .Values.webhook.certificateSecret.external) }}
---
# ServiceAccount for certificate generation job
apiVersion: v1
......
......@@ -13,7 +13,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.
{{- if and .Values.webhook.enabled .Values.webhook.certManager.enabled }}
{{- if .Values.webhook.certManager.enabled }}
---
# Self-signed issuer to bootstrap the CA
apiVersion: cert-manager.io/v1
......
......@@ -13,7 +13,6 @@
# See the License for the specific language governing permissions and
# limitations under the License.
{{- if .Values.webhook.enabled }}
---
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
......@@ -221,5 +220,4 @@ webhooks:
- dynamographdeployments
sideEffects: None
timeoutSeconds: {{ .Values.webhook.timeoutSeconds }}
{{- end }}
......@@ -13,7 +13,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.
{{- if and .Values.webhook.enabled .Values.webhook.certManager.enabled }}
{{- if .Values.webhook.certManager.enabled }}
---
# ServiceAccount for the CRD conversion patch job
apiVersion: v1
......
......@@ -13,7 +13,6 @@
# See the License for the specific language governing permissions and
# limitations under the License.
{{- if .Values.webhook.enabled }}
---
# Role to read the webhook certificate secret
apiVersion: rbac.authorization.k8s.io/v1
......@@ -56,5 +55,4 @@ subjects:
- kind: ServiceAccount
name: {{ include "dynamo-operator.fullname" . }}-controller-manager
namespace: {{ .Release.Namespace }}
{{- end }}
......@@ -13,7 +13,6 @@
# See the License for the specific language governing permissions and
# limitations under the License.
{{- if .Values.webhook.enabled }}
---
apiVersion: v1
kind: Service
......@@ -34,5 +33,4 @@ spec:
selector:
control-plane: controller-manager
{{- include "dynamo-operator.selectorLabels" . | nindent 4 }}
{{- end }}
......@@ -208,10 +208,6 @@ checkpoint:
# Webhook configuration
webhook:
# Enable admission webhooks for validation
# Enabled by default for production-ready validation and better error reporting
enabled: true
# Certificate configuration
certificateSecret:
# Name of the secret containing webhook TLS certificates
......
......@@ -156,9 +156,6 @@ dynamo-operator:
# Webhook configuration for admission control and validation
webhook:
# -- Whether to enable admission webhooks for resource validation. When enabled, the operator will validate DynamoComponentDeployment and DynamoGraphDeployment resources before they are created or updated in the cluster. Enabled by default for production-ready validation and better error reporting.
enabled: true
# Certificate configuration for webhook TLS
certificateSecret:
# -- Name of the Kubernetes secret containing webhook TLS certificates. The secret must contain three keys: tls.crt (server certificate), tls.key (server private key), and ca.crt (Certificate Authority certificate).
......
......@@ -157,7 +157,6 @@ func main() {
var namespaceScopeLeaseRenewInterval time.Duration
var operatorVersion string
var discoveryBackend string
var enableWebhooks bool
var gpuDiscoveryEnabled bool
// Checkpoint configuration
var checkpointEnabled bool
......@@ -181,9 +180,6 @@ func main() {
"If set the metrics endpoint is served securely")
flag.BoolVar(&enableHTTP2, "enable-http2", false,
"If set, HTTP/2 will be enabled for the metrics and webhook servers")
flag.BoolVar(&enableWebhooks, "enable-webhooks", false,
"Enable admission webhooks for validation. When enabled, controllers skip validation "+
"(webhooks handle it). When disabled, controllers perform validation.")
flag.BoolVar(&gpuDiscoveryEnabled, "gpu-discovery-enabled", true,
"Whether GPU discovery is enabled for namespace-scoped operators. When true (default), "+
"the Helm chart has provisioned a ClusterRole granting node read access for GPU hardware discovery.")
......@@ -692,80 +688,68 @@ func main() {
os.Exit(1)
}
// Set webhooks enabled flag in config
ctrlConfig.WebhooksEnabled = enableWebhooks
ctrlConfig.GPUDiscoveryEnabled = gpuDiscoveryEnabled
if enableWebhooks {
setupLog.Info("Webhooks are enabled - webhooks will validate, controllers will skip validation")
} else {
setupLog.Info("Webhooks are disabled - controllers will validate (defense in depth)")
}
// Configure webhooks with lease-based namespace exclusion (only if enabled)
// Configure webhooks with lease-based namespace exclusion
// In cluster-wide mode, inject ctrlConfig.ExcludedNamespaces (leaseWatcher) so webhooks can defer
// to namespace-restricted operators. In namespace-restricted mode, webhooks validate without checking
// leases (ExcludedNamespaces is nil). The webhooks use LeaseAwareValidator wrapper to add coordination.
if enableWebhooks {
if ctrlConfig.RestrictedNamespace == "" {
// Cluster-wide mode: inject the same ExcludedNamespaces used by controllers
setupLog.Info("Configuring webhooks with lease-based namespace exclusion for cluster-wide mode")
internalwebhook.SetExcludedNamespaces(ctrlConfig.ExcludedNamespaces)
} else {
// Namespace-restricted mode: no exclusion checking needed (validators not wrapped)
setupLog.Info("Configuring webhooks for namespace-restricted mode (no lease checking)",
"restrictedNamespace", ctrlConfig.RestrictedNamespace)
internalwebhook.SetExcludedNamespaces(nil)
}
// Register validation webhook handlers
setupLog.Info("Registering validation webhooks")
isClusterWide := ctrlConfig.RestrictedNamespace == ""
if isClusterWide {
setupLog.Info("Configuring webhooks with lease-based namespace exclusion for cluster-wide mode")
internalwebhook.SetExcludedNamespaces(ctrlConfig.ExcludedNamespaces)
} else {
setupLog.Info("Configuring webhooks for namespace-restricted mode (no lease checking)",
"restrictedNamespace", ctrlConfig.RestrictedNamespace)
internalwebhook.SetExcludedNamespaces(nil)
}
dcdHandler := webhookvalidation.NewDynamoComponentDeploymentHandler()
if err = dcdHandler.RegisterWithManager(mgr); err != nil {
setupLog.Error(err, "unable to register webhook", "webhook", "DynamoComponentDeployment")
os.Exit(1)
}
// Register validation webhook handlers
setupLog.Info("Registering validation webhooks")
dgdHandler := webhookvalidation.NewDynamoGraphDeploymentHandler(mgr)
if err = dgdHandler.RegisterWithManager(mgr); err != nil {
setupLog.Error(err, "unable to register webhook", "webhook", "DynamoGraphDeployment")
os.Exit(1)
}
dcdHandler := webhookvalidation.NewDynamoComponentDeploymentHandler()
if err = dcdHandler.RegisterWithManager(mgr); err != nil {
setupLog.Error(err, "unable to register webhook", "webhook", "DynamoComponentDeployment")
os.Exit(1)
}
dmHandler := webhookvalidation.NewDynamoModelHandler()
if err = dmHandler.RegisterWithManager(mgr); err != nil {
setupLog.Error(err, "unable to register webhook", "webhook", "DynamoModel")
os.Exit(1)
}
dgdHandler := webhookvalidation.NewDynamoGraphDeploymentHandler(mgr)
if err = dgdHandler.RegisterWithManager(mgr); err != nil {
setupLog.Error(err, "unable to register webhook", "webhook", "DynamoGraphDeployment")
os.Exit(1)
}
isClusterWide := ctrlConfig.RestrictedNamespace == ""
dgdrHandler := webhookvalidation.NewDynamoGraphDeploymentRequestHandler(isClusterWide, gpuDiscoveryEnabled)
if err = dgdrHandler.RegisterWithManager(mgr); err != nil {
setupLog.Error(err, "unable to register webhook", "webhook", "DynamoGraphDeploymentRequest")
os.Exit(1)
}
dmHandler := webhookvalidation.NewDynamoModelHandler()
if err = dmHandler.RegisterWithManager(mgr); err != nil {
setupLog.Error(err, "unable to register webhook", "webhook", "DynamoModel")
os.Exit(1)
}
if err = ctrl.NewWebhookManagedBy(mgr).
For(&nvidiacomv1alpha1.DynamoGraphDeploymentRequest{}).
Complete(); err != nil {
setupLog.Error(err, "unable to register conversion webhook", "webhook", "DynamoGraphDeploymentRequest-conversion")
os.Exit(1)
}
dgdrHandler := webhookvalidation.NewDynamoGraphDeploymentRequestHandler(isClusterWide, gpuDiscoveryEnabled)
if err = dgdrHandler.RegisterWithManager(mgr); err != nil {
setupLog.Error(err, "unable to register webhook", "webhook", "DynamoGraphDeploymentRequest")
os.Exit(1)
}
setupLog.Info("Validation webhooks registered successfully")
if err = ctrl.NewWebhookManagedBy(mgr).
For(&nvidiacomv1alpha1.DynamoGraphDeploymentRequest{}).
Complete(); err != nil {
setupLog.Error(err, "unable to register conversion webhook", "webhook", "DynamoGraphDeploymentRequest-conversion")
os.Exit(1)
}
// Register defaulting (mutating) webhook handlers
setupLog.Info("Registering defaulting webhooks")
setupLog.Info("Validation webhooks registered successfully")
dgdDefaulter := webhookdefaulting.NewDGDDefaulter(operatorVersion)
if err = dgdDefaulter.RegisterWithManager(mgr); err != nil {
setupLog.Error(err, "unable to register webhook", "webhook", "DynamoGraphDeployment-defaulting")
os.Exit(1)
}
// Register defaulting (mutating) webhook handlers
setupLog.Info("Registering defaulting webhooks")
setupLog.Info("Defaulting webhooks registered successfully")
dgdDefaulter := webhookdefaulting.NewDGDDefaulter(operatorVersion)
if err = dgdDefaulter.RegisterWithManager(mgr); err != nil {
setupLog.Error(err, "unable to register webhook", "webhook", "DynamoGraphDeployment-defaulting")
os.Exit(1)
}
setupLog.Info("Defaulting webhooks registered successfully")
//+kubebuilder:scaffold:builder
if err := mgr.AddHealthzCheck("healthz", healthz.Ping); err != nil {
......
......@@ -43,7 +43,6 @@ import (
"github.com/ai-dynamo/dynamo/deploy/operator/internal/dynamo"
"github.com/ai-dynamo/dynamo/deploy/operator/internal/modelendpoint"
"github.com/ai-dynamo/dynamo/deploy/operator/internal/observability"
webhookvalidation "github.com/ai-dynamo/dynamo/deploy/operator/internal/webhook/validation"
)
const (
......@@ -98,45 +97,6 @@ func (r *DynamoModelReconciler) Reconcile(ctx context.Context, req ctrl.Request)
logs = logs.WithValues("dynamoModel", model.Name, "namespace", model.Namespace, "baseModelName", model.Spec.BaseModelName)
logs.Info("Reconciling DynamoModel")
// Validate the DynamoModel spec (defense in depth - only when webhooks are disabled)
if !r.Config.WebhooksEnabled {
validator := webhookvalidation.NewDynamoModelValidator(model)
if _, err := validator.Validate(); err != nil {
logs.Error(err, "DynamoModel validation failed, refusing to reconcile")
// Set validation error condition
meta.SetStatusCondition(&model.Status.Conditions, metav1.Condition{
Type: "Valid",
Status: metav1.ConditionFalse,
ObservedGeneration: model.Generation,
Reason: "ValidationFailed",
Message: fmt.Sprintf("Validation failed: %v", err),
})
// Update status and don't requeue (user must fix the spec)
if statusErr := r.Status().Update(ctx, model); statusErr != nil {
logs.Error(statusErr, "Failed to update DynamoModel status with validation error")
return ctrl.Result{}, statusErr
}
// Record event for visibility
r.Recorder.Event(model, corev1.EventTypeWarning, "ValidationFailed", err.Error())
// Don't requeue - user must fix the spec
logs.Info("DynamoModel is invalid, not reconciling until spec is fixed")
return ctrl.Result{}, nil
}
// Set Valid condition to True
meta.SetStatusCondition(&model.Status.Conditions, metav1.Condition{
Type: "Valid",
Status: metav1.ConditionTrue,
ObservedGeneration: model.Generation,
Reason: "ValidationPassed",
Message: "DynamoModel spec is valid",
})
}
// Handle finalizer using common handler
finalized, err := commoncontroller.HandleFinalizer(ctx, model, r.Client, r)
if err != nil {
......
......@@ -40,7 +40,6 @@ import (
commonController "github.com/ai-dynamo/dynamo/deploy/operator/internal/controller_common"
"github.com/ai-dynamo/dynamo/deploy/operator/internal/dynamo"
"github.com/ai-dynamo/dynamo/deploy/operator/internal/observability"
webhookvalidation "github.com/ai-dynamo/dynamo/deploy/operator/internal/webhook/validation"
networkingv1beta1 "istio.io/client-go/pkg/apis/networking/v1beta1"
k8serrors "k8s.io/apimachinery/pkg/api/errors"
"k8s.io/apimachinery/pkg/api/meta"
......@@ -146,52 +145,6 @@ func (r *DynamoComponentDeploymentReconciler) Reconcile(ctx context.Context, req
}
}()
// Validate the DynamoComponentDeployment spec (defense in depth - only when webhooks are disabled)
if !r.Config.WebhooksEnabled {
validator := webhookvalidation.NewDynamoComponentDeploymentValidator(dynamoComponentDeployment)
if _, validationErr := validator.Validate(ctx); validationErr != nil {
logs.Error(validationErr, "DynamoComponentDeployment validation failed, refusing to reconcile")
// Set validation error condition
meta.SetStatusCondition(&dynamoComponentDeployment.Status.Conditions, metav1.Condition{
Type: "Valid",
Status: metav1.ConditionFalse,
ObservedGeneration: dynamoComponentDeployment.Generation,
Reason: "ValidationFailed",
Message: fmt.Sprintf("Validation failed: %v", validationErr),
})
// Update status and don't requeue (user must fix the spec)
if statusErr := r.Status().Update(ctx, dynamoComponentDeployment); statusErr != nil {
logs.Error(statusErr, "Failed to update DynamoComponentDeployment status with validation error")
err = statusErr
return ctrl.Result{}, err
}
// Record event for visibility
r.Recorder.Event(dynamoComponentDeployment, corev1.EventTypeWarning, "ValidationFailed", validationErr.Error())
// Don't requeue - user must fix the spec
logs.Info("DynamoComponentDeployment is invalid, not reconciling until spec is fixed")
err = nil
return ctrl.Result{}, nil
}
// Set Valid condition to True and persist it
meta.SetStatusCondition(&dynamoComponentDeployment.Status.Conditions, metav1.Condition{
Type: "Valid",
Status: metav1.ConditionTrue,
ObservedGeneration: dynamoComponentDeployment.Generation,
Reason: "ValidationPassed",
Message: "DynamoComponentDeployment spec is valid",
})
if statusErr := r.Status().Update(ctx, dynamoComponentDeployment); statusErr != nil {
logs.Error(statusErr, "Failed to update DynamoComponentDeployment status with validation success")
err = statusErr
return ctrl.Result{}, err
}
}
deleted, err := commonController.HandleFinalizer(ctx, dynamoComponentDeployment, r.Client, r)
if err != nil {
logs.Error(err, "Failed to handle finalizer")
......
......@@ -53,7 +53,6 @@ import (
"github.com/ai-dynamo/dynamo/deploy/operator/internal/dynamo"
"github.com/ai-dynamo/dynamo/deploy/operator/internal/dynamo/epp"
"github.com/ai-dynamo/dynamo/deploy/operator/internal/observability"
webhookvalidation "github.com/ai-dynamo/dynamo/deploy/operator/internal/webhook/validation"
rbacv1 "k8s.io/api/rbac/v1"
gaiev1 "sigs.k8s.io/gateway-api-inference-extension/api/v1"
)
......@@ -164,28 +163,6 @@ func (r *DynamoGraphDeploymentReconciler) Reconcile(ctx context.Context, req ctr
return ctrl.Result{}, nil
}
// Validate the DynamoGraphDeployment spec (defense in depth - only when webhooks are disabled)
if !r.Config.WebhooksEnabled {
validator := webhookvalidation.NewDynamoGraphDeploymentValidator(dynamoDeployment)
if _, validationErr := validator.Validate(ctx); validationErr != nil {
logger.Error(validationErr, "DynamoGraphDeployment validation failed, refusing to reconcile")
// Set validation error state and reason (defer will update status)
state = nvidiacomv1alpha1.DGDStateFailed
reason = Reason("ValidationFailed")
message = Message(fmt.Sprintf("Validation failed: %v", validationErr))
// Record event for visibility
r.Recorder.Event(dynamoDeployment, corev1.EventTypeWarning, "ValidationFailed", validationErr.Error())
// Don't requeue - user must fix the spec
logger.Info("DynamoGraphDeployment is invalid, not reconciling until spec is fixed")
// Return without error so defer updates status but doesn't requeue
return ctrl.Result{}, nil
}
}
if r.supportsManagedRollingUpdate(dynamoDeployment) {
if err = r.initializeWorkerHashIfNeeded(ctx, dynamoDeployment); err != nil {
logger.Error(err, "Failed to initialize worker hash")
......
......@@ -49,7 +49,6 @@ import (
commonController "github.com/ai-dynamo/dynamo/deploy/operator/internal/controller_common"
"github.com/ai-dynamo/dynamo/deploy/operator/internal/gpu"
"github.com/ai-dynamo/dynamo/deploy/operator/internal/observability"
webhookvalidation "github.com/ai-dynamo/dynamo/deploy/operator/internal/webhook/validation"
)
const (
......@@ -890,24 +889,6 @@ func isOnlineProfiling(dgdr *nvidiacomv1alpha1.DynamoGraphDeploymentRequest) boo
// validateSpec validates the DGDR spec
func (r *DynamoGraphDeploymentRequestReconciler) validateSpec(ctx context.Context, dgdr *nvidiacomv1alpha1.DynamoGraphDeploymentRequest) error {
// Use the validator for simple validation (defense in depth - only when webhooks are disabled)
if !r.Config.WebhooksEnabled {
isClusterWide := r.Config.RestrictedNamespace == ""
validator := webhookvalidation.NewDynamoGraphDeploymentRequestValidator(dgdr, isClusterWide, r.Config.GPUDiscoveryEnabled)
warnings, err := validator.Validate()
if err != nil {
return err
}
// Log warnings if any
if len(warnings) > 0 {
logger := log.FromContext(ctx)
for _, warning := range warnings {
logger.Info("Validation warning", "warning", warning)
}
}
}
// Validate ConfigMap if provided (for the DGD base config)
// This requires cluster access and cannot be done in the stateless validator
if dgdr.Spec.ProfilingConfig.ConfigMapRef != nil {
......
......@@ -80,11 +80,6 @@ type Config struct {
// DiscoveryBackend is the discovery backend to use. Default is "kubernetes" for Kubernetes API service discovery. Set to "etcd" to use ETCD for discovery.
DiscoveryBackend string
// WebhooksEnabled indicates whether admission webhooks are enabled
// When true, controllers skip validation (webhooks handle it)
// When false, controllers perform validation (defense in depth)
WebhooksEnabled bool
// GPUDiscoveryEnabled indicates whether Helm provisioned node read access for the namespace-scoped operator.
// Only relevant for namespace-scoped operators (RestrictedNamespace != "").
GPUDiscoveryEnabled bool
......
......@@ -122,7 +122,7 @@ For a user-focused guide on deploying and managing models with DynamoModel, see:
## Webhooks
The Dynamo Operator uses **Kubernetes admission webhooks** for real-time validation of custom resources before they are persisted to the cluster. Webhooks are **enabled by default** and ensure that invalid configurations are rejected immediately at the API server level.
The Dynamo Operator uses **Kubernetes admission webhooks** for real-time validation and mutation of custom resources before they are persisted to the cluster. Webhooks are a required component of the operator and ensure that invalid configurations are rejected immediately at the API server level.
**Key Features:**
- ✅ Shared certificate infrastructure across all webhook types
......
......@@ -11,7 +11,6 @@ This document describes the webhook functionality in the Dynamo Operator, includ
- [Overview](#overview)
- [Architecture](#architecture)
- [Configuration](#configuration)
- [Enabling/Disabling Webhooks](#enablingdisabling-webhooks)
- [Certificate Management Options](#certificate-management-options)
- [Advanced Configuration](#advanced-configuration)
- [Certificate Management](#certificate-management)
......@@ -31,10 +30,9 @@ All webhook types (validating, mutating, conversion, etc.) share the same **webh
### Key Features
-**Enabled by default** - Zero-touch validation out of the box
-**Always enabled** - Webhooks are a required component of the operator
-**Shared certificate infrastructure** - All webhook types use the same TLS certificates
-**Automatic certificate generation** - No manual certificate management required
-**Defense in depth** - Controllers validate when webhooks are disabled
-**cert-manager integration** - Optional integration for automated certificate lifecycle
-**Multi-operator support** - Lease-based coordination for cluster-wide and namespace-restricted deployments
-**Immutability enforcement** - Critical fields protected via CEL validation rules
......@@ -45,8 +43,11 @@ All webhook types (validating, mutating, conversion, etc.) share the same **webh
- `DynamoComponentDeployment` validation
- `DynamoGraphDeployment` validation
- `DynamoModel` validation
- `DynamoGraphDeploymentRequest` validation
- **Mutating Webhooks**: Apply default values to resources on creation
- `DynamoGraphDeployment` defaulting
**Note:** Future releases may add mutating webhooks (for defaults/transformations) and conversion webhooks (for CRD version migrations). All will use the same certificate infrastructure described in this document.
**Note:** All webhook types use the same certificate infrastructure described in this document.
---
......@@ -56,54 +57,57 @@ All webhook types (validating, mutating, conversion, etc.) share the same **webh
┌─────────────────────────────────────────────────────────────────┐
│ API Server │
│ 1. User submits CR (kubectl apply) │
│ 2. API server calls ValidatingWebhookConfiguration │
│ 2. API server calls MutatingWebhookConfiguration
└────────────────────────┬────────────────────────────────────────┘
│ HTTPS (TLS required)
┌─────────────────────────────────────────────────────────────────┐
│ Webhook Server (in Operator Pod) │
│ 3. Validates CR against business rules │
│ 4. Returns admit/deny decision + warnings │
└─────────────────────────────────────────────────────────────────┘
│ 3. Applies defaults (e.g., operator version annotation) │
│ 4. Returns mutated CR │
└────────────────────────┬────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ API Server │
│ 5. API server calls ValidatingWebhookConfiguration │
└────────────────────────┬────────────────────────────────────────┘
│ HTTPS (TLS required)
┌─────────────────────────────────────────────────────────────────┐
│ Webhook Server (in Operator Pod) │
│ 6. Validates CR against business rules │
│ 7. Returns admit/deny decision + warnings │
└────────────────────────┬────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ API Server │
5. If admitted: Persist CR to etcd │
6. If denied: Return error to user │
8. If admitted: Persist CR to etcd │
9. If denied: Return error to user │
└─────────────────────────────────────────────────────────────────┘
```
### Validation Flow
### Admission Flow
1. **Webhook validation** (if enabled): Validates at API server level
2. **CEL validation**: Kubernetes-native immutability checks (always active)
3. **Controller validation** (if webhooks disabled): Defense-in-depth validation during reconciliation
1. **Mutating webhooks**: Apply defaults and transformations before validation
2. **Validating webhooks**: Validate the (possibly mutated) CR against business rules
3. **CEL validation**: Kubernetes-native immutability checks (always active)
---
## Configuration
## Upgrading from versions with `webhook.enabled: false`
### Enabling/Disabling Webhooks
The `webhook.enabled` Helm value has been removed. Webhooks are now a required component of the operator and are always active. If you previously ran with `webhook.enabled: false`, take the following steps before upgrading:
Webhooks are **enabled by default**. To disable them:
```yaml
# Platform-level values.yaml
dynamo-operator:
webhook:
enabled: false
```
**When to disable webhooks:**
- During development/testing when rapid iteration is needed
- In environments where admission webhooks are not supported
- When troubleshooting validation issues
**Note:** When webhooks are disabled, controllers perform validation during reconciliation (defense in depth).
1. **Remove `webhook.enabled`** from any custom values files. Helm will ignore the unknown key, but it should be cleaned up to avoid confusion.
2. **Ensure port 9443 is reachable** from the Kubernetes API server to the operator pod. If you have `NetworkPolicy` rules or firewall configurations restricting traffic, add an ingress rule allowing the API server to reach the webhook server on port 9443.
3. **Ensure webhook TLS certificates are available.** By default, Helm hooks generate self-signed certificates automatically during `helm upgrade` — no action needed. If you use cert-manager or externally managed certificates, verify your configuration is in place before upgrading.
---
## Configuration
### Certificate Management Options
The operator supports three certificate management modes:
......@@ -123,9 +127,6 @@ The operator supports three certificate management modes:
```yaml
dynamo-operator:
webhook:
# Enable/disable validation webhooks
enabled: true
# Certificate management
certManager:
enabled: false
......@@ -455,7 +456,7 @@ If the namespace-restricted operator is deleted or becomes unhealthy:
**Checks:**
1. **Verify webhook is enabled**:
1. **Verify webhook configuration exists**:
```bash
kubectl get validatingwebhookconfiguration | grep dynamo
```
......@@ -477,7 +478,7 @@ kubectl get service -n <namespace> | grep webhook
4. **Check operator logs for webhook startup**:
```bash
kubectl logs -n <namespace> deployment/<release>-dynamo-operator | grep webhook
# Should see: "Webhooks are enabled - webhooks will validate, controllers will skip validation"
# Should see: "Registering validation webhooks"
# Should see: "Starting webhook server"
```
......@@ -636,15 +637,15 @@ kubectl describe <resource-type> <name> -n <namespace>
# Look for events mentioning webhook errors
```
2. **Temporarily disable webhook**:
2. **Temporarily work around the webhook**:
```bash
# Option 1: Delete ValidatingWebhookConfiguration
kubectl delete validatingwebhookconfiguration <name>
# Option 2: Set failurePolicy to Ignore
# Option 1: Set failurePolicy to Ignore
kubectl patch validatingwebhookconfiguration <name> \
--type='json' \
-p='[{"op": "replace", "path": "/webhooks/0/failurePolicy", "value": "Ignore"}]'
# Option 2 (last resort): Delete ValidatingWebhookConfiguration
kubectl delete validatingwebhookconfiguration <name>
```
3. **Delete resource again**:
......@@ -652,7 +653,7 @@ kubectl patch validatingwebhookconfiguration <name> \
kubectl delete <resource-type> <name> -n <namespace>
```
4. **Re-enable webhook**:
4. **Restore webhook configuration**:
```bash
helm upgrade <release> dynamo-platform -n <namespace>
```
......@@ -663,17 +664,15 @@ helm upgrade <release> dynamo-platform -n <namespace>
### Production Deployments
1.**Keep webhooks enabled** (default) for real-time validation
2.**Use `failurePolicy: Fail`** (default) to ensure validation is enforced
3.**Monitor webhook latency** - Validation adds ~10-50ms per resource operation
4.**Use cert-manager** for automated certificate lifecycle in large deployments
5.**Test webhook configuration** in staging before production
1.**Use `failurePolicy: Fail`** (default) to ensure validation is enforced
2.**Monitor webhook latency** - Validation adds ~10-50ms per resource operation
3.**Use cert-manager** for automated certificate lifecycle in large deployments
4.**Test webhook configuration** in staging before production
### Development Deployments
1.**Disable webhooks** for rapid iteration if needed
2.**Use `failurePolicy: Ignore`** if webhook availability is problematic
3.**Keep automatic certificates** (simpler than cert-manager for dev)
1.**Use `failurePolicy: Ignore`** if webhook availability is problematic during development
2.**Keep automatic certificates** (simpler than cert-manager for dev)
### Multi-Tenant Deployments
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment