feat: Add Tiltfile for operator local development with live-reload (#6971)

Signed-off-by: Jont828 <jt572@cornell.edu>

feat: Add Tiltfile for operator local development with live-reload (#6971)
Signed-off-by: Jont828 <jt572@cornell.edu>
6e56bad6 · Jonathan Tong · GitHub · 7fb597a2 · 6e56bad6 · 6e56bad6
Unverified Commit 6e56bad6 authored Mar 20, 2026 by Jonathan Tong Committed by GitHub Mar 20, 2026
6 changed files
--- a/deploy/operator/.gitignore
+++ b/deploy/operator/.gitignore
@@ -35,5 +35,7 @@ tilt-settings.local.yaml
 *.swo
 *~
+# Tilt per-developer settings (defaults are defined in the Tiltfile)
+tilt-settings.yaml
 !env*
\ No newline at end of file
--- a/deploy/operator/README.md
+++ b/deploy/operator/README.md
@@ -31,6 +31,76 @@ Built with [Kubebuilder](https://book.kubebuilder.io/), it follows Kubernetes be
 make
 ```
+### Local development with Tilt
+[Tilt](https://docs.tilt.dev/install.html) provides a live-reload development loop for the operator. It compiles the Go binary locally, builds a minimal Docker image, renders the production Helm chart, and deploys everything to your cluster. On code changes, Tilt recompiles and live-updates the binary without a full image rebuild — giving fast iteration on controller logic against a real cluster.
+#### Prerequisites
+The following tools must be installed and available in your `PATH` before running `tilt up`:
+| Tool | Version | Purpose | Install |
+|------|---------|---------|---------|
+| [Go](https://go.dev/doc/install) | ≥ 1.25 | Compiles the manager binary locally | [go.dev/doc/install](https://go.dev/doc/install) |
+| [Tilt](https://docs.tilt.dev/install.html) | latest | Live-reload dev loop orchestrator | [docs.tilt.dev/install](https://docs.tilt.dev/install.html) |
+| [Helm](https://helm.sh/docs/intro/install/) | v3 | Renders the platform Helm chart | [helm.sh/docs/intro/install](https://helm.sh/docs/intro/install/) |
+| [kubectl](https://kubernetes.io/docs/tasks/tools/) | ≥ 1.29 | Applies CRDs and creates the namespace | [kubernetes.io/docs/tasks/tools](https://kubernetes.io/docs/tasks/tools/) |
+| [Docker](https://docs.docker.com/get-docker/) | latest | Builds the live-update container image | [docs.docker.com/get-docker](https://docs.docker.com/get-docker/) |
+**Conditional prerequisites** (only needed when `skip_codegen: false`, the default):
+| Tool | Version | Purpose | Install |
+|------|---------|---------|---------|
+| [yq](https://github.com/mikefarah/yq) | v4+ | Post-processes generated CRD YAML | `make ensure-yq` or [github.com/mikefarah/yq](https://github.com/mikefarah/yq) |
+| [Python 3](https://www.python.org/) + [pydantic](https://docs.pydantic.dev/) | 3.x | Generates Pydantic models from Go types (`make generate`) | `pip install pydantic` |
+> **Tip:** Set `skip_codegen: true` in `tilt-settings.yaml` to skip CRD/code generation on every reload. This removes the yq/Python requirement and speeds up iteration when you haven't changed API types.
+**Cluster:** You need a Kubernetes cluster (kind, minikube, GKE, EKS, bare-metal, etc.) with a kubeconfig context that Tilt can reach. If your cluster has GPUs and you want to test DGD/DGDR workloads end-to-end, the [NVIDIA GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html) should be installed on the cluster.
+#### Setup
+1. **Create `tilt-settings.yaml`** in `deploy/operator/` with this minimal config:
+   ```yaml
+   allowed_contexts:
+     - h100                 # Change to your Kubernetes context
+   registry: docker.io/myuser  # Change to your Docker registry
+   ```
+2. **Run Tilt**:
+   ```bash
+   cd deploy/operator
+   tilt up
+   ```
+   The Tilt UI will open at http://localhost:10350 showing resource status and logs.
+#### Features
+- **Fast iteration**: On code changes, Tilt recompiles the manager binary and live-updates it into the running container — no full image rebuild needed
+- **Real cluster testing**: Reconciles against your actual Kubernetes cluster (kind, minikube, GKE, AKS, etc.)
+- **CRD + Helm rendering**: Automatically applies CRDs and renders the platform Helm chart with your configuration
+- **Infrastructure toggles**: Control NATS, etcd, KAI scheduler, and Grove via `tilt-settings.yaml`
+#### Optional configuration
+Additional settings available in `tilt-settings.yaml`:
+```yaml
+# Infrastructure toggles (control which components are deployed)
+enable_nats: true              # Enable NATS messaging (default: true, required for DGD/DGDR)
+enable_etcd: false             # Enable etcd service discovery (default: false)
+enable_kai_scheduler: false    # Enable KAI GPU-aware scheduler (default: false)
+enable_grove: false            # Enable Grove orchestrator (default: false)
+# Other settings
+namespace: dynamo-system       # Kubernetes namespace for operator deployment
+skip_codegen: false            # Skip code generation for faster reloads if API unchanged
+image_pull_secret: ""          # Name of Secret for private Docker registries
+helm_values: {}                # Extra Helm value overrides for platform chart
+operator_version: "0.0.0-dev"  # Override operator version (default: from Chart.yaml)
+```
 ### Install
 See [Dynamo Kubernetes Platform Installation Guide](/docs/kubernetes/installation-guide.md) for installation instructions.
--- a/deploy/operator/Tiltfile
+++ b/deploy/operator/Tiltfile
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Tiltfile for developing the Dynamo Kubernetes Operator.
+#
+# Usage:
+#   cd deploy/operator
+#   # edit tilt-settings.yaml as needed
+#   tilt up
+#
+# What it does:
+#   1. Compiles the Go manager binary locally (fast, native).
+#   2. Builds a minimal Docker image containing only the binary.
+#   3. Renders the production Helm chart (deploy/helm/charts/platform) with
+#      `helm template`, applies CRDs via kubectl, and deploys all rendered
+#      resources via k8s_yaml().
+#   4. On code change Tilt recompiles the binary and live-updates it into the
+#      running container — no full image rebuild needed.
+#
+# Prerequisites (must be in PATH):
+#   - Go >= 1.25        — compiles the manager binary locally
+#   - tilt              — live-reload orchestrator (https://docs.tilt.dev/install.html)
+#   - helm v3           — renders the platform Helm chart
+#   - kubectl >= 1.29   — applies CRDs and creates the namespace
+#   - docker            — builds the live-update container image
+#   - A Kubernetes cluster reachable via your current kubeconfig context
+#
+# Conditional (only when skip_codegen is false, the default):
+#   - yq v4+            — post-processes generated CRD YAML (run `make ensure-yq`)
+#   - python3 + pydantic — generates Pydantic models from Go types
+#
+# The tilt restart_process extension is auto-fetched on first `tilt up`.
+load('ext://restart_process', 'docker_build_with_restart')
+# ---------------------------------------------------------------------------
+# Settings — defaults are defined here; tilt-settings.yaml overrides them.
+# ---------------------------------------------------------------------------
+settings = {
+    'namespace':            'dynamo-system',
+    'enable_nats':          True,       # required for DGD/DGDR workloads
+    'enable_etcd':          False,      # only if discoveryBackend is "etcd"
+    'enable_kai_scheduler': False,      # GPU-aware scheduling for multi-node
+    'enable_grove':         False,      # PodClique-based multi-node orchestration
+    'skip_codegen':         False,      # skip make generate/manifests for faster iteration
+    'image_pull_secret':    '',         # name of docker-registry Secret for private registries
+    'helm_values':          {},         # extra --set overrides passed to helm template
+}
+if os.path.exists('tilt-settings.yaml'):
+    data = read_yaml('tilt-settings.yaml', default={})
+    if data:
+        settings.update(data)
+if 'allowed_contexts' in settings:
+    allow_k8s_contexts(settings['allowed_contexts'])
+# Registry — resolved from (highest priority wins):
+#   1. REGISTRY env var          (e.g. REGISTRY=docker.io/myuser tilt up)
+#   2. "registry" in tilt-settings.yaml
+# When set the operator image is pushed as <registry>/controller:tilt-dev.
+REGISTRY = os.getenv('REGISTRY', settings.get('registry', ''))
+if REGISTRY:
+    REGISTRY = REGISTRY.rstrip('/')
+NAMESPACE            = settings['namespace']
+HELM_VALUES          = settings['helm_values']
+ENABLE_NATS          = settings['enable_nats']
+ENABLE_ETCD          = settings['enable_etcd']
+ENABLE_KAI_SCHEDULER = settings['enable_kai_scheduler']
+ENABLE_GROVE         = settings['enable_grove']
+IMAGE_PULL_SECRET    = settings['image_pull_secret']
+# ---------------------------------------------------------------------------
+# Operator version — passed as --operator-version to the manager binary.
+# The Helm chart uses .Chart.AppVersion; for Tilt dev we read it from the
+# operator subchart's Chart.yaml so it stays in sync automatically.
+# Override via tilt-settings.yaml if needed:
+#
+#   tilt-settings.yaml:
+#     operator_version: "1.2.3"
+# ---------------------------------------------------------------------------
+def _read_chart_app_version():
+    """Read appVersion from the operator subchart's Chart.yaml."""
+    chart_path = os.path.join(
+        os.getcwd(), '..', 'helm', 'charts', 'platform',
+        'components', 'operator', 'Chart.yaml')
+    if os.path.exists(chart_path):
+        chart = read_yaml(chart_path, default={})
+        if chart and 'appVersion' in chart:
+            return str(chart['appVersion'])
+    return '0.0.0-dev'
+OPERATOR_VERSION = settings.get('operator_version', _read_chart_app_version())
+# ---------------------------------------------------------------------------
+# Paths (relative to this Tiltfile, i.e. deploy/operator/)
+# ---------------------------------------------------------------------------
+OPERATOR_DIR = os.getcwd()                                     # deploy/operator
+HELM_CHART   = os.path.join(OPERATOR_DIR, '..', 'helm', 'charts', 'platform')  # deploy/helm/charts/platform
+CRD_DIR      = os.path.join(HELM_CHART, 'components', 'operator', 'crds')
+IMG_NAME = 'controller'
+IMG_TAG  = 'tilt-dev'
+IMG      = (REGISTRY + '/' + IMG_NAME) if REGISTRY else IMG_NAME
+IMG_REF  = IMG + ':' + IMG_TAG
+# ---------------------------------------------------------------------------
+# Compile the manager binary locally (much faster than building in Docker)
+# ---------------------------------------------------------------------------
+def compile_manager():
+    return 'CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -o tilt_bin/manager ./cmd/main.go'
+local_resource(
+    'manager-build',
+    compile_manager(),
+    deps=[
+        'api/',
+        'cmd/',
+        'internal/',
+        'go.mod',
+        'go.sum',
+    ],
+    ignore=['**/zz_generated.deepcopy.go'],
+    labels=['operator'],
+)
+# ---------------------------------------------------------------------------
+# CRDs — regenerate & apply via server-side apply on change
+# ---------------------------------------------------------------------------
+SKIP_CODEGEN = settings['skip_codegen']
+_crd_cmd = 'kubectl apply --server-side --force-conflicts -f ' + CRD_DIR
+if not SKIP_CODEGEN:
+    _crd_cmd = 'make generate && make manifests && ' + _crd_cmd
+local_resource(
+    'crds',
+    _crd_cmd,
+    deps=['api/'],
+    ignore=['**/zz_generated.deepcopy.go'],
+    labels=['operator'],
+)
+# ---------------------------------------------------------------------------
+# Helm template → k8s_yaml
+#
+# Renders the production Helm chart (deploy/helm/charts/platform) with the
+# operator and required infrastructure (NATS by default). This gives you a
+# fully working cluster where you can apply DGDR/DGD resources and have them
+# reconcile into real workloads on your GPU nodes — while live-reloading the
+# controller binary on every code change.
+#
+# The chart has no Helm hooks — webhook certificates, CA bundle injection,
+# and MPI SSH key generation are all handled by the operator binary at
+# runtime (auto mode).
+# ---------------------------------------------------------------------------
+def render_helm():
+    """Render the platform Helm chart with only the operator subchart enabled."""
+    helm_cmd = [
+        'helm', 'template', 'dynamo', HELM_CHART,
+        '--namespace', NAMESPACE,
+        '--set', 'dynamo-operator.enabled=true',
+        # Subcharts — NATS is on by default (workers need it)
+        '--set', 'nats.enabled=%s' % str(ENABLE_NATS).lower(),
+        '--set', 'dynamo-operator.nats.enabled=%s' % str(ENABLE_NATS).lower(),
+        '--set', 'global.etcd.install=%s' % str(ENABLE_ETCD).lower(),
+        '--set', 'global.kai-scheduler.install=%s' % str(ENABLE_KAI_SCHEDULER).lower(),
+        '--set', 'global.grove.install=%s' % str(ENABLE_GROVE).lower(),
+        # Point image at our Tilt-managed image
+        '--set', 'dynamo-operator.controllerManager.manager.image.repository=' + IMG,
+        '--set', 'dynamo-operator.controllerManager.manager.image.tag=' + IMG_TAG,
+        '--set', 'dynamo-operator.controllerManager.manager.image.pullPolicy=IfNotPresent',
+        # We apply CRDs ourselves in the local_resource above
+        '--set', 'dynamo-operator.upgradeCRD=false',
+        '--skip-crds',
+    ]
+    # Wire in imagePullSecrets when a pull secret is configured
+    if IMAGE_PULL_SECRET:
+        helm_cmd += ['--set', 'dynamo-operator.imagePullSecrets[0].name=' + IMAGE_PULL_SECRET]
+    # Append user-provided Helm overrides from tilt-settings
+    for k, v in HELM_VALUES.items():
+        helm_cmd += ['--set', '%s=%s' % (k, v)]
+    data = local(helm_cmd, quiet=True)
+    # Decode the YAML stream so we can patch individual documents
+    decoded = decode_yaml_stream(data)
+    patched = []
+    for doc in decoded:
+        if doc == None:
+            continue
+        # Ensure namespaced resources land in the target namespace.
+        # Cluster-scoped kinds must not have a namespace set.
+        _cluster_scoped_kinds = [
+            'ClusterRole', 'ClusterRoleBinding',
+            'ValidatingWebhookConfiguration', 'MutatingWebhookConfiguration',
+            'CustomResourceDefinition', 'Namespace',
+            'PriorityClass', 'StorageClass', 'IngressClass',
+        ]
+        kind = doc.get('kind', '')
+        if 'metadata' in doc and 'namespace' not in doc['metadata'] and kind not in _cluster_scoped_kinds:
+            doc['metadata']['namespace'] = NAMESPACE
+        # Strip securityContext so Tilt's live_update (writing into the
+        # container as root) doesn't get blocked by non-root restrictions.
+        if doc.get('kind') == 'Deployment':
+            spec = doc.get('spec', {}).get('template', {}).get('spec', {})
+            spec.pop('securityContext', None)
+            for c in spec.get('containers', []):
+                c.pop('securityContext', None)
+        patched.append(doc)
+    return encode_yaml_stream(patched)
+# Create the namespace before applying anything else
+local('kubectl create namespace %s || true' % NAMESPACE, quiet=True)
+k8s_yaml(render_helm())
+# ---------------------------------------------------------------------------
+# Docker image — minimal container with just the compiled binary
+# ---------------------------------------------------------------------------
+DOCKERFILE = '''
+FROM alpine:3.20 AS base
+RUN apk add --no-cache ca-certificates
+FROM base
+WORKDIR /
+COPY ./tilt_bin/manager /manager
+COPY ./tilt_bin/manager /workspace/manager
+ENTRYPOINT ["/manager"]
+'''
+docker_build_with_restart(
+    IMG_REF,
+    context='.',
+    dockerfile_contents=DOCKERFILE,
+    entrypoint=['/manager', '--config=/etc/dynamo-operator/config.yaml',
+                '--operator-version=' + OPERATOR_VERSION],
+    only=['./tilt_bin/manager'],
+    live_update=[
+        sync('./tilt_bin/manager', '/manager'),
+    ],
+)
+if not REGISTRY:
+    print('WARNING: no registry configured — image will only be available locally.')
+    print('  Set "registry" in tilt-settings.yaml or pass REGISTRY env var.')
+# ---------------------------------------------------------------------------
+# Resource grouping — keep the Tilt UI tidy
+# ---------------------------------------------------------------------------
+k8s_resource(
+    workload='dynamo-dynamo-operator-controller-manager',
+    new_name='operator',
+    labels=['operator'],
+    port_forwards=['8081:8081'],  # health endpoint
+    resource_deps=['crds', 'manager-build'],
+)
+# Group subchart workloads in the Tilt UI
+if ENABLE_NATS:
+    k8s_resource(
+        workload='dynamo-nats',
+        labels=['infrastructure'],
+    )
--- a/docs/assets/img/tilt-ui.png
+++ b/docs/assets/img/tilt-ui.png
--- a/docs/index.yml
+++ b/docs/index.yml
@@ -69,6 +69,8 @@ navigation:
            path: kubernetes/autoscaling.md
          - page: Rolling Update
            path: kubernetes/rolling-update.md
+          - page: Developing with Tilt
+            path: kubernetes/tilt-dev-setup.md
          - page: Inference Gateway (GAIE)
            path: kubernetes/inference-gateway.md
          - page: Snapshot

--- a/docs/kubernetes/tilt-dev-setup.md
+++ b/docs/kubernetes/tilt-dev-setup.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: Developing the Operator with Tilt
+subtitle: Fast, live-reload development loop for the Dynamo Kubernetes operator
+---
+## Overview
+[Tilt](https://tilt.dev) provides a live-reload development environment for the
+Dynamo Kubernetes operator. Instead of manually building images, pushing to a
+registry, and redeploying on every change, Tilt watches your source files and
+automatically recompiles the Go binary, syncs it into the running container, and
+restarts the process — all in seconds.
+Under the hood, the Tiltfile:
+1. **Compiles** the Go manager binary locally (`CGO_ENABLED=0`).
+2. **Builds** a minimal Docker image containing only the binary.
+3. **Renders** the production Helm chart (`deploy/helm/charts/platform`) with
+   `helm template`, applies CRDs via `kubectl`, and deploys all rendered
+   resources.
+4. **Live-updates** the binary inside the running container on every code
+   change — no full image rebuild required.
+This gives you a fully working cluster where you can apply `DynamoGraphDeployment`
+and `DynamoGraphDeploymentRequest` resources and have them reconcile into real
+workloads — while iterating on controller logic with sub-second feedback.
+## Prerequisites
+| Tool | Version | Purpose |
+|------|---------|---------|
+| [Tilt](https://docs.tilt.dev/install.html) | v0.33+ | Development orchestration |
+| [Helm](https://helm.sh/docs/intro/install/) | v3 | Chart rendering |
+| [Go](https://go.dev/dl/) | 1.25+ | Compiling the operator |
+| [kubectl](https://kubernetes.io/docs/tasks/tools/) | — | Cluster access |
+| A Kubernetes cluster | — | kind, minikube, or remote cluster |
+You also need a **container registry** that is accessible to your cluster's
+nodes, so they can pull the operator image Tilt builds. If you use a local
+cluster like kind with a local registry, Tilt can push there directly.
+## Quick Start
+```bash
+cd deploy/operator
+# Create your personal settings file (gitignored)
+cat > tilt-settings.yaml <<EOF
+allowed_contexts:
+  - my-cluster-context
+registry: docker.io/myuser
+skip_codegen: true
+EOF
+# Launch
+tilt up
+```
+Tilt opens a terminal UI and a web dashboard at <http://localhost:10350>.
+The dashboard shows resource status, build logs, and port-forwards.
+Press **Space** in the terminal to open the web UI. Press **Ctrl-C** to
+shut everything down (resources remain deployed; run `tilt down` to tear
+them down).
+![Tilt web UI showing the operator, CRDs, and infrastructure resources](../assets/img/tilt-ui.png)
+## Configuration
+All configuration is optional. The Tiltfile defines sensible defaults for every
+setting, and `tilt-settings.yaml` is gitignored so your personal values
+(cluster context, registry, etc.) never leak into the repo.
+Create `deploy/operator/tilt-settings.yaml` with any of the settings below:
+```yaml
+# Kubernetes contexts Tilt is allowed to connect to.
+# Safety guard: prevents accidental deployments to production clusters.
+allowed_contexts:
+  - my-cluster-context
+# Container registry for the operator image.
+# Can also be set via the REGISTRY env var (env var takes precedence).
+registry: docker.io/myuser
+# Skip running `make generate && make manifests` before applying CRDs.
+# Set to true when you haven't changed API types (faster iteration).
+skip_codegen: true
+# Target namespace for the operator and related resources.
+# namespace: dynamo-system
+# Subchart toggles
+# enable_nats: true            # Required for DGD/DGDR workloads (default: true)
+# enable_etcd: false           # Only if discoveryBackend is "etcd"
+# enable_kai_scheduler: false  # GPU-aware scheduling for multi-node
+# enable_grove: false          # PodClique-based multi-node orchestration
+# Extra Helm value overrides (applied on top of subchart toggles)
+# helm_values:
+#   dynamo-operator.discoveryBackend: kubernetes
+#   dynamo-operator.natsAddr: "nats://external-nats:4222"
+```
+### Settings Reference
+| Key | Type | Default | Description |
+|-----|------|---------|-------------|
+| `allowed_contexts` | list | *(none)* | Kubernetes contexts Tilt may connect to. Prevents accidental production deploys. |
+| `registry` | string | `""` | Container registry prefix (e.g. `docker.io/myuser`). Also settable via `REGISTRY` env var, which takes precedence. |
+| `namespace` | string | `dynamo-system` | Namespace for the operator Deployment and related resources. |
+| `skip_codegen` | bool | `false` | Skip `make generate && make manifests` before applying CRDs. Set to `true` when you haven't changed API types. |
+| `enable_nats` | bool | `true` | Deploy NATS subchart. Required for DGD/DGDR workloads (workers use it for communication). |
+| `enable_etcd` | bool | `false` | Deploy etcd subchart. Only needed when `discoveryBackend` is `etcd`. |
+| `enable_kai_scheduler` | bool | `false` | Deploy kai-scheduler for GPU-aware scheduling in multi-node setups. |
+| `enable_grove` | bool | `false` | Deploy Grove for PodClique-based multi-node orchestration. |
+| `image_pull_secret` | string | `""` | Name of a `docker-registry` Secret for pulling images from private registries. |
+| `helm_values` | map | `{}` | Arbitrary `--set` overrides passed to `helm template`. |
+| `operator_version` | string | *(from Chart.yaml)* | Operator `--operator-version` flag. Defaults to `appVersion` from the operator subchart. |
+### Registry Configuration
+The operator image needs to be pullable by your cluster's nodes. The registry is resolved in priority order:
+1. **`REGISTRY` env var** — `REGISTRY=docker.io/myuser tilt up`
+2. **`registry` in `tilt-settings.yaml`**
+The image is pushed as `<registry>/controller:tilt-dev`.
+<Warning>
+If no registry is configured, the image is only available locally. This works
+with kind using a local registry but will fail on remote clusters.
+</Warning>
+## How It Works
+When you run `tilt up`, the following resources are created in order:
+```
+manager-build     Compile Go binary locally
+        │
+        ├───── crds       Apply CRDs via server-side apply
+        │
+    operator              Deploy operator pod (live-updated)
+```
+The operator handles webhook certificate generation, CA bundle injection, and
+MPI SSH key provisioning at runtime — no external setup needed.
+### What Each Resource Does
+**manager-build** — Runs `CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build` to
+compile the operator binary. Re-runs on changes to `api/`, `cmd/`, `internal/`,
+`go.mod`, or `go.sum`.
+**crds** — Applies CRDs from the Helm chart via `kubectl apply --server-side`.
+When `skip_codegen` is `false`, runs `make generate && make manifests` first.
+**operator** — The operator Deployment itself. Tilt watches the compiled binary
+and uses `live_update` to sync it into the running container and restart the
+process — no image rebuild needed. On startup, the operator's built-in cert
+controller generates a self-signed TLS certificate, injects the CA bundle into
+webhook configurations, and creates the MPI SSH secret — matching production
+behavior exactly.
+### Live Update Cycle
+The inner development loop looks like this:
+1. You edit Go source files under `deploy/operator/`.
+2. Tilt detects the change and recompiles the binary (~2-5 seconds).
+3. The new binary is synced into the running container via `live_update`.
+4. The process restarts automatically.
+5. Your controller changes are live — test by applying a DGD/DGDR.
+No `docker build`, no `docker push`, no `kubectl rollout restart`.
+## Webhook Certificates
+The operator handles webhook TLS certificates automatically at runtime using a
+built-in cert controller (based on OPA cert-controller). On startup it:
+1. Creates a self-signed CA and webhook serving certificate.
+2. Stores them in the `webhook-server-cert` Secret.
+3. Injects the CA bundle into `ValidatingWebhookConfiguration` and
+   `MutatingWebhookConfiguration` resources.
+This matches production behavior and requires no external tooling. For
+alternative certificate management (cert-manager or external certs), see the
+[webhook documentation](../kubernetes/webhooks.md) and configure via
+`helm_values` in `tilt-settings.yaml`.
+## Typical Workflows
+### Iterating on Controller Logic
+The most common workflow — you're modifying reconciliation logic and want fast
+feedback:
+```yaml
+# tilt-settings.yaml
+allowed_contexts: [my-cluster]
+registry: docker.io/myuser
+skip_codegen: true
+```
+```bash
+tilt up
+# Edit files under internal/controller/
+# Tilt auto-recompiles and live-updates
+# Apply test resources:
+kubectl apply -f examples/backends/vllm/deploy/agg.yaml
+```
+### Changing API Types (CRDs)
+When you modify files under `api/`, you need codegen to run:
+```yaml
+# tilt-settings.yaml
+skip_codegen: false   # or omit — false is the default
+```
+Tilt will run `make generate && make manifests` and re-apply CRDs whenever
+`api/` files change.
+### Testing Multi-Node Features
+Enable the necessary subcharts:
+```yaml
+# tilt-settings.yaml
+enable_grove: true
+enable_kai_scheduler: true
+```
+### Using Environment Variables
+You can override the registry without editing the settings file:
+```bash
+REGISTRY=ghcr.io/myorg tilt up
+```
+## Tilt UI
+The web UI at <http://localhost:10350> shows:
+- **Resource status** — green/red/pending for each resource
+- **Build logs** — compilation output and errors
+- **Runtime logs** — operator logs streamed in real time
+- **Port forwards** — the health endpoint is forwarded to `localhost:8081`
+Resources are grouped by label (`operator` and `infrastructure`) to keep the
+UI organized.
+## Cleanup
+```bash
+# Stop Tilt and leave resources deployed
+# (Ctrl-C in the terminal)
+# Stop Tilt and tear down all resources
+tilt down
+```
+## Troubleshooting
+### Image Pull Errors
+If pods show `ImagePullBackOff`:
+- Verify `registry` is set in `tilt-settings.yaml` or via `REGISTRY` env var.
+- Ensure your cluster nodes can pull from that registry.
+- For kind with a local registry, follow the
+  [kind local registry guide](https://kind.sigs.k8s.io/docs/user/local-registry/).
+### Webhook TLS Errors
+If applying a DGD/DGDR fails with `x509: certificate signed by unknown authority`:
+- Check the operator logs in the Tilt UI — the cert controller logs its
+  progress on startup.
+- Verify the `webhook-server-cert` Secret exists and has been populated:
+  ```bash
+  kubectl -n dynamo-system get secret webhook-server-cert
+  ```
+- The operator may need a few seconds after startup to generate certs and
+  inject the CA bundle. Wait for the `cert-controller` log messages before
+  applying resources.
+### CRD Codegen Failures
+If `crds` fails with codegen errors:
+- Ensure `controller-gen` is installed: `make controller-gen`
+- Try running codegen manually: `make generate && make manifests`
+- Set `skip_codegen: true` temporarily to bypass if you haven't changed API types.
+### Context Safety Guard
+If Tilt refuses to start with a context error, add your cluster context to
+`allowed_contexts` in `tilt-settings.yaml`:
+```yaml
+allowed_contexts:
+  - my-cluster-context
+```