feat: add DGDR test suite (#7343)

Signed-off-by: ashnamehrotra <ashnamehrotra@gmail.com>

feat: add DGDR test suite (#7343)
Signed-off-by: ashnamehrotra <ashnamehrotra@gmail.com>
597b7249 · Ashna Mehrotra · GitHub · 8dfed173 · 597b7249 · 597b7249
Unverified Commit 597b7249 authored Apr 14, 2026 by Ashna Mehrotra Committed by GitHub Apr 14, 2026
5 changed files
--- a/tests/dgdr/README.md
+++ b/tests/dgdr/README.md
+# DGDR v1beta1 End-to-End Test Suite
+This directory contains the end-to-end test suite for **DynamoGraphDeploymentRequest
+(DGDR) v1beta1** — the high-level, SLA-driven Kubernetes API for deploying
+inference models with Dynamo.
+## What's tested
+| Test group | Marker(s) | GPU req | Mocker OK? | What it covers |
+|---|---|---|---|---|
+| `TestDGDRValidation` | `gpu_0`, `pre_merge` | None | ✅ | Webhook validation: rejected/accepted specs, value enforcement, storage version, shortname |
+| `TestDGDRVersionConversion` | `gpu_0`, `pre_merge` | None | ✅ | v1alpha1 → v1beta1 conversion webhook |
+| `TestDGDRMinimalDeployment` | `gpu_1`, `pre_merge`, `e2e` | 1+ | ⚠️ see note | Full Pending → Profiling → Ready → Deploying → Deployed lifecycle |
+| `TestDGDRBackendSelection` | `gpu_1`, `nightly`, `e2e` | 1+ | ⚠️ vllm+trtllm only | vllm and trtllm pass; sglang **skipped** (no AIC silicon data for sglang on the mocker GPU SKU) |
+| `TestDGDRSearchStrategies` | `gpu_1`/`gpu_8`, `e2e` | 1 or 8 | ⚠️ rapid only | `rapid` uses AIC and works; `thorough` requires real GPU sweeps |
+| `TestDGDRSLATargets` | `gpu_1`, `nightly`, `e2e` | 1+ | ✅ | ttft+itl, e2eLatency, optimizationType (latency/throughput) |
+| `TestDGDRWorkloadPickingModes` | `gpu_1`, `nightly`, `e2e` | 1+ | ✅ | requestRate, concurrency, isl/osl |
+| `TestDGDRFeatures` | `gpu_1`, `nightly`, `e2e` | 1+ | ⚠️ see note | planner (rapid/none sweep), mocker |
+| `TestDGDRModelCache` | `gpu_1`, `nightly`, `e2e` | 1+ | ✅ | PVC-backed model cache, cache propagated to DGD |
+| `TestDGDRHardwareOverride` | `gpu_1`, `pre_merge`, `e2e` | ✅ | ✅ | Manual gpuSku/numGpusPerNode/totalGpus/vramMb |
+| `TestDGDRAutoApply` | `gpu_1`, `pre_merge`, `e2e` | 1+ | ⚠️ see note | autoApply=true **skipped** in mocker (operator race); autoApply=false keeps Ready |
+| `TestDGDROverrides` | `gpu_1`, `nightly`, `e2e` | 1+ | ✅ | Profiling job tolerations; DGD metadata label merging **xfail** (operator gap) |
+| `TestDGDRStatusAndConditions` | `gpu_1`, `pre_merge`, `e2e` | 1+ | ⚠️ see note | All conditions set correctly, sub-phases tracked, Pareto configs; all-conditions **xfail** in mocker; pareto **skipped** in mocker |
+| `TestDGDRImmutability` | mixed | 0–1 | ⚠️ see note | Spec rejected in Profiling/Deployed, metadata always allowed |
+| `TestDGDRCleanup` | `gpu_1`, `pre_merge`, `e2e` | 1+ | ⚠️ see note | Job deleted with DGDR; DGD preserved; ConfigMap cleanup **xfail** (operator gap); DGD-persistence test **skipped** in mocker |
+| `TestDGDRMoEModels` | `gpu_8`, `nightly`, `e2e` | 8 | ❌ | DeepSeek-R1 MoE on SGLang — requires real 8-GPU node |
+## Prerequisites
+1. A running Kubernetes cluster with GPU nodes (or see [GPU-free mode](#gpu-free-mocker-mode) below)
+2. The Dynamo operator installed (including CRDs and webhooks)
+3. `kubectl` configured and pointing at the cluster
+4. Python 3.10+ with `pytest` and `pyyaml` installed:
+   ```bash
+   pip install pytest pyyaml
+   # or, from the repo root:
+   pip install -e ".[test]"
+   ```
+## One-time cluster setup
+Before running any tests, ensure the following are in place in your cluster.
+These are required even for GPU-free (mocker) mode.
+### 1. Install the Dynamo operator
+```bash
+cd deploy/operator
+helm install dynamo-operator helm/dynamo-operator -n dynamo-system --create-namespace
+```
+### 2. Deploy NATS
+Mocker workers (and real workers) connect to NATS for inter-component messaging.
+The operator expects NATS at `nats://dynamo-operator-nats.dynamo-system.svc.cluster.local:4222`.
+```bash
+helm repo add nats https://nats-io.github.io/k8s/helm/charts/
+helm repo update
+helm install dynamo-operator-nats nats/nats -n dynamo-system --create-namespace
+```
+### 3. Create the HuggingFace token secret
+The profiling job reads the HF token from a secret named `hf-token-secret` using the
+key `HF_TOKEN` (not `HUGGING_FACE_HUB_TOKEN`).
+```bash
+kubectl create secret generic hf-token-secret \
+  --from-literal=HF_TOKEN=<your-hf-token> \
+  -n default
+# If running in a non-default namespace, adjust -n accordingly
+```
+> **Important:** The key must be `HF_TOKEN`.  The secret name must be `hf-token-secret`.
+> Using a different key name will cause the profiling job to fail silently.
+## Running the tests
+There are two main ways to run the suite depending on whether you have GPU hardware.
+---
+### GPU-free (mocker mode) — recommended for local development and CI
+No GPU nodes required.  Uses AIC simulation for profiling and mock inference workers
+for deployment.  Covers all `gpu_0` and `gpu_1` tests (~45 tests); `gpu_8` tests are
+excluded because they require a real 8-GPU node even in mocker mode.
+```bash
+python3 -m pytest tests/dgdr/ -m "gpu_0 or gpu_1" -v \
+  --dgdr-namespace=default \
+  --dgdr-image=<your-image>
+```
+Expect: 37 passed, 6 skipped (2 model-cache PVC; sglang backend; pareto in mocker; DGD-persistence in mocker; auto-apply-true in mocker), 4 xfail (DGD label merging; all-conditions requires Deployed; dry-run immutability requires Deployed; ConfigMap cleanup on deletion).
+`test_backend[sglang]` is one of the 6 skips (no AIC silicon data for sglang in mocker mode).
+---
+### Full suite with real GPUs — for production/nightly validation
+Requires a Kubernetes cluster with GPU nodes.  Set `--dgdr-no-mocker` to disable mocker
+injection and run against real hardware.  `gpu_8` tests additionally require an 8-GPU node.
+```bash
+# gpu_0 + gpu_1 tests on real GPUs (single-GPU node sufficient)
+python3 -m pytest tests/dgdr/ -m "gpu_0 or gpu_1" -v \
+  --dgdr-namespace=dynamo-test \
+  --dgdr-image=<your-image> \
+  --dgdr-no-mocker \
+  --dgdr-profiling-timeout=3600 \
+  --dgdr-deploy-timeout=1800
+# Full nightly suite including 8-GPU tests
+python3 -m pytest tests/dgdr/ -v \
+  --dgdr-namespace=dynamo-test \
+  --dgdr-image=<your-image> \
+  --dgdr-no-mocker \
+  --dgdr-pvc-name=model-cache \
+  --dgdr-profiling-timeout=14400 \
+  --dgdr-deploy-timeout=3600
+```
+Expect (gpu_0 + gpu_1, with `--dgdr-pvc-name`): **~43 passed, 0 skipped, 2 xfail** (DGD label-merging operator gap; ConfigMap cleanup operator gap).
+Without `--dgdr-pvc-name`: 2 additional skips for the model-cache tests.
+> **Note:** Two xfails are **permanent operator gaps** that persist in both mocker and GPU mode:
+> - `test_dgd_override_injects_custom_labels` — the operator does not yet merge `spec.overrides.dgd.metadata.labels` onto the created DGD.
+> - `test_deletion_removes_output_configmap` — the operator's `FinalizeResource` is a no-op and does not delete the output ConfigMap on DGDR deletion.
+> All other mocker-mode xfails/skips disappear in GPU mode and are expected to pass.
+---
+### Other useful invocations
+```bash
+# Validation + conversion tests only (no cluster setup required beyond CRDs)
+python3 -m pytest tests/dgdr/ -m "gpu_0" -v \
+  --dgdr-namespace=default \
+  --dgdr-image=<your-image>
+# Pre-merge gate (GPU-free)
+python3 -m pytest tests/dgdr/ -m "pre_merge" -v \
+  --dgdr-namespace=default \
+  --dgdr-image=<your-image>
+# Single test class
+python3 -m pytest tests/dgdr/test_dgdr_v1beta1.py::TestDGDRAutoApply -v \
+  --dgdr-namespace=default \
+  --dgdr-image=<your-image>
+```
+## CLI options
+| Option | Default | Description |
+|---|---|---|
+| `--dgdr-namespace` | _(required)_ | Kubernetes namespace for test resources |
+| `--dgdr-image` | _(required)_ | Container image for profiling and inference workers |
+| `--dgdr-model` | `Qwen/Qwen3-0.6B` | HuggingFace model ID used by most tests |
+| `--dgdr-backend` | `vllm` | Default backend for DGDR tests |
+| `--dgdr-pvc-name` | _(empty)_ | PVC name holding pre-downloaded model weights (PVC tests are skipped if unset) |
+| `--dgdr-profiling-timeout` | `3600` | Seconds to wait for profiling to complete |
+| `--dgdr-deploy-timeout` | `600` | Seconds to wait for DGD to reach Deployed phase |
+| `--dgdr-no-mocker` | `false` | Disable mocker mode (require real GPU nodes) |
+## DGDR v1beta1 feature coverage matrix
+The following spec fields are exercised by at least one test:
+| Field | Tests that exercise it |
+|---|---|
+| `spec.model` | All tests |
+| `spec.backend` (auto/vllm/sglang/trtllm) | `TestDGDRBackendSelection`, `TestDGDRValidation` |
+| `spec.image` | All tests |
+| `spec.searchStrategy` (rapid/thorough) | `TestDGDRSearchStrategies` |
+| `spec.sla.ttft` + `spec.sla.itl` | `TestDGDRSLATargets::test_sla_ttft_and_itl` |
+| `spec.sla.e2eLatency` | `TestDGDRSLATargets::test_sla_e2e_latency` |
+| `spec.sla.optimizationType` | `TestDGDRSLATargets::test_sla_optimization_type_*` |
+| `spec.workload.isl` + `spec.workload.osl` | `TestDGDRWorkloadPickingModes` |
+| `spec.workload.requestRate` | `TestDGDRWorkloadPickingModes::test_request_rate_picking` |
+| `spec.workload.concurrency` | `TestDGDRWorkloadPickingModes::test_concurrency_picking` |
+| `spec.features.planner` (opaque config) | `TestDGDRFeatures::test_planner_enabled_*` |
+| `spec.features.mocker.enabled` | `TestDGDRFeatures::test_mocker_enabled` |
+| `spec.modelCache.pvcName` | `TestDGDRModelCache` |
+| `spec.hardware.gpuSku` | `TestDGDRHardwareOverride::test_hardware_manual_override` |
+| `spec.hardware.numGpusPerNode` | `TestDGDRHardwareOverride` |
+| `spec.hardware.totalGpus` / `spec.hardware.vramMb` | `TestDGDRHardwareOverride::test_hardware_total_gpus_and_vram` |
+| `spec.autoApply` | `TestDGDRAutoApply` |
+| `spec.overrides.profilingJob` | `TestDGDROverrides::test_profiling_job_toleration_override` |
+| `spec.overrides.dgd` | `TestDGDROverrides::test_dgd_override_injects_custom_labels` |
+| `status.phase` | All lifecycle tests |
+| `status.profilingPhase` | `TestDGDRStatusAndConditions::test_profiling_sub_phase_tracked` |
+| `status.profilingJobName` | `TestDGDRStatusAndConditions::test_profiling_job_name_populated` |
+| `status.dgdName` | `TestDGDRAutoApply`, `TestDGDRMinimalDeployment` |
+| `status.profilingResults.selectedConfig` | Multiple |
+| `status.profilingResults.pareto` | `TestDGDRStatusAndConditions::test_pareto_configs_in_profiling_results` |
+| `status.deploymentInfo` | `TestDGDRMinimalDeployment` |
+| `status.conditions` (all types) | `TestDGDRStatusAndConditions` |
+| `status.observedGeneration` | `TestDGDRStatusAndConditions::test_observed_generation_tracks_spec` |
+## GPU-free mode (default)
+By default, the test suite runs the full DGDR lifecycle **without any GPU nodes**
+by combining two simulation features:
+| Feature | How it's enabled | Which phase it affects |
+|---|---|---|
+| **AIC (AI Configurator)** | `searchStrategy: rapid` (the default) | **Profiling** — profiler runs CPU-only simulation instead of online GPU sweep |
+| **Mocker** | Enabled by default (disable with `--dgdr-no-mocker`) | **Deployment** — DGD uses mock inference workers (no GPU resources requested) |
+**How it works:**
+- `searchStrategy: rapid` is the default for v1beta1 DGDRs.  The profiler automatically
+  uses AI Configurator (AIC) simulation when rapid is set — no additional config needed.
+- Mocker mode is **enabled by default**.  The `dgdr_factory` fixture automatically injects
+  `spec.features.mocker.enabled: true` and a default `spec.hardware` config into every DGDR.
+- AIC profiling creates a Kubernetes Job that runs CPU-only (job prefix: `profile-aic-`).
+  The profiling pod does not request GPU resources.
+- Mocker deployment selects the profiler's `mocker_config_with_planner.yaml` output
+  instead of the real deployment config, resulting in DGD pods that don't request GPUs.
+- Pass `--dgdr-no-mocker` to disable mocker mode and run against real GPU hardware.
+> **Note:** Some test assertions (e.g., status.deploymentInfo.gpuCount, pareto configs)
+> may produce different values under mocker than under real GPU profiling.
+> The tests are written to validate structure and phase transitions, not exact
+> profiling output values, so they work correctly in both modes.
+> **Note:** `searchStrategy: thorough` requires online (GPU) profiling even with mocker,
+> since thorough performs real benchmark measurements.  Use rapid for GPU-free testing.
+> **Note:** `TestDGDRFeatures::test_planner_enabled_with_rapid_sweep` runs with
+> `auto_apply=False` in mocker mode (same root cause as the note below — the operator
+> pre-sets `Status.DGDName` from the profiling output and then immediately fires
+> `handleDGDDeleted` when the DGD cannot be found).  In mocker mode the test only
+> validates that spec generation succeeds (waits for `PHASE_READY` and checks `dgdName`
+> + `selectedConfig`).  Full deployment with rapid sweeping is verified outside mocker
+> mode.  `test_planner_enabled_no_pre_deployment_sweep` and `test_mocker_enabled` are
+> likewise restricted to `PHASE_READY` in mocker mode.
+> **Note:** `auto_apply=True` consistently hits `handleDGDDeleted` in mocker mode. The
+> operator's `generateDGDSpec` pre-populates `Status.DGDName` from the profiling output
+> (e.g. `mocker-disagg`) _before_ the DGD is actually created.  When `handleDeployingPhase`
+> then runs it checks `DGDName != ""` and immediately tries to GET that DGD; since it does
+> not exist yet it fires `handleDGDDeleted` and the DGDR transitions to Failed.
+> All tests that would enter the Deploying phase in mocker mode therefore use
+> `auto_apply=False`/`PHASE_READY` instead (minimal lifecycle, backend selection,
+> mocker feature, planner-no-sweep, planner-rapid-sweep, DGD label override).
+> Tests whose sole purpose is to verify `auto_apply=True` DGD creation are skipped in
+> mocker mode (`test_auto_apply_true_creates_dgd_automatically`,
+> `test_deletion_does_not_remove_created_dgd`).
+> Non-mocker mode (real GPU cluster) is unaffected.
+> **Note:** `TestDGDRImmutability::test_spec_immutable_in_deployed_via_dry_run` is **xfail**
+> in mocker mode.  The test relies on the session `deployed_dgdr` fixture which, in mocker
+> mode, stops at `PHASE_READY` instead of `PHASE_DEPLOYED`.  The webhook's
+> `ValidateUpdate` immutability enforcement only activates when the DGDR is in `Deployed`
+> phase, so the server-dry-run mutation is accepted rather than rejected.
+> **Note:** `gpu_8` tests cannot be run with mocker and require a real 8-GPU node.
+> `TestDGDRSearchStrategies::test_thorough_strategy_completes` uses `searchStrategy: thorough`
+> which performs real GPU benchmark sweeps.  `TestDGDRMoEModels` (DeepSeek-R1) requires 8 GPUs
+> for the real inference workload.  Exclude them from GPU-free runs with `-m "gpu_0 or gpu_1"`.
+### AIC silicon data availability
+AIC operates in **silicon mode**: it looks up pre-recorded per-op performance data
+files shipped inside the `aiconfigurator` Python package.  These files are organised
+by `{gpu_sku}/{backend}/{backend_version}/`.  The mocker fixture injects
+`gpuSku: a100_sxm` into every DGDR — but the package only ships vllm data for that SKU:
+| Backend | a100_sxm data? | Mocker result |
+|---|---|---|
+| `vllm` | ✅ present | Profiling succeeds |
+| `trtllm` | ✅ present | Profiling succeeds |
+| `sglang` | ❌ missing | Test **skipped** automatically (no `sglang/0.5.8` perf data for `a100_sxm`) |
+To test sglang/trtllm, run against a real GPU cluster (`--dgdr-no-mocker`) where AIC
+can use a GPU SKU for which those data files are present.
+## Cleanup
+Tests clean up their own DGDRs via the `dgdr_factory` fixture.  If a test is
+interrupted, resources can be cleaned up manually:
+```bash
+# Delete all DGDRs created by the test suite (they are labelled automatically)
+kubectl delete dgdr -n default -l "test.dynamo/managed=true"
+# If you used a custom namespace:
+kubectl delete dgdr -n <namespace> -l "test.dynamo/managed=true"
+```
+## Architecture notes
+- All tests interact with the cluster **exclusively via `kubectl`** subprocess calls,
+  consistent with the rest of the Dynamo test suite.
+- The `dgdr_factory` fixture ensures DGDR cleanup via `yield` regardless of test
+  outcome.
+- Tests that require an optional PVC (`--dgdr-pvc-name`) skip automatically when the
+  option is not provided.
+- Timeout values are configurable to accommodate clusters with varying profiling speeds.
--- a/tests/dgdr/__init__.py
+++ b/tests/dgdr/__init__.py
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
--- a/tests/dgdr/conftest.py
+++ b/tests/dgdr/conftest.py
--- a/tests/dgdr/test_dgdr_validation.py
+++ b/tests/dgdr/test_dgdr_validation.py
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+"""
+Webhook validation and version conversion tests for DGDR v1beta1.
+These tests verify that:
+- The admission webhook correctly accepts/rejects DGDR specs (TestDGDRValidation)
+- v1alpha1 resources are transparently converted to v1beta1 (TestDGDRVersionConversion)
+No GPU or cluster profiling is required (gpu_0 only).  The only prerequisite is a
+running Kubernetes cluster with the Dynamo operator CRDs and webhooks installed.
+Run:
+  pytest tests/dgdr/test_dgdr_validation.py -m gpu_0 -v --dgdr-namespace=default --dgdr-image=<image>
+Test markers:
+  gpu_0       No GPU required
+  nightly     Requires live K8s cluster (not run in general pre-merge CI)
+  integration Integration-level (uses live webhook)
+"""
+from __future__ import annotations
+import json
+import logging
+import pytest
+import yaml
+from kubernetes_asyncio.client import exceptions as k8s_exceptions
+from tests.dgdr.conftest import (
+    DGDR_API_VERSION,
+    DGDR_SHORT_NAME,
+    _run_kubectl,
+    build_dgdr_manifest,
+    unique_dgdr_name,
+)
+from tests.utils.managed_deployment import ManagedDGDR
+logger = logging.getLogger(__name__)
+# ---------------------------------------------------------------------------
+# ── Group 1: Webhook Validation (gpu_0, no profiling required) ──────────────
+# ---------------------------------------------------------------------------
+@pytest.mark.gpu_0
+@pytest.mark.nightly
+@pytest.mark.integration
+@pytest.mark.k8s
+class TestDGDRValidation:
+    """
+    Tests that verify the admission webhook correctly validates DGDR specs
+    before they are persisted.  These tests use server-side dry-run so no
+    resources are actually created.
+    """
+    def test_missing_model_rejected(
+        self, managed_dgdr: ManagedDGDR, dgdr_image: str
+    ) -> None:
+        """
+        A DGDR without spec.model must be rejected by the webhook.
+        The model field is the only hard-required spec field in v1beta1.
+        """
+        manifest = build_dgdr_manifest(
+            unique_dgdr_name("no-model"),
+            model="",  # intentionally empty
+            image=dgdr_image,
+        )
+        # Clear model so the field is absent
+        del manifest["spec"]["model"]
+        with pytest.raises(k8s_exceptions.ApiException):
+            managed_dgdr.run(managed_dgdr.server_dry_run(manifest))
+    def test_thorough_with_auto_backend_rejected(
+        self, managed_dgdr: ManagedDGDR, dgdr_image: str, dgdr_model: str
+    ) -> None:
+        """
+        searchStrategy: thorough + backend: auto must be rejected.
+        'thorough' sweeps real GPU engines and requires a concrete backend.
+        """
+        manifest = build_dgdr_manifest(
+            unique_dgdr_name("thorough-auto"),
+            model=dgdr_model,
+            image=dgdr_image,
+            backend="auto",
+            search_strategy="thorough",
+        )
+        with pytest.raises(k8s_exceptions.ApiException) as exc_info:
+            managed_dgdr.run(managed_dgdr.server_dry_run(manifest))
+        error_body = str(exc_info.value)
+        assert (
+            "auto" in error_body.lower()
+            or "backend" in error_body.lower()
+            or "thorough" in error_body.lower()
+        ), f"Error message should mention backend/thorough incompatibility. Got: {error_body}"
+    def test_invalid_backend_rejected(
+        self, managed_dgdr: ManagedDGDR, dgdr_image: str, dgdr_model: str
+    ) -> None:
+        """
+        An unknown backend value must be rejected by the admission webhook.
+        Valid values: auto, vllm, sglang, trtllm.
+        """
+        manifest = build_dgdr_manifest(
+            unique_dgdr_name("bad-backend"),
+            model=dgdr_model,
+            image=dgdr_image,
+            backend="unknown_backend",
+        )
+        with pytest.raises(k8s_exceptions.ApiException):
+            managed_dgdr.run(managed_dgdr.server_dry_run(manifest))
+    def test_invalid_search_strategy_rejected(
+        self, managed_dgdr: ManagedDGDR, dgdr_image: str, dgdr_model: str
+    ) -> None:
+        """
+        An unknown searchStrategy value must be rejected by the admission webhook.
+        """
+        manifest = build_dgdr_manifest(
+            unique_dgdr_name("bad-strategy"),
+            model=dgdr_model,
+            image=dgdr_image,
+            search_strategy="superfast",  # not a valid strategy
+        )
+        with pytest.raises(k8s_exceptions.ApiException):
+            managed_dgdr.run(managed_dgdr.server_dry_run(manifest))
+    def test_invalid_optimization_type_rejected(
+        self, managed_dgdr: ManagedDGDR, dgdr_image: str, dgdr_model: str
+    ) -> None:
+        """
+        An invalid sla.optimizationType value must be rejected by the
+        admission webhook. Valid values: latency, throughput.
+        """
+        manifest = build_dgdr_manifest(
+            unique_dgdr_name("bad-opt-type"),
+            model=dgdr_model,
+            image=dgdr_image,
+            sla={"optimizationType": "cost"},  # not valid
+        )
+        with pytest.raises(k8s_exceptions.ApiException):
+            managed_dgdr.run(managed_dgdr.server_dry_run(manifest))
+    def test_valid_minimal_dgdr_accepted(
+        self, managed_dgdr: ManagedDGDR, dgdr_image: str, dgdr_model: str
+    ) -> None:
+        """
+        A DGDR with only the required fields (model + image) must pass validation.
+        All other fields have defaults and are optional.
+        """
+        manifest = build_dgdr_manifest(
+            unique_dgdr_name("valid-minimal"),
+            model=dgdr_model,
+            image=dgdr_image,
+        )
+        # Should not raise — accepted by the webhook
+        managed_dgdr.run(managed_dgdr.server_dry_run(manifest))
+    def test_valid_full_spec_accepted(
+        self, managed_dgdr: ManagedDGDR, dgdr_image: str, dgdr_model: str
+    ) -> None:
+        """
+        A fully-specified v1beta1 DGDR should pass webhook validation.
+        Exercises every top-level optional field.
+        """
+        manifest = build_dgdr_manifest(
+            unique_dgdr_name("valid-full"),
+            model=dgdr_model,
+            image=dgdr_image,
+            backend="vllm",
+            search_strategy="rapid",
+            sla={"ttft": 200.0, "itl": 20.0},
+            workload={"isl": 3000, "osl": 150},
+            features={
+                "planner": {"plannerPreDeploymentSweeping": "rapid"},
+                "mocker": {"enabled": False},
+            },
+            hardware={"numGpusPerNode": 8},
+            auto_apply=True,
+        )
+        # Should not raise — accepted by the webhook
+        managed_dgdr.run(managed_dgdr.server_dry_run(manifest))
+    def test_v1beta1_is_storage_version(self, dgdr_namespace: str) -> None:
+        """
+        The CRD's storage version must be v1beta1 (it is the conversion hub).
+        """
+        result = _run_kubectl(
+            [
+                "get",
+                "crd",
+                "dynamographdeploymentrequests.nvidia.com",
+                "-o",
+                "jsonpath={.status.storedVersions}",
+            ],
+            check=False,
+        )
+        assert result.returncode == 0, f"Failed to get CRD: {result.stderr}"
+        assert (
+            "v1beta1" in result.stdout
+        ), f"v1beta1 should be the storage version. Got: {result.stdout}"
+    def test_kubectl_shortname_dgdr_works(self, dgdr_namespace: str) -> None:
+        """
+        kubectl get dgdr must work (tests the shortName 'dgdr' in the CRD).
+        """
+        result = _run_kubectl(
+            ["get", DGDR_SHORT_NAME, "-n", dgdr_namespace, "--ignore-not-found"],
+            check=False,
+        )
+        assert (
+            result.returncode == 0
+        ), f"kubectl get dgdr failed (shortname may not be registered). stderr: {result.stderr}"
+    def test_kubectl_get_columns_schema(
+        self, dgdr_namespace: str, dgdr_image: str, dgdr_model: str, dgdr_factory
+    ) -> None:
+        """
+        kubectl get dgdr should output the columns defined in the CRD:
+        NAME, MODEL, BACKEND, PHASE, PROFILING, DGD, AGE.
+        """
+        name = unique_dgdr_name("col-test")
+        manifest = build_dgdr_manifest(name, model=dgdr_model, image=dgdr_image)
+        dgdr_factory(manifest)
+        result = _run_kubectl(
+            ["get", DGDR_SHORT_NAME, name, "-n", dgdr_namespace],
+            check=False,
+        )
+        assert result.returncode == 0, f"kubectl get dgdr failed: {result.stderr}"
+        header = (
+            result.stdout.splitlines()[0].upper() if result.stdout.splitlines() else ""
+        )
+        expected_columns = {"NAME", "MODEL", "BACKEND", "PHASE"}
+        for col in expected_columns:
+            assert (
+                col in header
+            ), f"Expected column {col!r} in kubectl output header. Got: {header}"
+# ---------------------------------------------------------------------------
+# ── Group 2: v1alpha1 → v1beta1 Version Conversion ─────────────────────────
+# ---------------------------------------------------------------------------
+@pytest.mark.gpu_0
+@pytest.mark.nightly
+@pytest.mark.integration
+@pytest.mark.k8s
+class TestDGDRVersionConversion:
+    """
+    Tests that v1alpha1 DGDR resources can be submitted and are stored
+    transparently as v1beta1 (conversion hub).  No profiling required.
+    """
+    def test_v1alpha1_dgdr_can_be_applied(
+        self, dgdr_namespace: str, dgdr_image: str, dgdr_model: str, dgdr_factory
+    ) -> None:
+        """
+        A v1alpha1 DynamoGraphDeploymentRequest should be accepted and
+        automatically converted to v1beta1 storage by the conversion webhook.
+        Note: v1alpha1 manifests use a different spec shape (profilingConfig
+        instead of image) so we must use kubectl here rather than the
+        v1beta1-only ManagedDGDR client.
+        """
+        name = unique_dgdr_name("v1a1")
+        v1alpha1_manifest = {
+            "apiVersion": "nvidia.com/v1alpha1",
+            "kind": "DynamoGraphDeploymentRequest",
+            "metadata": {"name": name},
+            "spec": {
+                "model": dgdr_model,
+                "backend": "vllm",
+                "profilingConfig": {
+                    "profilerImage": dgdr_image,
+                },
+            },
+        }
+        yaml_str = yaml.dump(v1alpha1_manifest)
+        result = _run_kubectl(
+            ["apply", "-n", dgdr_namespace, "-f", "-"], input=yaml_str, check=False
+        )
+        if result.returncode == 0:
+            # Register for cleanup without re-creating (resource already exists)
+            dgdr_factory.register_for_cleanup(name)
+        # Either accepted (0) or rejected for a known conversion reason – just not a 500
+        assert result.returncode in (
+            0,
+            1,
+        ), f"Unexpected error applying v1alpha1 DGDR: {result.stderr}"
+    def test_v1beta1_get_on_v1alpha1_object(
+        self,
+        managed_dgdr: ManagedDGDR,
+        dgdr_namespace: str,
+        dgdr_image: str,
+        dgdr_model: str,
+        dgdr_factory,
+    ) -> None:
+        """
+        A resource stored as v1beta1 must be retrievable as v1alpha1 via conversion.
+        """
+        name = unique_dgdr_name("conv-get")
+        manifest = build_dgdr_manifest(name, model=dgdr_model, image=dgdr_image)
+        dgdr_factory(manifest)
+        # Retrieve as v1beta1 (storage version) via ManagedDGDR
+        obj_v1beta1 = managed_dgdr.run(managed_dgdr.get(name))
+        assert obj_v1beta1 is not None
+        assert obj_v1beta1["apiVersion"] == DGDR_API_VERSION
+        # Retrieve as v1alpha1 (should trigger conversion webhook).
+        # Must use kubectl here since ManagedDGDR targets v1beta1 only.
+        result = _run_kubectl(
+            [
+                "get",
+                "dynamographdeploymentrequests.v1alpha1.nvidia.com",
+                name,
+                "-n",
+                dgdr_namespace,
+                "-o",
+                "json",
+            ],
+            check=False,
+        )
+        # If the conversion webhook is working, we get a 200 with v1alpha1 resource.
+        # If not registered, we may get a 404 - that is also acceptable here as
+        # some cluster configs only register v1beta1.
+        assert result.returncode in (
+            0,
+            1,
+        ), f"Unexpected failure getting v1alpha1 DGDR: {result.stderr}"
+        if result.returncode == 0:
+            obj_v1alpha1 = json.loads(result.stdout)
+            assert (
+                obj_v1alpha1["apiVersion"] == "nvidia.com/v1alpha1"
+            ), "Retrieved object should have v1alpha1 apiVersion"
--- a/tests/utils/managed_deployment.py
+++ b/tests/utils/managed_deployment.py
@@ -1179,6 +1179,287 @@ class ManagedDeployment:
        await self._cleanup()
+class ManagedDGDR:
+    """Async helper for managing DynamoGraphDeploymentRequest custom resources.
+    Provides CRUD operations and phase-polling against the DGDR CRD using the
+    ``kubernetes_asyncio`` client, following the same patterns as
+    ``ManagedDeployment`` (shared kubeconfig initialisation, timeout logic,
+    structured error messages).
+    Typical usage from a pytest fixture::
+        dgdr = ManagedDGDR(namespace="default")
+        await dgdr.init()
+        await dgdr.create(manifest)
+        phase = await dgdr.wait_for_phase(name, "Ready", timeout=600)
+        await dgdr.delete(name)
+        await dgdr.close()
+    """
+    # CRD coordinates for DGDR
+    DGDR_GROUP = "nvidia.com"
+    DGDR_VERSION = "v1beta1"
+    DGDR_PLURAL = "dynamographdeploymentrequests"
+    # CRD coordinates for DGD (for mocker cleanup)
+    DGD_PLURAL = "dynamographdeployments"
+    DEFAULT_POLL_INTERVAL = 10  # seconds
+    def __init__(
+        self,
+        namespace: str = "default",
+        loop: Optional[asyncio.AbstractEventLoop] = None,
+    ):
+        self.namespace = namespace
+        self._custom_api: Optional[client.CustomObjectsApi] = None
+        self._api_client: Optional[client.ApiClient] = None
+        self._logger = logging.getLogger(self.__class__.__name__)
+        self._loop = loop
+    def run(self, coro):
+        """Run an async coroutine synchronously using the stored event loop.
+        Convenience for callers that are not themselves async (e.g. pytest
+        fixtures and synchronous test methods).
+        """
+        if self._loop is None:
+            raise RuntimeError(
+                "No event loop set on ManagedDGDR; pass loop= at construction or call init() first"
+            )
+        return self._loop.run_until_complete(coro)
+    async def init(self) -> None:
+        """Initialise the kubernetes_asyncio client.
+        Priority: KUBECONFIG env → in-cluster → ~/.kube/config  (same as
+        ManagedDeployment._init_kubernetes).
+        """
+        kubeconfig_path = os.environ.get("KUBECONFIG")
+        if kubeconfig_path and os.path.exists(kubeconfig_path):
+            self._logger.info("Loading kubeconfig from KUBECONFIG: %s", kubeconfig_path)
+            await config.load_kube_config(config_file=kubeconfig_path)
+        else:
+            try:
+                self._logger.info("Attempting in-cluster kubernetes config")
+                config.load_incluster_config()
+            except Exception as e:
+                self._logger.warning(
+                    "In-cluster config failed (%s: %s), falling back to default kubeconfig",
+                    type(e).__name__,
+                    e,
+                )
+                await config.load_kube_config()
+        self._api_client = client.ApiClient()
+        self._custom_api = client.CustomObjectsApi(self._api_client)
+    async def close(self) -> None:
+        """Close the underlying API client."""
+        if self._api_client:
+            await self._api_client.close()
+            self._api_client = None
+            self._custom_api = None
+    # ----- CRUD -----
+    async def create(self, manifest: dict) -> str:
+        """Create a DGDR custom resource.  Returns the resource name."""
+        assert self._custom_api is not None, "call init() first"
+        name = manifest["metadata"]["name"]
+        await self._custom_api.create_namespaced_custom_object(
+            group=self.DGDR_GROUP,
+            version=self.DGDR_VERSION,
+            namespace=self.namespace,
+            plural=self.DGDR_PLURAL,
+            body=manifest,
+        )
+        self._logger.info("Created DGDR %s/%s", self.namespace, name)
+        return name
+    async def get(self, name: str) -> Optional[dict]:
+        """Get a DGDR as a dict, or ``None`` if not found."""
+        assert self._custom_api is not None, "call init() first"
+        try:
+            return await self._custom_api.get_namespaced_custom_object(
+                group=self.DGDR_GROUP,
+                version=self.DGDR_VERSION,
+                namespace=self.namespace,
+                plural=self.DGDR_PLURAL,
+                name=name,
+            )
+        except exceptions.ApiException as e:
+            if e.status == 404:
+                return None
+            raise
+    async def delete(self, name: str, ignore_not_found: bool = True) -> None:
+        """Delete a DGDR."""
+        assert self._custom_api is not None, "call init() first"
+        try:
+            await self._custom_api.delete_namespaced_custom_object(
+                group=self.DGDR_GROUP,
+                version=self.DGDR_VERSION,
+                namespace=self.namespace,
+                plural=self.DGDR_PLURAL,
+                name=name,
+            )
+            self._logger.info("Deleted DGDR %s/%s", self.namespace, name)
+        except exceptions.ApiException as e:
+            if e.status == 404 and ignore_not_found:
+                return
+            raise
+    async def list(self, label_selector: str = "") -> List[dict]:
+        """List DGDRs, optionally filtered by label selector.  Returns items."""
+        assert self._custom_api is not None, "call init() first"
+        resp = await self._custom_api.list_namespaced_custom_object(
+            group=self.DGDR_GROUP,
+            version=self.DGDR_VERSION,
+            namespace=self.namespace,
+            plural=self.DGDR_PLURAL,
+            label_selector=label_selector,
+        )
+        return resp.get("items", [])
+    async def server_dry_run(self, manifest: dict) -> dict:
+        """Apply with server-side dry-run to validate admission webhooks.
+        Returns the API response dict.  Raises ``ApiException`` on rejection.
+        """
+        assert self._custom_api is not None, "call init() first"
+        return await self._custom_api.create_namespaced_custom_object(
+            group=self.DGDR_GROUP,
+            version=self.DGDR_VERSION,
+            namespace=self.namespace,
+            plural=self.DGDR_PLURAL,
+            body=manifest,
+            dry_run="All",
+        )
+    # ----- Phase helpers -----
+    async def get_phase(self, name: str) -> Optional[str]:
+        """Return ``status.phase`` of the named DGDR, or ``None``."""
+        obj = await self.get(name)
+        if obj is None:
+            return None
+        return obj.get("status", {}).get("phase")
+    async def get_condition(self, name: str, condition_type: str) -> Optional[dict]:
+        """Return the named condition dict from ``status.conditions``."""
+        obj = await self.get(name)
+        if obj is None:
+            return None
+        for c in obj.get("status", {}).get("conditions", []):
+            if c.get("type") == condition_type:
+                return c
+        return None
+    async def wait_for_phase(
+        self,
+        name: str,
+        target_phase: str,
+        timeout: int = 3600,
+        fail_fast_phases: Optional[List[str]] = None,
+        poll_interval: int = DEFAULT_POLL_INTERVAL,
+    ) -> str:
+        """Poll until the DGDR reaches *target_phase* or times out.
+        Returns the final observed phase.  Raises ``AssertionError`` on
+        fail-fast and ``TimeoutError`` on timeout.
+        """
+        if fail_fast_phases is None:
+            fail_fast_phases = ["Failed"]
+        deadline = time.monotonic() + timeout
+        last_phase: Optional[str] = None
+        while time.monotonic() < deadline:
+            current = await self.get_phase(name)
+            if current != last_phase:
+                self._logger.info("DGDR %s/%s phase: %s", self.namespace, name, current)
+                last_phase = current
+            if current == target_phase:
+                return current
+            if current in fail_fast_phases:
+                obj = await self.get(name)
+                conditions = obj.get("status", {}).get("conditions", []) if obj else []
+                raise AssertionError(
+                    f"DGDR {self.namespace}/{name} reached fail-fast phase {current!r} "
+                    f"while waiting for {target_phase!r}. conditions={conditions}"
+                )
+            await asyncio.sleep(poll_interval)
+        raise TimeoutError(
+            f"Timed out after {timeout}s waiting for DGDR {self.namespace}/{name} "
+            f"to reach phase {target_phase!r}. Last phase: {last_phase!r}"
+        )
+    async def wait_for_any_phase(
+        self,
+        name: str,
+        target_phases: List[str],
+        timeout: int = 3600,
+        poll_interval: int = DEFAULT_POLL_INTERVAL,
+    ) -> str:
+        """Poll until the DGDR reaches any of *target_phases*.  Returns matched phase."""
+        deadline = time.monotonic() + timeout
+        last_phase: Optional[str] = None
+        while time.monotonic() < deadline:
+            current = await self.get_phase(name)
+            if current != last_phase:
+                self._logger.info("DGDR %s/%s phase: %s", self.namespace, name, current)
+                last_phase = current
+            if current in target_phases:
+                return current
+            await asyncio.sleep(poll_interval)
+        raise TimeoutError(
+            f"Timed out after {timeout}s waiting for DGDR {self.namespace}/{name} "
+            f"to reach any of {target_phases!r}. Last phase: {last_phase!r}"
+        )
+    # ----- DGD helpers (for mocker cleanup) -----
+    async def delete_dgd(self, name: str, ignore_not_found: bool = True) -> None:
+        """Delete a DynamoGraphDeployment resource."""
+        assert self._custom_api is not None, "call init() first"
+        try:
+            await self._custom_api.delete_namespaced_custom_object(
+                group=self.DGDR_GROUP,
+                version="v1alpha1",
+                namespace=self.namespace,
+                plural=self.DGD_PLURAL,
+                name=name,
+            )
+            self._logger.info("Deleted DGD %s/%s", self.namespace, name)
+        except exceptions.ApiException as e:
+            if e.status == 404 and ignore_not_found:
+                return
+            raise
+    async def get_dgd(self, name: str) -> Optional[dict]:
+        """Get a DynamoGraphDeployment, or ``None`` if not found."""
+        assert self._custom_api is not None, "call init() first"
+        try:
+            return await self._custom_api.get_namespaced_custom_object(
+                group=self.DGDR_GROUP,
+                version="v1alpha1",
+                namespace=self.namespace,
+                plural=self.DGD_PLURAL,
+                name=name,
+            )
+        except exceptions.ApiException as e:
+            if e.status == 404:
+                return None
+            raise
 async def main():
    LOG_FORMAT = "[TEST] %(asctime)s %(levelname)s %(name)s: %(message)s"
    DATE_FORMAT = "%Y-%m-%dT%H:%M:%S"