Unverified Commit 597b7249 authored by Ashna Mehrotra's avatar Ashna Mehrotra Committed by GitHub
Browse files

feat: add DGDR test suite (#7343)


Signed-off-by: default avatarashnamehrotra <ashnamehrotra@gmail.com>
parent 8dfed173
# DGDR v1beta1 End-to-End Test Suite
This directory contains the end-to-end test suite for **DynamoGraphDeploymentRequest
(DGDR) v1beta1** — the high-level, SLA-driven Kubernetes API for deploying
inference models with Dynamo.
## What's tested
| Test group | Marker(s) | GPU req | Mocker OK? | What it covers |
|---|---|---|---|---|
| `TestDGDRValidation` | `gpu_0`, `pre_merge` | None | ✅ | Webhook validation: rejected/accepted specs, value enforcement, storage version, shortname |
| `TestDGDRVersionConversion` | `gpu_0`, `pre_merge` | None | ✅ | v1alpha1 → v1beta1 conversion webhook |
| `TestDGDRMinimalDeployment` | `gpu_1`, `pre_merge`, `e2e` | 1+ | ⚠️ see note | Full Pending → Profiling → Ready → Deploying → Deployed lifecycle |
| `TestDGDRBackendSelection` | `gpu_1`, `nightly`, `e2e` | 1+ | ⚠️ vllm+trtllm only | vllm and trtllm pass; sglang **skipped** (no AIC silicon data for sglang on the mocker GPU SKU) |
| `TestDGDRSearchStrategies` | `gpu_1`/`gpu_8`, `e2e` | 1 or 8 | ⚠️ rapid only | `rapid` uses AIC and works; `thorough` requires real GPU sweeps |
| `TestDGDRSLATargets` | `gpu_1`, `nightly`, `e2e` | 1+ | ✅ | ttft+itl, e2eLatency, optimizationType (latency/throughput) |
| `TestDGDRWorkloadPickingModes` | `gpu_1`, `nightly`, `e2e` | 1+ | ✅ | requestRate, concurrency, isl/osl |
| `TestDGDRFeatures` | `gpu_1`, `nightly`, `e2e` | 1+ | ⚠️ see note | planner (rapid/none sweep), mocker |
| `TestDGDRModelCache` | `gpu_1`, `nightly`, `e2e` | 1+ | ✅ | PVC-backed model cache, cache propagated to DGD |
| `TestDGDRHardwareOverride` | `gpu_1`, `pre_merge`, `e2e` | ✅ | ✅ | Manual gpuSku/numGpusPerNode/totalGpus/vramMb |
| `TestDGDRAutoApply` | `gpu_1`, `pre_merge`, `e2e` | 1+ | ⚠️ see note | autoApply=true **skipped** in mocker (operator race); autoApply=false keeps Ready |
| `TestDGDROverrides` | `gpu_1`, `nightly`, `e2e` | 1+ | ✅ | Profiling job tolerations; DGD metadata label merging **xfail** (operator gap) |
| `TestDGDRStatusAndConditions` | `gpu_1`, `pre_merge`, `e2e` | 1+ | ⚠️ see note | All conditions set correctly, sub-phases tracked, Pareto configs; all-conditions **xfail** in mocker; pareto **skipped** in mocker |
| `TestDGDRImmutability` | mixed | 0–1 | ⚠️ see note | Spec rejected in Profiling/Deployed, metadata always allowed |
| `TestDGDRCleanup` | `gpu_1`, `pre_merge`, `e2e` | 1+ | ⚠️ see note | Job deleted with DGDR; DGD preserved; ConfigMap cleanup **xfail** (operator gap); DGD-persistence test **skipped** in mocker |
| `TestDGDRMoEModels` | `gpu_8`, `nightly`, `e2e` | 8 | ❌ | DeepSeek-R1 MoE on SGLang — requires real 8-GPU node |
## Prerequisites
1. A running Kubernetes cluster with GPU nodes (or see [GPU-free mode](#gpu-free-mocker-mode) below)
2. The Dynamo operator installed (including CRDs and webhooks)
3. `kubectl` configured and pointing at the cluster
4. Python 3.10+ with `pytest` and `pyyaml` installed:
```bash
pip install pytest pyyaml
# or, from the repo root:
pip install -e ".[test]"
```
## One-time cluster setup
Before running any tests, ensure the following are in place in your cluster.
These are required even for GPU-free (mocker) mode.
### 1. Install the Dynamo operator
```bash
cd deploy/operator
helm install dynamo-operator helm/dynamo-operator -n dynamo-system --create-namespace
```
### 2. Deploy NATS
Mocker workers (and real workers) connect to NATS for inter-component messaging.
The operator expects NATS at `nats://dynamo-operator-nats.dynamo-system.svc.cluster.local:4222`.
```bash
helm repo add nats https://nats-io.github.io/k8s/helm/charts/
helm repo update
helm install dynamo-operator-nats nats/nats -n dynamo-system --create-namespace
```
### 3. Create the HuggingFace token secret
The profiling job reads the HF token from a secret named `hf-token-secret` using the
key `HF_TOKEN` (not `HUGGING_FACE_HUB_TOKEN`).
```bash
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN=<your-hf-token> \
-n default
# If running in a non-default namespace, adjust -n accordingly
```
> **Important:** The key must be `HF_TOKEN`. The secret name must be `hf-token-secret`.
> Using a different key name will cause the profiling job to fail silently.
## Running the tests
There are two main ways to run the suite depending on whether you have GPU hardware.
---
### GPU-free (mocker mode) — recommended for local development and CI
No GPU nodes required. Uses AIC simulation for profiling and mock inference workers
for deployment. Covers all `gpu_0` and `gpu_1` tests (~45 tests); `gpu_8` tests are
excluded because they require a real 8-GPU node even in mocker mode.
```bash
python3 -m pytest tests/dgdr/ -m "gpu_0 or gpu_1" -v \
--dgdr-namespace=default \
--dgdr-image=<your-image>
```
Expect: 37 passed, 6 skipped (2 model-cache PVC; sglang backend; pareto in mocker; DGD-persistence in mocker; auto-apply-true in mocker), 4 xfail (DGD label merging; all-conditions requires Deployed; dry-run immutability requires Deployed; ConfigMap cleanup on deletion).
`test_backend[sglang]` is one of the 6 skips (no AIC silicon data for sglang in mocker mode).
---
### Full suite with real GPUs — for production/nightly validation
Requires a Kubernetes cluster with GPU nodes. Set `--dgdr-no-mocker` to disable mocker
injection and run against real hardware. `gpu_8` tests additionally require an 8-GPU node.
```bash
# gpu_0 + gpu_1 tests on real GPUs (single-GPU node sufficient)
python3 -m pytest tests/dgdr/ -m "gpu_0 or gpu_1" -v \
--dgdr-namespace=dynamo-test \
--dgdr-image=<your-image> \
--dgdr-no-mocker \
--dgdr-profiling-timeout=3600 \
--dgdr-deploy-timeout=1800
# Full nightly suite including 8-GPU tests
python3 -m pytest tests/dgdr/ -v \
--dgdr-namespace=dynamo-test \
--dgdr-image=<your-image> \
--dgdr-no-mocker \
--dgdr-pvc-name=model-cache \
--dgdr-profiling-timeout=14400 \
--dgdr-deploy-timeout=3600
```
Expect (gpu_0 + gpu_1, with `--dgdr-pvc-name`): **~43 passed, 0 skipped, 2 xfail** (DGD label-merging operator gap; ConfigMap cleanup operator gap).
Without `--dgdr-pvc-name`: 2 additional skips for the model-cache tests.
> **Note:** Two xfails are **permanent operator gaps** that persist in both mocker and GPU mode:
> - `test_dgd_override_injects_custom_labels` — the operator does not yet merge `spec.overrides.dgd.metadata.labels` onto the created DGD.
> - `test_deletion_removes_output_configmap` — the operator's `FinalizeResource` is a no-op and does not delete the output ConfigMap on DGDR deletion.
> All other mocker-mode xfails/skips disappear in GPU mode and are expected to pass.
---
### Other useful invocations
```bash
# Validation + conversion tests only (no cluster setup required beyond CRDs)
python3 -m pytest tests/dgdr/ -m "gpu_0" -v \
--dgdr-namespace=default \
--dgdr-image=<your-image>
# Pre-merge gate (GPU-free)
python3 -m pytest tests/dgdr/ -m "pre_merge" -v \
--dgdr-namespace=default \
--dgdr-image=<your-image>
# Single test class
python3 -m pytest tests/dgdr/test_dgdr_v1beta1.py::TestDGDRAutoApply -v \
--dgdr-namespace=default \
--dgdr-image=<your-image>
```
## CLI options
| Option | Default | Description |
|---|---|---|
| `--dgdr-namespace` | _(required)_ | Kubernetes namespace for test resources |
| `--dgdr-image` | _(required)_ | Container image for profiling and inference workers |
| `--dgdr-model` | `Qwen/Qwen3-0.6B` | HuggingFace model ID used by most tests |
| `--dgdr-backend` | `vllm` | Default backend for DGDR tests |
| `--dgdr-pvc-name` | _(empty)_ | PVC name holding pre-downloaded model weights (PVC tests are skipped if unset) |
| `--dgdr-profiling-timeout` | `3600` | Seconds to wait for profiling to complete |
| `--dgdr-deploy-timeout` | `600` | Seconds to wait for DGD to reach Deployed phase |
| `--dgdr-no-mocker` | `false` | Disable mocker mode (require real GPU nodes) |
## DGDR v1beta1 feature coverage matrix
The following spec fields are exercised by at least one test:
| Field | Tests that exercise it |
|---|---|
| `spec.model` | All tests |
| `spec.backend` (auto/vllm/sglang/trtllm) | `TestDGDRBackendSelection`, `TestDGDRValidation` |
| `spec.image` | All tests |
| `spec.searchStrategy` (rapid/thorough) | `TestDGDRSearchStrategies` |
| `spec.sla.ttft` + `spec.sla.itl` | `TestDGDRSLATargets::test_sla_ttft_and_itl` |
| `spec.sla.e2eLatency` | `TestDGDRSLATargets::test_sla_e2e_latency` |
| `spec.sla.optimizationType` | `TestDGDRSLATargets::test_sla_optimization_type_*` |
| `spec.workload.isl` + `spec.workload.osl` | `TestDGDRWorkloadPickingModes` |
| `spec.workload.requestRate` | `TestDGDRWorkloadPickingModes::test_request_rate_picking` |
| `spec.workload.concurrency` | `TestDGDRWorkloadPickingModes::test_concurrency_picking` |
| `spec.features.planner` (opaque config) | `TestDGDRFeatures::test_planner_enabled_*` |
| `spec.features.mocker.enabled` | `TestDGDRFeatures::test_mocker_enabled` |
| `spec.modelCache.pvcName` | `TestDGDRModelCache` |
| `spec.hardware.gpuSku` | `TestDGDRHardwareOverride::test_hardware_manual_override` |
| `spec.hardware.numGpusPerNode` | `TestDGDRHardwareOverride` |
| `spec.hardware.totalGpus` / `spec.hardware.vramMb` | `TestDGDRHardwareOverride::test_hardware_total_gpus_and_vram` |
| `spec.autoApply` | `TestDGDRAutoApply` |
| `spec.overrides.profilingJob` | `TestDGDROverrides::test_profiling_job_toleration_override` |
| `spec.overrides.dgd` | `TestDGDROverrides::test_dgd_override_injects_custom_labels` |
| `status.phase` | All lifecycle tests |
| `status.profilingPhase` | `TestDGDRStatusAndConditions::test_profiling_sub_phase_tracked` |
| `status.profilingJobName` | `TestDGDRStatusAndConditions::test_profiling_job_name_populated` |
| `status.dgdName` | `TestDGDRAutoApply`, `TestDGDRMinimalDeployment` |
| `status.profilingResults.selectedConfig` | Multiple |
| `status.profilingResults.pareto` | `TestDGDRStatusAndConditions::test_pareto_configs_in_profiling_results` |
| `status.deploymentInfo` | `TestDGDRMinimalDeployment` |
| `status.conditions` (all types) | `TestDGDRStatusAndConditions` |
| `status.observedGeneration` | `TestDGDRStatusAndConditions::test_observed_generation_tracks_spec` |
## GPU-free mode (default)
By default, the test suite runs the full DGDR lifecycle **without any GPU nodes**
by combining two simulation features:
| Feature | How it's enabled | Which phase it affects |
|---|---|---|
| **AIC (AI Configurator)** | `searchStrategy: rapid` (the default) | **Profiling** — profiler runs CPU-only simulation instead of online GPU sweep |
| **Mocker** | Enabled by default (disable with `--dgdr-no-mocker`) | **Deployment** — DGD uses mock inference workers (no GPU resources requested) |
**How it works:**
- `searchStrategy: rapid` is the default for v1beta1 DGDRs. The profiler automatically
uses AI Configurator (AIC) simulation when rapid is set — no additional config needed.
- Mocker mode is **enabled by default**. The `dgdr_factory` fixture automatically injects
`spec.features.mocker.enabled: true` and a default `spec.hardware` config into every DGDR.
- AIC profiling creates a Kubernetes Job that runs CPU-only (job prefix: `profile-aic-`).
The profiling pod does not request GPU resources.
- Mocker deployment selects the profiler's `mocker_config_with_planner.yaml` output
instead of the real deployment config, resulting in DGD pods that don't request GPUs.
- Pass `--dgdr-no-mocker` to disable mocker mode and run against real GPU hardware.
> **Note:** Some test assertions (e.g., status.deploymentInfo.gpuCount, pareto configs)
> may produce different values under mocker than under real GPU profiling.
> The tests are written to validate structure and phase transitions, not exact
> profiling output values, so they work correctly in both modes.
> **Note:** `searchStrategy: thorough` requires online (GPU) profiling even with mocker,
> since thorough performs real benchmark measurements. Use rapid for GPU-free testing.
> **Note:** `TestDGDRFeatures::test_planner_enabled_with_rapid_sweep` runs with
> `auto_apply=False` in mocker mode (same root cause as the note below — the operator
> pre-sets `Status.DGDName` from the profiling output and then immediately fires
> `handleDGDDeleted` when the DGD cannot be found). In mocker mode the test only
> validates that spec generation succeeds (waits for `PHASE_READY` and checks `dgdName`
> + `selectedConfig`). Full deployment with rapid sweeping is verified outside mocker
> mode. `test_planner_enabled_no_pre_deployment_sweep` and `test_mocker_enabled` are
> likewise restricted to `PHASE_READY` in mocker mode.
> **Note:** `auto_apply=True` consistently hits `handleDGDDeleted` in mocker mode. The
> operator's `generateDGDSpec` pre-populates `Status.DGDName` from the profiling output
> (e.g. `mocker-disagg`) _before_ the DGD is actually created. When `handleDeployingPhase`
> then runs it checks `DGDName != ""` and immediately tries to GET that DGD; since it does
> not exist yet it fires `handleDGDDeleted` and the DGDR transitions to Failed.
> All tests that would enter the Deploying phase in mocker mode therefore use
> `auto_apply=False`/`PHASE_READY` instead (minimal lifecycle, backend selection,
> mocker feature, planner-no-sweep, planner-rapid-sweep, DGD label override).
> Tests whose sole purpose is to verify `auto_apply=True` DGD creation are skipped in
> mocker mode (`test_auto_apply_true_creates_dgd_automatically`,
> `test_deletion_does_not_remove_created_dgd`).
> Non-mocker mode (real GPU cluster) is unaffected.
> **Note:** `TestDGDRImmutability::test_spec_immutable_in_deployed_via_dry_run` is **xfail**
> in mocker mode. The test relies on the session `deployed_dgdr` fixture which, in mocker
> mode, stops at `PHASE_READY` instead of `PHASE_DEPLOYED`. The webhook's
> `ValidateUpdate` immutability enforcement only activates when the DGDR is in `Deployed`
> phase, so the server-dry-run mutation is accepted rather than rejected.
> **Note:** `gpu_8` tests cannot be run with mocker and require a real 8-GPU node.
> `TestDGDRSearchStrategies::test_thorough_strategy_completes` uses `searchStrategy: thorough`
> which performs real GPU benchmark sweeps. `TestDGDRMoEModels` (DeepSeek-R1) requires 8 GPUs
> for the real inference workload. Exclude them from GPU-free runs with `-m "gpu_0 or gpu_1"`.
### AIC silicon data availability
AIC operates in **silicon mode**: it looks up pre-recorded per-op performance data
files shipped inside the `aiconfigurator` Python package. These files are organised
by `{gpu_sku}/{backend}/{backend_version}/`. The mocker fixture injects
`gpuSku: a100_sxm` into every DGDR — but the package only ships vllm data for that SKU:
| Backend | a100_sxm data? | Mocker result |
|---|---|---|
| `vllm` | ✅ present | Profiling succeeds |
| `trtllm` | ✅ present | Profiling succeeds |
| `sglang` | ❌ missing | Test **skipped** automatically (no `sglang/0.5.8` perf data for `a100_sxm`) |
To test sglang/trtllm, run against a real GPU cluster (`--dgdr-no-mocker`) where AIC
can use a GPU SKU for which those data files are present.
## Cleanup
Tests clean up their own DGDRs via the `dgdr_factory` fixture. If a test is
interrupted, resources can be cleaned up manually:
```bash
# Delete all DGDRs created by the test suite (they are labelled automatically)
kubectl delete dgdr -n default -l "test.dynamo/managed=true"
# If you used a custom namespace:
kubectl delete dgdr -n <namespace> -l "test.dynamo/managed=true"
```
## Architecture notes
- All tests interact with the cluster **exclusively via `kubectl`** subprocess calls,
consistent with the rest of the Dynamo test suite.
- The `dgdr_factory` fixture ensures DGDR cleanup via `yield` regardless of test
outcome.
- Tests that require an optional PVC (`--dgdr-pvc-name`) skip automatically when the
option is not provided.
- Timeout values are configurable to accommodate clusters with varying profiling speeds.
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
This diff is collapsed.
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
"""
Webhook validation and version conversion tests for DGDR v1beta1.
These tests verify that:
- The admission webhook correctly accepts/rejects DGDR specs (TestDGDRValidation)
- v1alpha1 resources are transparently converted to v1beta1 (TestDGDRVersionConversion)
No GPU or cluster profiling is required (gpu_0 only). The only prerequisite is a
running Kubernetes cluster with the Dynamo operator CRDs and webhooks installed.
Run:
pytest tests/dgdr/test_dgdr_validation.py -m gpu_0 -v --dgdr-namespace=default --dgdr-image=<image>
Test markers:
gpu_0 No GPU required
nightly Requires live K8s cluster (not run in general pre-merge CI)
integration Integration-level (uses live webhook)
"""
from __future__ import annotations
import json
import logging
import pytest
import yaml
from kubernetes_asyncio.client import exceptions as k8s_exceptions
from tests.dgdr.conftest import (
DGDR_API_VERSION,
DGDR_SHORT_NAME,
_run_kubectl,
build_dgdr_manifest,
unique_dgdr_name,
)
from tests.utils.managed_deployment import ManagedDGDR
logger = logging.getLogger(__name__)
# ---------------------------------------------------------------------------
# ── Group 1: Webhook Validation (gpu_0, no profiling required) ──────────────
# ---------------------------------------------------------------------------
@pytest.mark.gpu_0
@pytest.mark.nightly
@pytest.mark.integration
@pytest.mark.k8s
class TestDGDRValidation:
"""
Tests that verify the admission webhook correctly validates DGDR specs
before they are persisted. These tests use server-side dry-run so no
resources are actually created.
"""
def test_missing_model_rejected(
self, managed_dgdr: ManagedDGDR, dgdr_image: str
) -> None:
"""
A DGDR without spec.model must be rejected by the webhook.
The model field is the only hard-required spec field in v1beta1.
"""
manifest = build_dgdr_manifest(
unique_dgdr_name("no-model"),
model="", # intentionally empty
image=dgdr_image,
)
# Clear model so the field is absent
del manifest["spec"]["model"]
with pytest.raises(k8s_exceptions.ApiException):
managed_dgdr.run(managed_dgdr.server_dry_run(manifest))
def test_thorough_with_auto_backend_rejected(
self, managed_dgdr: ManagedDGDR, dgdr_image: str, dgdr_model: str
) -> None:
"""
searchStrategy: thorough + backend: auto must be rejected.
'thorough' sweeps real GPU engines and requires a concrete backend.
"""
manifest = build_dgdr_manifest(
unique_dgdr_name("thorough-auto"),
model=dgdr_model,
image=dgdr_image,
backend="auto",
search_strategy="thorough",
)
with pytest.raises(k8s_exceptions.ApiException) as exc_info:
managed_dgdr.run(managed_dgdr.server_dry_run(manifest))
error_body = str(exc_info.value)
assert (
"auto" in error_body.lower()
or "backend" in error_body.lower()
or "thorough" in error_body.lower()
), f"Error message should mention backend/thorough incompatibility. Got: {error_body}"
def test_invalid_backend_rejected(
self, managed_dgdr: ManagedDGDR, dgdr_image: str, dgdr_model: str
) -> None:
"""
An unknown backend value must be rejected by the admission webhook.
Valid values: auto, vllm, sglang, trtllm.
"""
manifest = build_dgdr_manifest(
unique_dgdr_name("bad-backend"),
model=dgdr_model,
image=dgdr_image,
backend="unknown_backend",
)
with pytest.raises(k8s_exceptions.ApiException):
managed_dgdr.run(managed_dgdr.server_dry_run(manifest))
def test_invalid_search_strategy_rejected(
self, managed_dgdr: ManagedDGDR, dgdr_image: str, dgdr_model: str
) -> None:
"""
An unknown searchStrategy value must be rejected by the admission webhook.
"""
manifest = build_dgdr_manifest(
unique_dgdr_name("bad-strategy"),
model=dgdr_model,
image=dgdr_image,
search_strategy="superfast", # not a valid strategy
)
with pytest.raises(k8s_exceptions.ApiException):
managed_dgdr.run(managed_dgdr.server_dry_run(manifest))
def test_invalid_optimization_type_rejected(
self, managed_dgdr: ManagedDGDR, dgdr_image: str, dgdr_model: str
) -> None:
"""
An invalid sla.optimizationType value must be rejected by the
admission webhook. Valid values: latency, throughput.
"""
manifest = build_dgdr_manifest(
unique_dgdr_name("bad-opt-type"),
model=dgdr_model,
image=dgdr_image,
sla={"optimizationType": "cost"}, # not valid
)
with pytest.raises(k8s_exceptions.ApiException):
managed_dgdr.run(managed_dgdr.server_dry_run(manifest))
def test_valid_minimal_dgdr_accepted(
self, managed_dgdr: ManagedDGDR, dgdr_image: str, dgdr_model: str
) -> None:
"""
A DGDR with only the required fields (model + image) must pass validation.
All other fields have defaults and are optional.
"""
manifest = build_dgdr_manifest(
unique_dgdr_name("valid-minimal"),
model=dgdr_model,
image=dgdr_image,
)
# Should not raise — accepted by the webhook
managed_dgdr.run(managed_dgdr.server_dry_run(manifest))
def test_valid_full_spec_accepted(
self, managed_dgdr: ManagedDGDR, dgdr_image: str, dgdr_model: str
) -> None:
"""
A fully-specified v1beta1 DGDR should pass webhook validation.
Exercises every top-level optional field.
"""
manifest = build_dgdr_manifest(
unique_dgdr_name("valid-full"),
model=dgdr_model,
image=dgdr_image,
backend="vllm",
search_strategy="rapid",
sla={"ttft": 200.0, "itl": 20.0},
workload={"isl": 3000, "osl": 150},
features={
"planner": {"plannerPreDeploymentSweeping": "rapid"},
"mocker": {"enabled": False},
},
hardware={"numGpusPerNode": 8},
auto_apply=True,
)
# Should not raise — accepted by the webhook
managed_dgdr.run(managed_dgdr.server_dry_run(manifest))
def test_v1beta1_is_storage_version(self, dgdr_namespace: str) -> None:
"""
The CRD's storage version must be v1beta1 (it is the conversion hub).
"""
result = _run_kubectl(
[
"get",
"crd",
"dynamographdeploymentrequests.nvidia.com",
"-o",
"jsonpath={.status.storedVersions}",
],
check=False,
)
assert result.returncode == 0, f"Failed to get CRD: {result.stderr}"
assert (
"v1beta1" in result.stdout
), f"v1beta1 should be the storage version. Got: {result.stdout}"
def test_kubectl_shortname_dgdr_works(self, dgdr_namespace: str) -> None:
"""
kubectl get dgdr must work (tests the shortName 'dgdr' in the CRD).
"""
result = _run_kubectl(
["get", DGDR_SHORT_NAME, "-n", dgdr_namespace, "--ignore-not-found"],
check=False,
)
assert (
result.returncode == 0
), f"kubectl get dgdr failed (shortname may not be registered). stderr: {result.stderr}"
def test_kubectl_get_columns_schema(
self, dgdr_namespace: str, dgdr_image: str, dgdr_model: str, dgdr_factory
) -> None:
"""
kubectl get dgdr should output the columns defined in the CRD:
NAME, MODEL, BACKEND, PHASE, PROFILING, DGD, AGE.
"""
name = unique_dgdr_name("col-test")
manifest = build_dgdr_manifest(name, model=dgdr_model, image=dgdr_image)
dgdr_factory(manifest)
result = _run_kubectl(
["get", DGDR_SHORT_NAME, name, "-n", dgdr_namespace],
check=False,
)
assert result.returncode == 0, f"kubectl get dgdr failed: {result.stderr}"
header = (
result.stdout.splitlines()[0].upper() if result.stdout.splitlines() else ""
)
expected_columns = {"NAME", "MODEL", "BACKEND", "PHASE"}
for col in expected_columns:
assert (
col in header
), f"Expected column {col!r} in kubectl output header. Got: {header}"
# ---------------------------------------------------------------------------
# ── Group 2: v1alpha1 → v1beta1 Version Conversion ─────────────────────────
# ---------------------------------------------------------------------------
@pytest.mark.gpu_0
@pytest.mark.nightly
@pytest.mark.integration
@pytest.mark.k8s
class TestDGDRVersionConversion:
"""
Tests that v1alpha1 DGDR resources can be submitted and are stored
transparently as v1beta1 (conversion hub). No profiling required.
"""
def test_v1alpha1_dgdr_can_be_applied(
self, dgdr_namespace: str, dgdr_image: str, dgdr_model: str, dgdr_factory
) -> None:
"""
A v1alpha1 DynamoGraphDeploymentRequest should be accepted and
automatically converted to v1beta1 storage by the conversion webhook.
Note: v1alpha1 manifests use a different spec shape (profilingConfig
instead of image) so we must use kubectl here rather than the
v1beta1-only ManagedDGDR client.
"""
name = unique_dgdr_name("v1a1")
v1alpha1_manifest = {
"apiVersion": "nvidia.com/v1alpha1",
"kind": "DynamoGraphDeploymentRequest",
"metadata": {"name": name},
"spec": {
"model": dgdr_model,
"backend": "vllm",
"profilingConfig": {
"profilerImage": dgdr_image,
},
},
}
yaml_str = yaml.dump(v1alpha1_manifest)
result = _run_kubectl(
["apply", "-n", dgdr_namespace, "-f", "-"], input=yaml_str, check=False
)
if result.returncode == 0:
# Register for cleanup without re-creating (resource already exists)
dgdr_factory.register_for_cleanup(name)
# Either accepted (0) or rejected for a known conversion reason – just not a 500
assert result.returncode in (
0,
1,
), f"Unexpected error applying v1alpha1 DGDR: {result.stderr}"
def test_v1beta1_get_on_v1alpha1_object(
self,
managed_dgdr: ManagedDGDR,
dgdr_namespace: str,
dgdr_image: str,
dgdr_model: str,
dgdr_factory,
) -> None:
"""
A resource stored as v1beta1 must be retrievable as v1alpha1 via conversion.
"""
name = unique_dgdr_name("conv-get")
manifest = build_dgdr_manifest(name, model=dgdr_model, image=dgdr_image)
dgdr_factory(manifest)
# Retrieve as v1beta1 (storage version) via ManagedDGDR
obj_v1beta1 = managed_dgdr.run(managed_dgdr.get(name))
assert obj_v1beta1 is not None
assert obj_v1beta1["apiVersion"] == DGDR_API_VERSION
# Retrieve as v1alpha1 (should trigger conversion webhook).
# Must use kubectl here since ManagedDGDR targets v1beta1 only.
result = _run_kubectl(
[
"get",
"dynamographdeploymentrequests.v1alpha1.nvidia.com",
name,
"-n",
dgdr_namespace,
"-o",
"json",
],
check=False,
)
# If the conversion webhook is working, we get a 200 with v1alpha1 resource.
# If not registered, we may get a 404 - that is also acceptable here as
# some cluster configs only register v1beta1.
assert result.returncode in (
0,
1,
), f"Unexpected failure getting v1alpha1 DGDR: {result.stderr}"
if result.returncode == 0:
obj_v1alpha1 = json.loads(result.stdout)
assert (
obj_v1alpha1["apiVersion"] == "nvidia.com/v1alpha1"
), "Retrieved object should have v1alpha1 apiVersion"
...@@ -1179,6 +1179,287 @@ class ManagedDeployment: ...@@ -1179,6 +1179,287 @@ class ManagedDeployment:
await self._cleanup() await self._cleanup()
class ManagedDGDR:
"""Async helper for managing DynamoGraphDeploymentRequest custom resources.
Provides CRUD operations and phase-polling against the DGDR CRD using the
``kubernetes_asyncio`` client, following the same patterns as
``ManagedDeployment`` (shared kubeconfig initialisation, timeout logic,
structured error messages).
Typical usage from a pytest fixture::
dgdr = ManagedDGDR(namespace="default")
await dgdr.init()
await dgdr.create(manifest)
phase = await dgdr.wait_for_phase(name, "Ready", timeout=600)
await dgdr.delete(name)
await dgdr.close()
"""
# CRD coordinates for DGDR
DGDR_GROUP = "nvidia.com"
DGDR_VERSION = "v1beta1"
DGDR_PLURAL = "dynamographdeploymentrequests"
# CRD coordinates for DGD (for mocker cleanup)
DGD_PLURAL = "dynamographdeployments"
DEFAULT_POLL_INTERVAL = 10 # seconds
def __init__(
self,
namespace: str = "default",
loop: Optional[asyncio.AbstractEventLoop] = None,
):
self.namespace = namespace
self._custom_api: Optional[client.CustomObjectsApi] = None
self._api_client: Optional[client.ApiClient] = None
self._logger = logging.getLogger(self.__class__.__name__)
self._loop = loop
def run(self, coro):
"""Run an async coroutine synchronously using the stored event loop.
Convenience for callers that are not themselves async (e.g. pytest
fixtures and synchronous test methods).
"""
if self._loop is None:
raise RuntimeError(
"No event loop set on ManagedDGDR; pass loop= at construction or call init() first"
)
return self._loop.run_until_complete(coro)
async def init(self) -> None:
"""Initialise the kubernetes_asyncio client.
Priority: KUBECONFIG env → in-cluster → ~/.kube/config (same as
ManagedDeployment._init_kubernetes).
"""
kubeconfig_path = os.environ.get("KUBECONFIG")
if kubeconfig_path and os.path.exists(kubeconfig_path):
self._logger.info("Loading kubeconfig from KUBECONFIG: %s", kubeconfig_path)
await config.load_kube_config(config_file=kubeconfig_path)
else:
try:
self._logger.info("Attempting in-cluster kubernetes config")
config.load_incluster_config()
except Exception as e:
self._logger.warning(
"In-cluster config failed (%s: %s), falling back to default kubeconfig",
type(e).__name__,
e,
)
await config.load_kube_config()
self._api_client = client.ApiClient()
self._custom_api = client.CustomObjectsApi(self._api_client)
async def close(self) -> None:
"""Close the underlying API client."""
if self._api_client:
await self._api_client.close()
self._api_client = None
self._custom_api = None
# ----- CRUD -----
async def create(self, manifest: dict) -> str:
"""Create a DGDR custom resource. Returns the resource name."""
assert self._custom_api is not None, "call init() first"
name = manifest["metadata"]["name"]
await self._custom_api.create_namespaced_custom_object(
group=self.DGDR_GROUP,
version=self.DGDR_VERSION,
namespace=self.namespace,
plural=self.DGDR_PLURAL,
body=manifest,
)
self._logger.info("Created DGDR %s/%s", self.namespace, name)
return name
async def get(self, name: str) -> Optional[dict]:
"""Get a DGDR as a dict, or ``None`` if not found."""
assert self._custom_api is not None, "call init() first"
try:
return await self._custom_api.get_namespaced_custom_object(
group=self.DGDR_GROUP,
version=self.DGDR_VERSION,
namespace=self.namespace,
plural=self.DGDR_PLURAL,
name=name,
)
except exceptions.ApiException as e:
if e.status == 404:
return None
raise
async def delete(self, name: str, ignore_not_found: bool = True) -> None:
"""Delete a DGDR."""
assert self._custom_api is not None, "call init() first"
try:
await self._custom_api.delete_namespaced_custom_object(
group=self.DGDR_GROUP,
version=self.DGDR_VERSION,
namespace=self.namespace,
plural=self.DGDR_PLURAL,
name=name,
)
self._logger.info("Deleted DGDR %s/%s", self.namespace, name)
except exceptions.ApiException as e:
if e.status == 404 and ignore_not_found:
return
raise
async def list(self, label_selector: str = "") -> List[dict]:
"""List DGDRs, optionally filtered by label selector. Returns items."""
assert self._custom_api is not None, "call init() first"
resp = await self._custom_api.list_namespaced_custom_object(
group=self.DGDR_GROUP,
version=self.DGDR_VERSION,
namespace=self.namespace,
plural=self.DGDR_PLURAL,
label_selector=label_selector,
)
return resp.get("items", [])
async def server_dry_run(self, manifest: dict) -> dict:
"""Apply with server-side dry-run to validate admission webhooks.
Returns the API response dict. Raises ``ApiException`` on rejection.
"""
assert self._custom_api is not None, "call init() first"
return await self._custom_api.create_namespaced_custom_object(
group=self.DGDR_GROUP,
version=self.DGDR_VERSION,
namespace=self.namespace,
plural=self.DGDR_PLURAL,
body=manifest,
dry_run="All",
)
# ----- Phase helpers -----
async def get_phase(self, name: str) -> Optional[str]:
"""Return ``status.phase`` of the named DGDR, or ``None``."""
obj = await self.get(name)
if obj is None:
return None
return obj.get("status", {}).get("phase")
async def get_condition(self, name: str, condition_type: str) -> Optional[dict]:
"""Return the named condition dict from ``status.conditions``."""
obj = await self.get(name)
if obj is None:
return None
for c in obj.get("status", {}).get("conditions", []):
if c.get("type") == condition_type:
return c
return None
async def wait_for_phase(
self,
name: str,
target_phase: str,
timeout: int = 3600,
fail_fast_phases: Optional[List[str]] = None,
poll_interval: int = DEFAULT_POLL_INTERVAL,
) -> str:
"""Poll until the DGDR reaches *target_phase* or times out.
Returns the final observed phase. Raises ``AssertionError`` on
fail-fast and ``TimeoutError`` on timeout.
"""
if fail_fast_phases is None:
fail_fast_phases = ["Failed"]
deadline = time.monotonic() + timeout
last_phase: Optional[str] = None
while time.monotonic() < deadline:
current = await self.get_phase(name)
if current != last_phase:
self._logger.info("DGDR %s/%s phase: %s", self.namespace, name, current)
last_phase = current
if current == target_phase:
return current
if current in fail_fast_phases:
obj = await self.get(name)
conditions = obj.get("status", {}).get("conditions", []) if obj else []
raise AssertionError(
f"DGDR {self.namespace}/{name} reached fail-fast phase {current!r} "
f"while waiting for {target_phase!r}. conditions={conditions}"
)
await asyncio.sleep(poll_interval)
raise TimeoutError(
f"Timed out after {timeout}s waiting for DGDR {self.namespace}/{name} "
f"to reach phase {target_phase!r}. Last phase: {last_phase!r}"
)
async def wait_for_any_phase(
self,
name: str,
target_phases: List[str],
timeout: int = 3600,
poll_interval: int = DEFAULT_POLL_INTERVAL,
) -> str:
"""Poll until the DGDR reaches any of *target_phases*. Returns matched phase."""
deadline = time.monotonic() + timeout
last_phase: Optional[str] = None
while time.monotonic() < deadline:
current = await self.get_phase(name)
if current != last_phase:
self._logger.info("DGDR %s/%s phase: %s", self.namespace, name, current)
last_phase = current
if current in target_phases:
return current
await asyncio.sleep(poll_interval)
raise TimeoutError(
f"Timed out after {timeout}s waiting for DGDR {self.namespace}/{name} "
f"to reach any of {target_phases!r}. Last phase: {last_phase!r}"
)
# ----- DGD helpers (for mocker cleanup) -----
async def delete_dgd(self, name: str, ignore_not_found: bool = True) -> None:
"""Delete a DynamoGraphDeployment resource."""
assert self._custom_api is not None, "call init() first"
try:
await self._custom_api.delete_namespaced_custom_object(
group=self.DGDR_GROUP,
version="v1alpha1",
namespace=self.namespace,
plural=self.DGD_PLURAL,
name=name,
)
self._logger.info("Deleted DGD %s/%s", self.namespace, name)
except exceptions.ApiException as e:
if e.status == 404 and ignore_not_found:
return
raise
async def get_dgd(self, name: str) -> Optional[dict]:
"""Get a DynamoGraphDeployment, or ``None`` if not found."""
assert self._custom_api is not None, "call init() first"
try:
return await self._custom_api.get_namespaced_custom_object(
group=self.DGDR_GROUP,
version="v1alpha1",
namespace=self.namespace,
plural=self.DGD_PLURAL,
name=name,
)
except exceptions.ApiException as e:
if e.status == 404:
return None
raise
async def main(): async def main():
LOG_FORMAT = "[TEST] %(asctime)s %(levelname)s %(name)s: %(message)s" LOG_FORMAT = "[TEST] %(asctime)s %(levelname)s %(name)s: %(message)s"
DATE_FORMAT = "%Y-%m-%dT%H:%M:%S" DATE_FORMAT = "%Y-%m-%dT%H:%M:%S"
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment