planner-examples.md

---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: Planner Examples
---

Practical examples for deploying the Planner with throughput-based scaling. All examples below use the DGDR workflow with pre-deployment profiling. For deployment concepts, see the [Planner Guide](planner-guide.md). For a quick overview, see the [Planner README](README.md).

## Basic Examples

### Minimal DGDR with AIC (Fastest)

The simplest way to deploy with the Planner. Uses AI Configurator for offline profiling (20-30 seconds instead of hours):

```yaml
apiVersion: nvidia.com/v1beta1
kind: DynamoGraphDeploymentRequest
metadata:
  name: sla-aic
spec:
  model: Qwen/Qwen3-32B
  backend: vllm
  image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:my-tag"
```

Deploy:
```bash
export NAMESPACE=your-namespace
kubectl apply -f components/src/dynamo/profiler/deploy/profile_sla_aic_dgdr.yaml -n $NAMESPACE
```

### Online Profiling (Real Measurements)

Standard online profiling runs real GPU measurements for more accurate results. Takes 2-4 hours:

```yaml
apiVersion: nvidia.com/v1beta1
kind: DynamoGraphDeploymentRequest
metadata:
  name: sla-online
spec:
  model: meta-llama/Llama-3.3-70B-Instruct
  backend: vllm
  image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:my-tag"
```

Deploy:
```bash
kubectl apply -f components/src/dynamo/profiler/deploy/profile_sla_dgdr.yaml -n $NAMESPACE
```

Available sample DGDRs in `components/src/dynamo/profiler/deploy/`:
- **`profile_sla_dgdr.yaml`**: Standard online profiling for dense models
- **`profile_sla_aic_dgdr.yaml`**: Fast offline profiling using AI Configurator
- **`profile_sla_moe_dgdr.yaml`**: Online profiling for MoE models (SGLang)

> **Note**: Starting with Dynamo 1.0.0 (DGDR API version v1beta1), DGDR fields use structured spec fields (e.g., `spec.workload`, `spec.sla`, `spec.hardware`) instead of the nested `profilingConfig.config` blob used in v1alpha1.

## Kubernetes Examples

### MoE Models (SGLang)

For Mixture-of-Experts models like DeepSeek-R1, use SGLang backend:

```yaml
apiVersion: nvidia.com/v1beta1
kind: DynamoGraphDeploymentRequest
metadata:
  name: sla-moe
spec:
  model: deepseek-ai/DeepSeek-R1
  backend: sglang
  image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:my-tag"
```

Deploy:
```bash
kubectl apply -f components/src/dynamo/profiler/deploy/profile_sla_moe_dgdr.yaml -n $NAMESPACE
```

### Using Existing DGD Configs (Custom Setups)

Reference an existing DynamoGraphDeployment config via ConfigMap:

**Step 1: Create ConfigMap from your DGD config:**

```bash
kubectl create configmap deepseek-r1-config \
  --from-file=disagg.yaml=/path/to/your/disagg.yaml \
  --namespace $NAMESPACE \
  --dry-run=client -o yaml | kubectl apply -f -
```

**Step 2: Reference it in your DGDR:**

```yaml
apiVersion: nvidia.com/v1beta1
kind: DynamoGraphDeploymentRequest
metadata:
  name: deepseek-r1
spec:
  model: deepseek-ai/DeepSeek-R1
  backend: sglang
  image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:my-tag"
```

The profiler uses the DGD config from the ConfigMap as a **base template**, then optimizes it based on your SLA targets. The controller automatically injects `spec.model` and `spec.backend` into the final configuration.

### Inline Configuration (Simple Use Cases)

For simple use cases without a custom DGD config, provide the configuration directly in the v1beta1 DGDR spec fields. The profiler auto-generates a basic DGD configuration:

```yaml
spec:
  workload:
    isl: 8000
    osl: 200

  sla:
    ttft: 200.0
    itl: 10.0

  hardware:
    gpuSku: h200_sxm

  searchStrategy: rapid
```

### Mocker Deployment (Testing)

Deploy a mocker backend that simulates GPU timing behavior without real GPUs. Useful for:
- Large-scale experiments without GPU resources
- Testing planner behavior and infrastructure
- Validating deployment configurations

```yaml
spec:
  model: <model-name>
  backend: trtllm  # Real backend for profiling
  features:
    mocker:
      enabled: true  # Deploy mocker instead of real backend

  image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:my-tag"
```

Profiling runs against the real backend (via GPUs or AIC). The mocker deployment then uses profiling data to simulate realistic timing.

### Model Cache PVC (0.8.1+)

For large models, use a pre-populated PVC instead of downloading from HuggingFace:

See [SLA-Driven Profiling](../profiler/profiler-guide.md) for configuration details.

## Advanced Examples

### Custom Load Predictors

#### Warm-starting with Trace Data

Pre-load predictors with historical request patterns before live traffic:

```yaml
# In planner arguments
args:
  - --load-predictor arima
  - --load-predictor-warmup-trace /data/trace.jsonl
  - --load-predictor-log1p
```

The trace file should be in mooncake-style JSONL format with request-count, ISL, and OSL samples.

#### Kalman Filter Tuning

For workloads with rapid changes, tune the Kalman filter:

```yaml
args:
  - --load-predictor kalman
  - --kalman-q-level 2.0      # Higher = more responsive to level changes
  - --kalman-q-trend 0.5      # Higher = trend changes faster
  - --kalman-r 5.0            # Lower = trusts new measurements more
  - --kalman-min-points 3     # Fewer points before forecasting starts
  - --load-predictor-log1p    # Often helps with request-rate series
```

#### Prophet for Seasonal Workloads

For workloads with daily/weekly patterns:

```yaml
args:
  - --load-predictor prophet
  - --prophet-window-size 100   # Larger window for seasonal detection
  - --load-predictor-log1p
```

### Virtual Connector

For non-Kubernetes environments, use the VirtualConnector to communicate scaling decisions:

```python
from dynamo._core import DistributedRuntime, VirtualConnectorClient

# Initialize client
client = VirtualConnectorClient(distributed_runtime, namespace)

# Main loop: watch for planner decisions and execute them
while True:
    # Block until the planner makes a new scaling decision
    await client.wait()

    # Read the decision
    decision = await client.get()
    print(f"Scale to: prefill={decision.num_prefill_workers}, "
          f"decode={decision.num_decode_workers}, "
          f"id={decision.decision_id}")

    # Execute scaling in your environment
    scale_prefill_workers(decision.num_prefill_workers)
    scale_decode_workers(decision.num_decode_workers)

    # Report completion
    await client.complete(decision)
```

See `components/planner/test/test_virtual_connector.py` for a full working example.

### Planner Configuration Passthrough

Pass planner-specific settings through the DGDR:

```yaml
features:
  planner:
    plannerMinEndpoint: 2
```

### Review Before Deploy (autoApply: false)

Disable auto-deployment to inspect the generated DGD:

```yaml
spec:
  autoApply: false
```

After profiling completes:

```bash
# Extract and review generated DGD
kubectl get dgdr sla-aic -n $NAMESPACE \
  -o jsonpath='{.status.profilingResults.selectedConfig}' > my-dgd.yaml

# Review and modify as needed
vi my-dgd.yaml

# Deploy manually
kubectl apply -f my-dgd.yaml -n $NAMESPACE
```

### Profiling Artifacts with PVC

Save detailed profiling artifacts (plots, logs, raw data) to a PVC:

```yaml
spec:
  workload:
    isl: 3000
    osl: 150

  sla:
    ttft: 200
    itl: 20
```

Setup:
```bash
export NAMESPACE=your-namespace
deploy/utils/setup_benchmarking_resources.sh
```

Access results:
```bash
kubectl apply -f deploy/utils/manifests/pvc-access-pod.yaml -n $NAMESPACE
kubectl wait --for=condition=Ready pod/pvc-access-pod -n $NAMESPACE --timeout=60s
kubectl cp $NAMESPACE/pvc-access-pod:/data ./profiling-results
kubectl delete pod pvc-access-pod -n $NAMESPACE
```

## Related Documentation

- [Planner README](README.md) -- Overview and quick start
- [Planner Guide](planner-guide.md) -- Deployment, configuration, integration
- [Planner Design](../../design-docs/planner-design.md) -- Architecture deep-dive
- [DGDR Configuration Reference](../profiler/profiler-guide.md#dgdr-configuration-structure)
- [SLA-Driven Profiling](../profiler/profiler-guide.md)