docs: planner 3-tier documentation restructure (#5876)

Signed-off-by: athreesh <anish.maddipoti@utexas.edu> Signed-off-by: dagil-nvidia <dagil@nvidia.com> Signed-off-by: Dan Gil <dagil@nvidia.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: dagil-nvidia <dagil@nvidia.com> Co-authored-by: Hongkuan Zhou <tedzhouhk@gmail.com> Co-authored-by: Cursor <cursoragent@cursor.com>

docs: planner 3-tier documentation restructure (#5876)
Signed-off-by: athreesh <anish.maddipoti@utexas.edu> Signed-off-by: dagil-nvidia <dagil@nvidia.com> Signed-off-by: Dan Gil <dagil@nvidia.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: dagil-nvidia <dagil@nvidia.com> Co-authored-by: Hongkuan Zhou <tedzhouhk@gmail.com> Co-authored-by: Cursor <cursoragent@cursor.com>
7752ce21 · Anish · GitHub · 8aa7335e · 7752ce21 · 7752ce21
Unverified Commit 7752ce21 authored Feb 05, 2026 by Anish Committed by GitHub Feb 05, 2026
7 changed files
--- a/components/src/dynamo/planner/README.md
+++ b/components/src/dynamo/planner/README.md
@@ -15,4 +15,9 @@ See the License for the specific language governing permissions and
 limitations under the License.
 -->
-Please refer to [planner docs](../../../../docs/planner/planner_intro.rst) for planner documentation.
+# Planner
+SLA-driven autoscaling controller for Dynamo inference graphs.
+- **User docs**: [docs/planner/](/docs/planner/) (deployment, configuration, examples)
+- **Design docs**: [docs/design_docs/planner_design.md](/docs/design_docs/planner_design.md) (architecture, algorithms)
--- a/docs/design_docs/planner_design.md
+++ b/docs/design_docs/planner_design.md
+# Planner Design
+> **Tier 3 design documentation** for contributors and architects. For user-facing docs, see [docs/planner/](/docs/planner/).
+## Overview
+The Planner is Dynamo's autoscaling controller. It observes system metrics, predicts future load, and adjusts prefill/decode worker replica counts to proactively meet SLA targets. This document covers the internal architecture, algorithms, and design trade-offs.
+## Architecture
+```text
+┌──────────────────────────────────────────────────────────┐
+│                    Planner Component                     │
+│                                                          │
+│  ┌───────────────┐ ┌───────────────┐ ┌────────────────┐  │
+│  │    Metric     │ │     Load      │ │  Performance   │  │
+│  │   Collector   │ │   Predictor   │ │  Interpolator  │  │
+│  │  (Prometheus) │ │ (ARIMA/etc.)  │ │  (JSON data)   │  │
+│  └───────┬───────┘ └───────┬───────┘ └───────┬────────┘  │
+│          │                 │                  │          │
+│          ▼                 ▼                  ▼          │
+│  ┌───────────────────────────────────────────────────┐   │
+│  │              Scaling Algorithm                    │   │
+│  └───────────────────────┬───────────────────────────┘   │
+│                          │                               │
+│  ┌───────────────────────▼───────────────────────────┐   │
+│  │               Connector Layer                     │   │
+│  │  ┌───────────────────┐  ┌───────────────────────┐ │   │
+│  │  │ KubernetesConn.   │  │   VirtualConn.        │ │   │
+│  │  │ (PATCH DGD)       │  │   (Runtime bridge)    │ │   │
+│  │  └───────────────────┘  └───────────────────────┘ │   │
+│  └───────────────────────────────────────────────────┘   │
+└──────────────────────────────────────────────────────────┘
+```
+## Scaling Algorithm
+### Step 1: Metric Collection
+Every `adjustment_interval` seconds, the planner queries Prometheus for:
+- Average TTFT and ITL over the interval
+- Total request count
+- Average input sequence length (ISL) and output sequence length (OSL)
+The Prometheus query targets the Frontend's `/metrics` endpoint, which exposes histograms and counters.
+### Step 2: Correction Factor Calculation
+The planner maintains correction factors that adapt profiling-based predictions to real-world behavior:
+```text
+prefill_correction = actual_ttft / expected_ttft
+decode_correction  = actual_itl  / expected_itl
+```
+These factors account for hard to model factors such as:
+- **Request queueing**: Bursty traffic causes higher TTFT than profiled steady-state
+- **Prefix cache hits**: KV reuse reduces effective prefill tokens, lowering actual TTFT
+- **Chunked prefill in decode**: Small prefills processed in decode engine affect ITL
+- **Metric variance**: Average ISL/OSL may not represent the actual distribution
+The correction factors are applied as multipliers to the next scaling decision. Setting `--no-correction` disables this for debugging or when cold-start artifacts dominate.
+### Step 3: Load Prediction
+The planner forecasts three values for the next interval:
+- `next_num_req`: Number of requests
+- `next_isl`: Average input sequence length
+- `next_osl`: Average output sequence length
+Four predictor implementations are available:
+| Predictor    | Algorithm                                | Best For                         |
+| ------------ | ---------------------------------------- | -------------------------------- |
+| **Constant** | `next = current`                         | Stable workloads, long intervals |
+| **ARIMA**    | Auto-ARIMA with optional log1p transform | Trending/seasonal patterns       |
+| **Kalman**   | Local linear trend Kalman filter         | Bursty traffics                  |
+| **Prophet**  | Facebook Prophet time-series model       | Complex seasonality              |
+All predictors support warm-starting from trace files (`--load-predictor-warmup-trace`).
+### Step 4: Replica Calculation
+**Prefill replicas:**
+```python
+predicted_load = next_requests * next_isl / interval * min(1, prefill_correction)
+prefill_replicas = ceil(predicted_load / interpolated_throughput / gpus_per_engine)
+```
+The prefill correction factor has a linear effect on throughput because prefill is single-batched.
+**Decode replicas:**
+```python
+# Apply correction to the ITL SLA target
+corrected_itl = target_itl / decode_correction_factor
+# Find best throughput/GPU that achieves corrected ITL at predicted context length
+throughput_per_gpu = decode_interpolator.find_best_throughput_per_gpu(
+    itl=corrected_itl,
+    context_length=next_isl + next_osl / 2
+)
+# Calculate required replicas
+decode_replicas = ceil(next_num_req * next_osl / interval / throughput_per_gpu / gpus_per_engine)
+```
+### Step 5: Scaling Execution
+The planner calls `connector.set_component_replicas()` with the calculated targets. Scaling is non-blocking by default: the planner continues monitoring while replicas are adjusting.
+## Connector Design
+### Interface
+```python
+class PlannerConnector(ABC):
+    async def add_component(self, component_name)
+    async def remove_component(self, component_name)
+    # Extended interface (not on ABC, but implemented by both connectors):
+    async def set_component_replicas(self, targets, blocking)
+    async def validate_deployment(self, ...)
+    async def wait_for_deployment_ready(self)
+```
+### KubernetesConnector
+Directly PATCHes the DGD resource to update replica counts. The operator watches for DGD changes and reconciles component deployments.
+**Design decisions:**
+- Uses `DYN_PARENT_DGD_K8S_NAME` to find its parent DGD (injected by operator)
+- Resolves services by `subComponentType` field (prefill/decode), with fallback to legacy component names
+- Validates deployment structure on startup: checks that prefill and decode services exist and model names match
+### VirtualConnector
+For non-native environments (e.g., custom orchestrators). Writes scaling decisions to the distributed runtime via `VirtualConnectorCoordinator` (Rust binding). External systems use `VirtualConnectorClient` to poll decisions and report completion.
+**Scaling decision flow:**
+1. Planner writes `(num_prefill, num_decode, decision_id)` to runtime
+2. External system reads decision via `client.wait()`
+3. External system executes scaling
+4. External system reports completion via `client.complete(decision)`
+5. Planner sees `scaled_decision_id >= decision_id` and proceeds
+**Timeout**: If scaling isn't acknowledged within 1800s (configurable), the planner proceeds with new decisions anyway.
+## Performance Interpolation
+The planner uses pre-deployment profiling data (NPZ files) to map (throughput, ISL/OSL, context_length) -> (TTFT, ITL). This data comes from the SLA-driven profiling process (either online GPU profiling or AI Configurator estimation).
+Two interpolators are maintained:
+- **Prefill interpolator**: Maps (throughput_per_gpu, ISL) -> TTFT
+- **Decode interpolator**: Maps (throughput_per_gpu, context_length) -> ITL
+The interpolators use the profiling sweep granularity to determine precision. Finer granularity means more profiling samples but more accurate interpolation.
+## Initialization
+The planner starts with a 30-second delay (`INIT_PLANNER_START_DELAY`) to allow other components (frontend, workers) to register and stabilize. This is a known workaround (marked TODO in code) that should be replaced with a proper readiness check.
+After the delay:
+1. Initialize the connector (K8s or Virtual based on `--environment`)
+2. Validate deployment structure
+3. Load profiling results
+4. Build interpolators
+5. Initialize load predictor
+6. Enter main scaling loop
+## Performance Considerations
+- **Adjustment interval sizing**: The interval must be long enough for scaling operations to complete. If `adjustment_interval` is shorter than the time to add/remove a worker (which includes pod scheduling, model loading, and registration), scaling decisions will overlap. Default of 180s is conservative; workloads with fast model loading can use shorter intervals.
+- **Correction factor stability**: Correction factors are recalculated each interval. During traffic transitions (e.g., ramp-up), they can oscillate. The `--no-correction` flag disables correction for scenarios where cold-start artifacts dominate and distort the factor.
+- **Interpolation accuracy vs profiling cost**: Higher `prefillInterpolationGranularity` and `decodeInterpolationGranularity` in the profiling sweep produce more accurate interpolation but increase profiling time linearly. Default granularity (16 prefill, 6 decode) balances accuracy with profiling duration.
+- **Predictor warm-up period**: All predictors need observation history before making reliable forecasts. ARIMA and Prophet need multiple adjustment intervals of data. Kalman starts forecasting after `--kalman-min-points` observations. During warm-up, the planner uses the constant predictor as fallback.
+## Known Limitations
+1. **30-second startup delay**: Hardcoded wait for component registration. It should be replaced with runtime readiness probing.
+2. **Adjustment interval vs scaling latency**: If `adjustment_interval` < time to scale, scaling decisions can pile up. The planner logs warnings but doesn't queue.
+3. **Average-based interpolation**: The planner uses average ISL/OSL, which may not represent bimodal or heavy-tailed distributions well.
+4. **Single DGD scope**: Each planner instance manages exactly one DGD. Multi-model/multi-DGD coordination is not supported.
+5. **Load-based planner deprecated**: The load-based code path exists but is non-functional with current backends (no prefill queue metrics).
+## Future Work
+- Support aggregated (non-disaggregated) scaling mode for single-worker deployments
+- Multi-DGD coordination for shared-cluster scenarios
+- Distribution-aware interpolation (beyond mean ISL/OSL)
+- Adaptive adjustment interval based on observed scaling latency
+## File Map
+| File                         | Size | Purpose                                               |
+| ---------------------------- | ---- | ----------------------------------------------------- |
+| `planner_core.py`            | 36k  | Main scaling loop, algorithm implementation           |
+| `perf_interpolation.py`      | 13k  | NPZ data loading and throughput/latency interpolation |
+| `load_predictor.py`          | 16k  | ARIMA, Prophet, Kalman, Constant predictors           |
+| `pre_swept_results_utils.py` | 12k  | Pre-computed H100/H200 profiling data loader          |
+| `kubernetes_connector.py`    | 11k  | K8s API integration for DGD scaling                   |
+| `kube.py`                    | 7.4k | Low-level K8s client wrapper                          |
+| `exceptions.py`              | 7.2k | Custom exception hierarchy                            |
+| `prometheus.py`              | 7.3k | Prometheus query builder and client                   |
+| `defaults.py`                | 8.1k | Default configs, backend name mappings                |
+| `planner_argparse.py`        | 6.2k | CLI argument definitions                              |
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -88,3 +88,4 @@ Quickstart
   Distributed Runtime <design_docs/distributed_runtime.md>
   Request Plane <design_docs/request_plane.md>
   Event Plane <design_docs/event_plane.md>
+   Planner Design <design_docs/planner_design.md>
--- a/docs/planner/README.md
+++ b/docs/planner/README.md
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+# Planner
+The Planner monitors system performance and automatically scales prefill/decode workers to meet latency SLAs. It runs as a component inside the Dynamo inference graph on Kubernetes.
+> **New to the Planner?** Start with the [SLA Planner Quick Start Guide](sla_planner_quickstart.md) for a complete workflow including profiling and deployment.
+## Feature Matrix
+| Category | Feature | Status |
+|----------|---------|--------|
+| **Backend** | Local (bare metal) | Deprecated |
+| | Kubernetes | Supported |
+| **LLM Framework** | vLLM | Supported |
+| | TensorRT-LLM | Supported |
+| | SGLang | Supported |
+| **Serving Type** | Aggregated | Unsupported |
+| | Disaggregated | Supported |
+| **Scaling Mode** | SLA-based (TTFT/ITL targets) | Supported (primary) |
+| | Load-based (KV cache/queue thresholds) | Deprecated |
+| **Load Predictors** | ARIMA | Supported |
+| | Prophet | Supported |
+| | Kalman filter | Supported |
+| | Constant (current = next) | Supported |
+| **Connectors** | KubernetesConnector (native DGD scaling) | Supported |
+| | VirtualConnector (external environments) | Supported |
+## Quick Start
+### Prerequisites
+- Dynamo platform installed on Kubernetes ([Installation Guide](/docs/kubernetes/installation_guide.md))
+- kube-prometheus-stack installed ([Metrics Setup](/docs/kubernetes/observability/metrics.md))
+- Pre-deployment profiling completed ([Profiling Guide](/docs/benchmarks/sla_driven_profiling.md))
+### Deploy with DGDR (Recommended)
+The fastest path to a planner-enabled deployment is through a DynamoGraphDeploymentRequest:
+```bash
+kubectl apply -f benchmarks/profiler/deploy/profile_sla_aic_dgdr.yaml -n $NAMESPACE
+```
+This automatically profiles your model and deploys with the SLA planner. See [SLA Planner Quick Start](sla_planner_quickstart.md) for the full workflow.
+### Deploy with DGD (Manual)
+For manual control, use the disaggregated planner templates:
+```bash
+# After profiling is complete
+kubectl apply -f examples/backends/vllm/deploy/disagg_planner.yaml -n $NAMESPACE
+```
+## Documentation
+| Document | Description |
+|----------|-------------|
+| [Planner Guide](planner_guide.md) | Deployment, configuration, integration, troubleshooting |
+| [Planner Examples](planner_examples.md) | DGDR YAML examples, sample configurations, advanced patterns |
+| [SLA Planner Quick Start](sla_planner_quickstart.md) | End-to-end DGDR workflow: define SLAs, profile, deploy, monitor |
+| [SLA-based Planner](sla_planner.md) | Scaling algorithm, correction factors, load prediction details |
+| [Load-based Planner](load_planner.md) | Legacy load-based scaling (deprecated) |
+| [SLA-Driven Profiling](/docs/benchmarks/sla_driven_profiling.md) | Pre-deployment profiling process and configuration |
+| [Planner Design](/docs/design_docs/planner_design.md) | Architecture deep-dive for contributors |
+## Configuration Reference
+### Key Arguments
+| Argument | Default | Description |
+|----------|---------|-------------|
+| `--namespace` | `$DYN_NAMESPACE` or `dynamo` | Dynamo logical namespace |
+| `--backend` | `vllm` | Backend framework (`vllm`, `sglang`, `trtllm`) |
+| `--environment` | `kubernetes` | Deployment environment |
+| `--adjustment-interval` | `180` | Seconds between scaling decisions |
+| `--ttft` | `500.0` | Target Time To First Token (ms) |
+| `--itl` | `50.0` | Target Inter-Token Latency (ms) |
+| `--isl` | `3000` | Expected average input sequence length |
+| `--osl` | `150` | Expected average output sequence length |
+| `--load-predictor` | `arima` | Prediction model (`arima`, `prophet`, `kalman`, `constant`) |
+| `--max-gpu-budget` | `8` | Maximum GPUs across all workers |
+| `--min-endpoint` | `1` | Minimum replicas per worker type |
+| `--decode-engine-num-gpu` | `1` | GPUs per decode engine |
+| `--prefill-engine-num-gpu` | `1` | GPUs per prefill engine |
+| `--no-operation` | `false` | Observation mode (no actual scaling) |
+| `--no-correction` | `false` | Disable correction factors |
+| `--profile-results-dir` | `profiling_results` | Path to profiling data (NPZ/JSON) |
+### Environment Variables
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `DYN_NAMESPACE` | `dynamo` | Dynamo logical namespace |
+| `DYN_PARENT_DGD_K8S_NAME` | (required) | Parent DGD K8s resource name |
+| `PROMETHEUS_ENDPOINT` | `http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090` | Prometheus URL |
+| `PLANNER_PROMETHEUS_PORT` | `0` (disabled) | Port for planner's own Prometheus metrics |
+## Monitoring
+### Grafana Dashboard
+Deploy the planner dashboard:
+```bash
+kubectl apply -n monitoring -f deploy/observability/k8s/grafana-planner-dashboard-configmap.yaml
+```
+The dashboard shows:
+- Worker counts and GPU usage over time
+- Observed TTFT, ITL, request rate, sequence lengths
+- Predicted load and recommended replica counts
+- Correction factors (actual vs. expected performance)
+### Prometheus Metrics
+The planner queries the frontend's `/metrics` endpoint via Prometheus. Required metrics:
+- Request count and duration
+- TTFT and ITL distributions
+- Input/output sequence lengths
--- a/docs/planner/planner_examples.md
+++ b/docs/planner/planner_examples.md
+# Planner Examples
+Practical examples for deploying the SLA Planner with different configurations. For deployment concepts, see the [Planner Guide](planner_guide.md). For a quick overview, see the [Planner README](README.md).
+## Basic Examples
+### Minimal DGDR with AIC (Fastest)
+The simplest way to deploy with the SLA planner. Uses AI Configurator for offline profiling (20-30 seconds instead of hours):
+```yaml
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeploymentRequest
+metadata:
+  name: sla-aic
+spec:
+  model: Qwen/Qwen3-32B
+  backend: vllm
+  profilingConfig:
+    profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"
+    config:
+      sla:
+        isl: 3000
+        osl: 150
+        ttft: 200
+        itl: 20
+      sweep:
+        useAiConfigurator: true
+        aicSystem: h200_sxm
+        aicHfId: Qwen/Qwen3-32B
+        aicBackendVersion: "0.20.0"
+  deploymentOverrides:
+    workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"
+  autoApply: true
+```
+Deploy:
+```bash
+export NAMESPACE=your-namespace
+kubectl apply -f benchmarks/profiler/deploy/profile_sla_aic_dgdr.yaml -n $NAMESPACE
+```
+### Online Profiling (Real Measurements)
+Standard online profiling runs real GPU measurements for more accurate results. Takes 2-4 hours:
+```yaml
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeploymentRequest
+metadata:
+  name: sla-online
+spec:
+  model: meta-llama/Llama-3.3-70B-Instruct
+  backend: vllm
+  profilingConfig:
+    profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"
+    config:
+      sla:
+        isl: 3000
+        osl: 150
+        ttft: 200
+        itl: 20
+      sweep:
+        useAiConfigurator: false
+        prefillInterpolationGranularity: 16
+        decodeInterpolationGranularity: 6
+  deploymentOverrides:
+    workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"
+  autoApply: true
+```
+Deploy:
+```bash
+kubectl apply -f benchmarks/profiler/deploy/profile_sla_dgdr.yaml -n $NAMESPACE
+```
+Available sample DGDRs in `benchmarks/profiler/deploy/`:
+- **`profile_sla_dgdr.yaml`**: Standard online profiling for dense models
+- **`profile_sla_aic_dgdr.yaml`**: Fast offline profiling using AI Configurator
+- **`profile_sla_moe_dgdr.yaml`**: Online profiling for MoE models (SGLang)
+> **Profiling Config Cases**: Prior to 0.8.1, fields under `profilingConfig.config` use snake_case. Starting 0.8.1, fields use camelCase. There is backwards compatibility to snake_case, but example DGDRs use camelCase.
+## Kubernetes Examples
+### MoE Models (SGLang)
+For Mixture-of-Experts models like DeepSeek-R1, use SGLang backend:
+```yaml
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeploymentRequest
+metadata:
+  name: sla-moe
+spec:
+  model: deepseek-ai/DeepSeek-R1
+  backend: sglang
+  profilingConfig:
+    profilerImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.1"
+    config:
+      sla:
+        isl: 4000
+        osl: 500
+        ttft: 300
+        itl: 10
+      sweep:
+        useAiConfigurator: false
+  deploymentOverrides:
+    workersImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.1"
+  autoApply: true
+```
+Deploy:
+```bash
+kubectl apply -f benchmarks/profiler/deploy/profile_sla_moe_dgdr.yaml -n $NAMESPACE
+```
+### Using Existing DGD Configs (Custom Setups)
+Reference an existing DynamoGraphDeployment config via ConfigMap:
+**Step 1: Create ConfigMap from your DGD config:**
+```bash
+kubectl create configmap deepseek-r1-config \
+  --from-file=disagg.yaml=/path/to/your/disagg.yaml \
+  --namespace $NAMESPACE \
+  --dry-run=client -o yaml | kubectl apply -f -
+```
+**Step 2: Reference it in your DGDR:**
+```yaml
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeploymentRequest
+metadata:
+  name: deepseek-r1
+spec:
+  model: deepseek-ai/DeepSeek-R1
+  backend: sglang
+  profilingConfig:
+    profilerImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.1"
+    configMapRef:
+      name: deepseek-r1-config
+      key: disagg.yaml  # Must match the key used in --from-file
+    config:
+      sla:
+        isl: 4000
+        osl: 500
+        ttft: 300
+        itl: 10
+      sweep:
+        useAiConfigurator: true
+        aicSystem: h200_sxm
+        aicHfId: deepseek-ai/DeepSeek-V3
+        aicBackendVersion: "0.20.0"
+  deploymentOverrides:
+    workersImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.1"
+  autoApply: true
+```
+The profiler uses the DGD config from the ConfigMap as a **base template**, then optimizes it based on your SLA targets. The controller automatically injects `spec.model` and `spec.backend` into the final configuration.
+### Inline Configuration (Simple Use Cases)
+For simple use cases without a custom DGD config, provide profiler configuration directly. The profiler auto-generates a basic DGD configuration:
+```yaml
+profilingConfig:
+  config:
+    sla:
+      isl: 8000
+      osl: 200
+      ttft: 200.0
+      itl: 10.0
+    hardware:
+      minNumGpusPerEngine: 2
+      maxNumGpusPerEngine: 8
+      gpuType: h200_sxm
+    sweep:
+      prefillInterpolationGranularity: 16
+      decodeInterpolationGranularity: 6
+```
+### Mocker Deployment (Testing)
+Deploy a mocker backend that simulates GPU timing behavior without real GPUs. Useful for:
+- Large-scale experiments without GPU resources
+- Testing planner behavior and infrastructure
+- Validating deployment configurations
+```yaml
+spec:
+  model: <model-name>
+  backend: trtllm  # Real backend for profiling
+  useMocker: true  # Deploy mocker instead of real backend
+  profilingConfig:
+    profilerImage: "nvcr.io/nvidia/dynamo/trtllm-runtime:<image-tag>"
+    config:
+      sla:
+        isl: 3000
+        osl: 150
+        ttft: 200
+        itl: 20
+      sweep:
+        useAiConfigurator: true
+        aicSystem: h100_sxm
+  autoApply: true
+```
+Profiling runs against the real backend (via GPUs or AIC). The mocker deployment then uses profiling data to simulate realistic timing.
+### Model Cache PVC (0.8.1+)
+For large models, use a pre-populated PVC instead of downloading from HuggingFace:
+See [SLA-Driven Profiling](/docs/benchmarks/sla_driven_profiling.md) for configuration details.
+## Advanced Examples
+### Custom Load Predictors
+#### Warm-starting with Trace Data
+Pre-load predictors with historical request patterns before live traffic:
+```yaml
+# In planner arguments
+args:
+  - --load-predictor arima
+  - --load-predictor-warmup-trace /data/trace.jsonl
+  - --load-predictor-log1p
+```
+The trace file should be in mooncake-style JSONL format with request-count, ISL, and OSL samples.
+#### Kalman Filter Tuning
+For workloads with rapid changes, tune the Kalman filter:
+```yaml
+args:
+  - --load-predictor kalman
+  - --kalman-q-level 2.0      # Higher = more responsive to level changes
+  - --kalman-q-trend 0.5      # Higher = trend changes faster
+  - --kalman-r 5.0            # Lower = trusts new measurements more
+  - --kalman-min-points 3     # Fewer points before forecasting starts
+  - --load-predictor-log1p    # Often helps with request-rate series
+```
+#### Prophet for Seasonal Workloads
+For workloads with daily/weekly patterns:
+```yaml
+args:
+  - --load-predictor prophet
+  - --prophet-window-size 100   # Larger window for seasonal detection
+  - --load-predictor-log1p
+```
+### Virtual Connector
+For non-Kubernetes environments, use the VirtualConnector to communicate scaling decisions:
+```python
+from dynamo._core import DistributedRuntime, VirtualConnectorClient
+# Initialize client
+client = VirtualConnectorClient(distributed_runtime, namespace)
+# Main loop: watch for planner decisions and execute them
+while True:
+    # Block until the planner makes a new scaling decision
+    await client.wait()
+    # Read the decision
+    decision = await client.get()
+    print(f"Scale to: prefill={decision.num_prefill_workers}, "
+          f"decode={decision.num_decode_workers}, "
+          f"id={decision.decision_id}")
+    # Execute scaling in your environment
+    scale_prefill_workers(decision.num_prefill_workers)
+    scale_decode_workers(decision.num_decode_workers)
+    # Report completion
+    await client.complete(decision)
+```
+See `components/planner/test/test_virtual_connector.py` for a full working example.
+### Planner Configuration Passthrough
+Pass planner-specific settings through the DGDR:
+```yaml
+profilingConfig:
+  config:
+    planner:
+      plannerMinEndpoint: 2
+```
+### Review Before Deploy (autoApply: false)
+Disable auto-deployment to inspect the generated DGD:
+```yaml
+spec:
+  autoApply: false
+```
+After profiling completes:
+```bash
+# Extract and review generated DGD
+kubectl get dgdr sla-aic -n $NAMESPACE \
+  -o jsonpath='{.status.generatedDeployment}' > my-dgd.yaml
+# Review and modify as needed
+vi my-dgd.yaml
+# Deploy manually
+kubectl apply -f my-dgd.yaml -n $NAMESPACE
+```
+### Profiling Artifacts with PVC
+Save detailed profiling artifacts (plots, logs, raw data) to a PVC:
+```yaml
+spec:
+  profilingConfig:
+    outputPVC: "dynamo-pvc"
+    config:
+      sla:
+        isl: 3000
+        osl: 150
+        ttft: 200
+        itl: 20
+```
+Setup:
+```bash
+export NAMESPACE=your-namespace
+deploy/utils/setup_benchmarking_resources.sh
+```
+Access results:
+```bash
+kubectl apply -f deploy/utils/manifests/pvc-access-pod.yaml -n $NAMESPACE
+kubectl wait --for=condition=Ready pod/pvc-access-pod -n $NAMESPACE --timeout=60s
+kubectl cp $NAMESPACE/pvc-access-pod:/data ./profiling-results
+kubectl delete pod pvc-access-pod -n $NAMESPACE
+```
+## Related Documentation
+- [Planner README](README.md) -- Overview and quick start
+- [Planner Guide](planner_guide.md) -- Deployment, configuration, integration
+- [Planner Design](/docs/design_docs/planner_design.md) -- Architecture deep-dive
+- [DGDR Configuration Reference](/docs/benchmarks/sla_driven_profiling.md#dgdr-configuration-reference)
+- [SLA-Driven Profiling](/docs/benchmarks/sla_driven_profiling.md)
--- a/docs/planner/planner_guide.md
+++ b/docs/planner/planner_guide.md
+# Planner Guide
+Deployment, configuration, and integration guide for the Dynamo SLA Planner. For a quick overview, see the [Planner README](README.md). For architecture internals, see [Planner Design](/docs/design_docs/planner_design.md).
+## Deployment
+### Prerequisites
+Before deploying the planner, ensure:
+- **Dynamo platform installed** with the operator running (see [Installation Guide](/docs/kubernetes/installation_guide.md))
+- **[kube-prometheus-stack](/docs/kubernetes/observability/metrics.md) installed and running** (required for SLA planner metric collection)
+- **Image pull secrets configured** if using private registries (typically `nvcr-imagepullsecret` for NVIDIA images)
+- **Sufficient GPU resources** available in your cluster for profiling
+- **Runtime images available** that contain both profiler and runtime components
+### Container Images
+Each DGDR requires container images for the profiling and deployment process:
+**profilingConfig.profilerImage** (Required):
+The container image used for the profiling job. Must contain the profiler code and dependencies for SLA-based profiling.
+**deploymentOverrides.workersImage** (Optional):
+The container image used for DGD worker components (frontend, workers, planner). Used for:
+- Temporary DGDs created during online profiling (for performance measurements)
+- The final DGD deployed after profiling completes
+If `workersImage` is omitted, the image from the base config file (e.g., `disagg.yaml`) is used. Public images are available from 0.6.1 onward.
+```yaml
+spec:
+  profilingConfig:
+    profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"
+  deploymentOverrides:
+    workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"  # Optional
+```
+### What is a DynamoGraphDeploymentRequest (DGDR)?
+A **DGDR** is a Kubernetes Custom Resource that serves as the primary interface for deploying models with specific performance and resource constraints. It specifies:
+- **What** model to deploy (`model`)
+- **How** it should perform (SLA targets: `ttft`, `itl`)
+- **Where** it should run (optional GPU preferences)
+- **Which** backend to use (`backend`: vllm, sglang, or trtllm)
+- **Which** images to use (`profilingConfig.profilerImage`, `deploymentOverrides.workersImage`)
+The Dynamo Operator watches for DGDRs and automatically:
+1. Discovers available GPU resources in your cluster
+2. Runs profiling (online or offline) to find optimal configurations
+3. Generates an optimized DynamoGraphDeployment (DGD) configuration
+4. Deploys the DGD to your cluster
+**Key Benefits:**
+- **Declarative**: Specify what you want, not how to achieve it
+- **Automated**: No manual profiling job setup or result processing
+- **SLA-Driven**: Ensures deployments meet your performance requirements
+- **Integrated**: Works seamlessly with the Dynamo Operator
+### DGDR Workflow
+The DGDR workflow automates the entire process from SLA specification to deployment:
+1. **Define SLAs**: Specify performance requirements (TTFT, ITL) and model information
+2. **Automatic Profiling**: The operator profiles your model to find optimal configurations
+3. **Auto-Deploy**: The system deploys the optimal configuration that meets your SLAs
+```mermaid
+flowchart TD
+    A[Create DGDR] --> B[DGDR Controller]
+    B --> C{Profiling Method}
+    C -->|Online| D[Run Profiling Job<br/>2-4 hours]
+    C -->|Offline/AIC| E[AI Configurator<br/>20-30 seconds]
+    D --> F[Generate DGD Config]
+    E --> F
+    F --> G[Auto-Deploy DGD]
+    G --> H[Monitor & Scale]
+    style A fill:#e1f5fe
+    style D fill:#fff3e0
+    style E fill:#e8f5e8
+    style G fill:#f3e5f5
+    style H fill:#fff8e1
+```
+### Monitoring Progress
+Watch DGDR status:
+```bash
+# View status
+kubectl get dgdr -n $NAMESPACE
+# Detailed status
+kubectl describe dgdr sla-aic -n $NAMESPACE
+# Watch profiling job logs
+kubectl logs -f job/profile-sla-aic -n $NAMESPACE
+```
+**DGDR Status States:**
+- `Pending`: Initial state, preparing to profile
+- `Profiling`: Running profiling job (20-30 seconds for AIC, 2-4 hours for online)
+- `Deploying`: Generating and applying DGD configuration
+- `Ready`: DGD successfully deployed and running
+- `Failed`: Error occurred (check events for details)
+### Relationship to DGD
+- **DGDR**: High-level "intent" -- what you want deployed
+- **DGD**: Low-level "implementation" -- how it's deployed
+The DGDR controller generates a DGD that:
+- Uses optimal TP configurations from profiling
+- Includes the SLA planner for autoscaling
+- Has deployment and engine settings tuned for your SLAs
+The generated DGD is tracked via labels:
+```yaml
+metadata:
+  labels:
+    dgdr.nvidia.com/name: sla-aic
+    dgdr.nvidia.com/namespace: your-namespace
+```
+## Configuration
+### DGDR Configuration
+#### Required Fields
+| Field | Type | Description |
+|-------|------|-------------|
+| `spec.model` | string | Model identifier (e.g., `meta-llama/Llama-3-70b`) |
+| `spec.backend` | enum | Inference backend: `vllm`, `sglang`, or `trtllm` |
+| `spec.profilingConfig.profilerImage` | string | Container image for profiling job |
+| `spec.profilingConfig.config.sla` | object | SLA targets (isl, osl, ttft, itl) |
+#### Optional Fields
+| Field | Type | Description |
+|-------|------|-------------|
+| `spec.deploymentOverrides.workersImage` | string | Container image for DGD workers. If omitted, uses image from base config. |
+| `spec.autoApply` | boolean | Automatically deploy DGD after profiling (default: false) |
+| `spec.useMocker` | boolean | Deploy mocker instead of real backend (default: false) |
+| `spec.deploymentOverrides` | object | Customize metadata and image for auto-created DGD |
+#### SLA Configuration
+```yaml
+sla:
+  isl: 3000      # Average input sequence length (tokens)
+  osl: 150       # Average output sequence length (tokens)
+  ttft: 200      # Target Time To First Token (milliseconds, float)
+  itl: 20        # Target Inter-Token Latency (milliseconds, float)
+```
+**Choosing SLA Values:**
+- **ISL/OSL**: Based on your expected traffic patterns
+- **TTFT**: First token latency target (lower = more GPUs needed)
+- **ITL**: Token generation latency target (lower = more GPUs needed)
+- **Trade-offs**: Tighter SLAs require more GPU resources
+For comprehensive documentation of all configuration options, see the [DGDR Configuration Reference](/docs/benchmarks/sla_driven_profiling.md#dgdr-configuration-reference).
+### Profiling Methods
+Choose between **online profiling** (real measurements, 2-4 hours) or **offline profiling** with AI Configurator (estimated, 20-30 seconds):
+```yaml
+# Online Profiling (Default)
+sweep:
+  useAiConfigurator: false
+# Offline Profiling (AI Configurator)
+sweep:
+  useAiConfigurator: true
+  aicSystem: h200_sxm
+  aicHfId: Qwen/Qwen3-32B
+  aicBackendVersion: "0.20.0"
+```
+For detailed comparison, supported configurations, and limitations, see [SLA-Driven Profiling Documentation](/docs/benchmarks/sla_driven_profiling.md#profiling-methods).
+### Load Predictors
+The SLA planner forecasts the number of requests, ISL, and OSL in the next adjustment interval. Four prediction models are supported:
+#### Constant Predictor
+- **Use case**: Stable workloads with long prediction intervals
+- **Behavior**: Assumes next load equals current load
+- **Configuration**: `load-predictor: "constant"`
+#### ARIMA Predictor
+- **Use case**: Time-series data with trends and seasonality
+- **Behavior**: Uses auto-ARIMA to fit optimal model parameters
+- **Configuration**: `load-predictor: "arima"`
+- **Tunable parameters**:
+  - `--load-predictor-log1p`: model `log1p(y)` instead of `y`. If not set, ARIMA starts in raw space, and if it collapses to `(0,d,0)`, it falls back to `log1p` automatically.
+#### Kalman Predictor
+- **Use case**: Low-latency online forecasting (observe 1 -> predict 1) with smooth adaptation
+- **Behavior**: Local linear trend Kalman filter (fast online updates; good default when ARIMA collapses to mean-only)
+- **Configuration**: `load-predictor: "kalman"`
+- **Tunable parameters**:
+  - `--kalman-q-level`: process noise for level (higher = more responsive)
+  - `--kalman-q-trend`: process noise for trend (higher = trend changes faster)
+  - `--kalman-r`: measurement noise (lower = trusts new measurements more)
+  - `--kalman-min-points`: minimum points before forecasting
+  - `--load-predictor-log1p`: model `log1p(y)` instead of `y` (often helps request-rate/count series)
+#### Prophet Predictor
+- **Use case**: Complex seasonal patterns and trend changes
+- **Behavior**: Facebook's [Prophet](https://facebook.github.io/prophet/) model for time-series forecasting
+- **Configuration**: `load-predictor: "prophet"`
+- **Tunable parameters**:
+  - `--prophet-window-size`: bounds internal history to control refit cost
+  - `--load-predictor-log1p`: model `log1p(y)` instead of `y`
+#### Warm-starting Load Predictors (Optional)
+You can warm-start load predictors with a mooncake-style JSONL trace file:
+- **CLI argument**: `--load-predictor-warmup-trace <path/to/trace.jsonl>`
+- **Effect**: preloads predictors with historical request-count / ISL / OSL samples extracted from the trace
+### Planner Scaling Parameters
+| Argument | Default | Description |
+|----------|---------|-------------|
+| `--adjustment-interval` | `180` | Seconds between scaling decisions |
+| `--ttft` | `500.0` | Target Time To First Token (ms) |
+| `--itl` | `50.0` | Target Inter-Token Latency (ms) |
+| `--isl` | `3000` | Expected average input sequence length |
+| `--osl` | `150` | Expected average output sequence length |
+| `--max-gpu-budget` | `8` | Maximum GPUs across all workers |
+| `--min-endpoint` | `1` | Minimum replicas per worker type |
+| `--decode-engine-num-gpu` | `1` | GPUs per decode engine |
+| `--prefill-engine-num-gpu` | `1` | GPUs per prefill engine |
+| `--no-operation` | `false` | Observation mode (no actual scaling) |
+| `--no-correction` | `false` | Disable correction factors |
+#### Planner Configuration Passthrough
+Add planner-specific settings in the DGDR:
+```yaml
+profilingConfig:
+  config:
+    planner:
+      plannerMinEndpoint: 2
+```
+## Integration
+### Prometheus Setup
+The planner queries Prometheus to collect frontend request metrics. The architecture:
+```mermaid
+flowchart LR
+  Frontend --"/metrics"--> Prometheus
+  Planner --"query API"--> Prometheus
+  Planner --"scaling decisions"--> Workers
+  Frontend -.->|"requests"| Workers
+```
+**Components:**
+- **Frontend**: Serves requests and exposes `/metrics`
+- **Prometheus**: Scrapes frontend metrics every 5s (configurable in podmonitor manifest)
+- **Planner**: Queries Prometheus and adjusts worker scaling every adjustment interval
+- **Workers**: Prefill and backend workers handle inference
+The planner requires a frontend that reports metrics at the `/metrics` HTTP endpoint with request count, ISL, OSL, TTFT, and ITL in the correct format. The Dynamo frontend provides these metrics automatically.
+**Prometheus endpoint configuration:**
+| Variable | Default |
+|----------|---------|
+| `PROMETHEUS_ENDPOINT` | `http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090` |
+If you see errors like "Failed to resolve prometheus service", ensure `PROMETHEUS_ENDPOINT` points to your Prometheus service.
+### Virtual Deployment
+The SLA planner supports virtual deployment mode for customized environments (e.g., custom orchestrators) through the `VirtualConnector`. This connector enables the planner to communicate scaling decisions without directly managing Kubernetes resources.
+The `VirtualConnector` acts as a bridge between the SLA planner and external deployment environments. Instead of PATCHing DGD resources, it writes scaling decisions and waits for the external environment to acknowledge completion.
+#### Scaling Decision Flow
+1. **Decision Generation**: The planner calculates optimal worker counts
+2. **Change Detection**: Skips scaling if target counts match current counts, logging: `"No scaling needed (prefill=X, decode=Y)"`
+3. **Readiness Check**: Verifies previous scaling operations completed by checking `scaled_decision_id >= decision_id`
+4. **Timeout Handling**: If not acknowledged within 30 minutes (1800 seconds), proceeds with new decisions
+5. **Completion Tracking**: Optionally waits for scaling completion confirmation (blocking mode)
+#### Configuration
+To use virtual deployment mode:
+```yaml
+environment: "virtual"
+backend: "vllm"  # or "sglang"
+```
+#### Deployment Environment Requirements
+The external deployment environment must use `VirtualConnectorClient`:
+```python
+from dynamo._core import DistributedRuntime, VirtualConnectorClient
+client = VirtualConnectorClient(distributed_runtime, namespace)
+```
+1. **Monitor Planner**: Continuously watch for scaling decisions: `await client.wait()` (blocks until change)
+2. **Parse Decisions**: Read values: `decision = await client.get()`
+3. **Execute Scaling**: Apply the scaling decisions to your infrastructure
+4. **Acknowledge Completion**: Mark done: `await client.complete(decision)`
+A scaling decision (returned by `client.get()`) contains:
+- `num_prefill_workers`: Target number of prefill workers (-1 if not set)
+- `num_decode_workers`: Target number of decode workers (-1 if not set)
+- `decision_id`: Incremental ID for each scaling decision
+See `components/planner/test/test_virtual_connector.py` for a full example.
+### Grafana Dashboard
+Deploy the planner Grafana dashboard:
+```bash
+kubectl apply -n monitoring -f deploy/observability/k8s/grafana-planner-dashboard-configmap.yaml
+```
+Follow [Dynamo Metrics Collection on Kubernetes](/docs/kubernetes/observability/metrics.md) to access the Grafana UI and select the **Dynamo Planner Dashboard**.
+The dashboard displays:
+- **Worker Counts & GPU Usage**: Current prefill/decode worker counts and cumulative GPU hours
+- **Observed Metrics**: Real-time TTFT, ITL, request rate, and sequence lengths from Prometheus
+- **Predicted Metrics**: Planner's load predictions and recommended replica counts
+- **Correction Factors**: How the planner adjusts predictions based on observed vs expected performance
+> Use the **Namespace** dropdown at the top of the dashboard to filter metrics for your deployment namespace.
+## DGDR Immutability
+DGDRs are **immutable**. To update SLAs or configuration:
+1. Delete the existing DGDR: `kubectl delete dgdr sla-aic`
+2. Create a new DGDR with updated specifications
+## Manual Deployment Control
+### Option 1: Use DGDR-Generated Configuration (Recommended)
+Disable auto-deployment to review the generated DGD before applying:
+```yaml
+spec:
+  autoApply: false
+```
+Then manually extract and apply:
+```bash
+# Extract generated DGD from DGDR status
+kubectl get dgdr sla-aic -n $NAMESPACE -o jsonpath='{.status.generatedDeployment}' | kubectl apply -f -
+# Or save to file first for review/modification
+kubectl get dgdr sla-aic -n $NAMESPACE -o jsonpath='{.status.generatedDeployment}' > my-dgd.yaml
+vi my-dgd.yaml
+kubectl apply -f my-dgd.yaml -n $NAMESPACE
+```
+### Option 2: Use Standalone Planner Templates (Advanced)
+For advanced use cases, use the standalone planner templates in `examples/backends/*/deploy/disagg_planner.yaml`:
+```bash
+# After profiling completes, profiling data is stored in ConfigMaps
+kubectl get configmap dgdr-output-<dgdr-name> -n $NAMESPACE -o yaml
+kubectl get configmap planner-profile-data -n $NAMESPACE -o yaml
+# Update PROMETHEUS_ENDPOINT in the template, then deploy
+kubectl apply -f examples/backends/<backend>/deploy/disagg_planner.yaml -n $NAMESPACE
+```
+## Accessing Profiling Artifacts
+By default, profiling jobs save essential data to ConfigMaps. For detailed artifacts, configure the DGDR to use `dynamo-pvc`:
+**ConfigMaps (always created):**
+- Generated DGD configuration
+- Profiling data for Planner (`.json` files)
+**PVC (optional):**
+- Performance plots (PNGs)
+- DGD configuration and logs for each profiled deployment
+- AIPerf profiling artifacts
+- Raw profiling data (`.npz` files)
+- Profiler log
+```bash
+# Setup PVC
+deploy/utils/setup_benchmarking_resources.sh
+# Access results after profiling
+kubectl apply -f deploy/utils/manifests/pvc-access-pod.yaml -n $NAMESPACE
+kubectl wait --for=condition=Ready pod/pvc-access-pod -n $NAMESPACE --timeout=60s
+kubectl cp $NAMESPACE/pvc-access-pod:/data ./profiling-results
+kubectl delete pod pvc-access-pod -n $NAMESPACE
+```
+## Troubleshooting
+### Quick Diagnostics
+```bash
+# Check DGDR status and events
+kubectl describe dgdr sla-aic -n $NAMESPACE
+# Check operator logs
+kubectl logs -n $NAMESPACE -l app.kubernetes.io/name=dynamo-operator --tail=100
+# Check profiling job logs
+kubectl logs -l job-name=profile-sla-aic -n $NAMESPACE
+```
+### Common Issues
+| Issue | Quick Fix |
+|-------|-----------|
+| **DGDR stuck in Pending** | Check GPU availability: `kubectl get nodes -o jsonpath='{.items[*].status.allocatable.nvidia\.com/gpu}'` |
+| **Image pull errors** | Verify secret exists: `kubectl get secret nvcr-imagepullsecret -n $NAMESPACE` |
+| **Profiling fails** | Check job logs: `kubectl logs -l job-name=profile-sla-aic -n $NAMESPACE` |
+| **SLA cannot be met** | Relax TTFT/ITL targets or add more GPUs |
+| **DGD not deployed** | Verify `autoApply: true` in DGDR spec |
+| **Prometheus errors** | Ensure `PROMETHEUS_ENDPOINT` env var points to your Prometheus service |
+For comprehensive troubleshooting including AI Configurator constraints, performance debugging, and backend-specific issues, see [SLA-Driven Profiling Troubleshooting](/docs/benchmarks/sla_driven_profiling.md#troubleshooting).
+## Related Documentation
+- [Planner README](README.md) -- Overview and quick start
+- [Planner Examples](planner_examples.md) -- DGDR YAML examples and sample configurations
+- [Planner Design](/docs/design_docs/planner_design.md) -- Architecture deep-dive for contributors
+- [DGDR API Reference](/docs/kubernetes/api_reference.md)
+- [Pre-Deployment Profiling](/docs/benchmarks/sla_driven_profiling.md)
+- [Dynamo Operator Guide](/docs/kubernetes/dynamo_operator.md)
--- a/docs/planner/planner_intro.rst
+++ b/docs/planner/planner_intro.rst
@@ -77,6 +77,9 @@ Key features include:
   :hidden:
   Overview <self>
+   Planner README <README>
+   Planner Guide <planner_guide>
+   Planner Examples <planner_examples>
   SLA Planner Quick Start <sla_planner_quickstart>
   SLA-Driven Profiling <../benchmarks/sla_driven_profiling.md>
   SLA-based Planner <sla_planner.md>