docs: improve Planner documentation clarity and fix incorrect defaults (#7303)

Signed-off-by: athreesh <anish.maddipoti@utexas.edu> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

docs: improve Planner documentation clarity and fix incorrect defaults (#7303)
Signed-off-by: athreesh <anish.maddipoti@utexas.edu> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
197f6595 · Anish · GitHub · b9d10bdb · 197f6595 · 197f6595
Unverified Commit 197f6595 authored Mar 12, 2026 by Anish Committed by GitHub Mar 12, 2026
4 changed files
--- a/docs/components/planner/README.md
+++ b/docs/components/planner/README.md
@@ -4,18 +4,31 @@
 title: Planner
 ---
-The Planner monitors system performance and automatically scales prefill/decode workers to meet latency SLAs. It runs as a component inside the Dynamo inference graph on Kubernetes.
+## Why LLM Inference Needs a Different Autoscaler
-The SLA Planner supports two scaling modes:
+Scaling a traditional web service is straightforward: watch CPU or request rate, add replicas when load is high, remove them when it's low. Tools like HPA and KEDA work well for this because the relationship between load and latency is roughly linear — twice the requests means roughly twice the CPU, so a simple threshold policy keeps response times stable.
- **Throughput-based scaling**: Uses pre-deployment profiling data and traffic prediction to compute the number of replicas needed to meet TTFT and ITL SLA targets. This is the primary scaling mode for production deployments.
+LLM inference breaks these assumptions:
- **Load-based scaling (Experimental)**: Uses real-time per-worker load metrics (active prefill tokens, active KV blocks) from the router to make SLA-aware scaling decisions via online linear regression. Does not require profiling data. Responds quickly to traffic bursts.
-When both modes are enabled, throughput-based scaling provides a lower bound on replicas (long-term capacity planning) while load-based scaling handles real-time adjustments (burst response).
+- **Latency depends on request content, not just request count.** A single request with a 32K-token prompt consumes orders of magnitude more compute than a short one. Two requests per second can mean completely different GPU loads depending on input/output sequence lengths.
+- **Prefill and decode have different scaling characteristics.** In disaggregated serving, prefill is compute-bound (scales with input length) while decode is memory-bound (scales with concurrent sequences and KV cache usage). A single replica count doesn't capture both.
+- **The metrics that matter aren't standard.** The SLAs users care about — Time to First Token (TTFT) and Inter-Token Latency (ITL) — don't map cleanly to CPU utilization or request throughput. HPA can't target "keep P95 TTFT under 500ms" because that requires understanding the relationship between sequence lengths, GPU memory pressure, and latency.
+- **Scaling decisions are expensive.** Spinning up a GPU worker takes minutes, not seconds. Overscaling wastes GPU-hours at cloud prices; underscaling violates SLAs. The autoscaler needs to predict demand, not just react to it.
-> **New to the Planner?** Start with the [SLA Planner Quick Start Guide](planner-guide.md) for a complete workflow including profiling and deployment.
+The Dynamo **Planner** is an autoscaler purpose-built for these constraints. It understands engine profiling data, tracks per-worker GPU utilization, predicts traffic patterns, and makes scaling decisions that directly target TTFT and ITL SLAs — not proxy metrics.
-> **Need multi-DGD coordination?** See [Global Planner Deployment Guide](global-planner.md). It covers both shared-policy coordination across multiple DGDs and the one-endpoint multi-pool pattern.
+> **New to the Planner?** Start with the [Planner Guide](planner-guide.md) for a complete workflow including profiling and deployment.
+> **Need multi-DGD coordination?** See the [Global Planner Guide](global-planner.md) for shared-policy coordination across multiple DGDs and single-endpoint multi-pool deployments.
+## Scaling Modes
+The Planner supports two scaling modes that can run independently or together:
+- **Throughput-based scaling**: Uses pre-deployment profiling data and traffic prediction to compute the replica count needed to meet TTFT and ITL targets. Adjusts on a longer interval (default 180s). This is the primary mode for production deployments.
+- **Load-based scaling (Experimental)**: Uses real-time per-worker load metrics (active prefill tokens, active KV blocks) from the router and fits an online linear regression to make scaling decisions. No profiling data required. Adjusts on a short interval (default 5s) to respond quickly to bursts.
+When both modes are enabled, throughput-based scaling provides a capacity floor (long-term planning) while load-based scaling handles real-time adjustments above that floor.
 ## Feature Matrix
@@ -30,6 +43,9 @@ When both modes are enabled, throughput-based scaling provides a lower bound on
 | vLLM | Supported | Supported |
 | **Requires Profiling Data** | Yes | No |
 | **Load Predictors** | ARIMA, Prophet, Kalman, Constant | N/A |
+| **Router** | | |
+| Any (round-robin, random, etc.) | Supported | Not supported |
+| KV Router | Supported | Supported |
 | **Connectors** | | |
 | KubernetesConnector | Supported | Supported |
 | VirtualConnector | Supported | Supported |
@@ -81,15 +97,28 @@ For manual control with throughput-based scaling, use the disaggregated planner
 kubectl apply -f examples/backends/vllm/deploy/disagg_planner.yaml -n $NAMESPACE
 ```
+## Current Limitations
+### Load-based scaling (Experimental)
+Load-based scaling is experimental and has the following known limitations. These are actively being addressed as part of the metrics refactor work. Throughput-based scaling is not affected by any of these.
+**Requires the KV Router.** Load-based scaling relies on per-worker engine metrics (active prefill tokens, active KV blocks) published by the [KV Router](../router/README.md). Other routing strategies (round-robin, random) do not emit these metrics, so load-based scaling cannot operate without the KV Router.
+**Scale-down with idle workers.** If a worker receives no requests (for example, because the router is not distributing traffic evenly), the router does not publish metrics for that worker. Without metrics, the Planner cannot evaluate whether the worker is underutilized, which can prevent scale-down decisions. **Workaround:** Ensure traffic distribution reaches all workers. If you observe workers stuck at zero load, check your router configuration.
+### General
+**In-flight requests during scale-down.** When the Planner scales down a worker, the worker is terminated without waiting for in-flight requests to complete. Requests that were mid-prefill on the terminated worker will fail. In disaggregated deployments, this can also affect decode workers that were waiting on KV cache transfers from the terminated prefill worker. **Workaround:** Set `--min-endpoint` to a value that avoids scaling below your steady-state traffic floor, and use a lower `--loadbased-scaling-down-sensitivity` value to reduce the frequency of scale-down events.
 ## Documentation
 | Document | Description |
 |----------|-------------|
-| [Planner Guide](planner-guide.md) | Deployment, configuration, integration, troubleshooting |
+| [Planner Guide](planner-guide.md) | Deployment, configuration, integration |
-| [Global Planner Deployment Guide](global-planner.md) | When to use `GlobalPlanner`, including multi-model coordination and single-endpoint multi-pool deployments |
+| [Planner Design](../../design-docs/planner-design.md) | Architecture and algorithm internals |
 | [Planner Examples](planner-examples.md) | DGDR YAML examples, sample configurations, advanced patterns |
-| [SLA-Driven Profiling](../profiler/profiler-guide.md) | Pre-deployment profiling process and configuration |
+| [Global Planner Guide](global-planner.md) | Multi-DGD coordination, shared GPU budgets, single-endpoint multi-pool deployments |
-| [Planner Design](../../design-docs/planner-design.md) | Architecture deep-dive for contributors |
 ## Configuration Reference
@@ -114,7 +143,7 @@ kubectl apply -f examples/backends/vllm/deploy/disagg_planner.yaml -n $NAMESPACE
 | `--adjustment-interval` | `180` | Seconds between throughput-based scaling decisions |
 | `--profile-results-dir` | `profiling_results` | Path to profiling data (NPZ/JSON) |
 | `--load-predictor` | `arima` | Prediction model (`arima`, `prophet`, `kalman`, `constant`) |
-| `--no-correction` | `false` | Disable correction factors |
+| `--no-correction` | `true` | Disable correction factors (auto-disabled when load-based scaling is on) |
 | **Load-based scaling (Experimental)** | | |
 | `--enable-loadbased-scaling` | `false` | Enable load-based scaling |
 | `--disable-throughput-scaling` | `false` | Disable throughput-based scaling (required for `agg` mode) |

--- a/docs/components/planner/global-planner.md
+++ b/docs/components/planner/global-planner.md
@@ -6,6 +6,8 @@ title: Global Planner Deployment Guide
 This guide explains how to deploy `GlobalPlanner` and when to use it. `GlobalPlanner` is the centralized scaling execution layer for deployments where multiple DGDs should delegate scaling through one component, whether those DGDs expose separate endpoints or sit behind one shared endpoint.
+> **New to Planner?** We recommend starting with a single-DGD deployment using either throughput-based or load-based scaling before adopting GlobalPlanner. See the [Planner overview](README.md) and [Planner Guide](planner-guide.md) to get started.
 ## Why Global Planner?
 Without `GlobalPlanner`, each DGD's local planner scales only its own deployment directly. That is fine for isolated deployments, but it becomes awkward when you want one place to:
@@ -18,8 +20,8 @@ Without `GlobalPlanner`, each DGD's local planner scales only its own deployment
 ## Terminology
- **SLA Planner**: The normal `dynamo.planner` component that computes desired replica counts to maintain SLAs.
+- **Planner**: The `dynamo.planner` component that computes desired replica counts to maintain latency SLAs. See the [Planner overview](README.md).
- **Local Planner**: A pool-local instance of a SLA planner inside one DGD.
+- **Local Planner**: A pool-local instance of the Planner running inside a single DGD.
 - **Global Planner**: The centralized execution and policy layer that receives scale requests from local planners.
 - **Single-endpoint multi-pool deployment**: One model endpoint backed by multiple DGDs for the same model. This pattern uses both `GlobalRouter` and `GlobalPlanner`.

--- a/docs/components/planner/planner-examples.md
+++ b/docs/components/planner/planner-examples.md
@@ -4,13 +4,13 @@
 title: Planner Examples
 ---
-Practical examples for deploying the SLA Planner with throughput-based scaling. All examples below use the DGDR workflow with pre-deployment profiling. For deployment concepts, see the [Planner Guide](planner-guide.md). For a quick overview, see the [Planner README](README.md).
+Practical examples for deploying the Planner with throughput-based scaling. All examples below use the DGDR workflow with pre-deployment profiling. For deployment concepts, see the [Planner Guide](planner-guide.md). For a quick overview, see the [Planner README](README.md).
 ## Basic Examples
 ### Minimal DGDR with AIC (Fastest)
-The simplest way to deploy with the SLA planner. Uses AI Configurator for offline profiling (20-30 seconds instead of hours):
+The simplest way to deploy with the Planner. Uses AI Configurator for offline profiling (20-30 seconds instead of hours):
 ```yaml
 apiVersion: nvidia.com/v1beta1

--- a/docs/components/planner/planner-guide.md
+++ b/docs/components/planner/planner-guide.md
@@ -4,16 +4,16 @@
 title: Planner Guide
 ---
-The Dynamo SLA Planner is an autoscaling controller that adjusts prefill and decode engine replica counts at runtime to meet latency SLAs. It reads traffic signals (Prometheus metrics or load predictor output) and engine performance profiles to decide when to scale up or down.
+The Dynamo Planner is an autoscaling controller that adjusts prefill and decode engine replica counts at runtime to meet latency SLAs. It reads traffic signals (Prometheus metrics or load predictor output) and engine performance profiles to decide when to scale up or down.
-For a quick overview, see the [Planner README](README.md). For architecture internals, see [Planner Design](../../design-docs/planner-design.md).
+For a quick overview, see the [Planner overview](README.md). For architecture internals, see [Planner Design](../../design-docs/planner-design.md).
 ## Scaling Modes
 The planner supports two scaling modes that can be used independently or together:
 - **Throughput-based scaling** (`enable_throughput_scaling: true`): Uses pre-deployment engine interpolation data and traffic prediction to plan capacity. Best for stable, predictable workloads. Requires profiling data generated by the [Profiler](../profiler/profiler-guide.md).
- **Load-based scaling** (`enable_load_scaling: true`): Uses real-time per-worker engine metrics and online regression. Best for bursty or unpredictable traffic. Does not require profiling data.
+- **Load-based scaling** (`enable_load_scaling: true`): Uses real-time per-worker engine metrics and online regression. Best for bursty or unpredictable traffic. Does not require profiling data. Requires the [KV Router](../router/README.md) — see [Current Limitations](README.md#current-limitations).
 **When to use which:**
@@ -40,7 +40,7 @@ features:
 | Field | Type | Default | Description |
 |-------|------|---------|-------------|
 | `enable_throughput_scaling` | bool | `true` | Enable throughput-based scaling (requires pre-deployment profiling data). |
-| `enable_load_scaling` | bool | `true` | Enable load-based scaling (no pre-deployment profiling data required). |
+| `enable_load_scaling` | bool | `false` | Enable load-based scaling (no pre-deployment profiling data required). |
 At least one scaling mode must be enabled.
@@ -56,23 +56,23 @@ When throughput-based scaling is enabled, the planner needs interpolation curves
 | Field | Type | Default | Description |
 |-------|------|---------|-------------|
-| `throughput_adjustment_interval` | int | `60` | Seconds between throughput-based scaling decisions. |
+| `throughput_adjustment_interval` | int | `180` | Seconds between throughput-based scaling decisions. |
 | `min_endpoint` | int | `1` | Minimum number of engine endpoints to maintain. |
-| `max_gpu_budget` | int | `128` | Maximum total GPUs the planner may allocate. |
+| `max_gpu_budget` | int | `8` | Maximum total GPUs the planner may allocate. |
-| `ttft` | float | `2000.0` | TTFT SLA target (ms) for scaling decisions. |
+| `ttft` | float | `500.0` | TTFT SLA target (ms) for scaling decisions. |
-| `itl` | float | `30.0` | ITL SLA target (ms) for scaling decisions. |
+| `itl` | float | `50.0` | ITL SLA target (ms) for scaling decisions. |
-| `no_correction` | bool | `false` | Disable latency correction factor. Auto-disabled when load-based scaling is on. |
+| `no_correction` | bool | `true` | Disable latency correction factor. Auto-disabled when load-based scaling is on. |
 ### Load-Based Scaling Settings
 | Field | Type | Default | Description |
 |-------|------|---------|-------------|
-| `load_adjustment_interval` | int | `10` | Seconds between load-based scaling decisions. Must be shorter than `throughput_adjustment_interval`. |
+| `load_adjustment_interval` | int | `5` | Seconds between load-based scaling decisions. Must be shorter than `throughput_adjustment_interval`. |
-| `load_learning_window` | int | `120` | Seconds of history used for online regression. |
+| `load_learning_window` | int | `50` | Sliding window size for regression model. |
-| `load_scaling_down_sensitivity` | int | `3` | Number of consecutive underutilized intervals before scaling down. |
+| `load_scaling_down_sensitivity` | int | `80` | Scale-down sensitivity 0–100 (0=never, 100=aggressive). |
 | `load_metric_samples` | int | `10` | Number of metric samples to collect per decision. |
 | `load_min_observations` | int | `5` | Minimum observations before making scaling decisions. |
-| `load_router_metrics_url` | string | `null` | Router metrics endpoint. Required outside Kubernetes mode. |
+| `load_router_metrics_url` | string | `null` | Router metrics endpoint. Auto-discovered in Kubernetes mode. |
 ### General Settings
@@ -87,19 +87,19 @@ When throughput-based scaling is enabled, the planner needs interpolation curves
 | Field | Type | Default | Description |
 |-------|------|---------|-------------|
-| `load_predictor` | string | `linear` | Prediction method: `linear`, `kalman`, or `prophet`. |
+| `load_predictor` | string | `arima` | Prediction method: `constant`, `arima`, `kalman`, or `prophet`. |
-| `load_predictor_log1p` | bool | `true` | Apply log1p transform to load data before prediction. |
+| `load_predictor_log1p` | bool | `false` | Apply log1p transform to load data before prediction. |
-| `prophet_window_size` | int | `300` | Window size (seconds) for Prophet predictor. |
+| `prophet_window_size` | int | `50` | Window size (seconds) for Prophet predictor. |
 | `load_predictor_warmup_trace` | string | `null` | Path to a warmup trace file for bootstrapping predictions. |
 ### Kalman Filter Settings
 | Field | Type | Default | Description |
 |-------|------|---------|-------------|
-| `kalman_q_level` | float | `0.1` | Process noise for level component. |
+| `kalman_q_level` | float | `1.0` | Process noise for level component. |
-| `kalman_q_trend` | float | `0.01` | Process noise for trend component. |
+| `kalman_q_trend` | float | `0.1` | Process noise for trend component. |
-| `kalman_r` | float | `1.0` | Measurement noise. |
+| `kalman_r` | float | `10.0` | Measurement noise. |
-| `kalman_min_points` | int | `10` | Minimum data points before Kalman predictions activate. |
+| `kalman_min_points` | int | `5` | Minimum data points before Kalman predictions activate. |
 ## Integration with Profiler
@@ -122,11 +122,12 @@ If you want one public endpoint for a model but multiple private DGDs optimized
 - one or more prefill pool DGDs
 - one or more decode pool DGDs
-In the current workflow, run profiling independently for each intended pool, then compose the final control DGD plus pool DGDs manually. See [Global Planner Deployment Guide](global-planner.md).
+In the current workflow, run profiling independently for each intended pool, then compose the final control DGD plus pool DGDs manually. See the [Global Planner Guide](global-planner.md).
 ## See Also
- [Planner README](README.md) — Quick overview
+- [Planner overview](README.md) — Why LLM inference needs a different autoscaler
- [Global Planner Deployment Guide](global-planner.md) — `GlobalPlanner` deployment patterns and single-endpoint multi-pool workflow
+- [Planner Design](../../design-docs/planner-design.md) — Architecture and algorithm internals
- [Planner Design](../../design-docs/planner-design.md) — Architecture internals
+- [Planner Examples](planner-examples.md) — DGDR YAML examples, sample configurations, advanced patterns
+- [Global Planner Guide](global-planner.md) — Multi-DGD coordination, shared GPU budgets, single-endpoint multi-pool deployments
 - [Profiler Guide](../profiler/profiler-guide.md) — How profiling data is generated