refactor: load planner using new forwardpass metric and many improvements (#7351)

Signed-off-by: hongkuanz <hongkuanz@nvidia.com>

refactor: load planner using new forwardpass metric and many improvements (#7351)
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
a58bcc31 · Hongkuan Zhou · GitHub · db14d63f · a58bcc31 · a58bcc31
Unverified Commit a58bcc31 authored Mar 25, 2026 by Hongkuan Zhou Committed by GitHub Mar 25, 2026
9 changed files
--- a/docs/components/planner/README.md
+++ b/docs/components/planner/README.md
@@ -26,13 +26,13 @@ The Dynamo **Planner** is an autoscaler purpose-built for these constraints. It
 The Planner supports two scaling modes that can run independently or together:
 - **Throughput-based scaling**: Uses pre-deployment profiling data and traffic prediction to compute the replica count needed to meet TTFT and ITL targets. Adjusts on a longer interval (default 180s). This is the primary mode for production deployments.
- **Load-based scaling (Experimental)**: Uses real-time per-worker load metrics (active prefill tokens, active KV blocks) from the router and fits an online linear regression to make scaling decisions. No profiling data required. Adjusts on a short interval (default 5s) to respond quickly to bursts.
+- **Load-based scaling**: Uses ForwardPassMetrics (FPM) from the Dynamo event plane and fits an online linear regression to make scaling decisions. No profiling data or KV Router required. Adjusts on a short interval (default 5s) to respond quickly to bursts.
 When both modes are enabled, throughput-based scaling provides a capacity floor (long-term planning) while load-based scaling handles real-time adjustments above that floor.
 ## Feature Matrix
-| Feature | Throughput-Based | Load-Based (Experimental) |
+| Feature | Throughput-Based | Load-Based |
 |---------|:----------------:|:-------------------------:|
 | **Deployment** | | |
 | Disaggregated | Supported | Supported |
@@ -99,13 +99,11 @@ kubectl apply -f examples/backends/vllm/deploy/disagg_planner.yaml -n $NAMESPACE
 ## Current Limitations
-### Load-based scaling (Experimental)
+### Load-based scaling
-Load-based scaling is experimental and has the following known limitations. These are actively being addressed as part of the metrics refactor work. Throughput-based scaling is not affected by any of these.
+Load-based scaling has the following known limitations. Throughput-based scaling is not affected by any of these.
-**Requires the KV Router.** Load-based scaling relies on per-worker engine metrics (active prefill tokens, active KV blocks) published by the [KV Router](../router/README.md). Other routing strategies (round-robin, random) do not emit these metrics, so load-based scaling cannot operate without the KV Router.
+**Requires ForwardPassMetrics (FPM).** Load-based scaling uses per-engine per-iteration metrics delivered via the Dynamo event plane (ForwardPassMetrics). FPM is currently only available for vllm and is automatically enabled when the engine uses `InstrumentedScheduler` and `DYN_FORWARDPASS_METRIC_PORT` is set. The KV Router is **not** required for load-based scaling.
-**Scale-down with idle workers.** If a worker receives no requests (for example, because the router is not distributing traffic evenly), the router does not publish metrics for that worker. Without metrics, the Planner cannot evaluate whether the worker is underutilized, which can prevent scale-down decisions. **Workaround:** Ensure traffic distribution reaches all workers. If you observe workers stuck at zero load, check your router configuration.
 ### General
@@ -144,7 +142,7 @@ Load-based scaling is experimental and has the following known limitations. Thes
 | `--profile-results-dir` | `profiling_results` | Path to profiling data (NPZ/JSON) |
 | `--load-predictor` | `arima` | Prediction model (`arima`, `prophet`, `kalman`, `constant`) |
 | `--no-correction` | `true` | Disable correction factors (auto-disabled when load-based scaling is on) |
-| **Load-based scaling (Experimental)** | | |
+| **Load-based scaling** | | |
 | `--enable-loadbased-scaling` | `false` | Enable load-based scaling |
 | `--disable-throughput-scaling` | `false` | Disable throughput-based scaling (required for `agg` mode) |
 | `--loadbased-router-metrics-url` | auto-discovered | URL to router's `/metrics` endpoint |
@@ -186,7 +184,7 @@ The dashboard shows:
 - TTFT and ITL distributions
 - Input/output sequence lengths
-**Load-based scaling** pulls per-engine status directly from the frontend's `/metrics` endpoint:
+**Load-based scaling** uses ForwardPassMetrics (FPM) from the Dynamo event plane:
- Active prefill tokens per worker
+- Per-iteration wall time, scheduled prefill/decode tokens, and queued request status
- Active decode blocks per worker
+- Delivered via `FpmEventSubscriber` with automatic engine discovery and lifecycle tracking
- Last observed TTFT, ITL, and ISL per worker
+- No router `/metrics` scraping required
--- a/docs/components/planner/planner-guide.md
+++ b/docs/components/planner/planner-guide.md
@@ -72,7 +72,6 @@ When throughput-based scaling is enabled, the planner needs interpolation curves
 | `load_scaling_down_sensitivity` | int | `80` | Scale-down sensitivity 0–100 (0=never, 100=aggressive). |
 | `load_metric_samples` | int | `10` | Number of metric samples to collect per decision. |
 | `load_min_observations` | int | `5` | Minimum observations before making scaling decisions. |
-| `load_router_metrics_url` | string | `null` | Router metrics endpoint. Auto-discovered in Kubernetes mode. |
 ### General Settings

--- a/docs/design-docs/planner-design.md
+++ b/docs/design-docs/planner-design.md
@@ -165,30 +165,31 @@ After the delay:
 - **Interpolation accuracy vs profiling cost**: Higher `prefillInterpolationGranularity` and `decodeInterpolationGranularity` in the profiling sweep produce more accurate interpolation but increase profiling time linearly. Default granularity (16 prefill, 6 decode) balances accuracy with profiling duration.
 - **Predictor warm-up period**: All predictors need observation history before making reliable forecasts. ARIMA and Prophet need multiple adjustment intervals of data. Kalman starts forecasting after `--kalman-min-points` observations. During warm-up, the planner uses the constant predictor as fallback.
-## Load-Based Scaling (Experimental)
+## Load-Based Scaling
-The load-based mode uses real-time per-worker metrics from the router to make SLA-aware scaling decisions without requiring profiling data.
+The load-based mode uses ForwardPassMetrics (FPM) from the Dynamo event plane to make SLA-aware scaling decisions without requiring profiling data or the KV Router.
 ### Metrics
-The planner pulls per-worker load metrics directly from the frontend's `/metrics` endpoint:
+Each engine emits per-iteration `ForwardPassMetrics` via ZMQ -> FpmEventRelay -> event plane. The planner subscribes via `FpmEventSubscriber` with automatic engine discovery and MDC-based lifecycle tracking. Key fields used:
- **Active prefill tokens**: pending prefill tokens per worker
+- **wall_time**: per-iteration execution time (regression target)
- **Active decode blocks**: active KV blocks per worker
+- **scheduled_requests.sum_prefill_tokens**: prefill regression input
- **Last TTFT, ITL, ISL**: most recent observed latencies per worker
+- **scheduled_requests.sum_decode_kv_tokens**: decode regression input
+- **queued_requests**: queued prefill/decode load for TTFT/ITL simulation
+- Idle heartbeats (wall_time=0) are skipped
-### Regression Model
+### Regression Models
-A sliding-window linear regression maps load to latency:
+Three specialized regression models (`fpm_regression.py`):
- Prefill: `(active_prefill_tokens + ISL)` -> `TTFT`
+- **PrefillRegressionModel**: 1D regression `sum_prefill_tokens -> wall_time`. Estimates TTFT by simulating chunked prefill scheduling (chunks of `max_num_batched_tokens`).
- Decode: `active_decode_blocks` -> `ITL`
+- **DecodeRegressionModel**: 1D regression `sum_decode_kv_tokens -> wall_time`. Estimates ITL for total decode load (scheduled + queued + avg decode length).
+- **AggRegressionModel**: 2D regression `(sum_prefill_tokens, sum_decode_kv_tokens) -> wall_time`. Estimates both TTFT (simulated prefill with piggybacked decode) and ITL (decode with average piggybacked prefill).
-Given a TTFT/ITL SLA target, the model reverse-solves for the maximum load that satisfies the SLA.
 ### Scaling Decisions
- **Scale up**: if ALL workers' recent load exceeds the regression-derived target
+- **Prefill/Decode**: Scale up if ALL engines' estimated TTFT/ITL > SLA; scale down if ALL < SLA * sensitivity
- **Scale down**: if ALL workers' recent load is below the target adjusted by `(num_workers - 1) / num_workers * sensitivity / 100`
+- **Agg**: Scale up if (ALL TTFT > SLA) OR (ALL ITL > SLA); scale down if (ALL TTFT < SLA * sensitivity) AND (ALL ITL < SLA * sensitivity)
- Only scales by +/-1 per interval (blocking)
+- Only scales by +/-1 per interval (non-blocking with pending-desired guard: metrics continue to be observed while scaling is in progress, but no new scaling action is issued until the previous one completes)
 ### Co-existence with Throughput-Based Scaling

--- a/lib/bindings/python/Cargo.lock
+++ b/lib/bindings/python/Cargo.lock
@@ -1763,6 +1763,7 @@ dependencies = [
 "anyhow",
 "async-trait",
 "clap",
+ "dashmap 6.1.0",
 "dynamo-kv-router",
 "dynamo-llm",
 "dynamo-mocker",
@@ -1774,6 +1775,7 @@ dependencies = [
 "pyo3",
 "pyo3-async-runtimes",
 "pythonize",
+ "rmp",
 "serde",
 "serde_json",
 "thiserror 2.0.18",

--- a/lib/bindings/python/Cargo.toml
+++ b/lib/bindings/python/Cargo.toml
@@ -36,7 +36,9 @@ dynamo-parsers = { path = "../../parsers" }
 anyhow = { version = "1" }
 async-trait = { version = "0.1" }
+dashmap = { version = "6.1" }
 futures = { version = "0.3" }
+rmp = { version = "0.8" }
 once_cell = { version = "1.20.3" }
 parking_lot = { version = "0.12.4" }
 serde = { version = "1" }

--- a/lib/bindings/python/rust/llm/fpm.rs
+++ b/lib/bindings/python/rust/llm/fpm.rs
--- a/lib/bindings/python/src/dynamo/_core.pyi
+++ b/lib/bindings/python/src/dynamo/_core.pyi
@@ -841,12 +841,23 @@ class FpmEventSubscriber:
    """
    Subscriber for ForwardPassMetrics from the Dynamo event plane.
    Auto-discovers engine publishers via the discovery plane.
+    Two mutually exclusive usage modes:
+    1. **recv mode** (default): call ``recv()`` to pull individual messages.
+    2. **tracking mode**: call ``start_tracking()`` once, then poll
+       ``get_recent_stats()`` to retrieve the latest FPM bytes keyed by
+       ``(worker_id, dp_rank)``.  Stale entries are cleaned up when
+       workers are removed (via discovery watch).
    """
    def __init__(self, endpoint: Endpoint) -> None:
        """
        Create a subscriber that auto-discovers FPM publishers.
+        No background tasks are started until ``recv()`` or
+        ``start_tracking()`` is called.
        Args:
            endpoint: Dynamo component endpoint (provides runtime + discovery).
        """
@@ -857,13 +868,48 @@ class FpmEventSubscriber:
        Blocking receive of the next message (raw msgspec bytes).
        Releases the GIL while waiting.
+        On the first call a background subscriber task is spawned (recv mode).
+        Cannot be used after ``start_tracking()``.
        Returns:
            Raw msgspec payload, or None if the stream is closed.
        """
        ...
+    def start_tracking(self) -> None:
+        """
+        Start background tracking of the latest FPM per (worker_id, dp_rank).
+        Spawns two background tasks:
+        1. Event consumption: subscribes to FPM events, extracts the composite
+           key (worker_id, dp_rank) from the msgpack payload, stores latest
+           raw bytes in an internal map.
+        2. MDC discovery watch: monitors ComponentModels for the target
+           component.  When a model is removed, all entries whose
+           worker_id matches the removed instance_id are purged.
+        After calling this, ``recv()`` will raise RuntimeError.
+        """
+        ...
+    def get_recent_stats(self) -> dict[tuple[str, int], bytes]:
+        """
+        Return the latest FPM bytes for every tracked (worker_id, dp_rank).
+        Cleanup of removed engines is handled by the MDC discovery watch
+        task spawned by ``start_tracking()``.
+        Raises RuntimeError if ``start_tracking()`` has not been called.
+        Returns:
+            dict mapping ``(worker_id, dp_rank)`` to raw msgspec bytes.
+            Decode each value with ``forward_pass_metrics.decode(data)``.
+        """
+        ...
    def shutdown(self) -> None:
-        """Shut down the subscriber."""
+        """Shut down the subscriber (all background tasks)."""
        ...

--- a/tests/planner/test_replica_calculation.py
+++ b/tests/planner/test_replica_calculation.py
@@ -23,6 +23,7 @@ from dynamo.planner.utils.planner_core import (
 )
 from dynamo.planner.utils.prefill_planner import PrefillPlanner
 from dynamo.planner.utils.prometheus import Metrics
+from dynamo.planner.worker_info import WorkerInfo
 pytestmark = [pytest.mark.pre_merge, pytest.mark.gpu_0]
@@ -56,12 +57,12 @@ class PlannerHarness:
        target_replicas = [
            {
                "sub_component_type": "prefill",
-                "component_name": self.prefill_planner.prefill_component_name,
+                "component_name": self.prefill_planner.prefill_worker_info.k8s_name,
                "desired_replicas": next_num_p,
            },
            {
                "sub_component_type": "decode",
-                "component_name": self.prefill_planner.decode_component_name,
+                "component_name": self.prefill_planner.decode_worker_info.k8s_name,
                "desired_replicas": next_num_d,
            },
        ]
@@ -83,12 +84,12 @@ class PlannerHarness:
        }
        prefill_attrs = {
            "prefill_interpolator",
-            "prefill_component_name",
+            "prefill_worker_info",
            "p_correction_factor",
        }
        decode_attrs = {
            "decode_interpolator",
-            "decode_component_name",
+            "decode_worker_info",
            "d_correction_factor",
        }
        if name == "last_metrics":
@@ -185,6 +186,20 @@ def planner():
        decode_planner = DecodePlanner(mock_runtime, config, shared_state=shared_state)
        planner = PlannerHarness(prefill_planner, decode_planner, shared_state)
+        # Set up WorkerInfo for both planners
+        prefill_planner.prefill_worker_info = WorkerInfo(
+            k8s_name="VllmPrefillWorker",
+            component_name="prefill",
+            endpoint="generate",
+        )
+        prefill_planner.decode_worker_info = WorkerInfo(
+            k8s_name="VllmDecodeWorker",
+            component_name="backend",
+            endpoint="generate",
+        )
+        decode_planner.prefill_worker_info = prefill_planner.prefill_worker_info
+        decode_planner.decode_worker_info = prefill_planner.decode_worker_info
        # Mock the interpolators to return fixed values for testing
        planner.prefill_interpolator = Mock()
        planner.decode_interpolator = Mock()

--- a/tests/planner/unit/test_load_based_scaling.py
+++ b/tests/planner/unit/test_load_based_scaling.py