Unverified Commit a58bcc31 authored by Hongkuan Zhou's avatar Hongkuan Zhou Committed by GitHub
Browse files

refactor: load planner using new forwardpass metric and many improvements (#7351)


Signed-off-by: default avatarhongkuanz <hongkuanz@nvidia.com>
parent db14d63f
...@@ -26,13 +26,13 @@ The Dynamo **Planner** is an autoscaler purpose-built for these constraints. It ...@@ -26,13 +26,13 @@ The Dynamo **Planner** is an autoscaler purpose-built for these constraints. It
The Planner supports two scaling modes that can run independently or together: The Planner supports two scaling modes that can run independently or together:
- **Throughput-based scaling**: Uses pre-deployment profiling data and traffic prediction to compute the replica count needed to meet TTFT and ITL targets. Adjusts on a longer interval (default 180s). This is the primary mode for production deployments. - **Throughput-based scaling**: Uses pre-deployment profiling data and traffic prediction to compute the replica count needed to meet TTFT and ITL targets. Adjusts on a longer interval (default 180s). This is the primary mode for production deployments.
- **Load-based scaling (Experimental)**: Uses real-time per-worker load metrics (active prefill tokens, active KV blocks) from the router and fits an online linear regression to make scaling decisions. No profiling data required. Adjusts on a short interval (default 5s) to respond quickly to bursts. - **Load-based scaling**: Uses ForwardPassMetrics (FPM) from the Dynamo event plane and fits an online linear regression to make scaling decisions. No profiling data or KV Router required. Adjusts on a short interval (default 5s) to respond quickly to bursts.
When both modes are enabled, throughput-based scaling provides a capacity floor (long-term planning) while load-based scaling handles real-time adjustments above that floor. When both modes are enabled, throughput-based scaling provides a capacity floor (long-term planning) while load-based scaling handles real-time adjustments above that floor.
## Feature Matrix ## Feature Matrix
| Feature | Throughput-Based | Load-Based (Experimental) | | Feature | Throughput-Based | Load-Based |
|---------|:----------------:|:-------------------------:| |---------|:----------------:|:-------------------------:|
| **Deployment** | | | | **Deployment** | | |
| Disaggregated | Supported | Supported | | Disaggregated | Supported | Supported |
...@@ -99,13 +99,11 @@ kubectl apply -f examples/backends/vllm/deploy/disagg_planner.yaml -n $NAMESPACE ...@@ -99,13 +99,11 @@ kubectl apply -f examples/backends/vllm/deploy/disagg_planner.yaml -n $NAMESPACE
## Current Limitations ## Current Limitations
### Load-based scaling (Experimental) ### Load-based scaling
Load-based scaling is experimental and has the following known limitations. These are actively being addressed as part of the metrics refactor work. Throughput-based scaling is not affected by any of these. Load-based scaling has the following known limitations. Throughput-based scaling is not affected by any of these.
**Requires the KV Router.** Load-based scaling relies on per-worker engine metrics (active prefill tokens, active KV blocks) published by the [KV Router](../router/README.md). Other routing strategies (round-robin, random) do not emit these metrics, so load-based scaling cannot operate without the KV Router. **Requires ForwardPassMetrics (FPM).** Load-based scaling uses per-engine per-iteration metrics delivered via the Dynamo event plane (ForwardPassMetrics). FPM is currently only available for vllm and is automatically enabled when the engine uses `InstrumentedScheduler` and `DYN_FORWARDPASS_METRIC_PORT` is set. The KV Router is **not** required for load-based scaling.
**Scale-down with idle workers.** If a worker receives no requests (for example, because the router is not distributing traffic evenly), the router does not publish metrics for that worker. Without metrics, the Planner cannot evaluate whether the worker is underutilized, which can prevent scale-down decisions. **Workaround:** Ensure traffic distribution reaches all workers. If you observe workers stuck at zero load, check your router configuration.
### General ### General
...@@ -144,7 +142,7 @@ Load-based scaling is experimental and has the following known limitations. Thes ...@@ -144,7 +142,7 @@ Load-based scaling is experimental and has the following known limitations. Thes
| `--profile-results-dir` | `profiling_results` | Path to profiling data (NPZ/JSON) | | `--profile-results-dir` | `profiling_results` | Path to profiling data (NPZ/JSON) |
| `--load-predictor` | `arima` | Prediction model (`arima`, `prophet`, `kalman`, `constant`) | | `--load-predictor` | `arima` | Prediction model (`arima`, `prophet`, `kalman`, `constant`) |
| `--no-correction` | `true` | Disable correction factors (auto-disabled when load-based scaling is on) | | `--no-correction` | `true` | Disable correction factors (auto-disabled when load-based scaling is on) |
| **Load-based scaling (Experimental)** | | | | **Load-based scaling** | | |
| `--enable-loadbased-scaling` | `false` | Enable load-based scaling | | `--enable-loadbased-scaling` | `false` | Enable load-based scaling |
| `--disable-throughput-scaling` | `false` | Disable throughput-based scaling (required for `agg` mode) | | `--disable-throughput-scaling` | `false` | Disable throughput-based scaling (required for `agg` mode) |
| `--loadbased-router-metrics-url` | auto-discovered | URL to router's `/metrics` endpoint | | `--loadbased-router-metrics-url` | auto-discovered | URL to router's `/metrics` endpoint |
...@@ -186,7 +184,7 @@ The dashboard shows: ...@@ -186,7 +184,7 @@ The dashboard shows:
- TTFT and ITL distributions - TTFT and ITL distributions
- Input/output sequence lengths - Input/output sequence lengths
**Load-based scaling** pulls per-engine status directly from the frontend's `/metrics` endpoint: **Load-based scaling** uses ForwardPassMetrics (FPM) from the Dynamo event plane:
- Active prefill tokens per worker - Per-iteration wall time, scheduled prefill/decode tokens, and queued request status
- Active decode blocks per worker - Delivered via `FpmEventSubscriber` with automatic engine discovery and lifecycle tracking
- Last observed TTFT, ITL, and ISL per worker - No router `/metrics` scraping required
...@@ -72,7 +72,6 @@ When throughput-based scaling is enabled, the planner needs interpolation curves ...@@ -72,7 +72,6 @@ When throughput-based scaling is enabled, the planner needs interpolation curves
| `load_scaling_down_sensitivity` | int | `80` | Scale-down sensitivity 0–100 (0=never, 100=aggressive). | | `load_scaling_down_sensitivity` | int | `80` | Scale-down sensitivity 0–100 (0=never, 100=aggressive). |
| `load_metric_samples` | int | `10` | Number of metric samples to collect per decision. | | `load_metric_samples` | int | `10` | Number of metric samples to collect per decision. |
| `load_min_observations` | int | `5` | Minimum observations before making scaling decisions. | | `load_min_observations` | int | `5` | Minimum observations before making scaling decisions. |
| `load_router_metrics_url` | string | `null` | Router metrics endpoint. Auto-discovered in Kubernetes mode. |
### General Settings ### General Settings
......
...@@ -165,30 +165,31 @@ After the delay: ...@@ -165,30 +165,31 @@ After the delay:
- **Interpolation accuracy vs profiling cost**: Higher `prefillInterpolationGranularity` and `decodeInterpolationGranularity` in the profiling sweep produce more accurate interpolation but increase profiling time linearly. Default granularity (16 prefill, 6 decode) balances accuracy with profiling duration. - **Interpolation accuracy vs profiling cost**: Higher `prefillInterpolationGranularity` and `decodeInterpolationGranularity` in the profiling sweep produce more accurate interpolation but increase profiling time linearly. Default granularity (16 prefill, 6 decode) balances accuracy with profiling duration.
- **Predictor warm-up period**: All predictors need observation history before making reliable forecasts. ARIMA and Prophet need multiple adjustment intervals of data. Kalman starts forecasting after `--kalman-min-points` observations. During warm-up, the planner uses the constant predictor as fallback. - **Predictor warm-up period**: All predictors need observation history before making reliable forecasts. ARIMA and Prophet need multiple adjustment intervals of data. Kalman starts forecasting after `--kalman-min-points` observations. During warm-up, the planner uses the constant predictor as fallback.
## Load-Based Scaling (Experimental) ## Load-Based Scaling
The load-based mode uses real-time per-worker metrics from the router to make SLA-aware scaling decisions without requiring profiling data. The load-based mode uses ForwardPassMetrics (FPM) from the Dynamo event plane to make SLA-aware scaling decisions without requiring profiling data or the KV Router.
### Metrics ### Metrics
The planner pulls per-worker load metrics directly from the frontend's `/metrics` endpoint: Each engine emits per-iteration `ForwardPassMetrics` via ZMQ -> FpmEventRelay -> event plane. The planner subscribes via `FpmEventSubscriber` with automatic engine discovery and MDC-based lifecycle tracking. Key fields used:
- **Active prefill tokens**: pending prefill tokens per worker - **wall_time**: per-iteration execution time (regression target)
- **Active decode blocks**: active KV blocks per worker - **scheduled_requests.sum_prefill_tokens**: prefill regression input
- **Last TTFT, ITL, ISL**: most recent observed latencies per worker - **scheduled_requests.sum_decode_kv_tokens**: decode regression input
- **queued_requests**: queued prefill/decode load for TTFT/ITL simulation
- Idle heartbeats (wall_time=0) are skipped
### Regression Model ### Regression Models
A sliding-window linear regression maps load to latency: Three specialized regression models (`fpm_regression.py`):
- Prefill: `(active_prefill_tokens + ISL)` -> `TTFT` - **PrefillRegressionModel**: 1D regression `sum_prefill_tokens -> wall_time`. Estimates TTFT by simulating chunked prefill scheduling (chunks of `max_num_batched_tokens`).
- Decode: `active_decode_blocks` -> `ITL` - **DecodeRegressionModel**: 1D regression `sum_decode_kv_tokens -> wall_time`. Estimates ITL for total decode load (scheduled + queued + avg decode length).
- **AggRegressionModel**: 2D regression `(sum_prefill_tokens, sum_decode_kv_tokens) -> wall_time`. Estimates both TTFT (simulated prefill with piggybacked decode) and ITL (decode with average piggybacked prefill).
Given a TTFT/ITL SLA target, the model reverse-solves for the maximum load that satisfies the SLA.
### Scaling Decisions ### Scaling Decisions
- **Scale up**: if ALL workers' recent load exceeds the regression-derived target - **Prefill/Decode**: Scale up if ALL engines' estimated TTFT/ITL > SLA; scale down if ALL < SLA * sensitivity
- **Scale down**: if ALL workers' recent load is below the target adjusted by `(num_workers - 1) / num_workers * sensitivity / 100` - **Agg**: Scale up if (ALL TTFT > SLA) OR (ALL ITL > SLA); scale down if (ALL TTFT < SLA * sensitivity) AND (ALL ITL < SLA * sensitivity)
- Only scales by +/-1 per interval (blocking) - Only scales by +/-1 per interval (non-blocking with pending-desired guard: metrics continue to be observed while scaling is in progress, but no new scaling action is issued until the previous one completes)
### Co-existence with Throughput-Based Scaling ### Co-existence with Throughput-Based Scaling
......
...@@ -1763,6 +1763,7 @@ dependencies = [ ...@@ -1763,6 +1763,7 @@ dependencies = [
"anyhow", "anyhow",
"async-trait", "async-trait",
"clap", "clap",
"dashmap 6.1.0",
"dynamo-kv-router", "dynamo-kv-router",
"dynamo-llm", "dynamo-llm",
"dynamo-mocker", "dynamo-mocker",
...@@ -1774,6 +1775,7 @@ dependencies = [ ...@@ -1774,6 +1775,7 @@ dependencies = [
"pyo3", "pyo3",
"pyo3-async-runtimes", "pyo3-async-runtimes",
"pythonize", "pythonize",
"rmp",
"serde", "serde",
"serde_json", "serde_json",
"thiserror 2.0.18", "thiserror 2.0.18",
......
...@@ -36,7 +36,9 @@ dynamo-parsers = { path = "../../parsers" } ...@@ -36,7 +36,9 @@ dynamo-parsers = { path = "../../parsers" }
anyhow = { version = "1" } anyhow = { version = "1" }
async-trait = { version = "0.1" } async-trait = { version = "0.1" }
dashmap = { version = "6.1" }
futures = { version = "0.3" } futures = { version = "0.3" }
rmp = { version = "0.8" }
once_cell = { version = "1.20.3" } once_cell = { version = "1.20.3" }
parking_lot = { version = "0.12.4" } parking_lot = { version = "0.12.4" }
serde = { version = "1" } serde = { version = "1" }
......
This diff is collapsed.
...@@ -841,12 +841,23 @@ class FpmEventSubscriber: ...@@ -841,12 +841,23 @@ class FpmEventSubscriber:
""" """
Subscriber for ForwardPassMetrics from the Dynamo event plane. Subscriber for ForwardPassMetrics from the Dynamo event plane.
Auto-discovers engine publishers via the discovery plane. Auto-discovers engine publishers via the discovery plane.
Two mutually exclusive usage modes:
1. **recv mode** (default): call ``recv()`` to pull individual messages.
2. **tracking mode**: call ``start_tracking()`` once, then poll
``get_recent_stats()`` to retrieve the latest FPM bytes keyed by
``(worker_id, dp_rank)``. Stale entries are cleaned up when
workers are removed (via discovery watch).
""" """
def __init__(self, endpoint: Endpoint) -> None: def __init__(self, endpoint: Endpoint) -> None:
""" """
Create a subscriber that auto-discovers FPM publishers. Create a subscriber that auto-discovers FPM publishers.
No background tasks are started until ``recv()`` or
``start_tracking()`` is called.
Args: Args:
endpoint: Dynamo component endpoint (provides runtime + discovery). endpoint: Dynamo component endpoint (provides runtime + discovery).
""" """
...@@ -857,13 +868,48 @@ class FpmEventSubscriber: ...@@ -857,13 +868,48 @@ class FpmEventSubscriber:
Blocking receive of the next message (raw msgspec bytes). Blocking receive of the next message (raw msgspec bytes).
Releases the GIL while waiting. Releases the GIL while waiting.
On the first call a background subscriber task is spawned (recv mode).
Cannot be used after ``start_tracking()``.
Returns: Returns:
Raw msgspec payload, or None if the stream is closed. Raw msgspec payload, or None if the stream is closed.
""" """
... ...
def start_tracking(self) -> None:
"""
Start background tracking of the latest FPM per (worker_id, dp_rank).
Spawns two background tasks:
1. Event consumption: subscribes to FPM events, extracts the composite
key (worker_id, dp_rank) from the msgpack payload, stores latest
raw bytes in an internal map.
2. MDC discovery watch: monitors ComponentModels for the target
component. When a model is removed, all entries whose
worker_id matches the removed instance_id are purged.
After calling this, ``recv()`` will raise RuntimeError.
"""
...
def get_recent_stats(self) -> dict[tuple[str, int], bytes]:
"""
Return the latest FPM bytes for every tracked (worker_id, dp_rank).
Cleanup of removed engines is handled by the MDC discovery watch
task spawned by ``start_tracking()``.
Raises RuntimeError if ``start_tracking()`` has not been called.
Returns:
dict mapping ``(worker_id, dp_rank)`` to raw msgspec bytes.
Decode each value with ``forward_pass_metrics.decode(data)``.
"""
...
def shutdown(self) -> None: def shutdown(self) -> None:
"""Shut down the subscriber.""" """Shut down the subscriber (all background tasks)."""
... ...
......
...@@ -23,6 +23,7 @@ from dynamo.planner.utils.planner_core import ( ...@@ -23,6 +23,7 @@ from dynamo.planner.utils.planner_core import (
) )
from dynamo.planner.utils.prefill_planner import PrefillPlanner from dynamo.planner.utils.prefill_planner import PrefillPlanner
from dynamo.planner.utils.prometheus import Metrics from dynamo.planner.utils.prometheus import Metrics
from dynamo.planner.worker_info import WorkerInfo
pytestmark = [pytest.mark.pre_merge, pytest.mark.gpu_0] pytestmark = [pytest.mark.pre_merge, pytest.mark.gpu_0]
...@@ -56,12 +57,12 @@ class PlannerHarness: ...@@ -56,12 +57,12 @@ class PlannerHarness:
target_replicas = [ target_replicas = [
{ {
"sub_component_type": "prefill", "sub_component_type": "prefill",
"component_name": self.prefill_planner.prefill_component_name, "component_name": self.prefill_planner.prefill_worker_info.k8s_name,
"desired_replicas": next_num_p, "desired_replicas": next_num_p,
}, },
{ {
"sub_component_type": "decode", "sub_component_type": "decode",
"component_name": self.prefill_planner.decode_component_name, "component_name": self.prefill_planner.decode_worker_info.k8s_name,
"desired_replicas": next_num_d, "desired_replicas": next_num_d,
}, },
] ]
...@@ -83,12 +84,12 @@ class PlannerHarness: ...@@ -83,12 +84,12 @@ class PlannerHarness:
} }
prefill_attrs = { prefill_attrs = {
"prefill_interpolator", "prefill_interpolator",
"prefill_component_name", "prefill_worker_info",
"p_correction_factor", "p_correction_factor",
} }
decode_attrs = { decode_attrs = {
"decode_interpolator", "decode_interpolator",
"decode_component_name", "decode_worker_info",
"d_correction_factor", "d_correction_factor",
} }
if name == "last_metrics": if name == "last_metrics":
...@@ -185,6 +186,20 @@ def planner(): ...@@ -185,6 +186,20 @@ def planner():
decode_planner = DecodePlanner(mock_runtime, config, shared_state=shared_state) decode_planner = DecodePlanner(mock_runtime, config, shared_state=shared_state)
planner = PlannerHarness(prefill_planner, decode_planner, shared_state) planner = PlannerHarness(prefill_planner, decode_planner, shared_state)
# Set up WorkerInfo for both planners
prefill_planner.prefill_worker_info = WorkerInfo(
k8s_name="VllmPrefillWorker",
component_name="prefill",
endpoint="generate",
)
prefill_planner.decode_worker_info = WorkerInfo(
k8s_name="VllmDecodeWorker",
component_name="backend",
endpoint="generate",
)
decode_planner.prefill_worker_info = prefill_planner.prefill_worker_info
decode_planner.decode_worker_info = prefill_planner.decode_worker_info
# Mock the interpolators to return fixed values for testing # Mock the interpolators to return fixed values for testing
planner.prefill_interpolator = Mock() planner.prefill_interpolator = Mock()
planner.decode_interpolator = Mock() planner.decode_interpolator = Mock()
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment