@@ -26,13 +26,13 @@ The Dynamo **Planner** is an autoscaler purpose-built for these constraints. It
The Planner supports two scaling modes that can run independently or together:
-**Throughput-based scaling**: Uses pre-deployment profiling data and traffic prediction to compute the replica count needed to meet TTFT and ITL targets. Adjusts on a longer interval (default 180s). This is the primary mode for production deployments.
-**Load-based scaling (Experimental)**: Uses real-time per-worker load metrics (active prefill tokens, active KV blocks) from the router and fits an online linear regression to make scaling decisions. No profiling data required. Adjusts on a short interval (default 5s) to respond quickly to bursts.
-**Load-based scaling**: Uses ForwardPassMetrics (FPM) from the Dynamo event plane and fits an online linear regression to make scaling decisions. No profiling data or KV Router required. Adjusts on a short interval (default 5s) to respond quickly to bursts.
When both modes are enabled, throughput-based scaling provides a capacity floor (long-term planning) while load-based scaling handles real-time adjustments above that floor.
Load-based scaling is experimental and has the following known limitations. These are actively being addressed as part of the metrics refactor work. Throughput-based scaling is not affected by any of these.
Load-based scaling has the following known limitations. Throughput-based scaling is not affected by any of these.
**Requires the KV Router.** Load-based scaling relies on per-worker engine metrics (active prefill tokens, active KV blocks) published by the [KV Router](../router/README.md). Other routing strategies (round-robin, random) do not emit these metrics, so load-based scaling cannot operate without the KV Router.
**Scale-down with idle workers.** If a worker receives no requests (for example, because the router is not distributing traffic evenly), the router does not publish metrics for that worker. Without metrics, the Planner cannot evaluate whether the worker is underutilized, which can prevent scale-down decisions. **Workaround:** Ensure traffic distribution reaches all workers. If you observe workers stuck at zero load, check your router configuration.
**Requires ForwardPassMetrics (FPM).** Load-based scaling uses per-engine per-iteration metrics delivered via the Dynamo event plane (ForwardPassMetrics). FPM is currently only available for vllm and is automatically enabled when the engine uses `InstrumentedScheduler` and `DYN_FORWARDPASS_METRIC_PORT` is set. The KV Router is **not** required for load-based scaling.
### General
...
...
@@ -144,7 +142,7 @@ Load-based scaling is experimental and has the following known limitations. Thes
| `--profile-results-dir` | `profiling_results` | Path to profiling data (NPZ/JSON) |
-**Interpolation accuracy vs profiling cost**: Higher `prefillInterpolationGranularity` and `decodeInterpolationGranularity` in the profiling sweep produce more accurate interpolation but increase profiling time linearly. Default granularity (16 prefill, 6 decode) balances accuracy with profiling duration.
-**Predictor warm-up period**: All predictors need observation history before making reliable forecasts. ARIMA and Prophet need multiple adjustment intervals of data. Kalman starts forecasting after `--kalman-min-points` observations. During warm-up, the planner uses the constant predictor as fallback.
## Load-Based Scaling (Experimental)
## Load-Based Scaling
The load-based mode uses real-time per-worker metrics from the router to make SLA-aware scaling decisions without requiring profiling data.
The load-based mode uses ForwardPassMetrics (FPM) from the Dynamo event plane to make SLA-aware scaling decisions without requiring profiling data or the KV Router.
### Metrics
The planner pulls per-worker load metrics directly from the frontend's `/metrics` endpoint:
-**Active prefill tokens**: pending prefill tokens per worker
-**Active decode blocks**: active KV blocks per worker
-**Last TTFT, ITL, ISL**: most recent observed latencies per worker
Each engine emits per-iteration `ForwardPassMetrics` via ZMQ -> FpmEventRelay -> event plane. The planner subscribes via `FpmEventSubscriber` with automatic engine discovery and MDC-based lifecycle tracking. Key fields used:
-**wall_time**: per-iteration execution time (regression target)
Given a TTFT/ITL SLA target, the model reverse-solves for the maximum load that satisfies the SLA.
Three specialized regression models (`fpm_regression.py`):
-**PrefillRegressionModel**: 1D regression `sum_prefill_tokens -> wall_time`. Estimates TTFT by simulating chunked prefill scheduling (chunks of `max_num_batched_tokens`).
-**DecodeRegressionModel**: 1D regression `sum_decode_kv_tokens -> wall_time`. Estimates ITL for total decode load (scheduled + queued + avg decode length).
-**AggRegressionModel**: 2D regression `(sum_prefill_tokens, sum_decode_kv_tokens) -> wall_time`. Estimates both TTFT (simulated prefill with piggybacked decode) and ITL (decode with average piggybacked prefill).
### Scaling Decisions
-**Scale up**: if ALL workers' recent load exceeds the regression-derived target
-**Scale down**: if ALL workers' recent load is below the target adjusted by `(num_workers - 1) / num_workers * sensitivity / 100`
- Only scales by +/-1 per interval (blocking)
-**Prefill/Decode**: Scale up if ALL engines' estimated TTFT/ITL > SLA; scale down if ALL < SLA * sensitivity
-**Agg**: Scale up if (ALL TTFT > SLA) OR (ALL ITL > SLA); scale down if (ALL TTFT < SLA * sensitivity) AND (ALL ITL < SLA * sensitivity)
- Only scales by +/-1 per interval (non-blocking with pending-desired guard: metrics continue to be observed while scaling is in progress, but no new scaling action is issued until the previous one completes)