The SLA Planner supports two scaling modes that can be used independently or together:
### Throughput-Based Scaling
Uses pre-deployment profiling data and traffic prediction to compute the number of prefill/decode replicas needed to meet TTFT and ITL SLA targets. Requires profiling data from the Dynamo profiler.
### Load-Based Scaling (Experimental)
Uses real-time per-worker load metrics (active prefill tokens, active KV blocks) from the router to make SLA-aware scaling decisions via online linear regression. Does not require profiling data. Responds quickly to traffic bursts.
When both modes are enabled, throughput-based scaling provides a lower bound on replicas while load-based scaling handles real-time adjustments.
### Support Matrix
| Deployment Type | Throughput-Based | Load-Based (Experimental) |
The Planner monitors system performance and automatically scales prefill/decode workers to meet latency SLAs. It runs as a component inside the Dynamo inference graph on Kubernetes.
The SLA Planner supports two scaling modes:
-**Throughput-based scaling**: Uses pre-deployment profiling data and traffic prediction to compute the number of replicas needed to meet TTFT and ITL SLA targets. This is the primary scaling mode for production deployments.
-**Load-based scaling (Experimental)**: Uses real-time per-worker load metrics (active prefill tokens, active KV blocks) from the router to make SLA-aware scaling decisions via online linear regression. Does not require profiling data. Responds quickly to traffic bursts.
When both modes are enabled, throughput-based scaling provides a lower bound on replicas (long-term capacity planning) while load-based scaling handles real-time adjustments (burst response).
> **New to the Planner?** Start with the [SLA Planner Quick Start Guide](planner-guide.md) for a complete workflow including profiling and deployment.
-**Throughput-based scaling** should be enabled whenever engine profiling data is available (through pre-deployment profiling). It provides stable, prediction-based capacity planning.
-**Load-based scaling** should be enabled when traffic is bursty or hard to predict. It reacts quickly to real-time load changes without requiring profiling data.
-**Both modes together**: For the best of both worlds, enable both. Throughput-based scaling provides a lower bound (long-term capacity), while load-based scaling handles bursts above that floor. When both are enabled, use a longer `--adjustment-interval` for throughput-based scaling.
## Quick Start
...
...
@@ -36,21 +46,35 @@ The Planner monitors system performance and automatically scales prefill/decode
- Dynamo platform installed on Kubernetes ([Installation Guide](../../kubernetes/installation-guide.md))
This automatically profiles your model and deploys with the SLA planner. See [SLA Planner Guide](planner-guide.md) for the full workflow.
See [Planner Guide](planner-guide.md) for the full workflow.
### Load-Based Scaling (without profiling)
To deploy with load-based scaling only (no profiling required), add these arguments to the planner service in your DGD:
```yaml
args:
---enable-loadbased-scaling
---disable-throughput-scaling
---loadbased-adjustment-interval=5
```
The planner will auto-discover the frontend metrics endpoint from the DGD. See [disagg_planner_load.yaml](../../../../tests/planner/scaling/disagg_planner_load.yaml) for a complete example.
### Deploy with DGD (Manual)
### Manual DGD Deployment
For manual control, use the disaggregated planner templates:
For manual control with throughput-based scaling, use the disaggregated planner templates:
Practical examples for deploying the SLA Planner with different configurations. For deployment concepts, see the [Planner Guide](planner-guide.md). For a quick overview, see the [Planner README](README.md).
Practical examples for deploying the SLA Planner with throughput-based scaling. All examples below use the DGDR workflow with pre-deployment profiling. For deployment concepts, see the [Planner Guide](planner-guide.md). For a quick overview, see the [Planner README](README.md).
Deployment, configuration, and integration guide for the Dynamo SLA Planner. For a quick overview, see the [Planner README](README.md). For architecture internals, see [Planner Design](../../design-docs/planner-design.md).
## Scaling Modes
The SLA Planner supports two scaling modes:
-**Throughput-based scaling**: Uses pre-deployment profiling data and traffic prediction. Best for stable, predictable workloads where profiling data is available.
-**Load-based scaling (Experimental)**: Uses real-time per-worker engine metrics and online regression. Best for bursty or unpredictable traffic. Does not require profiling data.
**When to use which mode:**
- Enable **throughput-based scaling** whenever engine profiling data is available. It provides stable, prediction-based capacity planning.
- Enable **load-based scaling** when traffic is bursty or hard to predict. It reacts quickly to real-time load changes.
- Enable **both modes together** for the best of both worlds: throughput-based scaling provides a lower bound (long-term capacity), while load-based scaling handles bursts above that floor. When both are enabled, use a longer `--adjustment-interval` for throughput-based scaling.
**DGDR and scaling modes:** Deploying via DGDR automatically triggers profiling and enables throughput-based scaling. To additionally enable load-based scaling, pass the planner arguments through the DGDR's planner config section:
```yaml
profilingConfig:
config:
planner:
plannerEnableLoadbasedScaling:true
plannerLoadbasedAdjustmentInterval:5
```
## Deployment
### Prerequisites
...
...
@@ -191,7 +213,7 @@ For detailed comparison, supported configurations, and limitations, see [SLA-Dri
### Load Predictors
The SLA planner forecasts the number of requests, ISL, and OSL in the next adjustment interval. Four prediction models are supported:
The throughput-based scaling mode forecasts the number of requests, ISL, and OSL in the next adjustment interval. Four prediction models are supported:
#### Constant Predictor
-**Use case**: Stable workloads with long prediction intervals
...
...
@@ -231,15 +253,13 @@ You can warm-start load predictors with a mooncake-style JSONL trace file:
The Planner is Dynamo's autoscaling controller. It observes system metrics, predicts future load, and adjusts prefill/decode worker replica counts to proactively meet SLA targets. This document covers the internal architecture, algorithms, and design trade-offs.
The Planner is Dynamo's autoscaling controller. It supports two scaling modes: **throughput-based** (using profiling data and traffic prediction) and **load-based** (using real-time engine metrics and online regression). This document covers the internal architecture, algorithms, and design trade-offs for both modes.
## Architecture
## Throughput-Based Scaling

...
...
@@ -167,17 +167,48 @@ After the delay:
-**Interpolation accuracy vs profiling cost**: Higher `prefillInterpolationGranularity` and `decodeInterpolationGranularity` in the profiling sweep produce more accurate interpolation but increase profiling time linearly. Default granularity (16 prefill, 6 decode) balances accuracy with profiling duration.
-**Predictor warm-up period**: All predictors need observation history before making reliable forecasts. ARIMA and Prophet need multiple adjustment intervals of data. Kalman starts forecasting after `--kalman-min-points` observations. During warm-up, the planner uses the constant predictor as fallback.
## Load-Based Scaling (Experimental)
The load-based mode uses real-time per-worker metrics from the router to make SLA-aware scaling decisions without requiring profiling data.
### Metrics
The planner pulls per-worker load metrics directly from the frontend's `/metrics` endpoint:
-**Active prefill tokens**: pending prefill tokens per worker
-**Active decode blocks**: active KV blocks per worker
-**Last TTFT, ITL, ISL**: most recent observed latencies per worker
### Regression Model
A sliding-window linear regression maps load to latency:
Given a TTFT/ITL SLA target, the model reverse-solves for the maximum load that satisfies the SLA.
### Scaling Decisions
-**Scale up**: if ALL workers' recent load exceeds the regression-derived target
-**Scale down**: if ALL workers' recent load is below the target adjusted by `(num_workers - 1) / num_workers * sensitivity / 100`
- Only scales by +/-1 per interval (blocking)
### Co-existence with Throughput-Based Scaling
When both modes are enabled, throughput-based scaling (longer interval) sets a lower bound on replicas while load-based scaling (shorter interval) handles real-time adjustments above that floor.
### Aggregated Mode
In aggregated mode (`--mode agg`), engines handle both prefill and decode via chunked prefill. The planner maintains both TTFT and ITL regression models but uses per-worker time-averaged metrics (not instantaneous) for regression training to smooth out chunked prefill noise. Scale up if either prefill or decode signals overload; scale down only if both signal underload.
## Known Limitations
1.**30-second startup delay**: Hardcoded wait for component registration. It should be replaced with runtime readiness probing.
2.**Adjustment interval vs scaling latency**: If `adjustment_interval`\< time to scale, scaling decisions can pile up. The planner logs warnings but doesn't queue.
3.**Average-based interpolation**: The planner uses average ISL/OSL, which may not represent bimodal or heavy-tailed distributions well.
3.**Average-based interpolation**: Throughput-based scaling uses average ISL/OSL, which may not represent bimodal or heavy-tailed distributions well.
4.**Single DGD scope**: Each planner instance manages exactly one DGD. Multi-model/multi-DGD coordination is not supported.
5.**Load-based planner deprecated**: The load-based code path exists but is non-functional with current backends (no prefill queue metrics).
## Future Work
- Support aggregated (non-disaggregated) scaling mode for single-worker deployments
- Multi-DGD coordination for shared-cluster scenarios
- Distribution-aware interpolation (beyond mean ISL/OSL)
- Adaptive adjustment interval based on observed scaling latency