planner-guide.md 6.66 KB
Newer Older
1
2
3
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
4
title: Planner Guide
5
6
7
8
---

# Planner Guide

9
The Dynamo SLA Planner is an autoscaling controller that adjusts prefill and decode engine replica counts at runtime to meet latency SLAs. It reads traffic signals (Prometheus metrics or load predictor output) and engine performance profiles to decide when to scale up or down.
10

11
For a quick overview, see the [Planner README](README.md). For architecture internals, see [Planner Design](../../design-docs/planner-design.md).
12

13
## Scaling Modes
14

15
The planner supports two scaling modes that can be used independently or together:
16

17
18
- **Throughput-based scaling** (`enable_throughput_scaling: true`): Uses pre-deployment engine interpolation data and traffic prediction to plan capacity. Best for stable, predictable workloads. Requires profiling data generated by the [Profiler](../profiler/profiler-guide.md).
- **Load-based scaling** (`enable_load_scaling: true`): Uses real-time per-worker engine metrics and online regression. Best for bursty or unpredictable traffic. Does not require profiling data.
19

20
**When to use which:**
21

22
23
24
- Enable **throughput-based scaling** whenever profiling data is available. It provides stable, prediction-based capacity planning.
- Enable **load-based scaling** when traffic is bursty. It reacts quickly to real-time load changes.
- Enable **both** for the best of both worlds: throughput-based provides a capacity floor, load-based handles bursts above it. When both are enabled, use a longer `throughput_adjustment_interval`.
25

26
## PlannerConfig Reference
27

28
The planner is configured via a `PlannerConfig` JSON/YAML object. When using the profiler, this is placed under the `features.planner` section of the DGDR spec:
29
30

```yaml
31
32
33
34
35
36
37
features:
  planner:
    enable_throughput_scaling: true
    enable_load_scaling: false
    pre_deployment_sweeping_mode: rapid
    mode: disagg
    backend: vllm
38
39
```

40
### Scaling Mode Fields
41

42
43
44
45
| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `enable_throughput_scaling` | bool | `true` | Enable throughput-based scaling (requires pre-deployment profiling data). |
| `enable_load_scaling` | bool | `true` | Enable load-based scaling (no pre-deployment profiling data required). |
46

47
At least one scaling mode must be enabled.
48

49
### Pre-Deployment Sweeping
50

51
52
53
| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `pre_deployment_sweeping_mode` | string | `rapid` | How to generate engine interpolation data: `rapid` (AIC simulation, ~30s), `thorough` (real GPUs, 2-4h), or `none` (skip). |
54

55
When throughput-based scaling is enabled, the planner needs interpolation curves that map ISL to TTFT (prefill) and KV-cache utilization to ITL (decode). The profiler generates this data based on the `pre_deployment_sweeping_mode` setting. See the [Profiler Guide](../profiler/profiler-guide.md) for details on how this data is produced.
56

57
### Throughput-Based Scaling Settings
58

59
60
61
62
63
64
65
66
| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `throughput_adjustment_interval` | int | `60` | Seconds between throughput-based scaling decisions. |
| `min_endpoint` | int | `1` | Minimum number of engine endpoints to maintain. |
| `max_gpu_budget` | int | `128` | Maximum total GPUs the planner may allocate. |
| `ttft` | float | `2000.0` | TTFT SLA target (ms) for scaling decisions. |
| `itl` | float | `30.0` | ITL SLA target (ms) for scaling decisions. |
| `no_correction` | bool | `false` | Disable latency correction factor. Auto-disabled when load-based scaling is on. |
67

68
### Load-Based Scaling Settings
69

70
71
72
73
74
75
76
77
| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `load_adjustment_interval` | int | `10` | Seconds between load-based scaling decisions. Must be shorter than `throughput_adjustment_interval`. |
| `load_learning_window` | int | `120` | Seconds of history used for online regression. |
| `load_scaling_down_sensitivity` | int | `3` | Number of consecutive underutilized intervals before scaling down. |
| `load_metric_samples` | int | `10` | Number of metric samples to collect per decision. |
| `load_min_observations` | int | `5` | Minimum observations before making scaling decisions. |
| `load_router_metrics_url` | string | `null` | Router metrics endpoint. Required outside Kubernetes mode. |
78

79
### General Settings
80

81
82
83
84
85
86
| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `mode` | string | `disagg` | Planner mode: `disagg`, `prefill`, `decode`, or `agg`. |
| `backend` | string | `vllm` | Backend: `vllm`, `sglang`, `trtllm`, or `mocker`. |
| `environment` | string | `kubernetes` | Runtime environment: `kubernetes`, `virtual`, or `global-planner`. |
| `namespace` | string | env `DYN_NAMESPACE` | Kubernetes namespace for the deployment. |
87

88
### Traffic Prediction Settings
89

90
91
92
93
94
95
| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `load_predictor` | string | `linear` | Prediction method: `linear`, `kalman`, or `prophet`. |
| `load_predictor_log1p` | bool | `true` | Apply log1p transform to load data before prediction. |
| `prophet_window_size` | int | `300` | Window size (seconds) for Prophet predictor. |
| `load_predictor_warmup_trace` | string | `null` | Path to a warmup trace file for bootstrapping predictions. |
96

97
### Kalman Filter Settings
98

99
100
101
102
103
104
| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `kalman_q_level` | float | `0.1` | Process noise for level component. |
| `kalman_q_trend` | float | `0.01` | Process noise for trend component. |
| `kalman_r` | float | `1.0` | Measurement noise. |
| `kalman_min_points` | int | `10` | Minimum data points before Kalman predictions activate. |
105

106
## Integration with Profiler
107

108
When the profiler runs with planner enabled, it:
109

110
111
112
113
1. Selects the best prefill and decode engine configurations
2. Generates interpolation curves (TTFT vs ISL, ITL vs KV-cache utilization)
3. Saves the `PlannerConfig` and profiling data into separate Kubernetes ConfigMaps
4. Adds the planner service to the generated DGD, configured to read from those ConfigMaps
114

115
The planner receives its config via `--config /path/to/planner_config.json` which is mounted from the `planner-config-XXXX` ConfigMap. Profiling data is mounted from the `planner-profile-data-XXXX` ConfigMap.
116

117
See the [Profiler Guide](../profiler/profiler-guide.md) for the full profiling workflow and how to configure pre-deployment sweeping.
118

119
## See Also
120

121
122
123
- [Planner README](README.md) — Quick overview
- [Planner Design](../../design-docs/planner-design.md) — Architecture internals
- [Profiler Guide](../profiler/profiler-guide.md) — How profiling data is generated