planner.md 12.3 KB
Newer Older
1
<!--
2
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
SPDX-License-Identifier: Apache-2.0

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->

# Planner

20
21
22
23
24
25
26
27
28
29
30
31
32
33
The planner monitors the state of the system and adjusts workers to ensure that the system runs efficiently.
Currently, the planner can scale the number of vllm workers up and down based on the kv cache load and prefill queue size:

|                     |   | Feature                                                             |
| :------------------ | - | :------------------------------------------------------------------ |
| **Backend**         | ✅ | Local                                                               |
|                     | ✅ | Kubernetes                                                          |
| **LLM Framework**   | ✅ | vLLM                                                                |
|                     | ❌ | TensorRT-LLM                                                        |
|                     | ❌ | SGLang                                                              |
|                     | ❌ | llama.cpp                                                           |
| **Serving Type**    | ✅ | Aggregated                                                          |
|                     | ✅ | Disaggregated                                                       |
| **Planner Actions** | ✅ | Load-based scaling up/down prefill/decode workers                   |
34
|                     | ✅ | SLA-based scaling up/down prefill/decode workers **<sup>[1]</sup>** |
35
|                     | ✅ | Adjusting engine knobs                                              |
36
37
38

**<sup>[1]</sup>** Supported with some limitations.

39
40
41
42

## Load-based Scaling Up/Down Prefill/Decode Workers

To adjust the number of prefill/decode workers, planner monitors the following metrics:
43

44
* Prefill worker: planner monitors the number of requests pending in the prefill queue to estimate the prefill workload.
45

46
47
* Decode/aggregated worker: planner monitors the average KV cache utilization rate to estimate the decode/aggregated workload.

48
49
50
51
Every `metric-pulling-interval`, planner gathers the aforementioned metrics.
Every `adjustment-interval`, planner compares the aggregated metrics in this interval with pre-set thresholds and decide to scale up/down prefill/decode workers.
To avoid over-compensation, planner only changes the number of workers by 1 in one adjustment interval.
In addition, when the number of workers is being adjusted, the planner blocks the metric pulling and adjustment.
52

53
54
55
56
57
58
59
60
To scale up a prefill/decode worker, planner just need to launch the worker in the correct namespace.
The auto-discovery mechanism picks up the workers and add them to the routers.
To scale down a prefill worker, planner send a SIGTERM signal to the prefill worker.
The prefill worker store the signal and exit when it finishes the current request pulled from the prefill queue.
This ensures that no remote prefill request is dropped.
To scale down a decode worker, planner revokes the etcd lease of the decode worker.
When the etcd lease is revoked, the corresponding decode worker is immediately removed from the router and won't get any new requests.
The decode worker then finishes all the current requests in their original stream and exits gracefully.
61
62

There are two additional rules set by planner to prevent over-compensation:
63
64
65
66
67
68

1. After a new decode worker is added, since it needs time to populate the kv cache,
   planner doesn't scale down the number of decode workers in the next `NEW_DECODE_WORKER_GRACE_PERIOD=3` adjustment intervals.

2. We do not scale up prefill worker if the prefill queue size is estimated to reduce below the `--prefill-queue-scale-up-threshold` within the next `NEW_PREFILL_WORKER_QUEUE_BUFFER_PERIOD=3` adjustment intervals following the trend observed in the current adjustment interval.

69

70
71
72
For benchmarking recommendations, see the [Planner benchmark example](../../docs/guides/planner_benchmark/benchmark_planner.md).


73
74
## Comply with SLA

75
76
To ensure dynamo serve complies with the SLA, we provide a pre-deployment script to profile the model performance with different parallelization mappings, and recommend the parallelization mapping for prefill and decode workers and planner configurations.
To use this script, the user needs to provide the target ISL, OSL, TTFT SLA, and ITL SLA.
77

78
79
80
81
82
83
84
> [!Note]
> The script considers a fixed ISL/OSL without KV cache reuse.
> If the real ISL/OSL has a large variance or a significant amount of KV cache can be reused, the result might be inaccurate.
>
> We assume there are no piggybacked prefill requests in the decode engine.
> Even if there are some short piggybacked prefill requests in the decode engine, it should not affect the ITL in most cases.
> However, if the piggybacked prefill requests are too much, the ITL might be inaccurate.
85
86
87
88
89
90
91
92
93
94
95

```bash
python -m utils.profile_sla \
  --config <path-to-dynamo-config-file> \
  --output-dir <path-to-profile-results-dir> \
  --isl <target-isl> \
  --osl <target-osl> \
  --ttft <target-ttft-(ms)> \
  --itl <target-itl-(ms)>
```

96
97
98
99
100
101
102
The script first detects the number of available GPUs on the current nodes (multi-node engine not supported yet).
Then, it profiles the prefill and decode performance with different TP sizes.
For prefill, since there is no in-flight batching (assume isl is long enough to saturate the GPU), the script directly measures the TTFT for a request with given isl without kv-reuse.
For decode, since the ITL (or iteration time) is relevant to how many requests are in-flight, the script measures the ITL under a different number of in-flight requests.
The range of the number of in-flight requests is from 1 to the maximum number of requests that the kv cache of the engine can hold.
To measure the ITL without being affected by piggybacked prefill requests, the script enables kv-reuse and warm up the engine by issuing the same prompts before measuring the ITL.
Since the kv cache is sufficient for all the requests, it can hold the kv cache of the pre-computed prompts and skip the prefill phase when measuring the ITL.
103

104
105
After the profiling finishes, two plots are generated in the `output-dir`.
For example, here are the profiling results for `examples/llm/configs/disagg.yaml`:
106

107
108
![Prefill Performance](../images/h100_prefill_performance.png)
![Decode Performance](../images/h100_decode_performance.png)
109

110
111
For the prefill performance, the script plots the TTFT for different TP sizes and selects the best TP size that meets the target TTFT SLA and delivers the best throughput per GPU.
Based on how close the TTFT of the selected TP size is to the SLA, the script also recommends the upper and lower bounds of the prefill queue size to be used in planner.
112

113
114
115
For the decode performance, the script plots the ITL for different TP sizes and different in-flight requests.
Similarly, it selects the best point that satisfies the ITL SLA and delivers the best throughput per GPU
and recommends the upper and lower bounds of the kv cache utilization rate to be used in planner.
116

117
The following information is printed out in the terminal:
118
119

```text
120
121
122
123
2025-05-16 15:20:24 - __main__ - INFO - Analyzing results and generate recommendations...
2025-05-16 15:20:24 - __main__ - INFO - Suggested prefill TP:4 (TTFT 48.37 ms, throughput 15505.23 tokens/s/GPU)
2025-05-16 15:20:24 - __main__ - INFO - Suggested planner upper/lower bound for prefill queue size: 0.24/0.10
2025-05-16 15:20:24 - __main__ - INFO - Suggested decode TP:4 (ITL 4.83 ms, throughput 51.22 tokens/s/GPU)
124
2025-05-16 15:20:24 - __main__ - INFO - Suggested planner upper/lower bound for decode kv cache utilization: 0.20/0.10
125
126
```

127
128
129
After finding the best TP size for prefill and decode, the script interpolates the TTFT with ISL and ITL with active KV cache and decode context length.
This is to provide a more accurate estimation of the performance when ISL and OSL changes.
The results are saved to `<output_dir>/<decode/prefill>_tp<best_tp>_interpolation`.
130

131
## Usage
132

133
134
135
136
`dynamo serve` automatically starts the planner.
Configure it through YAML files or command-line arguments:

Usage:
137

138
```bash
139
# YAML configuration
140
141
142
dynamo serve graphs.disagg:Frontend -f disagg.yaml

# disagg.yaml
143
144
145
146
Planner:
  environment: local
  no-operation: false
  log-dir: log/planner
147

148
149
150
151
152
153
# Configure the planner through CLI arguments
dynamo serve graphs.disagg:Frontend \
  -f disagg.yaml \
  --Planner.environment=local \
  --Planner.no-operation=false \
  --Planner.log-dir=log/planner
154
155
```

156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
The planner accepts the following options:

- `namespace` (str, default: "dynamo"):
   Namespace planner will look at

- `environment` (str, default: "local"):
   Environment to run the planner in (local, kubernetes)

- `no-operation` (bool, default: false):
   Do not make any adjustments, just observe the metrics and log to tensorboard

- `log-dir` (str, default: None):
   Tensorboard logging directory

- `adjustment-interval` (int, default: 30):
   Interval in seconds between scaling adjustments

- `metric-pulling-interval` (int, default: 1):
   Interval in seconds between metric pulls

- `max-gpu-budget` (int, default: 8):
   Maximum number of GPUs to use, planner will not scale up more than this number of GPUs for prefill plus decode workers

- `min-gpu-budget` (int, default: 1):
   Minimum number of GPUs to use, planner will not scale down below this number of GPUs for prefill or decode workers

- `decode-kv-scale-up-threshold` (float, default: 0.9):
   KV cache utilization threshold to scale up decode workers

- `decode-kv-scale-down-threshold` (float, default: 0.5):
   KV cache utilization threshold to scale down decode workers

- `prefill-queue-scale-up-threshold` (float, default: 0.5):
   Queue utilization threshold to scale up prefill workers

- `prefill-queue-scale-down-threshold` (float, default: 0.2):
   Queue utilization threshold to scale down prefill workers

- `decode-engine-num-gpu` (int, default: 1):
   Number of GPUs per decode engine

- `prefill-engine-num-gpu` (int, default: 1):
   Number of GPUs per prefill engine
199
200

Run as standalone process:
201

202
```bash
203
204
205
206
207
PYTHONPATH=/workspace/examples/llm python components/planner.py \
  --namespace=dynamo \
  --served-model-name=vllm \
  --no-operation \
  --log-dir=log/planner
208
209
```

210
Monitor metrics with Tensorboard:
211
212
213
214
215
216

### Tensorboard

Planner logs to tensorboard to visualize the metrics and the scaling actions.
You can start tensorboard with the following command:

217
218
219
220
```bash
tensorboard --logdir=<path-to-tensorboard-log-dir>
```

221

222
223
## Backends

224
The planner supports local and kubernetes backends for worker management.
225

226
### Local Backend
227

228
229
The local backend uses Circus to control worker processes. A Watcher tracks each `serve_dynamo.py` process.
The planner adds or removes watchers to scale workers.
230

231
Note: Circus's `increment` feature doesn't support GPU scheduling variables, so we create separate watchers per process.
232

233
#### State Management
234

235
The planner maintains state in a JSON file at `~/.dynamo/state/{namespace}.json`. This file:
236
237
238
239
240
241
242
243

- Tracks worker names as `{namespace}_{component_name}`.

- Records GPU allocations from the allocator.

- Updates after each planner action.

- Cleans up automatically when the arbiter exits.
244

245
Example state file evolution:
246

247
```none
248
# Initial decode worker
249
{
250
  "dynamo_VllmWorker": {..., resources={...}}
251
252
}

253
# After adding worker
254
255
{
  "dynamo_VllmWorker": {..., resources={...}},
256
  "dynamo_VllmWorker_1": {..., resources={...}}
257
258
}

259
# After removing worker
260
{
261
  "dynamo_VllmWorker": {..., resources={...}}
262
263
}

264
# After removing last worker
265
{
266
  "dynamo_VllmWorker": {...}
267
268
269
}
```

270
271
272
> [!Note]
> Start with one replica per worker.
> Multiple initial replicas currently share a single watcher.
273
274
275

### Kubernetes Backend

276
277
278
The Kubernetes backend scales workers by updating DynamoGraphDeployment replica counts.
When scaling needs change, the planner:

279
1. Updates the deployment's replica count
280

281
2. Lets the Kubernetes operator create/remove pods
282

283
3. Maintains seamless scaling without manual intervention