planner.md 3.85 KB
Newer Older
1
<!--
2
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
SPDX-License-Identifier: Apache-2.0

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->

# Planner

20
21
22
23
24
25
26
27
28
29
30
31
32
33
The planner monitors the state of the system and adjusts workers to ensure that the system runs efficiently.
Currently, the planner can scale the number of vllm workers up and down based on the kv cache load and prefill queue size:

|                     |   | Feature                                                             |
| :------------------ | - | :------------------------------------------------------------------ |
| **Backend**         | ✅ | Local                                                               |
|                     | ✅ | Kubernetes                                                          |
| **LLM Framework**   | ✅ | vLLM                                                                |
|                     | ❌ | TensorRT-LLM                                                        |
|                     | ❌ | SGLang                                                              |
|                     | ❌ | llama.cpp                                                           |
| **Serving Type**    | ✅ | Aggregated                                                          |
|                     | ✅ | Disaggregated                                                       |
| **Planner Actions** | ✅ | Load-based scaling up/down prefill/decode workers                   |
34
|                     | ✅ | SLA-based scaling up/down prefill/decode workers **<sup>[1]</sup>** |
35
|                     | ✅ | Adjusting engine knobs                                              |
36
37
38

**<sup>[1]</sup>** Supported with some limitations.

39
40
41
We currently provide two reference planner designs:
1. Load-based planner: [Load-based planner docs](load_planner.md)
2. SLA-based planner: [SLA-based planner docs](sla_planner.md)
42

43

44
45
## Backends

46
The planner supports local and kubernetes backends for worker management.
47

48
### Local Backend
49

50
51
The local backend uses Circus to control worker processes. A Watcher tracks each `serve_dynamo.py` process.
The planner adds or removes watchers to scale workers.
52

53
Note: Circus's `increment` feature doesn't support GPU scheduling variables, so we create separate watchers per process.
54

55
#### State Management
56

57
The planner maintains state in a JSON file at `~/.dynamo/state/{namespace}.json`. This file:
58
59
60
61
62
63
64
65

- Tracks worker names as `{namespace}_{component_name}`.

- Records GPU allocations from the allocator.

- Updates after each planner action.

- Cleans up automatically when the arbiter exits.
66

67
Example state file evolution:
68

69
```none
70
# Initial decode worker
71
{
72
  "dynamo_VllmWorker": {..., resources={...}}
73
74
}

75
# After adding worker
76
77
{
  "dynamo_VllmWorker": {..., resources={...}},
78
  "dynamo_VllmWorker_1": {..., resources={...}}
79
80
}

81
# After removing worker
82
{
83
  "dynamo_VllmWorker": {..., resources={...}}
84
85
}

86
# After removing last worker
87
{
88
  "dynamo_VllmWorker": {...}
89
90
91
}
```

92
93
94
> [!Note]
> Start with one replica per worker.
> Multiple initial replicas currently share a single watcher.
95
96
97

### Kubernetes Backend

98
99
100
The Kubernetes backend scales workers by updating DynamoGraphDeployment replica counts.
When scaling needs change, the planner:

101
102
103
1. Updates the deployment's replica count
2. Lets the Kubernetes operator create/remove pods
3. Maintains seamless scaling without manual intervention