Unverified Commit c9a60278 authored by Hongkuan Zhou's avatar Hongkuan Zhou Committed by GitHub
Browse files

fix: include load and sla planner doc (#1803)

parent b1509ea7
......@@ -55,7 +55,7 @@ The following diagram outlines Dynamo's high-level architecture. To enable large
- [Dynamo Disaggregated Serving](disagg_serving.md)
- [Dynamo Smart Router](kv_cache_routing.md)
- [Dynamo KV Cache Block Manager](kvbm_intro.rst)
- [Planner](planner.md)
- [Planner](planner_intro.rst)
- [NVIDIA Inference Transfer Library (NIXL)](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md)
Every component in the Dynamo architecture is independently scalable and portable. The API server can adapt to task-specific deployment. A smart router processes user requests to route them to the optimal worker for performance. Specifically, for Large Language Models (LLMs), Dynamo employs KV cache-aware routing, which directs requests to the worker with the highest cache hit rate while maintaining load balance, expediting decoding. This routing strategy leverages a KV cache manager that maintains a global radix tree registry for hit rate calculation. The KV cache manager also oversees a multi-tiered memory system, enabling rapid KV cache storage and eviction. This design results in substantial TTFT reductions, increased throughput, and the ability to process extensive context lengths.
......
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
https://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Planner
The planner monitors the state of the system and adjusts workers to ensure that the system runs efficiently.
Currently, the planner can scale the number of vllm workers up and down based on the kv cache load and prefill queue size:
| | | Feature |
| :------------------ | - | :------------------------------------------------------------------ |
| **Backend** | ✅ | Local |
| | ✅ | Kubernetes |
| **LLM Framework** | ✅ | vLLM |
| | ❌ | TensorRT-LLM |
| | ❌ | SGLang |
| | ❌ | llama.cpp |
| **Serving Type** | ✅ | Aggregated |
| | ✅ | Disaggregated |
| **Planner Actions** | ✅ | Load-based scaling up/down prefill/decode workers |
| | ✅ | SLA-based scaling up/down prefill/decode workers **<sup>[1]</sup>** |
| | ✅ | Adjusting engine knobs |
**<sup>[1]</sup>** Supported with some limitations.
We currently provide two reference planner designs:
1. Load-based planner: [Load-based planner docs](load_planner.md)
2. SLA-based planner: [SLA-based planner docs](sla_planner.md)
## Backends
The planner supports local and kubernetes backends for worker management.
### Local Backend
The local backend uses Circus to control worker processes. A Watcher tracks each `serve_dynamo.py` process.
The planner adds or removes watchers to scale workers.
Note: Circus's `increment` feature doesn't support GPU scheduling variables, so we create separate watchers per process.
#### State Management
The planner maintains state in a JSON file at `~/.dynamo/state/{namespace}.json`. This file:
- Tracks worker names as `{namespace}_{component_name}`.
- Records GPU allocations from the allocator.
- Updates after each planner action.
- Cleans up automatically when the arbiter exits.
Example state file evolution:
```none
# Initial decode worker
{
"dynamo_VllmWorker": {..., resources={...}}
}
# After adding worker
{
"dynamo_VllmWorker": {..., resources={...}},
"dynamo_VllmWorker_1": {..., resources={...}}
}
# After removing worker
{
"dynamo_VllmWorker": {..., resources={...}}
}
# After removing last worker
{
"dynamo_VllmWorker": {...}
}
```
> [!Note]
> Start with one replica per worker.
> Multiple initial replicas currently share a single watcher.
### Kubernetes Backend
The Kubernetes backend scales workers by updating DynamoGraphDeployment replica counts.
When scaling needs change, the planner:
1. Updates the deployment's replica count
2. Lets the Kubernetes operator create/remove pods
3. Maintains seamless scaling without manual intervention
..
SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
Planner
=======
The planner monitors the state of the system and adjusts workers to ensure that the system runs efficiently.
Currently, the planner can scale the number of vllm workers up and down based on the kv cache load and prefill queue size:
Key features include:
* **Load-based scaling** that monitors KV cache utilization and prefill queue size to make scaling decisions
* **SLA-based scaling** that uses predictive modeling and performance interpolation to proactively meet TTFT and ITL targets
* **Multi-backend support** for both local (Circus) and Kubernetes environments
* **Graceful scaling** that ensures no requests are dropped during scale-down operations
.. list-table::
:widths: 20 5 75
:header-rows: 1
* -
-
- Feature
* - **Backend**
- ✅
- Local
* -
- ✅
- Kubernetes
* - **LLM Framework**
- ✅
- vLLM
* -
- ❌
- TensorRT-LLM
* -
- ❌
- SGLang
* -
- ❌
- llama.cpp
* - **Serving Type**
- ✅
- Aggregated
* -
- ✅
- Disaggregated
* - **Planner Actions**
- ✅
- Load-based scaling up/down prefill/decode workers
* -
- ✅
- SLA-based scaling up/down prefill/decode workers [1]_
* -
- ❌
- Adjusting engine knobs
.. [1] Supported with some limitations.
.. toctree::
:hidden:
Load-based Planner <load_planner.md>
SLA-based Planner <sla_planner.md>
\ No newline at end of file
......@@ -147,7 +147,7 @@ This figure shows an overview of the major components to deploy:
```
```{note}
The planner component is enabled by default for all deployment architectures but is set to no-op mode. This means the planner observes metrics but doesn't take scaling actions. To enable active scaling, you can add `--Planner.no-operation=false` to your `dynamo serve` command. For more details, see [PLanner](../architecture/planner.md).
The planner component is enabled by default for all deployment architectures but is set to no-op mode. This means the planner observes metrics but doesn't take scaling actions. To enable active scaling, you can add `--Planner.no-operation=false` to your `dynamo serve` command. For more details, see [PLanner](../architecture/planner_intro.rst).
```
### Example architectures
......
......@@ -79,7 +79,7 @@ Dive in: Examples
Disaggregated Serving <architecture/disagg_serving.md>
KV Block Manager <architecture/kvbm_intro.rst>
KV Cache Routing <architecture/kv_cache_routing.md>
Planner <architecture/planner.md>
Planner <architecture/planner_intro.rst>
.. toctree::
:hidden:
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment