fix: include load and sla planner doc (#1803)

c9a60278 · Hongkuan Zhou · GitHub · b1509ea7 · c9a60278 · b1509ea7
Unverified Commit c9a60278 authored Jul 07, 2025 by Hongkuan Zhou Committed by GitHub Jul 07, 2025
5 changed files
--- a/docs/architecture/architecture.md
+++ b/docs/architecture/architecture.md
@@ -55,7 +55,7 @@ The following diagram outlines Dynamo's high-level architecture. To enable large
 - [Dynamo Disaggregated Serving](disagg_serving.md)
 - [Dynamo Smart Router](kv_cache_routing.md)
 - [Dynamo KV Cache Block Manager](kvbm_intro.rst)
- [Planner](planner.md)
+- [Planner](planner_intro.rst)
 - [NVIDIA Inference Transfer Library (NIXL)](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md)
 Every component in the Dynamo architecture is independently scalable and portable. The API server can adapt to task-specific deployment. A smart router processes user requests to route them to the optimal worker for performance. Specifically, for Large Language Models (LLMs), Dynamo employs KV cache-aware routing, which directs requests to the worker with the highest cache hit rate while maintaining load balance, expediting decoding. This routing strategy leverages a KV cache manager that maintains a global radix tree registry for hit rate calculation. The KV cache manager also oversees a multi-tiered memory system, enabling rapid KV cache storage and eviction. This design results in substantial TTFT reductions, increased throughput, and the ability to process extensive context lengths.

--- a/docs/architecture/planner.md
+++ b/docs/architecture/planner.md
-<!--
-SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-SPDX-License-Identifier: Apache-2.0
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-https://www.apache.org/licenses/LICENSE-2.0
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-->
-# Planner
-The planner monitors the state of the system and adjusts workers to ensure that the system runs efficiently.
-Currently, the planner can scale the number of vllm workers up and down based on the kv cache load and prefill queue size:
-|                     |   | Feature                                                             |
-| :------------------ | - | :------------------------------------------------------------------ |
-| **Backend**         | ✅ | Local                                                               |
-|                     | ✅ | Kubernetes                                                          |
-| **LLM Framework**   | ✅ | vLLM                                                                |
-|                     | ❌ | TensorRT-LLM                                                        |
-|                     | ❌ | SGLang                                                              |
-|                     | ❌ | llama.cpp                                                           |
-| **Serving Type**    | ✅ | Aggregated                                                          |
-|                     | ✅ | Disaggregated                                                       |
-| **Planner Actions** | ✅ | Load-based scaling up/down prefill/decode workers                   |
-|                     | ✅ | SLA-based scaling up/down prefill/decode workers **<sup>[1]</sup>** |
-|                     | ✅ | Adjusting engine knobs                                              |
-**<sup>[1]</sup>** Supported with some limitations.
-We currently provide two reference planner designs:
-1. Load-based planner: [Load-based planner docs](load_planner.md)
-2. SLA-based planner: [SLA-based planner docs](sla_planner.md)
-## Backends
-The planner supports local and kubernetes backends for worker management.
-### Local Backend
-The local backend uses Circus to control worker processes. A Watcher tracks each `serve_dynamo.py` process.
-The planner adds or removes watchers to scale workers.
-Note: Circus's `increment` feature doesn't support GPU scheduling variables, so we create separate watchers per process.
-#### State Management
-The planner maintains state in a JSON file at `~/.dynamo/state/{namespace}.json`. This file:
- Tracks worker names as `{namespace}_{component_name}`.
- Records GPU allocations from the allocator.
- Updates after each planner action.
- Cleans up automatically when the arbiter exits.
-Example state file evolution:
-```none
-# Initial decode worker
-{
-  "dynamo_VllmWorker": {..., resources={...}}
-}
-# After adding worker
-{
-  "dynamo_VllmWorker": {..., resources={...}},
-  "dynamo_VllmWorker_1": {..., resources={...}}
-}
-# After removing worker
-{
-  "dynamo_VllmWorker": {..., resources={...}}
-}
-# After removing last worker
-{
-  "dynamo_VllmWorker": {...}
-}
-```
-> [!Note]
-> Start with one replica per worker.
-> Multiple initial replicas currently share a single watcher.
-### Kubernetes Backend
-The Kubernetes backend scales workers by updating DynamoGraphDeployment replica counts.
-When scaling needs change, the planner:
-1. Updates the deployment's replica count
-2. Lets the Kubernetes operator create/remove pods
-3. Maintains seamless scaling without manual intervention
--- a/docs/architecture/planner_intro.rst
+++ b/docs/architecture/planner_intro.rst
+..
+    SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+    SPDX-License-Identifier: Apache-2.0
+    Licensed under the Apache License, Version 2.0 (the "License");
+    you may not use this file except in compliance with the License.
+    You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+    Unless required by applicable law or agreed to in writing, software
+    distributed under the License is distributed on an "AS IS" BASIS,
+    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+    See the License for the specific language governing permissions and
+    limitations under the License.
+Planner
+=======
+The planner monitors the state of the system and adjusts workers to ensure that the system runs efficiently.
+Currently, the planner can scale the number of vllm workers up and down based on the kv cache load and prefill queue size:
+Key features include:
+* **Load-based scaling** that monitors KV cache utilization and prefill queue size to make scaling decisions
+* **SLA-based scaling** that uses predictive modeling and performance interpolation to proactively meet TTFT and ITL targets
+* **Multi-backend support** for both local (Circus) and Kubernetes environments
+* **Graceful scaling** that ensures no requests are dropped during scale-down operations
+.. list-table::
+   :widths: 20 5 75
+   :header-rows: 1
+   * -
+     -
+     - Feature
+   * - **Backend**
+     - ✅
+     - Local
+   * -
+     - ✅
+     - Kubernetes
+   * - **LLM Framework**
+     - ✅
+     - vLLM
+   * -
+     - ❌
+     - TensorRT-LLM
+   * -
+     - ❌
+     - SGLang
+   * -
+     - ❌
+     - llama.cpp
+   * - **Serving Type**
+     - ✅
+     - Aggregated
+   * -
+     - ✅
+     - Disaggregated
+   * - **Planner Actions**
+     - ✅
+     - Load-based scaling up/down prefill/decode workers
+   * -
+     - ✅
+     - SLA-based scaling up/down prefill/decode workers [1]_
+   * -
+     - ❌
+     - Adjusting engine knobs
+.. [1] Supported with some limitations.
+.. toctree::
+   :hidden:
+   Load-based Planner <load_planner.md>
+   SLA-based Planner <sla_planner.md>
\ No newline at end of file
--- a/docs/examples/llm_deployment.md
+++ b/docs/examples/llm_deployment.md
@@ -147,7 +147,7 @@ This figure shows an overview of the major components to deploy:
 ```
 ```{note}
-The planner component is enabled by default for all deployment architectures but is set to no-op mode. This means the planner observes metrics but doesn't take scaling actions. To enable active scaling, you can add `--Planner.no-operation=false` to your `dynamo serve` command. For more details, see [PLanner](../architecture/planner.md).
+The planner component is enabled by default for all deployment architectures but is set to no-op mode. This means the planner observes metrics but doesn't take scaling actions. To enable active scaling, you can add `--Planner.no-operation=false` to your `dynamo serve` command. For more details, see [PLanner](../architecture/planner_intro.rst).
 ```
 ### Example architectures

--- a/docs/index.rst
+++ b/docs/index.rst
@@ -79,7 +79,7 @@ Dive in: Examples
   Disaggregated Serving <architecture/disagg_serving.md>
   KV Block Manager <architecture/kvbm_intro.rst>
   KV Cache Routing <architecture/kv_cache_routing.md>
-   Planner <architecture/planner.md>
+   Planner <architecture/planner_intro.rst>
 .. toctree::
   :hidden: