"vscode:/vscode.git/clone" did not exist on "f124b56786212f86caf7ce0a66e3d3c4c0720a62"
Unverified Commit 8d636ebd authored by Suman Tatiraju's avatar Suman Tatiraju Committed by GitHub
Browse files
parent 6d46288c
...@@ -15,18 +15,9 @@ See the License for the specific language governing permissions and ...@@ -15,18 +15,9 @@ See the License for the specific language governing permissions and
limitations under the License. limitations under the License.
--> -->
# Using `dynamo serve` to deploy inference graphs locally # Serving Inference Graphs (`dynamo serve`)
This guide explains how to create, configure, and deploy inference graphs for large language models using the `dynamo serve` command. This guide explains how to create, configure, and deploy inference graphs locally for large language models using the `dynamo serve` command.
## Table of Contents
- [What are inference graphs?](#what-are-inference-graphs)
- [Creating an inference graph](#creating-an-inference-graph)
- [Serving the inference graph](#deploying-the-inference-graph)
- [Guided Example](#guided-example)
## What are inference graphs?
Inference graphs are compositions of service components that work together to handle LLM inference. A typical graph might include: Inference graphs are compositions of service components that work together to handle LLM inference. A typical graph might include:
...@@ -37,9 +28,13 @@ Inference graphs are compositions of service components that work together to ha ...@@ -37,9 +28,13 @@ Inference graphs are compositions of service components that work together to ha
## Creating an inference graph ## Creating an inference graph
Once you've written your various Dynamo services (docs on how to write these can be found [here](../../deploy/sdk/docs/sdk/README.md)), you can create an inference graph by composing these services together using the following two mechanisms: Once you've written Dynamo services ([see the SDK](https://github.com/ai-dynamo/dynamo/blob/main/deploy/dynamo/sdk/docs/sdk/README.md)), create an inference graph by composing them together using the following mechanisms:
1. Dependencies with `depends()`
2. Dynamic composition with `.link()`
See the following sections for more details.
### 1. Dependencies with `depends()` ### Dependencies with `depends()`
```python ```python
from components.worker import VllmWorker from components.worker import VllmWorker
...@@ -58,7 +53,7 @@ Benefits of `depends()`: ...@@ -58,7 +53,7 @@ Benefits of `depends()`:
- Creates type-safe client connections between services - Creates type-safe client connections between services
- Allows calling dependent service methods directly - Allows calling dependent service methods directly
### 2. Dynamic composition with `.link()` ### Dynamic composition with `.link()`
```python ```python
# From examples/llm/graphs/agg.py # From examples/llm/graphs/agg.py
...@@ -82,15 +77,24 @@ The `.link()` method is useful for: ...@@ -82,15 +77,24 @@ The `.link()` method is useful for:
## Deploying the inference graph ## Deploying the inference graph
Once you've defined your inference graph and its configuration, you can deploy it locally using the `dynamo serve` command! We recommend running the `--dry-run` command so you can see what arguments will be pasesd into your final graph. And then Once you've defined your inference graph and its configuration, deploy it locally using the `dynamo serve` command. We recommend running the `--dry-run` command to see what arguments will be pasesd into your final graph.
Lets walk through an example. Consider the following example.
## Guided Example ### Guided Example
The files referenced here can be found [here](../../examples/llm/components/). You will need 1 GPU minimum to run this example. This example can be run from the `examples/llm` directory The files referenced in this example can be found [here](https://github.com/ai-dynamo/dynamo/blob/main/examples/llm/components). You need 1 GPU minimum to run this example. This example can be run from the `examples/llm` directory.
### 1. Define your components This example walks through:
1. [Defining your components](#define-your-components)
2. [Defining your graph](#define-your-graph)
3. [Defining your configuration](#define-your-configuration)
4. [Serving your graph](#serve-your-graph)
See the following sections for details.
#### Define your components
In this example we'll be deploying an aggregated serving graph. Our components include: In this example we'll be deploying an aggregated serving graph. Our components include:
...@@ -125,9 +129,9 @@ class VllmWorker: ...@@ -125,9 +129,9 @@ class VllmWorker:
... ...
``` ```
Note that our prebuilt components have the maximal set of dependancies needed to run the component. This allows you to plug in different components to the same graph to create different architectures. When you write your own components, you can be as flexible as you'd like. Note that our prebuilt components have the maximal set of dependancies needed to run the component, which allows you to plug different components into the same graph to create different architectures. When writing your own components, you can be as flexible as you like.
### 2. Define your graph #### Define your graph
```python ```python
# graphs/agg.py # graphs/agg.py
...@@ -138,18 +142,17 @@ from components.worker import VllmWorker ...@@ -138,18 +142,17 @@ from components.worker import VllmWorker
Frontend.link(Processor).link(VllmWorker) Frontend.link(Processor).link(VllmWorker)
``` ```
### 3. Define your configuration #### Define your configuration
We've provided a set of basic configurations for this example [here](../../examples/llm/configs/agg.yaml). All of these can be changed and also be overridden by passing in CLI flags to serve! We provide [basic configurations](https://github.com/ai-dynamo/dynamo/blob/main/examples/llm/configs/agg.yaml) that you can change; you can also override them by passing in CLI flags to `dynamo serve`.
### 4. Serve your graph #### Serve your graph
As a prerequisite, ensure you have NATS and etcd running by running the docker compose in the deploy directory. You can find it [here](../../deploy/metrics/docker-compose.yml). Before serving your graph, ensure that NATS and etcd are running using the [docker compose file](https://github.com/ai-dynamo/dynamo/blob/main/deploy/metrics/docker-compose.yml) file in the deploy directory.
```bash ```bash
docker compose up -d docker compose up -d
``` ```
Note that the we point toward the first node in our graph. In this case, it's the `Frontend` service. Note that the we point toward the first node in our graph. In this case, it's the `Frontend` service.
```bash ```bash
...@@ -157,7 +160,7 @@ Note that the we point toward the first node in our graph. In this case, it's th ...@@ -157,7 +160,7 @@ Note that the we point toward the first node in our graph. In this case, it's th
dynamo serve graphs.agg:Frontend -f ./configs/agg.yaml --dry-run dynamo serve graphs.agg:Frontend -f ./configs/agg.yaml --dry-run
``` ```
This will print out something like This returns output like:
```bash ```bash
Service Configuration: Service Configuration:
...@@ -200,7 +203,7 @@ You can override any of these configuration options by passing in CLI flags to s ...@@ -200,7 +203,7 @@ You can override any of these configuration options by passing in CLI flags to s
dynamo serve graphs.agg:Frontend -f ./configs/agg.yaml --Processor.router=random --dry-run dynamo serve graphs.agg:Frontend -f ./configs/agg.yaml --Processor.router=random --dry-run
``` ```
Which will print out something like Which prints out output like:
```bash ```bash
#... #...
...@@ -237,8 +240,8 @@ curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" ...@@ -237,8 +240,8 @@ curl localhost:8000/v1/chat/completions -H "Content-Type: application/json"
## Close deployment ## Close deployment
> [!IMPORTANT] ```{important}
> We are aware of an issue where vLLM subprocesses might not be killed when `ctrl-c` is pressed. We are aware of an issue where vLLM subprocesses might not be killed when `ctrl-c` is pressed.
> We are working on addressing this. Relevant vLLM issues can be found [here](https://github.com/vllm-project/vllm/pull/8492) and [here](https://github.com/vllm-project/vllm/issues/6219#issuecomment-2439257824). We are working on addressing this. Relevant vLLM issues can be found [here](https://github.com/vllm-project/vllm/pull/8492) and [here](https://github.com/vllm-project/vllm/issues/6219#issuecomment-2439257824).
To stop the serve, you can press `ctrl-c` which will kill the different components. In order to kill the remaining vLLM subprocesses you can run `nvidia-smi` and `kill -9` the remaining processes or run `pkill python3` from inside of the container. To stop the serve, you can press `ctrl-c` which kills the components. In order to kill the remaining vLLM subprocesses you can run `nvidia-smi` and `kill -9` the remaining processes or run `pkill python3` from inside of the container.
...@@ -17,7 +17,7 @@ limitations under the License. ...@@ -17,7 +17,7 @@ limitations under the License.
# Planner # Planner
The planner is a component that monitors the state of the system and makes adjustments to workers to ensure that the system is running efficiently. Currently, planner can scale up and down the number of vllm workers based on the kv cache load and prefill queue size: The planner monitors the state of the system and adjusts workers to ensure that the system runs efficiently. Currently, the planner can scale the number of vllm workers up and down based on the kv cache load and prefill queue size:
* Backend: * Backend:
* local ✅ * local ✅
* kubernetes ✅ * kubernetes ✅
...@@ -40,12 +40,12 @@ To adjust the number of prefill/decode workers, planner monitors the following m ...@@ -40,12 +40,12 @@ To adjust the number of prefill/decode workers, planner monitors the following m
* Prefill worker: planner monitors the number of requests pending in the prefill queue to estimate the prefill workload. * Prefill worker: planner monitors the number of requests pending in the prefill queue to estimate the prefill workload.
* Decode/aggregated worker: planner monitors the average KV cache utilization rate to estimate the decode/aggregated workload. * Decode/aggregated worker: planner monitors the average KV cache utilization rate to estimate the decode/aggregated workload.
Every `metric-pulling-interval`, planner will gather the aforementioned metrics. Every `adjustment-interval`, planner compares the aggregated metrics in this interval with pre-set thresholds and decide to scale up/down prefill/decode workers. To avoid over-compensation, planner only changes the number of workers by 1 in one adjustment interval. In addition, when the number of workers is being adjusted, the planner will block the metric pulling and adjustment. Every `metric-pulling-interval`, planner gathers the aforementioned metrics. Every `adjustment-interval`, planner compares the aggregated metrics in this interval with pre-set thresholds and decide to scale up/down prefill/decode workers. To avoid over-compensation, planner only changes the number of workers by 1 in one adjustment interval. In addition, when the number of workers is being adjusted, the planner blocks the metric pulling and adjustment.
To scale up a prefill/decode worker, planner just need to launch the worker in the correct namespace. The auto-discovery mechanism will pick up the workers and add them to the routers. To scale down a prefill worker, planner send a SIGTERM signal to the prefill worker. The prefill worker store the signal and exit when it finishes the current request pulled from the prefill queue. This ensures that no remote prefill request is dropped. To scale down a decode worker, currently, planner revoke the etcd lease of the decode worker. When the etcd lease is revoked, the corresponding decode worker will be immediately removed from the router and will not get any new requests. The decode worker will then finish all the current requests in their original stream and exit gracefully. To scale up a prefill/decode worker, planner just need to launch the worker in the correct namespace. The auto-discovery mechanism picks up the workers and add them to the routers. To scale down a prefill worker, planner send a SIGTERM signal to the prefill worker. The prefill worker store the signal and exit when it finishes the current request pulled from the prefill queue. This ensures that no remote prefill request is dropped. To scale down a decode worker, planner revokes the etcd lease of the decode worker. When the etcd lease is revoked, the corresponding decode worker is immediately removed from the router and won't get any new requests. The decode worker then finishes all the current requests in their original stream and exits gracefully.
There are two additional rules set by planner to prevent over-compensation: There are two additional rules set by planner to prevent over-compensation:
1. After a new decode worker is added, since it needs time to populate the kv cache, planner will not scale down the number of decode workers in the next `NEW_DECODE_WORKER_GRACE_PERIOD=3` adjustment intervals. 1. After a new decode worker is added, since it needs time to populate the kv cache, planner doesn't scale down the number of decode workers in the next `NEW_DECODE_WORKER_GRACE_PERIOD=3` adjustment intervals.
1. We do not scale up prefill worker if the prefill queue size is estimated to reduce below the `--prefill-queue-scale-up-threshold` within the next `NEW_PREFILL_WORKER_QUEUE_BUFFER_PERIOD=3` adjustment intervals following the trend observed in the current adjustment interval. 1. We do not scale up prefill worker if the prefill queue size is estimated to reduce below the `--prefill-queue-scale-up-threshold` within the next `NEW_PREFILL_WORKER_QUEUE_BUFFER_PERIOD=3` adjustment intervals following the trend observed in the current adjustment interval.
## Comply with SLA ## Comply with SLA
...@@ -144,10 +144,11 @@ We currently support two backends: ...@@ -144,10 +144,11 @@ We currently support two backends:
### Local Backend ### Local Backend
Circus is a Python program which can be used to monitor and control processes and sockets. Dynamo serve uses circus to start each node in a graph and monitors each subprocesses. We leverage a core feature to do this called `Watcher`. A `Watcher` is the target program that you would like to run (which in our case is `serve_dynamo.py`). When planner decides to scale up or down, it will either add or remove a watcher from the existing `circus`. Circus is a Python program that can be used to monitor and control processes and sockets. Dynamo serve uses circus to start each node in a graph and monitors each subprocesses. We leverage a core feature to do this called `Watcher`. A `Watcher` is the target program that you would like to run (which in our case is `serve_dynamo.py`). When planner decides to scale up or down, it either adds or removes a watcher from the existing `circus`.
> [!NOTE] ``` {note}
> Although circus allows you to `increment` an existing watcher, it was not designed to allow variables to be passed in which does not allow us to schedule on a GPU. So instead we start a new watcher per process. When planner decides to add or remove a worker, we have logic to handle this adding/removing and incrementing/decrementing the workers. Although circus allows you to `increment` an existing watcher, it was not designed to allow variables to be passed in which does not allow us to schedule on a GPU. So instead we start a new watcher per process. When planner decides to add or remove a worker, we have logic to handle this adding/removing and incrementing/decrementing the workers.
```
#### Statefile #### Statefile
...@@ -155,7 +156,7 @@ The statefile is a json file created when initially running `dynamo serve` and i ...@@ -155,7 +156,7 @@ The statefile is a json file created when initially running `dynamo serve` and i
When one Decode worker is spun up, the statefile looks like: When one Decode worker is spun up, the statefile looks like:
```json ```none
{ {
"dynamo_VllmWorker": {..., resources={...}}, "dynamo_VllmWorker": {..., resources={...}},
} }
...@@ -163,7 +164,7 @@ When one Decode worker is spun up, the statefile looks like: ...@@ -163,7 +164,7 @@ When one Decode worker is spun up, the statefile looks like:
Now another decode worker is added: Now another decode worker is added:
```json ```none
{ {
"dynamo_VllmWorker": {..., resources={...}}, "dynamo_VllmWorker": {..., resources={...}},
"dynamo_VllmWorker_1": {..., resources={...}}, "dynamo_VllmWorker_1": {..., resources={...}},
...@@ -172,7 +173,7 @@ Now another decode worker is added: ...@@ -172,7 +173,7 @@ Now another decode worker is added:
Then one decode worker is removed: Then one decode worker is removed:
```json ```none
{ {
"dynamo_VllmWorker": {..., resources={...}}, "dynamo_VllmWorker": {..., resources={...}},
} }
...@@ -180,17 +181,18 @@ Then one decode worker is removed: ...@@ -180,17 +181,18 @@ Then one decode worker is removed:
If the last decode worker is removed, the statefile looks like: If the last decode worker is removed, the statefile looks like:
```json ```none
{ {
"dynamo_VllmWorker": {...}, "dynamo_VllmWorker": {...},
} }
``` ```
Note that we keep the initial non-suffix entry in order to know what cmd we will need to spin up another worker. This is the same for prefill workers as well. We keep the initial non-suffix entry in order to know what cmd we'll need to spin up another worker. This is the same for prefill workers as well.
> [!NOTE] ``` {note}
> At the moment - planner work best if your initial replicas per worker are 1. This is because if you specify replicas > 1 when you initially start `dynamo serve`, the current implementation in `serving.py` starts each process in the same watcher. At the moment - planner work best if your initial replicas per worker are 1. This is because if you specify replicas > 1 when you initially start `dynamo serve`, the current implementation in `serving.py` starts each process in the same watcher.
```
### Kubernetes Backend ### Kubernetes Backend
The Kubernetes backend works by updating the replicas count of the DynamoGraphDeployment custom resource. When the planner detects the need to scale up or down a specific worker type, it uses the Kubernetes API to patch the DynamoGraphDeployment resource, modifying the replicas count for the appropriate component. The Kubernetes operator then reconciles this change by creating or removing the necessary pods. This provides a seamless scaling experience in Kubernetes environments without requiring manual intervention. The Kubernetes backend works by updating the replicas count of the DynamoGraphDeployment custom resource. When the planner detects the need to scale up or down a specific worker type, it uses the Kubernetes API to patch the DynamoGraphDeployment resource, modifying the replicas count for the appropriate component. The Kubernetes operator then reconciles this change by creating or removing the necessary pods. This provides a seamless scaling experience in Kubernetes environments without requiring manual intervention.
\ No newline at end of file
...@@ -35,7 +35,7 @@ python sin_synth.py \ ...@@ -35,7 +35,7 @@ python sin_synth.py \
--osl2 150 --osl2 150
``` ```
This will generate a [mooncake style trace](https://github.com/kvcache-ai/Mooncake) with This generates a [mooncake style trace](https://github.com/kvcache-ai/Mooncake) with
* duration = 600 seconds * duration = 600 seconds
* isl/osl = 3000/150 * isl/osl = 3000/150
* request rate varies sinusoidally from 0.75 to 3 requests with a period of 150 seconds * request rate varies sinusoidally from 0.75 to 3 requests with a period of 150 seconds
...@@ -76,7 +76,7 @@ and open `http://localhost:6006` in your browser. The following metrics are avai ...@@ -76,7 +76,7 @@ and open `http://localhost:6006` in your browser. The following metrics are avai
* `num_decode_workers`: the number of decode workers * `num_decode_workers`: the number of decode workers
* `num_gpu`: the total number of GPUs used * `num_gpu`: the total number of GPUs used
The benchmark results will be printed out in terminal 3 that runs the `genai-perf` command. The benchmark results are printed out in terminal 3 that runs the `genai-perf` command.
In this example, we use a fixed 2p2d engine as baseline. Planner provides a `--no-operation` flag to watch and log the metrics without making any adjustments: In this example, we use a fixed 2p2d engine as baseline. Planner provides a `--no-operation` flag to watch and log the metrics without making any adjustments:
...@@ -92,7 +92,7 @@ genai-perf profile --tokenizer deepseek-ai/DeepSeek-R1-Distill-Llama-8B -m deeps ...@@ -92,7 +92,7 @@ genai-perf profile --tokenizer deepseek-ai/DeepSeek-R1-Distill-Llama-8B -m deeps
The below two figures show the performance comparison between planner and the baseline 2p2d deployment. Planner achieves 1.5x speedup while using 7.4% less GPU resources. The below two figures show the performance comparison between planner and the baseline 2p2d deployment. Planner achieves 1.5x speedup while using 7.4% less GPU resources.
![Planner Performance Comparison](./images/planner_perf.png) ![Two bar charts comparing 2P2D and Planner. Planner shows lower GPU usage and lower average sequence latency.](../../images/planner_perf.png)
![Planner Tensorboard](./images/planner_tensorboard.png) ![Planner Tensorboard; four line graphs comparing two runs: 2p2d_rr5-20_2 and planner_rr5-20.](../../images/planner_tensorboard.png)
..
SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
.. This hidden toctree includes readmes etc that aren't meant to be in the main table of contents but should be accounted for in the sphinx project structure
.. toctree::
:maxdepth: 2
:hidden:
guides/README.md
runtime/README.md
examples/disagg_skeleton.md
\ No newline at end of file
File suppressed by a .gitattributes entry or the file's encoding is unsupported.
File suppressed by a .gitattributes entry or the file's encoding is unsupported.
File suppressed by a .gitattributes entry or the file's encoding is unsupported.
File suppressed by a .gitattributes entry or the file's encoding is unsupported.
This diff is collapsed.
# Dynamo Distributed KV Cache Manager
Calculating LLM KV values for user requests is resource-intensive and thus expensive. Leveraging KV cache to minimize the need for its recomputation is common practice. However, as AI demand increases, solely relying on GPU memory for KV cache would not be sustainable to meet SLA under fixed budget. It poses a significant demand to a more effective KV cache reuse management mechanism.
The Dynamo KV Cache Manager feature addresses this challenge by enabling the offloading of older or less frequently accessed KV cache blocks to more cost-effective memory and storage solutions, such as CPU memory, local storage or networked object or file storage. This capability enables organizations to store up to petabytes of KV cache data at a fraction of the cost of keeping it in GPU memory. By offloading KV cache to alternative memory hierarchies, developers can free up valuable GPU resources while still retaining and reusing historical KV cache to reduce inference computation costs.
<figure>
<img src='images/kv_cache_mgr.png' alt='missing' />
<p>Figure 1. Dynamo Distributed KV Cache Manager offloads less frequently accessed KV cache to more economical memory hierarchies </p>
</figure>
The Dynamo KV Cache Manager uses advanced caching policies that prioritize placing frequently accessed data in GPU memory, while less accessed data is moved to shared CPU memory, SSDs, or networked object storage. It incorporates eviction policies that strike a balance between over-caching (which can introduce lookup latencies) and under-caching (which leads to missed lookups and KV cache re-computation).
Additionally, this feature can manage KV cache across multiple GPU nodes, supporting both distributed and disaggregated inference serving, and offers hierarchical caching capabilities, creating offloading strategies at the GPU, node, and cluster levels.
The Dynamo KV Cache Manager is designed to be framework-agnostic to support various backends, including TensorRT-TLLM, vLLM, and SGLang, and to facilitate the scaling of KV cache storage across large, distributed clusters using NVLink, NVIDIA Quantum switches, and NVIDIA Spectrum switches. It integrates with [NIXL](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md) to enable data transfers across different worker instances and storage backends.
## Design
- Separation of Mechanism and Policy
- Mechanism: Manages memory allocation, caching hierarchy, and data flow.
- Policy: Determines caching strategies, including the choice of data structures (e.g., radix tree, distributed hash tables) and eviction algorithms.
This separation ensures that the underlying infrastructure can evolve without disrupting the caching logic. This design decision was created to enable each customer to come up with their own policies and mechanisms to manage memory that fits their access pattern.
- Hierarchical caching
- A radix tree provides a clean, structured approach for organizing KV storage in distributed inference. A local tree can be built per node, with a global tree at the cluster level, ensuring an efficient abstraction.
- The hierarchy spans HBM, local node KV stores, and external storage, with each layer caching data for the next to optimize lookups. Data movements across the tiers are handled using NIXL APIs for seamless communication. The data flow is fully asynchronous and is transparent to worker instances.
- Multiple backends are supported as long as they are compatible with KV manager APIs.
- RDMA transfers are preferred for optimal performance.
- Registration with runtimes
- Distributed KV manager registers with inference engine runtimes to enable KV offloading to the pool.
- Registration creates a two-way communication queue between the runtime and the pool.
- Management and transfer granularities
- KV blocks are managed in block level (group of tokens) however transfer of KV states can be performed at layer level.
- If multiple tokens are needed to be fetched, then these layer transfers are parallelized to ensure maximum throughput from the KV pool.
## V1 Implementation
Dynamo Distributed KV Manager has two implementations: V1 and V2. V1 serves as a proof-of-concept design, providing a lightweight KV offloading framework with simple, asynchronous APIs — GET() and PUT(), allowing inference engines to offload KV caches efficiently. These APIs are designed to be fully asynchronous, enabling seamless overlap with inference computation.
<figure>
<img src='images/kv_cache_mgr_design.png' alt='missing' />
<p>Figure 2. Design of Dynamo KV manager V1 </p>
</figure>
The left section of Figure 2 illustrates the execution timeline and data movement sequence in the V1 architecture. Inference engines like vLLM can initiate asynchronous operations with flexible access granularity, enabling various overlapping strategies to optimize execution based on whether the priority is throughput or latency.
The right section of Figure 2 depicts data flow within the runtime. At present, we do not allocate any portion of the GPU's high-bandwidth memory (HBM) beyond what is required by the inference engine, ensuring its exclusive utilization for inference tasks. Within the inference runtime, GPU device memory can either be fully dedicated to key-value (KV) storage or partially allocated for prefix KV caches, which are dynamically managed by the inference engine—similar to vLLM.
When the inference engine determines that some entries in the KV cache should be evicted from GPU memory, it invokes the put_async() API to offload them by the KV manager, which updates its index and transfers the data to the appropriate storage tier (CPU memory or a combination of CPU and SSD). Conversely, if the inference engine fails to locate a required KV entry in its self-managed prefix cache, it issues a get_async() request to the KV manager. If the KV entry already exists, retrieval via get_async() will significantly reduces recomputation overhead, ensuring efficient KV management, optimized memory utilization, and improved inference performance.
In the V1 implementation, CPU memory functions as a cache layer for SSD storage. If a required KV entry resides in CPU memory, the system bypasses SSD access, reducing transfer latency. Asynchronous APIs like get_async() or put_async() also enable transfers such that it does not impact system performance.
A key aspect of our implementation is the introduction of multiple parallel queues (or pipelines) for critical operations, including:
- Index matching, updates, and block allocation/free operations
- Data transfers between GPU and CPU
- Data transfers between CPU and SSD
This multi-queue design is crucial because it:
- Enables true asynchronous execution by decoupling blocking operations.
- Maximizes parallelism, allowing multiple requests to be processed concurrently.
- Fully utilizes different hardware resources, such as CPU, GPU, and storage, avoid bottlenecks.
- Decouples slow operations (e.g., SSD writes) from the critical path of responding to user queries, to improve responsiveness.
- Ensures the correctness of index updates and data transfers, even under high-throughput, concurrent workloads.
Looking ahead, V1 architecture will integrate with NIXL to enable KV reuse across multiple nodes. Additionally, we will add GPUDirect Storage capabilities to reduce the get_async() latency and minimize the CPU overhead while facilitating direct data transfers between GPU memory and SSD. These enhancements will be made available post-GTC.
V1 architecture is an excellent design for quick enablement and execution. However, it does not offer much finer control on memory management and interactions with the NVIDIA Dynamo ecosystem. To address this, we are parallelly implementing V2 architecture providing a notion of distributed KV pool across workers and storage. V2 architecture will be released in coming weeks.
## V2 Implementation
The V2 implementation introduces a distributed KV pool across worker instances and storage, incorporating all features outlined in the design. Development is still in progress, and we welcome collaborators to share their feedback. This documentation aims to offer a high-level overview of the V2 implementation and gather input.
The V2 BlockManager changes the ownership patterns to RAII objects. The primary object will be a KvBlock object which defines the contents of the tokens in the block and the unique sequence hash associated with that block. In Rust, the KvBlock is a generic KvBlock<S: BlockStorage>. This means each KvBlock is strongly typed to the storage type (S) which must conform to the behavior defined in the BlockStorage trait.
KvBlocks are allocated and ownership is transferred to a ReusePool object. The ReusePool object is used to provide free blocks in user defined priority. This specialized Pool is a compound collective, so it can also lookup and extract blocks from the pool by matching on the sequence hash.
When acquired from the ReusePool, the object is a PoolItem<KvBlock<BST>>. PoolItem is the object that is the RAII object that when it goes out of scope (Drop), it will be returned to the pool.
A PoolItem<KvBlock<BST> which is typedef’ed as `UniqueBlock<BST>` is a uniquely owned and mutable block. In order to make the block shareable and discoverable, the UniqueBlock<BST> must be registered with the ReservedBlockRegistry. Upon registration, a RegisteredBlock<BST> is returned – this block is shared, immutable and discoverable.
- Immutable - should only provide a const pointer to the storage
- Shared - internally atomically referenced counted object and therefore Cloneable.
- When refcount → 0, the block is unregistered and the backing UniqueBlock<BST> is returned to the ReusePool.
- Discoverable/Reserved
- Incoming requests can be matched to blocks by sequence hash which will return a list of RegisteredBlock<BST> clones for the matching blocks.
- Registered block state changes are emitted as events allowing the KV Aware Router to add/remove the block from the radix tree.
All data movement requires either Shared or Unique block ownership that is owned for the scope of the TransferEngines operation.
For example,
```bash
pub async fn copy_blocks<D, S>(dst: &[KvBlock<D>], src: &[KvBlock<S>]) -> Result<()>;
```
This allows us to specialize implementations on D and S which will be compiler matched. For python, this will be dynamically dispatched and mismatched types will be raised as exceptions.
The underlying Storage is layer-aware. This allows for us to expose a layer-wise trigger.
```bash
pub async fn copy_blocks_by_layer<D, S>(dst: &[KvBlock<D>], src: &[KvBlock<S>, layers: &[usize]) -> Result<()>;
```
To coordinate layer-wise chaining of transfers, say from GPU -> CPU -> Storage we will provide TransferCoordinator can pipeline layer transfers from to the next storage. Example, the moment a layer or set of layers arrives in CPU memory from GPU, we can trigger those layers begin CPU -> Storage. This allows the secondary transfers to have layer-wise overlap with the primary transfers.
This diff is collapsed.
This diff is collapsed.
../../docs/examples/hello_world.md
\ No newline at end of file
...@@ -248,4 +248,4 @@ curl localhost:8000/v1/chat/completions \ ...@@ -248,4 +248,4 @@ curl localhost:8000/v1/chat/completions \
}' }'
``` ```
For more details on managing deployments, testing, and troubleshooting, please refer to the [Operator Deployment Guide](../../docs/guides/dynamo_deploy/operator_deployment.md). For more details on managing deployments, testing, and troubleshooting, please refer to the [Operator Deployment Guide](../../docs/guides/dynamo_deploy/operator_deployment.md).
\ No newline at end of file
This diff is collapsed.
../../docs/examples/multinode.md
\ No newline at end of file
This diff is collapsed.
This diff is collapsed.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment