Unverified Commit 4c3eba2a authored by akshatha-k's avatar akshatha-k Committed by GitHub
Browse files

docs: restructure KVBM documentation into three-tier format (#5905)


Signed-off-by: default avatarakshatha-k <akshutk@gmail.com>
Signed-off-by: default avatarDan Gil <dagil@nvidia.com>
Signed-off-by: default avatardagil-nvidia <dagil@nvidia.com>
Co-authored-by: default avatardagil-nvidia <dagil@nvidia.com>
Co-authored-by: default avatarCursor <cursoragent@cursor.com>
Co-authored-by: default avatarKyle McGill <101670481+nv-kmcgill53@users.noreply.github.com>
parent 87449e1c
......@@ -54,7 +54,7 @@ Built in Rust for performance and Python for extensibility, Dynamo is fully open
| [**Disaggregated Serving**](docs/design_docs/disagg_serving.md) | ✅ | ✅ | ✅ |
| [**KV-Aware Routing**](docs/router/kv_cache_routing.md) | ✅ | ✅ | ✅ |
| [**SLA-Based Planner**](docs/planner/sla_planner.md) | ✅ | ✅ | ✅ |
| [**KVBM**](docs/kvbm/kvbm_architecture.md) | 🚧 | ✅ | ✅ |
| [**KVBM**](docs/kvbm/README.md) | 🚧 | ✅ | ✅ |
| [**Multimodal**](docs/multimodal/index.md) | ✅ | ✅ | ✅ |
| [**Tool Calling**](docs/agents/tool-calling.md) | ✅ | ✅ | ✅ |
......@@ -390,7 +390,7 @@ See [SGLang on Slurm](examples/backends/sglang/slurm_jobs/README.md) and [TRT-LL
[disagg]: docs/design_docs/disagg_serving.md
[kv-routing]: docs/router/kv_cache_routing.md
[planner]: docs/planner/sla_planner.md
[kvbm]: docs/kvbm/kvbm_architecture.md
[kvbm]: docs/kvbm/README.md
[mm]: examples/multimodal/
[migration]: docs/fault_tolerance/request_migration.md
[lora]: examples/backends/vllm/deploy/lora/README.md
......
......@@ -39,7 +39,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
| [**KV-Aware Routing**](../../router/kv_cache_routing.md) | ✅ | |
| [**SLA-Based Planner**](../../planner/sla_planner.md) | ✅ | |
| [**Multimodal Support**](../../multimodal/sglang.md) | ✅ | |
| [**KVBM**](../../kvbm/kvbm_architecture.md) | ❌ | Planned |
| [**KVBM**](../../kvbm/README.md) | ❌ | Planned |
## Dynamo SGLang Integration
......
......@@ -57,7 +57,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
| [**KV-Aware Routing**](../../../docs/router/kv_cache_routing.md) | ✅ | |
| [**SLA-Based Planner**](../../../docs/planner/sla_planner.md) | ✅ | |
| [**Load Based Planner**](../../../docs/planner/load_planner.md) | 🚧 | Planned |
| [**KVBM**](../../../docs/kvbm/kvbm_architecture.md) | ✅ | |
| [**KVBM**](../../../docs/kvbm/README.md) | ✅ | |
### Large Scale P/D and WideEP Features
......@@ -297,7 +297,7 @@ For detailed instructions on running comprehensive performance sweeps across bot
Dynamo with TensorRT-LLM currently supports integration with the Dynamo KV Block Manager. This integration can significantly reduce time-to-first-token (TTFT) latency, particularly in usage patterns such as multi-turn conversations and repeated long-context requests.
Here is the instruction: [Running KVBM in TensorRT-LLM](./../../../docs/kvbm/trtllm-setup.md) .
Here is the instruction: [Running KVBM in TensorRT-LLM](./../../../docs/kvbm/kvbm_guide.md#run-kvbm-in-dynamo-with-tensorrt-llm) .
## Known Issues and Mitigations
......
......@@ -40,8 +40,8 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
| [**KV-Aware Routing**](../../../docs/router/kv_cache_routing.md) | ✅ | |
| [**SLA-Based Planner**](../../../docs/planner/sla_planner.md) | ✅ | |
| [**Load Based Planner**](../../../docs/planner/load_planner.md) | 🚧 | WIP |
| [**KVBM**](../../../docs/kvbm/kvbm_architecture.md) | ✅ | |
| [**LMCache**](./LMCache_Integration.md) | ✅ | |
| [**KVBM**](../../../docs/kvbm/README.md) | ✅ | |
| [**LMCache**](../../integrations/lmcache_integration.md) | ✅ | |
| [**Prompt Embeddings**](./prompt-embeddings.md) | ✅ | Requires `--enable-prompt-embeds` flag |
### Large Scale P/D and WideEP Features
......
......@@ -11,7 +11,7 @@ When running vLLM through Dynamo, vLLM engine metrics are automatically passed t
**For the complete and authoritative list of all vLLM metrics**, always refer to the [official vLLM Metrics Design documentation](https://docs.vllm.ai/en/latest/design/metrics.html).
**For LMCache metrics and integration**, see the [LMCache Integration Guide](LMCache_Integration.md).
**For LMCache metrics and integration**, see the [LMCache Integration Guide](../../integrations/lmcache_integration.md).
**For Dynamo runtime metrics**, see the [Dynamo Metrics Guide](../../observability/metrics.md).
......@@ -133,10 +133,10 @@ curl -s localhost:8081/metrics | grep "^lmcache:"
Troubleshooting LMCache-related metrics and logs (including `PrometheusLogger instance already created with different metadata` and `PROMETHEUS_MULTIPROC_DIR` warnings) is documented in:
- [LMCache Integration Guide](LMCache_Integration.md#troubleshooting)
- [LMCache Integration Guide](../../integrations/lmcache_integration.md#troubleshooting)
**For complete LMCache configuration and metric details**, see:
- [LMCache Integration Guide](LMCache_Integration.md) - Setup and configuration
- [LMCache Integration Guide](../../integrations/lmcache_integration.md) - Setup and configuration
- [LMCache Observability Documentation](https://docs.lmcache.ai/production/observability/vllm_endpoint.html) - Complete metrics reference
## Implementation Details
......
......@@ -37,8 +37,6 @@
kubernetes/README.md
reference/cli.md
observability/metrics.md
kvbm/vllm-setup.md
kvbm/trtllm-setup.md
agents/tool-calling.md
development/jail_stream.md
......@@ -77,7 +75,6 @@
backends/vllm/deepseek-r1.md
backends/vllm/gpt-oss.md
backends/vllm/LMCache_Integration.md
backends/vllm/multi-node.md
backends/vllm/prometheus.md
backends/vllm/prompt-embeddings.md
......
File suppressed by a .gitattributes entry or the file's encoding is unsupported.
......@@ -58,6 +58,7 @@ Quickstart
:hidden:
:caption: User Guides
KV Cache Offloading <kvbm/kvbm_guide.md>
Tool Calling <agents/tool-calling.md>
Multimodality Support <features/multimodal/README.md>
LoRA Adapters <features/lora/README.md>
......
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# FlexKV Integration in Dynamo
## Introduction
[FlexKV](https://github.com/taco-project/FlexKV) is a scalable, distributed runtime for KV cache offloading developed by Tencent Cloud's TACO team in collaboration with the community. It acts as a unified KV caching layer for inference engines like vLLM, TensorRT-LLM, and SGLang.
### Key Features
- **Multi-level caching**: CPU memory, local SSD, and scalable storage (cloud storage) for KV cache offloading
- **Distributed KV cache reuse**: Share KV cache across multiple nodes using distributed RadixTree
- **High-performance I/O**: Supports io_uring and GPU Direct Storage (GDS) for accelerated data transfer
- **Asynchronous operations**: Get and put operations can overlap with computation through prefetching
## Prerequisites
1. **Dynamo installed** with vLLM support
2. **Infrastructure services running**:
```bash
docker compose -f deploy/docker-compose.yml up -d
```
3. **FlexKV dependencies** (for SSD offloading):
```bash
apt install liburing-dev libxxhash-dev
```
## Quick Start
### Enable FlexKV
Set the `DYNAMO_USE_FLEXKV` environment variable and use the `--connector flexkv` flag:
```bash
export DYNAMO_USE_FLEXKV=1
python -m dynamo.vllm --model Qwen/Qwen3-0.6B --connector flexkv
```
## Aggregated Serving
### Basic Setup
```bash
# Terminal 1: Start frontend
python -m dynamo.frontend &
# Terminal 2: Start vLLM worker with FlexKV
DYNAMO_USE_FLEXKV=1 \
FLEXKV_CPU_CACHE_GB=32 \
python -m dynamo.vllm --model Qwen/Qwen3-0.6B --connector flexkv
```
### With KV-Aware Routing
For multi-worker deployments with KV-aware routing to maximize cache reuse:
```bash
# Terminal 1: Start frontend with KV router
python -m dynamo.frontend \
--router-mode kv \
--router-reset-states &
# Terminal 2: Worker 1
DYNAMO_USE_FLEXKV=1 \
FLEXKV_CPU_CACHE_GB=32 \
FLEXKV_SERVER_RECV_PORT="ipc:///tmp/flexkv_server_0" \
CUDA_VISIBLE_DEVICES=0 \
python -m dynamo.vllm \
--model Qwen/Qwen3-0.6B \
--connector flexkv \
--gpu-memory-utilization 0.2 \
--kv-events-config '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:20080","enable_kv_cache_events":true}' &
# Terminal 3: Worker 2
DYNAMO_USE_FLEXKV=1 \
FLEXKV_CPU_CACHE_GB=32 \
FLEXKV_SERVER_RECV_PORT="ipc:///tmp/flexkv_server_1" \
CUDA_VISIBLE_DEVICES=1 \
python -m dynamo.vllm \
--model Qwen/Qwen3-0.6B \
--connector flexkv \
--gpu-memory-utilization 0.2 \
--kv-events-config '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:20081","enable_kv_cache_events":true}'
```
## Disaggregated Serving
FlexKV can be used with disaggregated prefill/decode serving. The prefill worker uses FlexKV for KV cache offloading, while NIXL handles KV transfer between prefill and decode workers.
```bash
# Terminal 1: Start frontend
python -m dynamo.frontend &
# Terminal 2: Decode worker (without FlexKV)
CUDA_VISIBLE_DEVICES=0 python -m dynamo.vllm --model Qwen/Qwen3-0.6B --connector nixl &
# Terminal 3: Prefill worker (with FlexKV)
DYN_VLLM_KV_EVENT_PORT=20081 \
VLLM_NIXL_SIDE_CHANNEL_PORT=20097 \
DYNAMO_USE_FLEXKV=1 \
FLEXKV_CPU_CACHE_GB=32 \
CUDA_VISIBLE_DEVICES=1 \
python -m dynamo.vllm \
--model Qwen/Qwen3-0.6B \
--is-prefill-worker \
--connector nixl flexkv
```
## Configuration
### Environment Variables
| Variable | Description | Default |
|----------|-------------|---------|
| `DYNAMO_USE_FLEXKV` | Enable FlexKV integration | `0` (disabled) |
| `FLEXKV_CPU_CACHE_GB` | CPU memory cache size in GB | Required |
| `FLEXKV_CONFIG_PATH` | Path to FlexKV YAML config file | Not set |
| `FLEXKV_SERVER_RECV_PORT` | IPC port for FlexKV server | Auto |
### CPU-Only Offloading
For simple CPU memory offloading:
```bash
unset FLEXKV_CONFIG_PATH
export FLEXKV_CPU_CACHE_GB=32
```
### CPU + SSD Tiered Offloading
For multi-tier offloading with SSD storage, create a configuration file:
```bash
cat > ./flexkv_config.yml <<EOF
cpu_cache_gb: 32
ssd_cache_gb: 1024
ssd_cache_dir: /data0/flexkv_ssd/;/data1/flexkv_ssd/
enable_gds: false
EOF
export FLEXKV_CONFIG_PATH="./flexkv_config.yml"
```
### Configuration Options
| Option | Description |
|--------|-------------|
| `cpu_cache_gb` | CPU memory cache size in GB |
| `ssd_cache_gb` | SSD cache size in GB |
| `ssd_cache_dir` | SSD cache directories (semicolon-separated for multiple SSDs) |
| `enable_gds` | Enable GPU Direct Storage for SSD I/O |
> **Note:** For full configuration options, see the [FlexKV Configuration Reference](https://github.com/taco-project/FlexKV/blob/main/docs/flexkv_config_reference/README_en.md).
## Distributed KV Cache Reuse
FlexKV supports distributed KV cache reuse to share cache across multiple nodes. This enables:
- **Distributed RadixTree**: Each node maintains a local snapshot of the global index
- **Lease Mechanism**: Ensures data validity during cross-node transfers
- **RDMA-based Transfer**: Uses Mooncake Transfer Engine for high-performance KV cache transfer
For setup instructions, see the [FlexKV Distributed Reuse Guide](https://github.com/taco-project/FlexKV/blob/main/docs/dist_reuse/README_en.md).
## Architecture
FlexKV consists of three core modules:
### StorageEngine
Initializes the three-level cache (GPU → CPU → SSD/Cloud). It groups multiple tokens into blocks and stores KV cache at the block level, maintaining the same KV shape as in GPU memory.
### GlobalCacheEngine
The control plane that determines data transfer direction and identifies source/destination block IDs. Includes:
- RadixTree for prefix matching
- Memory pool to track space usage and trigger eviction
### TransferEngine
The data plane that executes data transfers:
- Multi-threading for parallel transfers
- High-performance I/O (io_uring, GDS)
- Asynchronous operations overlapping with computation
## Verify Deployment
```bash
curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": false,
"max_tokens": 30
}'
```
## See Also
- [FlexKV GitHub Repository](https://github.com/taco-project/FlexKV)
- [FlexKV vLLM Adapter Documentation](https://github.com/taco-project/FlexKV/blob/main/docs/vllm_adapter/README_en.md)
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# LMCache Integration in Dynamo
## Introduction
......@@ -6,18 +23,12 @@ LMCache is a high-performance KV cache layer that supercharges LLM serving by en
This document describes how LMCache is integrated into Dynamo's vLLM backend to provide enhanced performance and memory efficiency.
### Key Benefits
- **Reduced Time to First Token (TTFT)**: Eliminates redundant prefill computations
- **Memory Offloading**: Intelligent KV cache placement across CPU/GPU/storage tiers
- **Improved Throughput**: Reduced GPU memory pressure enables higher batch sizes
## Platform Support
**Important Note**: LMCache integration currently only supports x86 architecture. ARM64 is not supported at this time.
## Aggregated Serving
### Configuration
LMCache is enabled using the `--connector lmcache` flag:
......@@ -31,6 +42,7 @@ python -m dynamo.vllm --model <model_name> --connector lmcache
LMCache configuration can be customized via environment variables listed [here](https://docs.lmcache.ai/api_reference/configurations.html).
For advanced configurations, LMCache supports multiple [storage backends](https://docs.lmcache.ai/index.html):
- **CPU RAM**: Fast local memory offloading
- **Local Storage**: Disk-based persistence
- **Redis**: Distributed cache sharing
......@@ -46,12 +58,13 @@ Use the provided launch script for quick setup:
```
This will:
1. Start the dynamo frontend
1. Start the Dynamo frontend
2. Launch a single vLLM worker with LMCache enabled
### Architecture for Aggregated Mode
In aggregated mode, the system uses:
- **KV Connector**: `LMCacheConnectorV1`
- **KV Role**: `kv_both` (handles both reading and writing)
......@@ -61,14 +74,14 @@ Disaggregated serving separates prefill and decode operations into dedicated wor
### Deployment
Use the provided disaggregated launch script(the script requires at least 2 GPUs):
Use the provided disaggregated launch script (requires at least 2 GPUs):
```bash
./examples/backends/vllm/launch/disagg_lmcache.sh
```
This will:
1. Start the dynamo frontend
1. Start the Dynamo frontend
2. Launch a decode worker on GPU 0
3. Wait for initialization
4. Launch a prefill worker on GPU 1 with LMCache enabled
......@@ -76,14 +89,16 @@ This will:
### Worker Roles
#### Decode Worker
- **Purpose**: Handles token generation (decode phase)
- **GPU Assignment**: CUDA_VISIBLE_DEVICES=0
- **LMCache Config**: Uses `NixlConnector` only for kv transfer between prefill and decode workers
- **LMCache Config**: Uses `NixlConnector` only for KV transfer between prefill and decode workers
#### Prefill Worker
- **Purpose**: Handles prompt processing (prefill phase)
- **GPU Assignment**: CUDA_VISIBLE_DEVICES=1
- **LMCache Config**: Uses `MultiConnector` with both LMCache and NIXL connectors. This enables prefill worker to use LMCache for kv offloading and use NIXL for kv transfer between prefill and decode workers.
- **LMCache Config**: Uses `MultiConnector` with both LMCache and NIXL connectors. This enables prefill worker to use LMCache for KV offloading and use NIXL for KV transfer between prefill and decode workers.
- **Flag**: `--is-prefill-worker`
## Architecture
......@@ -93,6 +108,7 @@ This will:
The system automatically configures KV transfer based on the deployment mode and worker type:
#### Prefill Worker (Disaggregated Mode)
```python
kv_transfer_config = KVTransferConfig(
kv_connector="PdConnector",
......@@ -107,6 +123,7 @@ kv_transfer_config = KVTransferConfig(
```
#### Decode Worker or Aggregated Mode
```python
kv_transfer_config = KVTransferConfig(
kv_connector="LMCacheConnectorV1",
......@@ -115,6 +132,7 @@ kv_transfer_config = KVTransferConfig(
```
#### Fallback (No LMCache)
```python
kv_transfer_config = KVTransferConfig(
kv_connector="NixlConnector",
......@@ -133,7 +151,6 @@ kv_transfer_config = KVTransferConfig(
- Creates vLLM engine with proper KV transfer config
- Handles both aggregated and disaggregated modes
### Best Practices
1. **Chunk Size Tuning**: Adjust `LMCACHE_CHUNK_SIZE` based on your use case:
......@@ -154,15 +171,16 @@ kv_transfer_config = KVTransferConfig(
When LMCache is enabled with `--connector lmcache` and `DYN_SYSTEM_PORT` is set, LMCache metrics are automatically exposed via Dynamo's `/metrics` endpoint alongside vLLM and Dynamo metrics.
**Requirements to access LMCache metrics:**
- `--connector lmcache` - Enables LMCache
- `DYN_SYSTEM_PORT=8081` - Enables metrics HTTP endpoint
- `PROMETHEUS_MULTIPROC_DIR` (optional) - If not set, Dynamo manages it internally. Only set explicitly if you need control over the metrics directory.
- `PROMETHEUS_MULTIPROC_DIR` (optional) - If not set, Dynamo manages it internally
For detailed information on LMCache metrics, including the complete list of available metrics and how to access them, see the **[LMCache Metrics section](prometheus.md#lmcache-metrics)** in the vLLM Prometheus Metrics Guide.
For detailed information on LMCache metrics, including the complete list of available metrics and how to access them, see the **[LMCache Metrics section](../backends/vllm/prometheus.md#lmcache-metrics)** in the vLLM Prometheus Metrics Guide.
### Troubleshooting
## Troubleshooting
#### LMCache log: `PrometheusLogger instance already created with different metadata`
### LMCache log: `PrometheusLogger instance already created with different metadata`
You may see an error like:
......@@ -192,7 +210,7 @@ vllm serve Qwen/Qwen3-0.6B \
- **Mitigation (silence)**: set `LMCACHE_LOG_LEVEL=CRITICAL`.
- **Upstream issue**: [vLLM issue #30996](https://github.com/vllm-project/vllm/issues/30996).
#### vLLM log: `Found PROMETHEUS_MULTIPROC_DIR was set by user`
### vLLM log: `Found PROMETHEUS_MULTIPROC_DIR was set by user`
vLLM v1 uses `prometheus_client.multiprocess` and stores intermediate metric values in `PROMETHEUS_MULTIPROC_DIR`.
......
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# Enable SGLang Hierarchical Cache (HiCache)
This guide shows how to enable SGLang's Hierarchical Cache (HiCache) inside Dynamo.
## 1) Start the SGLang worker with HiCache enabled
```bash
python -m dynamo.sglang \
--model-path Qwen/Qwen3-0.6B \
--host 0.0.0.0 --port 8000 \
--page-size 64 \
--enable-hierarchical-cache \
--hicache-ratio 2 \
--hicache-write-policy write_through \
--hicache-storage-backend nixl \
--log-level debug \
--skip-tokenizer-init
```
- **--enable-hierarchical-cache**: Enables hierarchical KV cache/offload
- **--hicache-ratio**: The ratio of the size of host KV cache memory pool to the size of device pool. Lower this number if your machine has less CPU memory.
- **--hicache-write-policy**: Write policy (e.g., `write_through` for synchronous host writes)
- **--hicache-storage-backend**: Host storage backend for HiCache (e.g., `nixl`). NIXL selects the concrete store automatically; see [PR #8488](https://github.com/sgl-project/sglang/pull/8488)
Then, start the frontend:
```bash
python -m dynamo.frontend --http-port 8000
```
## 2) Send a single request
```bash
curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [
{
"role": "user",
"content": "Explain why Roger Federer is considered one of the greatest tennis players of all time"
}
],
"stream": false,
"max_tokens": 30
}'
```
## 3) (Optional) Benchmarking
Run the perf script:
```bash
bash -x $DYNAMO_ROOT/benchmarks/llm/perf.sh \
--model Qwen/Qwen3-0.6B \
--tensor-parallelism 1 \
--data-parallelism 1 \
--concurrency "2,4,8" \
--input-sequence-length 2048 \
--output-sequence-length 256
```
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# KV Block Manager (KVBM)
The Dynamo KV Block Manager (KVBM) is a scalable runtime component designed to handle memory allocation, management, and remote sharing of Key-Value (KV) blocks for inference tasks across heterogeneous and distributed environments. It acts as a unified memory layer for frameworks like vLLM and TensorRT-LLM.
KVBM offers:
- A **unified memory API** spanning GPU memory, pinned host memory, remote RDMA-accessible memory, local/distributed SSDs, and remote file/object/cloud storage systems
- Support for **block lifecycles** (allocate → register → match) with event-based state transitions
- Integration with **[NIXL](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md)**, a dynamic memory exchange layer for remote registration, sharing, and access of memory blocks
> **Get started:** See the [KVBM Guide](kvbm_guide.md) for installation and deployment instructions.
## When to Use KV Cache Offloading
KV Cache offloading avoids expensive KV Cache recomputation, resulting in faster response times and better user experience. Providers benefit from higher throughput and lower cost per token, making inference services more scalable and efficient.
Offloading KV cache to CPU or storage is most effective when KV Cache exceeds GPU memory and cache reuse outweighs the overhead of transferring data. It is especially valuable in:
| Scenario | Benefit |
|----------|---------|
| **Long sessions and multi-turn conversations** | Preserves large prompt prefixes, avoids recomputation, improves first-token latency and throughput |
| **High concurrency** | Idle or partial conversations can be moved out of GPU memory, allowing active requests to proceed without hitting memory limits |
| **Shared or repeated content** | Reuse across users or sessions (system prompts, templates) increases cache hits, especially with remote or cross-instance sharing |
| **Memory- or cost-constrained deployments** | Offloading to RAM or SSD reduces GPU demand, allowing longer prompts or more users without adding hardware |
## Feature Support Matrix
| | Feature | Support |
|--|---------|---------|
| **Backend** | Local | ✅ |
| | Kubernetes | ✅ |
| **LLM Framework** | vLLM | ✅ |
| | TensorRT-LLM | ✅ |
| | SGLang | ❌ |
| **Serving Type** | Aggregated | ✅ |
| | Disaggregated | ✅ |
## Architecture
![KVBM Architecture](../images/kvbm-architecture.png)
*High-level layered architecture view of Dynamo KV Block Manager and how it interfaces with different components of the LLM inference ecosystem*
KVBM has three primary logical layers:
**LLM Inference Runtime Layer** — The top layer includes inference runtimes (TensorRT-LLM, vLLM) that integrate through dedicated connector modules to the Dynamo KVBM. These connectors act as translation layers, mapping runtime-specific operations and events into KVBM's block-oriented memory interface. This decouples memory management from the inference runtime, enabling backend portability and memory tiering.
**KVBM Logic Layer** — The middle layer encapsulates core KV block manager logic and serves as the runtime substrate for managing block memory. The KVBM adapter normalizes representations and data layout for incoming requests across runtimes and forwards them to the core memory manager. This layer implements table lookups, memory allocation, block layout management, lifecycle state transitions, and block reuse/eviction policies.
**NIXL Layer** — The bottom layer provides unified support for all data and storage transactions. NIXL enables P2P GPU transfers, RDMA and NVLink remote memory sharing, dynamic block registration and metadata exchange, and provides a plugin interface for storage backends including block memory (GPU HBM, Host DRAM, Remote DRAM, Local SSD), local/remote filesystems, object stores, and cloud storage.
> **Learn more:** See the [KVBM Design Document](kvbm_design.md) for detailed architecture, components, and data flows.
## Next Steps
- **[KVBM Guide](kvbm_guide.md)** — Installation, configuration, and deployment instructions
- **[KVBM Design](kvbm_design.md)** — Architecture deep dive, components, and data flows
- **[LMCache Integration](../integrations/lmcache_integration.md)** — Use LMCache with Dynamo vLLM backend
- **[FlexKV Integration](../integrations/flexkv_integration.md)** — Use FlexKV for KV cache management
- **[SGLang HiCache](../integrations/sglang_hicache.md)** — Enable SGLang's hierarchical cache with NIXL
- **[NIXL Documentation](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md)** — NIXL communication library details
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES.
All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# KVBM Architecture
The KVBM serves as a critical infrastructure component for scaling LLM inference workloads efficiently. By cleanly separating runtime logic from memory management, and by enabling distributed block sharing, KVBM lays the foundation for high-throughput, multi-node, and memory-disaggregated AI systems.
![A block diagram showing a layered architecture view of Dynamo KV Block manager.](../images/kvbm-architecture.png)
**High level layered architecture view of Dynamo KV Block manager and how it interfaces with different components of LLM inference ecosystem**
The KVBM has three primary logical layers. The top layer-the LLM inference runtimes (TRTLLM, vLLM and SGLang)-integrates through a dedicated connector module to the Dynamo KVBM module. These connectors act as translation layers, mapping runtime-specific operations and events into the KVBM’s block-oriented memory interface. This decouples memory management from the inference runtime, enabling backend portability and providing memory tiering.
The middle layer, the KVBM layer, encapsulates the core logic of the KV block manager and serves as the runtime substrate for managing block memory. The KVBM adapter layer normalizes the representations and data layout for the incoming requests across runtimes and forwards them to the core memory manager. The KVBM and the core modules implement required internal functionality, such as table lookups, memory allocation, block layout management, lifecycle, and state transitions and block reuse or eviction was on policies. The KVBM layer also has required abstractions for external components to override or augment its behavior.
The last layer, the NIXL layer, provides unified support for enabling all data and storage transactions. NIXL enables P2P GPU transfers, enables RDMA and NVLINK remote memory sharing, dynamic block registration and metadata exchange and provides a plugin interface for storage backends.
NIXL integrates with several backends:
- Block memory (Eg. GPU HBM, Host DRAM, Remote DRAM, Local SSD when exposed as block device)
- Local file system (for example, POSIX)
- Remote file system (for example, NFS)
- Object stores (for example, S3-compatible)
- Cloud storage (for example, blob storage APIs)
**[NIXL](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md)** abstracts away the registration and integration complexity for each backends via custom optimizable plugin architecture and enables memory blocks to be published, serialized, and accessed remotely, allowing the disaggregation of compute and memory across nodes. Combined with the Dynamo KV Block Manager (KVBM), storage providers no longer need to retrofit or optimize individual LLM inference engines. Instead, they can focus on tuning their own stack, providing optimized endpoints, knowing that integration is smooth, standardized, and efficient. And for those who *do* want to go further, Dynamo KVBM offers a clean separation of concerns, making custom optimization not only possible, but simple.
\ No newline at end of file
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES.
All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Understanding KVBM components
KVBM design takes inspiration from the KV block managers used in vLLM and SGLang, with an added influence from historical memory tiering strategies common in general GPU programming. For more details, [See KVBM Reading](kvbm_reading.md). The figure below illustrates the internal components of KVBM.
![Internal Components of Dynamo KVBM. ](../images/kvbm-components.png)
**Internal Components of Dynamo KVBM**
## KVBM Components
### Core
- **KvBlockManager**: Public facade. Constructs/owns the internal state and exposes the pools and onboarding APIs.
- **Scheduler**: Gates transfer execution relative to model progress (iteration/layer completion) when integrated with a framework connector (e.g., vLLM V1).
- **Config (config.rs)**: Describes model dims, page size, layout choices, and runtime flags used to build pools and layouts.
- **KvBlockManagerState**: Central object wiring together layouts, storage backends, and pools; owns the OffloadManager, metrics, and events hooks.
- **Events/Metrics**: Observability components emitting counters/gauges and event hooks for integration/testing.
### Layouts and Blocks
- **LayoutConfig & LayoutType**: Translate tensor shapes into KV cache layouts (layer-separated or fully-contiguous), including block counts and geometry.
- **Blocks & Metadata**: Typed block handles (mutable/immutable), metadata (e.g., priority), and views by layer/outer dims; used to allocate, register, and match by `sequence_hash`.
### Transfer Manager
- **TransferManager**: Asynchronous transfer orchestrator with per-path queues (Device→Host, Host→Disk, Host→Device, Disk→Device).
### Storage & Pools
- **Device Pool(G1)**: GPU-resident KV block pool. Allocates mutable GPU blocks, registers completed blocks (immutable), serves lookups by sequence hash, and is the target for onboarding (Host→Device, Disk→Device).
- **Host Pool(G2)**: CPU pinned-memory KV block pool. Receives Device offloads (Device→Host), can onboard to Device (Host→Device), and offloads to Disk. Uses pinned (page-locked) memory for efficient CUDA transfers and NIXL I/O.
- **Disk Pool(G3)**: Local SSD NVMe-backed KV block pool. Receives Host offloads (Host→Disk) and provides blocks for onboarding to Device (Disk→Device). NIXL descriptors expose file offsets/regions for zero-copy I/O and optional GDS.
## KVBM DataFlows
![KVBM Data Flows. ](../images/kvbm-data-flows.png)
**KVBM Data Flows from device to other memory hierarchies**
**Device → Host (Offload)**
* Triggered explicitly requested to offload by the connector scheduler.
* Worker allocates a Host block and performs CUDA D2H/Custom Kernel copy.
* Host pool registers the new immutable block (dedup by sequence hash).
**Host → Disk (Offload)**
* Local Disk: NIXL Write via POSIX; GDS when available.
* Remote Disk (Network FS like NFS/Lustre/GPFS): NIXL Write via POSIX to the mounted FS; batching/concurrency identical.
* Triggered on registered host blocks or explicit offload requests.
* Worker allocates a Disk block and performs NIXL Write (Host→Disk).
* Disk pool registers the new immutable block (dedup by sequence hash).
**Host → Device (Onboard)**
* Called to bring a host block into GPU memory.
* Worker uses provided Device targets and performs CUDA H2D/Custom Kernel copy.
* Device pool registers the new immutable block.
**Disk → Device (Onboard)**
* Called to bring a disk block directly into GPU memory.
* Worker uses provided Device targets and performs NIXL Read (Disk→Device), possibly via GDS.
* Device pool registers the new immutable block.
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# KVBM Design
This document provides an in-depth look at the architecture, components, framework integrations via the connector API, and the detailed workings of the Dynamo KV Block Manager (KVBM). The design of KVBM takes inspiration from the KV block managers used in vLLM and SGLang, with added influence from historical memory tiering strategies common in general GPU programming. For more details, see [Further Reading](#further-reading).
## KVBM Components
![Internal Components of Dynamo KVBM](../images/kvbm-components.png)
*Internal Components of Dynamo KVBM*
### Core
- **KvBlockManager**: Public facade. Constructs/owns the internal state and exposes the pools and onboarding APIs.
- **Scheduler**: Gates transfer execution relative to model progress (iteration/layer completion) when integrated with a framework connector (e.g., vLLM V1).
- **Config (config.rs)**: Describes model dims, page size, layout choices, and runtime flags used to build pools and layouts.
- **KvBlockManagerState**: Central object wiring together layouts, storage backends, and pools; owns the OffloadManager, metrics, and events hooks.
- **Events/Metrics**: Observability components emitting counters/gauges and event hooks for integration/testing.
### Layouts and Blocks
- **LayoutConfig & LayoutType**: Translate tensor shapes into KV cache layouts (layer-separated or fully-contiguous), including block counts and geometry.
- **Blocks & Metadata**: Typed block handles (mutable/immutable), metadata (e.g., priority), and views by layer/outer dims; used to allocate, register, and match by `sequence_hash`.
### Transfer Manager
- **TransferManager**: Asynchronous transfer orchestrator with per-path queues (Device→Host, Host→Disk, Host→Device, Disk→Device).
### Storage & Pools
- **Device Pool (G1)**: GPU-resident KV block pool. Allocates mutable GPU blocks, registers completed blocks (immutable), serves lookups by sequence hash, and is the target for onboarding (Host→Device, Disk→Device).
- **Host Pool (G2)**: CPU pinned-memory KV block pool. Receives Device offloads (Device→Host), can onboard to Device (Host→Device), and offloads to Disk. Uses pinned (page-locked) memory for efficient CUDA transfers and NIXL I/O.
- **Disk Pool (G3)**: Local SSD NVMe-backed KV block pool. Receives Host offloads (Host→Disk) and provides blocks for onboarding to Device (Disk→Device). NIXL descriptors expose file offsets/regions for zero-copy I/O and optional GDS.
- **Remote Storage (G4)**: Remote or cloud-backed KV block storage. KVBM treats G4 as an opaque blob store accessed through NIXL, unaware of internal layout optimizations.
## KVBM Data Flows
![KVBM Data Flows](../images/kvbm-data-flows.png)
*KVBM Data Flows from device to other memory hierarchies*
### Device → Host (Offload)
- Triggered when explicitly requested to offload by the connector scheduler
- Worker allocates a Host block and performs CUDA D2H/Custom Kernel copy
- Host pool registers the new immutable block (dedup by sequence hash)
### Host → Disk (Offload)
- **Local Disk (G3)**: NIXL Write via POSIX; GDS when available
- **Remote Disk (G4)** (Network FS like NFS/Lustre/GPFS): NIXL Write via POSIX to the mounted FS; batching/concurrency identical
- Triggered on registered host blocks or explicit offload requests
- Worker allocates a Disk block and performs NIXL Write (Host→Disk)
- Disk pool registers the new immutable block (dedup by sequence hash)
### Host → Device (Onboard)
- Called to bring a host block into GPU memory
- Worker uses provided Device targets and performs CUDA H2D/Custom Kernel copy
- Device pool registers the new immutable block
### Disk → Device (Onboard)
- Called to bring a disk block directly into GPU memory
- Worker uses provided Device targets and performs NIXL Read (Disk→Device), possibly via GDS
- Device pool registers the new immutable block
## Internal Architecture Deep Dive
![Internal architecture and key modules in the Dynamo KVBM](../images/kvbm-internal-arch.png)
*Internal architecture and key modules in the Dynamo KVBM*
### KvBlockManager as Orchestration Layer
The `KvBlockManager<H, D>` acts as a coordinator across memory tiers—host (CPU), device (GPU), and remote—by managing per-backend block pools and exposing consistent block lifecycle APIs. It tracks KV block locations across device memory (G1), CPU memory within and across nodes (G2), local/pooled SSDs (G3), and remote storage (G4). G1-G4 are key tiers enabled by KVBM. Note that KVBM treats G4 storage as an opaque blob store, unaware of internal layout optimizations.
`KvBlockManager<H, D>` owns:
- A device-side `BlockPool<Device>`
- A host-side `BlockPool<Host>`
- A remote NIXL agent that supports communication and memory sharing across nodes
- A block set registry for remote lookup and import/export of block metadata
Implementation-wise, `KvBlockManagerState` holds the logic: it's initialized by `KvBlockManagerConfig`, which merges runtime, model, and layout configurations. `NixlOptions` injects remote awareness.
### Block Layout and Memory Mapping
Each block is a 2D array `[num_layers][page_size × inner_dim]`. The `BlockLayout` trait abstracts the memory layout. The default implementation, `FullyContiguous`, stores all layers for all blocks in one region with alignment-aware stride computation:
```text
block_stride_in_bytes = align_up(num_layers × layer_stride, alignment);
```
Both CPU and GPU pools share this memory layout, but they use storage-specific backends:
- `DeviceStorage` → CUDA device buffer
- `PinnedStorage` → page-locked host memory
- `SystemStorage` → CPU heap memory (fallback/test)
- `NixlStorage` → remote memory through NIXL RDMA handles (includes storage)
Each layout is constructed using a `LayoutConfig`, and storage is either passed directly or allocated using a `StorageAllocator`.
### BlockPool and Memory Pools (Active and Inactive)
Each `BlockPool<T>` (where `T` is `DeviceStorage`, `PinnedStorage`, etc.) tracks two sub-pools:
- **ActivePool**: Contains blocks currently in use by sequences
- **InactivePool**: Recycled blocks ready for allocation (free list)
When a token block is requested (e.g., `get_mutable_block()`), the allocator pops from `InactivePool`, transitions its state, and returns a writable handle. On sequence commit or eviction, the system resets blocks and returns them to the inactive pool.
### Block State Machine
The state machine (`BlockState`) tracks block lifecycle transitions:
| State | Description | Ownership | Valid Actions/Transitions |
|-------|-------------|-----------|---------------------------|
| Reset | Block hasn't been initialized or was reset. No associated sequence. | Held in InactivePool, reusable | `init_sequence(salt_hash)` → Partial |
| Partial | Block is being filled with tokens for a new sequence. In-progress. | Owned by the sequence creator | `add_token()` / `add_tokens()` (accumulate), `commit()` → Complete, `reset()` → Reset |
| Complete | Block is fully filled with token data but not yet visible to others. | Still owned by creator thread | `register()` → Registered, `reset()` → Reset |
| Registered | Block is finalized and visible for reuse. Available in the deduplication cache. | Shared ownership (global registry) | Auto `drop()` → triggers Remove event and transitions to Reset |
#### Valid State Transitions
| From → To | Trigger | Validation |
|-----------|---------|------------|
| Reset → Partial | `init_sequence(salt_hash)` | Must not be in use |
| Partial → Complete | `commit()` | Must be full |
| Complete → Registered | `register()` | Must be finalized |
| Registered → Reset | Drop of `RegistrationHandle` | Automatic |
| Partial → Reset | Aborted sequence | Explicit or drop |
| Complete → Reset | Invalidated | Explicit or drop |
#### Example Block Lifecycle
A sequence requests a new KV block:
1. Allocator pops from InactivePool → Block is in Reset
2. `init_sequence()` → Transitions to Partial
3. Tokens are appended → State remains Partial
4. On full → `commit()` → State becomes Complete
5. `register()` → Block is hashed and moved to Registered. Blocks can now be used for lookup.
6. On eviction or end-of-life → `drop()` of RAII handle returns block to Reset
### Lifecycle Management using RAII and Event Plane
The system uses RAII for memory lifecycle management. Every block holds metadata and registration state, and registration is coupled with an `EventManager`. On registration and drop:
- `PublishHandle` triggers Register events
- Dropping it triggers Remove events
This pattern ensures consistency for shared memory tracking across workers without requiring explicit deallocation logic. The events are propagated in the Dynamo Events plane. Any Dynamo component subscribed to the events plane can listen to these changes. Note that even the storage provider can subscribe to the events plane and create an internal prefix tree representation that is tailored and optimized for the specific platform.
### Remote Memory Integration using NIXL
The NIXL agent exposes remote memory buffers using `NixlBlockSet`, `RemoteBlocks`, and layout descriptors. Key operations include:
- `nixl_register()`: Registers memory region with NIXL runtime
- `serialize() / deserialize()`: Converts layout and memory into transferable descriptors
- `import_remote_blockset()`: Loads remote node's block layouts into the manager
- `get_remote_blocks_mutable()`: Fetches transferable memory views from another node
`RemoteBlocks` is a lightweight abstraction over shared memory for cross-node block usage (through UCX or other backends).
#### Remote Memory Registration Protocol
The following describes a bidirectional remote memory registration and layout synchronization protocol between workers (e.g., Worker 1 and Worker 2) using NIXL:
**1. Agent Creation & Memory Registration**
Each worker independently sets up a NixlAgent:
- Registers its memory regions (i.e., device memory) through `nixl_register()`
- These regions correspond to blocks managed in the local BlockPool
Once the worker registers the memory, NIXL creates remote-accessible descriptors, which it binds to the memory layout.
**2. Metadata Exchange**
After memory registration, workers exchange serialized layout metadata, encapsulated in a `SerializedNixlBlockLayout`.
Why is this step critical?
- LLM inference workloads often differ in *tensor parallel (TP)* configurations:
- Worker 1 might have TP=4, while Worker 2 has TP=8
- Even if both systems use similar `FullyContiguous` layouts, their internal slicing and alignment assumptions differ
- The metadata exchange bridges this semantic mismatch by sharing:
- LayoutConfig (num_layers, page_size, inner_dim, dtype)
- BlockSetID
- Base address + stride information (including alignment)
- Device ID + memory type (host/device)
- Once workers share metadata, each can reconstruct the layout on its side using `deserialize()`
This enables NIXL to:
- Understand where each layer/block resides
- Perform correct gather-scatter operations during RDMA-like transfers
Without this step, remote fetches would result in data corruption or misaligned tokens.
**3. Serialization & Deserialization: Making Layouts Portable**
In the serialization stage, KVBM exports and `FullyContiguous::serialize()` encodes:
- FullyContiguousConfig
- base_offset
- Physical memory descriptors (NixlStorage), including:
- Memory type (VRAM, DRAM)
- Address & size
- Device ID
The system sends this using NIXL transfer and then injects it into a KVBM scheduler state.
In the deserialization stage, `SerializedNixlBlockLayout::deserialize()` rehydrates this into:
- A fully reconstructed memory layout view
- Local representation of a remote memory slice with correct offsets and size semantics
It also enables direct access to remote memory with consistent logical semantics. This guarantees that even across different system configurations (hardware or LLM shape), both parties agree on the memory view for each KV block.
**4. Ownership Handles and Lifetime Tracking**
Memory ownership in NIXL is tightly coupled with RAII-based handles:
- When a block is registered, it returns a `PublishHandle` which wraps a `RegistrationHandle`
- On drop of this handle, an automatic Remove event is published, which:
- Deregisters the block from the NIXL layer
- Removes it from the remote block registry
- This ensures that once the block is evicted from the cache or no longer used in inference, all references are invalidated cleanly across nodes
This mechanism avoids:
- Stale memory access
- Dangling pointers on GPU or host
- Manual deregistration bugs
The system can batch and publish registration events using a Publisher, optimizing performance under high concurrency.
### Storage Backends and Pluggability
You can integrate KVBM with a storage backend by extending or wrapping `NixlEnabledStorage` to support cross-node RDMA registration. All layouts and block pools are generic over these backends, allowing for fine-grained control over memory tiers.
```mermaid
---
title: Example KVBM System Architecture
---
flowchart TD
A["Distributed Inference Engine"] --> B["Dynamo KV Block Manager"]
B --> C["NIXL Storage Agent<br/>- Volume registration<br/>- get()/put() abstraction"]
B --> D["Event Plane<br/>- NATS-based Pub/Sub<br/>- StoreEvent / RemoveEvent"]
C --> E["G4 Storage Infrastructure<br/>(SSD, Object store, etc.)<br/>- Store KV blocks"]
D --> F["Storage Provider Subscriber<br/>- Parse Events<br/>- Build fast tree/index<br/>- Optimize G4 tiering"]
```
#### NIXL Storage Interface (for Backend Integration)
The NIXL interface abstracts volume interaction and decouples it from mounting, metadata tracking, or direct system I/O. It provides:
- `registerVolume(descriptor)`: Register a logical volume for KV cache data
- `unregisterVolume()`: Cleanly deregister and release volume mappings
- `get() / put()`: Block-level APIs used by KVBM to fetch and store token blocks
These abstractions allow backends to be integrated without tying into the host's file system stack, enabling safe interaction with block devices, local filesystems, and RDMA-capable volumes. Note that these APIs are still being finalized.
#### Dynamo Event Plane (Pub/Sub Coordination Layer)
To support external storage optimizations without modifying KVBM logic, we provide an **event plane** built on NATS.io that emits lifecycle events for all block operations:
- **StoreEvent**: Emitted when a KV block is registered
- **RemoveEvent**: Emitted when a KV block is released or evicted
Each KVEvent (~100 bytes) contains:
| Field | Description |
|-------|-------------|
| `sequence_hash` | Unique identifier of the KV block |
| `prefix_hash` | Prefix grouping for query-level aggregation |
| `block_size` | Size in bytes |
| `storage_location` | Logical volume identifier |
| `event_type` | Store or Remove |
| `extra_metadata` | Reserved fields for partner-specific optimization |
For scalability, the system batches and publishes these events periodically (e.g., every ~10s, or dynamically based on system load).
#### Conceptual Design of a Storage Advisor
This section provides an overview for storage providers interested in integrating as a custom backend to KVBM. **This is optional for KVBM integration with a backend.**
External storage systems are not tightly coupled with Dynamo's execution pipeline. Instead, they passively observe KV block lifecycle events through a subscription model:
1. Storage volumes are pre-provisioned and mounted by the storage provider
2. These volumes are registered with Dynamo through the NIXL Storage Agent using `registerVolume()` APIs
3. Dynamo KV Block Manager interacts only with logical block-level APIs (`get()` and `put()`)
4. The Event Plane asynchronously broadcasts KV lifecycle events using a NATS-based pub/sub channel
5. Storage vendors implement a lightweight subscriber process that listens to these events
To enable fast lookup and dynamic tiering, storage vendors may build internal data structures using the received event stream:
- On receiving a **StoreEvent**: Insert a record into an internal prefix tree, hash map, or LRU index with `prefix_hash`, `sequence_hash`, and associated metadata
- On receiving a **RemoveEvent**: Delete or prune the corresponding record, optionally triggering cleanup or tier migration workflows
With real-time visibility into KV block usage patterns, the storage system can implement smart tiering policies:
- **Hot block promotion**: Frequently accessed KV blocks can be migrated to fast SSD volumes
- **Cold block demotion**: Infrequently used blocks can be demoted to slower storage (HDDs, cloud object storage)
- **Proactive compaction**: If block sizes or prefix patterns indicate fragmentation, the storage backend can coalesce or rewrite blocks
This design ensures that performance, resilience, and extensibility scale independently across the KV layer and the storage backend layer.
## Framework Integrations
KVBM integrates with inference frameworks (vLLM, TensorRT-LLM, SGLang) via Connector APIs to influence KV caching behavior, scheduling, and forward pass execution.
### Connector Architecture
There are two components of the interface:
- **Scheduler (Leader)**: Responsible for orchestration of KV block offload/onboard, builds metadata specifying transfer data to the workers. It also maintains hooks for handling asynchronous transfer completion.
- **Worker**: Responsible for reading metadata built by the scheduler (leader), performs async onboarding/offloading at the end of the forward pass.
![vLLM KVBM Integration](../images/kvbm-integrations.png)
*Typical integration of KVBM with inference frameworks (vLLM shown as example)*
### Onboarding Operations
![Onboarding blocks from Host to Device](../images/kvbm-onboard-host2device.png)
*Onboarding blocks from Host to Device*
![Onboarding blocks from Disk to Device](../images/kvbm-onboard-disk2device.png)
*Onboarding blocks from Disk to Device*
### Offloading Operations
![Offloading blocks from Device to Host & Disk](../images/kvbm-offload.png)
*Offloading blocks from Device to Host & Disk*
## Further Reading
- [vLLM Automatic Prefix Caching](https://docs.vllm.ai/en/latest/features/automatic_prefix_caching.html)
- [SGLang HiCache Benchmarks](https://github.com/sgl-project/sglang/tree/main/benchmark/hicache)
- [EMOGI: Efficient Memory-access for Out-of-memory Graph-traversal](https://arxiv.org/abs/2006.06890)
## See Also
- [KVBM Overview](README.md)
- [KVBM Guide](kvbm_guide.md)
- [NIXL Documentation](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md)
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES.
All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# KVBM components
The design of the KVBM is inspired from vLLM and SGLang KV block managers but with a twist from historical memory tiering design aspired in general GPU programming. [See KVBM Reading](kvbm_reading.md). The following figure shows the internal architecture of KVBM and how it works across workers using NIXL.
![Internal architecture and key modules in the Dynamo KVBM. ](../images/kvbm-internal-arch.png)
**Internal architecture and key modules in the Dynamo KVBM**
## KvBlockManager as Orchestration Layer
The `KvBlockManager <H, D>` acts as a coordinator across memory tiers—host (CPU), device (GPU), and remote—by managing per-backend block pools and exposing consistent block lifecycle APIs. It tracks KV block locations across device memory (G1), CPU memory within and across nodes (G2), local/pooled SSDs (G3), and remote storage (G4). G1-G4 are key tiers enabled by KVBM. Critical to note that KVBM treats G4 storage as an opaque blob store, unaware of internal layout optimizations.
`KvBlockManager<H, D>` owns:
* A device-side `BlockPool<Device>`
* A host-side `BlockPool<Host>`
* A remote NIXL agent that supports communication and memory sharing across nodes
* A block set registry for remote lookup and import/export of block metadata
Implementation-wise, `KvBlockManagerState` holds the logic: it's initialized by `KvBlockManagerConfig`, which merges runtime, model, and layout configurations. `NixlOptions` injects remote awareness.
## Block Layout and Memory Mapping
Each block is a 2D array `[num_layers][page_size × inner_dim]`. `BlockLayouttrait` abstracts the memory layout. The default implementation,`FullyContiguous`, stores all layers for all blocks in one region with alignment-aware stride computation:
```none
block_stride_in_bytes = align_up(num_layers × layer_stride, alignment);
```
Both CPU and GPU pools share this memory layout, but they use storage-specific backends:
* `DeviceStorage` → CUDA device buffer
* `PinnedStorage` → page-locked host memory
* `SystemStorage` → CPU heap memory (fallback/test)
* `NixlStorage` → remote memory through NIXL RDMA handles (includes storage)
Each layout is constructed using a `LayoutConfig`, and storage is either passed directly or allocated using a StorageAllocator.
## BlockPool and Memory Pools (Active and Inactive)
Each `BlockPool<T>` (where `T` is `DeviceStorage`, `PinnedStorage`, and so forth) tracks two sub-pools:
* `ActivePool`: Contains blocks currently in use by sequences
* `InactivePool`: Recycled blocks ready for allocation; think free list
When a token block is requested (for example, `get_mutable_block()`), the allocator pops from `InactivePool`, transitions its state, and returns a writable handle. On sequence commit or eviction, the system resets blocks and returns them to the inactive pool.
The state machine (`BlockState`) that tracks the block lifecycle transitions includes:
| State | Description | Ownership | Valid Actions/Transitions |
| ----- | ----- | ----- | ----- |
| Reset | Block hasn't been initialized or was reset. No associated sequence. | Held in InactivePool, reusable | init_sequence(salt_hash) → Partial |
| Partial | Block is being filled with tokens for a new sequence. In-progress. | Owned by the sequence creator | add_token() / add_tokens() (accumulate)- commit() → Complete- reset() → Reset |
| Complete | Block is fully filled with token data but not yet visible to others. | Still owned by creator thread | register() → Registered- reset() → Reset |
| Registered | Block is finalized and visible for reuse. Available in the deduplication cache. Can use block for lookups | Shared ownership (global registry) | Auto drop() → triggers Remove event and transitions to Reset |
This table lists the valid KVBM transitions:
| From → To | Trigger | Validation |
| ----- | ----- | ----- |
| Reset → Partial | initsequence(salt_hash) | Must not be in use |
| Partial → Complete | commit() | Must be full |
| Complete → Registered | register() | Must be finalized |
| Registered → Reset | Drop of RegistrationHandle | Automatic |
| Partial → Reset | Aborted sequence | Explicit or drop |
| Complete → Reset | Invalidated | Explicit or drop |
Consider this example lifecycle of a block in the KVBM; in it, a sequence requests a new KV block:
1. Allocator pops from InactivePool → Block is in Reset
2. `init_sequence()` → Transitions to Partial
3. Tokens are appended → State remains Partial
4. On full → `commit()` → State becomes Complete
5. `register()` → Block is hashed and moved to Registered. Blocks can now be used to lookup.
6. On eviction or end-of-life → `drop()` of RAII handle returns block to Reset
## Lifecycle Management using RAII and Event Plane
The system uses RAII for memory lifecycle management. Every block holds metadata and registration state, and registration is coupled with an `EventManager`. On registration and drop:
* `PublishHandle` triggers Register events
* Dropping it triggers Remove events
This pattern ensures consistency for shared memory tracking across workers without requiring explicit deallocation logic. The events are propagated in the Dynamo Events plane. Any Dynamo component subscribed to the events plane can listen to these changes. Note that even the storage provider can subscribe to the events plane and create an internal prefix tree representation that is tailored and optimized for the specific platform.
## Remote Memory Integration using NIXL
The NIXL agent exposes remote memory buffers using `NixlBlockSet`, `RemoteBlocks`, and layout descriptors. Key operations include:
* `nixl_register()`: Registers memory region with NIXL runtime
* `serialize() / deserialize()`: Converts layout and memory into transferable descriptors
* `import_remote_blockset()`: Loads remote node's block layouts into the manager
* `get_remote_blocks_mutable()`: Fetches transferable memory views from another node
`RemoteBlocks` is a lightweight abstraction over shared memory for cross-node block usage (through UCX or other backends).
The left side of the figure in [Understanding KVBM Components](#understanding-kvbm-components) illustrates a bidirectional remote memory registration and layout synchronization protocol between workers (for example, Worker 1 and Worker 2) using NIXL. The following steps break down the process:
1. *Agent Creation & Memory Registration:*
Each worker independently sets up a NixlAgent:
* Registers its memory regions (that is, device memory) through `nixl_register()`.
* These regions correspond to blocks managed in the local BlockPool.
Once the worker registers the memory, NIXL creates remote-accessible descriptors, which it binds to the memory layout.
2. *Metadata exchange:*
After memory registration, workers exchange serialized layout metadata, encapsulated in a `SerializedNixlBlockLayout`.
Why is this step critical?
* LLM inference workloads often differ in *tensor parallel (TP)* configurations:
* Worker 1 might have TP=4, while Worker 2 has TP=8.
* Thus, even if both systems use similar `FullyContiguous` layouts, their internal slicing and alignment assumptions differ.
* The metadata exchange bridges this semantic mismatch by sharing:
* LayoutConfig (num_layers, page_size, inner_dim, dtype)
* BlockSetID
* Base address + stride information (including alignment)
* Device ID + memory type (host/device)
* Once the workers share metadata, each can reconstruct the layout on its side using deserialize().
This enables NIXL to:
* Understand where each layer/block resides
* Perform correct gather-scatter operations during RDMA-like transfers
Without this step, remote fetches would result in data corruption or misaligned tokens.
3. *Serialization & Deserialization: Making Layouts Portable*
In the serialization stage, KVBM exports and `FullyContiguous::serialize()` encodes:
* FullyContiguousConfig
* base_offset
* Physical memory descriptors (NixlStorage), including:
* Memory type (VRAM, DRAM)
* Address & size
* Device ID
The system sends this using NIXL transfer and then injects it into a KVBM scheduler state. In the deserialization stage, `SerializedNixlBlockLayout::deserialize()` rehydrates this into:
* A fully reconstructed memory layout view
* Local representation of a remote memory slice with correct offsets and size semantics
It also enables direct access to remote memory with consistent logical semantics
This guarantees that even across different system configurations (hardware or LLM shape), both parties agree on the memory view for each KV block.
4. *Ownership handles and lifetime tracking*
Memory ownership in NIXL is tightly coupled with RAII-based handles:
* When a block is registered, it returns a `PublishHandle` which wraps a `RegistrationHandle`
* On drop of this handle, an automatic Remove event is published, which:
* Deregisters the block from the NIXL layer
* Removes it from the remote block registry
* This ensures that, once the block is evicted from the cache or no longer used in inference, all references are invalidated cleanly across nodes
This mechanism avoids:
* Stale memory access
* Dangling pointers on GPU or host
* Manual deregistration bugs
The system can batch and publish registration events using a Publisher, optimizing performance under high concurrency
## Storage backends and pluggability
You can integrate KVBM with a storage backend by extending or wrapping `NixlEnabledStorage` to support cross-node RDMA registration. All layouts and block pools are generic over these backends, allowing for fine-grained control over memory tiers. We defer detailed integration guidance, since we collaborate with storage partners to simplify and standardize these integration paths.
```mermaid
---
title: Example KVBM System Architecture
---
flowchart TD
A["Distributed Inference Engine"] --> B["Dynamo KV Block Manager"]
B --> C["NIXL Storage Agent<br/>- Volume registration<br/>- get()/put() abstraction"]
B --> D["Event Plane<br/>- NATS-based Pub/Sub<br/>- StoreEvent / RemoveEvent"]
C --> E["G4 Storage Infrastructure<br/>(SSD, Object store, etc.)<br/>- Store KV blocks"]
D --> F["Storage Provider Subscriber<br/>- Parse Events<br/>- Build fast tree/index<br/>- Optimize G4 tiering"]
```
For now, the following breakdown provides a high-level understanding of how KVBM interacts with external storage using the NIXL storage interface and the Dynamo Event Plane:
### NIXL Storage Interface (for Backend Integration)
The NIXL interface abstracts volume interaction and decouples it from mounting, metadata tracking, or direct system I/O. It provides:
* registerVolume(descriptor): Register a logical volume for KV cache data.
* unregisterVolume(): Cleanly deregister and release volume mappings.
* get() / put(): Block-level APIs used by KVBM to fetch and store token blocks.
These abstractions allow backends to be integrated without tying into the host's file system stack, enabling safe interaction with block devices, local filesystems, and RDMA-capable volumes. Please note that these APIs are still being finalized.
### Dynamo Event Plane (Pub/Sub Coordination Layer)
To support external storage optimizations without modifying KVBM logic, we provide an **event plane** built on NATS.io that emits lifecycle events for all block operations. Particularly there are two events emitted.
* StoreEvent: Emitted when a KV block is registered.
* RemoveEvent: Emitted when a KV block is released or evicted.
Each KVEvent (\~100 bytes) contains:
* sequence_hash: Unique identifier of the KV block
* prefix_hash: Prefix grouping for query-level aggregation
* block_size: Size in bytes
* storage_location: Logical volume identifier
* event_type: Store or Remove
* extra_metadata: Reserved fields for partner-specific optimization
For scalability, the system batches and publishes these events periodically (for example, every \~10s, or dynamically based on system load).
### A conceptual design of a storage advisor
This section provides an overview for the storage provider who is interested in integrating as a custom backend to KVBM and providing optimized performance. **Please note, this is optional for KVBM integration with a backend.**
External storage systems are not tightly coupled with Dynamo's execution pipeline. Instead, they passively observe KV block lifecycle events through a subscription model:
* Storage volumes are pre-provisioned and mounted by the storage provider.
* These volumes are then registered with Dynamo through the NIXL Storage Agent using registerVolume() APIs. Dynamo itself does not manage mounts or provisioning.
* The Dynamo KV Block Manager interacts only with logical block-level APIs (that is, get() and put()).
* In parallel, the Event Plane asynchronously broadcasts KV lifecycle events using a NATS-based pub/sub channel.
* Storage vendors implement a lightweight subscriber process that listens to these events without interfering with the KV Manager's runtime behavior.
* This decoupling ensures that external storage systems can optimize block placement and lifecycle tracking without modifying or instrumenting the core Dynamo codebase.
Now, to enable fast lookup and dynamic tiering, storage vendors may build internal data structures using the received event stream. Here is a high level conceptual design:
* On receiving a StoreEvent, the storage system:
* Inserts a record into an internal prefix tree, hash map, or LRU index.
* This record includes the prefix_hash and sequence_hash, which logically identify the token block and its grouping.
* Associated metadata (for example, block_size, storage_location) is also captured.
* On receiving a RemoveEvent, the system:
* Deletes or prunes the corresponding record from its index.
* Optionally triggers cleanup or tier migration workflows.
This event-driven indexing allows the storage system to track which KV blocks are live and where they belong—enabling low-latency lookup, efficient space reclamation, and multi-tier coordination. With real-time visibility into KV block usage patterns, the storage system can implement smart tiering policies, such as:
* Hot block promotion: Frequently accessed KV blocks can be migrated to fast SSD volumes.
* Cold block demotion: Infrequently used blocks can be demoted to slower storage (for example, HDDs, cloud object storage).
* Proactive compaction: If block sizes or prefix patterns indicate fragmentation, the storage backend can coalesce or rewrite blocks.
These optimizations are performed entirely outside of Dynamo, with the assumption that storage providers adhere to SLA guarantees and volume availability.
Critically, this entire system is designed to be non-intrusive:
* The Dynamo KV Block Manager remains agnostic to how data is stored or optimized.
* The Event Plane doesn't block or intercept any critical path of inference.
* Storage vendors are given the freedom to innovate and optimize without requiring changes to the inference runtime.
This design ensures that performance, resilience, and extensibility scale independently across the KV layer and the storage backend layer.
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# KVBM Guide
KVBM is modular and can be used standalone via `pip install kvbm` or as the memory management component in the full Dynamo stack. This guide covers installation, configuration, and deployment of the Dynamo KV Block Manager (KVBM) and other KV cache management systems.
## Table of Contents
- [Quick Start](#quick-start)
- [Run KVBM Standalone](#run-kvbm-standalone)
- [Run KVBM in Dynamo with vLLM](#run-kvbm-in-dynamo-with-vllm)
- [Run KVBM in Dynamo with TensorRT-LLM](#run-kvbm-in-dynamo-with-tensorrt-llm)
- [Run Dynamo with SGLang HiCache](#run-dynamo-with-sglang-hicache)
- [Disaggregated Serving with KVBM](#disaggregated-serving-with-kvbm)
- [Configuration](#configuration)
- [Enable and View KVBM Metrics](#enable-and-view-kvbm-metrics)
- [Benchmarking KVBM](#benchmarking-kvbm)
- [Troubleshooting](#troubleshooting)
- [Developing Locally](#developing-locally)
## Quick Start
## Run KVBM Standalone
KVBM can be used independently without using the rest of the Dynamo stack:
```bash
pip install kvbm
```
See the [support matrix](../reference/support-matrix.md) for version compatibility.
### Build from Source
To build KVBM from source, see the detailed instructions in the [KVBM bindings README](../../lib/bindings/kvbm/README.md#build-from-source).
## Run KVBM in Dynamo with vLLM
### Docker Setup
```bash
# Start up etcd for KVBM leader/worker registration and discovery
docker compose -f deploy/docker-compose.yml up -d
# Build a dynamo vLLM container (KVBM is built in by default)
./container/build.sh --framework vllm
# Launch the container
./container/run.sh --framework vllm -it --mount-workspace --use-nixl-gds
```
### Aggregated Serving
```bash
cd $DYNAMO_HOME/examples/backends/vllm
./launch/agg_kvbm.sh
```
#### Verify Deployment
```bash
curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [{"role": "user", "content": "Hello, how are you?"}],
"stream": false,
"max_tokens": 10
}'
```
#### Alternative: Using Direct vllm serve
You can also use `vllm serve` directly with KVBM:
```bash
vllm serve --kv-transfer-config '{"kv_connector":"DynamoConnector","kv_role":"kv_both", "kv_connector_module_path": "kvbm.vllm_integration.connector"}' Qwen/Qwen3-0.6B
```
## Run KVBM in Dynamo with TensorRT-LLM
> [!NOTE]
> **Prerequisites:**
> - Ensure `etcd` and `nats` are running before starting
> - KVBM only supports TensorRT-LLM's PyTorch backend
> - Disable partial reuse (`enable_partial_reuse: false`) to increase offloading cache hits
> - KVBM requires TensorRT-LLM v1.2.0rc2 or newer
### Docker Setup
```bash
# Start up etcd for KVBM leader/worker registration and discovery
docker compose -f deploy/docker-compose.yml up -d
# Build a dynamo TRTLLM container (KVBM is built in by default)
./container/build.sh --framework trtllm
# Launch the container
./container/run.sh --framework trtllm -it --mount-workspace --use-nixl-gds
```
### Aggregated Serving
```bash
# Write the LLM API config
cat > "/tmp/kvbm_llm_api_config.yaml" <<EOF
backend: pytorch
cuda_graph_config: null
kv_cache_config:
enable_partial_reuse: false
free_gpu_memory_fraction: 0.80
kv_connector_config:
connector_module: kvbm.trtllm_integration.connector
connector_scheduler_class: DynamoKVBMConnectorLeader
connector_worker_class: DynamoKVBMConnectorWorker
EOF
# Start dynamo frontend
python3 -m dynamo.frontend --http-port 8000 &
# Serve the model with KVBM
python3 -m dynamo.trtllm \
--model-path Qwen/Qwen3-0.6B \
--served-model-name Qwen/Qwen3-0.6B \
--extra-engine-args /tmp/kvbm_llm_api_config.yaml &
```
#### Verify Deployment
```bash
curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [{"role": "user", "content": "Hello, how are you?"}],
"stream": false,
"max_tokens": 30
}'
```
#### Alternative: Using trtllm-serve
```bash
trtllm-serve Qwen/Qwen3-0.6B --host localhost --port 8000 --backend pytorch --extra_llm_api_options /tmp/kvbm_llm_api_config.yaml
```
## Run Dynamo with SGLang HiCache
SGLang's Hierarchical Cache (HiCache) extends KV cache storage beyond GPU memory to include host CPU memory. When using NIXL as the storage backend, HiCache integrates with Dynamo's memory infrastructure.
### Quick Start
```bash
# Start SGLang worker with HiCache enabled
python -m dynamo.sglang \
--model-path Qwen/Qwen3-0.6B \
--host 0.0.0.0 --port 8000 \
--enable-hierarchical-cache \
--hicache-ratio 2 \
--hicache-write-policy write_through \
--hicache-storage-backend nixl
# In a separate terminal, start the frontend
python -m dynamo.frontend --http-port 8000
# Send a test request
curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": false,
"max_tokens": 30
}'
```
> **Learn more:** See the [SGLang HiCache Integration Guide](../integrations/sglang_hicache.md) for detailed configuration, deployment examples, and troubleshooting.
## Disaggregated Serving with KVBM
KVBM supports disaggregated serving where prefill and decode operations run on separate workers. KVBM is enabled on the prefill worker to offload KV cache.
### Disaggregated Serving with vLLM
```bash
# 1P1D - one prefill worker and one decode worker
# NOTE: requires at least 2 GPUs
cd $DYNAMO_HOME/examples/backends/vllm
./launch/disagg_kvbm.sh
# 2P2D - two prefill workers and two decode workers
# NOTE: requires at least 4 GPUs
cd $DYNAMO_HOME/examples/backends/vllm
./launch/disagg_kvbm_2p2d.sh
```
### Disaggregated Serving with TRT-LLM
```bash
# Launch prefill worker with KVBM
python3 -m dynamo.trtllm \
--model-path Qwen/Qwen3-0.6B \
--served-model-name Qwen/Qwen3-0.6B \
--extra-engine-args /tmp/kvbm_llm_api_config.yaml \
--disaggregation-mode prefill &
```
## Configuration
### Cache Tier Configuration
Configure KVBM cache tiers using environment variables:
```bash
# Option 1: CPU cache only (GPU -> CPU offloading)
export DYN_KVBM_CPU_CACHE_GB=4 # 4GB of pinned CPU memory
# Option 2: Both CPU and Disk cache (GPU -> CPU -> Disk tiered offloading)
export DYN_KVBM_CPU_CACHE_GB=4
export DYN_KVBM_DISK_CACHE_GB=8 # 8GB of disk
# [Experimental] Option 3: Disk cache only (GPU -> Disk direct offloading)
# NOTE: Experimental, may not provide optimal performance
# NOTE: Disk offload filtering not supported with this option
export DYN_KVBM_DISK_CACHE_GB=8
```
You can also specify exact block counts instead of GB:
- `DYN_KVBM_CPU_CACHE_OVERRIDE_NUM_BLOCKS`
- `DYN_KVBM_DISK_CACHE_OVERRIDE_NUM_BLOCKS`
### SSD Lifespan Protection
When disk offloading is enabled, disk offload filtering is enabled by default to extend SSD lifespan. The current policy only offloads KV blocks from CPU to disk if the blocks have frequency ≥ 2. Frequency doubles on cache hit (initialized at 1) and decrements by 1 on each time decay step.
To disable disk offload filtering:
```bash
export DYN_KVBM_DISABLE_DISK_OFFLOAD_FILTER=true
```
## Enable and View KVBM Metrics
### Setup Monitoring Stack
```bash
# Start basic services (etcd & natsd), along with Prometheus and Grafana
docker compose -f deploy/docker-observability.yml up -d
```
### Enable Metrics for vLLM
```bash
DYN_KVBM_METRICS=true \
DYN_KVBM_CPU_CACHE_GB=20 \
python -m dynamo.vllm \
--model Qwen/Qwen3-0.6B \
--enforce-eager \
--connector kvbm
```
### Enable Metrics for TensorRT-LLM
```bash
DYN_KVBM_METRICS=true \
DYN_KVBM_CPU_CACHE_GB=20 \
python3 -m dynamo.trtllm \
--model-path Qwen/Qwen3-0.6B \
--served-model-name Qwen/Qwen3-0.6B \
--extra-engine-args /tmp/kvbm_llm_api_config.yaml &
```
### Firewall Configuration (Optional)
```bash
# If firewall blocks KVBM metrics ports
sudo ufw allow 6880/tcp
```
### View Metrics
Access Grafana at http://localhost:3000 (default login: `dynamo`/`dynamo`) and look for the **KVBM Dashboard**.
### Available Metrics
| Metric | Description |
|--------|-------------|
| `kvbm_matched_tokens` | Number of matched tokens |
| `kvbm_offload_blocks_d2h` | Offload blocks from device to host |
| `kvbm_offload_blocks_h2d` | Offload blocks from host to disk |
| `kvbm_offload_blocks_d2d` | Offload blocks from device to disk (bypassing host) |
| `kvbm_onboard_blocks_d2d` | Onboard blocks from disk to device |
| `kvbm_onboard_blocks_h2d` | Onboard blocks from host to device |
| `kvbm_host_cache_hit_rate` | Host cache hit rate (0.0-1.0) |
| `kvbm_disk_cache_hit_rate` | Disk cache hit rate (0.0-1.0) |
## Benchmarking KVBM
Use [LMBenchmark](https://github.com/LMCache/LMBenchmark) to evaluate KVBM performance.
### Setup
```bash
git clone https://github.com/LMCache/LMBenchmark.git
cd LMBenchmark/synthetic-multi-round-qa
```
### Run Benchmark
```bash
# Synthetic multi-turn chat dataset
# Arguments: model, endpoint, output prefix, qps
./long_input_short_output_run.sh \
"Qwen/Qwen3-0.6B" \
"http://localhost:8000" \
"benchmark_kvbm" \
1
```
Average TTFT and other performance numbers will be in the output.
> **TIP:** If metrics are enabled, observe KV offloading and onboarding in the Grafana dashboard.
### Baseline Comparison
#### vLLM Baseline (without KVBM)
```bash
vllm serve Qwen/Qwen3-0.6B
```
#### TensorRT-LLM Baseline (without KVBM)
```bash
# Create config without kv_connector_config
cat > "/tmp/llm_api_config.yaml" <<EOF
backend: pytorch
cuda_graph_config: null
kv_cache_config:
enable_partial_reuse: false
free_gpu_memory_fraction: 0.80
EOF
trtllm-serve Qwen/Qwen3-0.6B --host localhost --port 8000 --backend pytorch --extra_llm_api_options /tmp/llm_api_config.yaml
```
## Troubleshooting
### No TTFT Performance Gain
**Symptom:** Enabling KVBM does not show TTFT improvement or causes performance degradation.
**Cause:** Not enough prefix cache hits on KVBM to reuse offloaded KV blocks.
**Solution:** Enable KVBM metrics and check the Grafana dashboard for `Onboard Blocks - Host to Device` and `Onboard Blocks - Disk to Device`. Large numbers of onboarded KV blocks indicate good cache reuse:
![Grafana Example](../images/kvbm_metrics_grafana.png)
### KVBM Worker Initialization Timeout
**Symptom:** KVBM fails to start when allocating large memory or disk storage.
**Solution:** Increase the leader-worker initialization timeout (default: 1800 seconds):
```bash
export DYN_KVBM_LEADER_WORKER_INIT_TIMEOUT_SECS=3600 # 1 hour
```
### Disk Offload Fails to Start
**Symptom:** KVBM fails to start when disk offloading is enabled.
**Cause:** `fallocate()` is not supported on the filesystem (e.g., Lustre, certain network filesystems).
**Solution:** Enable disk zerofill fallback:
```bash
export DYN_KVBM_DISK_ZEROFILL_FALLBACK=true
```
If you encounter "write all error" or EINVAL (errno 22), also try:
```bash
export DYN_KVBM_DISK_DISABLE_O_DIRECT=true
```
## Developing Locally
Inside the Dynamo container, after changing KVBM-related code (Rust and/or Python):
```bash
cd /workspace/lib/bindings/kvbm
uv pip install maturin[patchelf]
maturin build --release --out /workspace/dist
uv pip install --upgrade --force-reinstall --no-deps /workspace/dist/kvbm*.whl
```
## See Also
- [KVBM Overview](README.md)
- [KVBM Design](kvbm_design.md)
- [LMCache Integration](../integrations/lmcache_integration.md)
- [FlexKV Integration](../integrations/flexkv_integration.md)
- [SGLang HiCache](../integrations/sglang_hicache.md)
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES.
All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# KVBM Integrations
KVBM Integrates with Inference frameworks (vLLM, TRTLLM, SGLang) via Connector APIs to influence KV caching behaviour, scheduling, and forward pass execution.
There are two components of the interface, Scheduler and Worker. Scheduler(leader) is responsible for the orchestration of KV block offload/onboard, builds metadata specifying transfer data to the workers. It also maintains hooks for handling asynchronous transfer completion. Worker is responsible for reading metadata built by the scheduler(leader), does async onboarding/ offloading at the end of the forward pass.
## Typical KVBM Integrations
The following figure shows the typical integration of KVBM with inference frameworks (vLLM used as an example)
![vLLM KVBM Integration ](../images/kvbm-integrations.png)
**vLLM KVBM Integration**
## How to run KVBM with Frameworks
* Instructions to [run KVBM in vLLM](vllm-setup.md)
* Instructions to [run KVBM with TRTLLM](trtllm-setup.md)
## Onboarding
![Onboarding blocks from Host to Device](../images/kvbm-onboard-host2device.png)
**Onboarding blocks from Host to Device**
![Onboarding blocks from Disk to Device](../images/kvbm-onboard-disk2device.png)
**Onboarding blocks from Disk to Device**
## Offloading
![Offloading blocks from Device to Host&Disk](../images/kvbm-offload.png)
**Offloading blocks from Device to Host&Disk**
......@@ -20,7 +20,7 @@ The Dynamo KV Block Manager (KVBM) is a scalable runtime component designed to h
It offers:
* A **unified memory API** that spans GPU memory(in future) , pinned host memory, remote RDMA-accessible memory, local or distributed pool of SSDs and remote file/object/cloud storage systems.
* A **unified memory API** that spans GPU memory (future), pinned host memory, remote RDMA-accessible memory, local or distributed pool of SSDs and remote file/object/cloud storage systems.
* Support for evolving **block lifecycles** (allocate → register → match) with event-based state transitions that storage can subscribe to.
* Integration with **NIXL**, a dynamic memory exchange layer used for remote registration, sharing, and access of memory blocks over RDMA/NVLink.
......@@ -59,12 +59,9 @@ The Dynamo KV Block Manager serves as a reference implementation that emphasizes
:hidden:
Overview <self>
Motivation <kvbm_motivation.md>
Architecture <kvbm_architecture.md>
Components <kvbm_components.md>
Design Deep Dive <kvbm_design_deepdive.md>
Integrations <kvbm_integrations.md>
KVBM in vLLM <vllm-setup.md>
KVBM in TRTLLM <trtllm-setup.md>
LMCache Integration <../backends/vllm/LMCache_Integration>
Further Reading <kvbm_reading.md>
\ No newline at end of file
Quick Start <README.md>
User Guide <kvbm_guide.md>
Design <kvbm_design.md>
LMCache Integration <../integrations/lmcache_integration.md>
FlexKV Integration <../integrations/flexkv_integration.md>
SGLang HiCache <../integrations/sglang_hicache.md>
\ No newline at end of file
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES.
All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Motivation behind KVBM
Large language models (LLMs) and other AI workloads increasingly rely on KV caches that extend beyond GPU and local CPU memory into remote storage tiers. However, efficiently managing the lifecycle of KV blocks in remote storage presents challenges:
* Tailored for GenAI use-cases
* Lack of visibility into real-time block usage patterns.
* Need for lightweight, ownership-driven memory management over complex object stores with unneeded overheads.
* Modular and need simplified UX and to be memory safe.
* Inability to differentiate between hot (frequently accessed) and cold (infrequently accessed) blocks across the stack without intrusive application-level changes.
* Difficulty in optimizing storage placement across heterogeneous storage tiers (for example, SSDs, object storage, and cloud storage).
Conventional systems either lack dynamic feedback mechanisms or require deep integration into core storage paths, which both increases complexity and reduces portability.
## Benefits of KV Cache offloading
KV Cache offloading avoids expensive KV Cache recomputation, resulting in faster response times and a better user experience. In the end, providers benefit from higher throughput and lower cost per token, making their inference services more scalable and efficient.
## When to offload KV Cache for reuse
Offloading KV cache to CPU or storage is most effective when KV Cache exceeds GPU memory and cache reuse outweighs the overhead of transferring data. It is especially valuable in long-context, high-concurrency, or resource-constrained inference environments such as:
* **Long sessions and multi-turn conversations:** Offloading preserves large prompt prefixes, avoids recomputation, and improves first-token latency and throughput.
* **High concurrency (future):** Idle or partial conversations can be moved out of GPU memory, allowing active requests to proceed without hitting memory limits.
* **Shared or repeated content (future):** Reuse across users or sessions (for example, system prompts and templates) increases cache hits, especially with remote or cross-instance sharing.
* **Memory- or cost-constrained deployments:** Offloading to RAM or SSD reduces GPU demand, allowing longer prompts or more users without adding hardware.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment