# Dynamo KVBM The Dynamo KVBM is a distributed KV-cache block management system designed for scalable LLM inference. It cleanly separates memory management from inference runtimes (vLLM, TensorRT-LLM, and SGLang), enabling GPU↔CPU↔Disk/Remote tiering, asynchronous block offload/onboard, and efficient block reuse. ![A block diagram showing a layered architecture view of Dynamo KV Block manager.](../../../docs/images/kvbm-architecture.png) ## Feature Highlights - **Distributed KV-Cache Management:** Unified GPU↔CPU↔Disk↔Remote tiering for scalable LLM inference. - **Async Offload & Reuse:** Seamlessly move KV blocks between memory tiers using GDS-accelerated transfers powered by NIXL, without recomputation. - **Runtime-Agnostic:** Works out-of-the-box with vLLM, TensorRT-LLM, and SGLang via lightweight connectors. - **Memory-Safe & Modular:** RAII lifecycle and pluggable design for reliability, portability, and backend extensibility. ## Build and Installation The pip wheel is built through a Docker build process: ```bash # Build the Docker image with KVBM enabled (from the dynamo repo root) ./container/build.sh --framework none --enable-kvbm --tag local-kvbm ``` Once built, you can either: **Option 1: Run and use the container directly** ```bash ./container/run.sh --framework none -it ``` **Option 2: Extract the wheel file to your local filesystem** ```bash # Create a temporary container from the built image docker create --name temp-kvbm-container local-kvbm:latest # Copy the KVBM wheel to your current directory docker cp temp-kvbm-container:/opt/dynamo/wheelhouse/ ./dynamo_wheelhouse # Clean up the temporary container docker rm temp-kvbm-container # Install the wheel locally pip install ./dynamo_wheelhouse/kvbm*.whl ``` Note that the default pip wheel built is not compatible with CUDA 13 at the moment. ## Integrations ### Environment Variables | Variable | Description | Default | |-----------|--------------|----------| | `DYN_KVBM_CPU_CACHE_GB` | CPU pinned memory cache size (GB) | required | | `DYN_KVBM_DISK_CACHE_GB` | SSD Disk/Storage system cache size (GB) | optional | | `DYN_KVBM_LEADER_WORKER_INIT_TIMEOUT_SECS` | Timeout (in seconds) for the KVBM leader and worker to synchronize and allocate the required memory and storage. Increase this value if allocating large amounts of memory or storage. | 120 | | `DYN_KVBM_METRICS` | Enable metrics endpoint | `false` | | `DYN_KVBM_METRICS_PORT` | Metrics port | `6880` | | `DYN_KVBM_DISABLE_DISK_OFFLOAD_FILTER` | Disable disk offload filtering to remove SSD lifespan protection | `false` | ### vLLM ```bash DYN_KVBM_CPU_CACHE_GB=100 vllm serve \ --kv-transfer-config '{"kv_connector":"DynamoConnector","kv_role":"kv_both","kv_connector_module_path":"kvbm.vllm_integration.connector"}' \ Qwen/Qwen3-8B ``` For more detailed integration with dynamo, disaggregated serving support and benchmarking, please check [vllm-setup](../../../docs/kvbm/vllm-setup.md) ### TensorRT-LLM ```bash cat >/tmp/kvbm_llm_api_config.yaml <