# Running KVBM in TensorRT-LLM This guide explains how to leverage KVBM (KV Block Manager) to mange KV cache and do KV offloading in TensorRT-LLM (trtllm). To learn what KVBM is, please check [here](https://docs.nvidia.com/dynamo/latest/architecture/kvbm_intro.html) > [!Note] > - Ensure that `etcd` and `nats` are running before starting. > - KVBM does not currently support CUDA graphs in TensorRT-LLM. > - KVBM only supports TensorRT-LLM’s PyTorch backend. > - To enable disk cache offloading, you must first enable a CPU memory cache offloading. > - Disable partial reuse `enable_partial_reuse: false` in the LLM API config’s `kv_connector_config` to increase offloading cache hits. > - KVBM requires TensorRT-LLM at commit ce580ce4f52af3ad0043a800b3f9469e1f1109f6 or newer. > - Enabling KVBM metrics with TensorRT-LLM is still a work in progress. ## Quick Start To use KVBM in TensorRT-LLM, you can follow the steps below: ```bash # start up etcd for KVBM leader/worker registration and discovery docker compose -f deploy/docker-compose.yml up -d # Build a container that includes TensorRT-LLM and KVBM. Note: KVBM integration is only available in TensorRT-LLM commit ce580ce4f52af3ad0043a800b3f9469e1f1109f6 or newer. ./container/build.sh --framework trtllm --tensorrtllm-commit ce580ce4f52af3ad0043a800b3f9469e1f1109f6 --enable-kvbm # launch the container ./container/run.sh --framework trtllm -it --mount-workspace --use-nixl-gds # enable kv offloading to CPU memory # 60 means 60GB of pinned CPU memory would be used export DYN_KVBM_CPU_CACHE_GB=60 # enable kv offloading to disk. Note: To enable disk cache offloading, you must first enable a CPU memory cache offloading. # 20 means 20GB of disk would be used export DYN_KVBM_DISK_CACHE_GB=20 # Allocating memory and disk storage can take some time. # We recommend setting a higher timeout for leader–worker initialization. # 1200 means 1200 seconds timeout export DYN_KVBM_LEADER_WORKER_INIT_TIMEOUT_SECS=1200 ``` ```bash # write an example LLM API config # Note: Disable partial reuse "enable_partial_reuse: false" in the LLM API config’s "kv_connector_config" to increase offloading cache hits. cat > "/tmp/kvbm_llm_api_config.yaml" <