# Running KVBM in TensorRT-LLM This guide explains how to leverage KVBM (KV Block Manager) to mange KV cache and do KV offloading in TensorRT-LLM (trtllm). To learn what KVBM is, please check [here](https://docs.nvidia.com/dynamo/latest/architecture/kvbm_intro.html) > [!Note] > - Ensure that `etcd` and `nats` are running before starting. > - KVBM does not currently support CUDA graphs in TensorRT-LLM. > - KVBM only supports TensorRT-LLM’s PyTorch backend. > - To enable disk cache offloading, you must first enable a CPU memory cache offloading. > - Disable partial reuse `enable_partial_reuse: false` in the LLM API config’s `kv_connector_config` to increase offloading cache hits. > - KVBM requires TensorRT-LLM v1.1.0rc5 or newer. > - Enabling KVBM metrics with TensorRT-LLM is still a work in progress. ## Quick Start To use KVBM in TensorRT-LLM, you can follow the steps below: ```bash # start up etcd for KVBM leader/worker registration and discovery docker compose -f deploy/docker-compose.yml up -d # Build a container that includes TensorRT-LLM and KVBM. ./container/build.sh --framework trtllm --enable-kvbm # launch the container ./container/run.sh --framework trtllm -it --mount-workspace --use-nixl-gds # enable kv offloading to CPU memory # 4 means 4GB of pinned CPU memory would be used export DYN_KVBM_CPU_CACHE_GB=4 # enable kv offloading to disk. Note: To enable disk cache offloading, you must first enable a CPU memory cache offloading. # 8 means 8GB of disk would be used export DYN_KVBM_DISK_CACHE_GB=8 # Allocating memory and disk storage can take some time. # We recommend setting a higher timeout for leader–worker initialization. # 1200 means 1200 seconds timeout export DYN_KVBM_LEADER_WORKER_INIT_TIMEOUT_SECS=1200 ``` ```bash # write an example LLM API config # Note: Disable partial reuse "enable_partial_reuse: false" in the LLM API config’s "kv_connector_config" to increase offloading cache hits. cat > "/tmp/kvbm_llm_api_config.yaml" <