docs: vLLM README container instructions and KV offloading page (#6793)

Signed-off-by: alec-flowers <aflowers@nvidia.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

docs: vLLM README container instructions and KV offloading page (#6793)
Signed-off-by: alec-flowers <aflowers@nvidia.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
9fe03dd8 · Alec · GitHub · 4d7c9845 · 9fe03dd8 · 4d7c9845
Unverified Commit 9fe03dd8 authored Mar 02, 2026 by Alec Committed by GitHub Mar 03, 2026
7 changed files
--- a/docs/backends/vllm/README.md
+++ b/docs/backends/vllm/README.md
@@ -21,15 +21,18 @@ uv pip install "ai-dynamo[vllm]"

 This installs Dynamo with the compatible vLLM version.

-### Development Setup
+---

-For development, use the [devcontainer](https://github.com/ai-dynamo/dynamo/tree/main/.devcontainer) which has all dependencies pre-installed.
+### Container

---
+We have public images available on [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo/artifacts):

-<Accordion title="Build and run container">
+```bash
+docker pull nvcr.io/nvidia/ai-dynamo/vllm-runtime:<version>
+./container/run.sh -it --framework VLLM --image nvcr.io/nvidia/ai-dynamo/vllm-runtime:<version>
+```

-We have public images available on [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo/artifacts). If you'd like to build your own container from source:
+<Accordion title="Build from source">

 ```bash
 python container/render.py --framework vllm --output-short-filename
@@ -42,6 +45,10 @@ docker build -f container/rendered.Dockerfile -t dynamo:latest-vllm .

 </Accordion>

+### Development Setup
+
+For development, use the [devcontainer](https://github.com/ai-dynamo/dynamo/tree/main/.devcontainer) which has all dependencies pre-installed.
+
 ## Feature Support Matrix

 | Feature | Status | Notes |
@@ -78,6 +85,7 @@ bash launch/agg.sh

 - **[Reference Guide](vllm-reference-guide.md)**: Configuration, arguments, and operational details
 - **[Examples](vllm-examples.md)**: All deployment patterns with launch scripts
+- **[KV Cache Offloading](vllm-kv-offloading.md)**: KVBM, LMCache, and FlexKV integrations
 - **[Observability](vllm-observability.md)**: Metrics and monitoring
 - **[vLLM-Omni](vllm-omni.md)**: Multimodal model serving
 - **[Kubernetes Deployment](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/README.md)**: Kubernetes deployment guide

--- a/docs/backends/vllm/flexkv_integration.md
+++ b/docs/backends/vllm/flexkv_integration.md
-# FlexKV Integration in Dynamo
-
-## Introduction
-
-[FlexKV](https://github.com/taco-project/FlexKV) is a scalable, distributed runtime for KV cache offloading. It acts as a unified KV caching layer for inference engines like vLLM, TensorRT-LLM and SGLang.
-
-## Usage
-
-Enable FlexKV by setting the `DYNAMO_USE_FLEXKV` environment variable:
-
-```bash
-export DYNAMO_USE_FLEXKV=1
-```
-
-### Aggregated Serving
-
-Use FlexKV with the `--kv-transfer-config '{"kv_connector":"FlexKVConnectorV1","kv_role":"kv_both"}'` flag:
-
-```bash
-python -m dynamo.vllm --model $YOUR_MODEL --kv-transfer-config '{"kv_connector":"FlexKVConnectorV1","kv_role":"kv_both"}'
-```
-
-Refer to [`agg_flexkv.sh`](../../../examples/backends/vllm/launch/agg_flexkv.sh) for quick setup.
-
-### Aggregated Serving with Peer Node KV Cache Reuse
-
-Refer to our project [README](https://github.com/taco-project/FlexKV/blob/main/docs/dist_reuse/README_en.md) for instructions on setting up peer KV cache reuse.
-
-### Disaggregated Serving
-
-Refer to [`disagg_flexkv.sh`](../../../examples/backends/vllm/launch/disagg_flexkv.sh) for quick setup.
--- a/docs/backends/vllm/vllm-kv-offloading.md
+++ b/docs/backends/vllm/vllm-kv-offloading.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: KV Cache Offloading
+subtitle: CPU and disk offloading integrations for vLLM in Dynamo
+---
+
+# KV Cache Offloading
+
+Dynamo supports multiple KV cache offloading backends for vLLM, allowing you to extend effective KV cache capacity beyond GPU memory using CPU RAM and disk storage. Each backend integrates through vLLM's connector interface and works with both aggregated and disaggregated serving.
+
+
+| Backend                 | Source                                           |
+| ----------------------- | ------------------------------------------------ |
+| **[KVBM](#kvbm)**       | [Dynamo](../../components/kvbm/README.md)        |
+| **[LMCache](#lmcache)** | [GitHub](https://github.com/LMCache/LMCache)     |
+| **[FlexKV](#flexkv)**   | [GitHub](https://github.com/taco-project/FlexKV) |
+
+
+## KVBM
+
+[KVBM](../../components/kvbm/README.md) (KV Block Manager) is Dynamo's built-in KV cache offloading system. It provides a three-layer architecture (LLM runtime, logical block management, NIXL transport) with support for CPU and disk cache tiers, and integrates natively with Dynamo's KV-aware routing and disaggregated serving.
+
+
+| Deployment                 | Launch Script                                                                           |
+| -------------------------- | --------------------------------------------------------------------------------------- |
+| Aggregated                 | [`agg_kvbm.sh`](../../../examples/backends/vllm/launch/agg_kvbm.sh)                     |
+| Aggregated + KV routing    | [`agg_kvbm_router.sh`](../../../examples/backends/vllm/launch/agg_kvbm_router.sh)       |
+| Disaggregated (1P1D)       | [`disagg_kvbm.sh`](../../../examples/backends/vllm/launch/disagg_kvbm.sh)               |
+| Disaggregated (2P2D)       | [`disagg_kvbm_2p2d.sh`](../../../examples/backends/vllm/launch/disagg_kvbm_2p2d.sh)     |
+| Disaggregated + KV routing | [`disagg_kvbm_router.sh`](../../../examples/backends/vllm/launch/disagg_kvbm_router.sh) |
+
+
+For configuration details, see the [KVBM Guide](../../components/kvbm/kvbm-guide.md).
+
+## LMCache
+
+[LMCache](https://github.com/LMCache/LMCache) is an open-source KV cache engine that provides prefill-once, reuse-everywhere caching with multi-level storage backends (CPU RAM, local storage, Redis, GDS, InfiniStore/Mooncake).
+
+
+| Deployment                        | Launch Script                                                                                 |
+| --------------------------------- | --------------------------------------------------------------------------------------------- |
+| Aggregated                        | [`agg_lmcache.sh`](../../../examples/backends/vllm/launch/agg_lmcache.sh)                     |
+| Aggregated (multiprocess metrics) | [`agg_lmcache_multiproc.sh`](../../../examples/backends/vllm/launch/agg_lmcache_multiproc.sh) |
+| Disaggregated                     | [`disagg_lmcache.sh`](../../../examples/backends/vllm/launch/disagg_lmcache.sh)               |
+
+
+For configuration details, see the [LMCache Integration Guide](../../integrations/lmcache-integration.md).
+
+## FlexKV
+
+[FlexKV](https://github.com/taco-project/FlexKV) is a scalable, distributed KV cache runtime developed by Tencent Cloud's TACO team. It supports multi-level caching (GPU, CPU, SSD), distributed KV cache reuse across nodes, and high-performance I/O via io_uring and GPUDirect Storage.
+
+
+| Deployment              | Launch Script                                                                         |
+| ----------------------- | ------------------------------------------------------------------------------------- |
+| Aggregated              | [`agg_flexkv.sh`](../../../examples/backends/vllm/launch/agg_flexkv.sh)               |
+| Aggregated + KV routing | [`agg_flexkv_router.sh`](../../../examples/backends/vllm/launch/agg_flexkv_router.sh) |
+| Disaggregated           | [`disagg_flexkv.sh`](../../../examples/backends/vllm/launch/disagg_flexkv.sh)         |
+
+
+For configuration details, see the [FlexKV Integration Guide](../../integrations/flexkv-integration.md).
+
+## See Also
+
+- **[KVBM Design](../../design-docs/kvbm-design.md)**: Architecture and design of Dynamo's built-in KV cache offloading
+- **[KV-Aware Routing](../../components/router/router-guide.md)**: Routing requests based on KV cache state
+- **[Disaggregated Serving](../../design-docs/disagg-serving.md)**: Prefill/decode separation architecture
+
--- a/docs/index.yml
+++ b/docs/index.yml
@@ -306,6 +306,8 @@ navigation:
            path: backends/vllm/vllm-reference-guide.md
          - page: Examples
            path: backends/vllm/vllm-examples.md
+          - page: KV Cache Offloading
+            path: backends/vllm/vllm-kv-offloading.md
          - page: Observability
            path: backends/vllm/vllm-observability.md
          - page: vLLM-Omni

--- a/examples/backends/vllm/launch/agg_flexkv.sh
+++ b/examples/backends/vllm/launch/agg_flexkv.sh
@@ -4,10 +4,32 @@
 set -e
 trap 'echo Cleaning up...; kill 0' EXIT

+MODEL="Qwen/Qwen3-0.6B"
+HTTP_PORT="${DYN_HTTP_PORT:-8000}"
+echo "=========================================="
+echo "Launching Aggregated Serving + FlexKV (1 GPU)"
+echo "=========================================="
+echo "Model:       $MODEL"
+echo "Frontend:    http://localhost:$HTTP_PORT"
+echo "=========================================="
+echo ""
+echo "Example test command:"
+echo ""
+echo "  curl http://localhost:${HTTP_PORT}/v1/chat/completions \\"
+echo "    -H 'Content-Type: application/json' \\"
+echo "    -d '{"
+echo "      \"model\": \"${MODEL}\","
+echo "      \"messages\": [{\"role\": \"user\", \"content\": \"Explain why Roger Federer is considered one of the greatest tennis players of all time\"}],"
+echo "      \"max_tokens\": 32"
+echo "    }'"
+echo ""
+echo "=========================================="
+
 # Run frontend
+# dynamo.frontend accepts either --http-port flag or DYN_HTTP_PORT env var (defaults to 8000)
 python -m dynamo.frontend &

 # Run worker with FlexKV
 DYNAMO_USE_FLEXKV=1 \
 FLEXKV_CPU_CACHE_GB=32 \
-  python -m dynamo.vllm --model Qwen/Qwen3-0.6B --kv-transfer-config '{"kv_connector":"FlexKVConnectorV1","kv_role":"kv_both"}'
+  python -m dynamo.vllm --model "$MODEL" --kv-transfer-config '{"kv_connector":"FlexKVConnectorV1","kv_role":"kv_both"}'
--- a/examples/backends/vllm/launch/agg_flexkv_router.sh
+++ b/examples/backends/vllm/launch/agg_flexkv_router.sh
@@ -5,8 +5,28 @@ set -e
 trap 'echo Cleaning up...; kill 0' EXIT

 MODEL="Qwen/Qwen3-0.6B"
+HTTP_PORT="${DYN_HTTP_PORT:-8000}"
+echo "=========================================="
+echo "Launching Aggregated Serving + FlexKV + KV Routing (2 GPUs)"
+echo "=========================================="
+echo "Model:       $MODEL"
+echo "Frontend:    http://localhost:$HTTP_PORT"
+echo "=========================================="
+echo ""
+echo "Example test command:"
+echo ""
+echo "  curl http://localhost:${HTTP_PORT}/v1/chat/completions \\"
+echo "    -H 'Content-Type: application/json' \\"
+echo "    -d '{"
+echo "      \"model\": \"${MODEL}\","
+echo "      \"messages\": [{\"role\": \"user\", \"content\": \"Explain why Roger Federer is considered one of the greatest tennis players of all time\"}],"
+echo "      \"max_tokens\": 32"
+echo "    }'"
+echo ""
+echo "=========================================="

 # Run frontend and KV router
+# dynamo.frontend accepts either --http-port flag or DYN_HTTP_PORT env var (defaults to 8000)
 python -m dynamo.frontend \
    --router-mode kv \
    --router-reset-states &

--- a/examples/backends/vllm/launch/disagg_flexkv.sh
+++ b/examples/backends/vllm/launch/disagg_flexkv.sh
@@ -5,8 +5,28 @@ set -e
 trap 'echo Cleaning up...; kill 0' EXIT

 MODEL="Qwen/Qwen3-0.6B"
+HTTP_PORT="${DYN_HTTP_PORT:-8000}"
+echo "=========================================="
+echo "Launching Disaggregated Serving + FlexKV (2 GPUs)"
+echo "=========================================="
+echo "Model:       $MODEL"
+echo "Frontend:    http://localhost:$HTTP_PORT"
+echo "=========================================="
+echo ""
+echo "Example test command:"
+echo ""
+echo "  curl http://localhost:${HTTP_PORT}/v1/chat/completions \\"
+echo "    -H 'Content-Type: application/json' \\"
+echo "    -d '{"
+echo "      \"model\": \"${MODEL}\","
+echo "      \"messages\": [{\"role\": \"user\", \"content\": \"Explain why Roger Federer is considered one of the greatest tennis players of all time\"}],"
+echo "      \"max_tokens\": 32"
+echo "    }'"
+echo ""
+echo "=========================================="

 # Run frontend
+# dynamo.frontend accepts either --http-port flag or DYN_HTTP_PORT env var (defaults to 8000)
 python -m dynamo.frontend &

 # Run decode worker without FlexKV