"lib/llm/vscode:/vscode.git/clone" did not exist on "14af074ecd91dc17612adca16e66dd26e1f5e969"
Unverified Commit 9fe03dd8 authored by Alec's avatar Alec Committed by GitHub
Browse files

docs: vLLM README container instructions and KV offloading page (#6793)


Signed-off-by: default avataralec-flowers <aflowers@nvidia.com>
Co-authored-by: default avatarClaude Opus 4.6 <noreply@anthropic.com>
parent 4d7c9845
......@@ -21,15 +21,18 @@ uv pip install "ai-dynamo[vllm]"
This installs Dynamo with the compatible vLLM version.
### Development Setup
---
For development, use the [devcontainer](https://github.com/ai-dynamo/dynamo/tree/main/.devcontainer) which has all dependencies pre-installed.
### Container
---
We have public images available on [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo/artifacts):
<Accordion title="Build and run container">
```bash
docker pull nvcr.io/nvidia/ai-dynamo/vllm-runtime:<version>
./container/run.sh -it --framework VLLM --image nvcr.io/nvidia/ai-dynamo/vllm-runtime:<version>
```
We have public images available on [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo/artifacts). If you'd like to build your own container from source:
<Accordion title="Build from source">
```bash
python container/render.py --framework vllm --output-short-filename
......@@ -42,6 +45,10 @@ docker build -f container/rendered.Dockerfile -t dynamo:latest-vllm .
</Accordion>
### Development Setup
For development, use the [devcontainer](https://github.com/ai-dynamo/dynamo/tree/main/.devcontainer) which has all dependencies pre-installed.
## Feature Support Matrix
| Feature | Status | Notes |
......@@ -78,6 +85,7 @@ bash launch/agg.sh
- **[Reference Guide](vllm-reference-guide.md)**: Configuration, arguments, and operational details
- **[Examples](vllm-examples.md)**: All deployment patterns with launch scripts
- **[KV Cache Offloading](vllm-kv-offloading.md)**: KVBM, LMCache, and FlexKV integrations
- **[Observability](vllm-observability.md)**: Metrics and monitoring
- **[vLLM-Omni](vllm-omni.md)**: Multimodal model serving
- **[Kubernetes Deployment](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/README.md)**: Kubernetes deployment guide
......
# FlexKV Integration in Dynamo
## Introduction
[FlexKV](https://github.com/taco-project/FlexKV) is a scalable, distributed runtime for KV cache offloading. It acts as a unified KV caching layer for inference engines like vLLM, TensorRT-LLM and SGLang.
## Usage
Enable FlexKV by setting the `DYNAMO_USE_FLEXKV` environment variable:
```bash
export DYNAMO_USE_FLEXKV=1
```
### Aggregated Serving
Use FlexKV with the `--kv-transfer-config '{"kv_connector":"FlexKVConnectorV1","kv_role":"kv_both"}'` flag:
```bash
python -m dynamo.vllm --model $YOUR_MODEL --kv-transfer-config '{"kv_connector":"FlexKVConnectorV1","kv_role":"kv_both"}'
```
Refer to [`agg_flexkv.sh`](../../../examples/backends/vllm/launch/agg_flexkv.sh) for quick setup.
### Aggregated Serving with Peer Node KV Cache Reuse
Refer to our project [README](https://github.com/taco-project/FlexKV/blob/main/docs/dist_reuse/README_en.md) for instructions on setting up peer KV cache reuse.
### Disaggregated Serving
Refer to [`disagg_flexkv.sh`](../../../examples/backends/vllm/launch/disagg_flexkv.sh) for quick setup.
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: KV Cache Offloading
subtitle: CPU and disk offloading integrations for vLLM in Dynamo
---
# KV Cache Offloading
Dynamo supports multiple KV cache offloading backends for vLLM, allowing you to extend effective KV cache capacity beyond GPU memory using CPU RAM and disk storage. Each backend integrates through vLLM's connector interface and works with both aggregated and disaggregated serving.
| Backend | Source |
| ----------------------- | ------------------------------------------------ |
| **[KVBM](#kvbm)** | [Dynamo](../../components/kvbm/README.md) |
| **[LMCache](#lmcache)** | [GitHub](https://github.com/LMCache/LMCache) |
| **[FlexKV](#flexkv)** | [GitHub](https://github.com/taco-project/FlexKV) |
## KVBM
[KVBM](../../components/kvbm/README.md) (KV Block Manager) is Dynamo's built-in KV cache offloading system. It provides a three-layer architecture (LLM runtime, logical block management, NIXL transport) with support for CPU and disk cache tiers, and integrates natively with Dynamo's KV-aware routing and disaggregated serving.
| Deployment | Launch Script |
| -------------------------- | --------------------------------------------------------------------------------------- |
| Aggregated | [`agg_kvbm.sh`](../../../examples/backends/vllm/launch/agg_kvbm.sh) |
| Aggregated + KV routing | [`agg_kvbm_router.sh`](../../../examples/backends/vllm/launch/agg_kvbm_router.sh) |
| Disaggregated (1P1D) | [`disagg_kvbm.sh`](../../../examples/backends/vllm/launch/disagg_kvbm.sh) |
| Disaggregated (2P2D) | [`disagg_kvbm_2p2d.sh`](../../../examples/backends/vllm/launch/disagg_kvbm_2p2d.sh) |
| Disaggregated + KV routing | [`disagg_kvbm_router.sh`](../../../examples/backends/vllm/launch/disagg_kvbm_router.sh) |
For configuration details, see the [KVBM Guide](../../components/kvbm/kvbm-guide.md).
## LMCache
[LMCache](https://github.com/LMCache/LMCache) is an open-source KV cache engine that provides prefill-once, reuse-everywhere caching with multi-level storage backends (CPU RAM, local storage, Redis, GDS, InfiniStore/Mooncake).
| Deployment | Launch Script |
| --------------------------------- | --------------------------------------------------------------------------------------------- |
| Aggregated | [`agg_lmcache.sh`](../../../examples/backends/vllm/launch/agg_lmcache.sh) |
| Aggregated (multiprocess metrics) | [`agg_lmcache_multiproc.sh`](../../../examples/backends/vllm/launch/agg_lmcache_multiproc.sh) |
| Disaggregated | [`disagg_lmcache.sh`](../../../examples/backends/vllm/launch/disagg_lmcache.sh) |
For configuration details, see the [LMCache Integration Guide](../../integrations/lmcache-integration.md).
## FlexKV
[FlexKV](https://github.com/taco-project/FlexKV) is a scalable, distributed KV cache runtime developed by Tencent Cloud's TACO team. It supports multi-level caching (GPU, CPU, SSD), distributed KV cache reuse across nodes, and high-performance I/O via io_uring and GPUDirect Storage.
| Deployment | Launch Script |
| ----------------------- | ------------------------------------------------------------------------------------- |
| Aggregated | [`agg_flexkv.sh`](../../../examples/backends/vllm/launch/agg_flexkv.sh) |
| Aggregated + KV routing | [`agg_flexkv_router.sh`](../../../examples/backends/vllm/launch/agg_flexkv_router.sh) |
| Disaggregated | [`disagg_flexkv.sh`](../../../examples/backends/vllm/launch/disagg_flexkv.sh) |
For configuration details, see the [FlexKV Integration Guide](../../integrations/flexkv-integration.md).
## See Also
- **[KVBM Design](../../design-docs/kvbm-design.md)**: Architecture and design of Dynamo's built-in KV cache offloading
- **[KV-Aware Routing](../../components/router/router-guide.md)**: Routing requests based on KV cache state
- **[Disaggregated Serving](../../design-docs/disagg-serving.md)**: Prefill/decode separation architecture
......@@ -306,6 +306,8 @@ navigation:
path: backends/vllm/vllm-reference-guide.md
- page: Examples
path: backends/vllm/vllm-examples.md
- page: KV Cache Offloading
path: backends/vllm/vllm-kv-offloading.md
- page: Observability
path: backends/vllm/vllm-observability.md
- page: vLLM-Omni
......
......@@ -4,10 +4,32 @@
set -e
trap 'echo Cleaning up...; kill 0' EXIT
MODEL="Qwen/Qwen3-0.6B"
HTTP_PORT="${DYN_HTTP_PORT:-8000}"
echo "=========================================="
echo "Launching Aggregated Serving + FlexKV (1 GPU)"
echo "=========================================="
echo "Model: $MODEL"
echo "Frontend: http://localhost:$HTTP_PORT"
echo "=========================================="
echo ""
echo "Example test command:"
echo ""
echo " curl http://localhost:${HTTP_PORT}/v1/chat/completions \\"
echo " -H 'Content-Type: application/json' \\"
echo " -d '{"
echo " \"model\": \"${MODEL}\","
echo " \"messages\": [{\"role\": \"user\", \"content\": \"Explain why Roger Federer is considered one of the greatest tennis players of all time\"}],"
echo " \"max_tokens\": 32"
echo " }'"
echo ""
echo "=========================================="
# Run frontend
# dynamo.frontend accepts either --http-port flag or DYN_HTTP_PORT env var (defaults to 8000)
python -m dynamo.frontend &
# Run worker with FlexKV
DYNAMO_USE_FLEXKV=1 \
FLEXKV_CPU_CACHE_GB=32 \
python -m dynamo.vllm --model Qwen/Qwen3-0.6B --kv-transfer-config '{"kv_connector":"FlexKVConnectorV1","kv_role":"kv_both"}'
python -m dynamo.vllm --model "$MODEL" --kv-transfer-config '{"kv_connector":"FlexKVConnectorV1","kv_role":"kv_both"}'
......@@ -5,8 +5,28 @@ set -e
trap 'echo Cleaning up...; kill 0' EXIT
MODEL="Qwen/Qwen3-0.6B"
HTTP_PORT="${DYN_HTTP_PORT:-8000}"
echo "=========================================="
echo "Launching Aggregated Serving + FlexKV + KV Routing (2 GPUs)"
echo "=========================================="
echo "Model: $MODEL"
echo "Frontend: http://localhost:$HTTP_PORT"
echo "=========================================="
echo ""
echo "Example test command:"
echo ""
echo " curl http://localhost:${HTTP_PORT}/v1/chat/completions \\"
echo " -H 'Content-Type: application/json' \\"
echo " -d '{"
echo " \"model\": \"${MODEL}\","
echo " \"messages\": [{\"role\": \"user\", \"content\": \"Explain why Roger Federer is considered one of the greatest tennis players of all time\"}],"
echo " \"max_tokens\": 32"
echo " }'"
echo ""
echo "=========================================="
# Run frontend and KV router
# dynamo.frontend accepts either --http-port flag or DYN_HTTP_PORT env var (defaults to 8000)
python -m dynamo.frontend \
--router-mode kv \
--router-reset-states &
......
......@@ -5,8 +5,28 @@ set -e
trap 'echo Cleaning up...; kill 0' EXIT
MODEL="Qwen/Qwen3-0.6B"
HTTP_PORT="${DYN_HTTP_PORT:-8000}"
echo "=========================================="
echo "Launching Disaggregated Serving + FlexKV (2 GPUs)"
echo "=========================================="
echo "Model: $MODEL"
echo "Frontend: http://localhost:$HTTP_PORT"
echo "=========================================="
echo ""
echo "Example test command:"
echo ""
echo " curl http://localhost:${HTTP_PORT}/v1/chat/completions \\"
echo " -H 'Content-Type: application/json' \\"
echo " -d '{"
echo " \"model\": \"${MODEL}\","
echo " \"messages\": [{\"role\": \"user\", \"content\": \"Explain why Roger Federer is considered one of the greatest tennis players of all time\"}],"
echo " \"max_tokens\": 32"
echo " }'"
echo ""
echo "=========================================="
# Run frontend
# dynamo.frontend accepts either --http-port flag or DYN_HTTP_PORT env var (defaults to 8000)
python -m dynamo.frontend &
# Run decode worker without FlexKV
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment