README.md 3.6 KB
Newer Older
1
2
3
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
4
title: vLLM
5
6
---

7
# LLM Deployment using vLLM
8

9
Dynamo vLLM integrates [vLLM](https://github.com/vllm-project/vllm) engines into Dynamo's distributed runtime, enabling disaggregated serving, KV-aware routing, and request cancellation while maintaining full compatibility with vLLM's native engine arguments. Dynamo leverages vLLM's native KV cache events, NIXL-based transfer mechanisms, and metric reporting to enable KV-aware routing and P/D disaggregation.
10

11
## Installation
12

13
### Install Latest Release
14

15
We recommend using [uv](https://github.com/astral-sh/uv) to install:
16

17
18
19
20
```bash
uv venv --python 3.12 --seed
uv pip install "ai-dynamo[vllm]"
```
21

22
This installs Dynamo with the compatible vLLM version.
23

24
---
25

26
### Container
27

28
We have public images available on [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo/artifacts):
29

30
31
32
33
```bash
docker pull nvcr.io/nvidia/ai-dynamo/vllm-runtime:<version>
./container/run.sh -it --framework VLLM --image nvcr.io/nvidia/ai-dynamo/vllm-runtime:<version>
```
34

35
<Accordion title="Build from source">
36
37

```bash
38
python container/render.py --framework vllm --output-short-filename
39
docker build -f container/rendered.Dockerfile -t dynamo:latest-vllm .
40
41
42
43
44
45
```

```bash
./container/run.sh -it --framework VLLM [--mount-workspace]
```

46
</Accordion>
47

48
49
50
51
### Development Setup

For development, use the [devcontainer](https://github.com/ai-dynamo/dynamo/tree/main/.devcontainer) which has all dependencies pre-installed.

52
## Feature Support Matrix
53

54
55
56
57
58
59
60
| Feature | Status | Notes |
|---------|--------|-------|
| [**Disaggregated Serving**](../../design-docs/disagg-serving.md) | ✅ | Prefill/decode separation with NIXL KV transfer |
| [**KV-Aware Routing**](../../components/router/README.md) | ✅ | |
| [**SLA-Based Planner**](../../components/planner/planner-guide.md) | ✅ | |
| [**KVBM**](../../components/kvbm/README.md) | ✅ | |
| [**LMCache**](../../integrations/lmcache-integration.md) | ✅ | |
61
| [**FlexKV**](../../integrations/flexkv-integration.md) | ✅ | |
62
63
64
65
66
67
| [**Multimodal Support**](vllm-omni.md) | ✅ | Via vLLM-Omni integration |
| [**Observability**](vllm-observability.md) | ✅ | Metrics and monitoring |
| **WideEP** | ✅ | Support for DeepEP |
| **DP Rank Routing** | ✅ | [Hybrid load balancing](https://docs.vllm.ai/en/stable/serving/data_parallel_deployment/?h=external+dp#hybrid-load-balancing) via external DP rank control |
| [**LoRA**](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/launch/lora/README.md) | ✅ | Dynamic loading/unloading from S3-compatible storage |
| **GB200 Support** | ✅ | Container functional on main |
68

69
## Quick Start
70

71
Start infrastructure services for local development:
72
73

```bash
74
docker compose -f deploy/docker-compose.yml up -d
75
76
```

77
Launch an aggregated serving deployment:
78
79

```bash
80
81
cd $DYNAMO_HOME/examples/backends/vllm
bash launch/agg.sh
82
83
```

84
## Next Steps
85

86
87
- **[Reference Guide](vllm-reference-guide.md)**: Configuration, arguments, and operational details
- **[Examples](vllm-examples.md)**: All deployment patterns with launch scripts
88
- **[KV Cache Offloading](vllm-kv-offloading.md)**: KVBM, LMCache, and FlexKV integrations
89
90
91
92
- **[Observability](vllm-observability.md)**: Metrics and monitoring
- **[vLLM-Omni](vllm-omni.md)**: Multimodal model serving
- **[Kubernetes Deployment](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/README.md)**: Kubernetes deployment guide
- **[vLLM Documentation](https://docs.vllm.ai/en/stable/)**: Upstream vLLM serve arguments