README.md 8.72 KB
Newer Older
1
<!--
2
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3
4
5
SPDX-License-Identifier: Apache-2.0
-->

Alec's avatar
Alec committed
6
# LLM Deployment using vLLM
7

8
This directory contains reference implementations for deploying Large Language Models (LLMs) in various configurations using vLLM. For Dynamo integration, we leverage vLLM's native KV cache events, NIXL based transfer mechanisms, and metric reporting to enable KV-aware routing and P/D disaggregation.
9

Anish's avatar
Anish committed
10
## Use the Latest Release
11

Anish's avatar
Anish committed
12
We recommend using the latest stable release of Dynamo to avoid breaking changes:
13

Anish's avatar
Anish committed
14
[![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest)
15

Anish's avatar
Anish committed
16
You can find the latest release [here](https://github.com/ai-dynamo/dynamo/releases/latest) and check out the corresponding branch with:
17

Anish's avatar
Anish committed
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
```bash
git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
```

---

## Table of Contents
- [Feature Support Matrix](#feature-support-matrix)
- [Quick Start](#quick-start)
- [Single Node Examples](#run-single-node-examples)
- [Advanced Examples](#advanced-examples)
- [Deploy on Kubernetes](#kubernetes-deployment)
- [Configuration](#configuration)

## Feature Support Matrix

### Core Dynamo Features

| Feature | vLLM | Notes |
|---------|------|-------|
38
39
| [**Disaggregated Serving**](../../../docs/design_docs/disagg_serving.md) | ✅ |  |
| [**Conditional Disaggregation**](../../../docs/design_docs/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP |
40
41
42
43
| [**KV-Aware Routing**](../../components/router/README.md) | ✅ |  |
| [**SLA-Based Planner**](../../../docs/components/planner/planner_guide.md) | ✅ |  |
| [**Load Based Planner**](../../../docs/components/planner/README.md) | 🚧 | WIP |
| [**KVBM**](../../../docs/components/kvbm/README.md) | ✅ |  |
44
| [**LMCache**](../../integrations/lmcache_integration.md) | ✅ |  |
45
| [**Prompt Embeddings**](./prompt-embeddings.md) | ✅ | Requires `--enable-prompt-embeds` flag |
Anish's avatar
Anish committed
46
47
48
49
50
51
52
53
54

### Large Scale P/D and WideEP Features

| Feature            | vLLM | Notes                                                                 |
|--------------------|------|-----------------------------------------------------------------------|
| **WideEP**         | ✅   | Support for PPLX / DeepEP not verified                                           |
| **DP Rank Routing**| ✅   | Supported via external control of DP ranks |
| **GB200 Support**  | 🚧   | Container functional on main |

55
## vLLM Quick Start
Anish's avatar
Anish committed
56
57
58

Below we provide a guide that lets you run all of our the common deployment patterns on a single node.

59
### Start Infrastructure Services (Local Development Only)
Anish's avatar
Anish committed
60

61
For local/bare-metal development, start etcd and optionally NATS using [Docker Compose](../../../deploy/docker-compose.yml):
62

63
```bash
64
docker compose -f deploy/docker-compose.yml up -d
65
66
```

67
68
69
70
71
> [!NOTE]
> - **etcd** is optional but is the default local discovery backend. You can also use `--kv_store file` to use file system based discovery.
> - **NATS** is optional - only needed if using KV routing with events (default). You can disable it with `--no-kv-events` flag for prediction-based routing
> - **On Kubernetes**, neither is required when using the Dynamo operator, which explicitly sets `DYN_DISCOVERY_BACKEND=kubernetes` to enable native K8s service discovery (DynamoWorkerMetadata CRD)

Anish's avatar
Anish committed
72
73
74
### Pull or build container

We have public images available on [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo/artifacts). If you'd like to build your own container from source:
75
76

```bash
Alec's avatar
Alec committed
77
./container/build.sh --framework VLLM
78
79
```

Anish's avatar
Anish committed
80
81
### Run container

82
```bash
Alec's avatar
Alec committed
83
./container/run.sh -it --framework VLLM [--mount-workspace]
84
85
```

Alec's avatar
Alec committed
86
This includes the specific commit [vllm-project/vllm#19790](https://github.com/vllm-project/vllm/pull/19790) which enables support for external control of the DP ranks.
87

Anish's avatar
Anish committed
88
89
90
91
## Run Single Node Examples

> [!IMPORTANT]
> Below we provide simple shell scripts that run the components for each configuration. Each shell script runs `python3 -m dynamo.frontend` to start the ingress and uses `python3 -m dynamo.vllm` to start the vLLM workers. You can also run each command in separate terminals for better log visibility.
92

Anish's avatar
Anish committed
93
### Aggregated Serving
94
95

```bash
Alec's avatar
Alec committed
96
# requires one gpu
97
cd examples/backends/vllm
Alec's avatar
Alec committed
98
bash launch/agg.sh
99
100
```

Anish's avatar
Anish committed
101
### Aggregated Serving with KV Routing
102
103

```bash
Alec's avatar
Alec committed
104
# requires two gpus
105
cd examples/backends/vllm
Alec's avatar
Alec committed
106
bash launch/agg_router.sh
107
108
```

Anish's avatar
Anish committed
109
### Disaggregated Serving
110
111

```bash
Alec's avatar
Alec committed
112
# requires two gpus
113
cd examples/backends/vllm
Alec's avatar
Alec committed
114
bash launch/disagg.sh
115
116
```

Anish's avatar
Anish committed
117
### Disaggregated Serving with KV Routing
118
119

```bash
Alec's avatar
Alec committed
120
# requires three gpus
121
cd examples/backends/vllm
Alec's avatar
Alec committed
122
bash launch/disagg_router.sh
123
124
```

Anish's avatar
Anish committed
125
### Single Node Data Parallel Attention / Expert Parallelism
Alec's avatar
Alec committed
126

Anish's avatar
Anish committed
127
This example is not meant to be performant but showcases Dynamo routing to data parallel workers
128
129

```bash
Alec's avatar
Alec committed
130
# requires four gpus
131
cd examples/backends/vllm
Alec's avatar
Alec committed
132
bash launch/dep.sh
133
134
```

Alec's avatar
Alec committed
135
136
> [!TIP]
> Run a disaggregated example and try adding another prefill worker once the setup is running! The system will automatically discover and utilize the new worker.
137

Anish's avatar
Anish committed
138
139
140
141
## Advanced Examples

Below we provide a selected list of advanced deployments. Please open up an issue if you'd like to see a specific example!

142
143
144
145
146
### Speculative Decoding with Aggregated Serving (Meta-Llama-3.1-8B-Instruct + Eagle3)

Run **Meta-Llama-3.1-8B-Instruct** with **Eagle3** as a draft model using **aggregated speculative decoding** on a single node.
This setup demonstrates how to use Dynamo to create an instance using Eagle-based speculative decoding under the **VLLM aggregated serving framework** for faster inference while maintaining accuracy.

147
**Guide:** [Speculative Decoding Quickstart](../../features/speculative_decoding/speculative_decoding_vllm.md)
148

149
150
> **See also:** [Speculative Decoding Feature Overview](../../features/speculative_decoding/README.md) for cross-backend documentation.

151
152
### Kubernetes Deployment

153
For complete Kubernetes deployment instructions, configurations, and troubleshooting, see [vLLM Kubernetes Deployment Guide](../../../examples/backends/vllm/deploy/README.md)
Alec's avatar
Alec committed
154
155
156
157
158
159
160
161

## Configuration

vLLM workers are configured through command-line arguments. Key parameters include:

- `--model`: Model to serve (e.g., `Qwen/Qwen3-0.6B`)
- `--is-prefill-worker`: Enable prefill-only mode for disaggregated serving
- `--metrics-endpoint-port`: Port for publishing KV metrics to Dynamo
162
- `--connector`: Specify which kv_transfer_config you want vllm to use `[nixl, lmcache, kvbm, none]`. This is a helper flag which overwrites the engines KVTransferConfig.
163
164
165
166
- `--enable-prompt-embeds`: **Enable prompt embeddings feature** (opt-in, default: disabled)
  - **Required for:** Accepting pre-computed prompt embeddings via API
  - **Default behavior:** Prompt embeddings DISABLED - requests with `prompt_embeds` will fail
  - **Error without flag:** `ValueError: You must set --enable-prompt-embeds to input prompt_embeds`
Alec's avatar
Alec committed
167
168
169
170

See `args.py` for the full list of configuration options and their defaults.

The [documentation](https://docs.vllm.ai/en/v0.9.2/configuration/serve_args.html?h=serve+arg) for the vLLM CLI args points to running 'vllm serve --help' to see what CLI args can be added. We use the same argument parser as vLLM.
171

172
173
174
175
176
177
178
179
180
181
### Hashing Consistency for KV Events

When using KV-aware routing, ensure deterministic hashing across processes to avoid radix tree mismatches. Choose one of the following:

- Set `PYTHONHASHSEED=0` for all vLLM processes when relying on Python's builtin hashing for prefix caching.
- If your vLLM version supports it, configure a deterministic prefix caching algorithm, for example:

```bash
vllm serve ... --enable-prefix-caching --prefix-caching-algo sha256
```
182
See the high-level notes in [Router Design](../../design_docs/router_design.md#deterministic-event-ids) on deterministic event IDs.
183

184
185
## Request Migration

186
You can enable [request migration](../../../docs/fault_tolerance/request_migration.md) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker:
187
188
189
190
191

```bash
python3 -m dynamo.vllm ... --migration-limit=3
```

192
This allows a request to be migrated up to 3 times before failing. See the [Request Migration Architecture](../../../docs/fault_tolerance/request_migration.md) documentation for details on how this works.
193
194
195
196
197
198
199
200
201
202
203
204

## Request Cancellation

When a user cancels a request (e.g., by disconnecting from the frontend), the request is automatically cancelled across all workers, freeing compute resources for other requests.

### Cancellation Support Matrix

| | Prefill | Decode |
|-|---------|--------|
| **Aggregated** | ✅ | ✅ |
| **Disaggregated** | ✅ | ✅ |

205
For more details, see the [Request Cancellation Architecture](../../../docs/fault_tolerance/request_cancellation.md) documentation.