README.md 7.46 KB
Newer Older
1
2
3
4
5
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->

Alec's avatar
Alec committed
6
# LLM Deployment using vLLM
7

Alec's avatar
Alec committed
8
This directory contains a Dynamo vllm engine and reference implementations for deploying Large Language Models (LLMs) in various configurations using vLLM. For Dynamo integration, we leverage vLLM's native KV cache events, NIXL based transfer mechanisms, and metric reporting to enable KV-aware routing and P/D disaggregation.
9

Anish's avatar
Anish committed
10
## Use the Latest Release
11

Anish's avatar
Anish committed
12
We recommend using the latest stable release of Dynamo to avoid breaking changes:
13

Anish's avatar
Anish committed
14
[![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest)
15

Anish's avatar
Anish committed
16
You can find the latest release [here](https://github.com/ai-dynamo/dynamo/releases/latest) and check out the corresponding branch with:
17

Anish's avatar
Anish committed
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
```bash
git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
```

---

## Table of Contents
- [Feature Support Matrix](#feature-support-matrix)
- [Quick Start](#quick-start)
- [Single Node Examples](#run-single-node-examples)
- [Advanced Examples](#advanced-examples)
- [Deploy on Kubernetes](#kubernetes-deployment)
- [Configuration](#configuration)

## Feature Support Matrix

### Core Dynamo Features

| Feature | vLLM | Notes |
|---------|------|-------|
38
39
40
41
42
43
| [**Disaggregated Serving**](../../../docs/architecture/disagg_serving.md) | ✅ |  |
| [**Conditional Disaggregation**](../../../docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP |
| [**KV-Aware Routing**](../../../docs/architecture/kv_cache_routing.md) | ✅ |  |
| [**SLA-Based Planner**](../../../docs/architecture/sla_planner.md) | ✅ |  |
| [**Load Based Planner**](../../../docs/architecture/load_planner.md) | 🚧 | WIP |
| [**KVBM**](../../../docs/architecture/kvbm_architecture.md) | 🚧 | WIP |
44
| [**LMCache**](./LMCache_Integration.md) | ✅ |  |
Anish's avatar
Anish committed
45
46
47
48
49
50
51
52
53

### Large Scale P/D and WideEP Features

| Feature            | vLLM | Notes                                                                 |
|--------------------|------|-----------------------------------------------------------------------|
| **WideEP**         | ✅   | Support for PPLX / DeepEP not verified                                           |
| **DP Rank Routing**| ✅   | Supported via external control of DP ranks |
| **GB200 Support**  | 🚧   | Container functional on main |

54
## vLLM Quick Start
Anish's avatar
Anish committed
55
56
57
58
59
60

Below we provide a guide that lets you run all of our the common deployment patterns on a single node.

### Start NATS and ETCD in the background

Start using [Docker Compose](../../../deploy/docker-compose.yml)
61

62
```bash
63
docker compose -f deploy/docker-compose.yml up -d
64
65
```

Anish's avatar
Anish committed
66
67
68
### Pull or build container

We have public images available on [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo/artifacts). If you'd like to build your own container from source:
69
70

```bash
Alec's avatar
Alec committed
71
./container/build.sh --framework VLLM
72
73
```

Anish's avatar
Anish committed
74
75
### Run container

76
```bash
Alec's avatar
Alec committed
77
./container/run.sh -it --framework VLLM [--mount-workspace]
78
79
```

Alec's avatar
Alec committed
80
This includes the specific commit [vllm-project/vllm#19790](https://github.com/vllm-project/vllm/pull/19790) which enables support for external control of the DP ranks.
81

Anish's avatar
Anish committed
82
83
84
85
## Run Single Node Examples

> [!IMPORTANT]
> Below we provide simple shell scripts that run the components for each configuration. Each shell script runs `python3 -m dynamo.frontend` to start the ingress and uses `python3 -m dynamo.vllm` to start the vLLM workers. You can also run each command in separate terminals for better log visibility.
86

Alec's avatar
Alec committed
87
This figure shows an overview of the major components to deploy:
88

Alec's avatar
Alec committed
89
90
91
92
93
94
95
96
97
98
99
100
101
```
+------+      +-----------+      +------------------+             +---------------+
| HTTP |----->| dynamo    |----->|   vLLM Worker    |------------>|  vLLM Prefill |
|      |<-----| ingress   |<-----|                  |<------------|    Worker     |
+------+      +-----------+      +------------------+             +---------------+
                  |    ^                  |
       query best |    | return           | publish kv events
           worker |    | worker_id        v
                  |    |         +------------------+
                  |    +---------|     kv-router    |
                  +------------->|                  |
                                 +------------------+
```
102

Alec's avatar
Alec committed
103
Note: The above architecture illustrates all the components. The final components that get spawned depend upon the chosen deployment pattern.
104

Anish's avatar
Anish committed
105
### Aggregated Serving
106
107

```bash
Alec's avatar
Alec committed
108
# requires one gpu
Alec's avatar
Alec committed
109
cd components/backends/vllm
Alec's avatar
Alec committed
110
bash launch/agg.sh
111
112
```

Anish's avatar
Anish committed
113
### Aggregated Serving with KV Routing
114
115

```bash
Alec's avatar
Alec committed
116
# requires two gpus
Alec's avatar
Alec committed
117
cd components/backends/vllm
Alec's avatar
Alec committed
118
bash launch/agg_router.sh
119
120
```

Anish's avatar
Anish committed
121
### Disaggregated Serving
122
123

```bash
Alec's avatar
Alec committed
124
# requires two gpus
Alec's avatar
Alec committed
125
cd components/backends/vllm
Alec's avatar
Alec committed
126
bash launch/disagg.sh
127
128
```

Anish's avatar
Anish committed
129
### Disaggregated Serving with KV Routing
130
131

```bash
Alec's avatar
Alec committed
132
# requires three gpus
Alec's avatar
Alec committed
133
cd components/backends/vllm
Alec's avatar
Alec committed
134
bash launch/disagg_router.sh
135
136
```

Anish's avatar
Anish committed
137
### Single Node Data Parallel Attention / Expert Parallelism
Alec's avatar
Alec committed
138

Anish's avatar
Anish committed
139
This example is not meant to be performant but showcases Dynamo routing to data parallel workers
140
141

```bash
Alec's avatar
Alec committed
142
# requires four gpus
Alec's avatar
Alec committed
143
cd components/backends/vllm
Alec's avatar
Alec committed
144
bash launch/dep.sh
145
146
```

Alec's avatar
Alec committed
147
148
> [!TIP]
> Run a disaggregated example and try adding another prefill worker once the setup is running! The system will automatically discover and utilize the new worker.
149

Anish's avatar
Anish committed
150
151
152
153
## Advanced Examples

Below we provide a selected list of advanced deployments. Please open up an issue if you'd like to see a specific example!

154
155
### Kubernetes Deployment

156
For complete Kubernetes deployment instructions, configurations, and troubleshooting, see [vLLM Kubernetes Deployment Guide](deploy/README.md)
Alec's avatar
Alec committed
157
158
159
160
161
162
163
164

## Configuration

vLLM workers are configured through command-line arguments. Key parameters include:

- `--model`: Model to serve (e.g., `Qwen/Qwen3-0.6B`)
- `--is-prefill-worker`: Enable prefill-only mode for disaggregated serving
- `--metrics-endpoint-port`: Port for publishing KV metrics to Dynamo
165
- `--connector`: Specify which kv_transfer_config you want vllm to use `[nixl, lmcache, kvbm, none]`. This is a helper flag which overwrites the engines KVTransferConfig.
Alec's avatar
Alec committed
166
167
168
169

See `args.py` for the full list of configuration options and their defaults.

The [documentation](https://docs.vllm.ai/en/v0.9.2/configuration/serve_args.html?h=serve+arg) for the vLLM CLI args points to running 'vllm serve --help' to see what CLI args can be added. We use the same argument parser as vLLM.
170

171
172
173
174
175
176
177
178
179
180
181
182
### Hashing Consistency for KV Events

When using KV-aware routing, ensure deterministic hashing across processes to avoid radix tree mismatches. Choose one of the following:

- Set `PYTHONHASHSEED=0` for all vLLM processes when relying on Python's builtin hashing for prefix caching.
- If your vLLM version supports it, configure a deterministic prefix caching algorithm, for example:

```bash
vllm serve ... --enable-prefix-caching --prefix-caching-algo sha256
```
See the high-level notes in [KV Cache Routing](../../../docs/architecture/kv_cache_routing.md) on deterministic event IDs.

183
184
## Request Migration

185
You can enable [request migration](../../../docs/architecture/request_migration.md) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker:
186
187
188
189
190

```bash
python3 -m dynamo.vllm ... --migration-limit=3
```

191
This allows a request to be migrated up to 3 times before failing. See the [Request Migration Architecture](../../../docs/architecture/request_migration.md) documentation for details on how this works.