README.md 9.74 KB
Newer Older
1
2
3
4
5
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->

Alec's avatar
Alec committed
6
# LLM Deployment using vLLM
7

Alec's avatar
Alec committed
8
This directory contains a Dynamo vllm engine and reference implementations for deploying Large Language Models (LLMs) in various configurations using vLLM. For Dynamo integration, we leverage vLLM's native KV cache events, NIXL based transfer mechanisms, and metric reporting to enable KV-aware routing and P/D disaggregation.
9

Anish's avatar
Anish committed
10
## Use the Latest Release
11

Anish's avatar
Anish committed
12
We recommend using the latest stable release of Dynamo to avoid breaking changes:
13

Anish's avatar
Anish committed
14
[![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest)
15

Anish's avatar
Anish committed
16
You can find the latest release [here](https://github.com/ai-dynamo/dynamo/releases/latest) and check out the corresponding branch with:
17

Anish's avatar
Anish committed
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
```bash
git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
```

---

## Table of Contents
- [Feature Support Matrix](#feature-support-matrix)
- [Quick Start](#quick-start)
- [Single Node Examples](#run-single-node-examples)
- [Advanced Examples](#advanced-examples)
- [Deploy on Kubernetes](#kubernetes-deployment)
- [Configuration](#configuration)

## Feature Support Matrix

### Core Dynamo Features

| Feature | vLLM | Notes |
|---------|------|-------|
38
39
40
41
42
43
| [**Disaggregated Serving**](../../../docs/architecture/disagg_serving.md) | ✅ |  |
| [**Conditional Disaggregation**](../../../docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP |
| [**KV-Aware Routing**](../../../docs/architecture/kv_cache_routing.md) | ✅ |  |
| [**SLA-Based Planner**](../../../docs/architecture/sla_planner.md) | ✅ |  |
| [**Load Based Planner**](../../../docs/architecture/load_planner.md) | 🚧 | WIP |
| [**KVBM**](../../../docs/architecture/kvbm_architecture.md) | 🚧 | WIP |
Anish's avatar
Anish committed
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59

### Large Scale P/D and WideEP Features

| Feature            | vLLM | Notes                                                                 |
|--------------------|------|-----------------------------------------------------------------------|
| **WideEP**         | ✅   | Support for PPLX / DeepEP not verified                                           |
| **DP Rank Routing**| ✅   | Supported via external control of DP ranks |
| **GB200 Support**  | 🚧   | Container functional on main |

## Quick Start

Below we provide a guide that lets you run all of our the common deployment patterns on a single node.

### Start NATS and ETCD in the background

Start using [Docker Compose](../../../deploy/docker-compose.yml)
60

61
```bash
62
docker compose -f deploy/docker-compose.yml up -d
63
64
```

Anish's avatar
Anish committed
65
66
67
### Pull or build container

We have public images available on [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo/artifacts). If you'd like to build your own container from source:
68
69

```bash
Alec's avatar
Alec committed
70
./container/build.sh --framework VLLM
71
72
```

Anish's avatar
Anish committed
73
74
### Run container

75
```bash
Alec's avatar
Alec committed
76
./container/run.sh -it --framework VLLM [--mount-workspace]
77
78
```

Alec's avatar
Alec committed
79
This includes the specific commit [vllm-project/vllm#19790](https://github.com/vllm-project/vllm/pull/19790) which enables support for external control of the DP ranks.
80

Anish's avatar
Anish committed
81
82
83
84
## Run Single Node Examples

> [!IMPORTANT]
> Below we provide simple shell scripts that run the components for each configuration. Each shell script runs `python3 -m dynamo.frontend` to start the ingress and uses `python3 -m dynamo.vllm` to start the vLLM workers. You can also run each command in separate terminals for better log visibility.
85

Alec's avatar
Alec committed
86
This figure shows an overview of the major components to deploy:
87

Alec's avatar
Alec committed
88
89
90
91
92
93
94
95
96
97
98
99
100
```
+------+      +-----------+      +------------------+             +---------------+
| HTTP |----->| dynamo    |----->|   vLLM Worker    |------------>|  vLLM Prefill |
|      |<-----| ingress   |<-----|                  |<------------|    Worker     |
+------+      +-----------+      +------------------+             +---------------+
                  |    ^                  |
       query best |    | return           | publish kv events
           worker |    | worker_id        v
                  |    |         +------------------+
                  |    +---------|     kv-router    |
                  +------------->|                  |
                                 +------------------+
```
101

Alec's avatar
Alec committed
102
Note: The above architecture illustrates all the components. The final components that get spawned depend upon the chosen deployment pattern.
103

Anish's avatar
Anish committed
104
### Aggregated Serving
105
106

```bash
Alec's avatar
Alec committed
107
# requires one gpu
Alec's avatar
Alec committed
108
cd components/backends/vllm
Alec's avatar
Alec committed
109
bash launch/agg.sh
110
111
```

Anish's avatar
Anish committed
112
### Aggregated Serving with KV Routing
113
114

```bash
Alec's avatar
Alec committed
115
# requires two gpus
Alec's avatar
Alec committed
116
cd components/backends/vllm
Alec's avatar
Alec committed
117
bash launch/agg_router.sh
118
119
```

Anish's avatar
Anish committed
120
### Disaggregated Serving
121
122

```bash
Alec's avatar
Alec committed
123
# requires two gpus
Alec's avatar
Alec committed
124
cd components/backends/vllm
Alec's avatar
Alec committed
125
bash launch/disagg.sh
126
127
```

Anish's avatar
Anish committed
128
### Disaggregated Serving with KV Routing
129
130

```bash
Alec's avatar
Alec committed
131
# requires three gpus
Alec's avatar
Alec committed
132
cd components/backends/vllm
Alec's avatar
Alec committed
133
bash launch/disagg_router.sh
134
135
```

Anish's avatar
Anish committed
136
### Single Node Data Parallel Attention / Expert Parallelism
Alec's avatar
Alec committed
137

Anish's avatar
Anish committed
138
This example is not meant to be performant but showcases Dynamo routing to data parallel workers
139
140

```bash
Alec's avatar
Alec committed
141
# requires four gpus
Alec's avatar
Alec committed
142
cd components/backends/vllm
Alec's avatar
Alec committed
143
bash launch/dep.sh
144
145
```

Alec's avatar
Alec committed
146
147
> [!TIP]
> Run a disaggregated example and try adding another prefill worker once the setup is running! The system will automatically discover and utilize the new worker.
148

Anish's avatar
Anish committed
149
150
151
152
## Advanced Examples

Below we provide a selected list of advanced deployments. Please open up an issue if you'd like to see a specific example!

153
154
155
156
157
158
159
160
### Kubernetes Deployment

For Kubernetes deployment, YAML manifests are provided in the `deploy/` directory. These define DynamoGraphDeployment resources for various configurations:

- `agg.yaml` - Aggregated serving
- `agg_router.yaml` - Aggregated serving with KV routing
- `disagg.yaml` - Disaggregated serving
- `disagg_router.yaml` - Disaggregated serving with KV routing
161
- `disagg_planner.yaml` - Disaggregated serving with [SLA Planner](../../../docs/architecture/sla_planner.md). See [SLA Planner Deployment Guide](../../../docs/guides/dynamo_deploy/sla_planner_deployment.md) for more details.
162
163
164

#### Prerequisites

165
- **Dynamo Cloud**: Follow the [Quickstart Guide](../../../docs/guides/dynamo_deploy/quickstart.md) to deploy Dynamo Cloud first.
166

Anish's avatar
Anish committed
167
- **Container Images**: We have public images available on [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo/artifacts). If you'd prefer to use your own registry, build and push your own image:
168
  ```bash
169
  ./container/build.sh --framework VLLM
170
171
172
173
  # Tag and push to your container registry
  # Update the image references in the YAML files
  ```

174
175
- **Pre-Deployment Profiling (if Using SLA Planner)**: Follow the [pre-deployment profiling guide](../../../docs/architecture/pre_deployment_profiling.md) to run pre-deployment profiling. The results will be saved to the `profiling-pvc` PVC and queried by the SLA Planner.

176
177
178
179
180
181
182
183
- **Port Forwarding**: After deployment, forward the frontend service to access the API:
  ```bash
  kubectl port-forward deployment/vllm-v1-disagg-frontend-<pod-uuid-info> 8080:8000
  ```

#### Deploy to Kubernetes

Example with disagg:
atchernych's avatar
atchernych committed
184
Export the NAMESPACE  you used in your Dynamo Cloud Installation.
185
186

```bash
atchernych's avatar
atchernych committed
187
188
189
cd dynamo
cd components/backends/vllm/deploy
kubectl apply -f disagg.yaml -n $NAMESPACE
190
191
```

192
193
194
195
196
197
198
199
200
201
202
To change `DYN_LOG` level, edit the yaml file by adding

```yaml
...
spec:
  envs:
    - name: DYN_LOG
      value: "debug" # or other log levels
  ...
```

Alec's avatar
Alec committed
203
### Testing the Deployment
204

Alec's avatar
Alec committed
205
Send a test request to verify your deployment:
206
207

```bash
Alec's avatar
Alec committed
208
curl localhost:8080/v1/chat/completions \
209
210
  -H "Content-Type: application/json" \
  -d '{
Alec's avatar
Alec committed
211
212
213
214
215
216
217
    "model": "Qwen/Qwen3-0.6B",
    "messages": [
    {
        "role": "user",
        "content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden."
    }
    ],
218
219
220
    "stream": false,
    "max_tokens": 30
  }'
Alec's avatar
Alec committed
221
222
223
224
225
226
227
228
229
230
231
232
233
234
```

## Configuration

vLLM workers are configured through command-line arguments. Key parameters include:

- `--endpoint`: Dynamo endpoint in format `dyn://namespace.component.endpoint`
- `--model`: Model to serve (e.g., `Qwen/Qwen3-0.6B`)
- `--is-prefill-worker`: Enable prefill-only mode for disaggregated serving
- `--metrics-endpoint-port`: Port for publishing KV metrics to Dynamo

See `args.py` for the full list of configuration options and their defaults.

The [documentation](https://docs.vllm.ai/en/v0.9.2/configuration/serve_args.html?h=serve+arg) for the vLLM CLI args points to running 'vllm serve --help' to see what CLI args can be added. We use the same argument parser as vLLM.
235
236
237

## Request Migration

238
You can enable [request migration](../../../docs/architecture/request_migration.md) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker:
239
240
241
242
243

```bash
python3 -m dynamo.vllm ... --migration-limit=3
```

244
This allows a request to be migrated up to 3 times before failing. See the [Request Migration Architecture](../../../docs/architecture/request_migration.md) documentation for details on how this works.