README.md 6.83 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->

Alec's avatar
Alec committed
18
# LLM Deployment Examples using vLLM
19

Alec's avatar
Alec committed
20
This directory contains examples and reference implementations for deploying Large Language Models (LLMs) in various configurations using vLLM. For Dynamo integration, we leverage vLLM's native KV cache events, NIXL based transfer mechanisms, and metric reporting to enable KV-aware routing and P/D disaggregation.
21

Alec's avatar
Alec committed
22
## Deployment Architectures
23

Alec's avatar
Alec committed
24
See [deployment architectures](../llm/README.md#deployment-architectures) to learn about the general idea of the architecture. vLLM supports aggregated, disaggregated, and KV-routed serving patterns.
25

Alec's avatar
Alec committed
26
## Getting Started
27

Alec's avatar
Alec committed
28
### Prerequisites
29

Alec's avatar
Alec committed
30
Start required services (etcd and NATS) using [Docker Compose](../../deploy/metrics/docker-compose.yml):
31

32
33
34
35
```bash
docker compose -f deploy/metrics/docker-compose.yml up -d
```

Alec's avatar
Alec committed
36
### Build and Run docker
37
38

```bash
Alec's avatar
Alec committed
39
./container/build.sh --framework VLLM_V1
40
41
42
```

```bash
Alec's avatar
Alec committed
43
./container/run.sh -it --framework VLLM_V1
44
45
```

Alec's avatar
Alec committed
46
This includes the specific commit [vllm-project/vllm#19790](https://github.com/vllm-project/vllm/pull/19790) which enables support for external control of the DP ranks.
47

Alec's avatar
Alec committed
48
## Run Deployment
49

Alec's avatar
Alec committed
50
This figure shows an overview of the major components to deploy:
51

Alec's avatar
Alec committed
52
53
54
55
56
57
58
59
60
61
62
63
64
```
+------+      +-----------+      +------------------+             +---------------+
| HTTP |----->| dynamo    |----->|   vLLM Worker    |------------>|  vLLM Prefill |
|      |<-----| ingress   |<-----|                  |<------------|    Worker     |
+------+      +-----------+      +------------------+             +---------------+
                  |    ^                  |
       query best |    | return           | publish kv events
           worker |    | worker_id        v
                  |    |         +------------------+
                  |    +---------|     kv-router    |
                  +------------->|                  |
                                 +------------------+
```
65

Alec's avatar
Alec committed
66
Note: The above architecture illustrates all the components. The final components that get spawned depend upon the chosen deployment pattern.
67

Alec's avatar
Alec committed
68
### Example Architectures
69

Alec's avatar
Alec committed
70
71
> [!IMPORTANT]
> Below we provide simple shell scripts that run the components for each configuration. Each shell script runs `dynamo run` to start the ingress and uses `python3 main.py` to start the vLLM workers. You can run each command in separate terminals for better log visibility.
72

Alec's avatar
Alec committed
73
#### Aggregated Serving
74
75

```bash
Alec's avatar
Alec committed
76
# requires one gpu
77
cd examples/vllm_v1
Alec's avatar
Alec committed
78
bash launch/agg.sh
79
80
```

Alec's avatar
Alec committed
81
#### Aggregated Serving with KV Routing
82
83

```bash
Alec's avatar
Alec committed
84
# requires two gpus
85
cd examples/vllm_v1
Alec's avatar
Alec committed
86
bash launch/agg_router.sh
87
88
```

Alec's avatar
Alec committed
89
#### Disaggregated Serving
90
91

```bash
Alec's avatar
Alec committed
92
# requires two gpus
93
cd examples/vllm_v1
Alec's avatar
Alec committed
94
bash launch/disagg.sh
95
96
```

Alec's avatar
Alec committed
97
#### Disaggregated Serving with KV Routing
98
99

```bash
Alec's avatar
Alec committed
100
# requires three gpus
101
cd examples/vllm_v1
Alec's avatar
Alec committed
102
bash launch/disagg_router.sh
103
104
```

Alec's avatar
Alec committed
105
106
107
#### Single Node Data Parallel Attention / Expert Parallelism

This example is not meant to be performant but showcases dynamo routing to data parallel workers
108
109

```bash
Alec's avatar
Alec committed
110
# requires four gpus
111
cd examples/vllm_v1
Alec's avatar
Alec committed
112
bash launch/dep.sh
113
114
115
```


Alec's avatar
Alec committed
116
117
> [!TIP]
> Run a disaggregated example and try adding another prefill worker once the setup is running! The system will automatically discover and utilize the new worker.
118

119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
### Kubernetes Deployment

For Kubernetes deployment, YAML manifests are provided in the `deploy/` directory. These define DynamoGraphDeployment resources for various configurations:

- `agg.yaml` - Aggregated serving
- `agg_router.yaml` - Aggregated serving with KV routing
- `disagg.yaml` - Disaggregated serving
- `disagg_router.yaml` - Disaggregated serving with KV routing

#### Prerequisites

- **Dynamo Cloud**: Follow the [Quickstart Guide](../../docs/guides/dynamo_deploy/quickstart.md) to deploy Dynamo Cloud first.

- **Container Images**: The deployment files currently require access to `nvcr.io/nvidian/nim-llm-dev/vllm_v1-runtime`. If you don't have access, build and push your own image:
  ```bash
  ./container/build.sh --framework VLLM_V1
  # Tag and push to your container registry
  # Update the image references in the YAML files
  ```

- **Port Forwarding**: After deployment, forward the frontend service to access the API:
  ```bash
  kubectl port-forward deployment/vllm-v1-disagg-frontend-<pod-uuid-info> 8080:8000
  ```

#### Deploy to Kubernetes

Example with disagg:

```bash
cd ~/dynamo/examples/vllm/deploy
kubectl apply -f disagg.yaml
```

Alec's avatar
Alec committed
153
### Testing the Deployment
154

Alec's avatar
Alec committed
155
Send a test request to verify your deployment:
156
157

```bash
Alec's avatar
Alec committed
158
curl localhost:8080/v1/chat/completions \
159
160
  -H "Content-Type: application/json" \
  -d '{
Alec's avatar
Alec committed
161
162
163
164
165
166
167
    "model": "Qwen/Qwen3-0.6B",
    "messages": [
    {
        "role": "user",
        "content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden."
    }
    ],
168
169
170
    "stream": false,
    "max_tokens": 30
  }'
Alec's avatar
Alec committed
171
172
173
174
175
176
177
178
179
180
181
182
183
184
```

## Configuration

vLLM workers are configured through command-line arguments. Key parameters include:

- `--endpoint`: Dynamo endpoint in format `dyn://namespace.component.endpoint`
- `--model`: Model to serve (e.g., `Qwen/Qwen3-0.6B`)
- `--is-prefill-worker`: Enable prefill-only mode for disaggregated serving
- `--metrics-endpoint-port`: Port for publishing KV metrics to Dynamo

See `args.py` for the full list of configuration options and their defaults.

The [documentation](https://docs.vllm.ai/en/v0.9.2/configuration/serve_args.html?h=serve+arg) for the vLLM CLI args points to running 'vllm serve --help' to see what CLI args can be added. We use the same argument parser as vLLM.