README.md 7.1 KB
Newer Older
1
2
3
4
5
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->

Alec's avatar
Alec committed
6
# LLM Deployment using vLLM
7

Alec's avatar
Alec committed
8
This directory contains a Dynamo vllm engine and reference implementations for deploying Large Language Models (LLMs) in various configurations using vLLM. For Dynamo integration, we leverage vLLM's native KV cache events, NIXL based transfer mechanisms, and metric reporting to enable KV-aware routing and P/D disaggregation.
9

Alec's avatar
Alec committed
10
## Deployment Architectures
11

Alec's avatar
Alec committed
12
See [deployment architectures](../llm/README.md#deployment-architectures) to learn about the general idea of the architecture. vLLM supports aggregated, disaggregated, and KV-routed serving patterns.
13

Alec's avatar
Alec committed
14
## Getting Started
15

Alec's avatar
Alec committed
16
### Prerequisites
17

18
Start required services (etcd and NATS) using [Docker Compose](../../../deploy/docker-compose.yml):
19

20
```bash
21
docker compose -f deploy/docker-compose.yml up -d
22
23
```

Alec's avatar
Alec committed
24
### Build and Run docker
25
26

```bash
Alec's avatar
Alec committed
27
./container/build.sh --framework VLLM
28
29
30
```

```bash
Alec's avatar
Alec committed
31
./container/run.sh -it --framework VLLM [--mount-workspace]
32
33
```

Alec's avatar
Alec committed
34
This includes the specific commit [vllm-project/vllm#19790](https://github.com/vllm-project/vllm/pull/19790) which enables support for external control of the DP ranks.
35

Alec's avatar
Alec committed
36
## Run Deployment
37

Alec's avatar
Alec committed
38
This figure shows an overview of the major components to deploy:
39

Alec's avatar
Alec committed
40
41
42
43
44
45
46
47
48
49
50
51
52
```
+------+      +-----------+      +------------------+             +---------------+
| HTTP |----->| dynamo    |----->|   vLLM Worker    |------------>|  vLLM Prefill |
|      |<-----| ingress   |<-----|                  |<------------|    Worker     |
+------+      +-----------+      +------------------+             +---------------+
                  |    ^                  |
       query best |    | return           | publish kv events
           worker |    | worker_id        v
                  |    |         +------------------+
                  |    +---------|     kv-router    |
                  +------------->|                  |
                                 +------------------+
```
53

Alec's avatar
Alec committed
54
Note: The above architecture illustrates all the components. The final components that get spawned depend upon the chosen deployment pattern.
55

Alec's avatar
Alec committed
56
### Example Architectures
57

Alec's avatar
Alec committed
58
59
> [!IMPORTANT]
> Below we provide simple shell scripts that run the components for each configuration. Each shell script runs `dynamo run` to start the ingress and uses `python3 main.py` to start the vLLM workers. You can run each command in separate terminals for better log visibility.
60

Alec's avatar
Alec committed
61
#### Aggregated Serving
62
63

```bash
Alec's avatar
Alec committed
64
# requires one gpu
Alec's avatar
Alec committed
65
cd components/backends/vllm
Alec's avatar
Alec committed
66
bash launch/agg.sh
67
68
```

Alec's avatar
Alec committed
69
#### Aggregated Serving with KV Routing
70
71

```bash
Alec's avatar
Alec committed
72
# requires two gpus
Alec's avatar
Alec committed
73
cd components/backends/vllm
Alec's avatar
Alec committed
74
bash launch/agg_router.sh
75
76
```

Alec's avatar
Alec committed
77
#### Disaggregated Serving
78
79

```bash
Alec's avatar
Alec committed
80
# requires two gpus
Alec's avatar
Alec committed
81
cd components/backends/vllm
Alec's avatar
Alec committed
82
bash launch/disagg.sh
83
84
```

Alec's avatar
Alec committed
85
#### Disaggregated Serving with KV Routing
86
87

```bash
Alec's avatar
Alec committed
88
# requires three gpus
Alec's avatar
Alec committed
89
cd components/backends/vllm
Alec's avatar
Alec committed
90
bash launch/disagg_router.sh
91
92
```

Alec's avatar
Alec committed
93
94
95
#### Single Node Data Parallel Attention / Expert Parallelism

This example is not meant to be performant but showcases dynamo routing to data parallel workers
96
97

```bash
Alec's avatar
Alec committed
98
# requires four gpus
Alec's avatar
Alec committed
99
cd components/backends/vllm
Alec's avatar
Alec committed
100
bash launch/dep.sh
101
102
103
```


Alec's avatar
Alec committed
104
105
> [!TIP]
> Run a disaggregated example and try adding another prefill worker once the setup is running! The system will automatically discover and utilize the new worker.
106

107
108
109
110
111
112
113
114
### Kubernetes Deployment

For Kubernetes deployment, YAML manifests are provided in the `deploy/` directory. These define DynamoGraphDeployment resources for various configurations:

- `agg.yaml` - Aggregated serving
- `agg_router.yaml` - Aggregated serving with KV routing
- `disagg.yaml` - Disaggregated serving
- `disagg_router.yaml` - Disaggregated serving with KV routing
115
- `disagg_planner.yaml` - Disaggregated serving with [SLA Planner](../../../docs/architecture/sla_planner.md). See [SLA Planner Deployment Guide](../../../docs/guides/dynamo_deploy/sla_planner_deployment.md) for more details.
116
117
118

#### Prerequisites

119
- **Dynamo Cloud**: Follow the [Quickstart Guide](../../../docs/guides/dynamo_deploy/quickstart.md) to deploy Dynamo Cloud first.
120

121
- **Container Images**: The deployment files currently require access to `nvcr.io/nvidian/nim-llm-dev/vllm-runtime`. If you don't have access, build and push your own image:
122
  ```bash
123
  ./container/build.sh --framework VLLM
124
125
126
127
  # Tag and push to your container registry
  # Update the image references in the YAML files
  ```

128
129
- **Pre-Deployment Profiling (if Using SLA Planner)**: Follow the [pre-deployment profiling guide](../../../docs/architecture/pre_deployment_profiling.md) to run pre-deployment profiling. The results will be saved to the `profiling-pvc` PVC and queried by the SLA Planner.

130
131
132
133
134
135
136
137
- **Port Forwarding**: After deployment, forward the frontend service to access the API:
  ```bash
  kubectl port-forward deployment/vllm-v1-disagg-frontend-<pod-uuid-info> 8080:8000
  ```

#### Deploy to Kubernetes

Example with disagg:
atchernych's avatar
atchernych committed
138
Export the NAMESPACE  you used in your Dynamo Cloud Installation.
139
140

```bash
atchernych's avatar
atchernych committed
141
142
143
cd dynamo
cd components/backends/vllm/deploy
kubectl apply -f disagg.yaml -n $NAMESPACE
144
145
```

146
147
148
149
150
151
152
153
154
155
156
To change `DYN_LOG` level, edit the yaml file by adding

```yaml
...
spec:
  envs:
    - name: DYN_LOG
      value: "debug" # or other log levels
  ...
```

Alec's avatar
Alec committed
157
### Testing the Deployment
158

Alec's avatar
Alec committed
159
Send a test request to verify your deployment:
160
161

```bash
Alec's avatar
Alec committed
162
curl localhost:8080/v1/chat/completions \
163
164
  -H "Content-Type: application/json" \
  -d '{
Alec's avatar
Alec committed
165
166
167
168
169
170
171
    "model": "Qwen/Qwen3-0.6B",
    "messages": [
    {
        "role": "user",
        "content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden."
    }
    ],
172
173
174
    "stream": false,
    "max_tokens": 30
  }'
Alec's avatar
Alec committed
175
176
177
178
179
180
181
182
183
184
185
186
187
188
```

## Configuration

vLLM workers are configured through command-line arguments. Key parameters include:

- `--endpoint`: Dynamo endpoint in format `dyn://namespace.component.endpoint`
- `--model`: Model to serve (e.g., `Qwen/Qwen3-0.6B`)
- `--is-prefill-worker`: Enable prefill-only mode for disaggregated serving
- `--metrics-endpoint-port`: Port for publishing KV metrics to Dynamo

See `args.py` for the full list of configuration options and their defaults.

The [documentation](https://docs.vllm.ai/en/v0.9.2/configuration/serve_args.html?h=serve+arg) for the vLLM CLI args points to running 'vllm serve --help' to see what CLI args can be added. We use the same argument parser as vLLM.