README.md 11.6 KB
Newer Older
ptarasiewiczNV's avatar
ptarasiewiczNV committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->

> **NOTE**: This example is based on an internal NVIDIA library that will soon be publicly released. The example won't work until the official release.

20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
## Prerequisites

Start required services (etcd and NATS):

   Option A: Using [Docker Compose](/deploy/docker-compose.yml) (Recommended)
   ```bash
   docker compose -f deploy/docker-compose.yml up -d
   ```

   Option B: Manual Setup

    - [NATS.io](https://docs.nats.io/running-a-nats-service/introduction/installation) server with [Jetstream](https://docs.nats.io/nats-concepts/jetstream)
        - example: `nats-server -js --trace`
    - [etcd](https://etcd.io) server
        - follow instructions in [etcd installation](https://etcd.io/docs/v3.5/install/) to start an `etcd-server` locally

ptarasiewiczNV's avatar
ptarasiewiczNV committed
36
37
38
## Build docker

```
39
./container/build.sh --framework VLLM_NIXL --target dev --build-context nixl=<path to downloaded nixl repo @ c53bb19a6a114e9093071bd1f2904f996ae1839b>
ptarasiewiczNV's avatar
ptarasiewiczNV committed
40
41
42
43
44
45
46
47
48
49
50
51
```

## Run container

```
./container/run.sh --framework VLLM_NIXL --target dev -it
```

All of the commands below are run inside the same container.

## Run deployment

Neelay Shah's avatar
Neelay Shah committed
52
Add model to dynamo and start http server.
ptarasiewiczNV's avatar
ptarasiewiczNV committed
53
54
55
56
57

```
TRT_LOG=DEBUG http --port 8181
```

58
### Router-less Deployment
ptarasiewiczNV's avatar
ptarasiewiczNV committed
59

60
Router-less deployment without kv router and disaggregated router.
ptarasiewiczNV's avatar
ptarasiewiczNV committed
61

62
63
64
65
For router-less deployment, the client should directly hit the vllm.generate endpoint,
```
llmctl http add chat-models deepseek-ai/DeepSeek-R1-Distill-Llama-8B dynamo-init.vllm.generate
```
ptarasiewiczNV's avatar
ptarasiewiczNV committed
66

67
#### Monolithic
ptarasiewiczNV's avatar
ptarasiewiczNV committed
68
69
70

```
cd /workspace/examples/python_rs/llm/vllm_nixl
71
CUDA_VISIBLE_DEVICES=0 python3 routerless/worker.py \
ptarasiewiczNV's avatar
ptarasiewiczNV committed
72
73
74
75
    --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
    --enforce-eager
```

76
#### Disaggregated
ptarasiewiczNV's avatar
ptarasiewiczNV committed
77

78
In disaggregated router-less deployment, the decode worker will directly send requests to a random prefill worker. All the requests will be sent to prefill worker(s) for remote prefill.
ptarasiewiczNV's avatar
ptarasiewiczNV committed
79
80
81
82
83

In terminal 1:

```
cd /workspace/examples/python_rs/llm/vllm_nixl
84
CUDA_VISIBLE_DEVICES=0 python routerless/prefill_worker.py \
ptarasiewiczNV's avatar
ptarasiewiczNV committed
85
86
    --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
    --enforce-eager \
87
    --block-size 64 \
ptarasiewiczNV's avatar
ptarasiewiczNV committed
88
    --kv-transfer-config \
Neelay Shah's avatar
Neelay Shah committed
89
    '{"kv_connector":"DynamoNixlConnector"}'
ptarasiewiczNV's avatar
ptarasiewiczNV committed
90
91
92
93
94
```

In terminal 2:
```
cd /workspace/examples/python_rs/llm/vllm_nixl
95
CUDA_VISIBLE_DEVICES=1,2 python3 routerless/worker.py \
ptarasiewiczNV's avatar
ptarasiewiczNV committed
96
97
98
    --remote-prefill \
    --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
    --enforce-eager \
99
    --block-size 64 \
100
    --tensor-parallel-size 2 \
ptarasiewiczNV's avatar
ptarasiewiczNV committed
101
    --kv-transfer-config \
Neelay Shah's avatar
Neelay Shah committed
102
    '{"kv_connector":"DynamoNixlConnector"}'
ptarasiewiczNV's avatar
ptarasiewiczNV committed
103
104
```

105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
### Router-based Deployment

Router-based deployment use kv router to schedule the request to the best decode worker and disaggregated router to decide whether to prefill locally or remotely. The remote prefill requests will be sent to a global prefill queue to balance the prefill load.

For router deployment, the client should hit the endpoint of the processor,
```
llmctl http add chat-models deepseek-ai/DeepSeek-R1-Distill-Llama-8B dynamo-init.process.chat/completions
```

To launch disaggregated vllm deployment, there are four major components:
1. Processor
2. KV Router
3. Disaggregated Router
4. Prefill and Decode Workers

#### Processor

```
# Processor must take the same args as the worker
# This is temporary until we communicate the ModelDeploymentCard over etcd
# Currently only block-size=64 is supported
cd /workspace/examples/python_rs/llm/vllm_nixl
RUST_LOG=info python3 router/processor.py \
    --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
    --tokenizer deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
    --enable-prefix-caching \
    --block-size 64 \
    --max-model-len 16384
```

#### KV Router

The KV Router is a component that aggregates KV Events from all the workers and maintains a prefix tree of the cached tokens. It makes decisions on which worker to route requests to based on the length of the prefix match and the load on the workers.

To launch the KV Router, run the following command:
```
RUST_LOG=info python3 router/kv_router.py \
    --routing-strategy prefix \
    --model-name deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
    --min-workers 1
```

There is also a custom router that uses a cost function defined in python to make routing decisions. To launch the custom router, run the following command:
```
RUST_LOG=info python3 router/kv_router.py \
    --routing-strategy prefix \
    --model-name deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
    --custom-router \
    --min-workers 1
```

You can choose only the prefix strategy for now:
- `prefix`: Route requests to the worker that has the longest prefix match.


#### Disaggregated Router

The disaggregated router determines whether a request should be send to a
remote prefill engine or a local prefill engine for prefilling based on the
prefill length. When prefilling locally, the vllm scheduler will prioritize
prefill request and pause any ongoing decode requests.

There are two types of disaggregated router implementations:
* Rust native: provide a simple heuristic to route to prefill engine
  if prefill length (including prefix catch hit) is greater than a threshold.
  This threshold can by dynamically adjusted at runtime through etcd.

  To check the current threshold (this will print out all kv pairs in etcd):
  ```
  curl -s -L http://localhost:2379/v3/kv/range -X POST   -d '{"key":"AA==", "range_end":"AA=="}' |   jq -r '.kvs[] | "KEY: \(.key | @base64d)\nVALUE: \(.value | @base64d)\n---"'
  ```

  To update the threshold:
  ```
  ETCDCTL_API=3 etcdctl --endpoints=http://localhost:2379 put 'public/components/disagg_router/models/chat/<vllm.served_model_name(default to "vllm")>' '{"max_local_prefill_length": <new_threshold>}'
  ```

* Python customized: provide a python implementation that can be easily customized.
  However, it does not support dynamic threshold adjustment through etcd.
  It is recommended to use the custom disaggregated router together with the custom
  kv router as the rust kv router does not report kv cache hit ratio.
  To use the python disaggregated router, add the following commands when launching
  the decode worker:
  ```
  python worker.py \
    --custom-disagg-router \
    --max-local-prefill-length <length> \
    --max-remote-prefill-cache-hit-ratio <ratio>
  ```

#### Workers

```
# start prefill worker in Terminal 1
# Note: prefix caching is not supported in the prefill for now
cd /workspace/examples/python_rs/llm/vllm_nixl
CUDA_VISIBLE_DEVICES=0 python3 router/prefill_worker.py \
    --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
    --enforce-eager \
    --kv-transfer-config '{"kv_connector":"DynamoNixlConnector"}' \
    --block-size 64 \
    --max-num-batched-tokens 16384 \
    --max-model-len 16384

# start decode worker in Terminal 2
cd /workspace/examples/python_rs/llm/vllm_nixl
CUDA_VISIBLE_DEVICES=1 python3 router/worker.py \
    --remote-prefill \
    --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
    --enforce-eager \
    --tensor-parallel-size 1 \
    --kv-transfer-config '{"kv_connector":"DynamoNixlConnector"}' \
    --enable-prefix-caching \
    --block-size 64 \
    --max-num-batched-tokens 16384 \
    --max-model-len 16384
```

Alternatively, we also provide a script to launch all workers in one go (with the python customized router):
```
# this TODO: change to dynamo-deploy functionality
./start_single_node.sh
# Usage [--model <model>] [--p_tensor_parallel_size <size>] [--d_tensor_parallel_size <size>] [--max_model_len <len>] [--max_num_batched_tokens <tokens>] [--max_num_seqs <seqs>] [--gpu_memory_utilization <utilization>] [--enable_chunked_prefill <True/False>] [--num_p <p>] [--num_d <d>]
```

### Common Issues

If torch GLOO backend is complaining about file name too long, set
```
export GLOO_SOCKET_IFNAME=lo
```
ptarasiewiczNV's avatar
ptarasiewiczNV committed
236
237
238
239
240

## Client

In another terminal:
```
241
242
# this test request has around 200 tokens isl
curl localhost:8181/v1/chat/completions   -H "Content-Type: application/json"   -d '{
ptarasiewiczNV's avatar
ptarasiewiczNV committed
243
244
    "model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
    "messages": [
245
246
247
248
    {
        "role": "user",
        "content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden."
    }
ptarasiewiczNV's avatar
ptarasiewiczNV committed
249
    ],
250
251
    "stream":false,
    "max_tokens": 30
ptarasiewiczNV's avatar
ptarasiewiczNV committed
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
  }'
```

## Run genai-perf

`genai-perf` is a tool for profiling and benchmarking LLM servers. It is already installed in the container. For more details, please refer to the [genai-perf README](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/perf_analyzer/genai-perf/README.html).

```
genai-perf profile \
  -m deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
  --url localhost:8181 \
  --endpoint-type chat \
  --streaming \
  --service-kind openai \
  --endpoint v1/chat/completions \
  --warmup-request-count 10 \
  --random-seed 123 \
  --synthetic-input-tokens-stddev 0 \
  --output-tokens-stddev 0 \
  --tokenizer deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
  --synthetic-input-tokens-mean 3000 \
  --output-tokens-mean 150 \
  --extra-inputs min_tokens:150 \
  --extra-inputs max_tokens:150 \
  --profile-export-file my_profile_export.json \
  --artifact-dir artifacts/ \
  --concurrency 10 \
  --request-count 40 \
  -- -v \
  --async
```

## Close deployment

Kill all python processes and clean up metadata files:

```
pkill -9 -f python
```

## TODOs, limitations, known issues

- [ ] Add etcd for discovery
- [ ] Multi-node deployment support
- [ ] Enable chunked prefill
- [ ] Support mixed tp
- [ ] Process many remote prefill in one iteration
- [ ] Support recompute preemption
- [ ] Make sure decode does not preempt blocks before xfer finishes
- [ ] Layer wise transfer
- [ ] Non blocking send in prefill (cache manager should check xfer status)
- [ ] Test under load
- [ ] Support pp > 1
- [ ] Check why adding extra seed input is crashing vllm with remote prefill
- [ ] Unified worker for both prefill and decode
- [x] Require sending two parallel requests to start decode for the first time
- [x] Concurrency > 2 is not working
- [x] Parse cmdline args
- [x] Manual nixl example with tp1
- [x] Zero copy
- [x] Conditional remote prefill
- [x] Manual example with tp > 1
Neelay Shah's avatar
Neelay Shah committed
314
- [x] Run on dynamo distributed runtime
ptarasiewiczNV's avatar
ptarasiewiczNV committed
315
316
317
318
- [x] add oai http endpoint
- [x] Sample only on decode, do note return remote prefill response
- [x] Check if all transfers finished before moving to decode
- [x] Enable async output processing - could be working