"git@developer.sourcefind.cn:OpenDAS/openpcdet.git" did not exist on "a1bd2d7bc3c093dc8a934f43b5f7a2993068d8e4"
Unverified Commit e06bfd55 authored by GuanLuo's avatar GuanLuo Committed by GitHub
Browse files

docs: R1 disaggregation guide (#720)

parent cce0c0f0
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
Common:
model: deepseek-ai/DeepSeek-R1
block-size: 64
max-model-len: 16384
kv-transfer-config: '{"kv_connector":"DynamoNixlConnector"}'
tensor-parallel-size: 16
Frontend:
served_model_name: deepseek-ai/DeepSeek-R1
endpoint: dynamo.Processor.chat/completions
port: 8000
Processor:
router: round-robin
common-configs: [model, block-size]
VllmWorker:
remote-prefill: true
conditional-disagg: false
ServiceArgs:
workers: 1
resources:
gpu: 16
common-configs: [model, block-size, max-model-len, kv-transfer-config, tensor-parallel-size]
PrefillWorker:
max-num-batched-tokens: 16384
ServiceArgs:
workers: 1
resources:
gpu: 16
common-configs: [model, block-size, max-model-len, kv-transfer-config, tensor-parallel-size]
......@@ -163,4 +163,54 @@ curl <node1-ip>:8000/v1/chat/completions \
"stream": true,
"max_tokens": 300
}'
```
\ No newline at end of file
```
##### Disaggregated Deployment
In this example, we will be deploying two replicas of the model (one prefill worker
and one decode worker). We will be using 4 H100x8 nodes and group every two of them
into one Ray cluster in the same way as described in aggregated deployment.
However, for etcd and nats server, we will only run them in
one node and let's consider that node to be the head node of the whole deployment.
Note that if you are starting etcd server directly instead of using `docker compose`,
you should add additional arguments to be discoverable in other node.
```bash
etcd --advertise-client-urls http://<head-node-ip>:2379 --listen-client-urls http://<head-node-ip>:2379,http://127.0.0.1:2379
```
**Step 1**: On every two nodes, set up Ray cluster as described in
[aggregated deployment](#aggregated-deployment). After that, you should have
two independent Ray cluster, each has access to 16 GPUs.
**Step 2** start the deployment by running different flavors of `dynamo serve`
on one of the node for each Ray cluster, using the configuration file,
`configs/mutinode_disagg_r1.yaml`.
For decode, below command will be used and the node will be the entry point of
the whole deployment. In other words, the ip of the node should be used to send
requests to.
```bash
# if not head node
export NATS_SERVER='nats://<nats-server-ip>:4222'
export ETCD_ENDPOINTS='<etcd-endpoints-ip>:2379'
cd $DYNAMO_HOME/examples/llm
dynamo serve graphs.agg:Frontend -f ./configs/mutinode_disagg_r1.yaml
```
For prefill:
```bash
# if not head node
export NATS_SERVER='nats://<nats-server-ip>:4222'
export ETCD_ENDPOINTS='<etcd-endpoints-ip>:2379'
cd $DYNAMO_HOME/examples/llm
dynamo serve components.prefill_worker:PrefillWorker -f ./configs/mutinode_disagg_r1.yaml
```
### Client
In another terminal, you can send the same curl request as described in
[aggregated deployment](#aggregated-deployment), addressing to the ip of
the decode node.
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment