docs: R1 disaggregation guide (#720)

e06bfd55 · GuanLuo · GitHub · cce0c0f0 · e06bfd55 · e06bfd55
Unverified Commit e06bfd55 authored Apr 22, 2025 by GuanLuo Committed by GitHub Apr 22, 2025
Hide whitespace changes
Inline Side-by-side

Showing with 97 additions and 1 deletion

examples/llm/configs/mutinode_disagg_r1.yaml examples/llm/configs/mutinode_disagg_r1.yaml +46 -0

examples/llm/multinode-examples.md examples/llm/multinode-examples.md +51 -1

No files found.
--- a/examples/llm/configs/mutinode_disagg_r1.yaml
+++ b/examples/llm/configs/mutinode_disagg_r1.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+Common:
+  model: deepseek-ai/DeepSeek-R1
+  block-size: 64
+  max-model-len: 16384
+  kv-transfer-config: '{"kv_connector":"DynamoNixlConnector"}'
+  tensor-parallel-size: 16
+
+Frontend:
+  served_model_name: deepseek-ai/DeepSeek-R1
+  endpoint: dynamo.Processor.chat/completions
+  port: 8000
+
+Processor:
+  router: round-robin
+  common-configs: [model, block-size]
+
+VllmWorker:
+  remote-prefill: true
+  conditional-disagg: false
+  ServiceArgs:
+    workers: 1
+    resources:
+      gpu: 16
+  common-configs: [model, block-size, max-model-len, kv-transfer-config, tensor-parallel-size]
+
+PrefillWorker:
+  max-num-batched-tokens: 16384
+  ServiceArgs:
+    workers: 1
+    resources:
+      gpu: 16
+  common-configs: [model, block-size, max-model-len, kv-transfer-config, tensor-parallel-size]
--- a/examples/llm/multinode-examples.md
+++ b/examples/llm/multinode-examples.md
@@ -163,4 +163,54 @@ curl <node1-ip>:8000/v1/chat/completions \
    "stream": true,
    "max_tokens": 300
  }'
-```
\ No newline at end of file
+```
+
+##### Disaggregated Deployment
+
+In this example, we will be deploying two replicas of the model (one prefill worker
+and one decode worker). We will be using 4 H100x8 nodes and group every two of them
+into one Ray cluster in the same way as described in aggregated deployment.
+However, for etcd and nats server, we will only run them in
+one node and let's consider that node to be the head node of the whole deployment.
+
+Note that if you are starting etcd server directly instead of using `docker compose`,
+you should add additional arguments to be discoverable in other node.
+```bash
+etcd --advertise-client-urls http://<head-node-ip>:2379 --listen-client-urls http://<head-node-ip>:2379,http://127.0.0.1:2379
+```
+
+**Step 1**: On every two nodes, set up Ray cluster as described in
+[aggregated deployment](#aggregated-deployment). After that, you should have
+two independent Ray cluster, each has access to 16 GPUs.
+
+**Step 2** start the deployment by running different flavors of `dynamo serve`
+on one of the node for each Ray cluster, using the configuration file,
+`configs/mutinode_disagg_r1.yaml`.
+
+For decode, below command will be used and the node will be the entry point of
+the whole deployment. In other words, the ip of the node should be used to send
+requests to.
+```bash
+# if not head node
+export NATS_SERVER='nats://<nats-server-ip>:4222'
+export ETCD_ENDPOINTS='<etcd-endpoints-ip>:2379'
+
+cd $DYNAMO_HOME/examples/llm
+dynamo serve graphs.agg:Frontend -f ./configs/mutinode_disagg_r1.yaml
+```
+
+For prefill:
+```bash
+# if not head node
+export NATS_SERVER='nats://<nats-server-ip>:4222'
+export ETCD_ENDPOINTS='<etcd-endpoints-ip>:2379'
+
+cd $DYNAMO_HOME/examples/llm
+dynamo serve components.prefill_worker:PrefillWorker -f ./configs/mutinode_disagg_r1.yaml
+```
+
+### Client
+
+In another terminal, you can send the same curl request as described in
+[aggregated deployment](#aggregated-deployment), addressing to the ip of
+the decode node.
\ No newline at end of file