docs: add aggregated deployment guide for multi-node sized model (#713)

22cacbb1 · GuanLuo · GitHub · 63026b6c · 22cacbb1 · 22cacbb1
Unverified Commit 22cacbb1 authored Apr 18, 2025 by GuanLuo Committed by GitHub Apr 18, 2025
Show whitespace changes
Inline Side-by-side

Showing with 125 additions and 4 deletions

examples/llm/configs/multinode_agg_r1.yaml examples/llm/configs/multinode_agg_r1.yaml +39 -0

examples/llm/multinode-examples.md examples/llm/multinode-examples.md +86 -4

No files found.
--- a/examples/llm/configs/multinode_agg_r1.yaml
+++ b/examples/llm/configs/multinode_agg_r1.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+Common:
+  model: deepseek-ai/DeepSeek-R1
+  block-size: 64
+  max-model-len: 16384
+
+Frontend:
+  served_model_name: deepseek-ai/DeepSeek-R1
+  endpoint: dynamo.Processor.chat/completions
+  port: 8000
+
+Processor:
+  router: round-robin
+  common-configs: [model, block-size, max-model-len]
+
+VllmWorker:
+  enforce-eager: true
+  max-num-batched-tokens: 16384
+  enable-prefix-caching: true
+  router: random
+  tensor-parallel-size: 16
+  ServiceArgs:
+    workers: 1
+    resources:
+      gpu: 1
+  common-configs: [model, block-size, max-model-len]
--- a/examples/llm/multinode-examples.md
+++ b/examples/llm/multinode-examples.md
@@ -2,6 +2,7 @@

 Table of Contents
 - [Single node sized models](#single-node-sized-models)
+- [Multi-node sized models](#multi-node-sized-models)

 ## Single node sized models
 You can deploy dynamo on multiple nodes via NATS/ETCD based discovery and communication. Here's an example of deploying disaggregated serving on 3 nodes using `nvidia/Llama-3.1-405B-Instruct-FP8`. Each node will need to be properly configured with Infiniband and/or RoCE for communication between decode and prefill workers.
@@ -28,14 +29,14 @@ Frontend.link(Processor).link(Router).link(VllmWorker)

 **Step 3**: Create a configuration file for this node. We've provided a sample one for you in `configs/multinode-405b.yaml` for the 405B model. Note that we still include the `PrefillWorker` component in the configuration file even though we are not using it on node 1. This is because we can reuse the same configuration file on all nodes and just spin up individual workers on the other ones.

-**Step 3**: Start the frontend, processor, router, and VllmWorker on node 1.
+**Step 4**: Start the frontend, processor, router, and VllmWorker on node 1.
 ```bash
 # node 1
 cd $DYNAMO_HOME/examples/llm
 dynamo serve graphs.agg_router:Frontend -f ./configs/multinode-405b.yaml
 ```

-**Step 4**: Start the first prefill worker on node 2.
+**Step 5**: Start the first prefill worker on node 2.
 Since we only want to start the `PrefillWorker` on node 2, you can simply run just the PrefillWorker component directly with the configuration file from before.

 ```bash
@@ -47,7 +48,7 @@ cd $DYNAMO_HOME/examples/llm
 dynamo serve components.prefill_worker:PrefillWorker -f ./configs/multinode-405b.yaml
 ```

-**Step 5**: Start the second prefill worker on node 3.
+**Step 6**: Start the second prefill worker on node 3.
 ```bash
 # node 3
 export NATS_SERVER = '<your-nats-server-address>' # note this should start with nats://...
@@ -82,3 +83,84 @@ curl <node1-ip>:8000/v1/chat/completions \
 #### Multi-node sized models

 Multinode model support is coming soon. You can track progress [here](https://github.com/ai-dynamo/dynamo/issues/513)!
+
+##### Aggregated Deployment
+
+The steps for aggregated deployment of multi-node sized models is similar to
+single-node sized models, except that you need to first configure the nodes
+to be interconnected according to the framework's multi-node deployment guide.
+In the below example, vLLM will be used as the framework to serve `DeepSeek-R1` model
+using tensor parallel 16 on two H100x8 nodes.
+
+**Step 1**: On each of the nodes, set up Ray cluster so that vLLM can access the resource
+collectively:
+```bash
+# head node
+ray start --head --port=6379
+
+# example output and keep note of the IP address of the head node
+# Local node IP: <head-node-address>
+
+# set vLLM env arg
+export VLLM_HOST_IP=<head-node-address>
+
+# other node
+ray start  --address=<head-node-address>:6379
+export VLLM_HOST_IP=<current-node-address>
+
+# verify the accessibility by checking aggregated GPU count shown in ray status
+ray status
+
+# Expected/Sample output for 2 nodes:
+# ```bash
+# ======== Autoscaler status: 2025-04-16 15:35:42.751688 ========
+# Node status
+# ---------------------------------------------------------------
+# Active:
+#  1 node_<hash_1>
+#  1 node_<hash_2>
+# Pending:
+#  (no pending nodes)
+# Recent failures:
+#  (no failures)
+# Resources
+# ---------------------------------------------------------------
+# Usage:
+# XXX CPU
+# XXX GPU
+# XXX memory
+# XXX object_store_memory
+# Demands:
+#  (no resource demands)
+```
+
+**Step 2**: On the head node, follow [LLM Deployment Guide](./README.md#getting-started) to
+setup dynamo deployment for aggregated serving, using the configuration file,
+`configs/multinode_agg_r1.yaml`, for DeepSeek-R1:
+```bash
+cd $DYNAMO_HOME/examples/llm
+dynamo serve graphs.agg:Frontend -f ./configs/multinode_agg_r1.yaml
+```
+
+### Client
+
+In another terminal, you can send the same curl request as described above but
+with `"model": "deepseek-ai/DeepSeek-R1"`
+```bash
+# this test request has around 200 tokens isl
+
+curl <node1-ip>:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -H "Accept: text/event-stream" \
+  -d '{
+    "model": "deepseek-ai/DeepSeek-R1",
+    "messages": [
+      {
+        "role": "user",
+        "content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden."
+      }
+    ],
+    "stream": true,
+    "max_tokens": 300
+  }'
+```
\ No newline at end of file