docs: Guide for multi-node benchmarking (#561)

Signed-off-by: Jacky <18255193+kthui@users.noreply.github.com> Co-authored-by: Neelay Shah <neelays@nvidia.com> Co-authored-by: Meenakshi Sharma <163925564+nvda-mesharma@users.noreply.github.com>

docs: Guide for multi-node benchmarking (#561)
Signed-off-by: Jacky <18255193+kthui@users.noreply.github.com> Co-authored-by: Neelay Shah <neelays@nvidia.com> Co-authored-by: Meenakshi Sharma <163925564+nvda-mesharma@users.noreply.github.com>
538b4630 · Jacky · GitHub · 441846de · 538b4630 · 538b4630
Commit 538b4630 authored Apr 10, 2025 by Jacky Committed by GitHub Apr 10, 2025
5 changed files
--- a/examples/llm/benchmarks/README.md
+++ b/examples/llm/benchmarks/README.md
@@ -43,21 +43,23 @@ docker compose -f deploy/docker_compose.yml up -d

 ## Disaggregated Single Node Benchmarking

-In the following steps we compare Dynamo disaggregated vLLM single node performance to
-[native vLLM Aggregated Baseline](#vllm-aggregated-baseline-benchmarking). These were chosen to optimize
+*One H100 80GB x8 node is required for this setup.*
+
+In the following setup we compare Dynamo disaggregated vLLM performance to
+[native vLLM Aggregated Baseline](#vllm-aggregated-baseline-benchmarking) on a single node. These were chosen to optimize
 for Output Token Throughput (per sec) when both are performing under similar Inter Token Latency (ms).
 For more details on your use case please see the [Performance Tuning Guide](/docs/guides/disagg_perf_tuning.md).

-One H100 80GB x8 node is required for this setup.
+In this setup, we will be using 4 prefill workers and 1 decode worker.
+Each prefill worker will use tensor parallel 1 and the decode worker will use tensor parallel 4.

 With the Dynamo repository, benchmarking image and model available, and **NATS and ETCD started**, perform the following steps:

 1\. Run benchmarking container
 ```bash
-./container/run.sh -it \
-  -v <huggingface_hub>:/root/.cache/huggingface/hub \
-  -v <dynamo_repo>:/workspace
+./container/run.sh --mount-workspace
 ```
+Note: The huggingface home source mount can be changed by setting `--hf-cache ~/.cache/huggingface`.

 2\. Start disaggregated services
 ```bash
@@ -68,18 +70,65 @@ Note: Check the `disagg.log` to make sure the service is fully started before co

 Collect the performance numbers as shown on the [Collecting Performance Numbers](#collecting-performance-numbers) section below.

+## Disaggregated Multi Node Benchmarking
+
+*Two H100 80GB x8 nodes are required for this setup.*
+
+> [!Note]
+> Nodes used for benchmarking were part of a cluster connected via InfiniBand
+> NDR with 8 connections for compute and 2 for storage. Both fabrics were on
+> their own fat tree non-blocking topology.
+
+In the following steps we compare Dynamo disaggregated vLLM performance to
+[native vLLM Aggregated Baseline](#vllm-aggregated-baseline-benchmarking) on two nodes. These were chosen to optimize
+for Output Token Throughput (per sec) when both are performing under similar Inter Token Latency (ms).
+For more details on your use case please see the [Performance Tuning Guide](/docs/guides/disagg_perf_tuning.md).
+
+In this setup, we will be using 8 prefill workers and 1 decode worker.
+Each prefill worker will use tensor parallel 1 and the decode worker will use tensor parallel 8.
+
+With the Dynamo repository, benchmarking image and model available, and **NATS and ETCD started on node 0**, perform the following steps:
+
+1\. Run benchmarking container (node 0 & 1)
+```bash
+./container/run.sh --mount-workspace
+```
+Note: The huggingface home source mount can be changed by setting `--hf-cache ~/.cache/huggingface`.
+
+2\. Config NATS and ETCD (node 1)
+```bash
+export NATS_SERVER="nats://<node_0_ip_addr>"
+export ETCD_ENDPOINTS="<node_0_ip_addr>:2379"
+```
+Note: Node 1 must be able to reach Node 0 over the network for the above services.
+
+3\. Start workers (node 0)
+```bash
+cd /workspace/examples/llm
+dynamo serve benchmarks.disagg_multinode:Frontend -f benchmarks/disagg_multinode.yaml 1> disagg_multinode.log 2>&1 &
+```
+Note: Check the `disagg_multinode.log` to make sure the service is fully started before collecting performance numbers.
+
+4\. Start workers (node 1)
+```bash
+cd /workspace/examples/llm
+dynamo serve components.prefill_worker:PrefillWorker -f benchmarks/disagg_multinode.yaml 1> prefill_multinode.log 2>&1 &
+```
+Note: Check the `prefill_multinode.log` to make sure the service is fully started before collecting performance numbers.
+
+Collect the performance numbers as shown on the [Collecting Performance Numbers](#collecting-performance-numbers) section above.
+
 ## vLLM Aggregated Baseline Benchmarking

-One H100 80GB x8 node is required for this setup.
+One (or two) H100 80GB x8 nodes are required for this setup.

 With the Dynamo repository and the benchmarking image available, perform the following steps:

 1\. Run benchmarking container
 ```bash
-./container/run.sh -it \
-  -v <huggingface_hub>:/root/.cache/huggingface/hub \
-  -v <dynamo_repo>:/workspace
+./container/run.sh --mount-workspace
 ```
+Note: The huggingface home source mount can be changed by setting `--hf-cache ~/.cache/huggingface`.

 2\. Start vLLM serve
 ```bash
@@ -102,7 +151,7 @@ CUDA_VISIBLE_DEVICES=4,5,6,7 vllm serve neuralmagic/DeepSeek-R1-Distill-Llama-70
 ```
 Notes:
 * Check the `vllm_0.log` and `vllm_1.log` to make sure the service is fully started before collecting performance numbers.
-* The `vllm serve` configuration should closely match the corresponding disaggregated benchmarking configuration.
+* If benchmarking over 2 nodes, `--tensor-parallel-size 8` should be used and only run one `vllm serve` instance per node.

 3\. Use NGINX as load balancer
 ```bash
@@ -110,6 +159,7 @@ apt update && apt install -y nginx
 cp /workspace/examples/llm/benchmarks/nginx.conf /etc/nginx/nginx.conf
 service nginx restart
 ```
+Note: If benchmarking over 2 nodes, the `upstream` configuration will need to be updated to link to the `vllm serve` on the second node.

 Collect the performance numbers as shown on the [Collecting Performance Numbers](#collecting-performance-numbers) section below.

@@ -122,5 +172,4 @@ bash -x /workspace/examples/llm/benchmarks/perf.sh

 ## Future Roadmap

-* Disaggregated Multi Node Benchmarking
 * Results Interpretation
--- a/examples/llm/benchmarks/disagg.yaml
+++ b/examples/llm/benchmarks/disagg.yaml
@@ -32,8 +32,10 @@ VllmWorker:
  # Number of tokens in a batch for more efficient chunked transfers to GPUs.
  block-size: 128
  max-model-len: 3500
-  # Enable KV cache passing from prefill to decode workers
+  # Enable prefill at different workers.
  remote-prefill: true
+  # Disable local prefill so only disaggregated prefill is used.
+  conditional-disagg: false
  tensor-parallel-size: 4
  gpu-memory-utilization: 0.95
  disable-log-requests: true
@@ -57,4 +59,4 @@ PrefillWorker:
    resources:
      gpu: 1

-# Note: No prefix cache is used, since all requests are expected to be unique.
+# Automatic prefix caching is disabled by default, since all requests are expected to be unique.
--- a/examples/llm/benchmarks/disagg_multinode.py
+++ b/examples/llm/benchmarks/disagg_multinode.py
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from components.frontend import Frontend
+from components.kv_router import Router
+from components.processor import Processor
+from components.worker import VllmWorker
+
+Frontend.link(Processor).link(Router).link(VllmWorker)
--- a/examples/llm/benchmarks/disagg_multinode.yaml
+++ b/examples/llm/benchmarks/disagg_multinode.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+Frontend:
+  served_model_name: neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
+  endpoint: dynamo.Processor.chat/completions
+  port: 8000
+
+Processor:
+  model: neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
+  block-size: 128
+  max-model-len: 3500
+  # Routing policy determines how remote workers are selected for processing
+  # prefill requests
+  # 1. random: randomly select workers for prefill requests
+  # 2. round-robin: different prefill requests take similar time to complete so
+  #                 selecting workers in round-robin maximizes the chance of
+  #                 selecting the least busy worker for a request
+  # 3. kv: finding prefill workers by KV cache is not beneficial when caching is
+  #        disabled on this setup
+  router: round-robin
+
+Router:
+  model-name: neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
+  min-workers: 1
+
+VllmWorker:
+  model: neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
+  kv-transfer-config: '{"kv_connector":"DynamoNixlConnector"}'
+  block-size: 128
+  max-model-len: 3500
+  # Enable prefill at different workers.
+  remote-prefill: true
+  # Disable local prefill so only disaggregated prefill is used.
+  conditional-disagg: false
+  # TP size is doubled from single node setup
+  tensor-parallel-size: 8
+  gpu-memory-utilization: 0.95
+  disable-log-requests: true
+  router: round-robin
+  ServiceArgs:
+    workers: 1
+    resources:
+      gpu: 8
+
+PrefillWorker:
+  model: neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
+  kv-transfer-config: '{"kv_connector":"DynamoNixlConnector"}'
+  block-size: 128
+  max-model-len: 3500
+  max-num-batched-tokens: 3500
+  tensor-parallel-size: 1
+  gpu-memory-utilization: 0.95
+  disable-log-requests: true
+  ServiceArgs:
+    # DP size is doubled from single node setup
+    workers: 8
+    resources:
+      gpu: 1
+
+# Automatic prefix caching is disabled by default, since all requests are expected to be unique.
--- a/examples/llm/benchmarks/perf.sh
+++ b/examples/llm/benchmarks/perf.sh
@@ -16,12 +16,13 @@

 model=neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic

-# Input sequence length.
+# Input Sequence Length (isl) 3000 and Output Sequence Length (osl) 150 are
+# selected for chat use case. Note that for other use cases, the results and
+# tuning would vary.
 isl=3000
-# Output sequence length.
 osl=150

-# Concurrency levels to test.
+# Concurrency levels to test
 for concurrency in 1 2 4 8 16 32 64 128 256; do

  genai-perf profile \