Commit 538b4630 authored by Jacky's avatar Jacky Committed by GitHub
Browse files

docs: Guide for multi-node benchmarking (#561)


Signed-off-by: default avatarJacky <18255193+kthui@users.noreply.github.com>
Co-authored-by: default avatarNeelay Shah <neelays@nvidia.com>
Co-authored-by: default avatarMeenakshi Sharma <163925564+nvda-mesharma@users.noreply.github.com>
parent 441846de
......@@ -43,21 +43,23 @@ docker compose -f deploy/docker_compose.yml up -d
## Disaggregated Single Node Benchmarking
In the following steps we compare Dynamo disaggregated vLLM single node performance to
[native vLLM Aggregated Baseline](#vllm-aggregated-baseline-benchmarking). These were chosen to optimize
*One H100 80GB x8 node is required for this setup.*
In the following setup we compare Dynamo disaggregated vLLM performance to
[native vLLM Aggregated Baseline](#vllm-aggregated-baseline-benchmarking) on a single node. These were chosen to optimize
for Output Token Throughput (per sec) when both are performing under similar Inter Token Latency (ms).
For more details on your use case please see the [Performance Tuning Guide](/docs/guides/disagg_perf_tuning.md).
One H100 80GB x8 node is required for this setup.
In this setup, we will be using 4 prefill workers and 1 decode worker.
Each prefill worker will use tensor parallel 1 and the decode worker will use tensor parallel 4.
With the Dynamo repository, benchmarking image and model available, and **NATS and ETCD started**, perform the following steps:
1\. Run benchmarking container
```bash
./container/run.sh -it \
-v <huggingface_hub>:/root/.cache/huggingface/hub \
-v <dynamo_repo>:/workspace
./container/run.sh --mount-workspace
```
Note: The huggingface home source mount can be changed by setting `--hf-cache ~/.cache/huggingface`.
2\. Start disaggregated services
```bash
......@@ -68,18 +70,65 @@ Note: Check the `disagg.log` to make sure the service is fully started before co
Collect the performance numbers as shown on the [Collecting Performance Numbers](#collecting-performance-numbers) section below.
## Disaggregated Multi Node Benchmarking
*Two H100 80GB x8 nodes are required for this setup.*
> [!Note]
> Nodes used for benchmarking were part of a cluster connected via InfiniBand
> NDR with 8 connections for compute and 2 for storage. Both fabrics were on
> their own fat tree non-blocking topology.
In the following steps we compare Dynamo disaggregated vLLM performance to
[native vLLM Aggregated Baseline](#vllm-aggregated-baseline-benchmarking) on two nodes. These were chosen to optimize
for Output Token Throughput (per sec) when both are performing under similar Inter Token Latency (ms).
For more details on your use case please see the [Performance Tuning Guide](/docs/guides/disagg_perf_tuning.md).
In this setup, we will be using 8 prefill workers and 1 decode worker.
Each prefill worker will use tensor parallel 1 and the decode worker will use tensor parallel 8.
With the Dynamo repository, benchmarking image and model available, and **NATS and ETCD started on node 0**, perform the following steps:
1\. Run benchmarking container (node 0 & 1)
```bash
./container/run.sh --mount-workspace
```
Note: The huggingface home source mount can be changed by setting `--hf-cache ~/.cache/huggingface`.
2\. Config NATS and ETCD (node 1)
```bash
export NATS_SERVER="nats://<node_0_ip_addr>"
export ETCD_ENDPOINTS="<node_0_ip_addr>:2379"
```
Note: Node 1 must be able to reach Node 0 over the network for the above services.
3\. Start workers (node 0)
```bash
cd /workspace/examples/llm
dynamo serve benchmarks.disagg_multinode:Frontend -f benchmarks/disagg_multinode.yaml 1> disagg_multinode.log 2>&1 &
```
Note: Check the `disagg_multinode.log` to make sure the service is fully started before collecting performance numbers.
4\. Start workers (node 1)
```bash
cd /workspace/examples/llm
dynamo serve components.prefill_worker:PrefillWorker -f benchmarks/disagg_multinode.yaml 1> prefill_multinode.log 2>&1 &
```
Note: Check the `prefill_multinode.log` to make sure the service is fully started before collecting performance numbers.
Collect the performance numbers as shown on the [Collecting Performance Numbers](#collecting-performance-numbers) section above.
## vLLM Aggregated Baseline Benchmarking
One H100 80GB x8 node is required for this setup.
One (or two) H100 80GB x8 nodes are required for this setup.
With the Dynamo repository and the benchmarking image available, perform the following steps:
1\. Run benchmarking container
```bash
./container/run.sh -it \
-v <huggingface_hub>:/root/.cache/huggingface/hub \
-v <dynamo_repo>:/workspace
./container/run.sh --mount-workspace
```
Note: The huggingface home source mount can be changed by setting `--hf-cache ~/.cache/huggingface`.
2\. Start vLLM serve
```bash
......@@ -102,7 +151,7 @@ CUDA_VISIBLE_DEVICES=4,5,6,7 vllm serve neuralmagic/DeepSeek-R1-Distill-Llama-70
```
Notes:
* Check the `vllm_0.log` and `vllm_1.log` to make sure the service is fully started before collecting performance numbers.
* The `vllm serve` configuration should closely match the corresponding disaggregated benchmarking configuration.
* If benchmarking over 2 nodes, `--tensor-parallel-size 8` should be used and only run one `vllm serve` instance per node.
3\. Use NGINX as load balancer
```bash
......@@ -110,6 +159,7 @@ apt update && apt install -y nginx
cp /workspace/examples/llm/benchmarks/nginx.conf /etc/nginx/nginx.conf
service nginx restart
```
Note: If benchmarking over 2 nodes, the `upstream` configuration will need to be updated to link to the `vllm serve` on the second node.
Collect the performance numbers as shown on the [Collecting Performance Numbers](#collecting-performance-numbers) section below.
......@@ -122,5 +172,4 @@ bash -x /workspace/examples/llm/benchmarks/perf.sh
## Future Roadmap
* Disaggregated Multi Node Benchmarking
* Results Interpretation
......@@ -32,8 +32,10 @@ VllmWorker:
# Number of tokens in a batch for more efficient chunked transfers to GPUs.
block-size: 128
max-model-len: 3500
# Enable KV cache passing from prefill to decode workers
# Enable prefill at different workers.
remote-prefill: true
# Disable local prefill so only disaggregated prefill is used.
conditional-disagg: false
tensor-parallel-size: 4
gpu-memory-utilization: 0.95
disable-log-requests: true
......@@ -57,4 +59,4 @@ PrefillWorker:
resources:
gpu: 1
# Note: No prefix cache is used, since all requests are expected to be unique.
# Automatic prefix caching is disabled by default, since all requests are expected to be unique.
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from components.frontend import Frontend
from components.kv_router import Router
from components.processor import Processor
from components.worker import VllmWorker
Frontend.link(Processor).link(Router).link(VllmWorker)
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
Frontend:
served_model_name: neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
endpoint: dynamo.Processor.chat/completions
port: 8000
Processor:
model: neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
block-size: 128
max-model-len: 3500
# Routing policy determines how remote workers are selected for processing
# prefill requests
# 1. random: randomly select workers for prefill requests
# 2. round-robin: different prefill requests take similar time to complete so
# selecting workers in round-robin maximizes the chance of
# selecting the least busy worker for a request
# 3. kv: finding prefill workers by KV cache is not beneficial when caching is
# disabled on this setup
router: round-robin
Router:
model-name: neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
min-workers: 1
VllmWorker:
model: neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
kv-transfer-config: '{"kv_connector":"DynamoNixlConnector"}'
block-size: 128
max-model-len: 3500
# Enable prefill at different workers.
remote-prefill: true
# Disable local prefill so only disaggregated prefill is used.
conditional-disagg: false
# TP size is doubled from single node setup
tensor-parallel-size: 8
gpu-memory-utilization: 0.95
disable-log-requests: true
router: round-robin
ServiceArgs:
workers: 1
resources:
gpu: 8
PrefillWorker:
model: neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
kv-transfer-config: '{"kv_connector":"DynamoNixlConnector"}'
block-size: 128
max-model-len: 3500
max-num-batched-tokens: 3500
tensor-parallel-size: 1
gpu-memory-utilization: 0.95
disable-log-requests: true
ServiceArgs:
# DP size is doubled from single node setup
workers: 8
resources:
gpu: 1
# Automatic prefix caching is disabled by default, since all requests are expected to be unique.
......@@ -16,12 +16,13 @@
model=neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
# Input sequence length.
# Input Sequence Length (isl) 3000 and Output Sequence Length (osl) 150 are
# selected for chat use case. Note that for other use cases, the results and
# tuning would vary.
isl=3000
# Output sequence length.
osl=150
# Concurrency levels to test.
# Concurrency levels to test
for concurrency in 1 2 4 8 16 32 64 128 256; do
genai-perf profile \
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment