docs: Guides for single node benchmarking (#509)

Signed-off-by: Jacky <18255193+kthui@users.noreply.github.com> Co-authored-by: Piotr Marcinkiewicz <piotrm@nvidia.com> Co-authored-by: Neelay Shah <neelays@nvidia.com>

docs: Guides for single node benchmarking (#509)
Signed-off-by: Jacky <18255193+kthui@users.noreply.github.com> Co-authored-by: Piotr Marcinkiewicz <piotrm@nvidia.com> Co-authored-by: Neelay Shah <neelays@nvidia.com>
594f8c04 · Jacky · GitHub · 9ce49c2c · 594f8c04 · 594f8c04
Commit 594f8c04 authored Apr 09, 2025 by Jacky Committed by GitHub Apr 09, 2025
6 changed files
--- a/examples/llm/benchmarks/README.md
+++ b/examples/llm/benchmarks/README.md
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# LLM Deployment Benchmarking Guide
+
+This guide provides detailed steps on benchmarking Large Language Models (LLMs) in single and multi-node configurations.
+
+> [!NOTE]
+> We recommend trying out the [LLM Deployment Examples](./README.md) before benchmarking.
+
+## Prerequisites
+
+H100 80GB x8 node(s) are required for benchmarking.
+
+1\. Build benchmarking image
+```bash
+./container/build.sh
+```
+
+2\. Download model
+```bash
+huggingface-cli download neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
+```
+
+3\. Start NATS and ETCD
+```bash
+docker compose -f deploy/docker_compose.yml up -d
+```
+
+## Disaggregated Single Node Benchmarking
+
+In the following steps we compare Dynamo disaggregated vLLM single node performance to
+[native vLLM Aggregated Baseline](#vllm-aggregated-baseline-benchmarking). These were chosen to optimize
+for Output Token Throughput (per sec) when both are performing under similar Inter Token Latency (ms).
+For more details on your use case please see the [Performance Tuning Guide](/docs/guides/disagg_perf_tuning.md).
+
+One H100 80GB x8 node is required for this setup.
+
+With the Dynamo repository, benchmarking image and model available, and **NATS and ETCD started**, perform the following steps:
+
+1\. Run benchmarking container
+```bash
+./container/run.sh -it \
+  -v <huggingface_hub>:/root/.cache/huggingface/hub \
+  -v <dynamo_repo>:/workspace
+```
+
+2\. Start disaggregated services
+```bash
+cd /workspace/examples/llm
+dynamo serve benchmarks.disagg:Frontend -f benchmarks/disagg.yaml 1> disagg.log 2>&1 &
+```
+Note: Check the `disagg.log` to make sure the service is fully started before collecting performance numbers.
+
+Collect the performance numbers as shown on the [Collecting Performance Numbers](#collecting-performance-numbers) section below.
+
+## vLLM Aggregated Baseline Benchmarking
+
+One H100 80GB x8 node is required for this setup.
+
+With the Dynamo repository and the benchmarking image available, perform the following steps:
+
+1\. Run benchmarking container
+```bash
+./container/run.sh -it \
+  -v <huggingface_hub>:/root/.cache/huggingface/hub \
+  -v <dynamo_repo>:/workspace
+```
+
+2\. Start vLLM serve
+```bash
+CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic \
+  --block-size 128 \
+  --max-model-len 3500 \
+  --max-num-batched-tokens 3500 \
+  --tensor-parallel-size 4 \
+  --gpu-memory-utilization 0.95 \
+  --disable-log-requests \
+  --port 8001 1> vllm_0.log 2>&1 &
+CUDA_VISIBLE_DEVICES=4,5,6,7 vllm serve neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic \
+  --block-size 128 \
+  --max-model-len 3500 \
+  --max-num-batched-tokens 3500 \
+  --tensor-parallel-size 4 \
+  --gpu-memory-utilization 0.95 \
+  --disable-log-requests \
+  --port 8002 1> vllm_1.log 2>&1 &
+```
+Notes:
+* Check the `vllm_0.log` and `vllm_1.log` to make sure the service is fully started before collecting performance numbers.
+* The `vllm serve` configuration should closely match the corresponding disaggregated benchmarking configuration.
+
+3\. Use NGINX as load balancer
+```bash
+apt update && apt install -y nginx
+cp /workspace/examples/llm/benchmarks/nginx.conf /etc/nginx/nginx.conf
+service nginx restart
+```
+
+Collect the performance numbers as shown on the [Collecting Performance Numbers](#collecting-performance-numbers) section below.
+
+## Collecting Performance Numbers
+
+Run the benchmarking script
+```bash
+bash -x /workspace/examples/llm/benchmarks/perf.sh
+```
+
+## Future Roadmap
+
+* Disaggregated Multi Node Benchmarking
+* Results Interpretation
--- a/examples/llm/benchmarks/__init__.py
+++ b/examples/llm/benchmarks/__init__.py
--- a/examples/llm/benchmarks/disagg.py
+++ b/examples/llm/benchmarks/disagg.py
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from components.frontend import Frontend
+from components.prefill_worker import PrefillWorker
+from components.processor import Processor
+from components.worker import VllmWorker
+
+Frontend.link(Processor).link(VllmWorker).link(PrefillWorker)
--- a/examples/llm/benchmarks/disagg.yaml
+++ b/examples/llm/benchmarks/disagg.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+Frontend:
+  # This model was chosen for its 70B size and FP8 precision, which the TP and
+  # DP configurations were tuned for its size, and its precision reduces model
+  # and KV cache memory usage and easing remote cache transfer.
+  served_model_name: neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
+  endpoint: dynamo.Processor.chat/completions
+  port: 8000
+
+Processor:
+  model: neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
+  router: round-robin
+
+# x1 process with 4 GPUs generating output tokens (the "decode" phase).
+VllmWorker:
+  model: neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
+  kv-transfer-config: '{"kv_connector":"DynamoNixlConnector"}'
+  # Number of tokens in a batch for more efficient chunked transfers to GPUs.
+  block-size: 128
+  max-model-len: 3500
+  # Enable KV cache passing from prefill to decode workers
+  remote-prefill: true
+  tensor-parallel-size: 4
+  gpu-memory-utilization: 0.95
+  disable-log-requests: true
+  ServiceArgs:
+    workers: 1
+    resources:
+      gpu: 4
+
+# x4 processes each with 1 GPU handling the initial prefill (context embedding) phase.
+PrefillWorker:
+  model: neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
+  kv-transfer-config: '{"kv_connector":"DynamoNixlConnector"}'
+  block-size: 128
+  max-model-len: 3500
+  max-num-batched-tokens: 3500
+  tensor-parallel-size: 1
+  gpu-memory-utilization: 0.95
+  disable-log-requests: true
+  ServiceArgs:
+    workers: 4
+    resources:
+      gpu: 1
+
+# Note: No prefix cache is used, since all requests are expected to be unique.
--- a/examples/llm/benchmarks/nginx.conf
+++ b/examples/llm/benchmarks/nginx.conf
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+worker_processes 1;
+worker_rlimit_nofile 4096;
+events {
+    worker_connections 2048;
+    multi_accept on;
+    use epoll;
+}
+http {
+    upstream backend_servers {
+        least_conn;
+        # Select all the upstream vLLM servers to load balance across, including those at a different node if applicable.
+        server 127.0.0.1:8001 max_fails=3 fail_timeout=10000s;
+        server 127.0.0.1:8002 max_fails=3 fail_timeout=10000s;
+    }
+    server {
+        listen 8000;
+        location / {
+            proxy_pass http://backend_servers;
+            proxy_http_version 1.1;
+            proxy_read_timeout 240s;
+        }
+    }
+}
--- a/examples/llm/benchmarks/perf.sh
+++ b/examples/llm/benchmarks/perf.sh
+#/bin/bash
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+model=neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
+
+# Input sequence length.
+isl=3000
+# Output sequence length.
+osl=150
+
+# Concurrency levels to test.
+for concurrency in 1 2 4 8 16 32 64 128 256; do
+
+  genai-perf profile \
+    --model ${model} \
+    --tokenizer ${model} \
+    --service-kind openai \
+    --endpoint-type chat \
+    --endpoint /v1/chat/completions \
+    --streaming \
+    --url http://localhost:8000 \
+    --synthetic-input-tokens-mean ${isl} \
+    --synthetic-input-tokens-stddev 0 \
+    --output-tokens-mean ${osl} \
+    --output-tokens-stddev 0 \
+    --extra-inputs max_tokens:${osl} \
+    --extra-inputs min_tokens:${osl} \
+    --extra-inputs ignore_eos:true \
+    --concurrency ${concurrency} \
+    --request-count $(($concurrency*10)) \
+    --warmup-request-count $(($concurrency*2)) \
+    --num-dataset-entries $(($concurrency*12)) \
+    --random-seed 100 \
+    -- \
+    -v \
+    --max-threads 256 \
+    -H 'Authorization: Bearer NOT USED' \
+    -H 'Accept: text/event-stream'
+
+done