chore: Move Benchmarking to Top Level (#1461)

Signed-off-by: Tanmay Verma <tanmay2592@gmail.com> Co-authored-by: Tanmay Verma <tanmay2592@gmail.com> Co-authored-by: Jacky <18255193+kthui@users.noreply.github.com>

chore: Move Benchmarking to Top Level (#1461)
Signed-off-by: Tanmay Verma <tanmay2592@gmail.com> Co-authored-by: Tanmay Verma <tanmay2592@gmail.com> Co-authored-by: Jacky <18255193+kthui@users.noreply.github.com>
2ae9ab9f · J Wyman · GitHub · 08355da6 · 2ae9ab9f · 2ae9ab9f
Unverified Commit 2ae9ab9f authored Jun 11, 2025 by J Wyman Committed by GitHub Jun 11, 2025
4 changed files
--- a/benchmarks/llm/README.md
+++ b/benchmarks/llm/README.md
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+[../../examples/llm/benchmarks/README.md](../../examples/llm/benchmarks/README.md)
--- a/examples/llm/benchmarks/nginx.conf
+++ b/examples/llm/benchmarks/nginx.conf
--- a/examples/llm/benchmarks/perf.sh
+++ b/examples/llm/benchmarks/perf.sh
--- a/examples/llm/benchmarks/README.md
+++ b/examples/llm/benchmarks/README.md
@@ -22,37 +22,52 @@ This guide provides detailed steps on benchmarking Large Language Models (LLMs)
 > [!NOTE]
 > We recommend trying out the [LLM Deployment Examples](./README.md) before benchmarking.
 ## Prerequisites
-H100 80GB x8 node(s) are required for benchmarking.
+> [!Important]
+> At least one 8xH100-80GB node is required for the following instructions.
+ 1. Build benchmarking image
+    ```bash
+    ./container/build.sh
+    ```
+ 2. Download model
+    ```bash
+    huggingface-cli download neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
+    ```
+ 3. Start NATS and ETCD
+    ```bash
+    docker compose -f deploy/docker_compose.yml up -d
+    ```
 > [!NOTE]
 > This guide was tested on node(s) with the following hardware configuration:
-> * **GPUs**: 8xH100 80GB HBM3 (GPU Memory Bandwidth 3.2 TBs)
+>
-> * **CPU**: 2x Intel Saphire Rapids, Intel(R) Xeon(R) Platinum 8480CL E5, 112 cores (56 cores per CPU), 2.00 GHz (Base), 3.8 Ghz (Max boost), PCIe Gen5
+> * **GPUs**:
-> * **NVLink**: NVLink 4th Generation, 900 GB/s (GPU to GPU NVLink bidirectional bandwidth), 18 Links per GPU
+>   8xH100-80GB-HBM3 (GPU Memory Bandwidth 3.2 TBs)
-> * **InfiniBand**: 8X400Gbit/s (Compute Links), 2X400Gbit/s (Storage Links)
+>
+> * **CPU**:
+>   2 x Intel Sapphire Rapids, Intel(R) Xeon(R) Platinum 8480CL E5, 112 cores (56 cores per CPU), 2.00 GHz (Base), 3.8 Ghz (Max boost), PCIe Gen5
+>
+> * **NVLink**:
+>   NVLink 4th Generation, 900 GB/s (GPU to GPU NVLink bidirectional bandwidth), 18 Links per GPU
+>
+> * **InfiniBand**:
+>   8x400Gbit/s (Compute Links), 2x400Gbit/s (Storage Links)
 >
 > Benchmarking with a different hardware configuration may yield suboptimal results.
-1\. Build benchmarking image
-```bash
-./container/build.sh
-```
-2\. Download model
-```bash
-huggingface-cli download neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
-```
-3\. Start NATS and ETCD
-```bash
-docker compose -f deploy/docker_compose.yml up -d
-```
 ## Disaggregated Single Node Benchmarking
-One H100 80GB x8 node is required for this setup.
+> [!Important]
+> One 8xH100-80GB node is required for the following instructions.
 In the following setup we compare Dynamo disaggregated vLLM performance to
 [native vLLM Aggregated Baseline](#vllm-aggregated-baseline-benchmarking) on a single node. These were chosen to optimize
@@ -64,24 +79,32 @@ Each prefill worker will use tensor parallel 1 and the decode worker will use te
 With the Dynamo repository, benchmarking image and model available, and **NATS and ETCD started**, perform the following steps:
-1\. Run benchmarking container
+ 1. Run benchmarking container
-```bash
-./container/run.sh --mount-workspace
-```
-Note: The huggingface home source mount can be changed by setting `--hf-cache ~/.cache/huggingface`.
-2\. Start disaggregated services
+    ```bash
-```bash
+    ./container/run.sh --mount-workspace
-cd /workspace/examples/llm
+    ```
-dynamo serve benchmarks.disagg:Frontend -f benchmarks/disagg.yaml 1> disagg.log 2>&1 &
-```
+    > [!Tip]
-Note: Check the `disagg.log` to make sure the service is fully started before collecting performance numbers.
+    > The huggingface home source mount can be changed by setting `--hf-cache ~/.cache/huggingface`.
+ 2. Start disaggregated services
+    ```bash
+    cd /workspace/examples/llm
+    dynamo serve benchmarks.disagg:Frontend -f benchmarks/disagg.yaml 1> disagg.log 2>&1 &
+    ```
-Collect the performance numbers as shown on the [Collecting Performance Numbers](#collecting-performance-numbers) section below.
+    > [!Tip]
+    > Check the `disagg.log` to make sure the service is fully started before collecting performance numbers.
-## Disaggregated Multi Node Benchmarking
+ 3. Collect the performance numbers as shown on the [Collecting Performance Numbers](#collecting-performance-numbers) section below.
-Two H100 80GB x8 nodes are required for this setup.
+## Disaggregated Multinode Benchmarking
+> [!Important]
+> Two 8xH100-80GB nodes are required the following instructions.
 In the following steps we compare Dynamo disaggregated vLLM performance to
 [native vLLM Aggregated Baseline](#vllm-aggregated-baseline-benchmarking) on two nodes. These were chosen to optimize
@@ -93,50 +116,68 @@ Each prefill worker will use tensor parallel 1 and the decode worker will use te
 With the Dynamo repository, benchmarking image and model available, and **NATS and ETCD started on node 0**, perform the following steps:
-1\. Run benchmarking container (node 0 & 1)
+ 1. Run benchmarking container (nodes 0 & 1)
-```bash
-./container/run.sh --mount-workspace
-```
-Note: The huggingface home source mount can be changed by setting `--hf-cache ~/.cache/huggingface`.
-2\. Config NATS and ETCD (node 1)
+    ```bash
-```bash
+    ./container/run.sh --mount-workspace
-export NATS_SERVER="nats://<node_0_ip_addr>"
+    ```
-export ETCD_ENDPOINTS="<node_0_ip_addr>:2379"
-```
-Note: Node 1 must be able to reach Node 0 over the network for the above services.
-3\. Start workers (node 0)
+    > [!Tip]
-```bash
+    > The huggingface home source mount can be changed by setting `--hf-cache ~/.cache/huggingface`.
-cd /workspace/examples/llm
-dynamo serve benchmarks.disagg_multinode:Frontend -f benchmarks/disagg_multinode.yaml 1> disagg_multinode.log 2>&1 &
-```
-Note: Check the `disagg_multinode.log` to make sure the service is fully started before collecting performance numbers.
-4\. Start workers (node 1)
+ 2. Config NATS and ETCD (node 1)
-```bash
-cd /workspace/examples/llm
+    ```bash
-dynamo serve components.prefill_worker:PrefillWorker -f benchmarks/disagg_multinode.yaml 1> prefill_multinode.log 2>&1 &
+    export NATS_SERVER="nats://<node_0_ip_addr>"
-```
+    export ETCD_ENDPOINTS="<node_0_ip_addr>:2379"
-Note: Check the `prefill_multinode.log` to make sure the service is fully started before collecting performance numbers.
+    ```
+    > [!Important]
+    > Node 1 must be able to reach Node 0 over the network for the above services.
+ 3. Start workers (node 0)
+    ```bash
+    cd /workspace/examples/llm
+    dynamo serve benchmarks.disagg_multinode:Frontend -f benchmarks/disagg_multinode.yaml 1> disagg_multinode.log 2>&1 &
+    ```
+    > [!Tip]
+    > Check the `disagg_multinode.log` to make sure the service is fully started before collecting performance numbers.
+ 4. Start workers (node 1)
+    ```bash
+    cd /workspace/examples/llm
+    dynamo serve components.prefill_worker:PrefillWorker -f benchmarks/disagg_multinode.yaml 1> prefill_multinode.log 2>&1 &
+    ```
+    > [!Tip]
+    > Check the `prefill_multinode.log` to make sure the service is fully started before collecting performance numbers.
+ 5. Collect the performance numbers as shown on the [Collecting Performance Numbers](#collecting-performance-numbers) section above.
-Collect the performance numbers as shown on the [Collecting Performance Numbers](#collecting-performance-numbers) section above.
 ## vLLM Aggregated Baseline Benchmarking
-One (or two) H100 80GB x8 nodes are required for this setup.
+> [!Important]
+> One (or two) 8xH100-80GB nodes are required the following instructions.
 With the Dynamo repository and the benchmarking image available, perform the following steps:
-1\. Run benchmarking container
+ 1. Run benchmarking container
-```bash
-./container/run.sh --mount-workspace
-```
-Note: The huggingface home source mount can be changed by setting `--hf-cache ~/.cache/huggingface`.
-2\. Start vLLM serve
+    ```bash
-```bash
+    ./container/run.sh --mount-workspace
-CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic \
+    ```
+    > [!Tip]
+    > The Hugging Face home source mount can be changed by setting `--hf-cache ~/.cache/huggingface`.
+ 2. Start vLLM serve
+    ```bash
+    CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic \
      --block-size 128 \
      --max-model-len 3500 \
      --max-num-batched-tokens 3500 \
@@ -144,7 +185,7 @@ CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve neuralmagic/DeepSeek-R1-Distill-Llama-70
      --gpu-memory-utilization 0.95 \
      --disable-log-requests \
      --port 8001 1> vllm_0.log 2>&1 &
-CUDA_VISIBLE_DEVICES=4,5,6,7 vllm serve neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic \
+    CUDA_VISIBLE_DEVICES=4,5,6,7 vllm serve neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic \
      --block-size 128 \
      --max-model-len 3500 \
      --max-num-batched-tokens 3500 \
@@ -152,28 +193,59 @@ CUDA_VISIBLE_DEVICES=4,5,6,7 vllm serve neuralmagic/DeepSeek-R1-Distill-Llama-70
      --gpu-memory-utilization 0.95 \
      --disable-log-requests \
      --port 8002 1> vllm_1.log 2>&1 &
-```
+    ```
-Notes:
-* Check the `vllm_0.log` and `vllm_1.log` to make sure the service is fully started before collecting performance numbers.
-* If benchmarking over 2 nodes, `--tensor-parallel-size 8` should be used and only run one `vllm serve` instance per node.
-3\. Use NGINX as load balancer
+    > [!Tip]
-```bash
+    > Check the `vllm_0.log` and `vllm_1.log` to make sure the service is fully started before collecting performance numbers.
-apt update && apt install -y nginx
+    >
-cp /workspace/examples/llm/benchmarks/nginx.conf /etc/nginx/nginx.conf
+    > If benchmarking with two or more nodes, `--tensor-parallel-size 8` should be used and only run one `vllm serve` instance per node.
-service nginx restart
-```
+ 3. Use NGINX as load balancer
-Note: If benchmarking over 2 nodes, the `upstream` configuration will need to be updated to link to the `vllm serve` on the second node.
+    ```bash
+    apt update && apt install -y nginx
+    cp /workspace/benchmarks/llm/nginx.conf /etc/nginx/nginx.conf
+    service nginx restart
+    ```
+    > [!Note]
+    > If benchmarking over 2 nodes, the `upstream` configuration will need to be updated to link to the `vllm serve` on the second node.
+ 4. Collect the performance numbers as shown on the [Collecting Performance Numbers](#collecting-performance-numbers) section below.
-Collect the performance numbers as shown on the [Collecting Performance Numbers](#collecting-performance-numbers) section below.
 ## Collecting Performance Numbers
 Run the benchmarking script
 ```bash
-bash -x /workspace/examples/llm/benchmarks/perf.sh
+bash -x /workspace/benchmarks/llm/perf.sh
 ```
-## Future Roadmap
+> [!Tip]
+> See [GenAI-Perf tutorial](https://github.com/triton-inference-server/perf_analyzer/blob/main/genai-perf/docs/tutorial.md)
+> @ [GitHub](https://github.com/triton-inference-server/perf_analyzer) for additional information about how to run GenAI-Perf
+> and how to interpret results.
+## Supporting Additional Models
+The instructions above can be used for nearly any model desired.
+More complex setup instructions might be required for certain models.
+The above instruction regarding ETCD, NATS, nginx, dynamo-serve, and GenAI-Perf still apply and can be reused.
+The specifics of deploying with different hardware, in a unique environment, or using another model framework can be adapted using the links below.
+Regardless of the deployment mechanism, the GenAI-Perf tool will report the same metrics and measurements so long as an accessible endpoint is available for it to interact with. Use the provided [perf.sh](../../../benchmarks/llm/perf.sh) script to automate the measurement of model throughput and latency against multiple request concurrences.
+### Deployment Examples
+- [Dynamo Multinode Deployments](../../../docs/examples/multinode.md)
+- [Dynamo TensorRT LLM Deployments](../../../docs/examples/trtllm.md)
+    - [Aggregated Deployment of Very Large Models](../../../docs/examples/multinode.md#aggregated-deployment)
+- [Dynamo vLLM Deployments](../../../docs/examples/llm_deployment.md)
+## Metrics and Visualization
-* Results Interpretation
+For instructions on how to acquire per worker metrics and visualize them using Grafana,
+please see the provided [Visualization with Prometheus and Grafana](../../../deploy/metrics/README.md).