Unverified Commit 2ae9ab9f authored by J Wyman's avatar J Wyman Committed by GitHub
Browse files

chore: Move Benchmarking to Top Level (#1461)


Signed-off-by: default avatarTanmay Verma <tanmay2592@gmail.com>
Co-authored-by: default avatarTanmay Verma <tanmay2592@gmail.com>
Co-authored-by: default avatarJacky <18255193+kthui@users.noreply.github.com>
parent 08355da6
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
[../../examples/llm/benchmarks/README.md](../../examples/llm/benchmarks/README.md)
...@@ -22,37 +22,52 @@ This guide provides detailed steps on benchmarking Large Language Models (LLMs) ...@@ -22,37 +22,52 @@ This guide provides detailed steps on benchmarking Large Language Models (LLMs)
> [!NOTE] > [!NOTE]
> We recommend trying out the [LLM Deployment Examples](./README.md) before benchmarking. > We recommend trying out the [LLM Deployment Examples](./README.md) before benchmarking.
## Prerequisites ## Prerequisites
H100 80GB x8 node(s) are required for benchmarking. > [!Important]
> At least one 8xH100-80GB node is required for the following instructions.
1. Build benchmarking image
```bash
./container/build.sh
```
2. Download model
```bash
huggingface-cli download neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
```
3. Start NATS and ETCD
```bash
docker compose -f deploy/docker_compose.yml up -d
```
> [!NOTE] > [!NOTE]
> This guide was tested on node(s) with the following hardware configuration: > This guide was tested on node(s) with the following hardware configuration:
> * **GPUs**: 8xH100 80GB HBM3 (GPU Memory Bandwidth 3.2 TBs) >
> * **CPU**: 2x Intel Saphire Rapids, Intel(R) Xeon(R) Platinum 8480CL E5, 112 cores (56 cores per CPU), 2.00 GHz (Base), 3.8 Ghz (Max boost), PCIe Gen5 > * **GPUs**:
> * **NVLink**: NVLink 4th Generation, 900 GB/s (GPU to GPU NVLink bidirectional bandwidth), 18 Links per GPU > 8xH100-80GB-HBM3 (GPU Memory Bandwidth 3.2 TBs)
> * **InfiniBand**: 8X400Gbit/s (Compute Links), 2X400Gbit/s (Storage Links) >
> * **CPU**:
> 2 x Intel Sapphire Rapids, Intel(R) Xeon(R) Platinum 8480CL E5, 112 cores (56 cores per CPU), 2.00 GHz (Base), 3.8 Ghz (Max boost), PCIe Gen5
>
> * **NVLink**:
> NVLink 4th Generation, 900 GB/s (GPU to GPU NVLink bidirectional bandwidth), 18 Links per GPU
>
> * **InfiniBand**:
> 8x400Gbit/s (Compute Links), 2x400Gbit/s (Storage Links)
> >
> Benchmarking with a different hardware configuration may yield suboptimal results. > Benchmarking with a different hardware configuration may yield suboptimal results.
1\. Build benchmarking image
```bash
./container/build.sh
```
2\. Download model
```bash
huggingface-cli download neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
```
3\. Start NATS and ETCD
```bash
docker compose -f deploy/docker_compose.yml up -d
```
## Disaggregated Single Node Benchmarking ## Disaggregated Single Node Benchmarking
One H100 80GB x8 node is required for this setup. > [!Important]
> One 8xH100-80GB node is required for the following instructions.
In the following setup we compare Dynamo disaggregated vLLM performance to In the following setup we compare Dynamo disaggregated vLLM performance to
[native vLLM Aggregated Baseline](#vllm-aggregated-baseline-benchmarking) on a single node. These were chosen to optimize [native vLLM Aggregated Baseline](#vllm-aggregated-baseline-benchmarking) on a single node. These were chosen to optimize
...@@ -64,24 +79,32 @@ Each prefill worker will use tensor parallel 1 and the decode worker will use te ...@@ -64,24 +79,32 @@ Each prefill worker will use tensor parallel 1 and the decode worker will use te
With the Dynamo repository, benchmarking image and model available, and **NATS and ETCD started**, perform the following steps: With the Dynamo repository, benchmarking image and model available, and **NATS and ETCD started**, perform the following steps:
1\. Run benchmarking container 1. Run benchmarking container
```bash
./container/run.sh --mount-workspace
```
Note: The huggingface home source mount can be changed by setting `--hf-cache ~/.cache/huggingface`.
2\. Start disaggregated services ```bash
```bash ./container/run.sh --mount-workspace
cd /workspace/examples/llm ```
dynamo serve benchmarks.disagg:Frontend -f benchmarks/disagg.yaml 1> disagg.log 2>&1 &
``` > [!Tip]
Note: Check the `disagg.log` to make sure the service is fully started before collecting performance numbers. > The huggingface home source mount can be changed by setting `--hf-cache ~/.cache/huggingface`.
2. Start disaggregated services
```bash
cd /workspace/examples/llm
dynamo serve benchmarks.disagg:Frontend -f benchmarks/disagg.yaml 1> disagg.log 2>&1 &
```
Collect the performance numbers as shown on the [Collecting Performance Numbers](#collecting-performance-numbers) section below. > [!Tip]
> Check the `disagg.log` to make sure the service is fully started before collecting performance numbers.
## Disaggregated Multi Node Benchmarking 3. Collect the performance numbers as shown on the [Collecting Performance Numbers](#collecting-performance-numbers) section below.
Two H100 80GB x8 nodes are required for this setup.
## Disaggregated Multinode Benchmarking
> [!Important]
> Two 8xH100-80GB nodes are required the following instructions.
In the following steps we compare Dynamo disaggregated vLLM performance to In the following steps we compare Dynamo disaggregated vLLM performance to
[native vLLM Aggregated Baseline](#vllm-aggregated-baseline-benchmarking) on two nodes. These were chosen to optimize [native vLLM Aggregated Baseline](#vllm-aggregated-baseline-benchmarking) on two nodes. These were chosen to optimize
...@@ -93,50 +116,68 @@ Each prefill worker will use tensor parallel 1 and the decode worker will use te ...@@ -93,50 +116,68 @@ Each prefill worker will use tensor parallel 1 and the decode worker will use te
With the Dynamo repository, benchmarking image and model available, and **NATS and ETCD started on node 0**, perform the following steps: With the Dynamo repository, benchmarking image and model available, and **NATS and ETCD started on node 0**, perform the following steps:
1\. Run benchmarking container (node 0 & 1) 1. Run benchmarking container (nodes 0 & 1)
```bash
./container/run.sh --mount-workspace
```
Note: The huggingface home source mount can be changed by setting `--hf-cache ~/.cache/huggingface`.
2\. Config NATS and ETCD (node 1) ```bash
```bash ./container/run.sh --mount-workspace
export NATS_SERVER="nats://<node_0_ip_addr>" ```
export ETCD_ENDPOINTS="<node_0_ip_addr>:2379"
```
Note: Node 1 must be able to reach Node 0 over the network for the above services.
3\. Start workers (node 0) > [!Tip]
```bash > The huggingface home source mount can be changed by setting `--hf-cache ~/.cache/huggingface`.
cd /workspace/examples/llm
dynamo serve benchmarks.disagg_multinode:Frontend -f benchmarks/disagg_multinode.yaml 1> disagg_multinode.log 2>&1 &
```
Note: Check the `disagg_multinode.log` to make sure the service is fully started before collecting performance numbers.
4\. Start workers (node 1) 2. Config NATS and ETCD (node 1)
```bash
cd /workspace/examples/llm ```bash
dynamo serve components.prefill_worker:PrefillWorker -f benchmarks/disagg_multinode.yaml 1> prefill_multinode.log 2>&1 & export NATS_SERVER="nats://<node_0_ip_addr>"
``` export ETCD_ENDPOINTS="<node_0_ip_addr>:2379"
Note: Check the `prefill_multinode.log` to make sure the service is fully started before collecting performance numbers. ```
> [!Important]
> Node 1 must be able to reach Node 0 over the network for the above services.
3. Start workers (node 0)
```bash
cd /workspace/examples/llm
dynamo serve benchmarks.disagg_multinode:Frontend -f benchmarks/disagg_multinode.yaml 1> disagg_multinode.log 2>&1 &
```
> [!Tip]
> Check the `disagg_multinode.log` to make sure the service is fully started before collecting performance numbers.
4. Start workers (node 1)
```bash
cd /workspace/examples/llm
dynamo serve components.prefill_worker:PrefillWorker -f benchmarks/disagg_multinode.yaml 1> prefill_multinode.log 2>&1 &
```
> [!Tip]
> Check the `prefill_multinode.log` to make sure the service is fully started before collecting performance numbers.
5. Collect the performance numbers as shown on the [Collecting Performance Numbers](#collecting-performance-numbers) section above.
Collect the performance numbers as shown on the [Collecting Performance Numbers](#collecting-performance-numbers) section above.
## vLLM Aggregated Baseline Benchmarking ## vLLM Aggregated Baseline Benchmarking
One (or two) H100 80GB x8 nodes are required for this setup. > [!Important]
> One (or two) 8xH100-80GB nodes are required the following instructions.
With the Dynamo repository and the benchmarking image available, perform the following steps: With the Dynamo repository and the benchmarking image available, perform the following steps:
1\. Run benchmarking container 1. Run benchmarking container
```bash
./container/run.sh --mount-workspace
```
Note: The huggingface home source mount can be changed by setting `--hf-cache ~/.cache/huggingface`.
2\. Start vLLM serve ```bash
```bash ./container/run.sh --mount-workspace
CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic \ ```
> [!Tip]
> The Hugging Face home source mount can be changed by setting `--hf-cache ~/.cache/huggingface`.
2. Start vLLM serve
```bash
CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic \
--block-size 128 \ --block-size 128 \
--max-model-len 3500 \ --max-model-len 3500 \
--max-num-batched-tokens 3500 \ --max-num-batched-tokens 3500 \
...@@ -144,7 +185,7 @@ CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve neuralmagic/DeepSeek-R1-Distill-Llama-70 ...@@ -144,7 +185,7 @@ CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve neuralmagic/DeepSeek-R1-Distill-Llama-70
--gpu-memory-utilization 0.95 \ --gpu-memory-utilization 0.95 \
--disable-log-requests \ --disable-log-requests \
--port 8001 1> vllm_0.log 2>&1 & --port 8001 1> vllm_0.log 2>&1 &
CUDA_VISIBLE_DEVICES=4,5,6,7 vllm serve neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic \ CUDA_VISIBLE_DEVICES=4,5,6,7 vllm serve neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic \
--block-size 128 \ --block-size 128 \
--max-model-len 3500 \ --max-model-len 3500 \
--max-num-batched-tokens 3500 \ --max-num-batched-tokens 3500 \
...@@ -152,28 +193,59 @@ CUDA_VISIBLE_DEVICES=4,5,6,7 vllm serve neuralmagic/DeepSeek-R1-Distill-Llama-70 ...@@ -152,28 +193,59 @@ CUDA_VISIBLE_DEVICES=4,5,6,7 vllm serve neuralmagic/DeepSeek-R1-Distill-Llama-70
--gpu-memory-utilization 0.95 \ --gpu-memory-utilization 0.95 \
--disable-log-requests \ --disable-log-requests \
--port 8002 1> vllm_1.log 2>&1 & --port 8002 1> vllm_1.log 2>&1 &
``` ```
Notes:
* Check the `vllm_0.log` and `vllm_1.log` to make sure the service is fully started before collecting performance numbers.
* If benchmarking over 2 nodes, `--tensor-parallel-size 8` should be used and only run one `vllm serve` instance per node.
3\. Use NGINX as load balancer > [!Tip]
```bash > Check the `vllm_0.log` and `vllm_1.log` to make sure the service is fully started before collecting performance numbers.
apt update && apt install -y nginx >
cp /workspace/examples/llm/benchmarks/nginx.conf /etc/nginx/nginx.conf > If benchmarking with two or more nodes, `--tensor-parallel-size 8` should be used and only run one `vllm serve` instance per node.
service nginx restart
``` 3. Use NGINX as load balancer
Note: If benchmarking over 2 nodes, the `upstream` configuration will need to be updated to link to the `vllm serve` on the second node.
```bash
apt update && apt install -y nginx
cp /workspace/benchmarks/llm/nginx.conf /etc/nginx/nginx.conf
service nginx restart
```
> [!Note]
> If benchmarking over 2 nodes, the `upstream` configuration will need to be updated to link to the `vllm serve` on the second node.
4. Collect the performance numbers as shown on the [Collecting Performance Numbers](#collecting-performance-numbers) section below.
Collect the performance numbers as shown on the [Collecting Performance Numbers](#collecting-performance-numbers) section below.
## Collecting Performance Numbers ## Collecting Performance Numbers
Run the benchmarking script Run the benchmarking script
```bash ```bash
bash -x /workspace/examples/llm/benchmarks/perf.sh bash -x /workspace/benchmarks/llm/perf.sh
``` ```
## Future Roadmap > [!Tip]
> See [GenAI-Perf tutorial](https://github.com/triton-inference-server/perf_analyzer/blob/main/genai-perf/docs/tutorial.md)
> @ [GitHub](https://github.com/triton-inference-server/perf_analyzer) for additional information about how to run GenAI-Perf
> and how to interpret results.
## Supporting Additional Models
The instructions above can be used for nearly any model desired.
More complex setup instructions might be required for certain models.
The above instruction regarding ETCD, NATS, nginx, dynamo-serve, and GenAI-Perf still apply and can be reused.
The specifics of deploying with different hardware, in a unique environment, or using another model framework can be adapted using the links below.
Regardless of the deployment mechanism, the GenAI-Perf tool will report the same metrics and measurements so long as an accessible endpoint is available for it to interact with. Use the provided [perf.sh](../../../benchmarks/llm/perf.sh) script to automate the measurement of model throughput and latency against multiple request concurrences.
### Deployment Examples
- [Dynamo Multinode Deployments](../../../docs/examples/multinode.md)
- [Dynamo TensorRT LLM Deployments](../../../docs/examples/trtllm.md)
- [Aggregated Deployment of Very Large Models](../../../docs/examples/multinode.md#aggregated-deployment)
- [Dynamo vLLM Deployments](../../../docs/examples/llm_deployment.md)
## Metrics and Visualization
* Results Interpretation For instructions on how to acquire per worker metrics and visualize them using Grafana,
please see the provided [Visualization with Prometheus and Grafana](../../../deploy/metrics/README.md).
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment