Commit 594f8c04 authored by Jacky's avatar Jacky Committed by GitHub
Browse files

docs: Guides for single node benchmarking (#509)


Signed-off-by: default avatarJacky <18255193+kthui@users.noreply.github.com>
Co-authored-by: default avatarPiotr Marcinkiewicz <piotrm@nvidia.com>
Co-authored-by: default avatarNeelay Shah <neelays@nvidia.com>
parent 9ce49c2c
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# LLM Deployment Benchmarking Guide
This guide provides detailed steps on benchmarking Large Language Models (LLMs) in single and multi-node configurations.
> [!NOTE]
> We recommend trying out the [LLM Deployment Examples](./README.md) before benchmarking.
## Prerequisites
H100 80GB x8 node(s) are required for benchmarking.
1\. Build benchmarking image
```bash
./container/build.sh
```
2\. Download model
```bash
huggingface-cli download neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
```
3\. Start NATS and ETCD
```bash
docker compose -f deploy/docker_compose.yml up -d
```
## Disaggregated Single Node Benchmarking
In the following steps we compare Dynamo disaggregated vLLM single node performance to
[native vLLM Aggregated Baseline](#vllm-aggregated-baseline-benchmarking). These were chosen to optimize
for Output Token Throughput (per sec) when both are performing under similar Inter Token Latency (ms).
For more details on your use case please see the [Performance Tuning Guide](/docs/guides/disagg_perf_tuning.md).
One H100 80GB x8 node is required for this setup.
With the Dynamo repository, benchmarking image and model available, and **NATS and ETCD started**, perform the following steps:
1\. Run benchmarking container
```bash
./container/run.sh -it \
-v <huggingface_hub>:/root/.cache/huggingface/hub \
-v <dynamo_repo>:/workspace
```
2\. Start disaggregated services
```bash
cd /workspace/examples/llm
dynamo serve benchmarks.disagg:Frontend -f benchmarks/disagg.yaml 1> disagg.log 2>&1 &
```
Note: Check the `disagg.log` to make sure the service is fully started before collecting performance numbers.
Collect the performance numbers as shown on the [Collecting Performance Numbers](#collecting-performance-numbers) section below.
## vLLM Aggregated Baseline Benchmarking
One H100 80GB x8 node is required for this setup.
With the Dynamo repository and the benchmarking image available, perform the following steps:
1\. Run benchmarking container
```bash
./container/run.sh -it \
-v <huggingface_hub>:/root/.cache/huggingface/hub \
-v <dynamo_repo>:/workspace
```
2\. Start vLLM serve
```bash
CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic \
--block-size 128 \
--max-model-len 3500 \
--max-num-batched-tokens 3500 \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.95 \
--disable-log-requests \
--port 8001 1> vllm_0.log 2>&1 &
CUDA_VISIBLE_DEVICES=4,5,6,7 vllm serve neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic \
--block-size 128 \
--max-model-len 3500 \
--max-num-batched-tokens 3500 \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.95 \
--disable-log-requests \
--port 8002 1> vllm_1.log 2>&1 &
```
Notes:
* Check the `vllm_0.log` and `vllm_1.log` to make sure the service is fully started before collecting performance numbers.
* The `vllm serve` configuration should closely match the corresponding disaggregated benchmarking configuration.
3\. Use NGINX as load balancer
```bash
apt update && apt install -y nginx
cp /workspace/examples/llm/benchmarks/nginx.conf /etc/nginx/nginx.conf
service nginx restart
```
Collect the performance numbers as shown on the [Collecting Performance Numbers](#collecting-performance-numbers) section below.
## Collecting Performance Numbers
Run the benchmarking script
```bash
bash -x /workspace/examples/llm/benchmarks/perf.sh
```
## Future Roadmap
* Disaggregated Multi Node Benchmarking
* Results Interpretation
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from components.frontend import Frontend
from components.prefill_worker import PrefillWorker
from components.processor import Processor
from components.worker import VllmWorker
Frontend.link(Processor).link(VllmWorker).link(PrefillWorker)
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
Frontend:
# This model was chosen for its 70B size and FP8 precision, which the TP and
# DP configurations were tuned for its size, and its precision reduces model
# and KV cache memory usage and easing remote cache transfer.
served_model_name: neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
endpoint: dynamo.Processor.chat/completions
port: 8000
Processor:
model: neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
router: round-robin
# x1 process with 4 GPUs generating output tokens (the "decode" phase).
VllmWorker:
model: neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
kv-transfer-config: '{"kv_connector":"DynamoNixlConnector"}'
# Number of tokens in a batch for more efficient chunked transfers to GPUs.
block-size: 128
max-model-len: 3500
# Enable KV cache passing from prefill to decode workers
remote-prefill: true
tensor-parallel-size: 4
gpu-memory-utilization: 0.95
disable-log-requests: true
ServiceArgs:
workers: 1
resources:
gpu: 4
# x4 processes each with 1 GPU handling the initial prefill (context embedding) phase.
PrefillWorker:
model: neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
kv-transfer-config: '{"kv_connector":"DynamoNixlConnector"}'
block-size: 128
max-model-len: 3500
max-num-batched-tokens: 3500
tensor-parallel-size: 1
gpu-memory-utilization: 0.95
disable-log-requests: true
ServiceArgs:
workers: 4
resources:
gpu: 1
# Note: No prefix cache is used, since all requests are expected to be unique.
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
worker_processes 1;
worker_rlimit_nofile 4096;
events {
worker_connections 2048;
multi_accept on;
use epoll;
}
http {
upstream backend_servers {
least_conn;
# Select all the upstream vLLM servers to load balance across, including those at a different node if applicable.
server 127.0.0.1:8001 max_fails=3 fail_timeout=10000s;
server 127.0.0.1:8002 max_fails=3 fail_timeout=10000s;
}
server {
listen 8000;
location / {
proxy_pass http://backend_servers;
proxy_http_version 1.1;
proxy_read_timeout 240s;
}
}
}
#/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
model=neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
# Input sequence length.
isl=3000
# Output sequence length.
osl=150
# Concurrency levels to test.
for concurrency in 1 2 4 8 16 32 64 128 256; do
genai-perf profile \
--model ${model} \
--tokenizer ${model} \
--service-kind openai \
--endpoint-type chat \
--endpoint /v1/chat/completions \
--streaming \
--url http://localhost:8000 \
--synthetic-input-tokens-mean ${isl} \
--synthetic-input-tokens-stddev 0 \
--output-tokens-mean ${osl} \
--output-tokens-stddev 0 \
--extra-inputs max_tokens:${osl} \
--extra-inputs min_tokens:${osl} \
--extra-inputs ignore_eos:true \
--concurrency ${concurrency} \
--request-count $(($concurrency*10)) \
--warmup-request-count $(($concurrency*2)) \
--num-dataset-entries $(($concurrency*12)) \
--random-seed 100 \
-- \
-v \
--max-threads 256 \
-H 'Authorization: Bearer NOT USED' \
-H 'Accept: text/event-stream'
done
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment