README.md 8.99 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->

# LLM Deployment Benchmarking Guide

This guide provides detailed steps on benchmarking Large Language Models (LLMs) in single and multi-node configurations.

> [!NOTE]
> We recommend trying out the [LLM Deployment Examples](./README.md) before benchmarking.

25

26
27
## Prerequisites

28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
> [!Important]
> At least one 8xH100-80GB node is required for the following instructions.

 1. Build benchmarking image

    ```bash
    ./container/build.sh
    ```

 2. Download model

    ```bash
    huggingface-cli download neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
    ```

 3. Start NATS and ETCD

    ```bash
    docker compose -f deploy/docker_compose.yml up -d
    ```
48

49
50
> [!NOTE]
> This guide was tested on node(s) with the following hardware configuration:
51
52
53
54
55
56
57
58
59
60
61
62
>
> * **GPUs**:
>   8xH100-80GB-HBM3 (GPU Memory Bandwidth 3.2 TBs)
>
> * **CPU**:
>   2 x Intel Sapphire Rapids, Intel(R) Xeon(R) Platinum 8480CL E5, 112 cores (56 cores per CPU), 2.00 GHz (Base), 3.8 Ghz (Max boost), PCIe Gen5
>
> * **NVLink**:
>   NVLink 4th Generation, 900 GB/s (GPU to GPU NVLink bidirectional bandwidth), 18 Links per GPU
>
> * **InfiniBand**:
>   8x400Gbit/s (Compute Links), 2x400Gbit/s (Storage Links)
63
64
65
>
> Benchmarking with a different hardware configuration may yield suboptimal results.

66
67
68

## Disaggregated Single Node Benchmarking

69
70
> [!Important]
> One 8xH100-80GB node is required for the following instructions.
71
72
73

In the following setup we compare Dynamo disaggregated vLLM performance to
[native vLLM Aggregated Baseline](#vllm-aggregated-baseline-benchmarking) on a single node. These were chosen to optimize
74
75
76
for Output Token Throughput (per sec) when both are performing under similar Inter Token Latency (ms).
For more details on your use case please see the [Performance Tuning Guide](/docs/guides/disagg_perf_tuning.md).

77
78
In this setup, we will be using 4 prefill workers and 1 decode worker.
Each prefill worker will use tensor parallel 1 and the decode worker will use tensor parallel 4.
79
80
81

With the Dynamo repository, benchmarking image and model available, and **NATS and ETCD started**, perform the following steps:

82
 1. Run benchmarking container
83

84
85
86
87
88
89
90
91
92
93
94
95
96
    ```bash
    ./container/run.sh --mount-workspace
    ```

    > [!Tip]
    > The huggingface home source mount can be changed by setting `--hf-cache ~/.cache/huggingface`.

 2. Start disaggregated services

    ```bash
    cd /workspace/examples/llm
    dynamo serve benchmarks.disagg:Frontend -f benchmarks/disagg.yaml 1> disagg.log 2>&1 &
    ```
97

98
99
    > [!Tip]
    > Check the `disagg.log` to make sure the service is fully started before collecting performance numbers.
100

101
 3. Collect the performance numbers as shown on the [Collecting Performance Numbers](#collecting-performance-numbers) section below.
102

103
104
105
106
107

## Disaggregated Multinode Benchmarking

> [!Important]
> Two 8xH100-80GB nodes are required the following instructions.
108
109
110
111
112
113
114
115
116
117
118

In the following steps we compare Dynamo disaggregated vLLM performance to
[native vLLM Aggregated Baseline](#vllm-aggregated-baseline-benchmarking) on two nodes. These were chosen to optimize
for Output Token Throughput (per sec) when both are performing under similar Inter Token Latency (ms).
For more details on your use case please see the [Performance Tuning Guide](/docs/guides/disagg_perf_tuning.md).

In this setup, we will be using 8 prefill workers and 1 decode worker.
Each prefill worker will use tensor parallel 1 and the decode worker will use tensor parallel 8.

With the Dynamo repository, benchmarking image and model available, and **NATS and ETCD started on node 0**, perform the following steps:

119
 1. Run benchmarking container (nodes 0 & 1)
120

121
122
123
    ```bash
    ./container/run.sh --mount-workspace
    ```
124

125
126
    > [!Tip]
    > The huggingface home source mount can be changed by setting `--hf-cache ~/.cache/huggingface`.
127

128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
 2. Config NATS and ETCD (node 1)

    ```bash
    export NATS_SERVER="nats://<node_0_ip_addr>"
    export ETCD_ENDPOINTS="<node_0_ip_addr>:2379"
    ```

    > [!Important]
    > Node 1 must be able to reach Node 0 over the network for the above services.

 3. Start workers (node 0)

    ```bash
    cd /workspace/examples/llm
    dynamo serve benchmarks.disagg_multinode:Frontend -f benchmarks/disagg_multinode.yaml 1> disagg_multinode.log 2>&1 &
    ```

    > [!Tip]
    > Check the `disagg_multinode.log` to make sure the service is fully started before collecting performance numbers.

 4. Start workers (node 1)

    ```bash
    cd /workspace/examples/llm
    dynamo serve components.prefill_worker:PrefillWorker -f benchmarks/disagg_multinode.yaml 1> prefill_multinode.log 2>&1 &
    ```

    > [!Tip]
    > Check the `prefill_multinode.log` to make sure the service is fully started before collecting performance numbers.

 5. Collect the performance numbers as shown on the [Collecting Performance Numbers](#collecting-performance-numbers) section above.
159
160


161
162
## vLLM Aggregated Baseline Benchmarking

163
164
> [!Important]
> One (or two) 8xH100-80GB nodes are required the following instructions.
165
166
167

With the Dynamo repository and the benchmarking image available, perform the following steps:

168
 1. Run benchmarking container
169

170
171
172
    ```bash
    ./container/run.sh --mount-workspace
    ```
173

174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
    > [!Tip]
    > The Hugging Face home source mount can be changed by setting `--hf-cache ~/.cache/huggingface`.

 2. Start vLLM serve

    ```bash
    CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic \
      --block-size 128 \
      --max-model-len 3500 \
      --max-num-batched-tokens 3500 \
      --tensor-parallel-size 4 \
      --gpu-memory-utilization 0.95 \
      --disable-log-requests \
      --port 8001 1> vllm_0.log 2>&1 &
    CUDA_VISIBLE_DEVICES=4,5,6,7 vllm serve neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic \
      --block-size 128 \
      --max-model-len 3500 \
      --max-num-batched-tokens 3500 \
      --tensor-parallel-size 4 \
      --gpu-memory-utilization 0.95 \
      --disable-log-requests \
      --port 8002 1> vllm_1.log 2>&1 &
    ```

    > [!Tip]
    > Check the `vllm_0.log` and `vllm_1.log` to make sure the service is fully started before collecting performance numbers.
    >
    > If benchmarking with two or more nodes, `--tensor-parallel-size 8` should be used and only run one `vllm serve` instance per node.

 3. Use NGINX as load balancer

    ```bash
    apt update && apt install -y nginx
    cp /workspace/benchmarks/llm/nginx.conf /etc/nginx/nginx.conf
    service nginx restart
    ```

    > [!Note]
    > If benchmarking over 2 nodes, the `upstream` configuration will need to be updated to link to the `vllm serve` on the second node.

 4. Collect the performance numbers as shown on the [Collecting Performance Numbers](#collecting-performance-numbers) section below.
215
216
217
218
219


## Collecting Performance Numbers

Run the benchmarking script
220

221
```bash
222
bash -x /workspace/benchmarks/llm/perf.sh
223
224
```

225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
> [!Tip]
> See [GenAI-Perf tutorial](https://github.com/triton-inference-server/perf_analyzer/blob/main/genai-perf/docs/tutorial.md)
> @ [GitHub](https://github.com/triton-inference-server/perf_analyzer) for additional information about how to run GenAI-Perf
> and how to interpret results.


## Supporting Additional Models

The instructions above can be used for nearly any model desired.
More complex setup instructions might be required for certain models.
The above instruction regarding ETCD, NATS, nginx, dynamo-serve, and GenAI-Perf still apply and can be reused.
The specifics of deploying with different hardware, in a unique environment, or using another model framework can be adapted using the links below.

Regardless of the deployment mechanism, the GenAI-Perf tool will report the same metrics and measurements so long as an accessible endpoint is available for it to interact with. Use the provided [perf.sh](../../../benchmarks/llm/perf.sh) script to automate the measurement of model throughput and latency against multiple request concurrences.

### Deployment Examples

- [Dynamo Multinode Deployments](../../../docs/examples/multinode.md)
- [Dynamo TensorRT LLM Deployments](../../../docs/examples/trtllm.md)
    - [Aggregated Deployment of Very Large Models](../../../docs/examples/multinode.md#aggregated-deployment)
- [Dynamo vLLM Deployments](../../../docs/examples/llm_deployment.md)


## Metrics and Visualization
249

250
251
For instructions on how to acquire per worker metrics and visualize them using Grafana,
please see the provided [Visualization with Prometheus and Grafana](../../../deploy/metrics/README.md).