micro-benchmarks.md 66.8 KB
Newer Older
1
2
3
4
5
6
---
id: micro-benchmarks
---

# Micro Benchmarks

7
8
9
10
11
12
## Computation Benchmarks

### `kernel-launch`

#### Introduction

one's avatar
one committed
13
14
Measure GPU kernel launch performance from multiple perspectives, including end-to-end latency,
host-side dispatch overhead, steady-state launch throughput, and device-side launch time.
15
16
17

#### Metrics

one's avatar
one committed
18
19
20
21
22
23
| Name                                | Unit               | Description                                                      |
|-------------------------------------|--------------------|------------------------------------------------------------------|
| kernel-launch/e2e_latency_us        | time (us)          | Single-shot end-to-end latency measured in CPU time.             |
| kernel-launch/host_dispatch_us      | time (us)          | Host-side dispatch overhead per kernel measured in CPU time.     |
| kernel-launch/launch_throughput_mkps| throughput (MKPS)  | Steady-state kernel launch throughput.                           |
| kernel-launch/device_launch_us      | time (us)          | Device-side average launch time per kernel measured by events.   |
24
25
26
27
28
29
30
31

### `gemm-flops`

#### Introduction

Measure the GPU GEMM FLOPS for different float and int data types, with or without Tensor Core (XDLOPS),
performed by NVIDIA [cutlass](https://github.com/NVIDIA/cutlass/tree/ccb697bac77fcc898e9c897b2c90aa5b60ac72fb)
or AMD [rocblas-bench](https://github.com/ROCmSoftwarePlatform/rocBLAS/tree/develop/clients/benchmarks).
one's avatar
one committed
32
The benchmark supports one or more GEMM shapes in `m,n,k` format.
33
34
35

#### Metrics

36
37
| Name                         | Unit           | Description                                             |
|------------------------------|----------------|---------------------------------------------------------|
one's avatar
one committed
38
39
40
41
42
43
44
45
46
47
48
49
50
| gemm-flops/fp64_m${m}_n${n}_k${k}_flops        | FLOPS (GFLOPS) | GEMM float64 peak FLOPS.                                |
| gemm-flops/fp32_m${m}_n${n}_k${k}_flops        | FLOPS (GFLOPS) | GEMM float32 peak FLOPS.                                |
| gemm-flops/fp16_m${m}_n${n}_k${k}_flops        | FLOPS (GFLOPS) | GEMM float16 peak FLOPS.                                |
| gemm-flops/fp64_tc_m${m}_n${n}_k${k}_flops     | FLOPS (GFLOPS) | GEMM float64 peak FLOPS with NVIDIA Tensor Core.        |
| gemm-flops/tf32_tc_m${m}_n${n}_k${k}_flops     | FLOPS (GFLOPS) | GEMM tensor-float32 peak FLOPS with NVIDIA Tensor Core. |
| gemm-flops/fp16_tc_m${m}_n${n}_k${k}_flops     | FLOPS (GFLOPS) | GEMM float16 peak FLOPS with NVIDIA Tensor Core.        |
| gemm-flops/bf16_tc_m${m}_n${n}_k${k}_flops     | FLOPS (GFLOPS) | GEMM bfloat16 peak FLOPS with NVIDIA Tensor Core.       |
| gemm-flops/int8_tc_m${m}_n${n}_k${k}_iops      | IOPS (GIOPS)   | GEMM int8 peak IOPS with NVIDIA Tensor Core.            |
| gemm-flops/int4_tc_m${m}_n${n}_k${k}_iops      | IOPS (GIOPS)   | GEMM int4 peak IOPS with NVIDIA Tensor Core.            |
| gemm-flops/fp32_xdlops_m${m}_n${n}_k${k}_flops | FLOPS (GFLOPS) | GEMM tensor-float32 peak FLOPS with AMD XDLOPS.         |
| gemm-flops/fp16_xdlops_m${m}_n${n}_k${k}_flops | FLOPS (GFLOPS) | GEMM float16 peak FLOPS with AMD XDLOPS.                |
| gemm-flops/bf16_xdlops_m${m}_n${n}_k${k}_flops | FLOPS (GFLOPS) | GEMM bfloat16 peak FLOPS with AMD XDLOPS.               |
| gemm-flops/int8_xdlops_m${m}_n${n}_k${k}_iops  | IOPS (GIOPS)   | GEMM int8 peak IOPS with AMD XDLOPS.                    |
51
52
53
54
55
56
57
58
59

### `matmul`

#### Introduction

Large scale matmul operation using `torch.matmul` with one GPU.

#### Metrics

60
61
62
| Name                           | Unit      | Description                    |
|--------------------------------|-----------|--------------------------------|
| pytorch-matmul/nosharding_time | time (ms) | Time of pure matmul operation. |
63

64
### `cublaslt-gemm` / `hipblaslt-gemm`
65
66
67

#### Introduction

68
Measure the GEMM performance of [`cublasLtMatmul`](https://docs.nvidia.com/cuda/cublas/#cublasltmatmul) or [`hipblasLt-bench`](https://github.com/ROCm/hipBLASLt/blob/develop/clients/benchmarks/README.md).
69
70
71

#### Metrics

72
73
74
75
| Name                                                      | Unit           | Description                     |
|-----------------------------------------------------------|----------------|---------------------------------|
| cublaslt-gemm/${dtype}\_${batch}\_${m}\_${n}\_${k}_flops  | FLOPS (TFLOPS) | TFLOPS of measured GEMM kernel. |
| hipblaslt-gemm/${dtype}\_${batch}\_${m}\_${n}\_${k}_flops | FLOPS (TFLOPS) | TFLOPS of measured GEMM kernel. |
76

77
78
### `cublas-function`

79
80
81
82
83
84
85
86
87
88
89
90
91
92
#### Introduction

Measure the performance of most common Nvidia cuBLAS functions with parameters in models training including ResNet, VGG, DenseNet, LSTM, BERT, and GPT-2.

The supported functions for cuBLAS are as follows:
 - cublasSgemm
 - cublasSgemmStridedBatched
 - cublasGemmStridedBatchedEx
 - cublasGemmEx
 - cublasCgemm3mStridedBatched
 - cublasCgemm

#### Metrics

93
94
95
96
97
| Name                                                              | Unit      | Description                                                                                                                                  |
|-------------------------------------------------------------------|-----------|----------------------------------------------------------------------------------------------------------------------------------------------|
| cublas-function/name\_${function_name}\_${parameters}_time        | time (us) | The mean time to execute the cublas function with the parameters.                                                                            |
| cublas-function/name\_${function_name}\_${parameters}_correctness |           | Whether the calculation results of executing the cublas function with the parameters pass the correctness check if enable correctness check. |
| cublas-function/name\_${function_name}\_${parameters}_error       |           | The error ratio of the calculation results of executing the cublas function with the parameters if enable correctness check.                 |
98
99
100

### `cudnn-function`

101
102
103
104
105
106
107
108
109
110
111
#### Introduction

Measure the performance of most common Nvidia cuDNN functions with parameters in models training including ResNet, VGG, DenseNet, LSTM, BERT, and GPT-2.

The supported functions for cuDNN are as follows:
 - cudnnConvolutionBackwardFilter
 - cudnnConvolutionBackwardData
 - cudnnConvolutionForward

#### Metrics

112
113
114
| Name                                                      | Unit      | Description                                                      |
|-----------------------------------------------------------|-----------|------------------------------------------------------------------|
| cudnn-function/name\_${function_name}\_${parameters}_time | time (us) | The mean time to execute the cudnn function with the parameters. |
115

116
117
118
119
120
### `tensorrt-inference`

#### Introduction

Inference PyTorch/ONNX models on NVIDIA GPUs with [TensorRT](https://developer.nvidia.com/tensorrt).
121

122
123
124
125
126
127
Currently the following models are supported:

> alexnet, densenet121, densenet169, densenet201, densenet161, googlenet, inception_v3, mnasnet0_5,
> mnasnet1_0, mobilenet_v2, resnet18, resnet34, resnet50, resnet101, resnet152, resnext50_32x4d,
> resnext101_32x8d, wide_resnet50_2, wide_resnet101_2, shufflenet_v2_x0_5, shufflenet_v2_x1_0,
> squeezenet1_0, squeezenet1_1, vgg11, vgg11_bn, vgg13, vgg13_bn, vgg16, vgg16_bn, vgg19_bn, vgg19
128
129
130
> lstm, bert-base, bert-large, gpt2-small

> Do not support large models like `gpt2-large` currently because models larger than 2GB (maximum protobuf size) cannot be exported in one ONNX file.
131
132
133

#### Metrics

134
135
136
137
138
139
140
141
| Name                                             | Unit      | Description                                                                                              |
|--------------------------------------------------|-----------|----------------------------------------------------------------------------------------------------------|
| tensorrt-inference/${model}_gpu_time_mean        | time (ms) | The mean GPU latency to execute the kernels for a query.                                                 |
| tensorrt-inference/${model}_gpu_time_99          | time (ms) | The 99th percentile GPU latency to execute the kernels for a query.                                      |
| tensorrt-inference/${model}_host_time_mean       | time (ms) | The mean H2D, GPU, and D2H latency to execute the kernels for a query.                                   |
| tensorrt-inference/${model}_host_time_99         | time (ms) | The 99th percentile H2D, GPU, and D2H latency to execute the kernels for a query.                        |
| tensorrt-inference/${model}_end_to_end_time_mean | time (ms) | The mean duration from when the H2D of a query is called to when the D2H of the same query is completed. |
| tensorrt-inference/${model}_end_to_end_time_99   | time (ms) | The P99 duration from when the H2D of a query is called to when the D2H of the same query is completed.  |
142

143
144
145
146
147
148
149
150
151
152
153
### `ort-inference`

#### Introduction

Inference performance of the torchvision models using ONNXRuntime. Currently the following models are supported:

> alexnet, densenet121, densenet169, densenet201, densenet161, googlenet, inception_v3, mnasnet0_5,
> mnasnet1_0, mobilenet_v2, resnet18, resnet34, resnet50, resnet101, resnet152, resnext50_32x4d,
> resnext101_32x8d, wide_resnet50_2, wide_resnet101_2, shufflenet_v2_x0_5, shufflenet_v2_x1_0,
> squeezenet1_0, squeezenet1_1, vgg11, vgg11_bn, vgg13, vgg13_bn, vgg16, vgg16_bn, vgg19_bn, vgg19

154
155
The supported percentiles are 50, 90, 95, 99, and 99.9.

156
157
158
159
160
161
162
163
164
165
166
167
168
#### Parameters

| Parameter              | Default | Description                                                                 |
|------------------------|---------|-----------------------------------------------------------------------------|
| `--pytorch_models`     | See above | Torchvision models to export to ONNX and run with ONNX Runtime.           |
| `--precision`          | `float16` | Inference precision: `float32`, `float16`, or `int8`.                     |
| `--graph_opt_level`    | `3`     | ONNX Runtime graph optimization level: `0`, `1`, `2`, or `3`.              |
| `--batch_size`         | `32`    | Batch size of the generated input tensor.                                  |
| `--num_warmup`         | `64`    | Number of warmup inference iterations excluded from metrics.               |
| `--num_steps`          | `256`   | Number of measured inference iterations.                                   |
| `--execution_provider` | `auto`  | ONNX Runtime execution provider: `auto`, `cuda`, `rocm`, `migraphx`, `cpu`, or a full provider name. |
| `--pretrained`         | `false` | Use pretrained torchvision weights when exporting ONNX models.             |

169
170
#### Metrics

171
172
173
174
| Name                                                | Unit      | Description                                                              |
|-----------------------------------------------------|-----------|--------------------------------------------------------------------------|
| ort-inference/{precision}_{model}_time              | time (ms) | The mean latency to execute one batch of inference.                      |
| ort-inference/{precision}_{model}_time_{percentile} | time (ms) | The {percentile}th percentile latency to execute one batch of inference. |
175

176
177
178
179
180
181
182
183
184
### `gpu-burn`

#### Introduction

Multi-GPU CUDA stress test for GPU compute and memory utilization, performed by [gpu-burn](https://github.com/wilicc/gpu-burn).
Supports the use of double unit types and the use of tensor cores.

#### Metrics

185
186
187
188
189
| Name                    | Unit     | Description                                                                        |
|-------------------------|----------|------------------------------------------------------------------------------------|
| gpu-burn/time           | time (s) | The runtime for gpu-burn test.                                                     |
| gpu-burn/gpu_[0-9]_pass | yes/no   | The result of the gpu-burn test for each GPU (1: yes, 0: no).                      |
| gpu-burn/abort          | yes/no   | Whether or not GPU-burn test aborted before returning GPU results (1: yes, 0: no). |
190

rafsalas19's avatar
rafsalas19 committed
191
192
193
194
195
196
197
198
199
### `cpu-hpl`

#### Introduction

HPL or High Performance Computing Linpack evaluates compute bandwidth by solving dense linear systems in double precision arethmetic.
Performed by [High-Performance Linpack Benchmark for Distributed-Memory Computers](https://netlib.org/benchmark/hpl/)

#### Metrics

200
201
202
203
204
| Name               | Unit               | Description                                                               |
|--------------------|--------------------|---------------------------------------------------------------------------|
| cpu-hpl/tests_pass |                    | HPL completed running and correctness test has passed (1: pass, 0: fail). |
| cpu-hpl/throughput | bandwidth (GFlops) | Compute bandwidth.                                                        |
| cpu-hpl/time       | time (s)           | Time elapsed during HPL run.                                              |
rafsalas19's avatar
rafsalas19 committed
205

206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
### `gpu-hpl`

#### Introduction

Measure GPU HPL performance for dense linear algebra workloads.
Performed by [rocHPL](https://github.com/ROCm/rocHPL).

#### Parameters

`gpu-hpl` always generates an HPL input `.dat` file from the command-line parameters.
The generated file name and output file name are derived from the same workload prefix used in metric keys.

| Parameter              | Default | Description                                                                 |
|------------------------|---------|-----------------------------------------------------------------------------|
| `--p`                  | `1`     | Number of rows in the MPI process grid.                                     |
| `--q`                  | `1`     | Number of columns in the MPI process grid.                                  |
| `--local-p`            |         | Optional number of rows in the node-local MPI process grid.                 |
| `--local-q`            |         | Optional number of columns in the node-local MPI process grid.              |
| `--n`                  | `45312` | Global matrix size.                                                         |
| `--nb`                 | `384`   | Panel/block size.                                                           |
| `--warmup`             | `0`     | Number of warmup HPL runs to exclude from result aggregation.               |
| `--iterations`         | `1`     | Number of measured HPL runs to include in result aggregation.               |
| `--reduce-op`          | `max`   | Reduce operator for measured runs by FLOPS: `mean`, `median`, `max`, `min`. |
| `--pmap`               | `1`     | Process mapping: `0` for row-major, `1` for column-major.                   |
| `--bcast`              | `0`     | Broadcast topology: `0` for 1rg, `1` for 1rM, `2` for 2rg, `3` for 2rM, `4` for Lng, `5` for LnM. |
| `--threshold`          | `16.0`  | Residual check threshold.                                                   |
| `--pfact`              | `2`     | Panel factorization: `0` for left, `1` for Crout, `2` for right.            |
| `--nbmin`              | `32`    | Recursive stopping criterion.                                               |
| `--ndiv`               | `2`     | Number of panels in recursion.                                              |
| `--rfact`              | `2`     | Recursive panel factorization: `0` for left, `1` for Crout, `2` for right.  |
| `--depth`              | `1`     | Lookahead depth.                                                            |
| `--swap`               | `1`     | Swapping algorithm: `0` for binary exchange, `1` for long, `2` for mix.     |
| `--swapping-threshold` | `64`    | Swapping threshold.                                                         |
| `--l1`                 | `0`     | L1 storage form: `0` for transposed, `1` for non-transposed.                |
| `--u`                  | `0`     | U storage form: `0` for transposed, `1` for non-transposed.                 |
| `--equilibration`      | `0`     | Equilibration: `0` for no, `1` for yes.                                     |
| `--memory-alignment`   | `8`     | Memory alignment in doubles.                                                |

`--warmup` runs are excluded from result aggregation. `--reduce-op` is applied to the measured FLOPS values.
The reported `_time` metric is reduced in the same performance direction by applying `--reduce-op` to `1 / time`
and then converting the result back to seconds.

#### Metrics

rocHPL reports performance, time, and correctness metrics.
The metric key includes the configured HPL variant, process grid, matrix size, and block size:
`${tv}_P${P}_Q${Q}_N${N}_NB${NB}`.
The `tv` field is based on the rocHPL `T/V` value and includes an extended suffix for `L1`, `U`,
`Equilibration`, and `memory-alignment`. For example, `WC11R2R32_TTN8` uses transposed `L1`,
transposed `U`, no equilibration, and memory alignment `8`.

| Name                                                  | Unit           | Description                                      |
|-------------------------------------------------------|----------------|--------------------------------------------------|
| `gpu-hpl/${tv}_P${P}_Q${Q}_N${N}_NB${NB}_flops`      | FLOPS (GFLOPS) | Throughput for the specified rocHPL run.         |
| `gpu-hpl/${tv}_P${P}_Q${Q}_N${N}_NB${NB}_time`       | time (s)       | Time elapsed during the specified HPL run.       |
| `gpu-hpl/${tv}_P${P}_Q${Q}_N${N}_NB${NB}_tests_pass` |                | Whether residual checks passed (1: pass, 0: fail). |

### `gpu-hpl-mxp`

#### Introduction

Measure GPU HPL-MxP performance for mixed-precision dense linear algebra workloads.
Performed by [rocHPL-MxP](https://github.com/ROCm/rocHPL-MxP).

#### Parameters

`gpu-hpl-mxp` always generates an HPL-MxP input `.dat` file from the command-line parameters.
The generated file name and output file name are derived from the same workload prefix used in metric keys.

| Parameter      | Default | Description                                                                 |
|----------------|---------|-----------------------------------------------------------------------------|
| `--p`          | `1`     | Number of rows in the MPI process grid.                                     |
| `--q`          | `1`     | Number of columns in the MPI process grid.                                  |
| `--local-p`    |         | Optional number of rows in the node-local MPI process grid.                 |
| `--local-q`    |         | Optional number of columns in the node-local MPI process grid.              |
| `--n`          | `61440` | Global matrix size.                                                         |
| `--nb`         | `2560`  | Panel/block size.                                                           |
| `--warmup`     | `0`     | Number of warmup HPL-MxP runs to exclude from result aggregation.           |
| `--iterations` | `1`     | Number of measured HPL-MxP runs to include in result aggregation.           |
| `--reduce-op`  | `max`   | Reduce operator for measured runs by FLOPS: `mean`, `median`, `max`, `min`. |
| `--pmap`       | `1`     | Process mapping: `0` for row-major, `1` for column-major.                   |
| `--bcast`      | `0`     | Broadcast topology: `0` for 1rg, `1` for 1rM, `2` for 2rg, `3` for 2rM, `4` for Lng, `5` for LnM. |
| `--threshold`  | `16.0`  | Residual check threshold.                                                   |

`--warmup` runs are excluded from result aggregation. `--reduce-op` is applied to the measured FLOPS values.
The reported `_time` metric is reduced in the same performance direction by applying `--reduce-op` to `1 / time`
and then converting the result back to seconds.

#### Metrics

rocHPL-MxP reports performance, time, and correctness metrics.
The metric key includes the configured HPL-MxP variant, process grid, matrix size, and block size:
`${tv}_P${P}_Q${Q}_N${N}_NB${NB}`.
The `tv` field is based on the rocHPL-MxP `T/V` value, for example `WC1`.

| Name                                                      | Unit           | Description                                      |
|-----------------------------------------------------------|----------------|--------------------------------------------------|
| `gpu-hpl-mxp/${tv}_P${P}_Q${Q}_N${N}_NB${NB}_flops`      | FLOPS (GFLOPS) | Throughput for the specified rocHPL-MxP run.     |
| `gpu-hpl-mxp/${tv}_P${P}_Q${Q}_N${N}_NB${NB}_time`       | time (s)       | Time elapsed during the specified HPL-MxP run.   |
| `gpu-hpl-mxp/${tv}_P${P}_Q${Q}_N${N}_NB${NB}_tests_pass` |                | Whether residual checks passed (1: pass, 0: fail). |

one's avatar
one committed
307
308
309
310
### `gpu-hpcg`

#### Introduction

one's avatar
one committed
311
312
Measure GPU HPCG performance for sparse linear algebra and multigrid-style workloads.
Performed by [rocHPCG](https://github.com/ROCm/rocHPCG).
one's avatar
one committed
313

314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
#### Parameters

| Parameter | Default | Description                                                                 |
|-----------|---------|-----------------------------------------------------------------------------|
| `--npx`   | `1`     | Number of MPI processes in the x dimension.                                 |
| `--npy`   | `1`     | Number of MPI processes in the y dimension.                                 |
| `--npz`   | `1`     | Number of MPI processes in the z dimension.                                 |
| `--nx`    | `560`   | Local problem size in the x dimension.                                      |
| `--ny`    | `280`   | Local problem size in the y dimension.                                      |
| `--nz`    | `280`   | Local problem size in the z dimension.                                      |
| `--rt`    | `60`    | Benchmark runtime in seconds.                                               |
| `--tol`   | `1.0`   | Verification control: `0` runs reference verification; non-zero skips it.   |
| `--pz`    | `0`     | Partition boundary in the z process dimension.                              |
| `--zl`    | `--nz`  | Local `nz` value for processes with z rank lower than `--pz`.               |
| `--zu`    | `--nz`  | Local `nz` value for processes with z rank greater than or equal to `--pz`. |

one's avatar
one committed
330
331
#### Metrics

332
333
334
335
336
337
338
rocHPCG reports performance and time metrics.
Performance metrics are reported for `final`, `ddot`, `waxpby`, `spmv`, `mg`, and `total`.
The metric key includes the configured process domain and local problem size:
`p${npx}x${npy}x${npz}_n${nx}x${ny}x${nz}`.

| Name                                                                                             | Unit             | Description                                             |
|--------------------------------------------------------------------------------------------------|------------------|---------------------------------------------------------|
339
| `gpu-hpcg/${operation}_p${npx}x${npy}x${npz}_n${nx}x${ny}x${nz}_flops`                          | FLOPS (GFLOPS)   | Throughput for the specified rocHPCG operation.         |
340
| `gpu-hpcg/${operation}_p${npx}x${npy}x${npz}_n${nx}x${ny}x${nz}_bandwidth`                      | bandwidth (GB/s) | Bandwidth for the specified rocHPCG operation.          |
341
| `gpu-hpcg/${operation}_p${npx}x${npy}x${npz}_n${nx}x${ny}x${nz}_flops_per_process`              | FLOPS (GFLOPS)   | Per-process throughput for the specified operation.     |
342
343
344
345
| `gpu-hpcg/${operation}_p${npx}x${npy}x${npz}_n${nx}x${ny}x${nz}_bandwidth_per_process`          | bandwidth (GB/s) | Per-process bandwidth for the specified operation.      |
| `gpu-hpcg/setup_time_p${npx}x${npy}x${npz}_n${nx}x${ny}x${nz}`                                  | time (s)         | Setup phase duration.                                   |
| `gpu-hpcg/optimization_time_p${npx}x${npy}x${npz}_n${nx}x${ny}x${nz}`                           | time (s)         | Optimization phase duration.                            |
| `gpu-hpcg/total_time_p${npx}x${npy}x${npz}_n${nx}x${ny}x${nz}`                                  | time (s)         | Total runtime.                                          |
one's avatar
one committed
346

rafsalas19's avatar
rafsalas19 committed
347
348
349
350
351
352
353
354
355
### `cpu-stream`

#### Introduction

Measure of memory bandwidth and computation rate for simple vector kernels.
performed by [University of Virginia STREAM benchmark](https://www.cs.virginia.edu/stream/ref.html).

#### Metrics

356
357
358
359
360
361
362
| Name                                                     | Unit             | Description                                                    |
|----------------------------------------------------------|------------------|----------------------------------------------------------------|
| cpu-stream/threads                                       |                  | Number of threads used for the test. Determined by core count. |
| cpu-stream/['copy', 'scale', 'add', 'triad']\_throughput | bandwidth (MB/s) | Memory throughput of designated kerel operation.               |
| cpu-stream/['copy', 'scale', 'add', 'triad']\_time_avg   | time (s)         | Average elapsed times over all iterations.                     |
| cpu-stream/['copy', 'scale', 'add', 'triad']\_time_min   | time (s)         | Minimum elapsed times over all iterations.                     |
| cpu-stream/['copy', 'scale', 'add', 'triad']\_time_max   | time (s)         | Maximum elapsed times over all iterations.                     |
rafsalas19's avatar
rafsalas19 committed
363

364
365
## Communication Benchmarks

366
367
368
369
370
371
372
373
374
375
376
### `cpu-memory-bw-latency`

#### Introduction

Measure the memory copy bandwidth and latency across different CPU NUMA nodes.
performed by [Intel MLC Tool](https://www.intel.com/content/www/us/en/developer/articles/tool/intelr-memory-latency-checker.html).

#### Metrics

| Name                                                                    | Unit             | Description                                                         |
|-------------------------------------------------------------------------|------------------|---------------------------------------------------------------------|
377
378
379
380
381
382
383
| cpu-memory-bw-latency/mem\_bandwidth\_matrix\_numa\_[0-9]+\_[0-9]+\_bw  | bandwidth (MB/s) | Former NUMA to latter NUMA memory bandwidth.                        |
| cpu-memory-bw-latency/mem\_bandwidth\_matrix\_numa\_[0-9]+\_[0-9]+\_lat | time (ns)        | Former NUMA to latter NUMA memory latency.                          |
| cpu-memory-bw-latency/mem\_max\_bandwidth\_all\_reads\_bw               | bandwidth (MB/s) | Whole-CPU maximum memory bandwidth, full read.                      |
| cpu-memory-bw-latency/mem\_max\_bandwidth\_3_1\_reads-writes\_bw        | bandwidth (MB/s) | Whole-CPU maximum memory bandwidth, read : write = 3 : 1.           |
| cpu-memory-bw-latency/mem\_max\_bandwidth\_2_1\_reads-writes\_bw        | bandwidth (MB/s) | Whole-CPU maximum memory bandwidth, read : write = 2 : 1.           |
| cpu-memory-bw-latency/mem\_max\_bandwidth\_1_1\_reads-writes\_bw        | bandwidth (MB/s) | Whole-CPU maximum memory bandwidth, read : write = 1 : 1.           |
| cpu-memory-bw-latency/mem\_max\_bandwidth\_stream-triad\_like\_bw       | bandwidth (MB/s) | Whole-CPU maximum memory bandwidth, with stream-triad like pattern. |
384

385
386
387
388
389
### `mem-bw`

#### Introduction

Measure the memory copy bandwidth across PCI-e and memory copy bandwidth between GPUs,
390
performed by [NVIDIA](https://github.com/NVIDIA/cuda-samples/tree/master/Samples/1_Utilities/bandwidthTest)
391
or [AMD](https://github.com/ROCm-Developer-Tools/HIP/tree/master/samples/1_Utils/hipBusBandwidth) bandwidth test tool.
392
The bandwidthTest sample was out-of-date and has been removed as of the CUDA Samples 12.9 release. For up-to-date bandwidth measurements, refer instead to the nvbandwidth benchmark.
393
394
395

#### Metrics

396
397
398
399
400
| Name          | Unit             | Description                      |
|---------------|------------------|----------------------------------|
| mem-bw/h2d_bw | bandwidth (GB/s) | Host to device copy bandwidth.   |
| mem-bw/d2h_bw | bandwidth (GB/s) | Device to host copy bandwidth.   |
| mem-bw/d2d_bw | bandwidth (GB/s) | Device to device copy bandwidth. |
401

402
### `gpu-copy-bw`
403

404
Measure the memory copy bandwidth performed by GPU SM/DMA engine, including device-to-host, host-to-device and device-to-device.
405
For measurements of peer-to-peer communication performance between AMD GPUs, GPU memory buffers are allocated in `hipDeviceMallocUncached` (previous `hipDeviceMallocFinegrained`) mode to maximize performance.
406
407
408

#### Metrics

409
410
411
412
413
414
415
416
417
418
| Name                                                        | Unit             | Description                                                                                                                              |
|-------------------------------------------------------------|------------------|------------------------------------------------------------------------------------------------------------------------------------------|
| cpu\_to\_gpu[0-9]+\_by\_(sm\|dma)\_under\_numa[0-9]+\_bw    | bandwidth (GB/s) | The unidirectional bandwidth of one GPU reading one NUMA node's host memory using DMA engine or GPU SM.                                  |
| gpu[0-9]+\_to\_cpu\_by\_(sm\|dma)\_under\_numa[0-9]+\_bw    | bandwidth (GB/s) | The unidirectional bandwidth of one GPU writing one NUMA node's host memory using DMA engine or GPU SM.                                  |
| gpu[0-9]+\_to\_gpu[0-9]+\_by\_(sm\|dma)\_bw                 | bandwidth (GB/s) | The unidirectional bandwidth of one GPU reading or writing self's memory using DMA engine or GPU SM.                                     |
| gpu[0-9]+\_to\_gpu[0-9]+\_(read\|write)\_by\_(sm\|dma)\_bw  | bandwidth (GB/s) | The unidirectional bandwidth of one GPU reading or writing peer GPU's memory using DMA engine or GPU SM with peer communication enabled. |
| cpu\_and\_gpu[0-9]+\_by\_(sm\|dma)\_under\_numa[0-9]+\_bw   | bandwidth (GB/s) | The bidirectional bandwidth of one GPU reading and writing one NUMA node's host memory using DMA engine or GPU SM.                       |
| gpu[0-9]+\_and\_cpu\_by\_(sm\|dma)\_under\_numa[0-9]+\_bw   | bandwidth (GB/s) | Same as above, but generated by --dtoh --bidirectional.                                                                                  |
| gpu[0-9]+\_and\_gpu[0-9]+\_by\_(sm\|dma)\_bw                | bandwidth (GB/s) | The bidirectional bandwidth of one GPU reading and writing self's memory using DMA engine or GPU SM.                                     |
| gpu[0-9]+\_and\_gpu[0-9]+\_(read\|write)\_by\_(sm\|dma)\_bw | bandwidth (GB/s) | The bidirectional bandwidth of one GPU reading and writing peer GPU's memory using DMA engine or GPU SM with peer communication enabled. |
419
420
421
| gpu[0-9]+\_to\_gpu\_all\_write\_by\_sm\_bw                  | bandwidth (GB/s) | The unidirectional bandwidth of one GPU writing all peer GPUs' memory using GPU SM with peer communication enabled.                      |
| gpu\_all\_to\_gpu[0-9]+\_write\_by\_sm\_bw                  | bandwidth (GB/s) | The unidirectional bandwidth of all peer GPUs writing one GPU's memory using GPU SM with peer communication enabled.                     |
| gpu\_all\_to\_gpu\_all\_write\_by\_sm\_bw                   | bandwidth (GB/s) | The unidirectional bandwidth of all peer GPUs writing all peer GPUs' memory using GPU SM with peer communication enabled.                |
422

423
424
425
426
### `gpu-stream`

#### Introduction

one's avatar
one committed
427
428
429
430
Measure the memory bandwidth of GPU using BabelStream (`hip-stream`) backend.
The benchmark executes copy, scale, add, triad, and dot operations.
The `array_size` parameter represents the number of elements.
Each benchmark run measures the GPU visible to the current process.
431
432
433

#### Metrics

one's avatar
one committed
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
| Metric Name                                                       | Unit             | Description                                                                                    |
|-------------------------------------------------------------------|------------------|------------------------------------------------------------------------------------------------|
| STREAM\_INIT\_[float\|double]\_array\_[0-9]+\_bw                 | bandwidth (GB/s) | Initialization phase bandwidth for the current benchmark run and one array size.              |
| STREAM\_INIT\_[float\|double]\_array\_[0-9]+\_time               | time (s)         | Initialization phase runtime for the current benchmark run and one array size.                |
| STREAM\_READ\_[float\|double]\_array\_[0-9]+\_bw                 | bandwidth (GB/s) | Read phase bandwidth for the current benchmark run and one array size.                        |
| STREAM\_READ\_[float\|double]\_array\_[0-9]+\_time               | time (s)         | Read phase runtime for the current benchmark run and one array size.                          |
| STREAM\_COPY\_[float\|double]\_array\_[0-9]+\_bw                 | bandwidth (GB/s) | Maximum copy bandwidth for the current benchmark run and one array size.                       |
| STREAM\_COPY\_[float\|double]\_array\_[0-9]+\_time\_min          | time (s)         | Minimum copy runtime for the current benchmark run and one array size.                        |
| STREAM\_COPY\_[float\|double]\_array\_[0-9]+\_time\_max          | time (s)         | Maximum copy runtime for the current benchmark run and one array size.                        |
| STREAM\_COPY\_[float\|double]\_array\_[0-9]+\_time\_avg          | time (s)         | Average copy runtime for the current benchmark run and one array size.                        |
| STREAM\_MUL\_[float\|double]\_array\_[0-9]+\_bw                  | bandwidth (GB/s) | Maximum mul bandwidth for the current benchmark run and one array size.                       |
| STREAM\_MUL\_[float\|double]\_array\_[0-9]+\_time\_min           | time (s)         | Minimum mul runtime for the current benchmark run and one array size.                         |
| STREAM\_MUL\_[float\|double]\_array\_[0-9]+\_time\_max           | time (s)         | Maximum mul runtime for the current benchmark run and one array size.                         |
| STREAM\_MUL\_[float\|double]\_array\_[0-9]+\_time\_avg           | time (s)         | Average mul runtime for the current benchmark run and one array size.                         |
| STREAM\_ADD\_[float\|double]\_array\_[0-9]+\_bw                  | bandwidth (GB/s) | Maximum add bandwidth for the current benchmark run and one array size.                        |
| STREAM\_ADD\_[float\|double]\_array\_[0-9]+\_time\_min           | time (s)         | Minimum add runtime for the current benchmark run and one array size.                         |
| STREAM\_ADD\_[float\|double]\_array\_[0-9]+\_time\_max           | time (s)         | Maximum add runtime for the current benchmark run and one array size.                         |
| STREAM\_ADD\_[float\|double]\_array\_[0-9]+\_time\_avg           | time (s)         | Average add runtime for the current benchmark run and one array size.                         |
| STREAM\_TRIAD\_[float\|double]\_array\_[0-9]+\_bw                | bandwidth (GB/s) | Maximum triad bandwidth for the current benchmark run and one array size.                      |
| STREAM\_TRIAD\_[float\|double]\_array\_[0-9]+\_time\_min         | time (s)         | Minimum triad runtime for the current benchmark run and one array size.                       |
| STREAM\_TRIAD\_[float\|double]\_array\_[0-9]+\_time\_max         | time (s)         | Maximum triad runtime for the current benchmark run and one array size.                       |
| STREAM\_TRIAD\_[float\|double]\_array\_[0-9]+\_time\_avg         | time (s)         | Average triad runtime for the current benchmark run and one array size.                       |
| STREAM\_DOT\_[float\|double]\_array\_[0-9]+\_bw                  | bandwidth (GB/s) | Maximum dot bandwidth for the current benchmark run and one array size.                        |
| STREAM\_DOT\_[float\|double]\_array\_[0-9]+\_time\_min           | time (s)         | Minimum dot runtime for the current benchmark run and one array size.                         |
| STREAM\_DOT\_[float\|double]\_array\_[0-9]+\_time\_max           | time (s)         | Maximum dot runtime for the current benchmark run and one array size.                         |
| STREAM\_DOT\_[float\|double]\_array\_[0-9]+\_time\_avg           | time (s)         | Average dot runtime for the current benchmark run and one array size.                         |

`gpu-stream` reports `phase` and `function` metrics. `_ratio` and `block_*` metrics are removed.
Bandwidth metrics are converted from BabelStream `max_mbytes_per_sec` by using `GB/s = MB/s / 1000`.
463

464
465
466
467
468
469
470
471
472
### `ib-loopback`

#### Introduction

Measure the InfiniBand loopback verbs bandwidth, performed by
[OFED performance tests](https://github.com/linux-rdma/perftest/tree/7504ce48ac396a02f4d00de359257b2cb8458f06).

#### Metrics

473
474
475
476
477
| Name                                | Unit             | Description                                                  |
|-------------------------------------|------------------|--------------------------------------------------------------|
| ib-loopback/ib_write_bw_${msg_size} | bandwidth (GB/s) | InfiniBand loopback write bandwidth with given message size. |
| ib-loopback/ib_read_bw_${msg_size}  | bandwidth (GB/s) | InfiniBand loopback read bandwidth with given message size.  |
| ib-loopback/ib_send_bw_${msg_size}  | bandwidth (GB/s) | InfiniBand loopback send bandwidth with given message size.  |
478
479
480
481
482

### `nccl-bw` / `rccl-bw`

#### Introduction

483
Measure the performance of NCCL/RCCL operations under multi nodes' traffic pattern,
484
485
486
performed by [nccl-tests](https://github.com/NVIDIA/nccl-tests/tree/44df0bf010dcc95e840ca0fb7466c67cff3f1f0f)
or [rccl-tests](https://github.com/ROCmSoftwarePlatform/rccl-tests/tree/dc1ad4853d7ec738387d42a75a58a98d7af00c7b).
Support the following operations currently: allreduce, allgather, broadcast, reduce, reducescatter, alltoall.
487
Support both in-place and out-of-place measurements.
488

489
490
491
492
493
494
Support the following traffic patterns:
* `all-nodes`, validate the NCCL/RCCL performance across all VM nodes simultaneously.
* `pair-wise`, validate the NCCL/RCCL performance across VM pairs with all possible combinations in parallel.
* `k-batch`, validate the NCCL/RCCL performance across VM groups with a specified batch scale.
* `topo-aware`, validate the NCCL/RCCL performance across VM pairs with different distances/hops as a quick test.

495
496
497
498
499
500
501
502
503
504
505
#### Metrics

| Name                                   | Unit             | Description                                                 |
|----------------------------------------|------------------|-------------------------------------------------------------|
| nccl-bw/${operation}_${msg_size}_time  | time (us)        | NCCL operation lantency with given message size.            |
| nccl-bw/${operation}_${msg_size}_algbw | bandwidth (GB/s) | NCCL operation algorithm bandwidth with given message size. |
| nccl-bw/${operation}_${msg_size}_busbw | bandwidth (GB/s) | NCCL operation bus bandwidth with given message size.       |
| rccl-bw/${operation}_${msg_size}_time  | time (us)        | RCCL operation lantency with given message size.            |
| rccl-bw/${operation}_${msg_size}_algbw | bandwidth (GB/s) | RCCL operation algorithm bandwidth with given message size. |
| rccl-bw/${operation}_${msg_size}_busbw | bandwidth (GB/s) | RCCL operation bus bandwidth with given message size.       |

506
If mpi mode is enable and traffic pattern is specified, the metrics pattern will change to `nccl-bw/${operation}_${serial_index)_${parallel_index):${msg_size}_time`
507
508
- `serial_index` represents the serial index of the host group in serial.
- `parallel_index` represents the parallel index of the host list in parallel.
509

510
511
512
513
514
515
516
517
518
### `tcp-connectivity`

#### Introduction

Test the TCP connectivity between current node and nodes in the hostfile,
performed by [tcping](https://github.com/zhengxiaowai/tcping)

#### Metrics

519
520
521
522
523
524
525
526
| Metrics                                         | Unit      | Description                                                                           |
|-------------------------------------------------|-----------|---------------------------------------------------------------------------------------|
| tcp-connectivity/${hostname/ip}_successed_count | count     | successed times of tcp connections between current node and other nodes               |
| tcp-connectivity/${hostname/ip}_failed_count    | count     | failed times of tcp connections between current node and other nodes                  |
| tcp-connectivity/${hostname/ip}_success_rate    |           | success rate (successed/total) of tcp connection between current node and other nodes |
| tcp-connectivity/${hostname/ip}_time_min        | time (ms) | mininum latency of tcp connections between current node and other nodes               |
| tcp-connectivity/${hostname/ip}_time_max        | time (ms) | maximum latency of tcp connections between current node and other nodes               |
| tcp-connectivity/${hostname/ip}_time_avg        | time (ms) | average latency of tcp connections between current node and other nodes               |
527
528
529
530
531
532
533
534
535
536
537
538

### `gpcnet-network-test` / `gpcnet-network-load-test`

#### Introduction

Distributed test, test the global network performance and congestion,
performed by [GPCNET](https://github.com/netbench/GPCNET)

gpcnet-network-test: Full system network tests in random and natural ring, alltoall and allreduce, at least 2 nodes

gpcnet-network-load-test: Select full system network tests run with four congestors to measure network congestion or contention, at least 10 nodes

539
540
 - supporting network tests: RR Two-sided Lat (8 B), RR Get Lat (8 B), RR Two-sided BW (131072 B), RR Put BW (131072 B), RR Two-sided BW+Sync (131072 B), Nat Two-sided BW (131072 B), Multiple Allreduce (8 B), Multiple Alltoall (4096 B)
 - supporting congestors: Alltoall (4096 B), Two-sided Incast (4096 B), Put Incast (4096 B), Get Bcast (4096 B)
541
542
543

#### Metrics

544
545
546
547
548
549
550
551
552
553
554
555
| Metrics                                                 | Unit                   | Description                                                                                                                                                                |
|---------------------------------------------------------|------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| gpcnet-network-test/rr_two-sided_lat_${stat}            | time (us)              | statistical values(min, max, avg, 99%, 99.9%) obtained by all nodes use algorithm 'random ring communication pattern two-side latency' for network testing                 |
| gpcnet-network-test/rr_two-sided+sync_bw_${stat}        | bandwidth (MiB/s/rank) | fstatistical values(min, max, avg, 99%, 99.9%) obtained by all nodes use algorithm 'random ring communication pattern two-side bandwidth with barrier' for network testing |
| gpcnet-network-test/multiple_allreduce_time_${stat}     | time (us)              | statistical values(min, max, avg, 99%, 99.9%) obtained by all nodes use algorithm 'multiple allreduce bandwidth' for network testing                                       |
| gpcnet-network-test/rr_get_lat_${stat}                  | bandwidth (MiB/s/rank) | statistical values(min, max, avg, 99%, 99.9%) obtained by all nodes use algorithm 'RR GetLat (8 B)' for network testing                                                    |
| gpcnet-network-test/rr_two-sided_bw_${stat}             | bandwidth (MiB/s/rank) | statistical values(min, max, avg, 99%, 99.9%) obtained by all nodes use algorithm 'RR Two-sidedBW (131072 B)' for network testing                                          |
| gpcnet-network-test/nat_two-sided_bw_${stat}            | bandwidth (MiB/s/rank) | statistical values(min, max, avg, 99%, 99.9%) obtained by all nodes use algorithm 'Nat Two-sidedBW (131072 B)' for network testing                                         |
| gpcnet-network-test/multiple_alltoall_bw_${stat}        | bandwidth (MiB/s/rank) | statistical values(min, max, avg, 99%, 99.9%) obtained by all nodes use algorithm 'Multiple Alltoall (4096 B)' for network testing                                         |
| gpcnet-network-load-test/rr_two-sided_lat_x_${stat}     | factor (x)             | summary about congestion impact factor of the network test algorithm                                                                                                       |
| gpcnet-network-load-test/rr_two-sided+sync_bw_x_${stat} | factor (x)             | summary about congestion impact factor of the network test algorithm                                                                                                       |
| gpcnet-network-load-test/multiple_allreduce_x_${stat}   | factor (x)             | summary about congestion impact factor of the network test algorithm                                                                                                       |
556

557
558
559
560
561
562
### `ib-traffic`

#### Introduction

Measure the InfiniBand performance under multi nodes' traffic pattern.

563
564
The direction between client and server can be 'cpu-to-cpu'/'gpu-to-gpu'/'gpu-to-cpu'/'cpu-to-gpu'.

565
566
567
The traffic pattern is defined in a config file, which is pre-defined for one-to-many, many-to-one and all-to-all patterns.
Each row in the config is one round, and all pairs of nodes in a row run ib command simultaneously.

568
569
570
571
572
573
574
575
576
577
578
Besides the above three patterns, ib-traffic also supports topology-aware traffic pattern. To run ib-traffic with topology-aware
pattern, the user needs to specify 3 required (and 2 optional) parameters in YAML config file:
   - --pattern	 **topo-aware**
   - --ibstat	 **path to ibstat output**
   - --ibnetdiscover	 **path to ibnetdiscover output**
   - --min_dist	 **minimum distance of VM pairs (optional, default 2)**
   - --max_dist	 **maximum distance of VM pairs (optional, default 6)**

Each row in the config file has all VM pairs with a fixed distance (#hops). That's by default, 1st, 2nd, 3rd row has all VM pairs
with topology distance of 2, 4, 6, respectively.

579
#### Metrics
580

581
582
583
584
| Metrics                                                                                     | Unit             | Description                                                                                                                                                                                                                                                                                                                 |
|---------------------------------------------------------------------------------------------|------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| ib-traffic/ib\_write\_bw\_${msg_size}\_${direction}\_${line}\_${pair}:${server}\_${client}  | bandwidth (GB/s) | The max bandwidth of perftest (ib_write_bw, ib_send_bw, ib_read_bw) using ${msg_size} with ${direction}('cpu-to-cpu'/'gpu-to-gpu'/'gpu-to-cpu'/'cpu-to-gpu') run between the ${pair}<sup>th</sup> node pair in the ${line}<sup>th</sup> line of the config, ${server} and ${client} are the hostname of server and client.  |
| ib-traffic/ib\_write\_lat\_${msg_size}\_${direction}\_${line}\_${pair}:${server}\_${client} | time (us)        | The max latency of perftest (ib_write_lat, ib_send_lat, ib_read_lat) using ${msg_size} with ${direction}('cpu-to-cpu'/'gpu-to-gpu'/'gpu-to-cpu'/'cpu-to-gpu') run between the ${pair}<sup>th</sup> node pair in the ${line}<sup>th</sup> line of the config, ${server} and ${client} are the hostname of server and client. |
585

586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
### `nvbandwidth`

#### Introduction

Measures bandwidth and latency for various memcpy patterns across different links using copy engine or kernel copy methods,
performed by [nvbandwidth](https://github.com/NVIDIA/nvbandwidth)

#### Metrics

| Metrics                                                 | Unit                   | Description                                                                                                                                                                |
|---------------------------------------------------------|------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| host_to_device_memcpy_ce_cpu[0-9]_gpu[0-9]_bw      | GB/s                | Host to device CE memcpy using cuMemcpyAsync                      |
| host_to_device_memcpy_ce_sum_bw                    | GB/s                | Sum of the output matrix                                           |
| device_to_host_memcpy_ce_cpu[0-9]_gpu[0-9]_bw      | GB/s                | Device to host CE memcpy using cuMemcpyAsync                      |
| device_to_host_memcpy_ce_sum_bw                    | GB/s                | Sum of the output matrix                                           |
| host_to_device_bidirectional_memcpy_ce_cpu[0-9]_gpu[0-9]_bw | GB/s      | A host to device copy is measured while a device to host copy is run simultaneously. Only the host to device copy bandwidth is reported. |
| host_to_device_bidirectional_memcpy_ce_sum_bw      | GB/s                | Sum of the output matrix                                           |
| device_to_host_bidirectional_memcpy_ce_cpu[0-9]_gpu[0-9]_bw | GB/s      | A device to host copy is measured while a host to device copy is run simultaneously. Only the device to host copy bandwidth is reported. |
| device_to_host_bidirectional_memcpy_ce_sum_bw      | GB/s                | Sum of the output matrix                                           |
| device_to_device_memcpy_read_ce_gpu[0-9]_gpu[0-9]_bw | GB/s               | Measures bandwidth of cuMemcpyAsync between each pair of accessible peers. Read tests launch a copy from the peer device to the target using the target's context. |
| device_to_device_memcpy_read_ce_sum_bw             | GB/s                | Sum of the output matrix                                           |
| device_to_device_memcpy_write_ce_gpu[0-9]_gpu[0-9]_bw | GB/s              | Measures bandwidth of cuMemcpyAsync between each pair of accessible peers. Write tests launch a copy from the target device to the peer using the target's context. |
| device_to_device_memcpy_write_ce_sum_bw            | GB/s                | Sum of the output matrix                                           |
| device_to_device_bidirectional_memcpy_read_ce_gpu[0-9]_gpu[0-9]_bw | GB/s | Measures bandwidth of cuMemcpyAsync between each pair of accessible peers. A copy in the opposite direction of the measured copy is run simultaneously but not measured. Read tests launch a copy from the peer device to the target using the target's context. |
| device_to_device_bidirectional_memcpy_read_ce_sum_bw | GB/s               | Sum of the output matrix                                           |
| device_to_device_bidirectional_memcpy_write_ce_gpu[0-9]_gpu[0-9]_bw | GB/s | Measures bandwidth of cuMemcpyAsync between each pair of accessible peers. A copy in the opposite direction of the measured copy is run simultaneously but not measured. Write tests launch a copy from the target device to the peer using the target's context. |
| device_to_device_bidirectional_memcpy_write_ce_sum_bw | GB/s               | Sum of the output matrix                                           |
| all_to_host_memcpy_ce_cpu[0-9]_gpu[0-9]_bw         | GB/s                | Measures bandwidth of cuMemcpyAsync between a single device and the host while simultaneously running copies from all other devices to the host. |
| all_to_host_memcpy_ce_sum_bw                       | GB/s                | Sum of the output matrix                                           |
| all_to_host_bidirectional_memcpy_ce_cpu[0-9]_gpu[0-9]_bw | GB/s              | A device to host copy is measured while a host to device copy is run simultaneously. Only the device to host copy bandwidth is reported. All other devices generate simultaneous host to device and device to host interfering traffic. |
| all_to_host_bidirectional_memcpy_ce_sum_bw         | GB/s                | Sum of the output matrix                                           |
| host_to_all_memcpy_ce_cpu[0-9]_gpu[0-9]_bw         | GB/s                | Measures bandwidth of cuMemcpyAsync between the host to a single device while simultaneously running copies from the host to all other devices. |
| host_to_all_memcpy_ce_sum_bw                       | GB/s                | Sum of the output matrix                                           |
| host_to_all_bidirectional_memcpy_ce_cpu[0-9]_gpu[0-9]_bw | GB/s              | A host to device copy is measured while a device to host copy is run simultaneously. Only the host to device copy bandwidth is reported. All other devices generate simultaneous host to device and device to host interfering traffic. |
| host_to_all_bidirectional_memcpy_ce_sum_bw         | GB/s                | Sum of the output matrix                                           |
| all_to_one_write_ce_gpu[0-9]_gpu[0-9]_bw           | GB/s                | Measures the total bandwidth of copies from all accessible peers to a single device, for each device. Bandwidth is reported as the total inbound bandwidth for each device. Write tests launch a copy from the target device to the peer using the target's context. |
| all_to_one_write_ce_sum_bw                         | GB/s                | Sum of the output matrix                                           |
| all_to_one_read_ce_gpu[0-9]_gpu[0-9]_bw            | GB/s                | Measures the total bandwidth of copies from all accessible peers to a single device, for each device. Bandwidth is reported as the total outbound bandwidth for each device. Read tests launch a copy from the peer device to the target using the target's context. |
| all_to_one_read_ce_sum_bw                          | GB/s                | Sum of the output matrix                                           |
| one_to_all_write_ce_gpu[0-9]_gpu[0-9]_bw           | GB/s                | Measures the total bandwidth of copies from a single device to all accessible peers, for each device. Bandwidth is reported as the total outbound bandwidth for each device. Write tests launch a copy from the target device to the peer using the target's context. |
| one_to_all_write_ce_sum_bw                         | GB/s                | Sum of the output matrix                                           |
| one_to_all_read_ce_gpu[0-9]_gpu[0-9]_bw            | GB/s                | Measures the total bandwidth of copies from a single device to all accessible peers, for each device. Bandwidth is reported as the total inbound bandwidth for each device. Read tests launch a copy from the peer device to the target using the target's context. |
| one_to_all_read_ce_sum_bw                          | GB/s                | Sum of the output matrix                                           |
| host_to_device_memcpy_sm_cpu[0-9]_gpu[0-9]_bw      | GB/s                | Host to device SM memcpy using a copy kernel                      |
| host_to_device_memcpy_sm_sum_bw                    | GB/s                | Sum of the output matrix                                           |
| device_to_host_memcpy_sm_cpu[0-9]_gpu[0-9]_bw      | GB/s                | Device to host SM memcpy using a copy kernel                      |
| device_to_host_memcpy_sm_sum_bw                    | GB/s                | Sum of the output matrix                                           |
| device_to_device_memcpy_read_sm_gpu[0-9]_gpu[0-9]_bw | GB/s               | Measures bandwidth of a copy kernel between each pair of accessible peers. Read tests launch a copy from the peer device to the target using the target's context. |
| device_to_device_memcpy_read_sm_sum_bw             | GB/s                | Sum of the output matrix                                           |
| device_to_device_memcpy_write_sm_gpu[0-9]_gpu[0-9]_bw | GB/s              | Measures bandwidth of a copy kernel between each pair of accessible peers. Write tests launch a copy from the target device to the peer using the target's context. |
| device_to_device_memcpy_write_sm_sum_bw            | GB/s                | Sum of the output matrix                                           |
| device_to_device_bidirectional_memcpy_read_sm_gpu[0-9]_gpu[0-9]_bw | GB/s | Measures bandwidth of a copy kernel between each pair of accessible peers. Copies are run in both directions between each pair, and the sum is reported. Read tests launch a copy from the peer device to the target using the target's context. |
| device_to_device_bidirectional_memcpy_read_sm_sum_bw | GB/s               | Sum of the output matrix                                           |
| device_to_device_bidirectional_memcpy_write_sm_gpu[0-9]_gpu[0-9]_bw | GB/s | Measures bandwidth of a copy kernel between each pair of accessible peers. Copies are run in both directions between each pair, and the sum is reported. Write tests launch a copy from the target device to the peer using the target's context. |
| device_to_device_bidirectional_memcpy_write_sm_sum_bw | GB/s               | Sum of the output matrix                                           |
| all_to_host_memcpy_sm_cpu[0-9]_gpu[0-9]_bw         | GB/s                | Measures bandwidth of a copy kernel between a single device and the host while simultaneously running copies from all other devices to the host. |
| all_to_host_memcpy_sm_sum_bw                       | GB/s                | Sum of the output matrix                                           |
| all_to_host_bidirectional_memcpy_sm_cpu[0-9]_gpu[0-9]_bw | GB/s              | A device to host bandwidth of a copy kernel is measured while a host to device copy is run simultaneously. Only the device to host copy bandwidth is reported. All other devices generate simultaneous host to device and device to host interfering traffic using copy kernels. |
| all_to_host_bidirectional_memcpy_sm_sum_bw         | GB/s                | Sum of the output matrix                                           |
| host_to_all_memcpy_sm_cpu[0-9]_gpu[0-9]_bw         | GB/s                | Measures bandwidth of a copy kernel between the host to a single device while simultaneously running copies from the host to all other devices. |
| host_to_all_memcpy_sm_sum_bw                       | GB/s                | Sum of the output matrix                                           |
| host_to_all_bidirectional_memcpy_sm_cpu[0-9]_gpu[0-9]_bw | GB/s              | A host to device bandwidth of a copy kernel is measured while a device to host copy is run simultaneously. Only the host to device copy bandwidth is reported. All other devices generate simultaneous host to device and device to host interfering traffic using copy kernels. |
| host_to_all_bidirectional_memcpy_sm_sum_bw         | GB/s                | Sum of the output matrix                                           |
| all_to_one_write_sm_gpu[0-9]_gpu[0-9]_bw           | GB/s                | Measures the total bandwidth of copies from all accessible peers to a single device, for each device. Bandwidth is reported as the total inbound bandwidth for each device. Write tests launch a copy from the target device to the peer using the target's context. |
| all_to_one_write_sm_sum_bw                         | GB/s                | Sum of the output matrix                                           |
| all_to_one_read_sm_gpu[0-9]_gpu[0-9]_bw            | GB/s                | Measures the total bandwidth of copies from all accessible peers to a single device, for each device. Bandwidth is reported as the total outbound bandwidth for each device. Read tests launch a copy from the peer device to the target using the target's context. |
| all_to_one_read_sm_sum_bw                          | GB/s                | Sum of the output matrix                                           |
| one_to_all_write_sm_gpu[0-9]_gpu[0-9]_bw           | GB/s                | Measures the total bandwidth of copies from a single device to all accessible peers, for each device. Bandwidth is reported as the total outbound bandwidth for each device. Write tests launch a copy from the target device to the peer using the target's context. |
| one_to_all_write_sm_sum_bw                         | GB/s                | Sum of the output matrix                                           |
| one_to_all_read_sm_gpu[0-9]_gpu[0-9]_bw            | GB/s                | Measures the total bandwidth of copies from a single device to all accessible peers, for each device. Bandwidth is reported as the total inbound bandwidth for each device. Read tests launch a copy from the peer device to the target using the target's context. |
| one_to_all_read_sm_sum_bw                          | GB/s                | Sum of the output matrix                                           |
| host_device_latency_sm_cpu[0-9]_gpu[0-9]_lat       | µs                  | Host - device SM copy latency using a ptr chase kernel            |
| host_device_latency_sm_sum_lat                     | µs                  | Sum of the output matrix                                           |
| device_to_device_latency_sm_gpu[0-9]_gpu[0-9]_lat  | µs                  | Measures latency of a pointer dereference operation between each pair of accessible peers. Memory is allocated on a GPU and is accessed by the peer GPU to determine latency. |
| device_to_device_latency_sm_sum_lat                | µs                  | Sum of the output matrix                                           |

662

663
664
665
666
667
668
669
670
671
672
673
674
## Computation-communication Benchmarks

### `computation-communication-overlap`

#### Introduction

Test the performance of single node when communication and computation overlap.

#### Metrics

| Name                                                  | Unit      | Description                                                  |
|-------------------------------------------------------|-----------|--------------------------------------------------------------|
675
676
| pytorch-computation-communication-overlap/mul_time    | time (ms) | Time of communication and mul kernel computation overlap.    |
| pytorch-computation-communication-overlap/matmul_time | time (ms) | Time of communication and matmul kernel computation overlap. |
677
678
679
680
681
682
683
684
685
686
687
688
689

####

### `sharding-matmul`

#### Introduction

Test the performance of large scale matmul operation with multiple GPUs:
* allreduce: Each GPU will calculate part of the MM calculation, and use AllReduce to merge all data into one tensor.
* allgather: Each GPU will calculate part of the MM calculation, and use AllGather + Concat to merge all data into one tensor.

#### Metrics

690
691
692
693
| Name                                   | Unit      | Description                              |
|----------------------------------------|-----------|------------------------------------------|
| pytorch-sharding-matmul/allreduce_time | time (ms) | Time of sharding matmul using allreduce. |
| pytorch-sharding-matmul/allgather_time | time (ms) | Time of sharding matmul using allgather. |
694

695
696
697
698
### `dist-inference`

#### Introduction

699
Test the performance of distributed model inference. Support both PyTorch implementation and cpp implementation.
700

701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
#### Parameters

| Parameter                | Default     | Description                                                                 |
|--------------------------|-------------|-----------------------------------------------------------------------------|
| `--use_pytorch`          | `false`     | Use the PyTorch implementation. If omitted, the C++ implementation is used. |
| `--batch_size`           | `64`        | Batch size of the generated input tensor.                                   |
| `--input_size`           | `1024`      | Input dimension of the synthetic model.                                     |
| `--hidden_size`          | `1024`      | Hidden dimension of the synthetic model.                                    |
| `--alpha`                | `1.0`       | Alpha coefficient for `D = alpha * (A * B) + beta * C`.                    |
| `--beta`                 | `1.0`       | Beta coefficient for `D = alpha * (A * B) + beta * C`.                     |
| `--num_layers`           | `1`         | Number of repeated compute-communicate-activate layers.                     |
| `--computation_kernel`   | `matmul`    | Computation kernel: `addmm`, `matmul`, or `mul`.                           |
| `--communication_kernel` | `allreduce` | Communication kernel: `allgather`, `allreduce`, or `alltoall`.             |
| `--activation_kernel`    | `relu`      | Activation kernel: `relu`, `sigmoid`, or `tanh`.                           |
| `--precision`            | `float32`   | Model precision, such as `float32` or `float16`.                           |
| `--num_warmup`           | `50`        | Number of warmup steps excluded from metrics.                              |
| `--num_steps`            | `10000`     | Number of measured inference steps.                                        |
| `--distributed_impl`     | `ddp`       | Distributed implementation for the PyTorch path.                           |
| `--distributed_backend`  | `nccl`      | Distributed backend for the PyTorch path.                                  |
| `--use_cuda_graph`       | `false`     | Launch kernels in CUDA graph mode when supported.                          |
| `--tune_gemm`            | `false`     | Tune GEMM performance before measurement in the C++ implementation.         |

723
724
725
726
727
728
729
#### Metrics

| Name                                            | Unit      | Description                                           |
|-------------------------------------------------|-----------|-------------------------------------------------------|
| pytorch-dist-inference/step_times               | time (ms) | Average time of model inference runs.                 |
| pytorch-dist-inference/step_times_${percentile} | time (ms) | Tail (50,90,95,99,99.9) time of model inference runs. |

730
731
732
733
734
735
736
737
738
739
## Storage Benchmarks

### `disk-benchmark`

#### Introduction

Measure the disk performance through [FIO](https://github.com/axboe/fio/tree/0313e938c9c8bb37d71dade239f1f5326677b079).

#### Metrics

740
741
742
743
744
745
746
747
748
749
750
| Name                                                          | Unit         | Description                                              |
|---------------------------------------------------------------|--------------|----------------------------------------------------------|
| disk-benchmark/${disk_name}_rand_read_write_bs                | size (bytes) | Disk random read write block size.                       |
| disk-benchmark/${disk_name}_rand_read_write_read_iops         | IOPS         | Disk random read write read IOPS.                        |
| disk-benchmark/${disk_name}_rand_read_write_read_lat_ns_95.0  | time (ns)    | Disk random read write read latency in 95.0 percentile.  |
| disk-benchmark/${disk_name}_rand_read_write_read_lat_ns_99.0  | time (ns)    | Disk random read write read latency in 99.0 percentile.  |
| disk-benchmark/${disk_name}_rand_read_write_read_lat_ns_99.9  | time (ns)    | Disk random read write read latency in 99.9 percentile.  |
| disk-benchmark/${disk_name}_rand_read_write_write_iops        | IOPS         | Disk random read write write IOPS.                       |
| disk-benchmark/${disk_name}_rand_read_write_write_lat_ns_95.0 | time (ns)    | Disk random read write write latency in 95.0 percentile. |
| disk-benchmark/${disk_name}_rand_read_write_write_lat_ns_99.0 | time (ns)    | Disk random read write write latency in 99.0 percentile. |
| disk-benchmark/${disk_name}_rand_read_write_write_lat_ns_99.9 | time (ns)    | Disk random read write write latency in 99.9 percentile. |