Unverified Commit 976803f8 authored by Yifan Xiong's avatar Yifan Xiong Committed by GitHub
Browse files

Docs - Add introduction and metrics in benchmarks docs (#233)

Add introduction and metrics for micro-benchmarks and model-benchmarks document.
parent e98a6812
......@@ -4,76 +4,182 @@ id: micro-benchmarks
# Micro Benchmarks
## Benchmarking list
### Computation benchmark
### Communication benchmark
### Computation-communication benchmark
### Storage benchmark
## Benchmarking metrics
<table>
<tbody>
<tr valign="top">
<td align="center" valign="middle">
<b>Metrics</b>
</td>
<td>
<ul><li><b>Computation Benchmark</b></li>
<ul><li><b>GEMM FLOPS</b></li>
<ul>
<li>GFLOPS</li>
<li>TensorCore</li>
<li>cuBLAS</li>
<li>cuDNN</li>
</ul>
</ul>
<ul><li><b>Kernel Launch Time</b></li>
<ul>
<li>Kernel_Launch_Event_Time</li>
<li>Kernel_Launch_Wall_Time</li>
</ul>
</ul>
<ul><li><b>Operator Performance</b></li>
<ul><li>MatMul</li><li>Sharding_MatMul</li></ul>
</ul>
</ul>
<ul><li><b>Communication Benchmark</b></li>
<ul><li><b>Memory</b></li>
<ul><li>H2D_Mem_BW_&lt;GPU ID&gt;</li>
<li>D2H_Mem_BW_&lt;GPU ID&gt;</li></ul>
</ul>
<ul><li><b>Device P2P Bandwidth</b></li>
<ul><li>P2P_BW_Max</li><li>P2P_BW_Min</li><li>P2P_BW_Avg</li></ul>
</ul>
<ul><li><b>RDMA</b></li>
<ul><li>RDMA_Peak</li><li>RDMA_Avg</li></ul>
</ul>
<ul><li><b>NCCL</b></li>
<ul><li>NCCL_AllReduce</li></ul>
<ul><li>NCCL_AllGather</li></ul>
<ul><li>NCCL_broadcast</li></ul>
<ul><li>NCCL_reduce</li></ul>
<ul><li>NCCL_reduce_scatter</li></ul>
</ul>
</ul>
<ul><li><b>Computation-Communication Benchmark</b></li>
<ul><li><b>Mul_During_NCCL</b></li><li><b>MatMul_During_NCCL</b></li></ul>
</ul>
<ul><li><b>Storage Benchmark</b></li>
<ul><li><b>Disk</b></li>
<ul>
<li>Seq_Read/Seq_Write</li><li>Rand_Read/Rand_Write</li>
<li>Seq_R/W_Read</li><li>Seq_R/W_Write</li><li>Rand_R/W_Read</li><li>Rand_R/W_Write</li>
</ul>
</ul>
</ul>
</td>
</tr>
</tbody>
</table>
## Computation Benchmarks
### `kernel-launch`
#### Introduction
Measure GPU kernel launch latency,
which is defined as the time range from the beginning of the launch API call to the beginning of the kernel execution.
#### Metrics
| Name | Unit | Description |
|------------------------------|-----------|--------------------------------------|
| kernel-launch/event_overhead | time (ms) | Launch latency measured in GPU time. |
| kernel-launch/wall_overhead | time (ms) | Launch latency measured in CPU time. |
### `gemm-flops`
#### Introduction
Measure the GPU GEMM FLOPS for different float and int data types, with or without Tensor Core (XDLOPS),
performed by NVIDIA [cutlass](https://github.com/NVIDIA/cutlass/tree/ccb697bac77fcc898e9c897b2c90aa5b60ac72fb)
or AMD [rocblas-bench](https://github.com/ROCmSoftwarePlatform/rocBLAS/tree/develop/clients/benchmarks).
#### Metrics
| Name | Unit | Description |
|------------------------|----------------|---------------------------------------------------------|
| gemm-flops/FP64 | FLOPS (GFLOPS) | GEMM float64 peak FLOPS. |
| gemm-flops/FP32 | FLOPS (GFLOPS) | GEMM float32 peak FLOPS. |
| gemm-flops/FP16 | FLOPS (GFLOPS) | GEMM float16 peak FLOPS. |
| gemm-flops/FP64_TC | FLOPS (GFLOPS) | GEMM float64 peak FLOPS with NVIDIA Tensor Core. |
| gemm-flops/TF32_TC | FLOPS (GFLOPS) | GEMM tensor-float32 peak FLOPS with NVIDIA Tensor Core. |
| gemm-flops/FP16_TC | FLOPS (GFLOPS) | GEMM float16 peak FLOPS with NVIDIA Tensor Core. |
| gemm-flops/BF16_TC | FLOPS (GFLOPS) | GEMM bfloat16 peak FLOPS with NVIDIA Tensor Core. |
| gemm-flops/INT8_TC | IOPS (GIOPS) | GEMM int8 peak IOPS with NVIDIA Tensor Core. |
| gemm-flops/INT4_TC | IOPS (GIOPS) | GEMM int4 peak IOPS with NVIDIA Tensor Core. |
| gemm-flops/FP32_xDLOPS | FLOPS (GFLOPS) | GEMM tensor-float32 peak FLOPS with AMD XDLOPS. |
| gemm-flops/FP16_xDLOPS | FLOPS (GFLOPS) | GEMM float16 peak FLOPS with AMD XDLOPS. |
| gemm-flops/BF16_xDLOPS | FLOPS (GFLOPS) | GEMM bfloat16 peak FLOPS with AMD XDLOPS. |
| gemm-flops/INT8_xDLOPS | IOPS (GIOPS) | GEMM int8 peak IOPS with AMD XDLOPS. |
### `matmul`
#### Introduction
Large scale matmul operation using `torch.matmul` with one GPU.
#### Metrics
| Name | Unit | Description |
|---------------------------|-----------|--------------------------------|
| pytorch-matmul/nosharding | time (ms) | Time of pure matmul operation. |
### `cublas-function`
TODO
### `cudnn-function`
TODO
## Communication Benchmarks
### `mem-bw`
#### Introduction
Measure the memory copy bandwidth across PCI-e and memory copy bandwidth between GPUs,
performed by [NVIDIA](https://github.com/NVIDIA/cuda-samples/tree/master/Samples/bandwidthTest)
or [AMD](https://github.com/ROCm-Developer-Tools/HIP/tree/master/samples/1_Utils/hipBusBandwidth) bandwidth test tool.
#### Metrics
| Name | Unit | Description |
|-------------------|------------------|----------------------------------|
| mem-bw/H2D_Mem_BW | bandwidth (GB/s) | Host to device copy bandwidth. |
| mem-bw/D2H_Mem_BW | bandwidth (GB/s) | Device to host copy bandwidth. |
| mem-bw/D2D_Mem_BW | bandwidth (GB/s) | Device to device copy bandwidth. |
### `gpu-sm-copy-bw`
Measure the memory copy bandwidth across PCI-e and memory copy bandwidth between GPUs, initialized by GPU SM.
#### Metrics
| Name | Unit | Description |
|---------------------|------------------|------------------------------------------------------|
| gpu-sm-copy-bw/htod | bandwidth (GB/s) | Host to device copy bandwidth initialized by GPU SM. |
| gpu-sm-copy-bw/dtoh | bandwidth (GB/s) | Device to host copy bandwidth initialized by GPU SM. |
### `ib-loopback`
#### Introduction
Measure the InfiniBand loopback verbs bandwidth, performed by
[OFED performance tests](https://github.com/linux-rdma/perftest/tree/7504ce48ac396a02f4d00de359257b2cb8458f06).
#### Metrics
| Name | Unit | Description |
|----------------------------------------------------|------------------|--------------------------------------------------------------|
| ib-loopback/IB\_write\_${msg\_size}\_Avg_${ib_dev} | bandwidth (MB/s) | InfiniBand loopback write bandwidth with given message size. |
| ib-loopback/IB\_read\_${msg\_size}\_Avg_${ib_dev} | bandwidth (MB/s) | InfiniBand loopback read bandwidth with given message size. |
| ib-loopback/IB\_send\_${msg\_size}\_Avg_${ib_dev} | bandwidth (MB/s) | InfiniBand loopback send bandwidth with given message size. |
### `nccl-bw` / `rccl-bw`
#### Introduction
Measure the performance of NCCL/RCCL operations,
performed by [nccl-tests](https://github.com/NVIDIA/nccl-tests/tree/44df0bf010dcc95e840ca0fb7466c67cff3f1f0f)
or [rccl-tests](https://github.com/ROCmSoftwarePlatform/rccl-tests/tree/dc1ad4853d7ec738387d42a75a58a98d7af00c7b).
Support the following operations currently: allreduce, allgather, broadcast, reduce, reducescatter, alltoall.
#### Metrics
| Name | Unit | Description |
|----------------------------------------|------------------|-------------------------------------------------------------|
| nccl-bw/${operation}_${msg_size}_time | time (us) | NCCL operation lantency with given message size. |
| nccl-bw/${operation}_${msg_size}_algbw | bandwidth (GB/s) | NCCL operation algorithm bandwidth with given message size. |
| nccl-bw/${operation}_${msg_size}_busbw | bandwidth (GB/s) | NCCL operation bus bandwidth with given message size. |
| rccl-bw/${operation}_${msg_size}_time | time (us) | RCCL operation lantency with given message size. |
| rccl-bw/${operation}_${msg_size}_algbw | bandwidth (GB/s) | RCCL operation algorithm bandwidth with given message size. |
| rccl-bw/${operation}_${msg_size}_busbw | bandwidth (GB/s) | RCCL operation bus bandwidth with given message size. |
## Computation-communication Benchmarks
### `computation-communication-overlap`
#### Introduction
Test the performance of single node when communication and computation overlap.
#### Metrics
| Name | Unit | Description |
|-------------------------------------------------------|-----------|--------------------------------------------------------------|
| pytorch-computation-communication-overlap/mul_cost | time (ms) | Time of communication and mul kernel computation overlap. |
| pytorch-computation-communication-overlap/matmul_cost | time (ms) | Time of communication and matmul kernel computation overlap. |
####
### `sharding-matmul`
#### Introduction
Test the performance of large scale matmul operation with multiple GPUs:
* allreduce: Each GPU will calculate part of the MM calculation, and use AllReduce to merge all data into one tensor.
* allgather: Each GPU will calculate part of the MM calculation, and use AllGather + Concat to merge all data into one tensor.
#### Metrics
| Name | Unit | Description |
|-----------------------------------|-----------|------------------------------------------|
| pytorch-sharding-matmul/allreduce | time (ms) | Time of sharding matmul using allreduce. |
| pytorch-sharding-matmul/allgather | time (ms) | Time of sharding matmul using allgather. |
## Storage Benchmarks
### `disk-benchmark`
#### Introduction
Measure the disk performance through [FIO](https://github.com/axboe/fio/tree/0313e938c9c8bb37d71dade239f1f5326677b079).
#### Metrics
| Name | Unit | Description |
|--------------------------------------------------------------------|--------------|----------------------------------------------------------|
| disk-benchmark/${disk_name}_rand_read_write_bs | size (bytes) | Disk random read write block size. |
| disk-benchmark/${disk_name}_rand_read_write_read_iops | IOPS | Disk random read write read IOPS. |
| disk-benchmark/${disk_name}_rand_read_write_read_lat_ns_95.000000 | time (ns) | Disk random read write read latency in 95.0 percentile. |
| disk-benchmark/${disk_name}_rand_read_write_read_lat_ns_99.000000 | time (ns) | Disk random read write read latency in 99.0 percentile. |
| disk-benchmark/${disk_name}_rand_read_write_read_lat_ns_99.900000 | time (ns) | Disk random read write read latency in 99.9 percentile. |
| disk-benchmark/${disk_name}_rand_read_write_write_iops | IOPS | Disk random read write write IOPS. |
| disk-benchmark/${disk_name}_rand_read_write_write_lat_ns_95.000000 | time (ns) | Disk random read write write latency in 95.0 percentile. |
| disk-benchmark/${disk_name}_rand_read_write_write_lat_ns_99.000000 | time (ns) | Disk random read write write latency in 99.0 percentile. |
| disk-benchmark/${disk_name}_rand_read_write_write_lat_ns_99.900000 | time (ns) | Disk random read write write latency in 99.9 percentile. |
......@@ -4,47 +4,92 @@ id: model-benchmarks
# Model Benchmarks
## Benchmarking list
### GPT-2 models
### BERT models
### LSTM models
### CNN models
## Benchmarking metrics
<table>
<tbody>
<tr valign="top">
<td align="center" valign="middle">
<b>Metrics</b>
</td>
<td>
<ul><li><b>CNN models</b></li>
<ul>
<li><b>ResNet</b></li>
<ul><li>ResNet-50</li><li>ResNet-101</li><li>ResNet-152</li></ul>
</ul>
<ul>
<li><b>DenseNet</b></li>
<ul><li>DenseNet-169</li><li>DenseNet-201</li></ul>
</ul>
<ul>
<li><b>VGG</b></li>
<ul><li>VGG-11</li><li>VGG-13</li><li>VGG-16</li><li>VGG-19</li></ul>
</ul>
<ul><li><b>Other CNN models</b></li><ul><li>...</li></ul></ul>
</ul>
<ul><li><b>BERT models</b></li>
<ul><li><b>BERT-Base</b></li><li><b>BERT-Large</b></li></ul>
</ul>
<ul><li><b>LSTM</b></li></ul>
<ul><li><b>GPT-2</b></li></ul>
</td>
</tr>
</tbody>
</table>
## PyTorch Model Benchmarks
### `gpt_models`
#### Introduction
Run training or inference tasks with single or half precision for GPT models,
including gpt2-small, gpt2-medium, gpt2-large and gpt2-xl.
#### Metrics
| Name | Unit | Description |
|---------------------------------------------------------------|------------------------|---------------------------------------------|
| gpt_models/pytorch-${model_name}/steptime_train_float32 | time (ms) | Train step time with single precision. |
| gpt_models/pytorch-${model_name}/throughput_train_float32 | throughput (samples/s) | Train throughput with single precision. |
| gpt_models/pytorch-${model_name}/steptime_inference_float32 | time (ms) | Inference step time with single precision. |
| gpt_models/pytorch-${model_name}/throughput_inference_float32 | throughput (samples/s) | Inference throughput with single precision. |
| gpt_models/pytorch-${model_name}/steptime_train_float16 | time (ms) | Train step time with half precision. |
| gpt_models/pytorch-${model_name}/throughput_train_float16 | throughput (samples/s) | Train throughput with half precision. |
| gpt_models/pytorch-${model_name}/steptime_inference_float16 | time (ms) | Inference step time with half precision. |
| gpt_models/pytorch-${model_name}/throughput_inference_float16 | throughput (samples/s) | Inference throughput with half precision. |
### `bert_models`
#### Introduction
Run training or inference tasks with single or half precision for BERT models, including bert-base and bert-large.
#### Metrics
| Name | Unit | Description |
|----------------------------------------------------------------|------------------------|---------------------------------------------|
| bert_models/pytorch-${model_name}/steptime_train_float32 | time (ms) | Train step time with single precision. |
| bert_models/pytorch-${model_name}/throughput_train_float32 | throughput (samples/s) | Train throughput with single precision. |
| bert_models/pytorch-${model_name}/steptime_inference_float32 | time (ms) | Inference step time with single precision. |
| bert_models/pytorch-${model_name}/throughput_inference_float32 | throughput (samples/s) | Inference throughput with single precision. |
| bert_models/pytorch-${model_name}/steptime_train_float16 | time (ms) | Train step time with half precision. |
| bert_models/pytorch-${model_name}/throughput_train_float16 | throughput (samples/s) | Train throughput with half precision. |
| bert_models/pytorch-${model_name}/steptime_inference_float16 | time (ms) | Inference step time with half precision. |
| bert_models/pytorch-${model_name}/throughput_inference_float16 | throughput (samples/s) | Inference throughput with half precision. |
### `lstm_models`
#### Introduction
Run training or inference tasks with single or half precision for one bidirectional LSTM model.
#### Metrics
| Name | Unit | Description |
|-------------------------------------------------------|------------------------|---------------------------------------------|
| lstm_models/pytorch-lstm/steptime_train_float32 | time (ms) | Train step time with single precision. |
| lstm_models/pytorch-lstm/throughput_train_float32 | throughput (samples/s) | Train throughput with single precision. |
| lstm_models/pytorch-lstm/steptime_inference_float32 | time (ms) | Inference step time with single precision. |
| lstm_models/pytorch-lstm/throughput_inference_float32 | throughput (samples/s) | Inference throughput with single precision. |
| lstm_models/pytorch-lstm/steptime_train_float16 | time (ms) | Train step time with half precision. |
| lstm_models/pytorch-lstm/throughput_train_float16 | throughput (samples/s) | Train throughput with half precision. |
| lstm_models/pytorch-lstm/steptime_inference_float16 | time (ms) | Inference step time with half precision. |
| lstm_models/pytorch-lstm/throughput_inference_float16 | throughput (samples/s) | Inference throughput with half precision. |
### `cnn_models`
#### Introduction
Run training or inference tasks with single or half precision for CNN models listed in
[`torchvision.models`](https://pytorch.org/vision/0.8/models.html), including:
* resnet: resnet18, resnet34, resnet50, resnet101, resnet152
* resnext: resnext50_32x4d, resnext101_32x8d
* wide_resnet: wide_resnet50_2, wide_resnet101_2
* densenet: densenet121, densenet169, densenet201, densenet161
* vgg: vgg11, vgg11_bn, vgg13, vgg13_bn, vgg16, vgg16_bn, vgg19_bn, vgg19
* mnasnet: mnasnet0_5, mnasnet0_75, mnasnet1_0, mnasnet1_3
* mobilenet: mobilenet_v2
* shufflenet: shufflenet_v2_x0_5, shufflenet_v2_x1_0, shufflenet_v2_x1_5, shufflenet_v2_x2_0
* squeezenet: squeezenet1_0, squeezenet1_1
* others: alexnet, googlenet, inception_v3
#### Metrics
| Name | Unit | Description |
|---------------------------------------------------------------|------------------------|---------------------------------------------|
| cnn_models/pytorch-${model_name}/steptime_train_float32 | time (ms) | Train step time with single precision. |
| cnn_models/pytorch-${model_name}/throughput_train_float32 | throughput (samples/s) | Train throughput with single precision. |
| cnn_models/pytorch-${model_name}/steptime_inference_float32 | time (ms) | Inference step time with single precision. |
| cnn_models/pytorch-${model_name}/throughput_inference_float32 | throughput (samples/s) | Inference throughput with single precision. |
| cnn_models/pytorch-${model_name}/steptime_train_float16 | time (ms) | Train step time with half precision. |
| cnn_models/pytorch-${model_name}/throughput_train_float16 | throughput (samples/s) | Train throughput with half precision. |
| cnn_models/pytorch-${model_name}/steptime_inference_float16 | time (ms) | Inference step time with half precision. |
| cnn_models/pytorch-${model_name}/throughput_inference_float16 | throughput (samples/s) | Inference throughput with half precision. |
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment