2021-09-22-release-0-3.md 5.41 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
---
slug: release-sb-v0.3
title: Releasing SuperBench v0.3
author: Peng Cheng
author_title: SuperBench Team
author_url: https://github.com/cp5555
author_image_url: https://github.com/cp5555.png
tags: [superbench, announcement, release]
---

We are very happy to announce that **SuperBench 0.3.0 version** is officially released today!

You can install and try superbench by following [Getting Started Tutorial](https://microsoft.github.io/superbenchmark/docs/getting-started/installation).

## SuperBench 0.3.0 Release Notes

### SuperBench Framework

#### Runner

- Implement MPI mode.

#### Benchmarks

- Support Docker benchmark.

### Single-node Validation

#### Micro Benchmarks

1. Memory (Tool: NVIDIA/AMD Bandwidth Test Tool)

   | Metrics        | Unit | Description                         |
   |----------------|------|-------------------------------------|
   | H2D_Mem_BW_GPU | GB/s | host-to-GPU bandwidth for each GPU  |
   | D2H_Mem_BW_GPU | GB/s | GPU-to-host bandwidth  for each GPU |

2. IBLoopback (Tool: PerfTest – Standard RDMA Test Tool)

   | Metrics  | Unit | Description                                                   |
   |----------|------|---------------------------------------------------------------|
   | IB_Write | MB/s | The IB write loopback throughput with different message sizes |
   | IB_Read  | MB/s | The IB read loopback throughput with different message sizes  |
   | IB_Send  | MB/s | The IB send loopback throughput with different message sizes  |

3. NCCL/RCCL (Tool: NCCL/RCCL Tests)

   | Metrics             | Unit | Description                                                     |
   |---------------------|------|-----------------------------------------------------------------|
   | NCCL_AllReduce      | GB/s | The NCCL AllReduce performance with different message sizes     |
   | NCCL_AllGather      | GB/s | The NCCL AllGather performance with different message sizes     |
   | NCCL_broadcast      | GB/s | The NCCL Broadcast performance with different message sizes     |
   | NCCL_reduce         | GB/s | The NCCL Reduce performance with different message sizes        |
   | NCCL_reduce_scatter | GB/s | The NCCL ReduceScatter performance with different message sizes |

4. Disk (Tool: FIO – Standard Disk Performance Tool)

   | Metrics        | Unit | Description                                                                     |
   |----------------|------|---------------------------------------------------------------------------------|
   | Seq_Read       | MB/s | Sequential read performance                                                     |
   | Seq_Write      | MB/s | Sequential write performance                                                    |
   | Rand_Read      | MB/s | Random read performance                                                         |
   | Rand_Write     | MB/s | Random write performance                                                        |
   | Seq_R/W_Read   | MB/s | Read performance in sequential read/write, fixed measurement (read:write = 4:1) |
   | Seq_R/W_Write  | MB/s | Write performance in sequential read/write (read:write = 4:1)                   |
   | Rand_R/W_Read  | MB/s | Read performance in random read/write (read:write = 4:1)                        |
   | Rand_R/W_Write | MB/s | Write performance in random read/write (read:write = 4:1)                       |

5. H2D/D2H SM Transmission Bandwidth (Tool: MSR-A build)

   | Metrics       | Unit | Description                                         |
   |---------------|------|-----------------------------------------------------|
   | H2D_SM_BW_GPU | GB/s | host-to-GPU bandwidth using GPU kernel for each GPU |
   | D2H_SM_BW_GPU | GB/s | GPU-to-host bandwidth using GPU kernel for each GPU |

### AMD GPU Support

#### Docker Image Support

- ROCm 4.2 PyTorch 1.7.0
- ROCm 4.0 PyTorch 1.7.0

#### Micro Benchmarks

1. Kernel Launch (Tool: MSR-A build)

   | Metrics                  | Unit      | Description                                                  |
   |--------------------------|-----------|--------------------------------------------------------------|
   | Kernel_Launch_Event_Time | Time (ms) | Dispatch latency measured in GPU time using hipEventRecord() |
   | Kernel_Launch_Wall_Time  | Time (ms) | Dispatch latency measured in CPU time                        |

2. GEMM FLOPS (Tool: AMD rocblas-bench Tool)

   | Metrics  | Unit   | Description                   |
   |----------|--------|-------------------------------|
   | FP64     | GFLOPS | FP64 FLOPS without MatrixCore |
   | FP32(MC) | GFLOPS | TF32 FLOPS with MatrixCore    |
   | FP16(MC) | GFLOPS | FP16 FLOPS with MatrixCore    |
   | BF16(MC) | GFLOPS | BF16 FLOPS with MatrixCore    |
   | INT8(MC) | GOPS   | INT8 FLOPS with MatrixCore    |

#### E2E Benchmarks

1. CNN models -- Use PyTorch torchvision models
   - ResNet: ResNet-50, ResNet-101, ResNet-152
   - DenseNet: DenseNet-169, DenseNet-201
   - VGG: VGG-11, VGG-13, VGG-16, VGG-19​

2. BERT -- Use huggingface Transformers
   - BERT
   - BERT Large

3. LSTM -- Use PyTorch
4. GPT-2 -- Use huggingface Transformers

### Bug Fix

- VGG models failed on A100 GPU with batch_size=128

### Other Improvement

1. Contribution related
   - Contribute rule
   - System information collection

2. Document
   - Add release process doc
   - Add design documents
   - Add developer guide doc for coding style
   - Add contribution rules
   - Add docker image list
   - Add initial validation results