Release SuperBench v0.3.0

SuperBench v0.3.0 Release Notes
===============================

SuperBench Framework
--------------------

__Runner__

- Implement MPI mode.

__Benchmarks__

- Support Docker benchmark.

Single-node Validation
----------------------

__Micro Benchmarks__

1. Memory (Tool: NVIDIA/AMD Bandwidth Test Tool)

   | Metrics        | Unit | Description                         |
   |----------------|------|-------------------------------------|
   | H2D_Mem_BW_GPU | GB/s | host-to-GPU bandwidth for each GPU  |
   | D2H_Mem_BW_GPU | GB/s | GPU-to-host bandwidth  for each GPU |

2. IBLoopback (Tool: PerfTest – Standard RDMA Test Tool)

   | Metrics  | Unit | Description                                                   |
   |----------|------|---------------------------------------------------------------|
   | IB_Write | MB/s | The IB write loopback throughput with different message sizes |
   | IB_Read  | MB/s | The IB read loopback throughput with different message sizes  |
   | IB_Send  | MB/s | The IB send loopback throughput with different message sizes  |

3. NCCL/RCCL (Tool: NCCL/RCCL Tests)

   | Metrics             | Unit | Description                                                     |
   |---------------------|------|-----------------------------------------------------------------|
   | NCCL_AllReduce      | GB/s | The NCCL AllReduce performance with different message sizes     |
   | NCCL_AllGather      | GB/s | The NCCL AllGather performance with different message sizes     |
   | NCCL_broadcast      | GB/s | The NCCL Broadcast performance with different message sizes     |
   | NCCL_reduce         | GB/s | The NCCL Reduce performance with different message sizes        |
   | NCCL_reduce_scatter | GB/s | The NCCL ReduceScatter performance with different message sizes |

4. Disk (Tool: FIO – Standard Disk Performance Tool)

   | Metrics        | Unit | Description                                                                     |
   |----------------|------|---------------------------------------------------------------------------------|
   | Seq_Read       | MB/s | Sequential read performance                                                     |
   | Seq_Write      | MB/s | Sequential write performance                                                    |
   | Rand_Read      | MB/s | Random read performance                                                         |
   | Rand_Write     | MB/s | Random write performance                                                        |
   | Seq_R/W_Read   | MB/s | Read performance in sequential read/write, fixed measurement (read:write = 4:1) |
   | Seq_R/W_Write  | MB/s | Write performance in sequential read/write (read:write = 4:1)                   |
   | Rand_R/W_Read  | MB/s | Read performance in random read/write (read:write = 4:1)                        |
   | Rand_R/W_Write | MB/s | Write performance in random read/write (read:write = 4:1)                       |

5. H2D/D2H SM Transmission Bandwidth (Tool: MSR-A build)

   | Metrics       | Unit | Description                                         |
   |---------------|------|-----------------------------------------------------|
   | H2D_SM_BW_GPU | GB/s | host-to-GPU bandwidth using GPU kernel for each GPU |
   | D2H_SM_BW_GPU | GB/s | GPU-to-host bandwidth using GPU kernel for each GPU |

AMD GPU Support
---------------

__Docker Image Support__

- ROCm 4.2 PyTorch 1.7.0
- ROCm 4.0 PyTorch 1.7.0

__Micro Benchmarks__

1. Kernel Launch (Tool: MSR-A build)

   | Metrics                  | Unit      | Description                                                  |
   |--------------------------|-----------|--------------------------------------------------------------|
   | Kernel_Launch_Event_Time | Time (ms) | Dispatch latency measured in GPU time using hipEventRecord() |
   | Kernel_Launch_Wall_Time  | Time (ms) | Dispatch latency measured in CPU time                        |

2. GEMM FLOPS (Tool: AMD rocblas-bench Tool)

   | Metrics  | Unit   | Description                   |
   |----------|--------|-------------------------------|
   | FP64     | GFLOPS | FP64 FLOPS without MatrixCore |
   | FP32(MC) | GFLOPS | TF32 FLOPS with MatrixCore    |
   | FP16(MC) | GFLOPS | FP16 FLOPS with MatrixCore    |
   | BF16(MC) | GFLOPS | BF16 FLOPS with MatrixCore    |
   | INT8(MC) | GOPS   | INT8 FLOPS with MatrixCore    |

__E2E Benchmarks__

1. CNN models -- Use PyTorch torchvision models
   - ResNet: ResNet-50, ResNet-101, ResNet-152
   - DenseNet: DenseNet-169, DenseNet-201
   - VGG: VGG-11, VGG-13, VGG-16, VGG-19

2. BERT -- Use huggingface Transformers
   - BERT
   - BERT Large

3. LSTM -- Use PyTorch
4. GPT-2 -- Use huggingface Transformers

Bug Fix
-------

- VGG models failed on A100 GPU with batch_size=128

Other Improvement
-----------------

1. Contribution related
   - Contribute rule
   - System information collection

2. Document
   - Add release process doc
   - Add design documents
   - Add developer guide doc for coding style
   - Add contribution rules
   - Add docker image list
   - Add initial validation results
This tag has no release notes.