Release SuperBench v0.3.0 SuperBench v0.3.0 Release Notes =============================== SuperBench Framework -------------------- __Runner__ - Implement MPI mode. __Benchmarks__ - Support Docker benchmark. Single-node Validation ---------------------- __Micro Benchmarks__ 1. Memory (Tool: NVIDIA/AMD Bandwidth Test Tool) | Metrics | Unit | Description | |----------------|------|-------------------------------------| | H2D_Mem_BW_GPU | GB/s | host-to-GPU bandwidth for each GPU | | D2H_Mem_BW_GPU | GB/s | GPU-to-host bandwidth for each GPU | 2. IBLoopback (Tool: PerfTest – Standard RDMA Test Tool) | Metrics | Unit | Description | |----------|------|---------------------------------------------------------------| | IB_Write | MB/s | The IB write loopback throughput with different message sizes | | IB_Read | MB/s | The IB read loopback throughput with different message sizes | | IB_Send | MB/s | The IB send loopback throughput with different message sizes | 3. NCCL/RCCL (Tool: NCCL/RCCL Tests) | Metrics | Unit | Description | |---------------------|------|-----------------------------------------------------------------| | NCCL_AllReduce | GB/s | The NCCL AllReduce performance with different message sizes | | NCCL_AllGather | GB/s | The NCCL AllGather performance with different message sizes | | NCCL_broadcast | GB/s | The NCCL Broadcast performance with different message sizes | | NCCL_reduce | GB/s | The NCCL Reduce performance with different message sizes | | NCCL_reduce_scatter | GB/s | The NCCL ReduceScatter performance with different message sizes | 4. Disk (Tool: FIO – Standard Disk Performance Tool) | Metrics | Unit | Description | |----------------|------|---------------------------------------------------------------------------------| | Seq_Read | MB/s | Sequential read performance | | Seq_Write | MB/s | Sequential write performance | | Rand_Read | MB/s | Random read performance | | Rand_Write | MB/s | Random write performance | | Seq_R/W_Read | MB/s | Read performance in sequential read/write, fixed measurement (read:write = 4:1) | | Seq_R/W_Write | MB/s | Write performance in sequential read/write (read:write = 4:1) | | Rand_R/W_Read | MB/s | Read performance in random read/write (read:write = 4:1) | | Rand_R/W_Write | MB/s | Write performance in random read/write (read:write = 4:1) | 5. H2D/D2H SM Transmission Bandwidth (Tool: MSR-A build) | Metrics | Unit | Description | |---------------|------|-----------------------------------------------------| | H2D_SM_BW_GPU | GB/s | host-to-GPU bandwidth using GPU kernel for each GPU | | D2H_SM_BW_GPU | GB/s | GPU-to-host bandwidth using GPU kernel for each GPU | AMD GPU Support --------------- __Docker Image Support__ - ROCm 4.2 PyTorch 1.7.0 - ROCm 4.0 PyTorch 1.7.0 __Micro Benchmarks__ 1. Kernel Launch (Tool: MSR-A build) | Metrics | Unit | Description | |--------------------------|-----------|--------------------------------------------------------------| | Kernel_Launch_Event_Time | Time (ms) | Dispatch latency measured in GPU time using hipEventRecord() | | Kernel_Launch_Wall_Time | Time (ms) | Dispatch latency measured in CPU time | 2. GEMM FLOPS (Tool: AMD rocblas-bench Tool) | Metrics | Unit | Description | |----------|--------|-------------------------------| | FP64 | GFLOPS | FP64 FLOPS without MatrixCore | | FP32(MC) | GFLOPS | TF32 FLOPS with MatrixCore | | FP16(MC) | GFLOPS | FP16 FLOPS with MatrixCore | | BF16(MC) | GFLOPS | BF16 FLOPS with MatrixCore | | INT8(MC) | GOPS | INT8 FLOPS with MatrixCore | __E2E Benchmarks__ 1. CNN models -- Use PyTorch torchvision models - ResNet: ResNet-50, ResNet-101, ResNet-152 - DenseNet: DenseNet-169, DenseNet-201 - VGG: VGG-11, VGG-13, VGG-16, VGG-19 2. BERT -- Use huggingface Transformers - BERT - BERT Large 3. LSTM -- Use PyTorch 4. GPT-2 -- Use huggingface Transformers Bug Fix ------- - VGG models failed on A100 GPU with batch_size=128 Other Improvement ----------------- 1. Contribution related - Contribute rule - System information collection 2. Document - Add release process doc - Add design documents - Add developer guide doc for coding style - Add contribution rules - Add docker image list - Add initial validation results
This tag has no release notes.