-
v0.12.0 Release SuperBench v0.12.0 SuperBench 0.12.0 Release Notes =============================== SuperBench Improvements ----------------------- - Optimized cutlass build process for faster builds and smaller binaries. - Improve image build pipeline. - Add support for arm64 builds. - Upgrade pipeline dependencies. - Fix SuperBench installation and code lint issues. - Update Flake8 repository. - Add support for the latest Python versions. - Enhance error handling for `pkg_resources` imports. - Update ROCm image build labels. - Add CUDA 12.8 and CUDA 12.9 support. - Consolidate multi-architecture Docker images. - Upgrade runner OS to latest version. - Fix typos in documentation and code. Micro-benchmark Improvements ---------------------------- - Add general CPU bandwidth and latency benchmarks. - Add nvbandwidth build process and benchmarks. - Add architecture support for 10.0 in gemm-flops. - Add GPU Stream micro benchmark. - Add FP4 GEMM FLOPS support in `cublaslt_gemm` benchmark. - Add Grace CPU support for CPU Stream benchmark. - Revise CPU Stream benchmark. - Fix NUMA error on Grace CPU in gpu-copy benchmark. - Bump onnxruntime-gpu dependency from 1.10.0 to 1.12.0. - Fix stderr message in gpu-copy benchmark. - Fix TensorRT inference parsing. - Handle N/A values in nvbandwidth benchmark. - Avoid unintended nvbandwidth function calls in all benchmarks. - Support CUDA arch flag and autotuning in `cublaslt` GEMM. Model-benchmark Improvements ---------------------------- - Add LLaMA-2 model benchmarks. - Add Mixture of Experts model benchmarks. - Add DeepSeek inference benchmark (AMD GPU). Result Analysis --------------- - Enhance logging for diagnosis rule baseline errors. Documentation Updates --------------------- - Update CODEOWNERS file.
-
v0.10.0 Release SuperBench v0.10.0 SuperBench 0.10.0 Release Notes =============================== SuperBench Improvements ----------------------- - Support monitoring for AMD GPUs. - Support ROCm 5.7 and ROCm 6.0 dockerfile. - Add MSCCL support for Nvidia GPU. - Fix NUMA domains swap issue in NDv4 topology file. - Add NDv5 topo file. - Fix NCCL and NCCL-test to 2.18.3 for hang issue in CUDA 12.2. Micro-benchmark Improvements ---------------------------- - Add HPL random generator to gemm-flops with ROCm. - Add DirectXGPURenderFPS benchmark to measure the FPS of rendering simple frames. - Add HWDecoderFPS benchmark to measure the FPS of hardware decoder performance. - Update Docker image for H100 support. - Update MLC version into 3.10 for CUDA/ROCm dockerfile. - Bug fix for GPU Burn test. - Support INT8 in cublaslt function. - Add hipBLASLt function benchmark. - Support cpu-gpu and gpu-cpu in ib-validation. - Support graph mode in NCCL/RCCL benchmarks for latency metrics. - Support cpp implementation in distributed inference benchmark. - Add O2 option for gpu copy ROCm build. - Support different hipblasLt data types in dist inference. - Support in-place in NCCL/RCCL benchmark. - Support data type option in NCCL/RCCL benchmark. - Improve P2P performance with fine-grained GPU memory in GPU-copy test for AMD GPUs. - Update hipblaslt GEMM metric unit to tflops. - Support FP8 for hipblaslt benchmark. Model Benchmark Improvements ---------------------------- - Change torch.distributed.launch to torchrun. - Support Megatron-LM/Megatron-Deepspeed GPT pretrain benchmark. Result Analysis --------------- - Support baseline generation from multiple nodes.
-
v0.8.0 Release SuperBench v0.8.0 SuperBench v0.8.0 Release Notes =============================== SuperBench Improvements ----------------------- - Support SuperBench Executor running on Windows. - Remove fixed rccl version in rocm5.1.x docker file. - Upgrade networkx version to fix installation compatibility issue. - Pin setuptools version to v65.7.0. - Limit ansible_runner version for Python 3.6. - Support cgroup V2 when read system metrics in monitor. - Fix analyzer bug in Python 3.8 due to pandas api change. - Collect real-time GPU power in monitor. - Remove unreachable condition when write host list in mpi mode. - Upgrade Docker image with cuda12.1, nccl 2.17.1-1, hpcx v2.14, and mlc 3.10. - Fix wrong unit of cpu-memory-bw-latency in document. Micro-benchmark Improvements ---------------------------- - Add STREAM benchmark for sustainable memory bandwidth and the corresponding computation rate. - Add HPL Benchmark for HPC Linpack Benchmark. - Support flexible warmup and non-random data initialization in cublas-benchmark. - Support error tolerance in micro-benchmark for CuDNN function. - Add distributed inference benchmark. - Support tensor core precisions (e.g., FP8) and batch/shape range in cublaslt gemm. Model Benchmark Improvements ---------------------------- - Fix torch.dist init issue with multiple models. - Support TE FP8 in BERT/GPT2 model. - Add num_workers configurations in model benchmark.
-
v0.7.0 Release SuperBench v0.7.0 SuperBench v0.7.0 Release Notes =============================== SuperBench Improvements ----------------------- - Support non-zero return code when "sb deploy" or "sb run" fails in Ansible. - Support log flushing to the result file during runtime. - Update version to include revision hash and date. - Support "pattern" in mpi mode to run tasks in parallel. - Support topo-aware, all-pair, and K-batch pattern in mpi mode. - Fix Transformers version to avoid Tensorrt failure. - Add CUDA11.8 Docker image for NVIDIA arch90 GPUs. - Support "sb deploy" without pulling image. Micro-benchmark Improvements ---------------------------- - Support list of custom config string in cudnn-functions and cublas-functions. - Support correctness check in cublas-functions. - Support GEMM-FLOPS for NVIDIA arch90 GPUs. - Support cuBLASLt FP16 and FP8 GEMM. - Add wait time option to resolve mem-bw unstable issue. - Fix bug for incorrect datatype judgement in cublas-function source code. Model Benchmark Improvements ---------------------------- - Support FP8 in BERT model training. Distributed Benchmark Improvements ---------------------------------- - Support pair-wise pattern in IB validation benchmark. - Support topo-aware, pair-wise, and K-batch pattern in nccl-bw benchmark.
-
v0.6.0 Release SuperBench v0.6.0 SuperBench v0.6.0 Release Notes =============================== SuperBench Improvement ---------------------- - Support running on host directly without Docker. - Support running `sb` command inside docker image. - Support ROCm 5.1.1. - Support ROCm 5.1.3. - Fix bugs in data diagnosis. - Fix cmake and build issues. - Support automatic configuration yaml selection on Azure VM. - Refine error message when GPU is not detected. - Add return code for Timeout. - Update Dockerfile for NCCL/RCCL version, tag name, and verbose output. - Support node_num=1 in mpi mode. - Update Python setup for require packages. - Enhance parameter parsing to allow spaces in value. - Support NO_COLOR for SuperBench output. Micro-benchmark Improvements ---------------------------- - Fix issues in ib loopback benchmark. - Fix stability issue in ib loopback benchmark. Distributed Benchmark Improvements ---------------------------------- - Enhance pair-wise IB benchmark. - Bug Fix in IB benchmark. - Support topology-aware IB benchmark. Data Diagnosis and Analysis --------------------------- - Add failure check function in data_diagnosis.py. - Support JSON and JSONL in Diagnosis. - Add support to store values of metrics in data diagnosis. - Support exit code of sb result diagnosis. - Format int type and unify empty value to N/A in diagnosis output files.
-
v0.6.0-rc1 Pre-release v0.6.0-rc1 Pre-release v0.6.0-rc1.
-
v0.5.0 Release SuperBench v0.5.0 SuperBench v0.5.0 Release Notes =============================== Micro-benchmark Improvements ---------------------------- - Support NIC only NCCL bandwidth benchmark on single node in NCCL/RCCL bandwidth test. - Support bi-directional bandwidth benchmark in GPU copy bandwidth test. - Support data checking in GPU copy bandwidth test. - Update rccl-tests submodule to fix divide by zero error. - Add GPU-Burn micro-benchmark. Model-benchmark Improvements ---------------------------- - Sync results on root rank for e2e model benchmarks in distributed mode. - Support customized `env` in local and torch.distributed mode. - Add support for pytorch>=1.9.0. - Keep BatchNorm as fp32 for pytorch cnn models cast to fp16. - Remove FP16 samples type converting time. - Support FAMBench. Inference Benchmark Improvements -------------------------------- - Revise the default setting for inference benchmark. - Add percentile metrics for inference benchmarks. - Support T4 and A10 in GEMM benchmark. - Add configuration with inference benchmark. Other Improvements ------------------ - Add command to support listing all optional parameters for benchmarks. - Unify benchmark naming convention and support multiple tests with same benchmark and different parameters/options in one configuration file. - Support timeout to detect the benchmark failure and stop the process automatically. - Add rocm5.0 dockerfile. - Improve output interface. Data Diagnosis and Analysis --------------------------- - Support multi-benchmark check. - Support result summary in md, html and excel formats. - Support data diagnosis in md and html formats. - Support result output for all nodes in data diagnosis.
-
v0.5.0-rc1 Pre-release v0.5.0-rc1 Pre-release v0.5.0-rc1.
-
v0.4.0 Release SuperBench v0.4.0 SuperBench v0.4.0 Release Notes =============================== SuperBench Framework -------------------- __Monitor__ - Add monitor framework for NVIDIA GPU, CPU, memory and disk. __Data Diagnosis and Analysis__ - Support baseline-based data diagnosis. - Support basic analysis feature (boxplot figure, outlier detection, etc.). Single-node Validation ---------------------- __Micro Benchmarks__ - CPU Memory Validation (tool: Intel Memory Latency Checker). - GPU Copy Bandwidth (tool: built by MSRA). - Add ORT Model on AMD GPU platform. - Add inference backend TensorRT. - Add inference backend ORT. Multi-node Validation --------------------- __Micro Benchmarks__ - IB Networking validation. - TCP validation (tool: TCPing). - GPCNet Validation (tool: GPCNet). Other Improvement ----------------- 1. Enhancement - Add pipeline for AMD docker. - Integrate system config info script with SuperBench. - Support FP32 mode without TF32. - Refine unit test for microbenchmark. - Unify metric names for all benchmarks. 2. Document - Add benchmark list. - Add monitor document. - Add data diagnosis document.
-
v0.3.0 Release SuperBench v0.3.0 SuperBench v0.3.0 Release Notes =============================== SuperBench Framework -------------------- __Runner__ - Implement MPI mode. __Benchmarks__ - Support Docker benchmark. Single-node Validation ---------------------- __Micro Benchmarks__ 1. Memory (Tool: NVIDIA/AMD Bandwidth Test Tool) | Metrics | Unit | Description | |----------------|------|-------------------------------------| | H2D_Mem_BW_GPU | GB/s | host-to-GPU bandwidth for each GPU | | D2H_Mem_BW_GPU | GB/s | GPU-to-host bandwidth for each GPU | 2. IBLoopback (Tool: PerfTest – Standard RDMA Test Tool) | Metrics | Unit | Description | |----------|------|---------------------------------------------------------------| | IB_Write | MB/s | The IB write loopback throughput with different message sizes | | IB_Read | MB/s | The IB read loopback throughput with different message sizes | | IB_Send | MB/s | The IB send loopback throughput with different message sizes | 3. NCCL/RCCL (Tool: NCCL/RCCL Tests) | Metrics | Unit | Description | |---------------------|------|-----------------------------------------------------------------| | NCCL_AllReduce | GB/s | The NCCL AllReduce performance with different message sizes | | NCCL_AllGather | GB/s | The NCCL AllGather performance with different message sizes | | NCCL_broadcast | GB/s | The NCCL Broadcast performance with different message sizes | | NCCL_reduce | GB/s | The NCCL Reduce performance with different message sizes | | NCCL_reduce_scatter | GB/s | The NCCL ReduceScatter performance with different message sizes | 4. Disk (Tool: FIO – Standard Disk Performance Tool) | Metrics | Unit | Description | |----------------|------|---------------------------------------------------------------------------------| | Seq_Read | MB/s | Sequential read performance | | Seq_Write | MB/s | Sequential write performance | | Rand_Read | MB/s | Random read performance | | Rand_Write | MB/s | Random write performance | | Seq_R/W_Read | MB/s | Read performance in sequential read/write, fixed measurement (read:write = 4:1) | | Seq_R/W_Write | MB/s | Write performance in sequential read/write (read:write = 4:1) | | Rand_R/W_Read | MB/s | Read performance in random read/write (read:write = 4:1) | | Rand_R/W_Write | MB/s | Write performance in random read/write (read:write = 4:1) | 5. H2D/D2H SM Transmission Bandwidth (Tool: MSR-A build) | Metrics | Unit | Description | |---------------|------|-----------------------------------------------------| | H2D_SM_BW_GPU | GB/s | host-to-GPU bandwidth using GPU kernel for each GPU | | D2H_SM_BW_GPU | GB/s | GPU-to-host bandwidth using GPU kernel for each GPU | AMD GPU Support --------------- __Docker Image Support__ - ROCm 4.2 PyTorch 1.7.0 - ROCm 4.0 PyTorch 1.7.0 __Micro Benchmarks__ 1. Kernel Launch (Tool: MSR-A build) | Metrics | Unit | Description | |--------------------------|-----------|--------------------------------------------------------------| | Kernel_Launch_Event_Time | Time (ms) | Dispatch latency measured in GPU time using hipEventRecord() | | Kernel_Launch_Wall_Time | Time (ms) | Dispatch latency measured in CPU time | 2. GEMM FLOPS (Tool: AMD rocblas-bench Tool) | Metrics | Unit | Description | |----------|--------|-------------------------------| | FP64 | GFLOPS | FP64 FLOPS without MatrixCore | | FP32(MC) | GFLOPS | TF32 FLOPS with MatrixCore | | FP16(MC) | GFLOPS | FP16 FLOPS with MatrixCore | | BF16(MC) | GFLOPS | BF16 FLOPS with MatrixCore | | INT8(MC) | GOPS | INT8 FLOPS with MatrixCore | __E2E Benchmarks__ 1. CNN models -- Use PyTorch torchvision models - ResNet: ResNet-50, ResNet-101, ResNet-152 - DenseNet: DenseNet-169, DenseNet-201 - VGG: VGG-11, VGG-13, VGG-16, VGG-19 2. BERT -- Use huggingface Transformers - BERT - BERT Large 3. LSTM -- Use PyTorch 4. GPT-2 -- Use huggingface Transformers Bug Fix ------- - VGG models failed on A100 GPU with batch_size=128 Other Improvement ----------------- 1. Contribution related - Contribute rule - System information collection 2. Document - Add release process doc - Add design documents - Add developer guide doc for coding style - Add contribution rules - Add docker image list - Add initial validation results
-
v0.2.1 Release SuperBench v0.2.1 SuperBench v0.2.1 Release Notes =============================== Bug Fixes --------- * Fix Ansible connection issue when running in localhost. * Fix crashes of vgg models distributed training. * Fix bug when convert bool config to store_true argument.
-
v0.2.0 Release SuperBench v0.2.0 SuperBench v0.2.0 Release Notes =============================== SuperBench Framework -------------------- * Implemented a CLI to provide a command line interface. * Implemented Runner for nodes control and management. * Implemented Executor. * Implemented Benchmark framework. Supported Benchmarks -------------------- * Supported Micro-benchmarks * GEMM FLOPS (GFLOPS, TensorCore, cuBLAS, cuDNN) * Kernel Launch Time (Kernel_Launch_Event_Time, Kernel_Launch_Wall_Time) * Operator Performance (MatMul, Sharding_MatMul) * Supported Model-benchmarks * CNN models (Reference: [torchvision models](https://github.com/pytorch/vision/tree/v0.8.0/torchvision/models)) * ResNet (ResNet-18, ResNet-34, ResNet-50, ResNet-101, ResNet-152) * DenseNet (DenseNet-161, DenseNet-169, DenseNet-201) * VGG (VGG-11, VGG-13, VGG-16, VGG-19, VGG11_bn, VGG13_bn, VGG16_bn, VGG19_bn) * MNASNet (mnasnet0_5, mnasnet0_75, mnasnet1_0, mnasnet1_3) * AlexNet * GoogLeNet * Inception_v3 * mobilenet_v2 * ResNeXt (resnext50_32x4d, resnext101_32x8d) * Wide ResNet (wide_resnet50_2, wide_resnet101_2) * ShuffleNet (shufflenet_v2_x0_5, shufflenet_v2_x1_0, shufflenet_v2_x1_5, shufflenet_v2_x2_0) * SqueezeNet (squeezenet1_0, squeezenet1_1) * LSTM model * BERT models (BERT-Base, BERT-Large) * GPT-2 model (specify which config) Examples and Documents ---------------------- * Added examples to run benchmarks respectively. * Tutorial Documents (introduction, getting-started, developer-guides, APIs, benchmarks). * Built SuperBench [website](https://aka.ms/superbench/).