- 19 Mar, 2026 1 commit
-
-
one authored
- Added Platform.DTK in the microbenchmark framework. - Introduced new DTK hipblaslt benchmark class and corresponding tests. - Updated Dockerfile to include hipblaslt-bench and its permissions. - Registered DTK benchmarks in the benchmark registry for various performance tests. - Enhanced GPU detection logic to recognize HYGON GPUs. This update improves the benchmarking capabilities for DTK, ensuring compatibility and performance testing across platforms.
-
- 17 Nov, 2025 1 commit
-
-
Yuting Jiang authored
Benchmarks: micro benchmarks - add --set_ib_devices option to auto-select IB device by MPI local rank in ib validation (#733) **Description** add --set_ib_devices option to auto-select IB device by MPI local rank **Major Revision** - Add a new CLI flag --set_ib_devices to automatically select irregular IB devices based on the MPI local rank. - When enabled, the benchmark queries available IB devices via network.get_ib_devices() and selects the device corresponding to OMPI_COMM_WORLD_LOCAL_RANK. - Fall back to existing --ib_dev behavior when the flag is not provided. **Minor Revision** - Add an env in network.get_ib_devices() to allow user to set the device name
-
- 23 Oct, 2025 1 commit
-
-
Yuting Jiang authored
**Description** This PR adds NCU (NVIDIA Nsight Compute) profiling support to the cublaslt-gemm micro benchmark, enabling detailed kernel analysis including DRAM throughput, compute throughput, and launch arguments. **Major Revision** - Add --enable_ncu_profiling and --profiling_metrics for ncu profiling - Modifies command execution to use NCU when profiling is enabled - Updates result parsing to handle both standard and NCU profiled output formats
-
- 22 Oct, 2025 1 commit
-
-
Ziyue Yang authored
Benchmarks: Micro benchmark - Support verification and parallel run for disk performance benchmark (#741) **Description** Adds verification and parallel run support for disk performance benchmark. **Major Revision** - Adds `--verify` flag to support verify written data. - Supports loading benchmark options from `PROC_RANK`, `BLOCK_DEVICES` and `NUMA_NODES` environmental variables. --------- Co-authored-by:guoshzhao <guzhao@microsoft.com>
-
- 29 Sep, 2025 1 commit
-
-
Yuting Jiang authored
**Description** Add numa support for nvbandwidth.
-
- 20 Jun, 2025 1 commit
-
-
WenqingLan1 authored
**Description** Added support for Grace CPU neo2 architecture in CPU Stream. Now CPU Stream supports dual socket benchmarking. Example config for this arch support: ```yaml cpu-stream:numa0: timeout: *default_timeout modes: - name: local parallel: no parameters: cpu_arch: neo2 numa_mem_nodes: 0 cores: 0 1 2 3 4 5 6 7 8 cpu-stream:numa1: timeout: *default_timeout modes: - name: local parallel: no parameters: cpu_arch: neo2 numa_mem_nodes: 1 cores: 64 65 66 67 68 69 70 71 72 cpu-stream:numa-spread: timeout: *default_timeout modes: - name: local parallel: no parameters: cpu_arch: neo2 numa_mem_nodes: 0 1 cores: 0 1 2 3 4 5 6 7 8 64 65 66 67 68 69 70 71 72 ``` --------- Co-authored-by:dpower4 <dilipreddi@gmail.com>
-
- 18 Jun, 2025 1 commit
-
-
WenqingLan1 authored
Added GPU Stream benchmark - measures the GPU memory bandwidth and efficiency for double datatype through various memory operations including copy, scale, add, and triad. - added documentation for `gpu-stream` detailing its introduction, metrics, and descriptions. - added unit tests for `gpu-stream`. Example output is in `superbenchmark/tests/data/gpu_stream.log`.
-
- 14 Jun, 2025 1 commit
-
-
Hongtao Zhang authored
In the current implementation, the CPU‑stream benchmark code renames the binary before the microbench base class can verify its existence, causing the default‐binary check to fail. This PR adds a “default” binary—built with the standard compile parameters—so that the base class can always find and validate it. Once the default binary is in place, the CPU‑stream code will rename it as needed and re‑check its presence before running the benchmark. The PR also enable CPU stream in the default settings. --------- Co-authored-by:Hongtao Zhang <hongtaozhang@microsoft.com>
-
- 15 Feb, 2025 1 commit
-
-
Hongtao Zhang authored
Root Cause: 1. '_get_all_test_cases()' was called in '_parser' while '_parser' was defined in the base class. 2. in '_get_all_test_cases()', cmd path was not included. Fix: 1. Remove '_get_all_test_cases()' from '_parser'. 2. Construct path for cmd. --------- Co-authored-by:hongtaozhang <hongtaozhang@microsoft.com>
-
- 05 Feb, 2025 1 commit
-
-
Hongtao Zhang authored
**Description** 1. Fixed the bug that nvbandwidth benchmark need to handle 'N/A' values in nvbandwidth cmd output. 2. Replaced the input format of test cases with a list. 3. Add nvbandwidth configuration example in default config files. --------- Co-authored-by:
hongtaozhang <hongtaozhang@microsoft.com> Co-authored-by:
Yifan Xiong <yifan.xiong@microsoft.com>
-
- 22 Nov, 2024 1 commit
-
-
Hongtao Zhang authored
**Description** Add nvbandwidth benchmark. --------- Co-authored-by:hongtaozhang <hongtaozhang@microsoft.com>
-
- 20 Nov, 2024 1 commit
-
-
Hongtao Zhang authored
**Description** Add micro benchmark to measure general CPU bandwidth and latency without 'mlc'. Test output: ``` { "cpu-memory-bw-latency/return_code": 0, "cpu-memory-bw-latency/mem_bandwidth_matrix_numa_0_1_bw": 5388.75021, "cpu-memory-bw-latency/mem_bandwidth_matrix_numa_0_1_lat": 0.185571786, "cpu-memory-bw-latency/mem_bandwidth_matrix_numa_1_0_bw": 4634.82028, "cpu-memory-bw-latency/mem_bandwidth_matrix_numa_1_0_lat": 0.215758096, } ``` --------- Co-authored-by:hongtaozhang <hongtaozhang@microsoft.com>
-
- 02 Apr, 2024 1 commit
-
-
Ziyue Yang authored
**Description** Adds hipblasLt tuning to dist-inference cpp implementation.
-
- 08 Jan, 2024 1 commit
-
-
Yifan Xiong authored
**Description** Cherry-pick bug fixes from v0.10.0 to main. **Major Revisions** * Benchmarks: Microbenchmark - Support different hipblasLt data types in dist_inference #590 * Benchmarks: Microbenchmark - Support in-place for NCCL/RCCL benchmark #591 * Bug Fix - Fix NUMA Domains Swap Issue in NDv4 Topology File #592 * Benchmarks: Microbenchmark - Add data type option for NCCL and RCCL tests #595 * Benchmarks: Bug Fix - Make metrics of dist-inference-cpp aligned with PyTorch version #596 * CI/CD - Add ndv5 topo file #597 * Benchmarks: Microbenchmark - Improve AMD GPU P2P performance with fine-grained GPU memory #593 * Benchmarks: Build Pipeline - fix nccl and nccl test version to 2.18.3 to resolve hang issue in cuda12.2 docker #599 * Dockerfile - Bug fix for rocm docker build and deploy #598 * Benchmarks: Microbenchmark - Adapt to hipblasLt data type changes #603 * Benchmarks: Micro benchmarks - Update hipblaslt metric unit to tflops #604 * Monitor - Upgrade pyrsmi to amdsmi python library. #601 * Benchmarks: Micro benchmarks - add fp8 and initialization for hipblaslt benchmark #605 * Dockerfile - Add rocm6.0 dockerfile #602 * Bug Fix - Bug fix for latest megatron-lm benchmark #600 * Docs - Upgrade version and release note #606 Co-authored-by:
Ziyue Yang <ziyyang@microsoft.com> Co-authored-by:
Yang Wang <yangwang1@microsoft.com> Co-authored-by:
Yuting Jiang <yutingjiang@microsoft.com> Co-authored-by:
guoshzhao <guzhao@microsoft.com>
-
- 10 Dec, 2023 1 commit
-
-
Ziyue Yang authored
**Description** Add distributed inference benchmark cpp implementation.
-
- 08 Dec, 2023 1 commit
-
-
Ziyue Yang authored
Benchmarks: Micro benchmark - Add one-to-all, all-to-one, all-to-all support to gpu_copy_bw_performance (#588) **Description** Add one-to-all, all-to-one, all-to-all support to gpu_copy_bw_performance, and fix performance bug in gpu_copy
-
- 05 Dec, 2023 1 commit
-
-
Ziyue Yang authored
**Description** Revise NCCL/RCCL benchmarks to graph mode add latency metrics.
-
- 04 Dec, 2023 1 commit
-
-
Yuting Jiang authored
**Description** Benchmarks: micro benchmark - Support cpu-gpu and gpu-cpu in ib-validation **Major Revision** - Support cpu-gpu and gpu-cpu in ib-validation **Minor Revision** - support multi msg size, multi direction, multi ib commands in ib-validation
-
- 22 Nov, 2023 2 commits
-
-
Yuting Jiang authored
**Description** add initialization options for rocm gemm flops.
-
Yuting Jiang authored
**Description** hipblaslt function benchmark and rebase cublaslt function benchmark.
-
- 20 Nov, 2023 1 commit
-
-
Yuting Jiang authored
**Description** add int8 support for cublaslt function.
-
- 14 Nov, 2023 1 commit
-
-
Yuting Jiang authored
**Description** remove cp ptx file in gpu burn test since the command is run inside self.args.bin_dir dir. https://github.com/microsoft/superbenchmark/blob/d246bab430adeb461072918a551b2e2b68c9bce5/superbench/benchmarks/micro_benchmarks/micro_base.py#L183
-
- 06 Jul, 2023 1 commit
-
-
Yuting Jiang authored
**Description** add python code for DirectXGPUEncodingLatency.
-
- 05 Jul, 2023 3 commits
-
-
Yuting Jiang authored
**Description** add python code for DirectXGPUCopy.
-
Yuting Jiang authored
**Description** add python code for DirecXGPUMemBw.
-
Yuting Jiang authored
**Description** add python code for DirectX core flops and init DirectX test pipeline. **Major Revision** - add python code for DirectX core flops - init DirectX test pipeline **Minor Revision** - add test for DirectX core flops
-
- 30 Jun, 2023 2 commits
-
-
Yuting Jiang authored
**Description** add auto selecting algorithm support for cudnn functions. **Major Revision** - add auto selecting algorithm support for cudnn functions in source code - add 'auto_algo' option in benchmark - add related test
-
Yifan Xiong authored
* Update result parsing for newer tensorrt versions * Update arguments when load torchvision models
-
- 24 Mar, 2023 1 commit
-
-
Ziyue Yang authored
**Description** This PR adds a micro-benchmark of distributed model inference workloads. **Major Revision** - Add a new micro-benchmark dist-inference. - Add corresponding example and unit tests. - Update configuration files to include this new micro-benchmark. - Update micro-benchmark README. --------- Co-authored-by:Peng Cheng <chengpeng5555@outlook.com>
-
- 22 Mar, 2023 1 commit
-
-
Yifan Xiong authored
Support batch and shape range with multiplication factors in cublaslt gemm benchmark.
-
- 21 Mar, 2023 1 commit
-
-
rafsalas19 authored
**Description** - Adding HPL benchmark --------- Co-authored-by:
Ubuntu <azureuser@sbtestvm.jzlku1oskncengjiado35wf1hd.ax.internal.cloudapp.net> Co-authored-by:
Peng Cheng <chengpeng5555@outlook.com>
-
- 13 Feb, 2023 1 commit
-
-
rafsalas19 authored
**Description** - Added stream benchmark - Added stream unit test - Added stream example - Modified docker files to build stream --------- Co-authored-by:
Ubuntu <azureuser@sbtestvm.jzlku1oskncengjiado35wf1hd.ax.internal.cloudapp.net> Co-authored-by:
Peng Cheng <chengpeng5555@outlook.com> Co-authored-by:
Yifan Xiong <xiongyf@yandex.com>
-
- 04 Jan, 2023 1 commit
-
-
Yang Wang authored
Support traffic patterns under the different devices in NCCL/RCCL test * change the metrics format if specified the pattern
-
- 03 Jan, 2023 2 commits
-
-
Yifan Xiong authored
Integrate cublaslt-gemm micro-benchmark #451.
-
Yuting Jiang authored
**Description** Add correctness check in cublas-function benchmark. **Major Revision** - add python code of correctness check in cublas-function benchmark and test
-
- 14 Dec, 2022 1 commit
-
-
Yuting Jiang authored
**Description** Add wait time option to resolve mem-bw unstable issue.
-
- 18 Oct, 2022 1 commit
-
-
Yuting Jiang authored
Benchmarks - Add support to allow list of custom config string in cudnn-functions and cublas-functions (#414) **Description** Add support to allow list of custom config string in cudnn-functions and cublas-functions.
-
- 06 Sep, 2022 1 commit
-
-
Yifan Xiong authored
**Description** Cherry-pick bug fixes from v0.6.0 to main. **Major Revisions** * Enable latency test in ib traffic validation distributed benchmark (#396) * Enhance parameter parsing to allow spaces in value (#397) * Update apt packages in dockerfile (#398) * Upgrade colorlog for NO_COLOR support (#404) * Analyzer - Update error handling to support exit code of sb result diagnosis (#403) * Analyzer - Make baseline file optional in data diagnosis and fix bugs (#399) * Enhance timeout cleanup to avoid possible hanging (#405) * Auto generate ibstat file by pssh (#402) * Analyzer - Format int type and unify empty value to N/A in diagnosis output file (#406) * Docs - Upgrade version and release note (#407) * Docs - Fix issues in document (#408) Co-authored-by:
Yang Wang <yangwang1@microsoft.com> Co-authored-by:
Yuting Jiang <yutingjiang@microsoft.com>
-
- 26 Jul, 2022 1 commit
-
-
Jie Zhang authored
* Support topo-aware IB performance validation Add a new pattern `topo-aware`, so the user can run IB performance test based on VM's topology information. This way, the user can validate the IB performance across VM pairs with different distance as a quick test instead of pair-wise test. To run with topo-aware pattern, user needs to specify three required (and two optional) parameters in YAML config file: --pattern topo-aware --ibstat path to ibstat output --ibnetdiscover path to ibnetdiscover output --min_dist minimum distance of VM pairs (optional, default 2) --max_dist maximum distance of VM pairs (optional, default 6) The newly added topo_aware module then parses the topology information, builds a graph, and generates the VM pairs with the specified distance (# hops). The specified IB test will then be running across these generated VM pairs. Signed-off-by:
Jie Zhang <jessezhang1010@gmail.com> * Add description about topology aware ib traffic tests Signed-off-by:
Jie Zhang <jessezhang1010@gmail.com> * Add unit test to verify generated topology aware config file This commit adds unit test to verify the generated topology aware config file is correct. To do so, four new data files are added in order to invoke gen_topo_aware_config function to generate topology aware config file, then compares it with the expected config file. Signed-off-by:
Jie Zhang <jessezhang1010@gmail.com> * Fix lint issue on Azure pipeline Signed-off-by:
Jie Zhang <jessezhang1010@gmail.com>
-
- 25 Jul, 2022 1 commit
-
-
Yang Wang authored
Fix an unexpected result value (`-0.125`) issue in ib traffic benchmark when encountering `-1` in raw output * Check if the value is valid before the base conversion * Add a test case to cover this situation
-