- 15 Apr, 2026 1 commit
-
-
one authored
-
- 04 Dec, 2025 1 commit
-
-
Henry Li authored
**Description** The ib-loopback test was regressed due to this recent [change](https://github.com/microsoft/superbenchmark/commit/c65ae56713d6bfcc4a3be37d7fe24779590f9791). When running ib-loopback using the standard [config](https://github.com/microsoft/superbenchmark/blob/c65ae56713d6bfcc4a3be37d7fe24779590f9791/superbench/config/default.yaml#L69 ), the test would fail since it would pass numeric values like `0` into the test command which would break since it is not a valid IB device name. Example failure: ``` [2025-11-25 22:08:38,100 vmssnc6ec000003:141056][micro_base.py:200][INFO] Execute command - round: 0, benchmark: ib-loopback, command: /usr/local/bin/run_perftest_loopback 47 45 /usr/local/b in/ib_write_bw -s 8388608 -F --iters=20000 -d 0 -p 45617 -x 0 --report_gbits. [0]: IB device 0 not found Unable to find the Infiniband/RoCE device IB device 0 not found Unable to find the Infiniband/RoCE device [2025-11-25 22:08:39,113 vmssnc6ec000003:141056][micro_base.py:209][ERROR] Microbenchmark execution failed - round: 0, benchmark: ib-loopback, error message: IB device 0 not found Unable to find the Infiniband/RoCE device IB device 0 not found Unable to find the Infiniband/RoCE device ``` **Major Revision** - Major Revision A - Major Revision B - ... **Minor Revision** - Minor Revision A - Minor Revision B - ... --------- Co-authored-by:
Henry Li <lihl@microsoft.com>
-
- 17 Nov, 2025 1 commit
-
-
Yuting Jiang authored
Benchmarks: micro benchmarks - add --set_ib_devices option to auto-select IB device by MPI local rank in ib validation (#733) **Description** add --set_ib_devices option to auto-select IB device by MPI local rank **Major Revision** - Add a new CLI flag --set_ib_devices to automatically select irregular IB devices based on the MPI local rank. - When enabled, the benchmark queries available IB devices via network.get_ib_devices() and selects the device corresponding to OMPI_COMM_WORLD_LOCAL_RANK. - Fall back to existing --ib_dev behavior when the flag is not provided. **Minor Revision** - Add an env in network.get_ib_devices() to allow user to set the device name
-
- 10 Oct, 2024 1 commit
-
-
Yuting Jiang authored
**Description** Cherry pick bug fixes from v0.11.0 to main **Major Revision** * #645 * #648 * #646 * #647 * #651 * #652 * #650 --------- Co-authored-by:
hongtaozhang <hongtaozhang@microsoft.com> Co-authored-by:
Yifan Xiong <yifan.xiong@microsoft.com>
-
- 08 Jan, 2024 1 commit
-
-
Yifan Xiong authored
**Description** Cherry-pick bug fixes from v0.10.0 to main. **Major Revisions** * Benchmarks: Microbenchmark - Support different hipblasLt data types in dist_inference #590 * Benchmarks: Microbenchmark - Support in-place for NCCL/RCCL benchmark #591 * Bug Fix - Fix NUMA Domains Swap Issue in NDv4 Topology File #592 * Benchmarks: Microbenchmark - Add data type option for NCCL and RCCL tests #595 * Benchmarks: Bug Fix - Make metrics of dist-inference-cpp aligned with PyTorch version #596 * CI/CD - Add ndv5 topo file #597 * Benchmarks: Microbenchmark - Improve AMD GPU P2P performance with fine-grained GPU memory #593 * Benchmarks: Build Pipeline - fix nccl and nccl test version to 2.18.3 to resolve hang issue in cuda12.2 docker #599 * Dockerfile - Bug fix for rocm docker build and deploy #598 * Benchmarks: Microbenchmark - Adapt to hipblasLt data type changes #603 * Benchmarks: Micro benchmarks - Update hipblaslt metric unit to tflops #604 * Monitor - U...
-
- 27 Nov, 2023 1 commit
-
-
guoshzhao authored
**Description** Add AMD support in monitor. **Major Revision** - Add library pyrsmi to collect metrics. - Currently can get device_utilization, device_power, device_used_memory and device_total_memory.
-
- 05 Jul, 2023 1 commit
-
-
Yuting Jiang authored
**Description** add python code for DirecXGPUMemBw.
-
- 14 Apr, 2023 1 commit
-
-
Yifan Xiong authored
**Description** Cherry-pick bug fixes from v0.8.0 to main. **Major Revisions** * Monitor - Fix the cgroup version checking logic (#502) * Benchmark - Fix matrix size overflow issue in cuBLASLt GEMM (#503) * Fix wrong torch usage in communication wrapper for Distributed Inference Benchmark (#505) * Analyzer: Fix bug in python3.8 due to pandas api change (#504) * Bug - Fix bug to get metric from cmd when error happens (#506) * Monitor - Collect realtime GPU power when benchmarking (#507) * Add num_workers argument in model benchmark (#511) * Remove unreachable condition when write host list (#512) * Update cuda11.8 image to cuda12.1 based on nvcr23.03 (#513) * Doc - Fix wrong unit of cpu-memory-bw-latency in doc (#515) * Docs - Upgrade version and release note (#508) Co-authored-by:
guoshzhao <guzhao@microsoft.com> Co-authored-by:
Ziyue Yang <ziyyang@microsoft.com> Co-authored-by:
Yuting Jiang <yutingjiang@microsoft.com>
-
- 22 Mar, 2023 1 commit
-
-
guoshzhao authored
**Description** Since ubuntu 22.04 will use cgroup V2 and the file structure changed. Modify the monitor to adapt to cgroup v1 and v2.
-
- 13 Feb, 2023 1 commit
-
-
Yuting Jiang authored
**Description** Support SuperBench Executor running on Windows. **Major Revision** - Lazy import ansible related module
-
- 04 Jan, 2023 1 commit
-
-
Yang Wang authored
**Major Revision** - Add an option for pattern to generate mpi_pattern.txt file if specified the path. - In mpi pattern, serial_index and parallel_index will add in each benchmark as environment variables. **Minor Revision** - Fix typo
-
- 03 Jan, 2023 1 commit
-
-
Yang Wang authored
**Description** Support the following patterns in `mpi` mode: * `k-batch` * `topo-aware`
-
- 30 Dec, 2022 1 commit
-
-
Yuting Jiang authored
**Description** Add stdout logging util module and enable real-time logging flushing in executor **Major Revision** - Add stdout logging util module to redirect stdout into file log - enable stdout logging in executor to write benchmark output into both stdout and file `sb-bench.log` - enable real-time log flushing in run_command of microbenchmarks through config `log_flushing` **Minor Revision** - add log_n_step args to enable regular step time log in model benchmarks - udpate related docs
-
- 29 Dec, 2022 1 commit
-
-
Yang Wang authored
* Extract pair-wise pattern from ib_validation
-
- 29 Nov, 2022 1 commit
-
-
Yang Wang authored
* add mpi-parallels mode * update according to comments * fix and update doc * update * merge into 'mpi' mode * udpate according to comments * fix testcases * fix ansible * regard pattern as field * udpate * fix flake8 version * add flake8 range * remove map-by from host config * udpate comments
-
- 06 Sep, 2022 1 commit
-
-
Yifan Xiong authored
**Description** Cherry-pick bug fixes from v0.6.0 to main. **Major Revisions** * Enable latency test in ib traffic validation distributed benchmark (#396) * Enhance parameter parsing to allow spaces in value (#397) * Update apt packages in dockerfile (#398) * Upgrade colorlog for NO_COLOR support (#404) * Analyzer - Update error handling to support exit code of sb result diagnosis (#403) * Analyzer - Make baseline file optional in data diagnosis and fix bugs (#399) * Enhance timeout cleanup to avoid possible hanging (#405) * Auto generate ibstat file by pssh (#402) * Analyzer - Format int type and unify empty value to N/A in diagnosis output file (#406) * Docs - Upgrade version and release note (#407) * Docs - Fix issues in document (#408) Co-authored-by:
Yang Wang <yangwang1@microsoft.com> Co-authored-by:
Yuting Jiang <yutingjiang@microsoft.com>
-
- 13 Aug, 2022 1 commit
-
-
Yang Wang authored
An enhancement for topo-aware IB performance validation #373. This PR will auto-generate a required ibstate file `ib_traffic_topo_aware_ibstat.txt` which is used as input to build a graph.
-
- 26 Jul, 2022 1 commit
-
-
Jie Zhang authored
* Support topo-aware IB performance validation Add a new pattern `topo-aware`, so the user can run IB performance test based on VM's topology information. This way, the user can validate the IB performance across VM pairs with different distance as a quick test instead of pair-wise test. To run with topo-aware pattern, user needs to specify three required (and two optional) parameters in YAML config file: --pattern topo-aware --ibstat path to ibstat output --ibnetdiscover path to ibnetdiscover output --min_dist minimum distance of VM pairs (optional, default 2) --max_dist maximum distance of VM pairs (optional, default 6) The newly added topo_aware module then parses the topology information, builds a graph, and generates the VM pairs with the specified distance (# hops). The specified IB test will then be running across these generated VM pairs. Signed-off-by:
Jie Zhang <jessezhang1010@gmail.com> * Add description about topology aware ib traffic tests Signed-off-by:
Jie Zhang <jessezhang1010@gmail.com> * Add unit test to verify generated topology aware config file This commit adds unit test to verify the generated topology aware config file is correct. To do so, four new data files are added in order to invoke gen_topo_aware_config function to generate topology aware config file, then compares it with the expected config file. Signed-off-by:
Jie Zhang <jessezhang1010@gmail.com> * Fix lint issue on Azure pipeline Signed-off-by:
Jie Zhang <jessezhang1010@gmail.com>
-
- 05 Jul, 2022 1 commit
-
-
Yifan Xiong authored
Support SKU auto detect and using corresponding benchmark config if running on Azure VM.
-
- 24 Jan, 2022 1 commit
-
-
Yuting Jiang authored
**Description** Fix code insecure issue that binds a socket to all network interfaces.
-
- 15 Nov, 2021 1 commit
-
-
guoshzhao authored
**Description** Rename `nvidia_helper` utility as `device_manager` module and support more functions: ``` device_manager.get_device_count() device_manager.get_device_utilization(idx) device_manager.get_device_temperature(idx) device_manager.get_device_power_limit(idx) device_manager.get_device_memory(idx) device_manager.get_device_row_remapped_info(idx) device_manager.get_device_ecc_error(idx) ```
-
- 31 Aug, 2021 1 commit
-
-
guoshzhao authored
**Description** Package frequently-used subprocess invoke into function.
-
- 13 Jul, 2021 1 commit
-
-
Yuting Jiang authored
Update network common utils. Add get_ib_devices in network common utils and move get_free_port from test utils to network common utils
-
- 09 Jul, 2021 1 commit
-
-
guoshzhao authored
* Bug Fix - Fix race condition issue for multi ranks (#117) Fix race condition issue when multi ranks rotating the same directory. * Update pipeline for release branch (#122) * Bug Fix - Fix bug when convert bool config to store_true argument. (#120) Co-authored-by:Yifan Xiong <yifan.xiong@microsoft.com>
-
- 02 Jul, 2021 1 commit
-
-
Yifan Xiong authored
Fetch benchmarks results on all nodes, will rsync after each benchmark. The results directory structure on control node is as follows: ``` outputs/ └── datetime ├── nodes │ └── node-0 │ ├── benchmarks │ │ ├── benchmark-0 │ │ │ ├── rank-0 │ │ │ │ └── results.json │ └── sb-exec.log ├── sb-run.log └── sb.config.yaml ```
-
- 01 Jul, 2021 1 commit
-
-
Yifan Xiong authored
* Support custom output directory. * Update document.
-
- 23 Jun, 2021 1 commit
-
-
Yifan Xiong authored
* Add `sb deploy` command content. * Fix inline if-expression syntax in playbook. * Fix quote escape issue in bash command. * Add custom env in config. * Update default config for multi GPU benchmarks. * Update MANIFEST.in to include jinja2 template. * Require jinja2 minimum version. * Fix occasional duplicate output in Ansible runner. * Fix mixed color from Ansible and Python colorlog. * Update according to comments. * Change superbench.env from list to dict in config file.
-
- 16 Jun, 2021 1 commit
-
-
Yifan Xiong authored
Fix bugs and refine log in single GPU benchmarks: * Fix none framework issue * Fix empty parameter bug * Remove missed mobilenet_v3 models * Change benchmark registration log to debug level * Add pid in logging * Add missing benchmarks in default config * Fix deprecated logging warn
-
- 01 Jun, 2021 1 commit
-
-
guoshzhao authored
-
- 18 May, 2021 1 commit
-
-
Yifan Xiong authored
* use absolute path of input file * parse registry uri from image * merge common parts for arguments processing
-
- 11 May, 2021 1 commit
-
-
Yifan Xiong authored
__Major Revision__ * Support lazy import. * Not importing benchmarks when running `help`, `version`, `deploy` commands, etc.
-
- 29 Mar, 2021 1 commit
-
-
Yifan Xiong authored
Update logger class. * add file handler along with stream handler * add colored formatter
-
- 26 Mar, 2021 1 commit
-
-
Yifan Xiong authored
Use omegaconf to replace hydra for configuration system: * remove hydra * use omegaconf to merge configurations
-
- 12 Mar, 2021 1 commit
-
-
Yifan Xiong authored
- Add CLI commands * sb version * sb deploy * sb exec * sb run - Add interface with executor and runner - Add cli test cases
-
- 04 Mar, 2021 1 commit
-
-
guoshzhao authored
Co-authored-by:Guoshuai Zhao <guzhao@microsoft.com>
-
- 24 Feb, 2021 1 commit
-
-
guoshzhao authored
* benchmarks init. Co-authored-by:Guoshuai Zhao <guzhao@microsoft.com>
-