- 04 Dec, 2025 1 commit
-
-
Henry Li authored
**Description** The ib-loopback test was regressed due to this recent [change](https://github.com/microsoft/superbenchmark/commit/c65ae56713d6bfcc4a3be37d7fe24779590f9791). When running ib-loopback using the standard [config](https://github.com/microsoft/superbenchmark/blob/c65ae56713d6bfcc4a3be37d7fe24779590f9791/superbench/config/default.yaml#L69 ), the test would fail since it would pass numeric values like `0` into the test command which would break since it is not a valid IB device name. Example failure: ``` [2025-11-25 22:08:38,100 vmssnc6ec000003:141056][micro_base.py:200][INFO] Execute command - round: 0, benchmark: ib-loopback, command: /usr/local/bin/run_perftest_loopback 47 45 /usr/local/b in/ib_write_bw -s 8388608 -F --iters=20000 -d 0 -p 45617 -x 0 --report_gbits. [0]: IB device 0 not found Unable to find the Infiniband/RoCE device IB device 0 not found Unable to find the Infiniband/RoCE device [2025-11-25 22:08:39,113 vmssnc6ec000003:141056][micro_base.py:209][ERROR] Microbenchmark execution failed - round: 0, benchmark: ib-loopback, error message: IB device 0 not found Unable to find the Infiniband/RoCE device IB device 0 not found Unable to find the Infiniband/RoCE device ``` **Major Revision** - Major Revision A - Major Revision B - ... **Minor Revision** - Minor Revision A - Minor Revision B - ... --------- Co-authored-by:
Henry Li <lihl@microsoft.com>
-
- 17 Nov, 2025 1 commit
-
-
Yuting Jiang authored
Benchmarks: micro benchmarks - add --set_ib_devices option to auto-select IB device by MPI local rank in ib validation (#733) **Description** add --set_ib_devices option to auto-select IB device by MPI local rank **Major Revision** - Add a new CLI flag --set_ib_devices to automatically select irregular IB devices based on the MPI local rank. - When enabled, the benchmark queries available IB devices via network.get_ib_devices() and selects the device corresponding to OMPI_COMM_WORLD_LOCAL_RANK. - Fall back to existing --ib_dev behavior when the flag is not provided. **Minor Revision** - Add an env in network.get_ib_devices() to allow user to set the device name
-
- 06 Nov, 2025 1 commit
-
-
WenqingLan1 authored
Updated mlc wget link in dockerfiles. --------- Co-authored-by:guoshzhao <guzhao@microsoft.com>
-
- 05 Nov, 2025 1 commit
-
-
Hongtao Zhang authored
Python3.10 verification pipeline failed for conflict 'setuptools' version as below. <img width="1157" height="622" alt="image" src="https://github.com/user-attachments/assets/ba0f6045-4b92-4fd8-b92f-1c474725534c " /> Root Cause: The problem is that modern pip (25.3) uses an isolated build environment with the latest setuptools by default. The pipeline installs setuptools 65.7 in the user environment, but pip builds the package in an isolated environment with newer setuptools, which conflicts with the version check in [setup.py]. Solution: Remove pip upgrade. --------- Co-authored-by:
Hongtao Zhang <hongtaozhang@microsoft.com>
-
- 23 Oct, 2025 1 commit
-
-
Yuting Jiang authored
**Description** This PR adds NCU (NVIDIA Nsight Compute) profiling support to the cublaslt-gemm micro benchmark, enabling detailed kernel analysis including DRAM throughput, compute throughput, and launch arguments. **Major Revision** - Add --enable_ncu_profiling and --profiling_metrics for ncu profiling - Modifies command execution to use NCU when profiling is enabled - Updates result parsing to handle both standard and NCU profiled output formats
-
- 22 Oct, 2025 2 commits
-
-
Ziyue Yang authored
Benchmarks: Micro benchmark - Support verification and parallel run for disk performance benchmark (#741) **Description** Adds verification and parallel run support for disk performance benchmark. **Major Revision** - Adds `--verify` flag to support verify written data. - Supports loading benchmark options from `PROC_RANK`, `BLOCK_DEVICES` and `NUMA_NODES` environmental variables. --------- Co-authored-by:guoshzhao <guzhao@microsoft.com>
-
Hongtao Zhang authored
**Description** Python3.10 pipeline failed. **Solution** From log, 'bc' cmd is missing. Since our image tags are simple, the solution is to remove 'bc' cmd directly. --------- Co-authored-by:Hongtao Zhang <hongtaozhang@microsoft.com>
-
- 08 Oct, 2025 2 commits
-
-
Hongtao Zhang authored
To improve benchmark debugging, the following debug methods were added: pytorch profiler in model benchmark - SB_ENABLE_PYTORCH_PROFILER: switch to enable/disable - SB_TORCH_PROFILER_TRACE_DIR: log path These 2 runtime variables need to be configured in SB config file. nsys in SB runner - SB_ENABLE_NSYS: switch to enable/disable - SB_NSYS_TRACE_DIR: log path These 2 runtime variables need to be configured in runner's ENV --------- Co-authored-by:Hongtao Zhang <hongtaozhang@microsoft.com>
-
Yifan Xiong authored
Fix image merge for release event in GitHub Action.
-
- 01 Oct, 2025 1 commit
-
-
WenqingLan1 authored
Add support for cuda13.0. Add cuda13.0.dockerfile. Add cuda13.0 image building task to github pipeline. Update GPU STREAM to work with cuda13.0. Fix data type conversion perf bug in GPU stream. Update nvbandwidth submodule to be v0.8. Update perftest submodule to be 4bee61f80d9e268fc97eaf40be00409e91d3a19e (recent master). --------- Co-authored-by:
Ubuntu <dilipreddi@gmail.com> Co-authored-by:
guoshzhao <guzhao@microsoft.com>
-
- 30 Sep, 2025 1 commit
-
-
Yuting Jiang authored
Benchmarks: Micro benchmark - Add simultanneously all-to-host / host-to-all bandwidth testcases to nvbandwidth (#736) **Description** Add simultanneously all-to-host / host-to-all bandwidth testcases to nvbandwidth . **Major Revision** - nvbandwidth.patch: Add simultanneously all-to-host / host-to-all bandwidth testcases to nvbandwidth - upgrade nvbandwidth submodule into v0.8 - add patch into makefile build
-
- 29 Sep, 2025 2 commits
-
-
Yuting Jiang authored
**Description** add option to exclude data copy time in model benchmarks. **Major Revision** - add an option --no_copy - move start time after data copy finish
-
Yuting Jiang authored
**Description** Add numa support for nvbandwidth.
-
- 19 Sep, 2025 1 commit
-
-
Yuting Jiang authored
Benchmarks: micro benchmarks - change cublasLtMatmulDescCreate scaleType from CUDA_R_32F to CUDA_R_16F in FP16 dist inference (#732) **Description** change cublasLtMatmulDescCreate scaleType from CUDA_R_32F to CUDA_R_16F in FP16 dist inference to fix cublaslt error.
-
- 12 Aug, 2025 1 commit
-
-
Hongtao Zhang authored
**Description** Cherry-pick bug fixes from v0.12.0 to main. **Major Revisions** * #725 * #727 * #728 Co-authored-by:
Hongtao Zhang <hongtaozhang@microsoft.com> Co-authored-by:
Yifan Xiong <yixio@microsoft.com> Co-authored-by:
Guoshuai Zhao <guzhao@microsoft.com> --------- Co-authored-by:
Hongtao Zhang <hongtaozhang@microsoft.com>
-
- 30 Jun, 2025 1 commit
-
-
pdr authored
Added MoE model using MixtralConfig. 1. Added 8x7b and 8x22b variants 2. Requires high VRAM as all experts are loaded in memory. Thus, disabled training due to memory constraint on test worker. --------- Co-authored-by:
Hongtao Zhang <garyworkzht@gmail.com> Co-authored-by:
Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by:
Hongtao Zhang <hongtaozhang@microsoft.com>
-
- 26 Jun, 2025 1 commit
-
-
Yuting Jiang authored
**Description** Add deepseek megatron-lm benchmark. --------- Co-authored-by:
yukirora <yuting.jiang@microsoft.com> Co-authored-by:
Hongtao Zhang <garyworkzht@gmail.com> Co-authored-by:
Hongtao Zhang <hongtaozhang@microsoft.com>
-
- 25 Jun, 2025 1 commit
-
-
guoshzhao authored
**Description** Add cuda 12.9 dockerfile and build in pipeline. --------- Co-authored-by:
Guoshuai Zhao <microsoft@microsoft.com> Co-authored-by:
Hongtao Zhang <hongtaozhang@microsoft.com> Co-authored-by:
Hongtao Zhang <garyworkzht@gmail.com>
-
- 24 Jun, 2025 1 commit
-
-
guoshzhao authored
**Description** Add FP4 precision support for cublaslt_gemm benchmark. **Major Revision** - Add new type `fp4e2m1` and `__nv_fp4_e2m1`. - For FP4 matmul, precision of MatrixC (add) should be FP16, precision of MatricD (output) should be FP4, otherwise, it will not work. - Add macro `CUDA_VERSION` to resolve the compatibility issue of different CUDA versions. --------- Co-authored-by:
Ubuntu <aiperf@aiperf000000.hp5z1gqeinfufbj2u3jcty5fme.cdmx.internal.cloudapp.net> Co-authored-by:
AVA <39534996+avazr@users.noreply.github.com> Co-authored-by:
Guoshuai Zhao <microsoft@microsoft.com>
-
- 20 Jun, 2025 2 commits
-
-
Babak Hejazi authored
**Description** Enable autotuning as an opt-in mode when benchmarking cublasLt via `cublaslt_gemm` The implementation is based on https://github.com/NVIDIA/CUDALibrarySamples/blob/master/cuBLASLt/LtSgemmSimpleAutoTuning/sample_cublasLt_LtSgemmSimpleAutoTuning.cu The behavior of original benchmark command remains unchanged, e.g.: - `cublaslt_gemm -m 2048 -n 12288 -k 1536 -w10000 -i 1000 -t fp8e4m3` The new opt-in options are `-a` (for autotune) and `-I` (for autotune iterations, default is 50, same as the default for `-i`) and `-W` (for autotune warmups, default=20, same as the default for `-w`), e.g.: - `cublaslt_gemm -m 2048 -n 12288 -k 1536 -w 10000 -i 1000 -t fp8e4m3 -a` - `cublaslt_gemm -m 2048 -n 12288 -k 1536 -w 10000 -i 1000 -t fp8e4m3 -a -I 10 -W 10` **Note:** This PR also changes the default `gemm_compute_type` for BF16 and FP16 to `CUBLAS_COMPUTE_32F`. **Further observations:** 1. The support matrix of the `cublaslt_gemm` could be furt...
-
WenqingLan1 authored
**Description** Added support for Grace CPU neo2 architecture in CPU Stream. Now CPU Stream supports dual socket benchmarking. Example config for this arch support: ```yaml cpu-stream:numa0: timeout: *default_timeout modes: - name: local parallel: no parameters: cpu_arch: neo2 numa_mem_nodes: 0 cores: 0 1 2 3 4 5 6 7 8 cpu-stream:numa1: timeout: *default_timeout modes: - name: local parallel: no parameters: cpu_arch: neo2 numa_mem_nodes: 1 cores: 64 65 66 67 68 69 70 71 72 cpu-stream:numa-spread: timeout: *default_timeout modes: - name: local parallel: no parameters: cpu_arch: neo2 numa_mem_nodes: 0 1 cores: 0 1 2 3 4 5 6 7 8 64 65 66 67 68 69 70 71 72 ``` --------- Co-authored-by:dpower4 <dilipreddi@gmail.com>
-
- 18 Jun, 2025 1 commit
-
-
WenqingLan1 authored
Added GPU Stream benchmark - measures the GPU memory bandwidth and efficiency for double datatype through various memory operations including copy, scale, add, and triad. - added documentation for `gpu-stream` detailing its introduction, metrics, and descriptions. - added unit tests for `gpu-stream`. Example output is in `superbenchmark/tests/data/gpu_stream.log`.
-
- 14 Jun, 2025 1 commit
-
-
Hongtao Zhang authored
In the current implementation, the CPU‑stream benchmark code renames the binary before the microbench base class can verify its existence, causing the default‐binary check to fail. This PR adds a “default” binary—built with the standard compile parameters—so that the base class can always find and validate it. Once the default binary is in place, the CPU‑stream code will rename it as needed and re‑check its presence before running the benchmark. The PR also enable CPU stream in the default settings. --------- Co-authored-by:Hongtao Zhang <hongtaozhang@microsoft.com>
-
- 05 Jun, 2025 1 commit
-
-
Yifan Xiong authored
Update CODEOWNERS.
-
- 01 May, 2025 1 commit
-
-
pdr authored
adding gb200 cuda arch flag for cublaslt compilation
-
- 30 Apr, 2025 1 commit
-
-
Hongtao Zhang authored
- Upgrade OS of github runner used by lint to the latest. - Add symbolic link for clang-format to version 14. - Update importlib_metadata version since it is too old (inside nvcr.io/nvidia/pytorch:20.12-py3) and failed the 11.1 build. --------- Co-authored-by:
hongtaozhang <hongtaozhang@microsoft.com> Co-authored-by:
Yifan Xiong <yifan.xiong@microsoft.com>
-
- 09 Apr, 2025 1 commit
-
-
Yifan Xiong authored
Merge multi-arch image in build pipeline.
-
- 21 Mar, 2025 1 commit
-
-
pdr authored
**Description** Updated docker for 12.8 Use cutlass latest relase 3.8 with ARCH 100(blackwell) support add latest nccl-test release with ARCH 100(blackwell) Updated msccl to support build for sm_100 No breaking changes, so backward compatible tested with cuda 12.4 --------- Co-authored-by:Hongtao Zhang <garyworkzht@gmail.com>
-
- 12 Mar, 2025 1 commit
-
-
Hongtao Zhang authored
Due to the matrix strategy’s default "fail-fast" setting. In GitHub Actions, when running a job with a matrix, the individual configurations run in parallel. By default, if one matrix job (for example, the one labeled "rocm6_2_rocm6_2_x_superbe") fails, the remaining parallel jobs are canceled automatically. In our current build image pipeline, the arm64 build job always are canceled by the rocm build job. So, using a non-existent label in the job config to prevent rocm build job from scheduling for a temporary solution. --------- Co-authored-by:hongtaozhang <hongtaozhang@microsoft.com>
-
- 08 Mar, 2025 1 commit
-
-
Hongtao Zhang authored
This enhancement addresses an issue in mypy where it may report missing pkg_resources even when ignore_missing_imports = True is set and the package is installed. Adding this configuration ensures that pkg_resources is properly skipped during type checking. Co-authored-by:hongtaozhang <hongtaozhang@microsoft.com>
-
- 07 Mar, 2025 1 commit
-
-
Yifan Xiong authored
Add image build on arm64 arch.
-
- 04 Mar, 2025 1 commit
-
-
Jorge Esguerra authored
Improves logging info for diagnosis rule op baseline errors. This allows developers to easily detect errors in their rule files as well as baseline files, improving end-user experience.
-
- 25 Feb, 2025 2 commits
-
-
Maxim Evtush authored
Co-authored-by:
Yifan Xiong <yifan.xiong@microsoft.com> Co-authored-by:
Hongtao Zhang <garyworkzht@gmail.com>
-
Hongtao Zhang authored
Added support for Python 3.11, 3.12 and 3.13. yapf is not compatiable with python3.12+, so we disable yapf in py3.12 for now. https://github.com/google/yapf/issues/1258 https://github.com/google/yapf/issues/1266 --------- Co-authored-by:
hongtaozhang <hongtaozhang@microsoft.com>
-
- 15 Feb, 2025 1 commit
-
-
Hongtao Zhang authored
Root Cause: 1. '_get_all_test_cases()' was called in '_parser' while '_parser' was defined in the base class. 2. in '_get_all_test_cases()', cmd path was not included. Fix: 1. Remove '_get_all_test_cases()' from '_parser'. 2. Construct path for cmd. --------- Co-authored-by:hongtaozhang <hongtaozhang@microsoft.com>
-
- 05 Feb, 2025 2 commits
-
-
Hongtao Zhang authored
**Description** 1. Fixed the bug that nvbandwidth benchmark need to handle 'N/A' values in nvbandwidth cmd output. 2. Replaced the input format of test cases with a list. 3. Add nvbandwidth configuration example in default config files. --------- Co-authored-by:
hongtaozhang <hongtaozhang@microsoft.com> Co-authored-by:
Yifan Xiong <yifan.xiong@microsoft.com>
-
Kirill Prosvirov authored
**Description** Today I was running a benchmark on my machine. And encountered a fancy issue with tensorrt-inference. I got code 33, which according to the source code is: ``` MICROBENCHMARK_RESULT_PARSING_FAILURE = 33 ``` I dived deep into the code and found out the following problem. The parser stumbled upon getting to the following line: ``` [11/28/2024-17:03:11] [I] Latency: min = 7.2793 ms, max = 10.1606 ms, mean = 7.41642 ms, median = 7.39551 ms, percentile(99%) = 8 ms ``` I ran it separately on the code and found out that the regular expression was not suitable for the cases like this, when you encounter an INT as a result in milliseconds. That's why this pull request is created. I came up with the closest possible regular expression to fix this issue and not to introduce any other bug. **Major Revision** - 0.11.0
-
- 04 Feb, 2025 3 commits
-
-
pdr authored
Flake8 has moved away from gitlab to github. Updating the repo path in the pre commit config.
-
Hongtao Zhang authored
**Description** Introduce architecture support for version 10.0 in gemm-flops.
-
Yifan Xiong authored
Fix installation and lint issues: * Fix transformer installation in Python3.7 due to upgrade of safetensors. * Fix lint issues in mypy 1.14.1.
-