Commits · 44e35cda22a53d2af76454566a5fe17aaf9cfc86 · tsoc / superbenchmark

30 Jun, 2025 1 commit

Benchmarks: Add Mixture of Experts Model (#679) · 44e35cda

pdr authored Jun 30, 2025



Added MoE model using MixtralConfig. 

1. Added 8x7b and 8x22b variants 
2. Requires high VRAM as all experts are loaded in memory. Thus,
disabled training due to memory constraint on test worker.

---------
Co-authored-by: Hongtao Zhang <garyworkzht@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Hongtao Zhang <hongtaozhang@microsoft.com>

44e35cda

26 Jun, 2025 1 commit

Benchmarks - Add deepseek megatron-lm benchmark (#713) · deef9a3d

Yuting Jiang authored Jun 27, 2025



**Description**
Add deepseek megatron-lm benchmark.

---------
Co-authored-by: yukirora <yuting.jiang@microsoft.com>
Co-authored-by: Hongtao Zhang <garyworkzht@gmail.com>
Co-authored-by: Hongtao Zhang <hongtaozhang@microsoft.com>

deef9a3d

25 Jun, 2025 1 commit

Dockerfile - Add cuda12.9 docker image (#716) · a56356d8

guoshzhao authored Jun 25, 2025



**Description**
Add cuda 12.9 dockerfile and build in pipeline.

---------
Co-authored-by: Guoshuai Zhao <microsoft@microsoft.com>
Co-authored-by: Hongtao Zhang <hongtaozhang@microsoft.com>
Co-authored-by: Hongtao Zhang <garyworkzht@gmail.com>

a56356d8

24 Jun, 2025 1 commit

Benchmarks - Add FP4 GEMM FLOPS support for cublaslt_gemm benchmark (#711) · b795477e

guoshzhao authored Jun 24, 2025



**Description**
Add FP4 precision support for cublaslt_gemm benchmark.

**Major Revision**
- Add new type `fp4e2m1` and `__nv_fp4_e2m1`.
- For FP4 matmul, precision of MatrixC (add) should be FP16, precision
of MatricD (output) should be FP4, otherwise, it will not work.
- Add macro `CUDA_VERSION` to resolve the compatibility issue of
different CUDA versions.

---------
Co-authored-by: Ubuntu <aiperf@aiperf000000.hp5z1gqeinfufbj2u3jcty5fme.cdmx.internal.cloudapp.net>
Co-authored-by: AVA <39534996+avazr@users.noreply.github.com>
Co-authored-by: Guoshuai Zhao <microsoft@microsoft.com>

b795477e

20 Jun, 2025 2 commits

Benchmark - Support autotuning in cublaslt gemm (#706) · 60b13256

Babak Hejazi authored Jun 20, 2025

**Description**
Enable autotuning as an opt-in mode when benchmarking cublasLt via
`cublaslt_gemm`

The implementation is based on
https://github.com/NVIDIA/CUDALibrarySamples/blob/master/cuBLASLt/LtSgemmSimpleAutoTuning/sample_cublasLt_LtSgemmSimpleAutoTuning.cu

The behavior of original benchmark command remains unchanged, e.g.:
- `cublaslt_gemm -m 2048 -n 12288 -k 1536 -w10000 -i 1000 -t fp8e4m3`

The new opt-in options are `-a` (for autotune) and `-I` (for autotune
iterations, default is 50, same as the default for `-i`) and `-W` (for
autotune warmups, default=20, same as the default for `-w`), e.g.:
- `cublaslt_gemm -m 2048 -n 12288 -k 1536 -w 10000 -i 1000 -t fp8e4m3
-a`
- `cublaslt_gemm -m 2048 -n 12288 -k 1536 -w 10000 -i 1000 -t fp8e4m3 -a
-I 10 -W 10`

**Note:** This PR also changes the default `gemm_compute_type` for BF16
and FP16 to `CUBLAS_COMPUTE_32F`.

**Further observations:** 
1. The support matrix of the `cublaslt_gemm` could be further extended
in the future to support non-FP16 output as well for FP8 inputs.
2. Currently, the input matrices are initialized with values of 1.0 and
2.0 which makes them less demanding in terms of power. Another future
extension could be to enable another fill mode for, say, uniform random
numbers between -1 and 1.
3. cuBLAS workspace recommendations are listed under
https://docs.nvidia.com/cuda/cublas/#cublassetworkspace



Update (June 10, 2025): verified using higher level test driver with
these commands:

1. inline:
```
python3 -c "                                                                            
from superbench.benchmarks import BenchmarkRegistry, Platform
from superbench.common.utils import logger

parameters = (
    '--num_warmup 10 --num_steps 50 '
    '--shapes 512,512,512 1024,1024,1024 --in_types fp16 fp32 '
    '--enable_autotune --num_warmup_autotune 20 --num_steps_autotune 50'
)
context = BenchmarkRegistry.create_benchmark_context(
    'cublaslt-gemm', platform=Platform.CUDA, parameters=parameters
)
benchmark = BenchmarkRegistry.launch_benchmark(context)
logger.info('Result: {}'.format(benchmark.result))
"
```

2. newly added script: 
`python3 examples/benchmarks/cublaslt_function.py`

---------
Co-authored-by: Babak Hejazi <babakh@nvidia.com>

60b13256

Benchmark - Add Grace CPU support for CPU Stream (#719) · 0b8d1fd4

WenqingLan1 authored Jun 19, 2025



**Description**
Added support for Grace CPU neo2 architecture in CPU Stream. Now CPU
Stream supports dual socket benchmarking.

Example config for this arch support:
```yaml
    cpu-stream:numa0:
      timeout: *default_timeout
      modes:
      - name: local
        parallel: no
      parameters:
        cpu_arch: neo2
        numa_mem_nodes: 0
        cores: 0 1 2 3 4 5 6 7 8
    cpu-stream:numa1:
      timeout: *default_timeout
      modes:
      - name: local
        parallel: no
      parameters:
        cpu_arch: neo2
        numa_mem_nodes: 1
        cores: 64 65 66 67 68 69 70 71 72
    cpu-stream:numa-spread:
      timeout: *default_timeout
      modes:
      - name: local
        parallel: no
      parameters:
        cpu_arch: neo2
        numa_mem_nodes: 0 1
        cores: 0 1 2 3 4 5 6 7 8 64 65 66 67 68 69 70 71 72
```

---------
Co-authored-by: dpower4 <dilipreddi@gmail.com>

0b8d1fd4

18 Jun, 2025 1 commit

Benchmarks - Add GPU Stream Micro Benchmark (#697) · 4eddd50a

WenqingLan1 authored Jun 18, 2025

Added GPU Stream benchmark - measures the GPU memory bandwidth and
efficiency for double datatype through various memory operations
including copy, scale, add, and triad.
- added documentation for `gpu-stream` detailing its introduction,
metrics, and descriptions.
- added unit tests for `gpu-stream`. Example output is in
`superbenchmark/tests/data/gpu_stream.log`.

4eddd50a

14 Jun, 2025 1 commit

microbenchmark - CPU Stream Benchmark Revise (#712) · 991c0051

Hongtao Zhang authored Jun 14, 2025



In the current implementation, the CPU‑stream benchmark code renames the
binary before the microbench base class can verify its existence,
causing the default‐binary check to fail.

This PR adds a “default” binary—built with the standard compile
parameters—so that the base class can always find and validate it. Once
the default binary is in place, the CPU‑stream code will rename it as
needed and re‑check its presence before running the benchmark.

The PR also enable CPU stream in the default settings.

---------
Co-authored-by: Hongtao Zhang <hongtaozhang@microsoft.com>

991c0051

05 Jun, 2025 1 commit
- Update CODEOWNERS (#718) · 431bf19c
  Yifan Xiong authored Jun 05, 2025
```
Update CODEOWNERS.
```
  431bf19c
01 May, 2025 1 commit
- cuda arch flag for cublaslt (#701) · 3e090482
  pdr authored Apr 30, 2025
```
adding gb200 cuda arch flag for cublaslt compilation
```
  3e090482
30 Apr, 2025 1 commit

CI/CD - Update OS of runner to the latest. (#702) · 330c68aa

Hongtao Zhang authored Apr 30, 2025



- Upgrade OS of github runner used by lint to the latest.
- Add symbolic link for clang-format to version 14.
- Update importlib_metadata version since it is too old (inside
nvcr.io/nvidia/pytorch:20.12-py3) and failed the 11.1 build.

---------
Co-authored-by: hongtaozhang <hongtaozhang@microsoft.com>
Co-authored-by: Yifan Xiong <yifan.xiong@microsoft.com>

330c68aa

09 Apr, 2025 1 commit
- CI/CD - Merge multi-arch image (#696) · b13ef28f
  Yifan Xiong authored Apr 08, 2025
```
Merge multi-arch image in build pipeline.
```
  b13ef28f
21 Mar, 2025 1 commit

Dockerfile - Support cuda12.8 for Blackwell arch (#682) · 294f1f20

pdr authored Mar 20, 2025



**Description**
Updated docker for 12.8
Use cutlass latest relase 3.8 with ARCH 100(blackwell) support
add latest nccl-test release with ARCH 100(blackwell) 
Updated msccl to support build for sm_100
No breaking changes, so backward compatible tested with  cuda 12.4

---------
Co-authored-by: Hongtao Zhang <garyworkzht@gmail.com>

294f1f20

12 Mar, 2025 1 commit

CI/CD - Update label in the ROCm image build (#693) · 48cd8a3c

Hongtao Zhang authored Mar 12, 2025



Due to the matrix strategy’s default "fail-fast" setting. In GitHub
Actions, when running a job with a matrix, the individual configurations
run in parallel. By default, if one matrix job (for example, the one
labeled "rocm6_2_rocm6_2_x_superbe") fails, the remaining parallel jobs
are canceled automatically.

In our current build image pipeline, the arm64 build job always are
canceled by the rocm build job. So, using a non-existent label in the
job config to prevent rocm build job from scheduling for a temporary
solution.

---------
Co-authored-by: hongtaozhang <hongtaozhang@microsoft.com>

48cd8a3c

08 Mar, 2025 1 commit

Lint: Enhancement of ignoring errors for import pkg_resources (#692) · 5e32859a

Hongtao Zhang authored Mar 07, 2025

This enhancement addresses an issue in mypy where it may report missing
pkg_resources even when ignore_missing_imports = True is set and the
package is installed. Adding this configuration ensures that
pkg_resources is properly skipped during type checking.
Co-authored-by: hongtaozhang <hongtaozhang@microsoft.com>

5e32859a

07 Mar, 2025 1 commit
- CI/CD - Add image build on arm64 arch (#690) · 300df46b
  Yifan Xiong authored Mar 07, 2025
```
Add image build on arm64 arch.
```
  300df46b
04 Mar, 2025 1 commit

Analyzer - Enhance logging information for diagnosis rule op baseline errors. (#689) · 64edc9c5

Jorge Esguerra authored Mar 04, 2025

Improves logging info for diagnosis rule op baseline errors. This allows
developers to easily detect errors in their rule files as well as
baseline files, improving end-user experience.

64edc9c5

25 Feb, 2025 2 commits

Docs - Fix typos in documentation and code files (#686) · 71573f3c

Maxim Evtush authored Feb 25, 2025


Co-authored-by: Yifan Xiong <yifan.xiong@microsoft.com>
Co-authored-by: Hongtao Zhang <garyworkzht@gmail.com>

71573f3c

Init latest python support. (#687) · 064ca1d0

Hongtao Zhang authored Feb 24, 2025

Added support for Python 3.11, 3.12 and 3.13.

yapf is not compatiable with python3.12+, so we disable yapf in py3.12
for now.
https://github.com/google/yapf/issues/1258
https://github.com/google/yapf/issues/1266



---------
Co-authored-by: hongtaozhang <hongtaozhang@microsoft.com>

064ca1d0

15 Feb, 2025 1 commit

Bugfix: Avoid Unintended nvbandwidth Function Calls in All Benchmarks (#685) · 41a484fa

Hongtao Zhang authored Feb 14, 2025



Root Cause:

1. '_get_all_test_cases()' was called in '_parser' while '_parser' was
defined in the base class.
2.  in '_get_all_test_cases()', cmd path was not included.

Fix:

1. Remove '_get_all_test_cases()' from '_parser'.
2. Construct path for cmd.

---------
Co-authored-by: hongtaozhang <hongtaozhang@microsoft.com>

41a484fa

05 Feb, 2025 2 commits

Bugfix - nvbandwidth benchmark need to handle N/A value (#675) · 45d06647

Hongtao Zhang authored Feb 05, 2025



**Description**

1. Fixed the bug that nvbandwidth benchmark need to handle 'N/A' values
in nvbandwidth cmd output.
2. Replaced the input format of test cases with a list.
3. Add nvbandwidth configuration example in default config files.

---------
Co-authored-by: hongtaozhang <hongtaozhang@microsoft.com>
Co-authored-by: Yifan Xiong <yifan.xiong@microsoft.com>

45d06647

Bug - Fix tensorrt-inference parsing (#674) · 7af7c0b7

Kirill Prosvirov authored Feb 05, 2025

**Description**
Today I was running a benchmark on my machine. And encountered a fancy
issue with tensorrt-inference.
I got code 33, which according to the source code is:
```
MICROBENCHMARK_RESULT_PARSING_FAILURE = 33
```
I dived deep into the code and found out the following problem. The
parser stumbled upon getting to the following line:
```
[11/28/2024-17:03:11] [I] Latency: min = 7.2793 ms, max = 10.1606 ms, mean = 7.41642 ms, median = 7.39551 ms, percentile(99%) = 8 ms
```
I ran it separately on the code and found out that the regular
expression was not suitable for the cases like this, when you encounter
an INT as a result in milliseconds.
That's why this pull request is created.
I came up with the closest possible regular expression to fix this issue
and not to introduce any other bug.

**Major Revision**
- 0.11.0

7af7c0b7

04 Feb, 2025 3 commits

Update Flake8 repo (#683) · b55279ad

pdr authored Feb 04, 2025

Flake8 has moved away from gitlab to github.
Updating the repo path in the pre commit config.

b55279ad

Microbenchmark - Add arch support for 10.0 in gemm-flops (#680) · 1d09b111
Hongtao Zhang authored Feb 03, 2025
```
**Description**
Introduce architecture support for version 10.0 in gemm-flops.
```
1d09b111

Setup - Fix installation and lint issues (#684) · 424f7b5b

Yifan Xiong authored Feb 03, 2025

Fix installation and lint issues:

* Fix transformer installation in Python3.7 due to upgrade of safetensors.
* Fix lint issues in mypy 1.14.1.

424f7b5b

08 Jan, 2025 1 commit

Bump nanoid from 3.3.6 to 3.3.8 in /website (#678) · 060f4f82

dependabot[bot] authored Jan 07, 2025

Bumps [nanoid](https://github.com/ai/nanoid) from 3.3.6 to 3.3.8.
- [Release notes](https://github.com/ai/nanoid/releases)
- [Changelog](https://github.com/ai/nanoid/blob/main/CHANGELOG.md

)
- [Commits](ai/nanoid@3.3.6...3.3.8)

---
updated-dependencies:
- dependency-name: nanoid
  dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com>

060f4f82

28 Nov, 2024 2 commits

Benchmarks - Add LLaMA-2 Models (#668) · 249e21c1

pdr authored Nov 27, 2024

Added llama benchmark - training and inference in accordance with the
existing pytorch models implementation like gpt2, lstm etc.

- added llama fp8 unit test for better code coverage, to reduce memory
required
- updated transformers version >= 4.28.0 for LLamaConfig
- set tokenizers version <= 0.20.3 to avoid 0.20.4 version
[issues](https://github.com/huggingface/tokenizers/issues/1691

) with
py3.8
- added llama2 to tensorrt
- llama2 tests not added to test_tensorrt_inference_performance.py due
to large memory requirement for worker gpu. tests validated separately
on gh200

---------
Co-authored-by: dpatlolla <dpatlolla@microsoft.com>

249e21c1

Bug Fix - Fix stderr message in gpu-copy benchmark (#673) · 4e6935ab
pdr authored Nov 27, 2024
```
Fix ordering of args in err messages.
```
4e6935ab

27 Nov, 2024 1 commit

CI/CD - Upgrade dependency versions in pipeline (#671) · 96f5ccea

Yifan Xiong authored Nov 26, 2024



Upgrade dependency versions in Azure pipeline:

* Remove Python 3.6 and add Python 3.10 for cpu-unit-test
* Upgrade CUDA from 11.1 to 12.4 for cuda-unit-test
* Update labels accordingly

---------
Co-authored-by: Dilip Patlolla <dilipreddi@gmail.com>

96f5ccea

22 Nov, 2024 1 commit

Benchmarks: micro benchmarks - add nvbandwidth benchmark (#669) · 7cef624e

Hongtao Zhang authored Nov 21, 2024



**Description**

Add nvbandwidth benchmark.

---------
Co-authored-by: hongtaozhang <hongtaozhang@microsoft.com>

7cef624e

21 Nov, 2024 2 commits
- Benchmarks: micro benchmarks - add nvbandwidth build (#665) · c8c52eb2
  Hongtao Zhang authored Nov 21, 2024
```
**Description**
Add nvbandwidth build to repo

---------
Co-authored-by: hongtaozhang <hongtaozhang@microsoft.com>
```
  c8c52eb2
- Docs - Update CODEOWNERS (#670) · 54eeac25
  Yifan Xiong authored Nov 20, 2024
```
Update CODEOWNERS for docs.
```
  54eeac25
20 Nov, 2024 1 commit

Benchmarks: micro benchmarks - add general CPU bandwidth and latency benchmark (#662) · 9c35e80a

Hongtao Zhang authored Nov 20, 2024



**Description**
Add micro benchmark to measure general CPU bandwidth and latency without 'mlc'.

Test output:
```
{
"cpu-memory-bw-latency/return_code": 0,
"cpu-memory-bw-latency/mem_bandwidth_matrix_numa_0_1_bw": 5388.75021,
"cpu-memory-bw-latency/mem_bandwidth_matrix_numa_0_1_lat": 0.185571786,
"cpu-memory-bw-latency/mem_bandwidth_matrix_numa_1_0_bw": 4634.82028,
"cpu-memory-bw-latency/mem_bandwidth_matrix_numa_1_0_lat": 0.215758096,
}
```

---------
Co-authored-by: hongtaozhang <hongtaozhang@microsoft.com>

9c35e80a

15 Nov, 2024 1 commit

Dependency - Bump onnxruntime-gpu version from 1.10.0 to 1.12.0 (#663) · a8a7bed2

Hongtao Zhang authored Nov 14, 2024



**Description**

Bump onnxruntime-gpu from 1.10.0 to 1.12.0.

---------
Co-authored-by: hongtaozhang <hongtaozhang@microsoft.com>

a8a7bed2

07 Nov, 2024 2 commits

Bump webpack from 5.76.1 to 5.96.1 in /website (#661) · 83ee4eba

dependabot[bot] authored Nov 07, 2024

Bumps [webpack](https://github.com/webpack/webpack) from 5.76.1 to 5.96.1.
- [Release notes](https://github.com/webpack/webpack/releases

)
- [Commits](webpack/webpack@v5.76.1...v5.96.1)

---
updated-dependencies:
- dependency-name: webpack
  dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com>

83ee4eba

Bump cookie and express in /website (#655) · c9b2b455

dependabot[bot] authored Nov 07, 2024

Bumps [cookie](https://github.com/jshttp/cookie) and [express](https://github.com/expressjs/express). These dependencies needed to be updated together.

Updates `cookie` from 0.6.0 to 0.7.1
- [Release notes](https://github.com/jshttp/cookie/releases)
- [Commits](jshttp/cookie@v0.6.0...v0.7.1)

Updates `express` from 4.21.0 to 4.21.1
- [Release notes](https://github.com/expressjs/express/releases)
- [Changelog](https://github.com/expressjs/express/blob/4.21.1/History.md

)
- [Commits](expressjs/express@4.21.0...4.21.1)

---
updated-dependencies:
- dependency-name: cookie
  dependency-type: indirect
- dependency-name: express
  dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com>

c9b2b455

06 Nov, 2024 1 commit

Dockerfile - Add support for arm64 build (#660) · 47949127

pdr authored Nov 06, 2024

Add support for arm64 build:

- Updated dockerfile for arm64 build
- extend cpu stream compilation for neoverse 
- handle onnxruntime-gpu installation
- third party builds filtering based on arch
- disable cuda decode perf build for non x86

47949127

05 Nov, 2024 1 commit

Bug Fix - Fix numa error on grace cpu in gpu-copy (#658) · 59d36f7f

pdr authored Nov 05, 2024

The current GPU Copy BW Performance fails on Nvidia Grace systems. This
is due to the memory only numa node and thus the numa_run_on_node fails
for such nodes and halts completely.

This fix checks for the presence of assigned CPU cores for the numa
node, on checking if it has no cpu cores assigned, it skips that
specific node during the args creation and continues.

59d36f7f

02 Nov, 2024 1 commit

CI/CD - Update Image Build Pipeline (#659) · 61770b89

Yifan Xiong authored Nov 01, 2024

**Description**

Update image build.

**Major Revision**

* Remove ROCm 6.0 image due to outdated packages
* Remove build tag for ROCm
* Preserve build cache for 30 days

61770b89

10 Oct, 2024 1 commit

Release - SuperBench v0.11.0 (#654) · 949f9cb4

Yuting Jiang authored Oct 10, 2024



**Description**
Cherry pick bug fixes from v0.11.0 to main

**Major Revision**
* #645 
* #648 
* #646 
* #647 
* #651 
* #652 
* #650

---------
Co-authored-by: hongtaozhang <hongtaozhang@microsoft.com>
Co-authored-by: Yifan Xiong <yifan.xiong@microsoft.com>

949f9cb4