Commits · 294f1f20bed36200cfd35359b62bc8bfa25de901 · tsoc / superbenchmark

21 Mar, 2025 1 commit

Dockerfile - Support cuda12.8 for Blackwell arch (#682) · 294f1f20

pdr authored Mar 20, 2025



**Description**
Updated docker for 12.8
Use cutlass latest relase 3.8 with ARCH 100(blackwell) support
add latest nccl-test release with ARCH 100(blackwell) 
Updated msccl to support build for sm_100
No breaking changes, so backward compatible tested with  cuda 12.4

---------
Co-authored-by: Hongtao Zhang <garyworkzht@gmail.com>

294f1f20

25 Feb, 2025 1 commit

Docs - Fix typos in documentation and code files (#686) · 71573f3c

Maxim Evtush authored Feb 25, 2025


Co-authored-by: Yifan Xiong <yifan.xiong@microsoft.com>
Co-authored-by: Hongtao Zhang <garyworkzht@gmail.com>

71573f3c

15 Feb, 2025 1 commit

Bugfix: Avoid Unintended nvbandwidth Function Calls in All Benchmarks (#685) · 41a484fa

Hongtao Zhang authored Feb 14, 2025



Root Cause:

1. '_get_all_test_cases()' was called in '_parser' while '_parser' was
defined in the base class.
2.  in '_get_all_test_cases()', cmd path was not included.

Fix:

1. Remove '_get_all_test_cases()' from '_parser'.
2. Construct path for cmd.

---------
Co-authored-by: hongtaozhang <hongtaozhang@microsoft.com>

41a484fa

05 Feb, 2025 2 commits

Bugfix - nvbandwidth benchmark need to handle N/A value (#675) · 45d06647

Hongtao Zhang authored Feb 05, 2025



**Description**

1. Fixed the bug that nvbandwidth benchmark need to handle 'N/A' values
in nvbandwidth cmd output.
2. Replaced the input format of test cases with a list.
3. Add nvbandwidth configuration example in default config files.

---------
Co-authored-by: hongtaozhang <hongtaozhang@microsoft.com>
Co-authored-by: Yifan Xiong <yifan.xiong@microsoft.com>

45d06647

Bug - Fix tensorrt-inference parsing (#674) · 7af7c0b7

Kirill Prosvirov authored Feb 05, 2025

**Description**
Today I was running a benchmark on my machine. And encountered a fancy
issue with tensorrt-inference.
I got code 33, which according to the source code is:
```
MICROBENCHMARK_RESULT_PARSING_FAILURE = 33
```
I dived deep into the code and found out the following problem. The
parser stumbled upon getting to the following line:
```
[11/28/2024-17:03:11] [I] Latency: min = 7.2793 ms, max = 10.1606 ms, mean = 7.41642 ms, median = 7.39551 ms, percentile(99%) = 8 ms
```
I ran it separately on the code and found out that the regular
expression was not suitable for the cases like this, when you encounter
an INT as a result in milliseconds.
That's why this pull request is created.
I came up with the closest possible regular expression to fix this issue
and not to introduce any other bug.

**Major Revision**
- 0.11.0

7af7c0b7

04 Feb, 2025 1 commit
- Microbenchmark - Add arch support for 10.0 in gemm-flops (#680) · 1d09b111
  Hongtao Zhang authored Feb 03, 2025
```
**Description**
Introduce architecture support for version 10.0 in gemm-flops.
```
  1d09b111
28 Nov, 2024 2 commits

Benchmarks - Add LLaMA-2 Models (#668) · 249e21c1

pdr authored Nov 27, 2024

Added llama benchmark - training and inference in accordance with the
existing pytorch models implementation like gpt2, lstm etc.

- added llama fp8 unit test for better code coverage, to reduce memory
required
- updated transformers version >= 4.28.0 for LLamaConfig
- set tokenizers version <= 0.20.3 to avoid 0.20.4 version
[issues](https://github.com/huggingface/tokenizers/issues/1691

) with
py3.8
- added llama2 to tensorrt
- llama2 tests not added to test_tensorrt_inference_performance.py due
to large memory requirement for worker gpu. tests validated separately
on gh200

---------
Co-authored-by: dpatlolla <dpatlolla@microsoft.com>

249e21c1

Bug Fix - Fix stderr message in gpu-copy benchmark (#673) · 4e6935ab
pdr authored Nov 27, 2024
```
Fix ordering of args in err messages.
```
4e6935ab

22 Nov, 2024 1 commit

Benchmarks: micro benchmarks - add nvbandwidth benchmark (#669) · 7cef624e

Hongtao Zhang authored Nov 21, 2024



**Description**

Add nvbandwidth benchmark.

---------
Co-authored-by: hongtaozhang <hongtaozhang@microsoft.com>

7cef624e

20 Nov, 2024 1 commit

Benchmarks: micro benchmarks - add general CPU bandwidth and latency benchmark (#662) · 9c35e80a

Hongtao Zhang authored Nov 20, 2024



**Description**
Add micro benchmark to measure general CPU bandwidth and latency without 'mlc'.

Test output:
```
{
"cpu-memory-bw-latency/return_code": 0,
"cpu-memory-bw-latency/mem_bandwidth_matrix_numa_0_1_bw": 5388.75021,
"cpu-memory-bw-latency/mem_bandwidth_matrix_numa_0_1_lat": 0.185571786,
"cpu-memory-bw-latency/mem_bandwidth_matrix_numa_1_0_bw": 4634.82028,
"cpu-memory-bw-latency/mem_bandwidth_matrix_numa_1_0_lat": 0.215758096,
}
```

---------
Co-authored-by: hongtaozhang <hongtaozhang@microsoft.com>

9c35e80a

06 Nov, 2024 1 commit

Dockerfile - Add support for arm64 build (#660) · 47949127

pdr authored Nov 06, 2024

Add support for arm64 build:

- Updated dockerfile for arm64 build
- extend cpu stream compilation for neoverse 
- handle onnxruntime-gpu installation
- third party builds filtering based on arch
- disable cuda decode perf build for non x86

47949127

05 Nov, 2024 1 commit

Bug Fix - Fix numa error on grace cpu in gpu-copy (#658) · 59d36f7f

pdr authored Nov 05, 2024

The current GPU Copy BW Performance fails on Nvidia Grace systems. This
is due to the memory only numa node and thus the numa_run_on_node fails
for such nodes and halts completely.

This fix checks for the presence of assigned CPU cores for the numa
node, on checking if it has no cpu cores assigned, it skips that
specific node during the args creation and continues.

59d36f7f

26 Jul, 2024 1 commit
- Benchmarks: Micro benchmarks - add support for NVIDIA L4/L40/L40s GPUs in gemm-flops (#634) · e304cf15
  Yuting Jiang authored Jul 26, 2024
```
**Description**
Add support GPU ARCH 8.9 for NVIDIA L4/L40/L40s GPUs in gemm-flops.
```
  e304cf15
02 Apr, 2024 1 commit
- Benchmarks: Revise Code - Add hipblasLt tuning to dist-inference cpp implementation (#616) · cc89ee59
  Ziyue Yang authored Apr 02, 2024
```
**Description**
Adds hipblasLt tuning to dist-inference cpp implementation.
```
  cc89ee59
08 Jan, 2024 1 commit

Release - SuperBench v0.10.0 (#607) · 2c88db90

Yifan Xiong authored Jan 07, 2024



**Description**

Cherry-pick bug fixes from v0.10.0 to main.

**Major Revisions**

* Benchmarks: Microbenchmark - Support different hipblasLt data types in dist_inference #590
* Benchmarks: Microbenchmark - Support in-place for NCCL/RCCL benchmark #591
* Bug Fix - Fix NUMA Domains Swap Issue in NDv4 Topology File #592
* Benchmarks: Microbenchmark - Add data type option for NCCL and RCCL tests #595
* Benchmarks: Bug Fix - Make metrics of dist-inference-cpp aligned with PyTorch version #596
* CI/CD - Add ndv5 topo file #597
* Benchmarks: Microbenchmark - Improve AMD GPU P2P performance with fine-grained GPU memory #593
* Benchmarks: Build Pipeline - fix nccl and nccl test version to 2.18.3 to resolve hang issue in cuda12.2 docker #599
* Dockerfile - Bug fix for rocm docker build and deploy #598
* Benchmarks: Microbenchmark - Adapt to hipblasLt data type changes #603
* Benchmarks: Micro benchmarks - Update hipblaslt metric unit to tflops #604
* Monitor - Upgrade pyrsmi to amdsmi python library. #601
* Benchmarks: Micro benchmarks - add fp8 and initialization for hipblaslt benchmark #605
* Dockerfile - Add rocm6.0 dockerfile #602
* Bug Fix - Bug fix for latest megatron-lm benchmark #600
* Docs - Upgrade version and release note #606
Co-authored-by: Ziyue Yang <ziyyang@microsoft.com>
Co-authored-by: Yang Wang <yangwang1@microsoft.com>
Co-authored-by: Yuting Jiang <yutingjiang@microsoft.com>
Co-authored-by: guoshzhao <guzhao@microsoft.com>

2c88db90

11 Dec, 2023 1 commit
- Benchmark: Revision - Fix -O2 option passing in gpu_copy ROCm build (#589) · 2c2096ed
  Ziyue Yang authored Dec 11, 2023
```
**Description**
`add_compile_options` will not work for ROCm build, change it to setting
`CMAKE_CXX_FLAGS`.
```
  2c2096ed
10 Dec, 2023 1 commit
- Benchmarks: Microbenchmark - Add distributed inference benchmark cpp implementation (#586) · 719a427f
  Ziyue Yang authored Dec 11, 2023
```
**Description**
Add distributed inference benchmark cpp implementation.
```
  719a427f
09 Dec, 2023 1 commit

Dockerfile - Upgrade to rocm5.7 dockerfile (#587) · 1f5031bd

Yuting Jiang authored Dec 10, 2023



**Description**
upgrade to rocm5.7 dockerfile.

---------
Co-authored-by: yukirora <yuting.jiang@microsoft.com>

1f5031bd

08 Dec, 2023 1 commit

Benchmarks: Micro benchmark - Add one-to-all, all-to-one, all-to-all support... · 4fa60be7

Ziyue Yang authored Dec 08, 2023

Benchmarks: Micro benchmark - Add one-to-all, all-to-one, all-to-all support to gpu_copy_bw_performance (#588)

**Description**
Add one-to-all, all-to-one, all-to-all support to
gpu_copy_bw_performance, and fix performance bug in gpu_copy

4fa60be7

05 Dec, 2023 1 commit
- Benchmarks: Micro benchmark - Add graph mode in NCCL/RCCL benchmarks for latency metrics (#583) · 254ea7fe
  Ziyue Yang authored Dec 05, 2023
```
**Description**
Revise NCCL/RCCL benchmarks to graph mode add latency metrics.
```
  254ea7fe
04 Dec, 2023 1 commit

Benchmarks: micro benchmark - Support cpu-gpu and gpu-cpu in ib-validation (#581) · 9ae8c670

Yuting Jiang authored Dec 04, 2023

**Description**
Benchmarks: micro benchmark - Support cpu-gpu and gpu-cpu in
ib-validation

**Major Revision**
- Support cpu-gpu and gpu-cpu in ib-validation


**Minor Revision**
- support multi msg size, multi direction, multi ib commands in
ib-validation

9ae8c670

22 Nov, 2023 3 commits
- Dockerfile - Upgrade Docker image to CUDA 12.2 (#577) · 1ad1c21c
  Yifan Xiong authored Nov 22, 2023
```
Upgrade Docker image to CUDA 12.2 for H100:
* upgrade base image to 23.10
* fix onnxruntime version in python3.10
* fix compilation errors
```
  1ad1c21c
- Benchmarks: Micro benchmark - add initialization options for rocm gemm flops (#578) · 2235e084
  Yuting Jiang authored Nov 22, 2023
```
**Description**
add initialization options for rocm gemm flops.
```
  2235e084
- Benchmarks: Micro benchmark - Add hipBLASLt function benchmark (#576) · 79089b65
  Yuting Jiang authored Nov 22, 2023
```
**Description**
hipblaslt function benchmark and rebase cublaslt function benchmark.
```
  79089b65
20 Nov, 2023 1 commit
- Benchmarks: micro benchmarks - add int8 support for cublaslt function (#574) · f53d941a
  Yuting Jiang authored Nov 20, 2023
```
**Description**
add int8 support for cublaslt function.
```
  f53d941a
14 Nov, 2023 1 commit

Bug Fix - remove cp ptx file command in gpu burn test (#567) · c7800bb8

Yuting Jiang authored Nov 14, 2023

**Description**
remove cp ptx file in gpu burn test since the command is run inside
self.args.bin_dir dir.


https://github.com/microsoft/superbenchmark/blob/d246bab430adeb461072918a551b2e2b68c9bce5/superbench/benchmarks/micro_benchmarks/micro_base.py#L183

c7800bb8

22 Aug, 2023 1 commit
- Benchmarks: micro benchmark - source code for evaluating NVDEC decoding performance (#560) · 27a10811
  Yuting Jiang authored Aug 22, 2023
```
**Description**
source code for evaluating NVDEC decoding performance.

---------
Co-authored-by: yukirora <yuting.jiang@microsoft.com>
```
  27a10811
18 Aug, 2023 1 commit
- Benchmarks: micro benchmarks - add source code for DirectXRenderPerf (#549) · 6c0205ce
  Yuting Jiang authored Aug 18, 2023
```
**Description**
add source code for DirectXRenderPerf.

---------
Co-authored-by: yukirora <yuting.jiang@microsoft.com>
```
  6c0205ce
27 Jul, 2023 1 commit

Release - SuperBench v0.9.0 (#558) · e1df877b

Yuting Jiang authored Jul 27, 2023

**Description**
Cherry-pick bug fixes from v0.9.0 to main.

**Major Revision**
- CI/CD: pipeline - clean more disk space to fix rocm building image
pipeline(#555 )
- Benchmarks: bug fix - use absolute path for input file in
DirectXEncodingLatency(#554)
- CI/CD - add push win docker image on release branch in pipeline (#552)
- Docs - Upgrade version and release note(#557)

e1df877b

06 Jul, 2023 1 commit
- Benchmarks: micro benchmarks - add python code for DirectXGPUEncodingLatency (#548) · e8ac0b1e
  Yuting Jiang authored Jul 06, 2023
```
**Description**
add python code for DirectXGPUEncodingLatency.
```
  e8ac0b1e
05 Jul, 2023 3 commits
- Benchmarks: micro benchmarks - add python code for DirectXGPUCopy (#546) · c8c079c2
  Yuting Jiang authored Jul 06, 2023
```
**Description**
add python code for DirectXGPUCopy.
```
  c8c079c2
- Benchmarks: micro benchmarks - add python code for DirecXGPUMemBw (#547) · af4cfd5b
  Yuting Jiang authored Jul 05, 2023
```
**Description**
add python code for DirecXGPUMemBw.
```
  af4cfd5b
- Benchmarks: micro benchmarks - add python code for DirectXGPUCoreFlops (#542) · f1d608ae
  Yuting Jiang authored Jul 05, 2023
```
**Description**
add python code for DirectX core flops and init DirectX test pipeline.

**Major Revision**
- add python code for DirectX core flops 
- init DirectX test pipeline


**Minor Revision**
- add test for DirectX core flops
```
  f1d608ae
30 Jun, 2023 2 commits

Benchmarks: microbenchmark - add auto selecting algorithm support for cudnn functions (#540) · 97f7b1df

Yuting Jiang authored Jun 30, 2023

**Description**
add auto selecting algorithm support for cudnn functions.

**Major Revision**
- add auto selecting algorithm support for cudnn functions in source
code
- add 'auto_algo' option in benchmark
- add related test

97f7b1df

Benchmarks - Update result parsing in tensorrt inference (#541) · 7184bdd1
Yifan Xiong authored Jun 30, 2023
```
* Update result parsing for newer tensorrt versions
* Update arguments when load torchvision models
```
7184bdd1

29 Jun, 2023 3 commits
- Benchmarks: Add benchmark - Add source code of DirectxGPUCopy microbenchmark (#486) · f2599137
  Yuting Jiang authored Jun 29, 2023
```
**Description**
Add source code of DirectxGPUCopy microbenchmark.
```
  f2599137
- Benchmarks: Add benchmark - Add source code of DirectxGPUMemBw microbenchmark (#487) · af4d18de
  Yuting Jiang authored Jun 29, 2023
```
**Description**
Add source code of DirectxGPUMemBw microbenchmark.

---------
Co-authored-by: v-junlinlv <v-junlinlv@microsoft.com>
```
  af4d18de
- Benchmarks: Add benchmark - Add source code of DirectXGPUCoreFLOPs microbenchmark (#488) · 3a6622f7
  Yuting Jiang authored Jun 29, 2023
```
**Description**
Add source code of DirectXGPUCoreFLOPs microbenchmark.

---------
Co-authored-by: v-junlinlv <v-junlinlv@microsoft.com>
```
  3a6622f7
24 Apr, 2023 1 commit

Benchmarks - Revise step time collection in distributed inference benchmark (#524) · 4cb431ca

Ziyue Yang authored Apr 24, 2023

**Description**
This commit revises distributed inference benchmark to give a unified
step time result by taking maximum step times of different GPUs.

4cb431ca

14 Apr, 2023 1 commit

Release - SuperBench v0.8.0 (#517) · 51761b3a

Yifan Xiong authored Apr 14, 2023



**Description**

Cherry-pick bug fixes from v0.8.0 to main.

**Major Revisions**

* Monitor - Fix the cgroup version checking logic (#502)
* Benchmark - Fix matrix size overflow issue in cuBLASLt GEMM (#503)
* Fix wrong torch usage in communication wrapper for Distributed
Inference Benchmark (#505)
* Analyzer: Fix bug in python3.8 due to pandas api change (#504)
* Bug - Fix bug to get metric from cmd when error happens (#506)
* Monitor - Collect realtime GPU power when benchmarking (#507)
* Add num_workers argument in model benchmark (#511)
* Remove unreachable condition when write host list (#512)
* Update cuda11.8 image to cuda12.1 based on nvcr23.03 (#513)
* Doc - Fix wrong unit of cpu-memory-bw-latency in doc (#515)
* Docs - Upgrade version and release note (#508)
Co-authored-by: guoshzhao <guzhao@microsoft.com>
Co-authored-by: Ziyue Yang <ziyyang@microsoft.com>
Co-authored-by: Yuting Jiang <yutingjiang@microsoft.com>

51761b3a