- 30 Jun, 2025 1 commit
-
-
pdr authored
Added MoE model using MixtralConfig. 1. Added 8x7b and 8x22b variants 2. Requires high VRAM as all experts are loaded in memory. Thus, disabled training due to memory constraint on test worker. --------- Co-authored-by:
Hongtao Zhang <garyworkzht@gmail.com> Co-authored-by:
Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by:
Hongtao Zhang <hongtaozhang@microsoft.com>
-
- 26 Jun, 2025 1 commit
-
-
Yuting Jiang authored
**Description** Add deepseek megatron-lm benchmark. --------- Co-authored-by:
yukirora <yuting.jiang@microsoft.com> Co-authored-by:
Hongtao Zhang <garyworkzht@gmail.com> Co-authored-by:
Hongtao Zhang <hongtaozhang@microsoft.com>
-
- 25 Jun, 2025 1 commit
-
-
guoshzhao authored
**Description** Add cuda 12.9 dockerfile and build in pipeline. --------- Co-authored-by:
Guoshuai Zhao <microsoft@microsoft.com> Co-authored-by:
Hongtao Zhang <hongtaozhang@microsoft.com> Co-authored-by:
Hongtao Zhang <garyworkzht@gmail.com>
-
- 24 Jun, 2025 1 commit
-
-
guoshzhao authored
**Description** Add FP4 precision support for cublaslt_gemm benchmark. **Major Revision** - Add new type `fp4e2m1` and `__nv_fp4_e2m1`. - For FP4 matmul, precision of MatrixC (add) should be FP16, precision of MatricD (output) should be FP4, otherwise, it will not work. - Add macro `CUDA_VERSION` to resolve the compatibility issue of different CUDA versions. --------- Co-authored-by:
Ubuntu <aiperf@aiperf000000.hp5z1gqeinfufbj2u3jcty5fme.cdmx.internal.cloudapp.net> Co-authored-by:
AVA <39534996+avazr@users.noreply.github.com> Co-authored-by:
Guoshuai Zhao <microsoft@microsoft.com>
-
- 20 Jun, 2025 2 commits
-
-
Babak Hejazi authored
**Description** Enable autotuning as an opt-in mode when benchmarking cublasLt via `cublaslt_gemm` The implementation is based on https://github.com/NVIDIA/CUDALibrarySamples/blob/master/cuBLASLt/LtSgemmSimpleAutoTuning/sample_cublasLt_LtSgemmSimpleAutoTuning.cu The behavior of original benchmark command remains unchanged, e.g.: - `cublaslt_gemm -m 2048 -n 12288 -k 1536 -w10000 -i 1000 -t fp8e4m3` The new opt-in options are `-a` (for autotune) and `-I` (for autotune iterations, default is 50, same as the default for `-i`) and `-W` (for autotune warmups, default=20, same as the default for `-w`), e.g.: - `cublaslt_gemm -m 2048 -n 12288 -k 1536 -w 10000 -i 1000 -t fp8e4m3 -a` - `cublaslt_gemm -m 2048 -n 12288 -k 1536 -w 10000 -i 1000 -t fp8e4m3 -a -I 10 -W 10` **Note:** This PR also changes the default `gemm_compute_type` for BF16 and FP16 to `CUBLAS_COMPUTE_32F`. **Further observations:** 1. The support matrix of the `cublaslt_gemm` could be furt...
-
WenqingLan1 authored
**Description** Added support for Grace CPU neo2 architecture in CPU Stream. Now CPU Stream supports dual socket benchmarking. Example config for this arch support: ```yaml cpu-stream:numa0: timeout: *default_timeout modes: - name: local parallel: no parameters: cpu_arch: neo2 numa_mem_nodes: 0 cores: 0 1 2 3 4 5 6 7 8 cpu-stream:numa1: timeout: *default_timeout modes: - name: local parallel: no parameters: cpu_arch: neo2 numa_mem_nodes: 1 cores: 64 65 66 67 68 69 70 71 72 cpu-stream:numa-spread: timeout: *default_timeout modes: - name: local parallel: no parameters: cpu_arch: neo2 numa_mem_nodes: 0 1 cores: 0 1 2 3 4 5 6 7 8 64 65 66 67 68 69 70 71 72 ``` --------- Co-authored-by:dpower4 <dilipreddi@gmail.com>
-
- 18 Jun, 2025 1 commit
-
-
WenqingLan1 authored
Added GPU Stream benchmark - measures the GPU memory bandwidth and efficiency for double datatype through various memory operations including copy, scale, add, and triad. - added documentation for `gpu-stream` detailing its introduction, metrics, and descriptions. - added unit tests for `gpu-stream`. Example output is in `superbenchmark/tests/data/gpu_stream.log`.
-
- 14 Jun, 2025 1 commit
-
-
Hongtao Zhang authored
In the current implementation, the CPU‑stream benchmark code renames the binary before the microbench base class can verify its existence, causing the default‐binary check to fail. This PR adds a “default” binary—built with the standard compile parameters—so that the base class can always find and validate it. Once the default binary is in place, the CPU‑stream code will rename it as needed and re‑check its presence before running the benchmark. The PR also enable CPU stream in the default settings. --------- Co-authored-by:Hongtao Zhang <hongtaozhang@microsoft.com>
-
- 05 Jun, 2025 1 commit
-
-
Yifan Xiong authored
Update CODEOWNERS.
-
- 01 May, 2025 1 commit
-
-
pdr authored
adding gb200 cuda arch flag for cublaslt compilation
-
- 30 Apr, 2025 1 commit
-
-
Hongtao Zhang authored
- Upgrade OS of github runner used by lint to the latest. - Add symbolic link for clang-format to version 14. - Update importlib_metadata version since it is too old (inside nvcr.io/nvidia/pytorch:20.12-py3) and failed the 11.1 build. --------- Co-authored-by:
hongtaozhang <hongtaozhang@microsoft.com> Co-authored-by:
Yifan Xiong <yifan.xiong@microsoft.com>
-
- 09 Apr, 2025 1 commit
-
-
Yifan Xiong authored
Merge multi-arch image in build pipeline.
-
- 21 Mar, 2025 1 commit
-
-
pdr authored
**Description** Updated docker for 12.8 Use cutlass latest relase 3.8 with ARCH 100(blackwell) support add latest nccl-test release with ARCH 100(blackwell) Updated msccl to support build for sm_100 No breaking changes, so backward compatible tested with cuda 12.4 --------- Co-authored-by:Hongtao Zhang <garyworkzht@gmail.com>
-
- 12 Mar, 2025 1 commit
-
-
Hongtao Zhang authored
Due to the matrix strategy’s default "fail-fast" setting. In GitHub Actions, when running a job with a matrix, the individual configurations run in parallel. By default, if one matrix job (for example, the one labeled "rocm6_2_rocm6_2_x_superbe") fails, the remaining parallel jobs are canceled automatically. In our current build image pipeline, the arm64 build job always are canceled by the rocm build job. So, using a non-existent label in the job config to prevent rocm build job from scheduling for a temporary solution. --------- Co-authored-by:hongtaozhang <hongtaozhang@microsoft.com>
-
- 08 Mar, 2025 1 commit
-
-
Hongtao Zhang authored
This enhancement addresses an issue in mypy where it may report missing pkg_resources even when ignore_missing_imports = True is set and the package is installed. Adding this configuration ensures that pkg_resources is properly skipped during type checking. Co-authored-by:hongtaozhang <hongtaozhang@microsoft.com>
-
- 07 Mar, 2025 1 commit
-
-
Yifan Xiong authored
Add image build on arm64 arch.
-
- 04 Mar, 2025 1 commit
-
-
Jorge Esguerra authored
Improves logging info for diagnosis rule op baseline errors. This allows developers to easily detect errors in their rule files as well as baseline files, improving end-user experience.
-
- 25 Feb, 2025 2 commits
-
-
Maxim Evtush authored
Co-authored-by:
Yifan Xiong <yifan.xiong@microsoft.com> Co-authored-by:
Hongtao Zhang <garyworkzht@gmail.com>
-
Hongtao Zhang authored
Added support for Python 3.11, 3.12 and 3.13. yapf is not compatiable with python3.12+, so we disable yapf in py3.12 for now. https://github.com/google/yapf/issues/1258 https://github.com/google/yapf/issues/1266 --------- Co-authored-by:
hongtaozhang <hongtaozhang@microsoft.com>
-
- 15 Feb, 2025 1 commit
-
-
Hongtao Zhang authored
Root Cause: 1. '_get_all_test_cases()' was called in '_parser' while '_parser' was defined in the base class. 2. in '_get_all_test_cases()', cmd path was not included. Fix: 1. Remove '_get_all_test_cases()' from '_parser'. 2. Construct path for cmd. --------- Co-authored-by:hongtaozhang <hongtaozhang@microsoft.com>
-
- 05 Feb, 2025 2 commits
-
-
Hongtao Zhang authored
**Description** 1. Fixed the bug that nvbandwidth benchmark need to handle 'N/A' values in nvbandwidth cmd output. 2. Replaced the input format of test cases with a list. 3. Add nvbandwidth configuration example in default config files. --------- Co-authored-by:
hongtaozhang <hongtaozhang@microsoft.com> Co-authored-by:
Yifan Xiong <yifan.xiong@microsoft.com>
-
Kirill Prosvirov authored
**Description** Today I was running a benchmark on my machine. And encountered a fancy issue with tensorrt-inference. I got code 33, which according to the source code is: ``` MICROBENCHMARK_RESULT_PARSING_FAILURE = 33 ``` I dived deep into the code and found out the following problem. The parser stumbled upon getting to the following line: ``` [11/28/2024-17:03:11] [I] Latency: min = 7.2793 ms, max = 10.1606 ms, mean = 7.41642 ms, median = 7.39551 ms, percentile(99%) = 8 ms ``` I ran it separately on the code and found out that the regular expression was not suitable for the cases like this, when you encounter an INT as a result in milliseconds. That's why this pull request is created. I came up with the closest possible regular expression to fix this issue and not to introduce any other bug. **Major Revision** - 0.11.0
-
- 04 Feb, 2025 3 commits
-
-
pdr authored
Flake8 has moved away from gitlab to github. Updating the repo path in the pre commit config.
-
Hongtao Zhang authored
**Description** Introduce architecture support for version 10.0 in gemm-flops.
-
Yifan Xiong authored
Fix installation and lint issues: * Fix transformer installation in Python3.7 due to upgrade of safetensors. * Fix lint issues in mypy 1.14.1.
-
- 08 Jan, 2025 1 commit
-
-
dependabot[bot] authored
Bumps [nanoid](https://github.com/ai/nanoid) from 3.3.6 to 3.3.8. - [Release notes](https://github.com/ai/nanoid/releases) - [Changelog](https://github.com/ai/nanoid/blob/main/CHANGELOG.md ) - [Commits](ai/nanoid@3.3.6...3.3.8) --- updated-dependencies: - dependency-name: nanoid dependency-type: indirect ... Signed-off-by:
dependabot[bot] <support@github.com>
-
- 28 Nov, 2024 2 commits
-
-
pdr authored
Added llama benchmark - training and inference in accordance with the existing pytorch models implementation like gpt2, lstm etc. - added llama fp8 unit test for better code coverage, to reduce memory required - updated transformers version >= 4.28.0 for LLamaConfig - set tokenizers version <= 0.20.3 to avoid 0.20.4 version [issues](https://github.com/huggingface/tokenizers/issues/1691 ) with py3.8 - added llama2 to tensorrt - llama2 tests not added to test_tensorrt_inference_performance.py due to large memory requirement for worker gpu. tests validated separately on gh200 --------- Co-authored-by:
dpatlolla <dpatlolla@microsoft.com>
-
pdr authored
Fix ordering of args in err messages.
-
- 27 Nov, 2024 1 commit
-
-
Yifan Xiong authored
Upgrade dependency versions in Azure pipeline: * Remove Python 3.6 and add Python 3.10 for cpu-unit-test * Upgrade CUDA from 11.1 to 12.4 for cuda-unit-test * Update labels accordingly --------- Co-authored-by:Dilip Patlolla <dilipreddi@gmail.com>
-
- 22 Nov, 2024 1 commit
-
-
Hongtao Zhang authored
**Description** Add nvbandwidth benchmark. --------- Co-authored-by:hongtaozhang <hongtaozhang@microsoft.com>
-
- 21 Nov, 2024 2 commits
-
-
Hongtao Zhang authored
**Description** Add nvbandwidth build to repo --------- Co-authored-by:hongtaozhang <hongtaozhang@microsoft.com>
-
Yifan Xiong authored
Update CODEOWNERS for docs.
-
- 20 Nov, 2024 1 commit
-
-
Hongtao Zhang authored
**Description** Add micro benchmark to measure general CPU bandwidth and latency without 'mlc'. Test output: ``` { "cpu-memory-bw-latency/return_code": 0, "cpu-memory-bw-latency/mem_bandwidth_matrix_numa_0_1_bw": 5388.75021, "cpu-memory-bw-latency/mem_bandwidth_matrix_numa_0_1_lat": 0.185571786, "cpu-memory-bw-latency/mem_bandwidth_matrix_numa_1_0_bw": 4634.82028, "cpu-memory-bw-latency/mem_bandwidth_matrix_numa_1_0_lat": 0.215758096, } ``` --------- Co-authored-by:hongtaozhang <hongtaozhang@microsoft.com>
-
- 15 Nov, 2024 1 commit
-
-
Hongtao Zhang authored
**Description** Bump onnxruntime-gpu from 1.10.0 to 1.12.0. --------- Co-authored-by:hongtaozhang <hongtaozhang@microsoft.com>
-
- 07 Nov, 2024 2 commits
-
-
dependabot[bot] authored
Bumps [webpack](https://github.com/webpack/webpack) from 5.76.1 to 5.96.1. - [Release notes](https://github.com/webpack/webpack/releases ) - [Commits](webpack/webpack@v5.76.1...v5.96.1) --- updated-dependencies: - dependency-name: webpack dependency-type: indirect ... Signed-off-by:
dependabot[bot] <support@github.com>
-
dependabot[bot] authored
Bumps [cookie](https://github.com/jshttp/cookie) and [express](https://github.com/expressjs/express). These dependencies needed to be updated together. Updates `cookie` from 0.6.0 to 0.7.1 - [Release notes](https://github.com/jshttp/cookie/releases) - [Commits](jshttp/cookie@v0.6.0...v0.7.1) Updates `express` from 4.21.0 to 4.21.1 - [Release notes](https://github.com/expressjs/express/releases) - [Changelog](https://github.com/expressjs/express/blob/4.21.1/History.md ) - [Commits](expressjs/express@4.21.0...4.21.1) --- updated-dependencies: - dependency-name: cookie dependency-type: indirect - dependency-name: express dependency-type: indirect ... Signed-off-by:
dependabot[bot] <support@github.com>
-
- 06 Nov, 2024 1 commit
-
-
pdr authored
Add support for arm64 build: - Updated dockerfile for arm64 build - extend cpu stream compilation for neoverse - handle onnxruntime-gpu installation - third party builds filtering based on arch - disable cuda decode perf build for non x86
-
- 05 Nov, 2024 1 commit
-
-
pdr authored
The current GPU Copy BW Performance fails on Nvidia Grace systems. This is due to the memory only numa node and thus the numa_run_on_node fails for such nodes and halts completely. This fix checks for the presence of assigned CPU cores for the numa node, on checking if it has no cpu cores assigned, it skips that specific node during the args creation and continues.
-
- 02 Nov, 2024 1 commit
-
-
Yifan Xiong authored
**Description** Update image build. **Major Revision** * Remove ROCm 6.0 image due to outdated packages * Remove build tag for ROCm * Preserve build cache for 30 days
-
- 10 Oct, 2024 1 commit
-
-
Yuting Jiang authored
**Description** Cherry pick bug fixes from v0.11.0 to main **Major Revision** * #645 * #648 * #646 * #647 * #651 * #652 * #650 --------- Co-authored-by:
hongtaozhang <hongtaozhang@microsoft.com> Co-authored-by:
Yifan Xiong <yifan.xiong@microsoft.com>
-