- 23 Oct, 2025 1 commit
-
-
Yuting Jiang authored
**Description** This PR adds NCU (NVIDIA Nsight Compute) profiling support to the cublaslt-gemm micro benchmark, enabling detailed kernel analysis including DRAM throughput, compute throughput, and launch arguments. **Major Revision** - Add --enable_ncu_profiling and --profiling_metrics for ncu profiling - Modifies command execution to use NCU when profiling is enabled - Updates result parsing to handle both standard and NCU profiled output formats
-
- 22 Oct, 2025 2 commits
-
-
Ziyue Yang authored
Benchmarks: Micro benchmark - Support verification and parallel run for disk performance benchmark (#741) **Description** Adds verification and parallel run support for disk performance benchmark. **Major Revision** - Adds `--verify` flag to support verify written data. - Supports loading benchmark options from `PROC_RANK`, `BLOCK_DEVICES` and `NUMA_NODES` environmental variables. --------- Co-authored-by:guoshzhao <guzhao@microsoft.com>
-
Hongtao Zhang authored
**Description** Python3.10 pipeline failed. **Solution** From log, 'bc' cmd is missing. Since our image tags are simple, the solution is to remove 'bc' cmd directly. --------- Co-authored-by:Hongtao Zhang <hongtaozhang@microsoft.com>
-
- 08 Oct, 2025 2 commits
-
-
Hongtao Zhang authored
To improve benchmark debugging, the following debug methods were added: pytorch profiler in model benchmark - SB_ENABLE_PYTORCH_PROFILER: switch to enable/disable - SB_TORCH_PROFILER_TRACE_DIR: log path These 2 runtime variables need to be configured in SB config file. nsys in SB runner - SB_ENABLE_NSYS: switch to enable/disable - SB_NSYS_TRACE_DIR: log path These 2 runtime variables need to be configured in runner's ENV --------- Co-authored-by:Hongtao Zhang <hongtaozhang@microsoft.com>
-
Yifan Xiong authored
Fix image merge for release event in GitHub Action.
-
- 01 Oct, 2025 1 commit
-
-
WenqingLan1 authored
Add support for cuda13.0. Add cuda13.0.dockerfile. Add cuda13.0 image building task to github pipeline. Update GPU STREAM to work with cuda13.0. Fix data type conversion perf bug in GPU stream. Update nvbandwidth submodule to be v0.8. Update perftest submodule to be 4bee61f80d9e268fc97eaf40be00409e91d3a19e (recent master). --------- Co-authored-by:
Ubuntu <dilipreddi@gmail.com> Co-authored-by:
guoshzhao <guzhao@microsoft.com>
-
- 30 Sep, 2025 1 commit
-
-
Yuting Jiang authored
Benchmarks: Micro benchmark - Add simultanneously all-to-host / host-to-all bandwidth testcases to nvbandwidth (#736) **Description** Add simultanneously all-to-host / host-to-all bandwidth testcases to nvbandwidth . **Major Revision** - nvbandwidth.patch: Add simultanneously all-to-host / host-to-all bandwidth testcases to nvbandwidth - upgrade nvbandwidth submodule into v0.8 - add patch into makefile build
-
- 29 Sep, 2025 2 commits
-
-
Yuting Jiang authored
**Description** add option to exclude data copy time in model benchmarks. **Major Revision** - add an option --no_copy - move start time after data copy finish
-
Yuting Jiang authored
**Description** Add numa support for nvbandwidth.
-
- 19 Sep, 2025 1 commit
-
-
Yuting Jiang authored
Benchmarks: micro benchmarks - change cublasLtMatmulDescCreate scaleType from CUDA_R_32F to CUDA_R_16F in FP16 dist inference (#732) **Description** change cublasLtMatmulDescCreate scaleType from CUDA_R_32F to CUDA_R_16F in FP16 dist inference to fix cublaslt error.
-
- 12 Aug, 2025 1 commit
-
-
Hongtao Zhang authored
**Description** Cherry-pick bug fixes from v0.12.0 to main. **Major Revisions** * #725 * #727 * #728 Co-authored-by:
Hongtao Zhang <hongtaozhang@microsoft.com> Co-authored-by:
Yifan Xiong <yixio@microsoft.com> Co-authored-by:
Guoshuai Zhao <guzhao@microsoft.com> --------- Co-authored-by:
Hongtao Zhang <hongtaozhang@microsoft.com>
-
- 30 Jun, 2025 1 commit
-
-
pdr authored
Added MoE model using MixtralConfig. 1. Added 8x7b and 8x22b variants 2. Requires high VRAM as all experts are loaded in memory. Thus, disabled training due to memory constraint on test worker. --------- Co-authored-by:
Hongtao Zhang <garyworkzht@gmail.com> Co-authored-by:
Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by:
Hongtao Zhang <hongtaozhang@microsoft.com>
-
- 26 Jun, 2025 1 commit
-
-
Yuting Jiang authored
**Description** Add deepseek megatron-lm benchmark. --------- Co-authored-by:
yukirora <yuting.jiang@microsoft.com> Co-authored-by:
Hongtao Zhang <garyworkzht@gmail.com> Co-authored-by:
Hongtao Zhang <hongtaozhang@microsoft.com>
-
- 25 Jun, 2025 1 commit
-
-
guoshzhao authored
**Description** Add cuda 12.9 dockerfile and build in pipeline. --------- Co-authored-by:
Guoshuai Zhao <microsoft@microsoft.com> Co-authored-by:
Hongtao Zhang <hongtaozhang@microsoft.com> Co-authored-by:
Hongtao Zhang <garyworkzht@gmail.com>
-
- 24 Jun, 2025 1 commit
-
-
guoshzhao authored
**Description** Add FP4 precision support for cublaslt_gemm benchmark. **Major Revision** - Add new type `fp4e2m1` and `__nv_fp4_e2m1`. - For FP4 matmul, precision of MatrixC (add) should be FP16, precision of MatricD (output) should be FP4, otherwise, it will not work. - Add macro `CUDA_VERSION` to resolve the compatibility issue of different CUDA versions. --------- Co-authored-by:
Ubuntu <aiperf@aiperf000000.hp5z1gqeinfufbj2u3jcty5fme.cdmx.internal.cloudapp.net> Co-authored-by:
AVA <39534996+avazr@users.noreply.github.com> Co-authored-by:
Guoshuai Zhao <microsoft@microsoft.com>
-
- 20 Jun, 2025 2 commits
-
-
Babak Hejazi authored
**Description** Enable autotuning as an opt-in mode when benchmarking cublasLt via `cublaslt_gemm` The implementation is based on https://github.com/NVIDIA/CUDALibrarySamples/blob/master/cuBLASLt/LtSgemmSimpleAutoTuning/sample_cublasLt_LtSgemmSimpleAutoTuning.cu The behavior of original benchmark command remains unchanged, e.g.: - `cublaslt_gemm -m 2048 -n 12288 -k 1536 -w10000 -i 1000 -t fp8e4m3` The new opt-in options are `-a` (for autotune) and `-I` (for autotune iterations, default is 50, same as the default for `-i`) and `-W` (for autotune warmups, default=20, same as the default for `-w`), e.g.: - `cublaslt_gemm -m 2048 -n 12288 -k 1536 -w 10000 -i 1000 -t fp8e4m3 -a` - `cublaslt_gemm -m 2048 -n 12288 -k 1536 -w 10000 -i 1000 -t fp8e4m3 -a -I 10 -W 10` **Note:** This PR also changes the default `gemm_compute_type` for BF16 and FP16 to `CUBLAS_COMPUTE_32F`. **Further observations:** 1. The support matrix of the `cublaslt_gemm` could be furt...
-
WenqingLan1 authored
**Description** Added support for Grace CPU neo2 architecture in CPU Stream. Now CPU Stream supports dual socket benchmarking. Example config for this arch support: ```yaml cpu-stream:numa0: timeout: *default_timeout modes: - name: local parallel: no parameters: cpu_arch: neo2 numa_mem_nodes: 0 cores: 0 1 2 3 4 5 6 7 8 cpu-stream:numa1: timeout: *default_timeout modes: - name: local parallel: no parameters: cpu_arch: neo2 numa_mem_nodes: 1 cores: 64 65 66 67 68 69 70 71 72 cpu-stream:numa-spread: timeout: *default_timeout modes: - name: local parallel: no parameters: cpu_arch: neo2 numa_mem_nodes: 0 1 cores: 0 1 2 3 4 5 6 7 8 64 65 66 67 68 69 70 71 72 ``` --------- Co-authored-by:dpower4 <dilipreddi@gmail.com>
-
- 18 Jun, 2025 1 commit
-
-
WenqingLan1 authored
Added GPU Stream benchmark - measures the GPU memory bandwidth and efficiency for double datatype through various memory operations including copy, scale, add, and triad. - added documentation for `gpu-stream` detailing its introduction, metrics, and descriptions. - added unit tests for `gpu-stream`. Example output is in `superbenchmark/tests/data/gpu_stream.log`.
-
- 14 Jun, 2025 1 commit
-
-
Hongtao Zhang authored
In the current implementation, the CPU‑stream benchmark code renames the binary before the microbench base class can verify its existence, causing the default‐binary check to fail. This PR adds a “default” binary—built with the standard compile parameters—so that the base class can always find and validate it. Once the default binary is in place, the CPU‑stream code will rename it as needed and re‑check its presence before running the benchmark. The PR also enable CPU stream in the default settings. --------- Co-authored-by:Hongtao Zhang <hongtaozhang@microsoft.com>
-
- 05 Jun, 2025 1 commit
-
-
Yifan Xiong authored
Update CODEOWNERS.
-
- 01 May, 2025 1 commit
-
-
pdr authored
adding gb200 cuda arch flag for cublaslt compilation
-
- 30 Apr, 2025 1 commit
-
-
Hongtao Zhang authored
- Upgrade OS of github runner used by lint to the latest. - Add symbolic link for clang-format to version 14. - Update importlib_metadata version since it is too old (inside nvcr.io/nvidia/pytorch:20.12-py3) and failed the 11.1 build. --------- Co-authored-by:
hongtaozhang <hongtaozhang@microsoft.com> Co-authored-by:
Yifan Xiong <yifan.xiong@microsoft.com>
-
- 09 Apr, 2025 1 commit
-
-
Yifan Xiong authored
Merge multi-arch image in build pipeline.
-
- 21 Mar, 2025 1 commit
-
-
pdr authored
**Description** Updated docker for 12.8 Use cutlass latest relase 3.8 with ARCH 100(blackwell) support add latest nccl-test release with ARCH 100(blackwell) Updated msccl to support build for sm_100 No breaking changes, so backward compatible tested with cuda 12.4 --------- Co-authored-by:Hongtao Zhang <garyworkzht@gmail.com>
-
- 12 Mar, 2025 1 commit
-
-
Hongtao Zhang authored
Due to the matrix strategy’s default "fail-fast" setting. In GitHub Actions, when running a job with a matrix, the individual configurations run in parallel. By default, if one matrix job (for example, the one labeled "rocm6_2_rocm6_2_x_superbe") fails, the remaining parallel jobs are canceled automatically. In our current build image pipeline, the arm64 build job always are canceled by the rocm build job. So, using a non-existent label in the job config to prevent rocm build job from scheduling for a temporary solution. --------- Co-authored-by:hongtaozhang <hongtaozhang@microsoft.com>
-
- 08 Mar, 2025 1 commit
-
-
Hongtao Zhang authored
This enhancement addresses an issue in mypy where it may report missing pkg_resources even when ignore_missing_imports = True is set and the package is installed. Adding this configuration ensures that pkg_resources is properly skipped during type checking. Co-authored-by:hongtaozhang <hongtaozhang@microsoft.com>
-
- 07 Mar, 2025 1 commit
-
-
Yifan Xiong authored
Add image build on arm64 arch.
-
- 04 Mar, 2025 1 commit
-
-
Jorge Esguerra authored
Improves logging info for diagnosis rule op baseline errors. This allows developers to easily detect errors in their rule files as well as baseline files, improving end-user experience.
-
- 25 Feb, 2025 2 commits
-
-
Maxim Evtush authored
Co-authored-by:
Yifan Xiong <yifan.xiong@microsoft.com> Co-authored-by:
Hongtao Zhang <garyworkzht@gmail.com>
-
Hongtao Zhang authored
Added support for Python 3.11, 3.12 and 3.13. yapf is not compatiable with python3.12+, so we disable yapf in py3.12 for now. https://github.com/google/yapf/issues/1258 https://github.com/google/yapf/issues/1266 --------- Co-authored-by:
hongtaozhang <hongtaozhang@microsoft.com>
-
- 15 Feb, 2025 1 commit
-
-
Hongtao Zhang authored
Root Cause: 1. '_get_all_test_cases()' was called in '_parser' while '_parser' was defined in the base class. 2. in '_get_all_test_cases()', cmd path was not included. Fix: 1. Remove '_get_all_test_cases()' from '_parser'. 2. Construct path for cmd. --------- Co-authored-by:hongtaozhang <hongtaozhang@microsoft.com>
-
- 05 Feb, 2025 2 commits
-
-
Hongtao Zhang authored
**Description** 1. Fixed the bug that nvbandwidth benchmark need to handle 'N/A' values in nvbandwidth cmd output. 2. Replaced the input format of test cases with a list. 3. Add nvbandwidth configuration example in default config files. --------- Co-authored-by:
hongtaozhang <hongtaozhang@microsoft.com> Co-authored-by:
Yifan Xiong <yifan.xiong@microsoft.com>
-
Kirill Prosvirov authored
**Description** Today I was running a benchmark on my machine. And encountered a fancy issue with tensorrt-inference. I got code 33, which according to the source code is: ``` MICROBENCHMARK_RESULT_PARSING_FAILURE = 33 ``` I dived deep into the code and found out the following problem. The parser stumbled upon getting to the following line: ``` [11/28/2024-17:03:11] [I] Latency: min = 7.2793 ms, max = 10.1606 ms, mean = 7.41642 ms, median = 7.39551 ms, percentile(99%) = 8 ms ``` I ran it separately on the code and found out that the regular expression was not suitable for the cases like this, when you encounter an INT as a result in milliseconds. That's why this pull request is created. I came up with the closest possible regular expression to fix this issue and not to introduce any other bug. **Major Revision** - 0.11.0
-
- 04 Feb, 2025 3 commits
-
-
pdr authored
Flake8 has moved away from gitlab to github. Updating the repo path in the pre commit config.
-
Hongtao Zhang authored
**Description** Introduce architecture support for version 10.0 in gemm-flops.
-
Yifan Xiong authored
Fix installation and lint issues: * Fix transformer installation in Python3.7 due to upgrade of safetensors. * Fix lint issues in mypy 1.14.1.
-
- 08 Jan, 2025 1 commit
-
-
dependabot[bot] authored
Bumps [nanoid](https://github.com/ai/nanoid) from 3.3.6 to 3.3.8. - [Release notes](https://github.com/ai/nanoid/releases) - [Changelog](https://github.com/ai/nanoid/blob/main/CHANGELOG.md ) - [Commits](ai/nanoid@3.3.6...3.3.8) --- updated-dependencies: - dependency-name: nanoid dependency-type: indirect ... Signed-off-by:
dependabot[bot] <support@github.com>
-
- 28 Nov, 2024 2 commits
-
-
pdr authored
Added llama benchmark - training and inference in accordance with the existing pytorch models implementation like gpt2, lstm etc. - added llama fp8 unit test for better code coverage, to reduce memory required - updated transformers version >= 4.28.0 for LLamaConfig - set tokenizers version <= 0.20.3 to avoid 0.20.4 version [issues](https://github.com/huggingface/tokenizers/issues/1691 ) with py3.8 - added llama2 to tensorrt - llama2 tests not added to test_tensorrt_inference_performance.py due to large memory requirement for worker gpu. tests validated separately on gh200 --------- Co-authored-by:
dpatlolla <dpatlolla@microsoft.com>
-
pdr authored
Fix ordering of args in err messages.
-
- 27 Nov, 2024 1 commit
-
-
Yifan Xiong authored
Upgrade dependency versions in Azure pipeline: * Remove Python 3.6 and add Python 3.10 for cpu-unit-test * Upgrade CUDA from 11.1 to 12.4 for cuda-unit-test * Update labels accordingly --------- Co-authored-by:Dilip Patlolla <dilipreddi@gmail.com>
-