- 01 Apr, 2026 2 commits
- 19 Mar, 2026 2 commits
-
-
one authored
-
one authored
- Added Platform.DTK in the microbenchmark framework. - Introduced new DTK hipblaslt benchmark class and corresponding tests. - Updated Dockerfile to include hipblaslt-bench and its permissions. - Registered DTK benchmarks in the benchmark registry for various performance tests. - Enhanced GPU detection logic to recognize HYGON GPUs. This update improves the benchmarking capabilities for DTK, ensuring compatibility and performance testing across platforms.
-
- 17 Nov, 2025 1 commit
-
-
Yuting Jiang authored
Benchmarks: micro benchmarks - add --set_ib_devices option to auto-select IB device by MPI local rank in ib validation (#733) **Description** add --set_ib_devices option to auto-select IB device by MPI local rank **Major Revision** - Add a new CLI flag --set_ib_devices to automatically select irregular IB devices based on the MPI local rank. - When enabled, the benchmark queries available IB devices via network.get_ib_devices() and selects the device corresponding to OMPI_COMM_WORLD_LOCAL_RANK. - Fall back to existing --ib_dev behavior when the flag is not provided. **Minor Revision** - Add an env in network.get_ib_devices() to allow user to set the device name
-
- 23 Oct, 2025 1 commit
-
-
Yuting Jiang authored
**Description** This PR adds NCU (NVIDIA Nsight Compute) profiling support to the cublaslt-gemm micro benchmark, enabling detailed kernel analysis including DRAM throughput, compute throughput, and launch arguments. **Major Revision** - Add --enable_ncu_profiling and --profiling_metrics for ncu profiling - Modifies command execution to use NCU when profiling is enabled - Updates result parsing to handle both standard and NCU profiled output formats
-
- 22 Oct, 2025 1 commit
-
-
Ziyue Yang authored
Benchmarks: Micro benchmark - Support verification and parallel run for disk performance benchmark (#741) **Description** Adds verification and parallel run support for disk performance benchmark. **Major Revision** - Adds `--verify` flag to support verify written data. - Supports loading benchmark options from `PROC_RANK`, `BLOCK_DEVICES` and `NUMA_NODES` environmental variables. --------- Co-authored-by:guoshzhao <guzhao@microsoft.com>
-
- 29 Sep, 2025 2 commits
-
-
Yuting Jiang authored
**Description** add option to exclude data copy time in model benchmarks. **Major Revision** - add an option --no_copy - move start time after data copy finish
-
Yuting Jiang authored
**Description** Add numa support for nvbandwidth.
-
- 30 Jun, 2025 1 commit
-
-
pdr authored
Added MoE model using MixtralConfig. 1. Added 8x7b and 8x22b variants 2. Requires high VRAM as all experts are loaded in memory. Thus, disabled training due to memory constraint on test worker. --------- Co-authored-by:
Hongtao Zhang <garyworkzht@gmail.com> Co-authored-by:
Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by:
Hongtao Zhang <hongtaozhang@microsoft.com>
-
- 26 Jun, 2025 1 commit
-
-
Yuting Jiang authored
**Description** Add deepseek megatron-lm benchmark. --------- Co-authored-by:
yukirora <yuting.jiang@microsoft.com> Co-authored-by:
Hongtao Zhang <garyworkzht@gmail.com> Co-authored-by:
Hongtao Zhang <hongtaozhang@microsoft.com>
-
- 25 Jun, 2025 1 commit
-
-
guoshzhao authored
**Description** Add cuda 12.9 dockerfile and build in pipeline. --------- Co-authored-by:
Guoshuai Zhao <microsoft@microsoft.com> Co-authored-by:
Hongtao Zhang <hongtaozhang@microsoft.com> Co-authored-by:
Hongtao Zhang <garyworkzht@gmail.com>
-
- 20 Jun, 2025 1 commit
-
-
WenqingLan1 authored
**Description** Added support for Grace CPU neo2 architecture in CPU Stream. Now CPU Stream supports dual socket benchmarking. Example config for this arch support: ```yaml cpu-stream:numa0: timeout: *default_timeout modes: - name: local parallel: no parameters: cpu_arch: neo2 numa_mem_nodes: 0 cores: 0 1 2 3 4 5 6 7 8 cpu-stream:numa1: timeout: *default_timeout modes: - name: local parallel: no parameters: cpu_arch: neo2 numa_mem_nodes: 1 cores: 64 65 66 67 68 69 70 71 72 cpu-stream:numa-spread: timeout: *default_timeout modes: - name: local parallel: no parameters: cpu_arch: neo2 numa_mem_nodes: 0 1 cores: 0 1 2 3 4 5 6 7 8 64 65 66 67 68 69 70 71 72 ``` --------- Co-authored-by:dpower4 <dilipreddi@gmail.com>
-
- 18 Jun, 2025 1 commit
-
-
WenqingLan1 authored
Added GPU Stream benchmark - measures the GPU memory bandwidth and efficiency for double datatype through various memory operations including copy, scale, add, and triad. - added documentation for `gpu-stream` detailing its introduction, metrics, and descriptions. - added unit tests for `gpu-stream`. Example output is in `superbenchmark/tests/data/gpu_stream.log`.
-
- 14 Jun, 2025 1 commit
-
-
Hongtao Zhang authored
In the current implementation, the CPU‑stream benchmark code renames the binary before the microbench base class can verify its existence, causing the default‐binary check to fail. This PR adds a “default” binary—built with the standard compile parameters—so that the base class can always find and validate it. Once the default binary is in place, the CPU‑stream code will rename it as needed and re‑check its presence before running the benchmark. The PR also enable CPU stream in the default settings. --------- Co-authored-by:Hongtao Zhang <hongtaozhang@microsoft.com>
-
- 04 Mar, 2025 1 commit
-
-
Jorge Esguerra authored
Improves logging info for diagnosis rule op baseline errors. This allows developers to easily detect errors in their rule files as well as baseline files, improving end-user experience.
-
- 15 Feb, 2025 1 commit
-
-
Hongtao Zhang authored
Root Cause: 1. '_get_all_test_cases()' was called in '_parser' while '_parser' was defined in the base class. 2. in '_get_all_test_cases()', cmd path was not included. Fix: 1. Remove '_get_all_test_cases()' from '_parser'. 2. Construct path for cmd. --------- Co-authored-by:hongtaozhang <hongtaozhang@microsoft.com>
-
- 05 Feb, 2025 1 commit
-
-
Hongtao Zhang authored
**Description** 1. Fixed the bug that nvbandwidth benchmark need to handle 'N/A' values in nvbandwidth cmd output. 2. Replaced the input format of test cases with a list. 3. Add nvbandwidth configuration example in default config files. --------- Co-authored-by:
hongtaozhang <hongtaozhang@microsoft.com> Co-authored-by:
Yifan Xiong <yifan.xiong@microsoft.com>
-
- 28 Nov, 2024 1 commit
-
-
pdr authored
Added llama benchmark - training and inference in accordance with the existing pytorch models implementation like gpt2, lstm etc. - added llama fp8 unit test for better code coverage, to reduce memory required - updated transformers version >= 4.28.0 for LLamaConfig - set tokenizers version <= 0.20.3 to avoid 0.20.4 version [issues](https://github.com/huggingface/tokenizers/issues/1691 ) with py3.8 - added llama2 to tensorrt - llama2 tests not added to test_tensorrt_inference_performance.py due to large memory requirement for worker gpu. tests validated separately on gh200 --------- Co-authored-by:
dpatlolla <dpatlolla@microsoft.com>
-
- 27 Nov, 2024 1 commit
-
-
Yifan Xiong authored
Upgrade dependency versions in Azure pipeline: * Remove Python 3.6 and add Python 3.10 for cpu-unit-test * Upgrade CUDA from 11.1 to 12.4 for cuda-unit-test * Update labels accordingly --------- Co-authored-by:Dilip Patlolla <dilipreddi@gmail.com>
-
- 22 Nov, 2024 1 commit
-
-
Hongtao Zhang authored
**Description** Add nvbandwidth benchmark. --------- Co-authored-by:hongtaozhang <hongtaozhang@microsoft.com>
-
- 20 Nov, 2024 1 commit
-
-
Hongtao Zhang authored
**Description** Add micro benchmark to measure general CPU bandwidth and latency without 'mlc'. Test output: ``` { "cpu-memory-bw-latency/return_code": 0, "cpu-memory-bw-latency/mem_bandwidth_matrix_numa_0_1_bw": 5388.75021, "cpu-memory-bw-latency/mem_bandwidth_matrix_numa_0_1_lat": 0.185571786, "cpu-memory-bw-latency/mem_bandwidth_matrix_numa_1_0_bw": 4634.82028, "cpu-memory-bw-latency/mem_bandwidth_matrix_numa_1_0_lat": 0.215758096, } ``` --------- Co-authored-by:hongtaozhang <hongtaozhang@microsoft.com>
-
- 20 Aug, 2024 1 commit
-
-
Yang Wang authored
**Description** Fix executor for Benchmark Execution Without Explicit Framework Field
-
- 23 Jul, 2024 1 commit
-
-
Yang Wang authored
Update `omegaconf` version to [2.3.0](https://pypi.org/project/omegaconf/2.3.0/) as omegaconf 2.0.6 has a non-standard dependency specifier PyYAML>=5.1.*. pip 24.1 will enforce this behaviour change. Discussion can be found at https://github.com/pypa/pip/issues/12063.
-
- 02 Apr, 2024 1 commit
-
-
Ziyue Yang authored
**Description** Adds hipblasLt tuning to dist-inference cpp implementation.
-
- 08 Jan, 2024 1 commit
-
-
Yifan Xiong authored
**Description** Cherry-pick bug fixes from v0.10.0 to main. **Major Revisions** * Benchmarks: Microbenchmark - Support different hipblasLt data types in dist_inference #590 * Benchmarks: Microbenchmark - Support in-place for NCCL/RCCL benchmark #591 * Bug Fix - Fix NUMA Domains Swap Issue in NDv4 Topology File #592 * Benchmarks: Microbenchmark - Add data type option for NCCL and RCCL tests #595 * Benchmarks: Bug Fix - Make metrics of dist-inference-cpp aligned with PyTorch version #596 * CI/CD - Add ndv5 topo file #597 * Benchmarks: Microbenchmark - Improve AMD GPU P2P performance with fine-grained GPU memory #593 * Benchmarks: Build Pipeline - fix nccl and nccl test version to 2.18.3 to resolve hang issue in cuda12.2 docker #599 * Dockerfile - Bug fix for rocm docker build and deploy #598 * Benchmarks: Microbenchmark - Adapt to hipblasLt data type changes #603 * Benchmarks: Micro benchmarks - Update hipblaslt metric unit to tflops #604 * Monitor - U...
-
- 10 Dec, 2023 1 commit
-
-
Ziyue Yang authored
**Description** Add distributed inference benchmark cpp implementation.
-
- 08 Dec, 2023 1 commit
-
-
Ziyue Yang authored
Benchmarks: Micro benchmark - Add one-to-all, all-to-one, all-to-all support to gpu_copy_bw_performance (#588) **Description** Add one-to-all, all-to-one, all-to-all support to gpu_copy_bw_performance, and fix performance bug in gpu_copy
-
- 07 Dec, 2023 1 commit
-
-
Yuting Jiang authored
**Description** Megatron-LM/Megatron-Deepspeed GPT pretrain benchmark
-
- 05 Dec, 2023 1 commit
-
-
Ziyue Yang authored
**Description** Revise NCCL/RCCL benchmarks to graph mode add latency metrics.
-
- 04 Dec, 2023 1 commit
-
-
Yuting Jiang authored
**Description** Benchmarks: micro benchmark - Support cpu-gpu and gpu-cpu in ib-validation **Major Revision** - Support cpu-gpu and gpu-cpu in ib-validation **Minor Revision** - support multi msg size, multi direction, multi ib commands in ib-validation
-
- 22 Nov, 2023 3 commits
-
-
Yuting Jiang authored
**Description** add initialization options for rocm gemm flops.
-
Yuting Jiang authored
**Description** hipblaslt function benchmark and rebase cublaslt function benchmark.
-
guoshzhao authored
**Description** Generate baseline given results from multiple nodes. **Major Revision** - Add sub command `sb result generate-baseline` - Add UT and docs --------- Co-authored-by:
454314380 <454314380@qq.com> Co-authored-by:
Yuting Jiang <yutingjiang@microsoft.com>
-
- 20 Nov, 2023 1 commit
-
-
Yuting Jiang authored
**Description** add int8 support for cublaslt function.
-
- 14 Nov, 2023 1 commit
-
-
Yuting Jiang authored
**Description** remove cp ptx file in gpu burn test since the command is run inside self.args.bin_dir dir. https://github.com/microsoft/superbenchmark/blob/d246bab430adeb461072918a551b2e2b68c9bce5/superbench/benchmarks/micro_benchmarks/micro_base.py#L183
-
- 08 Aug, 2023 1 commit
-
-
pnunna93 authored
This PR has following changes - torch.distributed.launch changed to torchrun. torch.distributed.launch is deprecated in latest Pytorch and is recommended to move to torchrun - https://pytorch.org/docs/stable/elastic/run.html - Changes to AMD GPU detection logic. The AMD GPU detection logic throws warning when containers have only renderD in /dev/dri, this change would resolve those warnings --------- Co-authored-by:
Yuting Jiang <yutingjiang@microsoft.com>
-
- 06 Jul, 2023 1 commit
-
-
Yuting Jiang authored
**Description** add python code for DirectXGPUEncodingLatency.
-
- 05 Jul, 2023 2 commits
-
-
Yuting Jiang authored
**Description** add python code for DirectXGPUCopy.
-
Yuting Jiang authored
**Description** add python code for DirecXGPUMemBw.
-