Unverified Commit 51761b3a authored by Yifan Xiong's avatar Yifan Xiong Committed by GitHub
Browse files

Release - SuperBench v0.8.0 (#517)



**Description**

Cherry-pick bug fixes from v0.8.0 to main.

**Major Revisions**

* Monitor - Fix the cgroup version checking logic (#502)
* Benchmark - Fix matrix size overflow issue in cuBLASLt GEMM (#503)
* Fix wrong torch usage in communication wrapper for Distributed
Inference Benchmark (#505)
* Analyzer: Fix bug in python3.8 due to pandas api change (#504)
* Bug - Fix bug to get metric from cmd when error happens (#506)
* Monitor - Collect realtime GPU power when benchmarking (#507)
* Add num_workers argument in model benchmark (#511)
* Remove unreachable condition when write host list (#512)
* Update cuda11.8 image to cuda12.1 based on nvcr23.03 (#513)
* Doc - Fix wrong unit of cpu-memory-bw-latency in doc (#515)
* Docs - Upgrade version and release note (#508)
Co-authored-by: default avatarguoshzhao <guzhao@microsoft.com>
Co-authored-by: default avatarZiyue Yang <ziyyang@microsoft.com>
Co-authored-by: default avatarYuting Jiang <yutingjiang@microsoft.com>
parent 97c9a41f
...@@ -181,7 +181,7 @@ def _init_dataloader(self): ...@@ -181,7 +181,7 @@ def _init_dataloader(self):
dataset=self._dataset, dataset=self._dataset,
batch_size=self._args.batch_size, batch_size=self._args.batch_size,
shuffle=False, shuffle=False,
num_workers=8, num_workers=self._args.num_workers,
sampler=train_sampler, sampler=train_sampler,
drop_last=True, drop_last=True,
pin_memory=self._args.pin_memory pin_memory=self._args.pin_memory
......
...@@ -72,6 +72,22 @@ def get_device_temperature(self, idx): ...@@ -72,6 +72,22 @@ def get_device_temperature(self, idx):
temp = None temp = None
return temp return temp
def get_device_power(self, idx):
"""Get the realtime power of device, unit: watt.
Args:
idx (int): device index.
Return:
temp (float): the realtime power of device, None means failed to get the data.
"""
try:
power = nvml.nvmlDeviceGetPowerUsage(self._device_handlers[idx])
except Exception as err:
logger.error('Get device power failed: {}'.format(str(err)))
return None
return int(int(power) / 1000)
def get_device_power_limit(self, idx): def get_device_power_limit(self, idx):
"""Get the power management limit of device, unit: watt. """Get the power management limit of device, unit: watt.
......
...@@ -182,15 +182,14 @@ def gen_traffic_pattern_host_groups(host_list, pattern, mpi_pattern_path, benchm ...@@ -182,15 +182,14 @@ def gen_traffic_pattern_host_groups(host_list, pattern, mpi_pattern_path, benchm
logger.error('Unsupported traffic pattern: {}'.format(pattern.type)) logger.error('Unsupported traffic pattern: {}'.format(pattern.type))
host_groups = __convert_config_to_host_group(config, host_list) host_groups = __convert_config_to_host_group(config, host_list)
# write traffic pattern host groups to specified path # write traffic pattern host groups to specified path
if pattern.mpi_pattern: with open(mpi_pattern_path, 'a') as f:
with open(mpi_pattern_path, 'a') as f: f.write('benchmark_name: {} pattern_type: {}'.format(benchmark_name, pattern.type) + '\n')
f.write('benchmark_name: {} pattern_type: {}'.format(benchmark_name, pattern.type) + '\n') for host_group in host_groups:
for host_group in host_groups: row = []
row = [] for host_list in host_group:
for host_list in host_group: group = ','.join(host_list)
group = ','.join(host_list) row.append(group)
row.append(group) group = ';'.join(row)
group = ';'.join(row) f.write(group + '\n')
f.write(group + '\n') f.write('\n')
f.write('\n')
return host_groups return host_groups
...@@ -3,7 +3,7 @@ ...@@ -3,7 +3,7 @@
# Server: # Server:
# - Product: HPE Apollo 6500 # - Product: HPE Apollo 6500
version: v0.7 version: v0.8
superbench: superbench:
enable: null enable: null
var: var:
......
...@@ -4,7 +4,7 @@ ...@@ -4,7 +4,7 @@
# - Product: G482-Z53 # - Product: G482-Z53
# - Link: https://www.gigabyte.cn/FileUpload/Global/MicroSite/553/G482-Z53.html # - Link: https://www.gigabyte.cn/FileUpload/Global/MicroSite/553/G482-Z53.html
version: v0.7 version: v0.8
superbench: superbench:
enable: null enable: null
var: var:
......
version: v0.7 version: v0.8
superbench: superbench:
enable: null enable: null
monitor: monitor:
......
version: v0.7 version: v0.8
superbench: superbench:
enable: null enable: null
monitor: monitor:
......
version: v0.7 version: v0.8
superbench: superbench:
enable: null enable: null
monitor: monitor:
......
...@@ -3,7 +3,7 @@ ...@@ -3,7 +3,7 @@
# Azure NDm A100 v4 # Azure NDm A100 v4
# reference: https://docs.microsoft.com/en-us/azure/virtual-machines/ndm-a100-v4-series # reference: https://docs.microsoft.com/en-us/azure/virtual-machines/ndm-a100-v4-series
version: v0.7 version: v0.8
superbench: superbench:
enable: null enable: null
monitor: monitor:
......
# SuperBench Config # SuperBench Config
version: v0.7 version: v0.8
superbench: superbench:
enable: null enable: null
monitor: monitor:
......
# SuperBench Config # SuperBench Config
version: v0.7 version: v0.8
superbench: superbench:
enable: null enable: null
monitor: monitor:
......
...@@ -38,16 +38,7 @@ def __init__(self, container_name, sample_duration, sample_interval, output_file ...@@ -38,16 +38,7 @@ def __init__(self, container_name, sample_duration, sample_interval, output_file
self.__unit_MiByte = 1024 * 1024 * 1.0 self.__unit_MiByte = 1024 * 1024 * 1.0
self.__output_handler = open(self.__output_file, 'a') self.__output_handler = open(self.__output_file, 'a')
self.__cgroup = 1 self.__cgroup = 1
output = run_command('grep cgroup /proc/filesystems', quiet=True)
if output.returncode != 0:
logger.error('Failed to check the cgroup version, will assume using cgroup V1.')
else:
if 'cgroup2' in output.stdout:
self.__cgroup = 2
logger.info('cgroup version: {}.'.format(self.__cgroup))
def __preprocess(self): def __preprocess(self):
"""Preprocess/preparation operations before the monitoring. """Preprocess/preparation operations before the monitoring.
...@@ -77,13 +68,15 @@ def __preprocess(self): ...@@ -77,13 +68,15 @@ def __preprocess(self):
container_pid = output.stdout container_pid = output.stdout
try: try:
if self.__cgroup == 1: cpu_file_cgroup_v1 = glob.glob('/sys/fs/cgroup/cpuacct/docker/{}*/cpuacct.stat'.format(container_id))
self._cpu_file = glob.glob('/sys/fs/cgroup/cpuacct/docker/{}*/cpuacct.stat'.format(container_id))[0] if len(cpu_file_cgroup_v1) > 0:
self._cpu_file = cpu_file_cgroup_v1[0]
self._mem_file = glob.glob( self._mem_file = glob.glob(
'/sys/fs/cgroup/memory/docker/{}*/memory.usage_in_bytes'.format(container_id) '/sys/fs/cgroup/memory/docker/{}*/memory.usage_in_bytes'.format(container_id)
)[0] )[0]
self._net_file = '/proc/{}/net/dev'.format(container_pid) self._net_file = '/proc/{}/net/dev'.format(container_pid)
else: else:
self.__cgroup = 2
self._cpu_file = glob.glob( self._cpu_file = glob.glob(
'/sys/fs/cgroup/system.slice/docker-{}*.scope/cpu.stat'.format(container_id) '/sys/fs/cgroup/system.slice/docker-{}*.scope/cpu.stat'.format(container_id)
)[0] )[0]
...@@ -99,10 +92,12 @@ def __preprocess(self): ...@@ -99,10 +92,12 @@ def __preprocess(self):
) )
return False return False
else: else:
if self.__cgroup == 1: cpu_file_cgroup_v1 = '/sys/fs/cgroup/cpuacct/cpuacct.stat'
self._cpu_file = '/sys/fs/cgroup/cpuacct/cpuacct.stat' if os.path.exists(cpu_file_cgroup_v1):
self._cpu_file = cpu_file_cgroup_v1
self._mem_file = '/sys/fs/cgroup/memory/memory.usage_in_bytes' self._mem_file = '/sys/fs/cgroup/memory/memory.usage_in_bytes'
else: else:
self.__cgroup = 2
self._cpu_file = '/sys/fs/cgroup/cpu.stat' self._cpu_file = '/sys/fs/cgroup/cpu.stat'
self._mem_file = '/sys/fs/cgroup/memory.stat' self._mem_file = '/sys/fs/cgroup/memory.stat'
self._net_file = '/proc/net/dev' self._net_file = '/proc/net/dev'
...@@ -199,6 +194,7 @@ def __sample_gpu_metrics(self, record): ...@@ -199,6 +194,7 @@ def __sample_gpu_metrics(self, record):
for i in range(device_count): for i in range(device_count):
record.gpu_usage.append(dm.device_manager.get_device_utilization(i)) record.gpu_usage.append(dm.device_manager.get_device_utilization(i))
record.gpu_temperature.append(dm.device_manager.get_device_temperature(i)) record.gpu_temperature.append(dm.device_manager.get_device_temperature(i))
record.gpu_power.append(dm.device_manager.get_device_power(i))
record.gpu_power_limit.append(dm.device_manager.get_device_power_limit(i)) record.gpu_power_limit.append(dm.device_manager.get_device_power_limit(i))
mem_used, mem_total = dm.device_manager.get_device_memory(i) mem_used, mem_total = dm.device_manager.get_device_memory(i)
record.gpu_mem_used.append(mem_used) record.gpu_mem_used.append(mem_used)
......
...@@ -14,6 +14,7 @@ class MonitorRecord: ...@@ -14,6 +14,7 @@ class MonitorRecord:
"""Record class to save all monitoring data.""" """Record class to save all monitoring data."""
reduce_ops = { reduce_ops = {
'gpu_temperature': ReduceType.MAX, 'gpu_temperature': ReduceType.MAX,
'gpu_power': ReduceType.MAX,
'gpu_power_limit': ReduceType.MIN, 'gpu_power_limit': ReduceType.MIN,
'gpu_corrected_ecc': ReduceType.LAST, 'gpu_corrected_ecc': ReduceType.LAST,
'gpu_uncorrected_ecc': ReduceType.LAST, 'gpu_uncorrected_ecc': ReduceType.LAST,
...@@ -28,6 +29,7 @@ def __init__(self): ...@@ -28,6 +29,7 @@ def __init__(self):
self.__mem_total = None self.__mem_total = None
self.__gpu_usage = list() self.__gpu_usage = list()
self.__gpu_temperature = list() self.__gpu_temperature = list()
self.__gpu_power = list()
self.__gpu_power_limit = list() self.__gpu_power_limit = list()
self.__gpu_mem_used = list() self.__gpu_mem_used = list()
self.__gpu_mem_total = list() self.__gpu_mem_total = list()
...@@ -112,6 +114,20 @@ def gpu_temperature(self, gpu_temperature): ...@@ -112,6 +114,20 @@ def gpu_temperature(self, gpu_temperature):
""" """
self.__gpu_temperature = gpu_temperature self.__gpu_temperature = gpu_temperature
@property
def gpu_power(self):
"""Decoration function to access __gpu_power."""
return self.__gpu_power
@gpu_power.setter
def gpu_power(self, gpu_power):
"""Set the gpu realtime power, unit: Watt.
Args:
gpu_power(list): list of gpu realtime power.
"""
self.__gpu_power = gpu_power
@property @property
def gpu_power_limit(self): def gpu_power_limit(self):
"""Decoration function to access __gpu_power_limit.""" """Decoration function to access __gpu_power_limit."""
......
...@@ -387,8 +387,9 @@ def __merge_monitor_metrics(self, node_path): ...@@ -387,8 +387,9 @@ def __merge_monitor_metrics(self, node_path):
metrics_dict[metric].append(value) metrics_dict[metric].append(value)
for metric, values in metrics_dict.items(): for metric, values in metrics_dict.items():
prefix = metric.split(':')[0]
for pattern, reduce_type in MonitorRecord.reduce_ops.items(): for pattern, reduce_type in MonitorRecord.reduce_ops.items():
if pattern in metric: if pattern == prefix:
reduce_func = Reducer.get_reduce_func(reduce_type) reduce_func = Reducer.get_reduce_func(reduce_type)
metric_name = 'monitor/{}'.format(metric) metric_name = 'monitor/{}'.format(metric)
metrics_summary[metric_name] = reduce_func(values) metrics_summary[metric_name] = reduce_func(values)
......
...@@ -167,6 +167,7 @@ def test_arguments_related_interfaces(): ...@@ -167,6 +167,7 @@ def test_arguments_related_interfaces():
--no_gpu Disable GPU training. --no_gpu Disable GPU training.
--num_steps int The number of test step. --num_steps int The number of test step.
--num_warmup int The number of warmup step. --num_warmup int The number of warmup step.
--num_workers int Number of subprocesses to use for data loading.
--pin_memory Enable option to pin memory in data loader. --pin_memory Enable option to pin memory in data loader.
--precision Precision [Precision ...] --precision Precision [Precision ...]
Model precision. E.g. fp8_hybrid fp8_e4m3 fp8_e5m2 Model precision. E.g. fp8_hybrid fp8_e4m3 fp8_e5m2
...@@ -206,6 +207,7 @@ def test_preprocess(): ...@@ -206,6 +207,7 @@ def test_preprocess():
--no_gpu Disable GPU training. --no_gpu Disable GPU training.
--num_steps int The number of test step. --num_steps int The number of test step.
--num_warmup int The number of warmup step. --num_warmup int The number of warmup step.
--num_workers int Number of subprocesses to use for data loading.
--pin_memory Enable option to pin memory in data loader. --pin_memory Enable option to pin memory in data loader.
--precision Precision [Precision ...] --precision Precision [Precision ...]
Model precision. E.g. fp8_hybrid fp8_e4m3 fp8_e5m2 Model precision. E.g. fp8_hybrid fp8_e4m3 fp8_e5m2
......
...@@ -44,8 +44,8 @@ def test_monitor(self): ...@@ -44,8 +44,8 @@ def test_monitor(self):
monitor._Monitor__sample_gpu_metrics(record) monitor._Monitor__sample_gpu_metrics(record)
gpu_list_metrics = [ gpu_list_metrics = [
record.gpu_usage, record.gpu_temperature, record.gpu_power_limit, record.gpu_mem_used, record.gpu_mem_total, record.gpu_usage, record.gpu_temperature, record.gpu_power, record.gpu_power_limit, record.gpu_mem_used,
record.gpu_corrected_ecc, record.gpu_uncorrected_ecc record.gpu_mem_total, record.gpu_corrected_ecc, record.gpu_uncorrected_ecc
] ]
for metric in gpu_list_metrics: for metric in gpu_list_metrics:
assert (metric) assert (metric)
......
...@@ -17,6 +17,7 @@ def test_monitor_record(): ...@@ -17,6 +17,7 @@ def test_monitor_record():
mr.mem_total = 1024 mr.mem_total = 1024
mr.gpu_usage = [90, 80, 86, 72, 79, 81, 94, 85] mr.gpu_usage = [90, 80, 86, 72, 79, 81, 94, 85]
mr.gpu_temperature = [62, 75, 69, 63, 72, 77, 80, 71] mr.gpu_temperature = [62, 75, 69, 63, 72, 77, 80, 71]
mr.gpu_power = [257, 290, 280, 262, 291, 284, 281, 273]
mr.gpu_power_limit = [400, 400, 400, 350, 400, 400, 400, 400] mr.gpu_power_limit = [400, 400, 400, 350, 400, 400, 400, 400]
mr.gpu_mem_used = [2550, 2680, 2543, 2588, 2612, 2603, 2515, 2593] mr.gpu_mem_used = [2550, 2680, 2543, 2588, 2612, 2603, 2515, 2593]
mr.gpu_mem_total = [16777216, 16777216, 16777216, 16777216, 16777216, 16777216, 16777216, 16777216] mr.gpu_mem_total = [16777216, 16777216, 16777216, 16777216, 16777216, 16777216, 16777216, 16777216]
...@@ -59,6 +60,14 @@ def test_monitor_record(): ...@@ -59,6 +60,14 @@ def test_monitor_record():
'gpu_temperature:5': 77, 'gpu_temperature:5': 77,
'gpu_temperature:6': 80, 'gpu_temperature:6': 80,
'gpu_temperature:7': 71, 'gpu_temperature:7': 71,
'gpu_power:0': 257,
'gpu_power:1': 290,
'gpu_power:2': 280,
'gpu_power:3': 262,
'gpu_power:4': 291,
'gpu_power:5': 284,
'gpu_power:6': 281,
'gpu_power:7': 273,
'gpu_power_limit:0': 400, 'gpu_power_limit:0': 400,
'gpu_power_limit:1': 400, 'gpu_power_limit:1': 400,
'gpu_power_limit:2': 400, 'gpu_power_limit:2': 400,
......
---
slug: release-sb-v0.8
title: Releasing SuperBench v0.8
author: Peng Cheng
author_title: SuperBench Team
author_url: https://github.com/cp5555
author_image_url: https://github.com/cp5555.png
tags: [superbench, announcement, release]
---
We are very happy to announce that **SuperBench 0.8.0 version** is officially released today!
You can install and try superbench by following [Getting Started Tutorial](https://microsoft.github.io/superbenchmark/docs/getting-started/installation).
## SuperBench 0.8.0 Release Notes
### SuperBench Improvements
- Support SuperBench Executor running on Windows.
- Remove fixed rccl version in rocm5.1.x docker file.
- Upgrade networkx version to fix installation compatibility issue.
- Pin setuptools version to v65.7.0.
- Limit ansible_runner version for Python 3.6.
- Support cgroup V2 when read system metrics in monitor.
- Fix analyzer bug in Python 3.8 due to pandas api change.
- Collect real-time GPU power in monitor.
- Remove unreachable condition when write host list in mpi mode.
- Upgrade Docker image with cuda12.1, nccl 2.17.1-1, hpcx v2.14, and mlc 3.10.
- Fix wrong unit of cpu-memory-bw-latency in document.
### Micro-benchmark Improvements
- Add STREAM benchmark for sustainable memory bandwidth and the corresponding computation rate.
- Add HPL Benchmark for HPC Linpack Benchmark.
- Support flexible warmup and non-random data initialization in cublas-benchmark.
- Support error tolerance in micro-benchmark for CuDNN function.
- Add distributed inference benchmark.
- Support tensor core precisions (e.g., FP8) and batch/shape range in cublaslt gemm.
### Model Benchmark Improvements
- Fix torch.dist init issue with multiple models.
- Support TE FP8 in BERT/GPT2 model.
- Add num_workers configurations in model benchmark.
...@@ -101,7 +101,7 @@ module.exports = { ...@@ -101,7 +101,7 @@ module.exports = {
announcementBar: { announcementBar: {
id: 'supportus', id: 'supportus',
content: content:
'📢 <a href="https://microsoft.github.io/superbenchmark/blog/release-sb-v0.7">v0.7.0</a> has been released! ' + '📢 <a href="https://microsoft.github.io/superbenchmark/blog/release-sb-v0.8">v0.8.0</a> has been released! ' +
'⭐️ If you like SuperBench, give it a star on <a target="_blank" rel="noopener noreferrer" href="https://github.com/microsoft/superbenchmark">GitHub</a>! ⭐️', '⭐️ If you like SuperBench, give it a star on <a target="_blank" rel="noopener noreferrer" href="https://github.com/microsoft/superbenchmark">GitHub</a>! ⭐️',
}, },
algolia: { algolia: {
......
{ {
"name": "superbench-website", "name": "superbench-website",
"version": "0.7.0", "version": "0.8.0",
"lockfileVersion": 1, "lockfileVersion": 1,
"requires": true, "requires": true,
"dependencies": { "dependencies": {
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment