Unverified Commit 63e9b2d1 authored by Yifan Xiong's avatar Yifan Xiong Committed by GitHub
Browse files

Release - SuperBench v0.6.0 (#409)



**Description**

Cherry-pick bug fixes from v0.6.0 to main.

**Major Revisions**

* Enable latency test in ib traffic validation distributed benchmark (#396)
* Enhance parameter parsing to allow spaces in value (#397)
* Update apt packages in dockerfile (#398)
* Upgrade colorlog for NO_COLOR support (#404)
* Analyzer - Update error handling to support exit code of sb result diagnosis (#403)
* Analyzer - Make baseline file optional in data diagnosis and fix bugs (#399)
* Enhance timeout cleanup to avoid possible hanging (#405)
* Auto generate ibstat file by pssh (#402)
* Analyzer - Format int type and unify empty value to N/A in diagnosis output file (#406)
* Docs - Upgrade version and release note (#407)
* Docs - Fix issues in document (#408)
Co-authored-by: default avatarYang Wang <yangwang1@microsoft.com>
Co-authored-by: default avatarYuting Jiang <yutingjiang@microsoft.com>
parent 733860d7
...@@ -54,7 +54,7 @@ def test_cudnn_functions(): ...@@ -54,7 +54,7 @@ def test_cudnn_functions():
context = BenchmarkRegistry.create_benchmark_context( context = BenchmarkRegistry.create_benchmark_context(
'cudnn-function', 'cudnn-function',
platform=Platform.CUDA, platform=Platform.CUDA,
parameters='--num_warmup 10 --num_steps 10 --num_in_step 100 --config_json_str ' + custom_config_str parameters=f"--num_warmup 10 --num_steps 10 --num_in_step 100 --config_json_str '{custom_config_str}'"
) )
assert (BenchmarkRegistry.is_benchmark_context_valid(context)) assert (BenchmarkRegistry.is_benchmark_context_valid(context))
......
...@@ -178,14 +178,14 @@ def test_ib_traffic_performance(self, mock_gpu): ...@@ -178,14 +178,14 @@ def test_ib_traffic_performance(self, mock_gpu):
assert (ret is True) assert (ret is True)
# Generate config # Generate config
parameters = '--ib_dev mlx5_0 --iters 2000 --msg_size 33554432 --hostfile hostfile' parameters = '--ib_dev "$(echo mlx5_0)" --iters 2000 --msg_size 33554432 --hostfile hostfile'
benchmark = benchmark_class(benchmark_name, parameters=parameters) benchmark = benchmark_class(benchmark_name, parameters=parameters)
os.environ['OMPI_COMM_WORLD_SIZE'] = '4' os.environ['OMPI_COMM_WORLD_SIZE'] = '4'
ret = benchmark._preprocess() ret = benchmark._preprocess()
Path('config.txt').unlink() Path('config.txt').unlink()
assert (ret) assert (ret)
expect_command = "ib_validation --cmd_prefix '" + benchmark._args.bin_dir + \ expect_command = "ib_validation --cmd_prefix '" + benchmark._args.bin_dir + \
"/ib_write_bw -F -n 2000 -d mlx5_0 -s 33554432 --report_gbits' " + \ "/ib_write_bw -F -n 2000 -d $(echo mlx5_0) -s 33554432 --report_gbits' " + \
f'--timeout 120 --hostfile hostfile --input_config {os.getcwd()}/config.txt' f'--timeout 120 --hostfile hostfile --input_config {os.getcwd()}/config.txt'
command = benchmark._bin_name + benchmark._commands[0].split(benchmark._bin_name)[1] command = benchmark._bin_name + benchmark._commands[0].split(benchmark._bin_name)[1]
assert (command == expect_command) assert (command == expect_command)
...@@ -206,6 +206,17 @@ def test_ib_traffic_performance(self, mock_gpu): ...@@ -206,6 +206,17 @@ def test_ib_traffic_performance(self, mock_gpu):
command = benchmark._bin_name + benchmark._commands[0].split(benchmark._bin_name)[1] command = benchmark._bin_name + benchmark._commands[0].split(benchmark._bin_name)[1]
assert (command == expect_command) assert (command == expect_command)
parameters = '--command ib_read_lat --ib_dev mlx5_0 --iters 2000 --msg_size 33554432 ' + \
'--pattern one-to-one --hostfile hostfile --gpu_dev 0'
mock_gpu.return_value = 'nvidia'
benchmark = benchmark_class(benchmark_name, parameters=parameters)
ret = benchmark._preprocess()
expect_command = "ib_validation --cmd_prefix '" + benchmark._args.bin_dir + \
"/ib_read_lat -F -n 2000 -d mlx5_0 -s 33554432 --report_gbits' " + \
f'--timeout 120 --hostfile hostfile --input_config {os.getcwd()}/config.txt'
command = benchmark._bin_name + benchmark._commands[0].split(benchmark._bin_name)[1]
assert (command == expect_command)
# Custom config # Custom config
config = ['0,1', '1,0;0,1', '0,1;1,0', '1,0;0,1'] config = ['0,1', '1,0;0,1', '0,1;1,0', '1,0;0,1']
with open('test_config.txt', 'w') as f: with open('test_config.txt', 'w') as f:
......
...@@ -118,6 +118,11 @@ def test_sb_result_diagnosis(self): ...@@ -118,6 +118,11 @@ def test_sb_result_diagnosis(self):
'sb result diagnosis -d {dir}/test_results.jsonl -r {dir}/test_rules.yaml -b {dir}/test_baseline.json'. 'sb result diagnosis -d {dir}/test_results.jsonl -r {dir}/test_rules.yaml -b {dir}/test_baseline.json'.
format(dir=test_analyzer_dir) + ' --output-dir outputs/test-diagnosis/ --output-all' format(dir=test_analyzer_dir) + ' --output-dir outputs/test-diagnosis/ --output-all'
) )
self.cmd(
'sb result diagnosis -d {dir}/test_results.jsonl -r {dir}/test_rules_without_baseline.yaml'.
format(dir=test_analyzer_dir) +
' --output-dir outputs/test-diagnosis/ --output-all --output-file-format json'
)
# test invalid output format # test invalid output format
self.cmd( self.cmd(
'sb result diagnosis -d {dir}/test_results.jsonl -r {dir}/test_rules.yaml -b {dir}/test_baseline.json'. 'sb result diagnosis -d {dir}/test_results.jsonl -r {dir}/test_rules.yaml -b {dir}/test_baseline.json'.
......
<table> <table>
<thead> <thead>
<tr> <tr>
<th>machine</th> <th>index</th>
<th>Category</th> <th>Category</th>
<th>Defective Details</th> <th>Defective Details</th>
<th>kernel-launch/event_overhead:0</th> <th>kernel-launch/event_overhead:0</th>
...@@ -53,7 +53,7 @@ ...@@ -53,7 +53,7 @@
<td>-1.17%</td> <td>-1.17%</td>
<td>-4.03%</td> <td>-4.03%</td>
<td>-1.01%</td> <td>-1.01%</td>
<td>0.0</td> <td>0</td>
<td>0.0%</td> <td>0.0%</td>
<td>0.0%</td> <td>0.0%</td>
<td>1.95%</td> <td>1.95%</td>
...@@ -78,7 +78,7 @@ ...@@ -78,7 +78,7 @@
<td>0.78%</td> <td>0.78%</td>
<td>-1.17%</td> <td>-1.17%</td>
<td>1.95%</td> <td>1.95%</td>
<td>0.0</td> <td>0</td>
</tr> </tr>
<tr> <tr>
<td>sb-validation-03</td> <td>sb-validation-03</td>
...@@ -92,7 +92,7 @@ ...@@ -92,7 +92,7 @@
<td>-1.17%</td> <td>-1.17%</td>
<td>-4.03%</td> <td>-4.03%</td>
<td>-1.01%</td> <td>-1.01%</td>
<td>0.0</td> <td>0</td>
<td>0.0%</td> <td>0.0%</td>
<td>0.0%</td> <td>0.0%</td>
<td>1.95%</td> <td>1.95%</td>
...@@ -101,23 +101,23 @@ ...@@ -101,23 +101,23 @@
<td>-1.95%</td> <td>-1.95%</td>
<td>1.85%</td> <td>1.85%</td>
<td>4.39%</td> <td>4.39%</td>
<td>nan</td> <td>N/A</td>
<td>nan</td> <td>N/A</td>
<td>nan</td> <td>N/A</td>
<td>nan</td> <td>N/A</td>
<td>nan</td> <td>N/A</td>
<td>nan</td> <td>N/A</td>
<td>nan</td> <td>N/A</td>
<td>nan</td> <td>N/A</td>
<td>nan</td> <td>N/A</td>
<td>nan</td> <td>N/A</td>
<td>nan</td> <td>N/A</td>
<td>nan</td> <td>N/A</td>
<td>nan</td> <td>N/A</td>
<td>nan</td> <td>N/A</td>
<td>nan</td> <td>N/A</td>
<td>nan</td> <td>N/A</td>
<td>1.0</td> <td>1</td>
</tr> </tr>
</tbody> </tbody>
</table> </table>
\ No newline at end of file
[ [
{ {
"index": "sb-validation-01",
"diagnosis/accept": false,
"diagnosis/issue_num": 1,
"diagnosis/category": "KernelLaunch",
"diagnosis/issue_details": "kernel-launch/event_overhead:0(B/L: 0.0060 VAL: 0.1000 VAR: 1577.85% Rule:lambda x:x>0.05)",
"bert_models/pytorch-bert-base/steptime_train_float32": 114.5916701062, "bert_models/pytorch-bert-base/steptime_train_float32": 114.5916701062,
"bert_models/pytorch-bert-base/throughput_train_float32": 279.8794623591, "bert_models/pytorch-bert-base/throughput_train_float32": 279.8794623591,
"bert_models/pytorch-bert-base/steptime_train_float16": 83.8895108318, "bert_models/pytorch-bert-base/steptime_train_float16": 83.8895108318,
...@@ -48,7 +53,7 @@ ...@@ -48,7 +53,7 @@
"gemm-flops/FP32:5": 18347.1, "gemm-flops/FP32:5": 18347.1,
"gemm-flops/FP32:6": 18247.4, "gemm-flops/FP32:6": 18247.4,
"gemm-flops/FP32:7": 18318.4, "gemm-flops/FP32:7": 18318.4,
"gemm-flops/FP16:0": 33878.0, "gemm-flops/FP16:0": 33878,
"gemm-flops/FP16:1": 33911.1, "gemm-flops/FP16:1": 33911.1,
"gemm-flops/FP16:2": 33769.3, "gemm-flops/FP16:2": 33769.3,
"gemm-flops/FP16:3": 33909.9, "gemm-flops/FP16:3": 33909.9,
...@@ -60,50 +65,50 @@ ...@@ -60,50 +65,50 @@
"gemm-flops/FP64_TC:1": 18924.2, "gemm-flops/FP64_TC:1": 18924.2,
"gemm-flops/FP64_TC:2": 18930.3, "gemm-flops/FP64_TC:2": 18930.3,
"gemm-flops/FP64_TC:3": 18971.9, "gemm-flops/FP64_TC:3": 18971.9,
"gemm-flops/FP64_TC:4": 18946.0, "gemm-flops/FP64_TC:4": 18946,
"gemm-flops/FP64_TC:5": 18945.0, "gemm-flops/FP64_TC:5": 18945,
"gemm-flops/FP64_TC:6": 18822.9, "gemm-flops/FP64_TC:6": 18822.9,
"gemm-flops/FP64_TC:7": 18911.1, "gemm-flops/FP64_TC:7": 18911.1,
"gemm-flops/TF32_TC:0": 127900.0, "gemm-flops/TF32_TC:0": 127900,
"gemm-flops/TF32_TC:1": 129094.0, "gemm-flops/TF32_TC:1": 129094,
"gemm-flops/TF32_TC:2": 127831.0, "gemm-flops/TF32_TC:2": 127831,
"gemm-flops/TF32_TC:3": 128709.0, "gemm-flops/TF32_TC:3": 128709,
"gemm-flops/TF32_TC:4": 127388.0, "gemm-flops/TF32_TC:4": 127388,
"gemm-flops/TF32_TC:5": 127861.0, "gemm-flops/TF32_TC:5": 127861,
"gemm-flops/TF32_TC:6": 128492.0, "gemm-flops/TF32_TC:6": 128492,
"gemm-flops/TF32_TC:7": 127720.0, "gemm-flops/TF32_TC:7": 127720,
"gemm-flops/BF16_TC:0": 264965.0, "gemm-flops/BF16_TC:0": 264965,
"gemm-flops/BF16_TC:1": 266638.0, "gemm-flops/BF16_TC:1": 266638,
"gemm-flops/BF16_TC:2": 263151.0, "gemm-flops/BF16_TC:2": 263151,
"gemm-flops/BF16_TC:3": 264752.0, "gemm-flops/BF16_TC:3": 264752,
"gemm-flops/BF16_TC:4": 263049.0, "gemm-flops/BF16_TC:4": 263049,
"gemm-flops/BF16_TC:5": 266605.0, "gemm-flops/BF16_TC:5": 266605,
"gemm-flops/BF16_TC:6": 267501.0, "gemm-flops/BF16_TC:6": 267501,
"gemm-flops/BF16_TC:7": 263880.0, "gemm-flops/BF16_TC:7": 263880,
"gemm-flops/FP16_TC:0": 279474.0, "gemm-flops/FP16_TC:0": 279474,
"gemm-flops/FP16_TC:1": 281256.0, "gemm-flops/FP16_TC:1": 281256,
"gemm-flops/FP16_TC:2": 277403.0, "gemm-flops/FP16_TC:2": 277403,
"gemm-flops/FP16_TC:3": 279147.0, "gemm-flops/FP16_TC:3": 279147,
"gemm-flops/FP16_TC:4": 277587.0, "gemm-flops/FP16_TC:4": 277587,
"gemm-flops/FP16_TC:5": 281537.0, "gemm-flops/FP16_TC:5": 281537,
"gemm-flops/FP16_TC:6": 282132.0, "gemm-flops/FP16_TC:6": 282132,
"gemm-flops/FP16_TC:7": 277788.0, "gemm-flops/FP16_TC:7": 277788,
"gemm-flops/INT8_TC:0": 475160.0, "gemm-flops/INT8_TC:0": 475160,
"gemm-flops/INT8_TC:1": 477725.0, "gemm-flops/INT8_TC:1": 477725,
"gemm-flops/INT8_TC:2": 471621.0, "gemm-flops/INT8_TC:2": 471621,
"gemm-flops/INT8_TC:3": 473716.0, "gemm-flops/INT8_TC:3": 473716,
"gemm-flops/INT8_TC:4": 472124.0, "gemm-flops/INT8_TC:4": 472124,
"gemm-flops/INT8_TC:5": 479972.0, "gemm-flops/INT8_TC:5": 479972,
"gemm-flops/INT8_TC:6": 481327.0, "gemm-flops/INT8_TC:6": 481327,
"gemm-flops/INT8_TC:7": 474710.0, "gemm-flops/INT8_TC:7": 474710,
"gemm-flops/INT4_TC:0": 970330.0, "gemm-flops/INT4_TC:0": 970330,
"gemm-flops/INT4_TC:1": 976837.0, "gemm-flops/INT4_TC:1": 976837,
"gemm-flops/INT4_TC:2": 966003.0, "gemm-flops/INT4_TC:2": 966003,
"gemm-flops/INT4_TC:3": 971315.0, "gemm-flops/INT4_TC:3": 971315,
"gemm-flops/INT4_TC:4": 964441.0, "gemm-flops/INT4_TC:4": 964441,
"gemm-flops/INT4_TC:5": 982461.0, "gemm-flops/INT4_TC:5": 982461,
"gemm-flops/INT4_TC:6": 979610.0, "gemm-flops/INT4_TC:6": 979610,
"gemm-flops/INT4_TC:7": 968359.0, "gemm-flops/INT4_TC:7": 968359,
"gpt_models/pytorch-gpt2-large/steptime_train_float32": 295.0526971836, "gpt_models/pytorch-gpt2-large/steptime_train_float32": 295.0526971836,
"gpt_models/pytorch-gpt2-large/throughput_train_float32": 27.1154543969, "gpt_models/pytorch-gpt2-large/throughput_train_float32": 27.1154543969,
"gpt_models/pytorch-gpt2-large/steptime_train_float16": 194.4957742235, "gpt_models/pytorch-gpt2-large/steptime_train_float16": 194.4957742235,
...@@ -292,7 +297,7 @@ ...@@ -292,7 +297,7 @@
"ib-loopback/IB_write_2097152_Avg_7:0": 23930.64, "ib-loopback/IB_write_2097152_Avg_7:0": 23930.64,
"ib-loopback/IB_write_4194304_Avg_7:0": 23845.63, "ib-loopback/IB_write_4194304_Avg_7:0": 23845.63,
"ib-loopback/IB_write_8388608_Avg_7:0": 23896.94, "ib-loopback/IB_write_8388608_Avg_7:0": 23896.94,
"kernel-launch/return_code": 0.0, "kernel-launch/return_code": 0,
"kernel-launch/event_overhead:0": 0.1, "kernel-launch/event_overhead:0": 0.1,
"kernel-launch/event_overhead:1": 0.00595, "kernel-launch/event_overhead:1": 0.00595,
"kernel-launch/event_overhead:2": 0.00557, "kernel-launch/event_overhead:2": 0.00557,
...@@ -314,10 +319,10 @@ ...@@ -314,10 +319,10 @@
"lstm_models/pytorch-lstm/steptime_train_float16": 25.9531298652, "lstm_models/pytorch-lstm/steptime_train_float16": 25.9531298652,
"lstm_models/pytorch-lstm/throughput_train_float16": 9069.9080925588, "lstm_models/pytorch-lstm/throughput_train_float16": 9069.9080925588,
"pytorch-matmul/nosharding": 34.6449975967, "pytorch-matmul/nosharding": 34.6449975967,
"mem-bw/return_code": 0.0, "mem-bw/return_code": 0,
"mem-bw/H2D_Mem_BW:0": 25.6, "mem-bw/H2D_Mem_BW:0": 25.6,
"mem-bw/H2D_Mem_BW:1": 25.8, "mem-bw/H2D_Mem_BW:1": 25.8,
"mem-bw/H2D_Mem_BW:2": 26.0, "mem-bw/H2D_Mem_BW:2": 26,
"mem-bw/H2D_Mem_BW:3": 26.1, "mem-bw/H2D_Mem_BW:3": 26.1,
"mem-bw/H2D_Mem_BW:4": 26.2, "mem-bw/H2D_Mem_BW:4": 26.2,
"mem-bw/H2D_Mem_BW:5": 25.8, "mem-bw/H2D_Mem_BW:5": 25.8,
...@@ -331,7 +336,7 @@ ...@@ -331,7 +336,7 @@
"mem-bw/D2H_Mem_BW:5": 24.3, "mem-bw/D2H_Mem_BW:5": 24.3,
"mem-bw/D2H_Mem_BW:6": 23.9, "mem-bw/D2H_Mem_BW:6": 23.9,
"mem-bw/D2H_Mem_BW:7": 24.6, "mem-bw/D2H_Mem_BW:7": 24.6,
"mem-bw/D2D_Mem_BW:0": 1118.0, "mem-bw/D2D_Mem_BW:0": 1118,
"mem-bw/D2D_Mem_BW:1": 1114.6, "mem-bw/D2D_Mem_BW:1": 1114.6,
"mem-bw/D2D_Mem_BW:2": 1119.7, "mem-bw/D2D_Mem_BW:2": 1119.7,
"mem-bw/D2D_Mem_BW:3": 1121.9, "mem-bw/D2D_Mem_BW:3": 1121.9,
...@@ -339,20 +344,20 @@ ...@@ -339,20 +344,20 @@
"mem-bw/D2D_Mem_BW:5": 1110.1, "mem-bw/D2D_Mem_BW:5": 1110.1,
"mem-bw/D2D_Mem_BW:6": 1123.3, "mem-bw/D2D_Mem_BW:6": 1123.3,
"mem-bw/D2D_Mem_BW:7": 1117.6, "mem-bw/D2D_Mem_BW:7": 1117.6,
"nccl-bw/allreduce_8_busbw:0": 0.0, "nccl-bw/allreduce_8_busbw:0": 0,
"nccl-bw/allreduce_8_algbw:0": 0.0, "nccl-bw/allreduce_8_algbw:0": 0,
"nccl-bw/allreduce_8_time:0": 37.84, "nccl-bw/allreduce_8_time:0": 37.84,
"nccl-bw/allreduce_16_busbw:0": 0.0, "nccl-bw/allreduce_16_busbw:0": 0,
"nccl-bw/allreduce_16_algbw:0": 0.0, "nccl-bw/allreduce_16_algbw:0": 0,
"nccl-bw/allreduce_16_time:0": 36.42, "nccl-bw/allreduce_16_time:0": 36.42,
"nccl-bw/allreduce_32_busbw:0": 0.0, "nccl-bw/allreduce_32_busbw:0": 0,
"nccl-bw/allreduce_32_algbw:0": 0.0, "nccl-bw/allreduce_32_algbw:0": 0,
"nccl-bw/allreduce_32_time:0": 36.87, "nccl-bw/allreduce_32_time:0": 36.87,
"nccl-bw/allreduce_64_busbw:0": 0.0, "nccl-bw/allreduce_64_busbw:0": 0,
"nccl-bw/allreduce_64_algbw:0": 0.0, "nccl-bw/allreduce_64_algbw:0": 0,
"nccl-bw/allreduce_64_time:0": 35.83, "nccl-bw/allreduce_64_time:0": 35.83,
"nccl-bw/allreduce_128_busbw:0": 0.01, "nccl-bw/allreduce_128_busbw:0": 0.01,
"nccl-bw/allreduce_128_algbw:0": 0.0, "nccl-bw/allreduce_128_algbw:0": 0,
"nccl-bw/allreduce_128_time:0": 36.91, "nccl-bw/allreduce_128_time:0": 36.91,
"nccl-bw/allreduce_256_busbw:0": 0.01, "nccl-bw/allreduce_256_busbw:0": 0.01,
"nccl-bw/allreduce_256_algbw:0": 0.01, "nccl-bw/allreduce_256_algbw:0": 0.01,
...@@ -378,7 +383,7 @@ ...@@ -378,7 +383,7 @@
"nccl-bw/allreduce_32768_busbw:0": 1.52, "nccl-bw/allreduce_32768_busbw:0": 1.52,
"nccl-bw/allreduce_32768_algbw:0": 0.87, "nccl-bw/allreduce_32768_algbw:0": 0.87,
"nccl-bw/allreduce_32768_time:0": 37.64, "nccl-bw/allreduce_32768_time:0": 37.64,
"nccl-bw/allreduce_65536_busbw:0": 3.0, "nccl-bw/allreduce_65536_busbw:0": 3,
"nccl-bw/allreduce_65536_algbw:0": 1.71, "nccl-bw/allreduce_65536_algbw:0": 1.71,
"nccl-bw/allreduce_65536_time:0": 38.22, "nccl-bw/allreduce_65536_time:0": 38.22,
"nccl-bw/allreduce_131072_busbw:0": 5.31, "nccl-bw/allreduce_131072_busbw:0": 5.31,
...@@ -401,7 +406,7 @@ ...@@ -401,7 +406,7 @@
"nccl-bw/allreduce_4194304_time:0": 111.6, "nccl-bw/allreduce_4194304_time:0": 111.6,
"nccl-bw/allreduce_8388608_busbw:0": 89.51, "nccl-bw/allreduce_8388608_busbw:0": 89.51,
"nccl-bw/allreduce_8388608_algbw:0": 51.15, "nccl-bw/allreduce_8388608_algbw:0": 51.15,
"nccl-bw/allreduce_8388608_time:0": 164.0, "nccl-bw/allreduce_8388608_time:0": 164,
"nccl-bw/allreduce_16777216_busbw:0": 114.38, "nccl-bw/allreduce_16777216_busbw:0": 114.38,
"nccl-bw/allreduce_16777216_algbw:0": 65.36, "nccl-bw/allreduce_16777216_algbw:0": 65.36,
"nccl-bw/allreduce_16777216_time:0": 256.7, "nccl-bw/allreduce_16777216_time:0": 256.7,
...@@ -425,13 +430,13 @@ ...@@ -425,13 +430,13 @@
"nccl-bw/allreduce_1073741824_time:0": 8164.5, "nccl-bw/allreduce_1073741824_time:0": 8164.5,
"nccl-bw/allreduce_2147483648_busbw:0": 231.89, "nccl-bw/allreduce_2147483648_busbw:0": 231.89,
"nccl-bw/allreduce_2147483648_algbw:0": 132.51, "nccl-bw/allreduce_2147483648_algbw:0": 132.51,
"nccl-bw/allreduce_2147483648_time:0": 16207.0, "nccl-bw/allreduce_2147483648_time:0": 16207,
"nccl-bw/allreduce_4294967296_busbw:0": 234.45, "nccl-bw/allreduce_4294967296_busbw:0": 234.45,
"nccl-bw/allreduce_4294967296_algbw:0": 133.97, "nccl-bw/allreduce_4294967296_algbw:0": 133.97,
"nccl-bw/allreduce_4294967296_time:0": 32059.0, "nccl-bw/allreduce_4294967296_time:0": 32059,
"nccl-bw/allreduce_8589934592_busbw:0": 235.36, "nccl-bw/allreduce_8589934592_busbw:0": 235.36,
"nccl-bw/allreduce_8589934592_algbw:0": 134.49, "nccl-bw/allreduce_8589934592_algbw:0": 134.49,
"nccl-bw/allreduce_8589934592_time:0": 63870.0, "nccl-bw/allreduce_8589934592_time:0": 63870,
"resnet_models/pytorch-resnet50/steptime_train_float32": 253.9552273229, "resnet_models/pytorch-resnet50/steptime_train_float32": 253.9552273229,
"resnet_models/pytorch-resnet50/throughput_train_float32": 760.334809913, "resnet_models/pytorch-resnet50/throughput_train_float32": 760.334809913,
"resnet_models/pytorch-resnet50/steptime_train_float16": 200.0860618427, "resnet_models/pytorch-resnet50/steptime_train_float16": 200.0860618427,
...@@ -461,14 +466,14 @@ ...@@ -461,14 +466,14 @@
"vgg_models/pytorch-vgg19/steptime_train_float32": 74.9348710524, "vgg_models/pytorch-vgg19/steptime_train_float32": 74.9348710524,
"vgg_models/pytorch-vgg19/throughput_train_float32": 429.8092158311, "vgg_models/pytorch-vgg19/throughput_train_float32": 429.8092158311,
"vgg_models/pytorch-vgg19/steptime_train_float16": 45.2033062465, "vgg_models/pytorch-vgg19/steptime_train_float16": 45.2033062465,
"vgg_models/pytorch-vgg19/throughput_train_float16": 709.1127328377, "vgg_models/pytorch-vgg19/throughput_train_float16": 709.1127328377
"diagnosis/accept": false,
"diagnosis/issue_num": 1,
"diagnosis/category": "KernelLaunch",
"diagnosis/issue_details": "kernel-launch/event_overhead:0(B/L: 0.0060 VAL: 0.1000 VAR: 1577.85% Rule:lambda x:x>0.05)",
"Index": "sb-validation-01"
}, },
{ {
"index": "sb-validation-02",
"diagnosis/accept": true,
"diagnosis/issue_num": 0,
"diagnosis/category": "N/A",
"diagnosis/issue_details": "N/A",
"bert_models/pytorch-bert-base/steptime_train_float32": 114.5916701062, "bert_models/pytorch-bert-base/steptime_train_float32": 114.5916701062,
"bert_models/pytorch-bert-base/throughput_train_float32": 279.8794623591, "bert_models/pytorch-bert-base/throughput_train_float32": 279.8794623591,
"bert_models/pytorch-bert-base/steptime_train_float16": 83.8895108318, "bert_models/pytorch-bert-base/steptime_train_float16": 83.8895108318,
...@@ -517,7 +522,7 @@ ...@@ -517,7 +522,7 @@
"gemm-flops/FP32:5": 18347.1, "gemm-flops/FP32:5": 18347.1,
"gemm-flops/FP32:6": 18247.4, "gemm-flops/FP32:6": 18247.4,
"gemm-flops/FP32:7": 18318.4, "gemm-flops/FP32:7": 18318.4,
"gemm-flops/FP16:0": 33878.0, "gemm-flops/FP16:0": 33878,
"gemm-flops/FP16:1": 33911.1, "gemm-flops/FP16:1": 33911.1,
"gemm-flops/FP16:2": 33769.3, "gemm-flops/FP16:2": 33769.3,
"gemm-flops/FP16:3": 33909.9, "gemm-flops/FP16:3": 33909.9,
...@@ -529,50 +534,50 @@ ...@@ -529,50 +534,50 @@
"gemm-flops/FP64_TC:1": 18924.2, "gemm-flops/FP64_TC:1": 18924.2,
"gemm-flops/FP64_TC:2": 18930.3, "gemm-flops/FP64_TC:2": 18930.3,
"gemm-flops/FP64_TC:3": 18971.9, "gemm-flops/FP64_TC:3": 18971.9,
"gemm-flops/FP64_TC:4": 18946.0, "gemm-flops/FP64_TC:4": 18946,
"gemm-flops/FP64_TC:5": 18945.0, "gemm-flops/FP64_TC:5": 18945,
"gemm-flops/FP64_TC:6": 18822.9, "gemm-flops/FP64_TC:6": 18822.9,
"gemm-flops/FP64_TC:7": 18911.1, "gemm-flops/FP64_TC:7": 18911.1,
"gemm-flops/TF32_TC:0": 127900.0, "gemm-flops/TF32_TC:0": 127900,
"gemm-flops/TF32_TC:1": 129094.0, "gemm-flops/TF32_TC:1": 129094,
"gemm-flops/TF32_TC:2": 127831.0, "gemm-flops/TF32_TC:2": 127831,
"gemm-flops/TF32_TC:3": 128709.0, "gemm-flops/TF32_TC:3": 128709,
"gemm-flops/TF32_TC:4": 127388.0, "gemm-flops/TF32_TC:4": 127388,
"gemm-flops/TF32_TC:5": 127861.0, "gemm-flops/TF32_TC:5": 127861,
"gemm-flops/TF32_TC:6": 128492.0, "gemm-flops/TF32_TC:6": 128492,
"gemm-flops/TF32_TC:7": 127720.0, "gemm-flops/TF32_TC:7": 127720,
"gemm-flops/BF16_TC:0": 264965.0, "gemm-flops/BF16_TC:0": 264965,
"gemm-flops/BF16_TC:1": 266638.0, "gemm-flops/BF16_TC:1": 266638,
"gemm-flops/BF16_TC:2": 263151.0, "gemm-flops/BF16_TC:2": 263151,
"gemm-flops/BF16_TC:3": 264752.0, "gemm-flops/BF16_TC:3": 264752,
"gemm-flops/BF16_TC:4": 263049.0, "gemm-flops/BF16_TC:4": 263049,
"gemm-flops/BF16_TC:5": 266605.0, "gemm-flops/BF16_TC:5": 266605,
"gemm-flops/BF16_TC:6": 267501.0, "gemm-flops/BF16_TC:6": 267501,
"gemm-flops/BF16_TC:7": 263880.0, "gemm-flops/BF16_TC:7": 263880,
"gemm-flops/FP16_TC:0": 279474.0, "gemm-flops/FP16_TC:0": 279474,
"gemm-flops/FP16_TC:1": 281256.0, "gemm-flops/FP16_TC:1": 281256,
"gemm-flops/FP16_TC:2": 277403.0, "gemm-flops/FP16_TC:2": 277403,
"gemm-flops/FP16_TC:3": 279147.0, "gemm-flops/FP16_TC:3": 279147,
"gemm-flops/FP16_TC:4": 277587.0, "gemm-flops/FP16_TC:4": 277587,
"gemm-flops/FP16_TC:5": 281537.0, "gemm-flops/FP16_TC:5": 281537,
"gemm-flops/FP16_TC:6": 282132.0, "gemm-flops/FP16_TC:6": 282132,
"gemm-flops/FP16_TC:7": 277788.0, "gemm-flops/FP16_TC:7": 277788,
"gemm-flops/INT8_TC:0": 475160.0, "gemm-flops/INT8_TC:0": 475160,
"gemm-flops/INT8_TC:1": 477725.0, "gemm-flops/INT8_TC:1": 477725,
"gemm-flops/INT8_TC:2": 471621.0, "gemm-flops/INT8_TC:2": 471621,
"gemm-flops/INT8_TC:3": 473716.0, "gemm-flops/INT8_TC:3": 473716,
"gemm-flops/INT8_TC:4": 472124.0, "gemm-flops/INT8_TC:4": 472124,
"gemm-flops/INT8_TC:5": 479972.0, "gemm-flops/INT8_TC:5": 479972,
"gemm-flops/INT8_TC:6": 481327.0, "gemm-flops/INT8_TC:6": 481327,
"gemm-flops/INT8_TC:7": 474710.0, "gemm-flops/INT8_TC:7": 474710,
"gemm-flops/INT4_TC:0": 970330.0, "gemm-flops/INT4_TC:0": 970330,
"gemm-flops/INT4_TC:1": 976837.0, "gemm-flops/INT4_TC:1": 976837,
"gemm-flops/INT4_TC:2": 966003.0, "gemm-flops/INT4_TC:2": 966003,
"gemm-flops/INT4_TC:3": 971315.0, "gemm-flops/INT4_TC:3": 971315,
"gemm-flops/INT4_TC:4": 964441.0, "gemm-flops/INT4_TC:4": 964441,
"gemm-flops/INT4_TC:5": 982461.0, "gemm-flops/INT4_TC:5": 982461,
"gemm-flops/INT4_TC:6": 979610.0, "gemm-flops/INT4_TC:6": 979610,
"gemm-flops/INT4_TC:7": 968359.0, "gemm-flops/INT4_TC:7": 968359,
"gpt_models/pytorch-gpt2-large/steptime_train_float32": 295.0526971836, "gpt_models/pytorch-gpt2-large/steptime_train_float32": 295.0526971836,
"gpt_models/pytorch-gpt2-large/throughput_train_float32": 27.1154543969, "gpt_models/pytorch-gpt2-large/throughput_train_float32": 27.1154543969,
"gpt_models/pytorch-gpt2-large/steptime_train_float16": 194.4957742235, "gpt_models/pytorch-gpt2-large/steptime_train_float16": 194.4957742235,
...@@ -761,7 +766,7 @@ ...@@ -761,7 +766,7 @@
"ib-loopback/IB_write_2097152_Avg_7:0": 23930.64, "ib-loopback/IB_write_2097152_Avg_7:0": 23930.64,
"ib-loopback/IB_write_4194304_Avg_7:0": 23845.63, "ib-loopback/IB_write_4194304_Avg_7:0": 23845.63,
"ib-loopback/IB_write_8388608_Avg_7:0": 23896.94, "ib-loopback/IB_write_8388608_Avg_7:0": 23896.94,
"kernel-launch/return_code": 0.0, "kernel-launch/return_code": 0,
"kernel-launch/event_overhead:0": 0.00595, "kernel-launch/event_overhead:0": 0.00595,
"kernel-launch/event_overhead:1": 0.00595, "kernel-launch/event_overhead:1": 0.00595,
"kernel-launch/event_overhead:2": 0.00557, "kernel-launch/event_overhead:2": 0.00557,
...@@ -783,10 +788,10 @@ ...@@ -783,10 +788,10 @@
"lstm_models/pytorch-lstm/steptime_train_float16": 25.9531298652, "lstm_models/pytorch-lstm/steptime_train_float16": 25.9531298652,
"lstm_models/pytorch-lstm/throughput_train_float16": 9069.9080925588, "lstm_models/pytorch-lstm/throughput_train_float16": 9069.9080925588,
"pytorch-matmul/nosharding": 34.6449975967, "pytorch-matmul/nosharding": 34.6449975967,
"mem-bw/return_code": 0.0, "mem-bw/return_code": 0,
"mem-bw/H2D_Mem_BW:0": 25.6, "mem-bw/H2D_Mem_BW:0": 25.6,
"mem-bw/H2D_Mem_BW:1": 25.8, "mem-bw/H2D_Mem_BW:1": 25.8,
"mem-bw/H2D_Mem_BW:2": 26.0, "mem-bw/H2D_Mem_BW:2": 26,
"mem-bw/H2D_Mem_BW:3": 26.1, "mem-bw/H2D_Mem_BW:3": 26.1,
"mem-bw/H2D_Mem_BW:4": 26.2, "mem-bw/H2D_Mem_BW:4": 26.2,
"mem-bw/H2D_Mem_BW:5": 25.8, "mem-bw/H2D_Mem_BW:5": 25.8,
...@@ -800,7 +805,7 @@ ...@@ -800,7 +805,7 @@
"mem-bw/D2H_Mem_BW:5": 24.3, "mem-bw/D2H_Mem_BW:5": 24.3,
"mem-bw/D2H_Mem_BW:6": 23.9, "mem-bw/D2H_Mem_BW:6": 23.9,
"mem-bw/D2H_Mem_BW:7": 24.6, "mem-bw/D2H_Mem_BW:7": 24.6,
"mem-bw/D2D_Mem_BW:0": 1118.0, "mem-bw/D2D_Mem_BW:0": 1118,
"mem-bw/D2D_Mem_BW:1": 1114.6, "mem-bw/D2D_Mem_BW:1": 1114.6,
"mem-bw/D2D_Mem_BW:2": 1119.7, "mem-bw/D2D_Mem_BW:2": 1119.7,
"mem-bw/D2D_Mem_BW:3": 1121.9, "mem-bw/D2D_Mem_BW:3": 1121.9,
...@@ -808,20 +813,20 @@ ...@@ -808,20 +813,20 @@
"mem-bw/D2D_Mem_BW:5": 1110.1, "mem-bw/D2D_Mem_BW:5": 1110.1,
"mem-bw/D2D_Mem_BW:6": 1123.3, "mem-bw/D2D_Mem_BW:6": 1123.3,
"mem-bw/D2D_Mem_BW:7": 1117.6, "mem-bw/D2D_Mem_BW:7": 1117.6,
"nccl-bw/allreduce_8_busbw:0": 0.0, "nccl-bw/allreduce_8_busbw:0": 0,
"nccl-bw/allreduce_8_algbw:0": 0.0, "nccl-bw/allreduce_8_algbw:0": 0,
"nccl-bw/allreduce_8_time:0": 37.84, "nccl-bw/allreduce_8_time:0": 37.84,
"nccl-bw/allreduce_16_busbw:0": 0.0, "nccl-bw/allreduce_16_busbw:0": 0,
"nccl-bw/allreduce_16_algbw:0": 0.0, "nccl-bw/allreduce_16_algbw:0": 0,
"nccl-bw/allreduce_16_time:0": 36.42, "nccl-bw/allreduce_16_time:0": 36.42,
"nccl-bw/allreduce_32_busbw:0": 0.0, "nccl-bw/allreduce_32_busbw:0": 0,
"nccl-bw/allreduce_32_algbw:0": 0.0, "nccl-bw/allreduce_32_algbw:0": 0,
"nccl-bw/allreduce_32_time:0": 36.87, "nccl-bw/allreduce_32_time:0": 36.87,
"nccl-bw/allreduce_64_busbw:0": 0.0, "nccl-bw/allreduce_64_busbw:0": 0,
"nccl-bw/allreduce_64_algbw:0": 0.0, "nccl-bw/allreduce_64_algbw:0": 0,
"nccl-bw/allreduce_64_time:0": 35.83, "nccl-bw/allreduce_64_time:0": 35.83,
"nccl-bw/allreduce_128_busbw:0": 0.01, "nccl-bw/allreduce_128_busbw:0": 0.01,
"nccl-bw/allreduce_128_algbw:0": 0.0, "nccl-bw/allreduce_128_algbw:0": 0,
"nccl-bw/allreduce_128_time:0": 36.91, "nccl-bw/allreduce_128_time:0": 36.91,
"nccl-bw/allreduce_256_busbw:0": 0.01, "nccl-bw/allreduce_256_busbw:0": 0.01,
"nccl-bw/allreduce_256_algbw:0": 0.01, "nccl-bw/allreduce_256_algbw:0": 0.01,
...@@ -847,7 +852,7 @@ ...@@ -847,7 +852,7 @@
"nccl-bw/allreduce_32768_busbw:0": 1.52, "nccl-bw/allreduce_32768_busbw:0": 1.52,
"nccl-bw/allreduce_32768_algbw:0": 0.87, "nccl-bw/allreduce_32768_algbw:0": 0.87,
"nccl-bw/allreduce_32768_time:0": 37.64, "nccl-bw/allreduce_32768_time:0": 37.64,
"nccl-bw/allreduce_65536_busbw:0": 3.0, "nccl-bw/allreduce_65536_busbw:0": 3,
"nccl-bw/allreduce_65536_algbw:0": 1.71, "nccl-bw/allreduce_65536_algbw:0": 1.71,
"nccl-bw/allreduce_65536_time:0": 38.22, "nccl-bw/allreduce_65536_time:0": 38.22,
"nccl-bw/allreduce_131072_busbw:0": 5.31, "nccl-bw/allreduce_131072_busbw:0": 5.31,
...@@ -870,7 +875,7 @@ ...@@ -870,7 +875,7 @@
"nccl-bw/allreduce_4194304_time:0": 111.6, "nccl-bw/allreduce_4194304_time:0": 111.6,
"nccl-bw/allreduce_8388608_busbw:0": 89.51, "nccl-bw/allreduce_8388608_busbw:0": 89.51,
"nccl-bw/allreduce_8388608_algbw:0": 51.15, "nccl-bw/allreduce_8388608_algbw:0": 51.15,
"nccl-bw/allreduce_8388608_time:0": 164.0, "nccl-bw/allreduce_8388608_time:0": 164,
"nccl-bw/allreduce_16777216_busbw:0": 114.38, "nccl-bw/allreduce_16777216_busbw:0": 114.38,
"nccl-bw/allreduce_16777216_algbw:0": 65.36, "nccl-bw/allreduce_16777216_algbw:0": 65.36,
"nccl-bw/allreduce_16777216_time:0": 256.7, "nccl-bw/allreduce_16777216_time:0": 256.7,
...@@ -894,13 +899,13 @@ ...@@ -894,13 +899,13 @@
"nccl-bw/allreduce_1073741824_time:0": 8164.5, "nccl-bw/allreduce_1073741824_time:0": 8164.5,
"nccl-bw/allreduce_2147483648_busbw:0": 231.89, "nccl-bw/allreduce_2147483648_busbw:0": 231.89,
"nccl-bw/allreduce_2147483648_algbw:0": 132.51, "nccl-bw/allreduce_2147483648_algbw:0": 132.51,
"nccl-bw/allreduce_2147483648_time:0": 16207.0, "nccl-bw/allreduce_2147483648_time:0": 16207,
"nccl-bw/allreduce_4294967296_busbw:0": 234.45, "nccl-bw/allreduce_4294967296_busbw:0": 234.45,
"nccl-bw/allreduce_4294967296_algbw:0": 133.97, "nccl-bw/allreduce_4294967296_algbw:0": 133.97,
"nccl-bw/allreduce_4294967296_time:0": 32059.0, "nccl-bw/allreduce_4294967296_time:0": 32059,
"nccl-bw/allreduce_8589934592_busbw:0": 235.36, "nccl-bw/allreduce_8589934592_busbw:0": 235.36,
"nccl-bw/allreduce_8589934592_algbw:0": 134.49, "nccl-bw/allreduce_8589934592_algbw:0": 134.49,
"nccl-bw/allreduce_8589934592_time:0": 63870.0, "nccl-bw/allreduce_8589934592_time:0": 63870,
"resnet_models/pytorch-resnet50/steptime_train_float32": 253.9552273229, "resnet_models/pytorch-resnet50/steptime_train_float32": 253.9552273229,
"resnet_models/pytorch-resnet50/throughput_train_float32": 760.334809913, "resnet_models/pytorch-resnet50/throughput_train_float32": 760.334809913,
"resnet_models/pytorch-resnet50/steptime_train_float16": 200.0860618427, "resnet_models/pytorch-resnet50/steptime_train_float16": 200.0860618427,
...@@ -930,14 +935,14 @@ ...@@ -930,14 +935,14 @@
"vgg_models/pytorch-vgg19/steptime_train_float32": 74.9348710524, "vgg_models/pytorch-vgg19/steptime_train_float32": 74.9348710524,
"vgg_models/pytorch-vgg19/throughput_train_float32": 429.8092158311, "vgg_models/pytorch-vgg19/throughput_train_float32": 429.8092158311,
"vgg_models/pytorch-vgg19/steptime_train_float16": 45.2033062465, "vgg_models/pytorch-vgg19/steptime_train_float16": 45.2033062465,
"vgg_models/pytorch-vgg19/throughput_train_float16": 709.1127328377, "vgg_models/pytorch-vgg19/throughput_train_float16": 709.1127328377
"diagnosis/accept": true,
"diagnosis/issue_num": 0,
"diagnosis/category": "",
"diagnosis/issue_details": "",
"Index": "sb-validation-02"
}, },
{ {
"index": "sb-validation-03",
"diagnosis/accept": false,
"diagnosis/issue_num": 17,
"diagnosis/category": "FailedTest",
"diagnosis/issue_details": "mem-bw/D2H_Mem_BW:0_miss,mem-bw/D2H_Mem_BW:1_miss,mem-bw/D2H_Mem_BW:2_miss,mem-bw/D2H_Mem_BW:3_miss,mem-bw/D2H_Mem_BW:4_miss,mem-bw/D2H_Mem_BW:5_miss,mem-bw/D2H_Mem_BW:6_miss,mem-bw/D2H_Mem_BW:7_miss,mem-bw/H2D_Mem_BW:0_miss,mem-bw/H2D_Mem_BW:1_miss,mem-bw/H2D_Mem_BW:2_miss,mem-bw/H2D_Mem_BW:3_miss,mem-bw/H2D_Mem_BW:4_miss,mem-bw/H2D_Mem_BW:5_miss,mem-bw/H2D_Mem_BW:6_miss,mem-bw/H2D_Mem_BW:7_miss,mem-bw/return_code(VAL: 1.0000 Rule:lambda x:x>0)",
"bert_models/pytorch-bert-base/steptime_train_float32": 114.5916701062, "bert_models/pytorch-bert-base/steptime_train_float32": 114.5916701062,
"bert_models/pytorch-bert-base/throughput_train_float32": 279.8794623591, "bert_models/pytorch-bert-base/throughput_train_float32": 279.8794623591,
"bert_models/pytorch-bert-base/steptime_train_float16": 83.8895108318, "bert_models/pytorch-bert-base/steptime_train_float16": 83.8895108318,
...@@ -986,7 +991,7 @@ ...@@ -986,7 +991,7 @@
"gemm-flops/FP32:5": 18347.1, "gemm-flops/FP32:5": 18347.1,
"gemm-flops/FP32:6": 18247.4, "gemm-flops/FP32:6": 18247.4,
"gemm-flops/FP32:7": 18318.4, "gemm-flops/FP32:7": 18318.4,
"gemm-flops/FP16:0": 33878.0, "gemm-flops/FP16:0": 33878,
"gemm-flops/FP16:1": 33911.1, "gemm-flops/FP16:1": 33911.1,
"gemm-flops/FP16:2": 33769.3, "gemm-flops/FP16:2": 33769.3,
"gemm-flops/FP16:3": 33909.9, "gemm-flops/FP16:3": 33909.9,
...@@ -998,50 +1003,50 @@ ...@@ -998,50 +1003,50 @@
"gemm-flops/FP64_TC:1": 18924.2, "gemm-flops/FP64_TC:1": 18924.2,
"gemm-flops/FP64_TC:2": 18930.3, "gemm-flops/FP64_TC:2": 18930.3,
"gemm-flops/FP64_TC:3": 18971.9, "gemm-flops/FP64_TC:3": 18971.9,
"gemm-flops/FP64_TC:4": 18946.0, "gemm-flops/FP64_TC:4": 18946,
"gemm-flops/FP64_TC:5": 18945.0, "gemm-flops/FP64_TC:5": 18945,
"gemm-flops/FP64_TC:6": 18822.9, "gemm-flops/FP64_TC:6": 18822.9,
"gemm-flops/FP64_TC:7": 18911.1, "gemm-flops/FP64_TC:7": 18911.1,
"gemm-flops/TF32_TC:0": 127900.0, "gemm-flops/TF32_TC:0": 127900,
"gemm-flops/TF32_TC:1": 129094.0, "gemm-flops/TF32_TC:1": 129094,
"gemm-flops/TF32_TC:2": 127831.0, "gemm-flops/TF32_TC:2": 127831,
"gemm-flops/TF32_TC:3": 128709.0, "gemm-flops/TF32_TC:3": 128709,
"gemm-flops/TF32_TC:4": 127388.0, "gemm-flops/TF32_TC:4": 127388,
"gemm-flops/TF32_TC:5": 127861.0, "gemm-flops/TF32_TC:5": 127861,
"gemm-flops/TF32_TC:6": 128492.0, "gemm-flops/TF32_TC:6": 128492,
"gemm-flops/TF32_TC:7": 127720.0, "gemm-flops/TF32_TC:7": 127720,
"gemm-flops/BF16_TC:0": 264965.0, "gemm-flops/BF16_TC:0": 264965,
"gemm-flops/BF16_TC:1": 266638.0, "gemm-flops/BF16_TC:1": 266638,
"gemm-flops/BF16_TC:2": 263151.0, "gemm-flops/BF16_TC:2": 263151,
"gemm-flops/BF16_TC:3": 264752.0, "gemm-flops/BF16_TC:3": 264752,
"gemm-flops/BF16_TC:4": 263049.0, "gemm-flops/BF16_TC:4": 263049,
"gemm-flops/BF16_TC:5": 266605.0, "gemm-flops/BF16_TC:5": 266605,
"gemm-flops/BF16_TC:6": 267501.0, "gemm-flops/BF16_TC:6": 267501,
"gemm-flops/BF16_TC:7": 263880.0, "gemm-flops/BF16_TC:7": 263880,
"gemm-flops/FP16_TC:0": 279474.0, "gemm-flops/FP16_TC:0": 279474,
"gemm-flops/FP16_TC:1": 281256.0, "gemm-flops/FP16_TC:1": 281256,
"gemm-flops/FP16_TC:2": 277403.0, "gemm-flops/FP16_TC:2": 277403,
"gemm-flops/FP16_TC:3": 279147.0, "gemm-flops/FP16_TC:3": 279147,
"gemm-flops/FP16_TC:4": 277587.0, "gemm-flops/FP16_TC:4": 277587,
"gemm-flops/FP16_TC:5": 281537.0, "gemm-flops/FP16_TC:5": 281537,
"gemm-flops/FP16_TC:6": 282132.0, "gemm-flops/FP16_TC:6": 282132,
"gemm-flops/FP16_TC:7": 277788.0, "gemm-flops/FP16_TC:7": 277788,
"gemm-flops/INT8_TC:0": 475160.0, "gemm-flops/INT8_TC:0": 475160,
"gemm-flops/INT8_TC:1": 477725.0, "gemm-flops/INT8_TC:1": 477725,
"gemm-flops/INT8_TC:2": 471621.0, "gemm-flops/INT8_TC:2": 471621,
"gemm-flops/INT8_TC:3": 473716.0, "gemm-flops/INT8_TC:3": 473716,
"gemm-flops/INT8_TC:4": 472124.0, "gemm-flops/INT8_TC:4": 472124,
"gemm-flops/INT8_TC:5": 479972.0, "gemm-flops/INT8_TC:5": 479972,
"gemm-flops/INT8_TC:6": 481327.0, "gemm-flops/INT8_TC:6": 481327,
"gemm-flops/INT8_TC:7": 474710.0, "gemm-flops/INT8_TC:7": 474710,
"gemm-flops/INT4_TC:0": 970330.0, "gemm-flops/INT4_TC:0": 970330,
"gemm-flops/INT4_TC:1": 976837.0, "gemm-flops/INT4_TC:1": 976837,
"gemm-flops/INT4_TC:2": 966003.0, "gemm-flops/INT4_TC:2": 966003,
"gemm-flops/INT4_TC:3": 971315.0, "gemm-flops/INT4_TC:3": 971315,
"gemm-flops/INT4_TC:4": 964441.0, "gemm-flops/INT4_TC:4": 964441,
"gemm-flops/INT4_TC:5": 982461.0, "gemm-flops/INT4_TC:5": 982461,
"gemm-flops/INT4_TC:6": 979610.0, "gemm-flops/INT4_TC:6": 979610,
"gemm-flops/INT4_TC:7": 968359.0, "gemm-flops/INT4_TC:7": 968359,
"gpt_models/pytorch-gpt2-large/steptime_train_float32": 295.0526971836, "gpt_models/pytorch-gpt2-large/steptime_train_float32": 295.0526971836,
"gpt_models/pytorch-gpt2-large/throughput_train_float32": 27.1154543969, "gpt_models/pytorch-gpt2-large/throughput_train_float32": 27.1154543969,
"gpt_models/pytorch-gpt2-large/steptime_train_float16": 194.4957742235, "gpt_models/pytorch-gpt2-large/steptime_train_float16": 194.4957742235,
...@@ -1230,7 +1235,7 @@ ...@@ -1230,7 +1235,7 @@
"ib-loopback/IB_write_2097152_Avg_7:0": 23930.64, "ib-loopback/IB_write_2097152_Avg_7:0": 23930.64,
"ib-loopback/IB_write_4194304_Avg_7:0": 23845.63, "ib-loopback/IB_write_4194304_Avg_7:0": 23845.63,
"ib-loopback/IB_write_8388608_Avg_7:0": 23896.94, "ib-loopback/IB_write_8388608_Avg_7:0": 23896.94,
"kernel-launch/return_code": 0.0, "kernel-launch/return_code": 0,
"kernel-launch/event_overhead:0": 0.00596, "kernel-launch/event_overhead:0": 0.00596,
"kernel-launch/event_overhead:1": 0.00595, "kernel-launch/event_overhead:1": 0.00595,
"kernel-launch/event_overhead:2": 0.00557, "kernel-launch/event_overhead:2": 0.00557,
...@@ -1252,45 +1257,45 @@ ...@@ -1252,45 +1257,45 @@
"lstm_models/pytorch-lstm/steptime_train_float16": 25.9531298652, "lstm_models/pytorch-lstm/steptime_train_float16": 25.9531298652,
"lstm_models/pytorch-lstm/throughput_train_float16": 9069.9080925588, "lstm_models/pytorch-lstm/throughput_train_float16": 9069.9080925588,
"pytorch-matmul/nosharding": 34.6449975967, "pytorch-matmul/nosharding": 34.6449975967,
"mem-bw/return_code": 1.0, "mem-bw/return_code": 1,
"mem-bw/H2D_Mem_BW:0": "", "mem-bw/H2D_Mem_BW:0": "N/A",
"mem-bw/H2D_Mem_BW:1": "", "mem-bw/H2D_Mem_BW:1": "N/A",
"mem-bw/H2D_Mem_BW:2": "", "mem-bw/H2D_Mem_BW:2": "N/A",
"mem-bw/H2D_Mem_BW:3": "", "mem-bw/H2D_Mem_BW:3": "N/A",
"mem-bw/H2D_Mem_BW:4": "", "mem-bw/H2D_Mem_BW:4": "N/A",
"mem-bw/H2D_Mem_BW:5": "", "mem-bw/H2D_Mem_BW:5": "N/A",
"mem-bw/H2D_Mem_BW:6": "", "mem-bw/H2D_Mem_BW:6": "N/A",
"mem-bw/H2D_Mem_BW:7": "", "mem-bw/H2D_Mem_BW:7": "N/A",
"mem-bw/D2H_Mem_BW:0": "", "mem-bw/D2H_Mem_BW:0": "N/A",
"mem-bw/D2H_Mem_BW:1": "", "mem-bw/D2H_Mem_BW:1": "N/A",
"mem-bw/D2H_Mem_BW:2": "", "mem-bw/D2H_Mem_BW:2": "N/A",
"mem-bw/D2H_Mem_BW:3": "", "mem-bw/D2H_Mem_BW:3": "N/A",
"mem-bw/D2H_Mem_BW:4": "", "mem-bw/D2H_Mem_BW:4": "N/A",
"mem-bw/D2H_Mem_BW:5": "", "mem-bw/D2H_Mem_BW:5": "N/A",
"mem-bw/D2H_Mem_BW:6": "", "mem-bw/D2H_Mem_BW:6": "N/A",
"mem-bw/D2H_Mem_BW:7": "", "mem-bw/D2H_Mem_BW:7": "N/A",
"mem-bw/D2D_Mem_BW:0": "", "mem-bw/D2D_Mem_BW:0": "N/A",
"mem-bw/D2D_Mem_BW:1": "", "mem-bw/D2D_Mem_BW:1": "N/A",
"mem-bw/D2D_Mem_BW:2": "", "mem-bw/D2D_Mem_BW:2": "N/A",
"mem-bw/D2D_Mem_BW:3": "", "mem-bw/D2D_Mem_BW:3": "N/A",
"mem-bw/D2D_Mem_BW:4": "", "mem-bw/D2D_Mem_BW:4": "N/A",
"mem-bw/D2D_Mem_BW:5": "", "mem-bw/D2D_Mem_BW:5": "N/A",
"mem-bw/D2D_Mem_BW:6": "", "mem-bw/D2D_Mem_BW:6": "N/A",
"mem-bw/D2D_Mem_BW:7": "", "mem-bw/D2D_Mem_BW:7": "N/A",
"nccl-bw/allreduce_8_busbw:0": 0.0, "nccl-bw/allreduce_8_busbw:0": 0,
"nccl-bw/allreduce_8_algbw:0": 0.0, "nccl-bw/allreduce_8_algbw:0": 0,
"nccl-bw/allreduce_8_time:0": 37.84, "nccl-bw/allreduce_8_time:0": 37.84,
"nccl-bw/allreduce_16_busbw:0": 0.0, "nccl-bw/allreduce_16_busbw:0": 0,
"nccl-bw/allreduce_16_algbw:0": 0.0, "nccl-bw/allreduce_16_algbw:0": 0,
"nccl-bw/allreduce_16_time:0": 36.42, "nccl-bw/allreduce_16_time:0": 36.42,
"nccl-bw/allreduce_32_busbw:0": 0.0, "nccl-bw/allreduce_32_busbw:0": 0,
"nccl-bw/allreduce_32_algbw:0": 0.0, "nccl-bw/allreduce_32_algbw:0": 0,
"nccl-bw/allreduce_32_time:0": 36.87, "nccl-bw/allreduce_32_time:0": 36.87,
"nccl-bw/allreduce_64_busbw:0": 0.0, "nccl-bw/allreduce_64_busbw:0": 0,
"nccl-bw/allreduce_64_algbw:0": 0.0, "nccl-bw/allreduce_64_algbw:0": 0,
"nccl-bw/allreduce_64_time:0": 35.83, "nccl-bw/allreduce_64_time:0": 35.83,
"nccl-bw/allreduce_128_busbw:0": 0.01, "nccl-bw/allreduce_128_busbw:0": 0.01,
"nccl-bw/allreduce_128_algbw:0": 0.0, "nccl-bw/allreduce_128_algbw:0": 0,
"nccl-bw/allreduce_128_time:0": 36.91, "nccl-bw/allreduce_128_time:0": 36.91,
"nccl-bw/allreduce_256_busbw:0": 0.01, "nccl-bw/allreduce_256_busbw:0": 0.01,
"nccl-bw/allreduce_256_algbw:0": 0.01, "nccl-bw/allreduce_256_algbw:0": 0.01,
...@@ -1316,7 +1321,7 @@ ...@@ -1316,7 +1321,7 @@
"nccl-bw/allreduce_32768_busbw:0": 1.52, "nccl-bw/allreduce_32768_busbw:0": 1.52,
"nccl-bw/allreduce_32768_algbw:0": 0.87, "nccl-bw/allreduce_32768_algbw:0": 0.87,
"nccl-bw/allreduce_32768_time:0": 37.64, "nccl-bw/allreduce_32768_time:0": 37.64,
"nccl-bw/allreduce_65536_busbw:0": 3.0, "nccl-bw/allreduce_65536_busbw:0": 3,
"nccl-bw/allreduce_65536_algbw:0": 1.71, "nccl-bw/allreduce_65536_algbw:0": 1.71,
"nccl-bw/allreduce_65536_time:0": 38.22, "nccl-bw/allreduce_65536_time:0": 38.22,
"nccl-bw/allreduce_131072_busbw:0": 5.31, "nccl-bw/allreduce_131072_busbw:0": 5.31,
...@@ -1339,7 +1344,7 @@ ...@@ -1339,7 +1344,7 @@
"nccl-bw/allreduce_4194304_time:0": 111.6, "nccl-bw/allreduce_4194304_time:0": 111.6,
"nccl-bw/allreduce_8388608_busbw:0": 89.51, "nccl-bw/allreduce_8388608_busbw:0": 89.51,
"nccl-bw/allreduce_8388608_algbw:0": 51.15, "nccl-bw/allreduce_8388608_algbw:0": 51.15,
"nccl-bw/allreduce_8388608_time:0": 164.0, "nccl-bw/allreduce_8388608_time:0": 164,
"nccl-bw/allreduce_16777216_busbw:0": 114.38, "nccl-bw/allreduce_16777216_busbw:0": 114.38,
"nccl-bw/allreduce_16777216_algbw:0": 65.36, "nccl-bw/allreduce_16777216_algbw:0": 65.36,
"nccl-bw/allreduce_16777216_time:0": 256.7, "nccl-bw/allreduce_16777216_time:0": 256.7,
...@@ -1363,13 +1368,13 @@ ...@@ -1363,13 +1368,13 @@
"nccl-bw/allreduce_1073741824_time:0": 8164.5, "nccl-bw/allreduce_1073741824_time:0": 8164.5,
"nccl-bw/allreduce_2147483648_busbw:0": 231.89, "nccl-bw/allreduce_2147483648_busbw:0": 231.89,
"nccl-bw/allreduce_2147483648_algbw:0": 132.51, "nccl-bw/allreduce_2147483648_algbw:0": 132.51,
"nccl-bw/allreduce_2147483648_time:0": 16207.0, "nccl-bw/allreduce_2147483648_time:0": 16207,
"nccl-bw/allreduce_4294967296_busbw:0": 234.45, "nccl-bw/allreduce_4294967296_busbw:0": 234.45,
"nccl-bw/allreduce_4294967296_algbw:0": 133.97, "nccl-bw/allreduce_4294967296_algbw:0": 133.97,
"nccl-bw/allreduce_4294967296_time:0": 32059.0, "nccl-bw/allreduce_4294967296_time:0": 32059,
"nccl-bw/allreduce_8589934592_busbw:0": 235.36, "nccl-bw/allreduce_8589934592_busbw:0": 235.36,
"nccl-bw/allreduce_8589934592_algbw:0": 134.49, "nccl-bw/allreduce_8589934592_algbw:0": 134.49,
"nccl-bw/allreduce_8589934592_time:0": 63870.0, "nccl-bw/allreduce_8589934592_time:0": 63870,
"resnet_models/pytorch-resnet50/steptime_train_float32": 253.9552273229, "resnet_models/pytorch-resnet50/steptime_train_float32": 253.9552273229,
"resnet_models/pytorch-resnet50/throughput_train_float32": 760.334809913, "resnet_models/pytorch-resnet50/throughput_train_float32": 760.334809913,
"resnet_models/pytorch-resnet50/steptime_train_float16": 200.0860618427, "resnet_models/pytorch-resnet50/steptime_train_float16": 200.0860618427,
...@@ -1399,11 +1404,6 @@ ...@@ -1399,11 +1404,6 @@
"vgg_models/pytorch-vgg19/steptime_train_float32": 74.9348710524, "vgg_models/pytorch-vgg19/steptime_train_float32": 74.9348710524,
"vgg_models/pytorch-vgg19/throughput_train_float32": 429.8092158311, "vgg_models/pytorch-vgg19/throughput_train_float32": 429.8092158311,
"vgg_models/pytorch-vgg19/steptime_train_float16": 45.2033062465, "vgg_models/pytorch-vgg19/steptime_train_float16": 45.2033062465,
"vgg_models/pytorch-vgg19/throughput_train_float16": 709.1127328377, "vgg_models/pytorch-vgg19/throughput_train_float16": 709.1127328377
"diagnosis/accept": false,
"diagnosis/issue_num": 17,
"diagnosis/category": "FailedTest",
"diagnosis/issue_details": "mem-bw/D2H_Mem_BW:0_miss,mem-bw/D2H_Mem_BW:1_miss,mem-bw/D2H_Mem_BW:2_miss,mem-bw/D2H_Mem_BW:3_miss,mem-bw/D2H_Mem_BW:4_miss,mem-bw/D2H_Mem_BW:5_miss,mem-bw/D2H_Mem_BW:6_miss,mem-bw/D2H_Mem_BW:7_miss,mem-bw/H2D_Mem_BW:0_miss,mem-bw/H2D_Mem_BW:1_miss,mem-bw/H2D_Mem_BW:2_miss,mem-bw/H2D_Mem_BW:3_miss,mem-bw/H2D_Mem_BW:4_miss,mem-bw/H2D_Mem_BW:5_miss,mem-bw/H2D_Mem_BW:6_miss,mem-bw/H2D_Mem_BW:7_miss,mem-bw/return_code(VAL: 1.0000 Rule:lambda x:x>0)",
"Index": "sb-validation-03"
} }
] ]
\ No newline at end of file
{"Category": "KernelLaunch", "Defective Details": "kernel-launch/event_overhead:0(B/L: 0.0060 VAL: 0.1000 VAR: 1577.85% Rule:lambda x:x>0.05)", "kernel-launch/event_overhead:0": 15.7785234899, "kernel-launch/event_overhead:1": -0.0016778523, "kernel-launch/event_overhead:2": -0.0654362416, "kernel-launch/event_overhead:3": -0.0771812081, "kernel-launch/event_overhead:4": -0.0067114094, "kernel-launch/event_overhead:5": -0.0117449664, "kernel-launch/event_overhead:6": -0.0402684564, "kernel-launch/event_overhead:7": -0.0100671141, "kernel-launch/return_code": 0.0, "kernel-launch/wall_overhead:0": 0.0, "kernel-launch/wall_overhead:1": 0.0, "kernel-launch/wall_overhead:2": 0.0194931774, "kernel-launch/wall_overhead:3": 0.022417154, "kernel-launch/wall_overhead:4": 0.0360623782, "kernel-launch/wall_overhead:5": -0.0194931774, "kernel-launch/wall_overhead:6": 0.0185185185, "kernel-launch/wall_overhead:7": 0.0438596491, "mem-bw/D2H_Mem_BW:0": 0.0, "mem-bw/D2H_Mem_BW:1": 0.012345679, "mem-bw/D2H_Mem_BW:2": 0.0082304527, "mem-bw/D2H_Mem_BW:3": 0.012345679, "mem-bw/D2H_Mem_BW:4": 0.0, "mem-bw/D2H_Mem_BW:5": 0.0, "mem-bw/D2H_Mem_BW:6": -0.0164609053, "mem-bw/D2H_Mem_BW:7": 0.012345679, "mem-bw/H2D_Mem_BW:0": 0.0, "mem-bw/H2D_Mem_BW:1": 0.0078125, "mem-bw/H2D_Mem_BW:2": 0.015625, "mem-bw/H2D_Mem_BW:3": 0.01953125, "mem-bw/H2D_Mem_BW:4": 0.0234375, "mem-bw/H2D_Mem_BW:5": 0.0078125, "mem-bw/H2D_Mem_BW:6": -0.01171875, "mem-bw/H2D_Mem_BW:7": 0.01953125, "mem-bw/return_code": 0.0, "Index": "sb-validation-01"} {"Category": "KernelLaunch", "Defective Details": "kernel-launch/event_overhead:0(B/L: 0.0060 VAL: 0.1000 VAR: 1577.85% Rule:lambda x:x>0.05)", "kernel-launch/event_overhead:0": 15.7785234899, "kernel-launch/event_overhead:1": -0.0016778523, "kernel-launch/event_overhead:2": -0.0654362416, "kernel-launch/event_overhead:3": -0.0771812081, "kernel-launch/event_overhead:4": -0.0067114094, "kernel-launch/event_overhead:5": -0.0117449664, "kernel-launch/event_overhead:6": -0.0402684564, "kernel-launch/event_overhead:7": -0.0100671141, "kernel-launch/return_code": 0, "kernel-launch/wall_overhead:0": 0, "kernel-launch/wall_overhead:1": 0, "kernel-launch/wall_overhead:2": 0.0194931774, "kernel-launch/wall_overhead:3": 0.022417154, "kernel-launch/wall_overhead:4": 0.0360623782, "kernel-launch/wall_overhead:5": -0.0194931774, "kernel-launch/wall_overhead:6": 0.0185185185, "kernel-launch/wall_overhead:7": 0.0438596491, "mem-bw/D2H_Mem_BW:0": 0, "mem-bw/D2H_Mem_BW:1": 0.012345679, "mem-bw/D2H_Mem_BW:2": 0.0082304527, "mem-bw/D2H_Mem_BW:3": 0.012345679, "mem-bw/D2H_Mem_BW:4": 0, "mem-bw/D2H_Mem_BW:5": 0, "mem-bw/D2H_Mem_BW:6": -0.0164609053, "mem-bw/D2H_Mem_BW:7": 0.012345679, "mem-bw/H2D_Mem_BW:0": 0, "mem-bw/H2D_Mem_BW:1": 0.0078125, "mem-bw/H2D_Mem_BW:2": 0.015625, "mem-bw/H2D_Mem_BW:3": 0.01953125, "mem-bw/H2D_Mem_BW:4": 0.0234375, "mem-bw/H2D_Mem_BW:5": 0.0078125, "mem-bw/H2D_Mem_BW:6": -0.01171875, "mem-bw/H2D_Mem_BW:7": 0.01953125, "mem-bw/return_code": 0, "index": "sb-validation-01"}
{"Category": "FailedTest", "Defective Details": "mem-bw/D2H_Mem_BW:0_miss,mem-bw/D2H_Mem_BW:1_miss,mem-bw/D2H_Mem_BW:2_miss,mem-bw/D2H_Mem_BW:3_miss,mem-bw/D2H_Mem_BW:4_miss,mem-bw/D2H_Mem_BW:5_miss,mem-bw/D2H_Mem_BW:6_miss,mem-bw/D2H_Mem_BW:7_miss,mem-bw/H2D_Mem_BW:0_miss,mem-bw/H2D_Mem_BW:1_miss,mem-bw/H2D_Mem_BW:2_miss,mem-bw/H2D_Mem_BW:3_miss,mem-bw/H2D_Mem_BW:4_miss,mem-bw/H2D_Mem_BW:5_miss,mem-bw/H2D_Mem_BW:6_miss,mem-bw/H2D_Mem_BW:7_miss,mem-bw/return_code(VAL: 1.0000 Rule:lambda x:x>0)", "kernel-launch/event_overhead:0": 0.0, "kernel-launch/event_overhead:1": -0.0016778523, "kernel-launch/event_overhead:2": -0.0654362416, "kernel-launch/event_overhead:3": -0.0771812081, "kernel-launch/event_overhead:4": -0.0067114094, "kernel-launch/event_overhead:5": -0.0117449664, "kernel-launch/event_overhead:6": -0.0402684564, "kernel-launch/event_overhead:7": -0.0100671141, "kernel-launch/return_code": 0.0, "kernel-launch/wall_overhead:0": 0.0, "kernel-launch/wall_overhead:1": 0.0, "kernel-launch/wall_overhead:2": 0.0194931774, "kernel-launch/wall_overhead:3": 0.022417154, "kernel-launch/wall_overhead:4": 0.0360623782, "kernel-launch/wall_overhead:5": -0.0194931774, "kernel-launch/wall_overhead:6": 0.0185185185, "kernel-launch/wall_overhead:7": 0.0438596491, "mem-bw/D2H_Mem_BW:0": null, "mem-bw/D2H_Mem_BW:1": null, "mem-bw/D2H_Mem_BW:2": null, "mem-bw/D2H_Mem_BW:3": null, "mem-bw/D2H_Mem_BW:4": null, "mem-bw/D2H_Mem_BW:5": null, "mem-bw/D2H_Mem_BW:6": null, "mem-bw/D2H_Mem_BW:7": null, "mem-bw/H2D_Mem_BW:0": null, "mem-bw/H2D_Mem_BW:1": null, "mem-bw/H2D_Mem_BW:2": null, "mem-bw/H2D_Mem_BW:3": null, "mem-bw/H2D_Mem_BW:4": null, "mem-bw/H2D_Mem_BW:5": null, "mem-bw/H2D_Mem_BW:6": null, "mem-bw/H2D_Mem_BW:7": null, "mem-bw/return_code": 1.0, "Index": "sb-validation-03"} {"Category": "FailedTest", "Defective Details": "mem-bw/D2H_Mem_BW:0_miss,mem-bw/D2H_Mem_BW:1_miss,mem-bw/D2H_Mem_BW:2_miss,mem-bw/D2H_Mem_BW:3_miss,mem-bw/D2H_Mem_BW:4_miss,mem-bw/D2H_Mem_BW:5_miss,mem-bw/D2H_Mem_BW:6_miss,mem-bw/D2H_Mem_BW:7_miss,mem-bw/H2D_Mem_BW:0_miss,mem-bw/H2D_Mem_BW:1_miss,mem-bw/H2D_Mem_BW:2_miss,mem-bw/H2D_Mem_BW:3_miss,mem-bw/H2D_Mem_BW:4_miss,mem-bw/H2D_Mem_BW:5_miss,mem-bw/H2D_Mem_BW:6_miss,mem-bw/H2D_Mem_BW:7_miss,mem-bw/return_code(VAL: 1.0000 Rule:lambda x:x>0)", "kernel-launch/event_overhead:0": 0.0, "kernel-launch/event_overhead:1": -0.0016778523, "kernel-launch/event_overhead:2": -0.0654362416, "kernel-launch/event_overhead:3": -0.0771812081, "kernel-launch/event_overhead:4": -0.0067114094, "kernel-launch/event_overhead:5": -0.0117449664, "kernel-launch/event_overhead:6": -0.0402684564, "kernel-launch/event_overhead:7": -0.0100671141, "kernel-launch/return_code": 0, "kernel-launch/wall_overhead:0": 0, "kernel-launch/wall_overhead:1": 0, "kernel-launch/wall_overhead:2": 0.0194931774, "kernel-launch/wall_overhead:3": 0.022417154, "kernel-launch/wall_overhead:4": 0.0360623782, "kernel-launch/wall_overhead:5": -0.0194931774, "kernel-launch/wall_overhead:6": 0.0185185185, "kernel-launch/wall_overhead:7": 0.0438596491, "mem-bw/D2H_Mem_BW:0": "N/A", "mem-bw/D2H_Mem_BW:1": "N/A", "mem-bw/D2H_Mem_BW:2": "N/A", "mem-bw/D2H_Mem_BW:3": "N/A", "mem-bw/D2H_Mem_BW:4": "N/A", "mem-bw/D2H_Mem_BW:5": "N/A", "mem-bw/D2H_Mem_BW:6": "N/A", "mem-bw/D2H_Mem_BW:7": "N/A", "mem-bw/H2D_Mem_BW:0": "N/A", "mem-bw/H2D_Mem_BW:1": "N/A", "mem-bw/H2D_Mem_BW:2": "N/A", "mem-bw/H2D_Mem_BW:3": "N/A", "mem-bw/H2D_Mem_BW:4": "N/A", "mem-bw/H2D_Mem_BW:5": "N/A", "mem-bw/H2D_Mem_BW:6": "N/A", "mem-bw/H2D_Mem_BW:7": "N/A", "mem-bw/return_code": 1, "index": "sb-validation-03"}
| machine | Category | Defective Details | kernel-launch/event_overhead:0 | kernel-launch/event_overhead:1 | kernel-launch/event_overhead:2 | kernel-launch/event_overhead:3 | kernel-launch/event_overhead:4 | kernel-launch/event_overhead:5 | kernel-launch/event_overhead:6 | kernel-launch/event_overhead:7 | kernel-launch/return_code | kernel-launch/wall_overhead:0 | kernel-launch/wall_overhead:1 | kernel-launch/wall_overhead:2 | kernel-launch/wall_overhead:3 | kernel-launch/wall_overhead:4 | kernel-launch/wall_overhead:5 | kernel-launch/wall_overhead:6 | kernel-launch/wall_overhead:7 | mem-bw/D2H_Mem_BW:0 | mem-bw/D2H_Mem_BW:1 | mem-bw/D2H_Mem_BW:2 | mem-bw/D2H_Mem_BW:3 | mem-bw/D2H_Mem_BW:4 | mem-bw/D2H_Mem_BW:5 | mem-bw/D2H_Mem_BW:6 | mem-bw/D2H_Mem_BW:7 | mem-bw/H2D_Mem_BW:0 | mem-bw/H2D_Mem_BW:1 | mem-bw/H2D_Mem_BW:2 | mem-bw/H2D_Mem_BW:3 | mem-bw/H2D_Mem_BW:4 | mem-bw/H2D_Mem_BW:5 | mem-bw/H2D_Mem_BW:6 | mem-bw/H2D_Mem_BW:7 | mem-bw/return_code | | index | Category | Defective Details | kernel-launch/event_overhead:0 | kernel-launch/event_overhead:1 | kernel-launch/event_overhead:2 | kernel-launch/event_overhead:3 | kernel-launch/event_overhead:4 | kernel-launch/event_overhead:5 | kernel-launch/event_overhead:6 | kernel-launch/event_overhead:7 | kernel-launch/return_code | kernel-launch/wall_overhead:0 | kernel-launch/wall_overhead:1 | kernel-launch/wall_overhead:2 | kernel-launch/wall_overhead:3 | kernel-launch/wall_overhead:4 | kernel-launch/wall_overhead:5 | kernel-launch/wall_overhead:6 | kernel-launch/wall_overhead:7 | mem-bw/D2H_Mem_BW:0 | mem-bw/D2H_Mem_BW:1 | mem-bw/D2H_Mem_BW:2 | mem-bw/D2H_Mem_BW:3 | mem-bw/D2H_Mem_BW:4 | mem-bw/D2H_Mem_BW:5 | mem-bw/D2H_Mem_BW:6 | mem-bw/D2H_Mem_BW:7 | mem-bw/H2D_Mem_BW:0 | mem-bw/H2D_Mem_BW:1 | mem-bw/H2D_Mem_BW:2 | mem-bw/H2D_Mem_BW:3 | mem-bw/H2D_Mem_BW:4 | mem-bw/H2D_Mem_BW:5 | mem-bw/H2D_Mem_BW:6 | mem-bw/H2D_Mem_BW:7 | mem-bw/return_code |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| sb-validation-01 | KernelLaunch | kernel-launch/event_overhead:0(B/L: 0.0060 VAL: 0.1000 VAR: 1577.85% Rule:lambda x:x>0.05) | 1577.85% | -0.17% | -6.54% | -7.72% | -0.67% | -1.17% | -4.03% | -1.01% | 0.0 | 0.0% | 0.0% | 1.95% | 2.24% | 3.61% | -1.95% | 1.85% | 4.39% | 0.0% | 1.23% | 0.82% | 1.23% | 0.0% | 0.0% | -1.65% | 1.23% | 0.0% | 0.78% | 1.56% | 1.95% | 2.34% | 0.78% | -1.17% | 1.95% | 0.0 | | sb-validation-01 | KernelLaunch | kernel-launch/event_overhead:0(B/L: 0.0060 VAL: 0.1000 VAR: 1577.85% Rule:lambda x:x>0.05) | 1577.85% | -0.17% | -6.54% | -7.72% | -0.67% | -1.17% | -4.03% | -1.01% | 0 | 0.0% | 0.0% | 1.95% | 2.24% | 3.61% | -1.95% | 1.85% | 4.39% | 0.0% | 1.23% | 0.82% | 1.23% | 0.0% | 0.0% | -1.65% | 1.23% | 0.0% | 0.78% | 1.56% | 1.95% | 2.34% | 0.78% | -1.17% | 1.95% | 0 |
| sb-validation-03 | FailedTest | mem-bw/D2H_Mem_BW:0_miss,mem-bw/D2H_Mem_BW:1_miss,mem-bw/D2H_Mem_BW:2_miss,mem-bw/D2H_Mem_BW:3_miss,mem-bw/D2H_Mem_BW:4_miss,mem-bw/D2H_Mem_BW:5_miss,mem-bw/D2H_Mem_BW:6_miss,mem-bw/D2H_Mem_BW:7_miss,mem-bw/H2D_Mem_BW:0_miss,mem-bw/H2D_Mem_BW:1_miss,mem-bw/H2D_Mem_BW:2_miss,mem-bw/H2D_Mem_BW:3_miss,mem-bw/H2D_Mem_BW:4_miss,mem-bw/H2D_Mem_BW:5_miss,mem-bw/H2D_Mem_BW:6_miss,mem-bw/H2D_Mem_BW:7_miss,mem-bw/return_code(VAL: 1.0000 Rule:lambda x:x>0) | 0.0% | -0.17% | -6.54% | -7.72% | -0.67% | -1.17% | -4.03% | -1.01% | 0.0 | 0.0% | 0.0% | 1.95% | 2.24% | 3.61% | -1.95% | 1.85% | 4.39% | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 1.0 | | sb-validation-03 | FailedTest | mem-bw/D2H_Mem_BW:0_miss,mem-bw/D2H_Mem_BW:1_miss,mem-bw/D2H_Mem_BW:2_miss,mem-bw/D2H_Mem_BW:3_miss,mem-bw/D2H_Mem_BW:4_miss,mem-bw/D2H_Mem_BW:5_miss,mem-bw/D2H_Mem_BW:6_miss,mem-bw/D2H_Mem_BW:7_miss,mem-bw/H2D_Mem_BW:0_miss,mem-bw/H2D_Mem_BW:1_miss,mem-bw/H2D_Mem_BW:2_miss,mem-bw/H2D_Mem_BW:3_miss,mem-bw/H2D_Mem_BW:4_miss,mem-bw/H2D_Mem_BW:5_miss,mem-bw/H2D_Mem_BW:6_miss,mem-bw/H2D_Mem_BW:7_miss,mem-bw/return_code(VAL: 1.0000 Rule:lambda x:x>0) | 0.0% | -0.17% | -6.54% | -7.72% | -0.67% | -1.17% | -4.03% | -1.01% | 0 | 0.0% | 0.0% | 1.95% | 2.24% | 3.61% | -1.95% | 1.85% | 4.39% | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | 1 |
[ [
{ {
"index": "sb-validation-01",
"diagnosis/category": "KernelLaunch", "diagnosis/category": "KernelLaunch",
"diagnosis/issue_details": "kernel-launch/event_overhead:0(B/L: 0.0060 VAL: 0.1000 VAR: 1577.85% Rule:lambda x:x>0.05)", "diagnosis/issue_details": "kernel-launch/event_overhead:0(B/L: 0.0060 VAL: 0.1000 VAR: 1577.85% Rule:lambda x:x>0.05)",
"kernel-launch/event_overhead:0": 15.7785234899, "kernel-launch/event_overhead:0": 15.7785234899,
...@@ -10,24 +11,24 @@ ...@@ -10,24 +11,24 @@
"kernel-launch/event_overhead:5": -0.0117449664, "kernel-launch/event_overhead:5": -0.0117449664,
"kernel-launch/event_overhead:6": -0.0402684564, "kernel-launch/event_overhead:6": -0.0402684564,
"kernel-launch/event_overhead:7": -0.0100671141, "kernel-launch/event_overhead:7": -0.0100671141,
"kernel-launch/return_code": 0.0, "kernel-launch/return_code": 0,
"kernel-launch/wall_overhead:0": 0.0, "kernel-launch/wall_overhead:0": 0,
"kernel-launch/wall_overhead:1": 0.0, "kernel-launch/wall_overhead:1": 0,
"kernel-launch/wall_overhead:2": 0.0194931774, "kernel-launch/wall_overhead:2": 0.0194931774,
"kernel-launch/wall_overhead:3": 0.022417154, "kernel-launch/wall_overhead:3": 0.022417154,
"kernel-launch/wall_overhead:4": 0.0360623782, "kernel-launch/wall_overhead:4": 0.0360623782,
"kernel-launch/wall_overhead:5": -0.0194931774, "kernel-launch/wall_overhead:5": -0.0194931774,
"kernel-launch/wall_overhead:6": 0.0185185185, "kernel-launch/wall_overhead:6": 0.0185185185,
"kernel-launch/wall_overhead:7": 0.0438596491, "kernel-launch/wall_overhead:7": 0.0438596491,
"mem-bw/D2H_Mem_BW:0": 0.0, "mem-bw/D2H_Mem_BW:0": 0,
"mem-bw/D2H_Mem_BW:1": 0.012345679, "mem-bw/D2H_Mem_BW:1": 0.012345679,
"mem-bw/D2H_Mem_BW:2": 0.0082304527, "mem-bw/D2H_Mem_BW:2": 0.0082304527,
"mem-bw/D2H_Mem_BW:3": 0.012345679, "mem-bw/D2H_Mem_BW:3": 0.012345679,
"mem-bw/D2H_Mem_BW:4": 0.0, "mem-bw/D2H_Mem_BW:4": 0,
"mem-bw/D2H_Mem_BW:5": 0.0, "mem-bw/D2H_Mem_BW:5": 0,
"mem-bw/D2H_Mem_BW:6": -0.0164609053, "mem-bw/D2H_Mem_BW:6": -0.0164609053,
"mem-bw/D2H_Mem_BW:7": 0.012345679, "mem-bw/D2H_Mem_BW:7": 0.012345679,
"mem-bw/H2D_Mem_BW:0": 0.0, "mem-bw/H2D_Mem_BW:0": 0,
"mem-bw/H2D_Mem_BW:1": 0.0078125, "mem-bw/H2D_Mem_BW:1": 0.0078125,
"mem-bw/H2D_Mem_BW:2": 0.015625, "mem-bw/H2D_Mem_BW:2": 0.015625,
"mem-bw/H2D_Mem_BW:3": 0.01953125, "mem-bw/H2D_Mem_BW:3": 0.01953125,
...@@ -35,10 +36,10 @@ ...@@ -35,10 +36,10 @@
"mem-bw/H2D_Mem_BW:5": 0.0078125, "mem-bw/H2D_Mem_BW:5": 0.0078125,
"mem-bw/H2D_Mem_BW:6": -0.01171875, "mem-bw/H2D_Mem_BW:6": -0.01171875,
"mem-bw/H2D_Mem_BW:7": 0.01953125, "mem-bw/H2D_Mem_BW:7": 0.01953125,
"mem-bw/return_code": 0.0, "mem-bw/return_code": 0
"Index": "sb-validation-01"
}, },
{ {
"index": "sb-validation-03",
"diagnosis/category": "FailedTest", "diagnosis/category": "FailedTest",
"diagnosis/issue_details": "mem-bw/D2H_Mem_BW:0_miss,mem-bw/D2H_Mem_BW:1_miss,mem-bw/D2H_Mem_BW:2_miss,mem-bw/D2H_Mem_BW:3_miss,mem-bw/D2H_Mem_BW:4_miss,mem-bw/D2H_Mem_BW:5_miss,mem-bw/D2H_Mem_BW:6_miss,mem-bw/D2H_Mem_BW:7_miss,mem-bw/H2D_Mem_BW:0_miss,mem-bw/H2D_Mem_BW:1_miss,mem-bw/H2D_Mem_BW:2_miss,mem-bw/H2D_Mem_BW:3_miss,mem-bw/H2D_Mem_BW:4_miss,mem-bw/H2D_Mem_BW:5_miss,mem-bw/H2D_Mem_BW:6_miss,mem-bw/H2D_Mem_BW:7_miss,mem-bw/return_code(VAL: 1.0000 Rule:lambda x:x>0)", "diagnosis/issue_details": "mem-bw/D2H_Mem_BW:0_miss,mem-bw/D2H_Mem_BW:1_miss,mem-bw/D2H_Mem_BW:2_miss,mem-bw/D2H_Mem_BW:3_miss,mem-bw/D2H_Mem_BW:4_miss,mem-bw/D2H_Mem_BW:5_miss,mem-bw/D2H_Mem_BW:6_miss,mem-bw/D2H_Mem_BW:7_miss,mem-bw/H2D_Mem_BW:0_miss,mem-bw/H2D_Mem_BW:1_miss,mem-bw/H2D_Mem_BW:2_miss,mem-bw/H2D_Mem_BW:3_miss,mem-bw/H2D_Mem_BW:4_miss,mem-bw/H2D_Mem_BW:5_miss,mem-bw/H2D_Mem_BW:6_miss,mem-bw/H2D_Mem_BW:7_miss,mem-bw/return_code(VAL: 1.0000 Rule:lambda x:x>0)",
"kernel-launch/event_overhead:0": 0.0, "kernel-launch/event_overhead:0": 0.0,
...@@ -49,32 +50,31 @@ ...@@ -49,32 +50,31 @@
"kernel-launch/event_overhead:5": -0.0117449664, "kernel-launch/event_overhead:5": -0.0117449664,
"kernel-launch/event_overhead:6": -0.0402684564, "kernel-launch/event_overhead:6": -0.0402684564,
"kernel-launch/event_overhead:7": -0.0100671141, "kernel-launch/event_overhead:7": -0.0100671141,
"kernel-launch/return_code": 0.0, "kernel-launch/return_code": 0,
"kernel-launch/wall_overhead:0": 0.0, "kernel-launch/wall_overhead:0": 0,
"kernel-launch/wall_overhead:1": 0.0, "kernel-launch/wall_overhead:1": 0,
"kernel-launch/wall_overhead:2": 0.0194931774, "kernel-launch/wall_overhead:2": 0.0194931774,
"kernel-launch/wall_overhead:3": 0.022417154, "kernel-launch/wall_overhead:3": 0.022417154,
"kernel-launch/wall_overhead:4": 0.0360623782, "kernel-launch/wall_overhead:4": 0.0360623782,
"kernel-launch/wall_overhead:5": -0.0194931774, "kernel-launch/wall_overhead:5": -0.0194931774,
"kernel-launch/wall_overhead:6": 0.0185185185, "kernel-launch/wall_overhead:6": 0.0185185185,
"kernel-launch/wall_overhead:7": 0.0438596491, "kernel-launch/wall_overhead:7": 0.0438596491,
"mem-bw/D2H_Mem_BW:0": null, "mem-bw/D2H_Mem_BW:0": "N/A",
"mem-bw/D2H_Mem_BW:1": null, "mem-bw/D2H_Mem_BW:1": "N/A",
"mem-bw/D2H_Mem_BW:2": null, "mem-bw/D2H_Mem_BW:2": "N/A",
"mem-bw/D2H_Mem_BW:3": null, "mem-bw/D2H_Mem_BW:3": "N/A",
"mem-bw/D2H_Mem_BW:4": null, "mem-bw/D2H_Mem_BW:4": "N/A",
"mem-bw/D2H_Mem_BW:5": null, "mem-bw/D2H_Mem_BW:5": "N/A",
"mem-bw/D2H_Mem_BW:6": null, "mem-bw/D2H_Mem_BW:6": "N/A",
"mem-bw/D2H_Mem_BW:7": null, "mem-bw/D2H_Mem_BW:7": "N/A",
"mem-bw/H2D_Mem_BW:0": null, "mem-bw/H2D_Mem_BW:0": "N/A",
"mem-bw/H2D_Mem_BW:1": null, "mem-bw/H2D_Mem_BW:1": "N/A",
"mem-bw/H2D_Mem_BW:2": null, "mem-bw/H2D_Mem_BW:2": "N/A",
"mem-bw/H2D_Mem_BW:3": null, "mem-bw/H2D_Mem_BW:3": "N/A",
"mem-bw/H2D_Mem_BW:4": null, "mem-bw/H2D_Mem_BW:4": "N/A",
"mem-bw/H2D_Mem_BW:5": null, "mem-bw/H2D_Mem_BW:5": "N/A",
"mem-bw/H2D_Mem_BW:6": null, "mem-bw/H2D_Mem_BW:6": "N/A",
"mem-bw/H2D_Mem_BW:7": null, "mem-bw/H2D_Mem_BW:7": "N/A",
"mem-bw/return_code": 1.0, "mem-bw/return_code": 1
"Index": "sb-validation-03"
} }
] ]
\ No newline at end of file
VM_hostname vma414bbc00005I VM_hostname vma414bbc00005I
0x0ff08c4321664e96 0ff08c4321664e96
VM_hostname vma414bbc00005J VM_hostname vma414bbc00005J
0x0ff08c43217299f2 0ff08c43217299f2
VM_hostname vma414bbc00005K VM_hostname vma414bbc00005K
0x0ff08c4321729742 0ff08c4321729742
VM_hostname vma414bbc00005L VM_hostname vma414bbc00005L
0x0ff08c4321729986 0ff08c4321729986
VM_hostname vma414bbc00005M VM_hostname vma414bbc00005M
0x1c34da03005baca4 1c34da03005baca4
VM_hostname vma414bbc00005N VM_hostname vma414bbc00005N
0x0ff08c432166275a 0ff08c432166275a
VM_hostname vma414bbc00005O VM_hostname vma414bbc00005O
0x0ff08c4321664b66 0ff08c4321664b66
VM_hostname vma414bbc00005P VM_hostname vma414bbc00005P
0x0ff08c432166274e 0ff08c432166274e
VM_hostname vma414bbc00005Q VM_hostname vma414bbc00005Q
0x0ff08c4321664f2a 0ff08c4321664f2a
VM_hostname vma414bbc00005R VM_hostname vma414bbc00005R
0x043f720300e61112 043f720300e61112
---
slug: release-sb-v0.6
title: Releasing SuperBench v0.6
author: Peng Cheng
author_title: SuperBench Team
author_url: https://github.com/cp5555
author_image_url: https://github.com/cp5555.png
tags: [superbench, announcement, release]
---
We are very happy to announce that **SuperBench 0.6.0 version** is officially released today!
You can install and try superbench by following [Getting Started Tutorial](https://microsoft.github.io/superbenchmark/docs/getting-started/installation).
## SuperBench 0.6.0 Release Notes
### SuperBench Improvement
- Support running on host directly without Docker.
- Support running `sb` command inside docker image.
- Support ROCm 5.1.1.
- Support ROCm 5.1.3.
- Fix bugs in data diagnosis.
- Fix cmake and build issues.
- Support automatic configuration yaml selection on Azure VM.
- Refine error message when GPU is not detected.
- Add return code for Timeout.
- Update Dockerfile for NCCL/RCCL version, tag name, and verbose output.
- Support node_num=1 in mpi mode.
- Update Python setup for require packages.
- Enhance parameter parsing to allow spaces in value.
- Support NO_COLOR for SuperBench output.
### Micro-benchmark Improvements
- Fix issues in ib loopback benchmark.
- Fix stability issue in ib loopback benchmark.
### Distributed Benchmark Improvements
- Enhance pair-wise IB benchmark.
- Bug Fix in IB benchmark.
- Support topology-aware IB benchmark.
### Data Diagnosis and Analysis
- Add failure check function in data_diagnosis.py.
- Support JSON and JSONL in Diagnosis.
- Add support to store values of metrics in data diagnosis.
- Support exit code of sb result diagnosis.
- Format int type and unify empty value to N/A in diagnosis output files.
...@@ -101,7 +101,7 @@ module.exports = { ...@@ -101,7 +101,7 @@ module.exports = {
announcementBar: { announcementBar: {
id: 'supportus', id: 'supportus',
content: content:
'📢 <a href="https://microsoft.github.io/superbenchmark/blog/release-sb-v0.5">v0.5.0</a> has been released! ' + '📢 <a href="https://microsoft.github.io/superbenchmark/blog/release-sb-v0.6">v0.6.0</a> has been released! ' +
'⭐️ If you like SuperBench, give it a star on <a target="_blank" rel="noopener noreferrer" href="https://github.com/microsoft/superbenchmark">GitHub</a>! ⭐️', '⭐️ If you like SuperBench, give it a star on <a target="_blank" rel="noopener noreferrer" href="https://github.com/microsoft/superbenchmark">GitHub</a>! ⭐️',
}, },
algolia: { algolia: {
......
{ {
"name": "superbench-website", "name": "superbench-website",
"version": "0.5.0", "version": "0.6.0",
"lockfileVersion": 1, "lockfileVersion": 1,
"requires": true, "requires": true,
"dependencies": { "dependencies": {
......
{ {
"name": "superbench-website", "name": "superbench-website",
"version": "0.5.0", "version": "0.6.0",
"private": true, "private": true,
"scripts": { "scripts": {
"docusaurus": "docusaurus", "docusaurus": "docusaurus",
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment