| onnxruntime-ort-models/bert_large_uncased_ngpu_1_train_throughput | throughput (samples/s) | The throughput of bert large uncased model on 1 GPU. |
| onnxruntime-ort-models/bert_large_uncased_ngpu_8_train_throughput | throughput (samples/s) | The throughput of bert large uncased model on 8 GPU. |
| onnxruntime-ort-models/distilbert_base_uncased_ngpu_1_train_throughput | throughput (samples/s) | The throughput of distilbert base uncased model on 1 GPU. |
| onnxruntime-ort-models/distilbert_base_uncased_ngpu_8_train_throughput | throughput (samples/s) | The throughput of distilbert base uncased model on 8 GPU. |
| onnxruntime-ort-models/gpt2_ngpu_1_train_throughput | throughput (samples/s) | The throughput of gpt2 model on 1 GPU. |
| onnxruntime-ort-models/gpt2_ngpu_8_train_throughput | throughput (samples/s) | The throughput of gpt2 model on 8 GPU. |
| onnxruntime-ort-models/facebook_bart_large_ngpu_1_train_throughput | throughput (samples/s) | The throughput of facebook bart large model on 1 GPU. |
| onnxruntime-ort-models/facebook_bart_large_ngpu_8_train_throughput | throughput (samples/s) | The throughput of facebook bart large model on 8 GPU. |
| onnxruntime-ort-models/roberta_large_ngpu_1_train_throughput | throughput (samples/s) | The throughput of roberta large model on 1 GPU. |
| onnxruntime-ort-models/roberta_large_ngpu_8_train_throughput | throughput (samples/s) | The throughput of roberta large model on 8 GPU. |
| tensorrt-inference/gpu_lat_ms_mean | time (ms) | The mean GPU latency to execute the kernels for a query. |
| tensorrt-inference/gpu_lat_ms_99 | time (ms) | The 99th percentile GPU latency to execute the kernels for a query. |
| tensorrt-inference/host_lat_ms_mean | time (ms) | The mean H2D, GPU, and D2H latency to execute the kernels for a query. |
| tensorrt-inference/host_lat_ms_99 | time (ms) | The 99th percentile H2D, GPU, and D2H latency to execute the kernels for a query. |
| tensorrt-inference/end_to_end_lat_ms_mean | time (ms) | The mean duration from when the H2D of a query is called to when the D2H of the same query is completed. |
| tensorrt-inference/end_to_end_lat_ms_99 | time (ms) | The P99 duration from when the H2D of a query is called to when the D2H of the same query is completed. |
| tensorrt-inference/${model}_gpu_time_mean | time (ms) | The mean GPU latency to execute the kernels for a query. |
| tensorrt-inference/${model}_gpu_time_99 | time (ms) | The 99th percentile GPU latency to execute the kernels for a query. |
| tensorrt-inference/${model}_host_time_mean | time (ms) | The mean H2D, GPU, and D2H latency to execute the kernels for a query. |
| tensorrt-inference/${model}_host_time_99 | time (ms) | The 99th percentile H2D, GPU, and D2H latency to execute the kernels for a query. |
| tensorrt-inference/${model}_end_to_end_time_mean | time (ms) | The mean duration from when the H2D of a query is called to when the D2H of the same query is completed. |
| tensorrt-inference/${model}_end_to_end_time_99 | time (ms) | The P99 duration from when the H2D of a query is called to when the D2H of the same query is completed. |
## Communication Benchmarks
...
...
@@ -95,11 +95,11 @@ or [AMD](https://github.com/ROCm-Developer-Tools/HIP/tree/master/samples/1_Utils
| cpu\_to\_gpu[0-9]+\_by\_gpu[0-9]+\_using\_(sm\|dma)\_under_numa[0-9]+ | bandwidth (GB/s) | The bandwidth reading from all NUMA nodes' host memory using DMA engine or GPU SM by all GPUs. |
| gpu[0-9]+\_to\_cpu\_by\_gpu[0-9]+\_using\_(sm\|dma)\_under_numa[0-9]+ | bandwidth (GB/s) | The bandwidth writing to all NUMA nodes' host memory using DMA engine or GPU SM by all GPUs. |
| gpu[0-9]+\_to_gpu[0-9]+\_by\_gpu[0-9]+\_using\_(sm\|dma)\_under_numa[0-9]+ | bandwidth (GB/s) | The bandwidth reading from or writing to all GPUs using DMA engine or GPU SM by all GPUs with peer communication enabled. |
| cpu\_to\_gpu[0-9]+\_by\_gpu[0-9]+\_using\_(sm\|dma)\_under_numa[0-9]+_bw | bandwidth (GB/s) | The bandwidth reading from all NUMA nodes' host memory using DMA engine or GPU SM by all GPUs. |
| gpu[0-9]+\_to\_cpu\_by\_gpu[0-9]+\_using\_(sm\|dma)\_under_numa[0-9]+_bw | bandwidth (GB/s) | The bandwidth writing to all NUMA nodes' host memory using DMA engine or GPU SM by all GPUs. |
| gpu[0-9]+\_to_gpu[0-9]+\_by\_gpu[0-9]+\_using\_(sm\|dma)\_under_numa[0-9]+_bw | bandwidth (GB/s) | The bandwidth reading from or writing to all GPUs using DMA engine or GPU SM by all GPUs with peer communication enabled. |
### `ib-loopback`
...
...
@@ -122,11 +122,11 @@ Measure the InfiniBand loopback verbs bandwidth, performed by
| {benchmark_name}/${test_title}_RRTwo-sidedLat(8B)_${stat} | time(usec) | statistical values(Min, Max, Avg, 99%, 99.9%) obtained by all nodes use algorithm 'random ring communication pattern two-side latency' for network testing |
| {benchmark_name}/${test_title}_RRTwo-sidedBW+Sync(131072B)_${stat} | MiB/s/rank | fstatistical values(Min, Max, Avg, 99%, 99.9%) obtained by all nodes use algorithm 'random ring communication pattern two-side bandwidth with barrier' for network testing |
| {benchmark_name}/${test_title}_MultipleAllreduce(8B)_${stat} | time(usec) | statistical values(Min, Max, Avg, 99%, 99.9%) obtained by all nodes use algorithm 'multiple allreduce bandwidth' for network testing |
| {benchmark_name}/${test_title}_GetBcast(4096B)_${stat} | bandwidth (MB/s/rank) | statistical values(Min, Max, Avg, 99%, 99.9%) obtained by all nodes use congestion 'Get Bcast(4096B)' for congestion testing |
| {benchmark_name}/${test_title}_PutIncast(4096B)_${stat} | bandwidth (MB/s/rank) | statistical values(Min, Max, Avg, 99%, 99.9%) obtained by all nodes use congestion 'Put Incast (4096 B)' for congestion testing |
| {benchmark_name}/${test_title}_Two-sidedIncast(4096B)_${stat} | bandwidth (MB/s/rank) | statistical values(Min, Max, Avg, 99%, 99.9%) obtained by all nodes use congestion 'Two-sided Incast (4096 B)' for congestion testing |
| {benchmark_name}/${test_title}_Alltoall(4096B)_${stat} | bandwidth (MB/s/rank) | statistical values(Min, Max, Avg, 99%, 99.9%) obtained by all nodes use congestion 'Alltoall (4096 B)' for congestion testing |
| gpcnet-network-load-test/${test_title}_${network_test_algo}_${stat} | times(x) | summary about congestion impact factor of every network test algorithm |
| gpcnet-network-test/rr_two-sided_lat_${stat} | time (us) | statistical values(min, max, avg, 99%, 99.9%) obtained by all nodes use algorithm 'random ring communication pattern two-side latency' for network testing |
| gpcnet-network-test/rr_two-sided+sync_bw_${stat} | bandwidth (MiB/s/rank) | fstatistical values(min, max, avg, 99%, 99.9%) obtained by all nodes use algorithm 'random ring communication pattern two-side bandwidth with barrier' for network testing |
| gpcnet-network-test/multiple_allreduce_time_${stat} | time (us) | statistical values(min, max, avg, 99%, 99.9%) obtained by all nodes use algorithm 'multiple allreduce bandwidth' for network testing |
| gpcnet-network-test/rr_get_lat_${stat} | bandwidth (MiB/s/rank) | statistical values(min, max, avg, 99%, 99.9%) obtained by all nodes use algorithm 'RR GetLat (8 B)' for network testing |
| gpcnet-network-test/rr_two-sided_bw_${stat} | bandwidth (MiB/s/rank) | statistical values(min, max, avg, 99%, 99.9%) obtained by all nodes use algorithm 'RR Two-sidedBW (131072 B)' for network testing |
| gpcnet-network-test/nat_two-sided_bw_${stat} | bandwidth (MiB/s/rank) | statistical values(min, max, avg, 99%, 99.9%) obtained by all nodes use algorithm 'Nat Two-sidedBW (131072 B)' for network testing |
| gpcnet-network-test/multiple_alltoall_bw_${stat} | bandwidth (MiB/s/rank) | statistical values(min, max, avg, 99%, 99.9%) obtained by all nodes use algorithm 'Multiple Alltoall (4096 B)' for network testing |
| gpcnet-network-load-test/rr_two-sided_lat_x_${stat} | factor (x) | summary about congestion impact factor of the network test algorithm |
| gpcnet-network-load-test/rr_two-sided+sync_bw_x_${stat} | factor (x) | summary about congestion impact factor of the network test algorithm |
| gpcnet-network-load-test/multiple_allreduce_x_${stat} | factor (x) | summary about congestion impact factor of the network test algorithm |
### `ib-traffic`
...
...
@@ -204,11 +205,11 @@ The traffic pattern is defined in a config file, which is pre-defined for one-to
Each row in the config is one round, and all pairs of nodes in a row run ib command simultaneously.
| ib-traffic/${command}-${line}-${pair} | bandwidth (MB/s) | The average bandwidth of ib command (ib_write_bw, ib_send_bw, ib_read_bw) run between the ${pair}<sup>th</sup> node pair in the ${line}<sup>th</sup> line of the config |
| ib-traffic/${command}-${line}-${pair} | time (us) | The max latency of ib command (ib_write_lat, ib_send_lat, ib_read_lat) run between the ${pair}<sup>th</sup> node pair in the ${line}<sup>th</sup> line of the config |
| ib-traffic/${command}_${line}_${pair}_${server}_${client}_bw | bandwidth (GB/s) | The max bandwidth of ib command (ib_write_bw, ib_send_bw, ib_read_bw) run between the ${pair}<sup>th</sup> node pair in the ${line}<sup>th</sup> line of the config, ${server} and ${client} are the hostname of server and client |
| ib-traffic/${command}_${line}_${pair}_${server}_${client}_lat | time (us) | The max latency of ib command (ib_write_lat, ib_send_lat, ib_read_lat) run between the ${pair}<sup>th</sup> node pair in the ${line}<sup>th</sup> line of the config, ${server} and ${client} are the hostname of server and client |
## Computation-communication Benchmarks
...
...
@@ -223,8 +224,8 @@ Test the performance of single node when communication and computation overlap.