# Benchmarks Here we provides our benchmark speed test results of LiBai's models compared with [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) implementations. In LiBai V0.2.0, we only benchmark the speed tests under 32 GPUs in 4 nodes and all of the experiments were conducted under the same settings for a fair comparison. ## Settings ### Environments - The commit of LiBai for comparison: [commit](https://github.com/Oneflow-Inc/libai/commit/9fc504c457da4fd1e92d854c60b7271c89a55222) - The commit of OneFlow for comparison: [commit](https://github.com/Oneflow-Inc/oneflow/commit/55b822e4d3c88757d11077d7546981309125c73f) - The commit of Megatron-LM for comparison: [commit](https://github.com/NVIDIA/Megatron-LM/commit/e156d2fea7fc5c98e645f7742eb86b643956d840) ### Model Hyper-parameters - **BERT Model** ```python num_layers = 24/48 num_attention_heads = 16 hidden_size = 1024 seq_length = 512 ``` - **GPT-2 Model** ```python num_layers = 24/48 num_attention_heads = 16 hidden_size = 1024 seq_length = 1024 ``` ## Main Results Here we explain the evaluation indicators in the following tables: - **fp16**: mixed precision training - **nl**: num layers (When pipeline parallel size = 8, in order to have a relative number of layers per stage for computation, we adjust the num layers from 24 to 48) - **ac**: enable activation checkpointing - **mb**: micro-batch size per gpu - **gb**: global batch size total - **d x m x p**: - d: data-parallel-size - m: tensor-model-parallel-size - p: pipeline-model-parallel-size - **1n1g**: 1 node, 1 gpu - **2n8g**: 2 nodes, 8 gpus per node, 16 gpus in total - **4n8g**: 4 nodes, 8 gpus per node, 32 gpus in total - `grad_acc_num_step = global_batch_size / (micro_batch_size * data_parallel_size)` - **samples/s**: throughput ### Data Parallel
BERT LiBai Megatron
nl24_fp16_1x1x1_mb24_gb24_1n1g 46.91 samples/s 42.6 samples/s
nl24_fp16_4x1x1_mb16_gb64_1n4g 176.88 samples/s 154.7 samples/s
nl24_fp16_8x1x1_mb16_gb128_1n8g 351.57 samples/s 309.2 samples/s
nl24_fp16_16x1x1_mb16_gb256_2n8g 675.87 samples/s 534.7 samples/s
nl24_fp16_32x1x1_mb16_gb512_4n8g 1207.65 samples/s 950.3 samples/s
GPT-2 LiBai Megatron
nl24_fp16_1x1x1_mb6_gb6_1n1g 17.52 samples/s 15.5 samples/s
nl24_fp16_4x1x1_mb4_gb16_1n4g 63.45 samples/s 53.3 samples/s
nl24_fp16_8x1x1_mb4_gb32_1n8g 125.64 samples/s 107.9 samples/s
nl24_fp16_16x1x1_mb4_gb64_2n8g 215.35 samples/s 176.0 samples/s
nl24_fp16_32x1x1_mb4_gb128_4n8g 329.58 samples/s 296.6 samples/s
### Tensor Model Parallel
BERT LiBai Megatron
nl24_fp16_1x1x1_ac_mb128_gb1024_1n1g 35.74 samples/s 33.6 samples/s
nl24_fp16_1x4x1_ac_mb128_gb1024_1n4g 87.12 samples/s 86.6 samples/s
nl24_fp16_1x8x1_ac_mb128_gb1024_1n8g 131.94 samples/s 128.7 samples/s
GPT-2 LiBai Megatron
nl24_fp16_1x1x1_mb6_gb6_1n1g 17.52 samples/s 15.5 samples/s
nl24_fp16_1x4x1_mb6_gb6_1n4g 40.38 samples/s 38.0 samples/s
nl24_fp16_1x8x1_mb8_gb8_1n8g 60.53 samples/s 55.7 samples/s
### Pipeline Model Parallel
BERT LiBai Megatron
nl24_fp16_1x1x1_ac_mb128_gb1024_1n1g 35.74 samples/s 33.6 samples/s
nl24_fp16_1x1x4_ac_mb128_gb1024_1n4g 103.6 samples/s 88.7 samples/s
nl48_fp16_1x1x8_ac_mb64_gb1024_1n8g 94.4 samples/s 85.5 samples/s
GPT-2 LiBai Megatron
nl24_fp16_1x1x1_ac_mb32_gb256_1n1g 14.43 samples/s 13.3 samples/
nl24_fp16_1x1x4_ac_mb32_gb256_1n4g 41.9 samples/s 33.2 samples/s
nl48_fp16_1x1x8_ac_mb24_gb384_1n8g 37.4 samples/s 31.8 samples/s
### 2-D Parallel #### Data Parallel + Tensor Model Parallel
BERT LiBai Megatron
nl24_fp16_2x2x1_ac_mb128_gb2048_1n4g 88.47 samples/s 86.6 samples/s
nl24_fp16_4x2x1_ac_mb128_gb4096_1n8g 175.94 samples/s 172.0 samples/s
nl24_fp16_8x2x1_ac_mb128_gb8192_2n8g 348.58 samples/s 343.8 samples/s
nl24_fp16_2x8x1_ac_mb128_gb2048_2n8g 261.78 samples/s 255.8 samples/s
nl24_fp16_4x4x1_ac_mb128_gb2048_2n8g 338.97 samples/s 337.3 samples/s
GPT-2 LiBai Megatron
nl24_fp16_2x2x1_ac_mb32_gb512_1n4g 37.63 samples/s 36.9 samples/s
nl24_fp16_4x2x1_ac_mb32_gb1024_1n8g 74.35 samples/s 73.2 samples/s
nl24_fp16_8x2x1_ac_mb32_gb2048_2n8g 148.94 samples/s 146.5 samples/s
nl24_fp16_2x8x1_ac_mb32_gb512_2n8g 116.04 samples/s 109.1 samples/s
nl24_fp16_4x4x1_ac_mb32_gb512_2n8g 141.25 samples/s 138.1 samples/s
#### Data Parallel + Pipeline Model Parallel
BERT LiBai Megatron
nl24_fp16_2x1x4_ac_mb128_gb2048_1n8g 207.31 samples/s 175.0 samples/s
nl24_fp16_4x1x4_ac_mb128_gb4096_2n8g 406.24 samples/s 342.9 samples/s
nl24_fp16_8x1x4_ac_mb128_gb8192_4n8g 805.04 samples/s 650.7 samples/s
GPT-2 LiBai Megatron
nl24_fp16_2x1x4_ac_mb32_gb512_1n8g 83.12 samples/s 65.3 samples/s
nl24_fp16_4x1x4_ac_mb32_gb1024_2n8g 164.23 samples/s 128.4 samples/s
nl24_fp16_8x1x4_ac_mb32_gb2048_4n8g 322.42 samples/s 247.3 samples/s
### 3-D Parallel
BERT LiBai Megatron
nl24_fp16_2x2x4_ac_mb128_gb2048_2n8g 267.39 samples/s 233.7 samples/s
nl24_fp16_4x2x4_ac_mb192_gb6144_4n8g 503.51 samples/s 439.4 samples/s
nl24_fp16_2x4x4_ac_mb256_gb4096_4n8g 405.75 samples/s 338.7 samples/s
GPT-2 LiBai Megatron
nl24_fp16_2x2x4_ac_mb32_gb1024_2n8g 128.77 samples/s 106.3 samples/s
nl24_fp16_4x2x4_ac_mb48_gb1536_4n8g 209.32 samples/s 179.5 samples/s
nl24_fp16_2x4x4_ac_mb64_gb1024_4n8g 186.67 samples/s 178.2 samples/s