# Benchmarks
Here we provides our benchmark speed test results of LiBai's models compared with [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) implementations. In LiBai V0.2.0, we only benchmark the speed tests under 32 GPUs in 4 nodes and all of the experiments were conducted under the same settings for a fair comparison.
## Settings
### Environments
- The commit of LiBai for comparison: [commit](https://github.com/Oneflow-Inc/libai/commit/9fc504c457da4fd1e92d854c60b7271c89a55222)
- The commit of OneFlow for comparison: [commit](https://github.com/Oneflow-Inc/oneflow/commit/55b822e4d3c88757d11077d7546981309125c73f)
- The commit of Megatron-LM for comparison: [commit](https://github.com/NVIDIA/Megatron-LM/commit/e156d2fea7fc5c98e645f7742eb86b643956d840)
### Model Hyper-parameters
- **BERT Model**
```python
num_layers = 24/48
num_attention_heads = 16
hidden_size = 1024
seq_length = 512
```
- **GPT-2 Model**
```python
num_layers = 24/48
num_attention_heads = 16
hidden_size = 1024
seq_length = 1024
```
## Main Results
Here we explain the evaluation indicators in the following tables:
- **fp16**: mixed precision training
- **nl**: num layers (When pipeline parallel size = 8, in order to have a relative number of layers per stage for computation, we adjust the num layers from 24 to 48)
- **ac**: enable activation checkpointing
- **mb**: micro-batch size per gpu
- **gb**: global batch size total
- **d x m x p**:
- d: data-parallel-size
- m: tensor-model-parallel-size
- p: pipeline-model-parallel-size
- **1n1g**: 1 node, 1 gpu
- **2n8g**: 2 nodes, 8 gpus per node, 16 gpus in total
- **4n8g**: 4 nodes, 8 gpus per node, 32 gpus in total
- `grad_acc_num_step = global_batch_size / (micro_batch_size * data_parallel_size)`
- **samples/s**: throughput
### Data Parallel
| nl24_fp16_1x1x1_mb24_gb24_1n1g |
46.91
samples/s |
42.6
samples/s |
| nl24_fp16_4x1x1_mb16_gb64_1n4g |
176.88
samples/s |
154.7
samples/s |
| nl24_fp16_8x1x1_mb16_gb128_1n8g |
351.57
samples/s |
309.2
samples/s |
| nl24_fp16_16x1x1_mb16_gb256_2n8g |
675.87
samples/s |
534.7
samples/s |
| nl24_fp16_32x1x1_mb16_gb512_4n8g |
1207.65
samples/s |
950.3
samples/s |
| nl24_fp16_1x1x1_mb6_gb6_1n1g |
17.52
samples/s |
15.5
samples/s |
| nl24_fp16_4x1x1_mb4_gb16_1n4g |
63.45
samples/s |
53.3
samples/s |
| nl24_fp16_8x1x1_mb4_gb32_1n8g |
125.64
samples/s |
107.9
samples/s |
| nl24_fp16_16x1x1_mb4_gb64_2n8g |
215.35
samples/s |
176.0
samples/s |
| nl24_fp16_32x1x1_mb4_gb128_4n8g |
329.58
samples/s |
296.6
samples/s |
### Tensor Model Parallel
| nl24_fp16_1x1x1_ac_mb128_gb1024_1n1g |
35.74
samples/s |
33.6
samples/s |
| nl24_fp16_1x4x1_ac_mb128_gb1024_1n4g |
87.12
samples/s |
86.6
samples/s |
| nl24_fp16_1x8x1_ac_mb128_gb1024_1n8g |
131.94
samples/s |
128.7
samples/s |
| nl24_fp16_1x1x1_mb6_gb6_1n1g |
17.52
samples/s |
15.5
samples/s |
| nl24_fp16_1x4x1_mb6_gb6_1n4g |
40.38
samples/s |
38.0
samples/s |
| nl24_fp16_1x8x1_mb8_gb8_1n8g |
60.53
samples/s |
55.7
samples/s |
### Pipeline Model Parallel
| nl24_fp16_1x1x1_ac_mb128_gb1024_1n1g |
35.74
samples/s |
33.6
samples/s |
| nl24_fp16_1x1x4_ac_mb128_gb1024_1n4g |
103.6
samples/s |
88.7
samples/s |
| nl48_fp16_1x1x8_ac_mb64_gb1024_1n8g |
94.4
samples/s |
85.5
samples/s |
| nl24_fp16_1x1x1_ac_mb32_gb256_1n1g |
14.43
samples/s |
13.3
samples/ |
| nl24_fp16_1x1x4_ac_mb32_gb256_1n4g |
41.9
samples/s |
33.2
samples/s |
| nl48_fp16_1x1x8_ac_mb24_gb384_1n8g |
37.4
samples/s |
31.8
samples/s |
### 2-D Parallel
#### Data Parallel + Tensor Model Parallel
| nl24_fp16_2x2x1_ac_mb128_gb2048_1n4g |
88.47
samples/s |
86.6
samples/s |
| nl24_fp16_4x2x1_ac_mb128_gb4096_1n8g |
175.94
samples/s |
172.0
samples/s |
| nl24_fp16_8x2x1_ac_mb128_gb8192_2n8g |
348.58
samples/s |
343.8
samples/s |
| nl24_fp16_2x8x1_ac_mb128_gb2048_2n8g |
261.78
samples/s |
255.8
samples/s |
| nl24_fp16_4x4x1_ac_mb128_gb2048_2n8g |
338.97
samples/s |
337.3
samples/s |
| nl24_fp16_2x2x1_ac_mb32_gb512_1n4g |
37.63
samples/s |
36.9
samples/s |
| nl24_fp16_4x2x1_ac_mb32_gb1024_1n8g |
74.35
samples/s |
73.2
samples/s |
| nl24_fp16_8x2x1_ac_mb32_gb2048_2n8g |
148.94
samples/s |
146.5
samples/s |
| nl24_fp16_2x8x1_ac_mb32_gb512_2n8g |
116.04
samples/s |
109.1
samples/s |
| nl24_fp16_4x4x1_ac_mb32_gb512_2n8g |
141.25
samples/s |
138.1
samples/s |
#### Data Parallel + Pipeline Model Parallel
| nl24_fp16_2x1x4_ac_mb128_gb2048_1n8g |
207.31
samples/s |
175.0
samples/s |
| nl24_fp16_4x1x4_ac_mb128_gb4096_2n8g |
406.24
samples/s |
342.9
samples/s |
| nl24_fp16_8x1x4_ac_mb128_gb8192_4n8g |
805.04
samples/s |
650.7
samples/s |
| nl24_fp16_2x1x4_ac_mb32_gb512_1n8g |
83.12
samples/s |
65.3
samples/s |
| nl24_fp16_4x1x4_ac_mb32_gb1024_2n8g |
164.23
samples/s |
128.4
samples/s |
| nl24_fp16_8x1x4_ac_mb32_gb2048_4n8g |
322.42
samples/s |
247.3
samples/s |
### 3-D Parallel
| nl24_fp16_2x2x4_ac_mb128_gb2048_2n8g |
267.39
samples/s |
233.7
samples/s |
| nl24_fp16_4x2x4_ac_mb192_gb6144_4n8g |
503.51
samples/s |
439.4
samples/s |
| nl24_fp16_2x4x4_ac_mb256_gb4096_4n8g |
405.75
samples/s |
338.7
samples/s |
| nl24_fp16_2x2x4_ac_mb32_gb1024_2n8g |
128.77
samples/s |
106.3
samples/s |
| nl24_fp16_4x2x4_ac_mb48_gb1536_4n8g |
209.32
samples/s |
179.5
samples/s |
| nl24_fp16_2x4x4_ac_mb64_gb1024_4n8g |
186.67
samples/s |
178.2
samples/s |