readme.bak.md 4.08 KB
Newer Older
liangjing's avatar
liangjing committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86

# Bert介绍
## 应用领域:
自然语言理解大模型
## 目标精度
Mask-LM accuracy 达到0.72
## 模型基本参数设置
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1024,
  "initializer_range": 0.02,
  "intermediate_size": 4096,
  "max_position_embeddings": 512,
  "num_attention_heads": 16,
  "num_hidden_layers": 24,
  "type_vocab_size": 2,
  "vocab_size": 30522
  
# 测试前准备

## 数据集准备

###  progress bars in model download and training scripts
boto3==1.14.0
gdown==3.13.0
git+https://github.com/mlcommons/logging.git@2.0.0-rc2
h5py==2.10.0
html2text==2020.1.16
ipdb==0.13.2
nltk==3.5
onnxruntime==1.3.0
parameterized
progressbar==2.5
requests==2.23.0
six==1.15.0
tensorflow==2.2.0
数据预处理时尽量将所有采用库的版本号对齐,以免出现md5码不一致问题
参见bert目录下 README.md制作数据

## 环境部署

1、准备dtk 21.04环境

2、Mlperf bert文件夹内包含paddlepaddle_rocm-0.0.0-cp36-cp36m-linux_x86_64.whl
python3 -m pip install  paddlepaddle_rocm-0.0.0-cp36-cp36m-linux_x86_64.whl

## 安装python依赖包

 pip install -r requirements.txt  -i https://pypi.tuna.tsinghua.edu.cn/simple --trusted-host pypi.tuna.tsinghua.edu.cn

# 测试脚本
## 8卡打开exchange padding测试

cp rundir_8gpu_exchange/* .
sbatch run_sbatch.sh

## 1024卡大规模并发测试

cp rundir_8gpu_exchange/* .
sbatch run_sbatch.sh

输出结果见worker.*文件

# 优化测试结果整理
测试数据存放目录:result.log
##  扩展性测试


| GPU卡数 | 单卡batch_size | gradient_accumulation | 吞吐量(seq/s)       | 并行效率                     |
|-------|--------------|-----------------------|------------------|--------------------------|
| 4     | 4            | 1                     | 36.69            | 100%                     |
| 8     | 4            | 1                     | 65.7             | 89.53%                   |
| 1024  | 4            | 1                     | 7723.38          | 82.23%                   |
| 1024  | 8            | 1                     | 9362.93-.9416.84 | 99.6%-100.25%(以单节点4卡为基准) |

##  性能优化测试

| GPU卡数 | 单卡batch_size | gradient_accumulation_steps | global batch size |      |  混精度          | gemm优化       | softmax+softmax_cross_entropy  | distributed_fused_lamb | GeLU近似算法      | exchange padding     | 收敛global_steps | walltime(s) |
|-------|--------------|-----------------------------|-------------------|------|---------------|--------------|--------------------------------|------------------------|---------------|----------------------|----------------|-------------|
| 8     | 4            | 14                          | 448               | 优化前: |               | 51.26seq/s   | 85.3seq/s                      | 89.59seq/s             |               | 6697 (global steps)  | 6697           | 32522.67    |
|       |              |                             |                   | 优化后: | 91.92seq/s    | 85.3seq/s    | 89.59seq/s                     | 91.92seq/s             | 91.92seq/s    |  5692 (global steps) | 5692           |             |
| 1024  | 4            | 1                           | 4096              | 优化前: | 4458.04seq/s  |              | 7461seq/s                      | 5174.44seq/s           | 7353.08seq/s  | 必须off                | 684            | 369.325     |
|       |              |                             |                   | 优化后: | 7723.38seq/s  | 5174.44seq/s | 7723.38seq/s                   | 7461seq/s              | 7723.38seq/s  |                      |                |             |
| 1024  | 8            | 2                           | 16384             | 优化前: |  ---          |              | 10634seq/s                     | 9083seq/s              |               | 必须off                | 794            | 580.618     |
|       |              |                             |                   | 优化后: | 11330.07seq/s | 9083seq/s    | 11330.07seq/s                  | 10634seq/s             | 11330.07seq/s |