README.md 3.74 KB
Newer Older
dcuai's avatar
dcuai committed
1
# BERT
liangjing's avatar
update  
liangjing committed
2
## 论文
liangjing's avatar
liangjing committed
3

liangjing's avatar
update  
liangjing committed
4
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
liangjing's avatar
liangjing committed
5

liangjing's avatar
update  
liangjing committed
6
* https://arxiv.org/abs/1810.04805
liangjing's avatar
liangjing committed
7
8

## 模型结构
liangjing's avatar
update  
liangjing committed
9
BERT模型的核心是Transformer编码器,BERT-large是BERT模型的一个更大、更复杂的版本,其包含24个Transformer编码器,每个编码器有1024个隐藏层,总共包含340M个参数。在预训练阶段,BERT-large使用更多的未标记的文本数据进行预训练,并使用Masked Language Model(MLM)和Next Sentence Prediction(NSP)两个任务来优化模型。
liangjing's avatar
liangjing committed
10

liangjing's avatar
update  
liangjing committed
11
12
13
14
15
16
17
下图为BERT的模型结构示意图

![figure1](figure1.png)

## 算法原理

BERT用大量的无监督文本通过自监督训练的方式训练,把文本中包含的语言知识(包括:词法、语法、语义等特征)以参数的形式编码到Transformer-encoder layer中,即用了Masked LM及Next Sentence Prediction两种方法分别捕捉词语和句子级别的representation。
liangjing's avatar
liangjing committed
18

liangjing's avatar
unpdat  
liangjing committed
19
![figure2](bert.png)
liangjing's avatar
liangjing committed
20

liangjing's avatar
update  
liangjing committed
21
22
23
24
25
26
27
28
29
30
31
32
33
34
## 环境配置

提供[光源](https://www.sourcefind.cn/#/service-details)拉取的训练的docker镜像:

    docker pull image.sourcefind.cn:5000/dcu/admin/base/custom:mlperf_paddle_bert_mpirun
    # <Image ID>用上面拉取docker镜像的ID替换
    # <Host Path>主机端路径
    # <Container Path>容器映射路径
    docker run -it --name mlperf_bert --shm-size=32G  --device=/dev/kfd --device=/dev/dri/ --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ulimit memlock=-1:-1 --ipc=host --network host --group-add video -v <Host Path>:<Container Path> <Image ID> /bin/bash

镜像版本依赖:
* DTK驱动:dtk21.04
* python: python3.6.8

liangjing's avatar
liangjing committed
35
36
注明:目前本镜像仅支持Z100/Z100L系列卡

liangjing's avatar
update  
liangjing committed
37
38
39
40
41
42
测试目录:

```
/root/mlperf-paddle_bert.20220919-training-bert/training/bert
```

liangjing's avatar
liangjing committed
43
## 数据集
liangjing's avatar
update  
liangjing committed
44

chenzk's avatar
chenzk committed
45
模型训练的数据集来自[Wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia),即一种常用的自然语言处理数据集,它包含了维基百科上的文章和对应的摘要(即第一段内容),可用于各种文本相关的任务,例如文本分类、文本摘要、命名实体识别等。
liangjing's avatar
liangjing committed
46
47
48
49

下载+预处理数据可按照下述进行,最终获得的输入数据如下图所示:

    ./input_preprocessing/prepare_data.sh --outputdir /workspace/bert_data 
liangjing's avatar
liangjing committed
50
![dataset](dataset.png)
liangjing's avatar
liangjing committed
51

liangjing's avatar
update  
liangjing committed
52
对于预训练模型需要基于已下载数据进行如下处理:
liangjing's avatar
liangjing committed
53

liangjing's avatar
update  
liangjing committed
54
55
56
     python3 models/load_tf_checkpoint.py \
        /workspace/bert_data/phase1/model.ckpt-28252 \
        /workspace/bert_data/phase1/model.ckpt-28252.tf_pickled
liangjing's avatar
liangjing committed
57

liangjing's avatar
update  
liangjing committed
58
可得到/workspace/bert_data文件夹存放预训练模型如下:
liangjing's avatar
liangjing committed
59
60
61
62

    ├── /workpalce/bert_data/phase1
    └── └──model.ckpt-28252.tf_pickled #预训练模型 

liangjing's avatar
update  
liangjing committed
63
64
65
66
67
## 训练

### 单机多卡

单机8卡进行性能&&精度测试
liangjing's avatar
liangjing committed
68
69

    bash run_8gpu.sh
huchen1's avatar
huchen1 committed
70
    
liangjing's avatar
liangjing committed
71
72
73
    #不同环境的配置及数据的存放路径会有不同,请根据实际情况进行调整run_benchmark_8gpu.sh脚本中的如下内容:
    BASE_DATA_DIR=${BASE_DATA_DIR:-"/public/DL_DATA/mlperf/bert"} //调整为具体的数据的路径

liangjing's avatar
unpdat  
liangjing committed
74
75
76
77
78
## result

![dataset](result.png)

## 精度
liangjing's avatar
liangjing committed
79

liangjing's avatar
update  
liangjing committed
80
采用上述输入数据,加速卡采用Z100L * 8,可最终达到官方收敛要求,即达到目标精度0.72 Mask-LM accuracy;
liangjing's avatar
liangjing committed
81

liangjing's avatar
update  
liangjing committed
82
83
84
| 卡数 | 类型     | 进程数 | 达到精度              |
| ---- | -------- | ------ | --------------------- |
| 8    | 混合精度 | 8      | 0.72 Mask-LM accuracy |
liangjing's avatar
liangjing committed
85

liangjing's avatar
update  
liangjing committed
86
## 应用场景
liangjing's avatar
liangjing committed
87

liangjing's avatar
update  
liangjing committed
88
### 算法类别
liangjing's avatar
liangjing committed
89

liangjing's avatar
update  
liangjing committed
90
自然语言处理
liangjing's avatar
liangjing committed
91

liangjing's avatar
update  
liangjing committed
92
### 热点应用行业
liangjing's avatar
liangjing committed
93

liangjing's avatar
unpdat  
liangjing committed
94
零售,广媒
liangjing's avatar
liangjing committed
95

liangjing's avatar
liangjing committed
96
## 源码仓库及问题反馈
liangjing's avatar
update  
liangjing committed
97

chenzk's avatar
chenzk committed
98
* https://developer.sourcefind.cn/codes/modelzoo/mlperf_bert_paddle
liangjing's avatar
liangjing committed
99

liangjing's avatar
liangjing committed
100
101
102
## 参考
* https://mlcommons.org/en/
* https://github.com/mlcommons
liangjing's avatar
unpdat  
liangjing committed
103
* https://github.com/mlcommons/training_results_v2.1/tree/main/Baidu/benchmarks/bert/implementations/8_node_64_A100_PaddlePaddle