README.md 4.16 KB
Newer Older
dcuai's avatar
dcuai committed
1
# bert-large
yuguo's avatar
yuguo committed
2
3
4
5
6
7
8

## 论文

`BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding`

- [https://arxiv.org/abs/1810.04805](https://arxiv.org/abs/1810.04805)

yuguo960516yuguo's avatar
yuguo960516yuguo committed
9
## 模型结构
yuguo's avatar
yuguo committed
10

chenzk's avatar
chenzk committed
11
<img src="http://developer.sourcefind.cn/codes/modelzoo/bert-large_oneflow/-/raw/main/bert%E6%A8%A1%E5%9E%8B%E7%BB%93%E6%9E%84.png" alt="bert模型结构.png" style="zoom: 50%;" />
yuguo's avatar
yuguo committed
12

yuguo960516yuguo's avatar
yuguo960516yuguo committed
13
14
15
16
17
18
19
20
21
22
23

我们为了用户可以使用OneFlow-Libai快速验证Bert模型预训练,统计性能或验证精度,提供了一个Bert网络示例,主要网络参数如下:

```
model.cfg.num_attention_heads = 16
model.cfg.hidden_size = 768
model.cfg.hidden_layers = 8
```

完整的Bert-Large网络配置在configs/common/model/bert.py中

yuguo's avatar
yuguo committed
24
25
26
## 算法原理

BERT的全称为Bidirectional Encoder Representation from Transformers,是一个预训练的语言表征模型。它强调了不再像以往一样采用传统的单向语言模型或者把两个单向语言模型进行浅层拼接的方法进行预训练,而是采用新的**masked language model(MLM)**,以致能生成**深度的双向**语言表征。以往的预训练模型的结构会受到单向语言模型(*从左到右或者从右到左*)的限制,因而也限制了模型的表征能力,使其只能获取单方向的上下文信息。而BERT利用MLM进行预训练并且采用深层的双向Transformer组件(*单向的Transformer一般被称为Transformer decoder,其每一个token(符号)只会attend到目前往左的token。而双向的Transformer则被称为Transformer encoder,其每一个token会attend到所有的token*)来构建整个模型,因此最终生成**能融合左右上下文信息**的深层双向语言表征。
yuguo960516yuguo's avatar
yuguo960516yuguo committed
27

chenzk's avatar
chenzk committed
28
<img src="http://developer.sourcefind.cn/codes/modelzoo/bert-large_oneflow/-/raw/main/bert%E7%AE%97%E6%B3%95%E5%8E%9F%E7%90%86.png" alt="bert算法原理.png" style="zoom: 50%;" />
yuguo960516yuguo's avatar
yuguo960516yuguo committed
29

yuguo's avatar
yuguo committed
30
## 环境配置
yuguo960516yuguo's avatar
yuguo960516yuguo committed
31

yuguo's avatar
yuguo committed
32
### Docker
yuguo960516yuguo's avatar
yuguo960516yuguo committed
33

yuguo's avatar
yuguo committed
34
35
36
37
```plaintext
docker pull image.sourcefind.cn:5000/dcu/admin/base/oneflow:0.9.1-centos7.6-dtk-22.10.1-py39-latest
# <Your Image ID>用上面拉取docker镜像的ID替换
docker run --shm-size 16g --network=host --name=bert_oneflow --privileged --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v $PWD/bert-large_oneflow:/home/bert-large_oneflow -it <Your Image ID> bash
yuguo's avatar
update  
yuguo committed
38
cd /home/bert-large_oneflow
yuguo's avatar
yuguo committed
39
40
41
42
43
44
45
46
pip3 install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple
pip3 install pybind11 -i https://mirrors.aliyun.com/pypi/simple
pip3 install -e . -i https://mirrors.aliyun.com/pypi/simple
```

## 数据集
我们在libai目录下集成了部分小数据集供用户快速验证,路径为:

“yuguo”'s avatar
update  
“yuguo” committed
47
48
49
50
51
    $ tree nlp_data
    ├── data
    ├── bert-base-chinese-vocab
    ├── gpt2-merges
    └── gpt2-vocab
chenzk's avatar
chenzk committed
52
53
54
55
56
57

训练数据:
- https://oneflow-static.oss-cn-beijing.aliyuncs.com/ci-files/dataset/libai/gpt_dataset/gpt2-merges.txt
- https://oneflow-static.oss-cn-beijing.aliyuncs.com/ci-files/dataset/libai/gpt_dataset/gpt2-vocab.json


“yuguo”'s avatar
update  
“yuguo” committed
58
## 训练
yuguo960516yuguo's avatar
yuguo960516yuguo committed
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73

该预训练脚本运行环境为1节点,4张DCU-Z100-16G。

并行配置策略在configs/bert_large_pretrain.py中,使用自动混合精度:

```
train.amp.enabled = True
train.train_micro_batch_size = 16
train.dist.data_parallel_size = 4
train.dist.tensor_parallel_size = 1
train.dist.pipeline_parallel_size = 1
```

预训练命令:

“yuguo”'s avatar
“yuguo” committed
74
    cd home/bert-large_oneflow
yuguo960516yuguo's avatar
yuguo960516yuguo committed
75
76
    bash tools/train.sh tools/train_net.py configs/bert_large_pretrain.py 4

“yuguo”'s avatar
update  
“yuguo” committed
77
78
79
## result

### 精度
yuguo960516yuguo's avatar
yuguo960516yuguo committed
80
81
82

使用的GPGPU:4张DCU-Z100-16G。

yuguo's avatar
perf  
yuguo committed
83
模型精度:
yuguo960516yuguo's avatar
yuguo960516yuguo committed
84

yuguo's avatar
perf  
yuguo committed
85
86
87
| 卡数 | 分布式工具 |                            收敛性                            |
| :--: | :--------: | :----------------------------------------------------------: |
|  4   | Libai-main | total_loss: 6.555  lm_loss: 5.973  sop_loss: 0.583/10000 iters |
yuguo960516yuguo's avatar
yuguo960516yuguo committed
88

yuguo's avatar
yuguo committed
89
90
91
92
## 应用场景

### 算法类别

chenzk's avatar
chenzk committed
93
`对话问答`
yuguo's avatar
yuguo committed
94
95
96

### 热点应用行业

“yuguo”'s avatar
update  
“yuguo” committed
97
`医疗,教育,科研,金融`
yuguo's avatar
yuguo committed
98

yuguo960516yuguo's avatar
1.1  
yuguo960516yuguo committed
99
## 源码仓库及问题反馈
yuguo's avatar
yuguo committed
100

chenzk's avatar
chenzk committed
101
* https://developer.sourcefind.cn/codes/modelzoo/bert-large_oneflow
huchen1's avatar
huchen1 committed
102

yuguo960516yuguo's avatar
yuguo960516yuguo committed
103
104
105
## 参考
* https://libai.readthedocs.io/en/latest/tutorials/get_started/quick_run.html
* https://github.com/Oneflow-Inc/oneflow
huchen1's avatar
huchen1 committed
106
* https://github.com/Oneflow-Inc/libai/blob/main/docs/source/notes/FAQ.md