README.md 3.13 KB
Newer Older
yuguo960516yuguo's avatar
yuguo960516yuguo committed
1
2
3
4
# Bidirectional Encoder Representation from Transformers(BERT)
## 模型介绍
BERT的全称为Bidirectional Encoder Representation from Transformers,是一个预训练的语言表征模型。它强调了不再像以往一样采用传统的单向语言模型或者把两个单向语言模型进行浅层拼接的方法进行预训练,而是采用新的**masked language model(MLM)**,以致能生成**深度的双向**语言表征。
## 模型结构
huchen1's avatar
huchen1 committed
5
以往的预训练模型的结构会受到单向语言模型(*从左到右或者从右到左*)的限制,因而也限制了模型的表征能力,使其只能获取单方向的上下文信息。而BERT利用MLM进行预训练并且采用深层的双向Transformer组件(*单向的Transformer一般被称为Transformer decoder,其每一个token(符号)只会attend到目前往左的token。而双向的Transformer则被称为Transformer encoder,其每一个token会attend到所有的token*)来构建整个模型,因此最终生成**能融合左右上下文信息**的深层双向语言表征。
yuguo960516yuguo's avatar
yuguo960516yuguo committed
6
7
8
9
10
11
12
13
14
15
16
17

我们为了用户可以使用OneFlow-Libai快速验证Bert模型预训练,统计性能或验证精度,提供了一个Bert网络示例,主要网络参数如下:

```
model.cfg.num_attention_heads = 16
model.cfg.hidden_size = 768
model.cfg.hidden_layers = 8
```

完整的Bert-Large网络配置在configs/common/model/bert.py中

## 数据集
huchen1's avatar
huchen1 committed
18
我们在libai目录下集成了部分小数据集供用户快速验证,路径为:
yuguo960516yuguo's avatar
yuguo960516yuguo committed
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52

    ./nlp_data
## Bert预训练

### 环境配置

推荐使用docker方式运行,提供[光源](https://www.sourcefind.cn/#/service-details)拉取的docker镜像:image.sourcefind.cn:5000/dcu/admin/base/oneflow:0.9.1-centos7.6-dtk-22.10.1-py39-latest

进入docker:

    cd libai
    pip3 install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple
    pip3 install pybind11 -i https://mirrors.aliyun.com/pypi/simple
    pip3 install -e . -i https://mirrors.aliyun.com/pypi/simple

### 训练

该预训练脚本运行环境为1节点,4张DCU-Z100-16G。

并行配置策略在configs/bert_large_pretrain.py中,使用自动混合精度:

```
train.amp.enabled = True
train.train_micro_batch_size = 16
train.dist.data_parallel_size = 4
train.dist.tensor_parallel_size = 1
train.dist.pipeline_parallel_size = 1
```

预训练命令:

    cd libai
    bash tools/train.sh tools/train_net.py configs/bert_large_pretrain.py 4

yuguo's avatar
perf  
yuguo committed
53
### 精度
yuguo960516yuguo's avatar
yuguo960516yuguo committed
54
55
56
57
58

训练数据:[https://oneflow-static.oss-cn-beijing.aliyuncs.com/ci-files/dataset/libai/gpt_dataset](链接)

使用的GPGPU:4张DCU-Z100-16G。

yuguo's avatar
perf  
yuguo committed
59
模型精度:
yuguo960516yuguo's avatar
yuguo960516yuguo committed
60

yuguo's avatar
perf  
yuguo committed
61
62
63
| 卡数 | 分布式工具 |                            收敛性                            |
| :--: | :--------: | :----------------------------------------------------------: |
|  4   | Libai-main | total_loss: 6.555  lm_loss: 5.973  sop_loss: 0.583/10000 iters |
yuguo960516yuguo's avatar
yuguo960516yuguo committed
64

yuguo960516yuguo's avatar
1.1  
yuguo960516yuguo committed
65
## 源码仓库及问题反馈
yuguo960516yuguo's avatar
v1.0  
yuguo960516yuguo committed
66
* https://developer.hpccube.com/codes/modelzoo/bert-large_oneflow
huchen1's avatar
huchen1 committed
67

yuguo960516yuguo's avatar
yuguo960516yuguo committed
68
69
70
## 参考
* https://libai.readthedocs.io/en/latest/tutorials/get_started/quick_run.html
* https://github.com/Oneflow-Inc/oneflow
huchen1's avatar
huchen1 committed
71
* https://github.com/Oneflow-Inc/libai/blob/main/docs/source/notes/FAQ.md