"git@developer.sourcefind.cn:OpenDAS/dgl.git" did not exist on "e864f9c50f51f6b3ac4940e4d3c6675ab1979972"
README.md 4.75 KB
Newer Older
wanglch's avatar
wanglch committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# UMT5

**注:执行下游任务是需要使用进行预训练, 训练代码参考train_model.py。**
<div align="center">
    <img src="./docs/T5_task.png"/>
</div>

## 论文
- [论文地址] [UniMax: Fairer and more Effective Language Sampling for Large-Scale Multilingual Pretraining](https://arxiv.org/abs/2304.09151)

## 模型结构

umT5:T5 的多语言版本,具备 T5 模型大部分的多功能性,在多语言通用爬虫语料库 mC4 上预训练,覆盖 101 种语言;Encoder-Decoder架构,编码层和解码层都是12层,一共有220M个参数,大概是bert-base 的两倍。

<div align="center">
    <img src="./docs/T5_structure.png"/>
</div>

## 算法原理

总的来说,mT5 跟 T5 一脉相承的,整体基本一样,但在模型结构方面,mT5 用的是 T5.1.1方案,在此对它做个基本的介绍。

它主要的改动来自论文[GLU Variants Improve Transformer](https://arxiv.org/abs/2002.05202),主要是借用了[Language Modeling with Gated Convolutional Networks](https://arxiv.org/abs/1612.08083)**GLU**(Gated Linear Unit)来增强 FFN 部分的效果。具体来说,原来 T5 的 FFN 为(T5 没有 Bias):

<div align=center>
    <img src="./docs/equation1.png"/>
</div>

改为:

<div align=center>
    <img src="./docs/euqation2.png"/>
</div>

### T5 Transformer
<div align=center>
    <img src="./docs/euqation2.png"/>
</div>

## 环境配置
### Docker(方法一)
[光源](https://www.sourcefind.cn/#/service-details)拉取docker镜像的地址与使用步骤
```
dcuai's avatar
dcuai committed
44
docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.1-py3.10
wanglch's avatar
wanglch committed
45

wanglch's avatar
wanglch committed
46
docker run -it -v /path/your_code_data/:/path/your_code_data/ -v /opt/hyhal/:/opt/hyhal/:ro --shm-size=64G --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video --name umt5 <your imageID> bash
wanglch's avatar
wanglch committed
47

wanglch's avatar
wanglch committed
48
cd /path/your_code_data/umt5
wanglch's avatar
wanglch committed
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63

pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
```

### Dockerfile(方法二)
```
cd /path/your_code_data/umt5/docker

docker build --no-cache -t umt5:latest .

docker run --shm-size=64G --name umt5 -v /opt/hyhal:/opt/hyhal:ro --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video -v /path/your_code_data/:/path/your_code_data/ -it umt5 bash
```
### Anaconda(方法三)
此处提供本地配置、编译的详细步骤,例如:

chenzk's avatar
chenzk committed
64
关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.sourcefind.cn/tool/)开发者社区下载安装。
wanglch's avatar
wanglch committed
65
```
wanglch's avatar
wanglch committed
66
DTK驱动:dtk24.04.1
wanglch's avatar
wanglch committed
67
68
69
70
71
72
73
python:python3.10
torch:2.1.0
torchvision:0.16.0
deepspeed:0.12.3
```
`Tips:以上dtk驱动、python、paddle等DCU相关工具版本需要严格一一对应`

chenzk's avatar
chenzk committed
74
关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.sourcefind.cn/tool/)开发者社区下载安装。
wanglch's avatar
wanglch committed
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
```
conda create -n umt5 python=3.10

conda activate umt5 

cd /path/your_code_data/umt5

pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple
```

## 数据集

我们选择大规模中文短文本摘要语料库[LCSTS](http://icrc.hitsz.edu.cn/Article/show/139.html) 作为数据集,该语料基于新浪微博短新闻构建,规模超过 200 万。

```
wanglch's avatar
wanglch committed
90
91
92
93
├── data
|        ├── data1.tsv
|        ├── data2.tsv
|        ├── data3.tsv
wanglch's avatar
wanglch committed
94

wanglch's avatar
wanglch committed
95
96
```
## 训练
wanglch's avatar
wanglch committed
97

wanglch's avatar
wanglch committed
98
### 单机单卡
wanglch's avatar
wanglch committed
99

wanglch's avatar
wanglch committed
100
```
wanglch's avatar
wanglch committed
101
python train_single_dcu.py
wanglch's avatar
wanglch committed
102
103
```

wanglch's avatar
wanglch committed
104
105


wanglch's avatar
wanglch committed
106
## 推理
wanglch's avatar
wanglch committed
107

wanglch's avatar
wanglch committed
108
109
### 摘要任务

wanglch's avatar
wanglch committed
110
要进行摘要任务需先进行模型训练,从hf-mirror或者huggingface下载umt5-base模型后,使用**train_single_dcu.py**进行训练,保存训练权重后,加载权重进行摘要处理。同理,若要处理阅读理解,语言翻译任务时也需要做类似操作。
wanglch's avatar
wanglch committed
111
112
113
114
115
116
117
```
python umt5_summary.py
```

## result

<div align=center>
wanglch's avatar
wanglch committed
118
    <img src="./docs/result_2.png"/>
wanglch's avatar
wanglch committed
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
</div>

### 精度
测试数据:[LCSTS](http://icrc.hitsz.edu.cn/Article/show/139.html),使用的加速卡:V100S/K100。

根据测试结果情况填写表格:
| device | Rougue 1 | Rougue 2 |  Rougue L |
| :------: | :------: | :------: | :------: |
| V100s |  26.12 | 14.81 | 23.62 | 
| K100 | 26.94 | 15.38 | 24.24 | 

## 应用场景
### 算法类别
`文本摘要`

### 热点应用行业
`金融,教育,政府,科研,制造,能源,广媒`

## 预训练权重

wanglch's avatar
wanglch committed
139
140
141
142
143
- [hf-mirror预训练模型下载地址](https://hf-mirror.com/google/umt5-base/tree/main)

- [hf-mirror umt5预训练模型下载地址](https://hf-mirror.com/collections/google/mt5-release-65005f1a520f8d7b4d039509)


wanglch's avatar
wanglch committed
144
## 源码仓库及问题反馈
chenzk's avatar
chenzk committed
145
- http://developer.sourcefind.cn/codes/modelzoo/umt5.git
wanglch's avatar
wanglch committed
146
147
148
## 参考资料
- [UniMax: Fairer and more Effective Language Sampling for Large-Scale Multilingual Pretraining](https://arxiv.org/abs/2304.09151)

wanglch's avatar
wanglch committed
149
- [google-research/multilingual-t5](https://github.com/google-research/multilingual-t5)