README.md 9.57 KB
Newer Older
WenmuZhou's avatar
add re  
WenmuZhou committed
1
# 文档视觉问答(DOC-VQA)
littletomatodonkey's avatar
littletomatodonkey committed
2

WenmuZhou's avatar
WenmuZhou committed
3
VQA指视觉问答,主要针对图像内容进行提问和回答,DOC-VQA是VQA任务中的一种,DOC-VQA主要针对文本图像的文字内容提出问题。
WenmuZhou's avatar
add re  
WenmuZhou committed
4
5
6
7

PP-Structure 里的 DOC-VQA算法基于PaddleNLP自然语言处理算法库进行开发。

主要特性如下:
littletomatodonkey's avatar
littletomatodonkey committed
8
9

- 集成[LayoutXLM](https://arxiv.org/pdf/2104.08836.pdf)模型以及PP-OCR预测引擎。
WenmuZhou's avatar
WenmuZhou committed
10
11
- 支持基于多模态方法的语义实体识别 (Semantic Entity Recognition, SER) 以及关系抽取 (Relation Extraction, RE) 任务。基于 SER 任务,可以完成对图像中的文本识别与分类;基于 RE 任务,可以完成对图象中的文本内容的关系提取,如判断问题对(pair)。
- 支持SER任务和RE任务的自定义训练。
WenmuZhou's avatar
add re  
WenmuZhou committed
12
13
- 支持OCR+SER的端到端系统预测与评估。
- 支持OCR+SER+RE的端到端系统预测。
littletomatodonkey's avatar
littletomatodonkey committed
14
15
16
17
18


本项目是 [LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding](https://arxiv.org/pdf/2104.08836.pdf) 在 Paddle 2.2上的开源实现,
包含了在 [XFUND数据集](https://github.com/doc-analysis/XFUND) 上的微调代码。

WenmuZhou's avatar
add re  
WenmuZhou committed
19
20
21
22
## 1 性能

我们在 [XFUN](https://github.com/doc-analysis/XFUND) 评估数据集上对算法进行了评估,性能如下

WenmuZhou's avatar
WenmuZhou committed
23
|任务|    f1 | 模型下载地址|
WenmuZhou's avatar
add re  
WenmuZhou committed
24
25
26
27
28
29
30
|:---:|:---:| :---:|
|SER|0.9056| [链接](https://paddleocr.bj.bcebos.com/pplayout/PP-Layout_v1.0_ser_pretrained.tar)|
|RE|0.7113| [链接](https://paddleocr.bj.bcebos.com/pplayout/PP-Layout_v1.0_re_pretrained.tar)|



## 2. 效果演示
littletomatodonkey's avatar
littletomatodonkey committed
31
32
33

**注意:** 测试图片来源于XFUN数据集。

WenmuZhou's avatar
add re  
WenmuZhou committed
34
### 2.1 SER
littletomatodonkey's avatar
littletomatodonkey committed
35

WenmuZhou's avatar
add re  
WenmuZhou committed
36
37
![](./images/result_ser/zh_val_0_ser.jpg) | ![](./images/result_ser/zh_val_42_ser.jpg)
---|---
littletomatodonkey's avatar
littletomatodonkey committed
38

WenmuZhou's avatar
add re  
WenmuZhou committed
39
图中不同颜色的框表示不同的类别,对于XFUN数据集,有`QUESTION`, `ANSWER`, `HEADER` 3种类别
littletomatodonkey's avatar
littletomatodonkey committed
40

WenmuZhou's avatar
add re  
WenmuZhou committed
41
42
43
* 深紫色:HEADER
* 浅紫色:QUESTION
* 军绿色:ANSWER
littletomatodonkey's avatar
littletomatodonkey committed
44

WenmuZhou's avatar
add re  
WenmuZhou committed
45
在OCR检测框的左上方也标出了对应的类别和OCR识别结果。
littletomatodonkey's avatar
littletomatodonkey committed
46
47


WenmuZhou's avatar
add re  
WenmuZhou committed
48
### 2.2 RE
littletomatodonkey's avatar
littletomatodonkey committed
49

WenmuZhou's avatar
add re  
WenmuZhou committed
50
51
![](./images/result_re/zh_val_21_re.jpg) | ![](./images/result_re/zh_val_40_re.jpg)
---|---
littletomatodonkey's avatar
littletomatodonkey committed
52
53


WenmuZhou's avatar
add re  
WenmuZhou committed
54
图中红色框表示问题,蓝色框表示答案,问题和答案之间使用绿色线连接。在OCR检测框的左上方也标出了对应的类别和OCR识别结果。
littletomatodonkey's avatar
littletomatodonkey committed
55

WenmuZhou's avatar
add re  
WenmuZhou committed
56
57
58
59

## 3. 安装

### 3.1 安装依赖
littletomatodonkey's avatar
littletomatodonkey committed
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75

- **(1) 安装PaddlePaddle**

```bash
pip3 install --upgrade pip

# GPU安装
python3 -m pip install paddlepaddle-gpu==2.2 -i https://mirror.baidu.com/pypi/simple

# CPU安装
python3 -m pip install paddlepaddle==2.2 -i https://mirror.baidu.com/pypi/simple

```
更多需求,请参照[安装文档](https://www.paddlepaddle.org.cn/install/quick)中的说明进行操作。


WenmuZhou's avatar
add re  
WenmuZhou committed
76
### 3.2 安装PaddleOCR(包含 PP-OCR 和 VQA )
littletomatodonkey's avatar
littletomatodonkey committed
77
78
79
80

- **(1)pip快速安装PaddleOCR whl包(仅预测)**

```bash
WenmuZhou's avatar
add re  
WenmuZhou committed
81
pip install paddleocr
littletomatodonkey's avatar
littletomatodonkey committed
82
83
```

littletomatodonkey's avatar
littletomatodonkey committed
84
- **(2)下载VQA源码(预测+训练)**
littletomatodonkey's avatar
littletomatodonkey committed
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100

```bash
【推荐】git clone https://github.com/PaddlePaddle/PaddleOCR

# 如果因为网络问题无法pull成功,也可选择使用码云上的托管:
git clone https://gitee.com/paddlepaddle/PaddleOCR

# 注:码云托管代码可能无法实时同步本github项目更新,存在3~5天延时,请优先使用推荐方式。
```

- **(3)安装PaddleNLP**

```bash
# 需要使用PaddleNLP最新的代码版本进行安装
git clone https://github.com/PaddlePaddle/PaddleNLP -b develop
cd PaddleNLP
zhoujun's avatar
zhoujun committed
101
pip3 install -e .
littletomatodonkey's avatar
littletomatodonkey committed
102
103
104
```


littletomatodonkey's avatar
littletomatodonkey committed
105
- **(4)安装VQA的`requirements`**
littletomatodonkey's avatar
littletomatodonkey committed
106
107

```bash
WenmuZhou's avatar
add re  
WenmuZhou committed
108
cd ppstructure/vqa
littletomatodonkey's avatar
littletomatodonkey committed
109
110
111
pip install -r requirements.txt
```

WenmuZhou's avatar
add re  
WenmuZhou committed
112
## 4. 使用
littletomatodonkey's avatar
littletomatodonkey committed
113
114


WenmuZhou's avatar
add re  
WenmuZhou committed
115
### 4.1 数据和预训练模型准备
littletomatodonkey's avatar
littletomatodonkey committed
116
117
118
119
120
121
122
123
124
125
126
127

处理好的XFUN中文数据集下载地址:[https://paddleocr.bj.bcebos.com/dataset/XFUND.tar](https://paddleocr.bj.bcebos.com/dataset/XFUND.tar)


下载并解压该数据集,解压后将数据集放置在当前目录下。

```shell
wget https://paddleocr.bj.bcebos.com/dataset/XFUND.tar
```

如果希望转换XFUN中其他语言的数据集,可以参考[XFUN数据转换脚本](helper/trans_xfun_data.py)

WenmuZhou's avatar
add re  
WenmuZhou committed
128
如果希望直接体验预测过程,可以下载我们提供的预训练模型,跳过训练过程,直接预测即可。
littletomatodonkey's avatar
littletomatodonkey committed
129
130


WenmuZhou's avatar
add re  
WenmuZhou committed
131
### 4.2 SER任务
littletomatodonkey's avatar
littletomatodonkey committed
132
133
134
135

* 启动训练

```shell
WenmuZhou's avatar
add re  
WenmuZhou committed
136
python3.7 train_ser.py \
littletomatodonkey's avatar
littletomatodonkey committed
137
138
139
140
141
142
143
144
145
146
147
148
149
150
    --model_name_or_path "layoutxlm-base-uncased" \
    --train_data_dir "XFUND/zh_train/image" \
    --train_label_path "XFUND/zh_train/xfun_normalize_train.json" \
    --eval_data_dir "XFUND/zh_val/image" \
    --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \
    --num_train_epochs 200 \
    --eval_steps 10 \
    --output_dir "./output/ser/" \
    --learning_rate 5e-5 \
    --warmup_steps 50 \
    --evaluate_during_training \
    --seed 2048
```

WenmuZhou's avatar
add re  
WenmuZhou committed
151
最终会打印出`precision`, `recall`, `f1`等指标,模型和训练日志会保存在`./output/ser/`文件夹中。
littletomatodonkey's avatar
littletomatodonkey committed
152

zhoujun's avatar
zhoujun committed
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
* 恢复训练

```shell
python3.7 train_ser.py \
    --model_name_or_path "model_path" \
    --train_data_dir "XFUND/zh_train/image" \
    --train_label_path "XFUND/zh_train/xfun_normalize_train.json" \
    --eval_data_dir "XFUND/zh_val/image" \
    --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \
    --num_train_epochs 200 \
    --eval_steps 10 \
    --output_dir "./output/ser/" \
    --learning_rate 5e-5 \
    --warmup_steps 50 \
    --evaluate_during_training \
    --seed 2048 \
    --resume
```

* 评估
```shell
export CUDA_VISIBLE_DEVICES=0
python3 eval_ser.py \
    --model_name_or_path "PP-Layout_v1.0_ser_pretrained/" \
    --eval_data_dir "XFUND/zh_val/image" \
    --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \
    --per_gpu_eval_batch_size 8 \
    --output_dir "output/ser/"  \
    --seed 2048
```
最终会打印出`precision`, `recall`, `f1`等指标

littletomatodonkey's avatar
littletomatodonkey committed
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
* 使用评估集合中提供的OCR识别结果进行预测

```shell
export CUDA_VISIBLE_DEVICES=0
python3.7 infer_ser.py \
    --model_name_or_path "./PP-Layout_v1.0_ser_pretrained/" \
    --output_dir "output_res/" \
    --infer_imgs "XFUND/zh_val/image/" \
    --ocr_json_path "XFUND/zh_val/xfun_normalize_val.json"
```

最终会在`output_res`目录下保存预测结果可视化图像以及预测结果文本文件,文件名为`infer_results.txt`

* 使用`OCR引擎 + SER`串联结果

```shell
export CUDA_VISIBLE_DEVICES=0
python3.7 infer_ser_e2e.py \
    --model_name_or_path "./output/PP-Layout_v1.0_ser_pretrained/" \
    --max_seq_length 512 \
WenmuZhou's avatar
add re  
WenmuZhou committed
205
206
    --output_dir "output_res_e2e/" \
    --infer_imgs "images/input/zh_val_0.jpg"
littletomatodonkey's avatar
littletomatodonkey committed
207
208
209
210
211
212
```

*`OCR引擎 + SER`预测系统进行端到端评估

```shell
export CUDA_VISIBLE_DEVICES=0
WenmuZhou's avatar
add re  
WenmuZhou committed
213
python3.7 helper/eval_with_label_end2end.py --gt_json_path XFUND/zh_val/xfun_normalize_val.json  --pred_json_path output_res/infer_results.txt
littletomatodonkey's avatar
littletomatodonkey committed
214
215
216
```


WenmuZhou's avatar
add re  
WenmuZhou committed
217
### 3.3 RE任务
littletomatodonkey's avatar
littletomatodonkey committed
218

WenmuZhou's avatar
add re  
WenmuZhou committed
219
* 启动训练
littletomatodonkey's avatar
littletomatodonkey committed
220

WenmuZhou's avatar
add re  
WenmuZhou committed
221
```shell
zhoujun's avatar
zhoujun committed
222
export CUDA_VISIBLE_DEVICES=0
WenmuZhou's avatar
add re  
WenmuZhou committed
223
224
225
226
227
228
229
python3 train_re.py \
    --model_name_or_path "layoutxlm-base-uncased" \
    --train_data_dir "XFUND/zh_train/image" \
    --train_label_path "XFUND/zh_train/xfun_normalize_train.json" \
    --eval_data_dir "XFUND/zh_val/image" \
    --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \
    --label_map_path 'labels/labels_ser.txt' \
WenmuZhou's avatar
WenmuZhou committed
230
    --num_train_epochs 200 \
WenmuZhou's avatar
add re  
WenmuZhou committed
231
232
233
234
235
236
237
238
239
240
241
    --eval_steps 10 \
    --output_dir "output/re/"  \
    --learning_rate 5e-5 \
    --warmup_steps 50 \
    --per_gpu_train_batch_size 8 \
    --per_gpu_eval_batch_size 8 \
    --evaluate_during_training \
    --seed 2048

```

zhoujun's avatar
zhoujun committed
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
* 恢复训练

```shell
export CUDA_VISIBLE_DEVICES=0
python3 train_re.py \
    --model_name_or_path "model_path" \
    --train_data_dir "XFUND/zh_train/image" \
    --train_label_path "XFUND/zh_train/xfun_normalize_train.json" \
    --eval_data_dir "XFUND/zh_val/image" \
    --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \
    --label_map_path 'labels/labels_ser.txt' \
    --num_train_epochs 2 \
    --eval_steps 10 \
    --output_dir "output/re/"  \
    --learning_rate 5e-5 \
    --warmup_steps 50 \
    --per_gpu_train_batch_size 8 \
    --per_gpu_eval_batch_size 8 \
    --evaluate_during_training \
    --seed 2048 \
    --resume

```

WenmuZhou's avatar
add re  
WenmuZhou committed
266
267
最终会打印出`precision`, `recall`, `f1`等指标,模型和训练日志会保存在`./output/re/`文件夹中。

zhoujun's avatar
zhoujun committed
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
* 评估
```shell
export CUDA_VISIBLE_DEVICES=0
python3 eval_re.py \
    --model_name_or_path "output/check/checkpoint-best" \
    --max_seq_length 512 \
    --eval_data_dir "XFUND/zh_val/image" \
    --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \
    --label_map_path 'labels/labels_ser.txt' \
    --output_dir "output/re_test/"  \
    --per_gpu_eval_batch_size 8 \
    --seed 2048
```
最终会打印出`precision`, `recall`, `f1`等指标


WenmuZhou's avatar
add re  
WenmuZhou committed
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
* 使用评估集合中提供的OCR识别结果进行预测

```shell
export CUDA_VISIBLE_DEVICES=0
python3 infer_re.py \
    --model_name_or_path "./PP-Layout_v1.0_re_pretrained/" \
    --max_seq_length 512 \
    --eval_data_dir "XFUND/zh_val/image" \
    --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \
    --label_map_path 'labels/labels_ser.txt' \
    --output_dir "output_res"  \
    --per_gpu_eval_batch_size 1 \
    --seed 2048
```

最终会在`output_res`目录下保存预测结果可视化图像以及预测结果文本文件,文件名为`infer_results.txt`

* 使用`OCR引擎 + SER + RE`串联结果

```shell
export CUDA_VISIBLE_DEVICES=0
zhoujun's avatar
zhoujun committed
305
python3.7 infer_ser_re_e2e.py \
WenmuZhou's avatar
add re  
WenmuZhou committed
306
307
308
309
310
311
    --model_name_or_path "./PP-Layout_v1.0_ser_pretrained/" \
    --re_model_name_or_path "./PP-Layout_v1.0_re_pretrained/" \
    --max_seq_length 512 \
    --output_dir "output_ser_re_e2e_train/" \
    --infer_imgs "images/input/zh_val_21.jpg"
```
littletomatodonkey's avatar
littletomatodonkey committed
312
313
314
315
316
317

## 参考链接

- LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding, https://arxiv.org/pdf/2104.08836.pdf
- microsoft/unilm/layoutxlm, https://github.com/microsoft/unilm/tree/master/layoutxlm
- XFUND dataset, https://github.com/doc-analysis/XFUND