README.md 10.9 KB
Newer Older
Bin Lu's avatar
Bin Lu committed
1
English | [简体中文](README_ch.md)
littletomatodonkey's avatar
littletomatodonkey committed
2

Bin Lu's avatar
Bin Lu committed
3
4
5
6
- [Document Visual Question Answering (Doc-VQA)](#Document-Visual-Question-Answering)
  - [1. Introduction](#1-Introduction)
  - [2. Performance](#2-performance)
  - [3. Effect demo](#3-Effect-demo)
WenmuZhou's avatar
WenmuZhou committed
7
8
    - [3.1 SER](#31-ser)
    - [3.2 RE](#32-re)
Bin Lu's avatar
Bin Lu committed
9
10
11
12
13
  - [4. Install](#4-Install)
    - [4.1 Installation dependencies](#41-Install-dependencies)
    - [4.2 Install PaddleOCR](#42-Install-PaddleOCR)
  - [5. Usage](#5-Usage)
    - [5.1 Data and Model Preparation](#51-Data-and-Model-Preparation)
WenmuZhou's avatar
WenmuZhou committed
14
15
    - [5.2 SER](#52-ser)
    - [5.3 RE](#53-re)
Bin Lu's avatar
Bin Lu committed
16
  - [6. Reference](#6-Reference-Links)
MissPenguin's avatar
update  
MissPenguin committed
17

Bin Lu's avatar
Bin Lu committed
18
# Document Visual Question Answering
MissPenguin's avatar
update  
MissPenguin committed
19

Bin Lu's avatar
Bin Lu committed
20
## 1 Introduction
MissPenguin's avatar
update  
MissPenguin committed
21

Bin Lu's avatar
Bin Lu committed
22
VQA refers to visual question answering, which mainly asks and answers image content. DOC-VQA is one of the VQA tasks. DOC-VQA mainly asks questions about the text content of text images.
WenmuZhou's avatar
WenmuZhou committed
23

Bin Lu's avatar
Bin Lu committed
24
The DOC-VQA algorithm in PP-Structure is developed based on the PaddleNLP natural language processing algorithm library.
WenmuZhou's avatar
add re  
WenmuZhou committed
25

Bin Lu's avatar
Bin Lu committed
26
The main features are as follows:
WenmuZhou's avatar
add re  
WenmuZhou committed
27

Bin Lu's avatar
Bin Lu committed
28
29
30
31
32
- Integrate [LayoutXLM](https://arxiv.org/pdf/2104.08836.pdf) model and PP-OCR prediction engine.
- Supports Semantic Entity Recognition (SER) and Relation Extraction (RE) tasks based on multimodal methods. Based on the SER task, the text recognition and classification in the image can be completed; based on the RE task, the relationship extraction of the text content in the image can be completed, such as judging the problem pair (pair).
- Supports custom training for SER tasks and RE tasks.
- Supports end-to-end system prediction and evaluation of OCR+SER.
- Supports end-to-end system prediction of OCR+SER+RE.
littletomatodonkey's avatar
littletomatodonkey committed
33
34


Bin Lu's avatar
Bin Lu committed
35
36
This project is an open source implementation of [LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding](https://arxiv.org/pdf/2104.08836.pdf) on Paddle 2.2,
Included fine-tuning code on [XFUND dataset](https://github.com/doc-analysis/XFUND).
littletomatodonkey's avatar
littletomatodonkey committed
37

Bin Lu's avatar
Bin Lu committed
38
## 2. Performance
littletomatodonkey's avatar
littletomatodonkey committed
39

Bin Lu's avatar
Bin Lu committed
40
We evaluate the algorithm on the Chinese dataset of [XFUND](https://github.com/doc-analysis/XFUND), and the performance is as follows
WenmuZhou's avatar
add re  
WenmuZhou committed
41

Bin Lu's avatar
Bin Lu committed
42
| Model | Task | hmean | Model download address |
WenmuZhou's avatar
WenmuZhou committed
43
|:---:|:---:|:---:| :---:|
Bin Lu's avatar
Bin Lu committed
44
45
46
47
48
| LayoutXLM | SER | 0.9038 | [link](https://paddleocr.bj.bcebos.com/pplayout/ser_LayoutXLM_xfun_zh.tar) |
| LayoutXLM | RE | 0.7483 | [link](https://paddleocr.bj.bcebos.com/pplayout/re_LayoutXLM_xfun_zh.tar) |
| LayoutLMv2 | SER | 0.8544 | [link](https://paddleocr.bj.bcebos.com/pplayout/ser_LayoutLMv2_xfun_zh.tar)
| LayoutLMv2 | RE | 0.6777 | [link](https://paddleocr.bj.bcebos.com/pplayout/re_LayoutLMv2_xfun_zh.tar) |
| LayoutLM | SER | 0.7731 | [link](https://paddleocr.bj.bcebos.com/pplayout/ser_LayoutLM_xfun_zh.tar) |
WenmuZhou's avatar
add re  
WenmuZhou committed
49

Bin Lu's avatar
Bin Lu committed
50
## 3. Effect demo
littletomatodonkey's avatar
littletomatodonkey committed
51

Bin Lu's avatar
Bin Lu committed
52
**Note:** The test images are from the XFUND dataset.
littletomatodonkey's avatar
littletomatodonkey committed
53

MissPenguin's avatar
update  
MissPenguin committed
54
<a name="31"></a>
WenmuZhou's avatar
WenmuZhou committed
55
### 3.1 SER
littletomatodonkey's avatar
littletomatodonkey committed
56

MissPenguin's avatar
update  
MissPenguin committed
57
![](../docs/vqa/result_ser/zh_val_0_ser.jpg) | ![](../docs/vqa/result_ser/zh_val_42_ser.jpg)
WenmuZhou's avatar
add re  
WenmuZhou committed
58
---|---
littletomatodonkey's avatar
littletomatodonkey committed
59

Bin Lu's avatar
Bin Lu committed
60
Boxes with different colors in the figure represent different categories. For the XFUND dataset, there are 3 categories: `QUESTION`, `ANSWER`, `HEADER`
littletomatodonkey's avatar
littletomatodonkey committed
61

Bin Lu's avatar
Bin Lu committed
62
63
64
* Dark purple: HEADER
* Light purple: QUESTION
* Army Green: ANSWER
littletomatodonkey's avatar
littletomatodonkey committed
65

Bin Lu's avatar
Bin Lu committed
66
The corresponding categories and OCR recognition results are also marked on the upper left of the OCR detection frame.
littletomatodonkey's avatar
littletomatodonkey committed
67

MissPenguin's avatar
update  
MissPenguin committed
68
<a name="32"></a>
WenmuZhou's avatar
WenmuZhou committed
69
### 3.2 RE
littletomatodonkey's avatar
littletomatodonkey committed
70

MissPenguin's avatar
update  
MissPenguin committed
71
![](../docs/vqa/result_re/zh_val_21_re.jpg) | ![](../docs/vqa/result_re/zh_val_40_re.jpg)
WenmuZhou's avatar
add re  
WenmuZhou committed
72
---|---
littletomatodonkey's avatar
littletomatodonkey committed
73
74


Bin Lu's avatar
Bin Lu committed
75
The red box in the figure represents the question, the blue box represents the answer, and the question and the answer are connected by a green line. The corresponding categories and OCR recognition results are also marked on the upper left of the OCR detection frame.
littletomatodonkey's avatar
littletomatodonkey committed
76

Bin Lu's avatar
Bin Lu committed
77
## 4. Install
WenmuZhou's avatar
add re  
WenmuZhou committed
78

Bin Lu's avatar
Bin Lu committed
79
### 4.1 Install dependencies
littletomatodonkey's avatar
littletomatodonkey committed
80

Bin Lu's avatar
Bin Lu committed
81
- **(1) Install PaddlePaddle**
littletomatodonkey's avatar
littletomatodonkey committed
82
83

```bash
WenmuZhou's avatar
WenmuZhou committed
84
python3 -m pip install --upgrade pip
littletomatodonkey's avatar
littletomatodonkey committed
85

Bin Lu's avatar
Bin Lu committed
86
# GPU installation
87
python3 -m pip install "paddlepaddle-gpu>=2.2" -i https://mirror.baidu.com/pypi/simple
littletomatodonkey's avatar
littletomatodonkey committed
88

Bin Lu's avatar
Bin Lu committed
89
# CPU installation
90
python3 -m pip install "paddlepaddle>=2.2" -i https://mirror.baidu.com/pypi/simple
littletomatodonkey's avatar
littletomatodonkey committed
91

Bin Lu's avatar
Bin Lu committed
92
93
````
For more requirements, please refer to the instructions in [Installation Documentation](https://www.paddlepaddle.org.cn/install/quick).
littletomatodonkey's avatar
littletomatodonkey committed
94

Bin Lu's avatar
Bin Lu committed
95
### 4.2 Install PaddleOCR
littletomatodonkey's avatar
littletomatodonkey committed
96

Bin Lu's avatar
Bin Lu committed
97
- **(1) pip install PaddleOCR whl package quickly (prediction only)**
littletomatodonkey's avatar
littletomatodonkey committed
98
99

```bash
100
python3 -m pip install paddleocr
Bin Lu's avatar
Bin Lu committed
101
````
littletomatodonkey's avatar
littletomatodonkey committed
102

Bin Lu's avatar
Bin Lu committed
103
- **(2) Download VQA source code (prediction + training)**
littletomatodonkey's avatar
littletomatodonkey committed
104
105

```bash
Bin Lu's avatar
Bin Lu committed
106
[Recommended] git clone https://github.com/PaddlePaddle/PaddleOCR
littletomatodonkey's avatar
littletomatodonkey committed
107

Bin Lu's avatar
Bin Lu committed
108
# If the pull cannot be successful due to network problems, you can also choose to use the hosting on the code cloud:
littletomatodonkey's avatar
littletomatodonkey committed
109
110
git clone https://gitee.com/paddlepaddle/PaddleOCR

Bin Lu's avatar
Bin Lu committed
111
112
# Note: Code cloud hosting code may not be able to synchronize the update of this github project in real time, there is a delay of 3 to 5 days, please use the recommended method first.
````
littletomatodonkey's avatar
littletomatodonkey committed
113

Bin Lu's avatar
Bin Lu committed
114
- **(3) Install VQA's `requirements`**
littletomatodonkey's avatar
littletomatodonkey committed
115
116

```bash
117
python3 -m pip install -r ppstructure/vqa/requirements.txt
Bin Lu's avatar
Bin Lu committed
118
````
littletomatodonkey's avatar
littletomatodonkey committed
119

Bin Lu's avatar
Bin Lu committed
120
## 5. Usage
littletomatodonkey's avatar
littletomatodonkey committed
121

Bin Lu's avatar
Bin Lu committed
122
### 5.1 Data and Model Preparation
littletomatodonkey's avatar
littletomatodonkey committed
123

Bin Lu's avatar
Bin Lu committed
124
If you want to experience the prediction process directly, you can download the pre-training model provided by us, skip the training process, and just predict directly.
125

Bin Lu's avatar
Bin Lu committed
126
* Download the processed dataset
127

Bin Lu's avatar
Bin Lu committed
128
The download address of the processed XFUND Chinese dataset: [https://paddleocr.bj.bcebos.com/dataset/XFUND.tar](https://paddleocr.bj.bcebos.com/dataset/XFUND.tar).
littletomatodonkey's avatar
littletomatodonkey committed
129
130


Bin Lu's avatar
Bin Lu committed
131
Download and unzip the dataset, and place the dataset in the current directory after unzipping.
littletomatodonkey's avatar
littletomatodonkey committed
132
133
134

```shell
wget https://paddleocr.bj.bcebos.com/dataset/XFUND.tar
Bin Lu's avatar
Bin Lu committed
135
````
littletomatodonkey's avatar
littletomatodonkey committed
136

Bin Lu's avatar
Bin Lu committed
137
* Convert the dataset
littletomatodonkey's avatar
littletomatodonkey committed
138

Bin Lu's avatar
Bin Lu committed
139
If you need to train other XFUND datasets, you can use the following commands to convert the datasets
littletomatodonkey's avatar
littletomatodonkey committed
140

Bin Lu's avatar
Bin Lu committed
141
142
143
```bash
python3 ppstructure/vqa/tools/trans_xfun_data.py --ori_gt_path=path/to/json_path --output_path=path/to/save_path
````
littletomatodonkey's avatar
littletomatodonkey committed
144

Bin Lu's avatar
Bin Lu committed
145
* Download the pretrained models
146
```bash
Bin Lu's avatar
Bin Lu committed
147
148
149
150
151
152
153
mkdir pretrain && cd pretrain
#download the SER model
wget https://paddleocr.bj.bcebos.com/pplayout/ser_LayoutXLM_xfun_zh.tar && tar -xvf ser_LayoutXLM_xfun_zh.tar
#download the RE model
wget https://paddleocr.bj.bcebos.com/pplayout/re_LayoutXLM_xfun_zh.tar && tar -xvf re_LayoutXLM_xfun_zh.tar
cd ../
````
littletomatodonkey's avatar
littletomatodonkey committed
154

MissPenguin's avatar
update  
MissPenguin committed
155
<a name="52"></a>
WenmuZhou's avatar
WenmuZhou committed
156
### 5.2 SER
littletomatodonkey's avatar
littletomatodonkey committed
157

Bin Lu's avatar
Bin Lu committed
158
Before starting training, you need to modify the following four fields
159

Bin Lu's avatar
Bin Lu committed
160
161
162
163
1. `Train.dataset.data_dir`: point to the directory where the training set images are stored
2. `Train.dataset.label_file_list`: point to the training set label file
3. `Eval.dataset.data_dir`: refers to the directory where the validation set images are stored
4. `Eval.dataset.label_file_list`: point to the validation set label file
littletomatodonkey's avatar
littletomatodonkey committed
164

Bin Lu's avatar
Bin Lu committed
165
* start training
littletomatodonkey's avatar
littletomatodonkey committed
166
```shell
167
CUDA_VISIBLE_DEVICES=0 python3 tools/train.py -c configs/vqa/ser/layoutxlm.yml
Bin Lu's avatar
Bin Lu committed
168
````
littletomatodonkey's avatar
littletomatodonkey committed
169

Bin Lu's avatar
Bin Lu committed
170
171
Finally, `precision`, `recall`, `hmean` and other indicators will be printed.
In the `./output/ser_layoutxlm/` folder will save the training log, the optimal model and the model for the latest epoch.
littletomatodonkey's avatar
littletomatodonkey committed
172

Bin Lu's avatar
Bin Lu committed
173
* resume training
zhoujun's avatar
zhoujun committed
174

Bin Lu's avatar
Bin Lu committed
175
To resume training, assign the folder path of the previously trained model to the `Architecture.Backbone.checkpoints` field.
176

zhoujun's avatar
zhoujun committed
177
```shell
178
CUDA_VISIBLE_DEVICES=0 python3 tools/train.py -c configs/vqa/ser/layoutxlm.yml -o Architecture.Backbone.checkpoints=path/to/model_dir
Bin Lu's avatar
Bin Lu committed
179
````
zhoujun's avatar
zhoujun committed
180

Bin Lu's avatar
Bin Lu committed
181
* evaluate
zhoujun's avatar
zhoujun committed
182

Bin Lu's avatar
Bin Lu committed
183
Evaluation requires assigning the folder path of the model to be evaluated to the `Architecture.Backbone.checkpoints` field.
littletomatodonkey's avatar
littletomatodonkey committed
184
185

```shell
186
CUDA_VISIBLE_DEVICES=0 python3 tools/eval.py -c configs/vqa/ser/layoutxlm.yml -o Architecture.Backbone.checkpoints=path/to/model_dir
Bin Lu's avatar
Bin Lu committed
187
188
````
Finally, `precision`, `recall`, `hmean` and other indicators will be printed
littletomatodonkey's avatar
littletomatodonkey committed
189

Bin Lu's avatar
Bin Lu committed
190
* Use `OCR engine + SER` tandem prediction
littletomatodonkey's avatar
littletomatodonkey committed
191

Bin Lu's avatar
Bin Lu committed
192
Use the following command to complete the series prediction of `OCR engine + SER`, taking the pretrained SER model as an example:
littletomatodonkey's avatar
littletomatodonkey committed
193
194

```shell
195
CUDA_VISIBLE_DEVICES=0 python3 tools/infer_vqa_token_ser.py -c configs/vqa/ser/layoutxlm.yml -o Architecture.Backbone.checkpoints=pretrain/ser_LayoutXLM_xfun_zh/ Global.infer_img=doc/vqa/input/zh_val_42.jpg
Bin Lu's avatar
Bin Lu committed
196
````
littletomatodonkey's avatar
littletomatodonkey committed
197

Bin Lu's avatar
Bin Lu committed
198
Finally, the prediction result visualization image and the prediction result text file will be saved in the directory configured by the `config.Global.save_res_path` field. The prediction result text file is named `infer_results.txt`.
199

Bin Lu's avatar
Bin Lu committed
200
* End-to-end evaluation of `OCR engine + SER` prediction system
littletomatodonkey's avatar
littletomatodonkey committed
201

Bin Lu's avatar
Bin Lu committed
202
First use the `tools/infer_vqa_token_ser.py` script to complete the prediction of the dataset, then use the following command to evaluate.
203

littletomatodonkey's avatar
littletomatodonkey committed
204
205
```shell
export CUDA_VISIBLE_DEVICES=0
206
python3 ppstructure/vqa/tools/eval_with_label_end2end.py --gt_json_path XFUND/zh_val/xfun_normalize_val.json --pred_json_path output_res/infer_results.txt
Bin Lu's avatar
Bin Lu committed
207
````
littletomatodonkey's avatar
littletomatodonkey committed
208

MissPenguin's avatar
update  
MissPenguin committed
209
<a name="53"></a>
WenmuZhou's avatar
WenmuZhou committed
210
### 5.3 RE
littletomatodonkey's avatar
littletomatodonkey committed
211

Bin Lu's avatar
Bin Lu committed
212
* start training
littletomatodonkey's avatar
littletomatodonkey committed
213

Bin Lu's avatar
Bin Lu committed
214
Before starting training, you need to modify the following four fields
WenmuZhou's avatar
add re  
WenmuZhou committed
215

Bin Lu's avatar
Bin Lu committed
216
217
218
219
1. `Train.dataset.data_dir`: point to the directory where the training set images are stored
2. `Train.dataset.label_file_list`: point to the training set label file
3. `Eval.dataset.data_dir`: refers to the directory where the validation set images are stored
4. `Eval.dataset.label_file_list`: point to the validation set label file
zhoujun's avatar
zhoujun committed
220
221

```shell
222
CUDA_VISIBLE_DEVICES=0 python3 tools/train.py -c configs/vqa/re/layoutxlm.yml
Bin Lu's avatar
Bin Lu committed
223
````
zhoujun's avatar
zhoujun committed
224

Bin Lu's avatar
Bin Lu committed
225
226
Finally, `precision`, `recall`, `hmean` and other indicators will be printed.
In the `./output/re_layoutxlm/` folder will save the training log, the optimal model and the model for the latest epoch.
227

Bin Lu's avatar
Bin Lu committed
228
* resume training
229

Bin Lu's avatar
Bin Lu committed
230
To resume training, assign the folder path of the previously trained model to the `Architecture.Backbone.checkpoints` field.
WenmuZhou's avatar
add re  
WenmuZhou committed
231

zhoujun's avatar
zhoujun committed
232
```shell
233
CUDA_VISIBLE_DEVICES=0 python3 tools/train.py -c configs/vqa/re/layoutxlm.yml -o Architecture.Backbone.checkpoints=path/to/model_dir
Bin Lu's avatar
Bin Lu committed
234
````
zhoujun's avatar
zhoujun committed
235

Bin Lu's avatar
Bin Lu committed
236
* evaluate
zhoujun's avatar
zhoujun committed
237

Bin Lu's avatar
Bin Lu committed
238
Evaluation requires assigning the folder path of the model to be evaluated to the `Architecture.Backbone.checkpoints` field.
WenmuZhou's avatar
add re  
WenmuZhou committed
239
240

```shell
241
CUDA_VISIBLE_DEVICES=0 python3 tools/eval.py -c configs/vqa/re/layoutxlm.yml -o Architecture.Backbone.checkpoints=path/to/model_dir
Bin Lu's avatar
Bin Lu committed
242
243
````
Finally, `precision`, `recall`, `hmean` and other indicators will be printed
WenmuZhou's avatar
add re  
WenmuZhou committed
244

Bin Lu's avatar
Bin Lu committed
245
* Use `OCR engine + SER + RE` tandem prediction
WenmuZhou's avatar
add re  
WenmuZhou committed
246

Bin Lu's avatar
Bin Lu committed
247
Use the following command to complete the series prediction of `OCR engine + SER + RE`, taking the pretrained SER and RE models as an example:
WenmuZhou's avatar
add re  
WenmuZhou committed
248
249
```shell
export CUDA_VISIBLE_DEVICES=0
Bin Lu's avatar
Bin Lu committed
250
251
python3 tools/infer_vqa_token_ser_re.py -c configs/vqa/re/layoutxlm.yml -o Architecture.Backbone.checkpoints=pretrain/re_LayoutXLM_xfun_zh/Global.infer_img=doc/vqa/input/zh_val_21.jpg -c_ser configs/vqa/ser/layoutxlm. yml -o_ser Architecture.Backbone.checkpoints=pretrain/ser_LayoutXLM_xfun_zh/
````
littletomatodonkey's avatar
littletomatodonkey committed
252

Bin Lu's avatar
Bin Lu committed
253
Finally, the prediction result visualization image and the prediction result text file will be saved in the directory configured by the `config.Global.save_res_path` field. The prediction result text file is named `infer_results.txt`.
254

Bin Lu's avatar
Bin Lu committed
255
## 6. Reference Links
littletomatodonkey's avatar
littletomatodonkey committed
256
257
258
259

- LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding, https://arxiv.org/pdf/2104.08836.pdf
- microsoft/unilm/layoutxlm, https://github.com/microsoft/unilm/tree/master/layoutxlm
- XFUND dataset, https://github.com/doc-analysis/XFUND
MissPenguin's avatar
MissPenguin committed
260
261
262
263

## License

The content of this project itself is licensed under the [Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/)