README.md 10.8 KB
Newer Older
Bin Lu's avatar
Bin Lu committed
1
2
3
4
5
6
English | [简体中文](README_ch.md)

- [Document Visual Question Answering (Doc-VQA)](#Document-Visual-Question-Answering)
  - [1. Introduction](#1-Introduction)
  - [2. Performance](#2-performance)
  - [3. Effect demo](#3-Effect-demo)
WenmuZhou's avatar
WenmuZhou committed
7
8
    - [3.1 SER](#31-ser)
    - [3.2 RE](#32-re)
Bin Lu's avatar
Bin Lu committed
9
10
11
12
13
  - [4. Install](#4-Install)
    - [4.1 Installation dependencies](#41-Install-dependencies)
    - [4.2 Install PaddleOCR](#42-Install-PaddleOCR)
  - [5. Usage](#5-Usage)
    - [5.1 Data and Model Preparation](#51-Data-and-Model-Preparation)
WenmuZhou's avatar
WenmuZhou committed
14
15
    - [5.2 SER](#52-ser)
    - [5.3 RE](#53-re)
Bin Lu's avatar
Bin Lu committed
16
  - [6. Reference](#6-Reference-Links)
WenmuZhou's avatar
WenmuZhou committed
17

Bin Lu's avatar
Bin Lu committed
18
# Document Visual Question Answering
littletomatodonkey's avatar
littletomatodonkey committed
19

Bin Lu's avatar
Bin Lu committed
20
## 1 Introduction
WenmuZhou's avatar
WenmuZhou committed
21

Bin Lu's avatar
Bin Lu committed
22
VQA refers to visual question answering, which mainly asks and answers image content. DOC-VQA is one of the VQA tasks. DOC-VQA mainly asks questions about the text content of text images.
WenmuZhou's avatar
add re  
WenmuZhou committed
23

Bin Lu's avatar
Bin Lu committed
24
The DOC-VQA algorithm in PP-Structure is developed based on the PaddleNLP natural language processing algorithm library.
WenmuZhou's avatar
add re  
WenmuZhou committed
25

Bin Lu's avatar
Bin Lu committed
26
The main features are as follows:
littletomatodonkey's avatar
littletomatodonkey committed
27

Bin Lu's avatar
Bin Lu committed
28
29
30
31
32
- Integrate [LayoutXLM](https://arxiv.org/pdf/2104.08836.pdf) model and PP-OCR prediction engine.
- Supports Semantic Entity Recognition (SER) and Relation Extraction (RE) tasks based on multimodal methods. Based on the SER task, the text recognition and classification in the image can be completed; based on the RE task, the relationship extraction of the text content in the image can be completed, such as judging the problem pair (pair).
- Supports custom training for SER tasks and RE tasks.
- Supports end-to-end system prediction and evaluation of OCR+SER.
- Supports end-to-end system prediction of OCR+SER+RE.
littletomatodonkey's avatar
littletomatodonkey committed
33
34


Bin Lu's avatar
Bin Lu committed
35
36
This project is an open source implementation of [LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding](https://arxiv.org/pdf/2104.08836.pdf) on Paddle 2.2,
Included fine-tuning code on [XFUND dataset](https://github.com/doc-analysis/XFUND).
littletomatodonkey's avatar
littletomatodonkey committed
37

Bin Lu's avatar
Bin Lu committed
38
## 2. Performance
WenmuZhou's avatar
add re  
WenmuZhou committed
39

Bin Lu's avatar
Bin Lu committed
40
We evaluate the algorithm on the Chinese dataset of [XFUND](https://github.com/doc-analysis/XFUND), and the performance is as follows
WenmuZhou's avatar
add re  
WenmuZhou committed
41

Bin Lu's avatar
Bin Lu committed
42
| Model | Task | hmean | Model download address |
WenmuZhou's avatar
WenmuZhou committed
43
|:---:|:---:|:---:| :---:|
Bin Lu's avatar
Bin Lu committed
44
45
46
47
48
| LayoutXLM | SER | 0.9038 | [link](https://paddleocr.bj.bcebos.com/pplayout/ser_LayoutXLM_xfun_zh.tar) |
| LayoutXLM | RE | 0.7483 | [link](https://paddleocr.bj.bcebos.com/pplayout/re_LayoutXLM_xfun_zh.tar) |
| LayoutLMv2 | SER | 0.8544 | [link](https://paddleocr.bj.bcebos.com/pplayout/ser_LayoutLMv2_xfun_zh.tar)
| LayoutLMv2 | RE | 0.6777 | [link](https://paddleocr.bj.bcebos.com/pplayout/re_LayoutLMv2_xfun_zh.tar) |
| LayoutLM | SER | 0.7731 | [link](https://paddleocr.bj.bcebos.com/pplayout/ser_LayoutLM_xfun_zh.tar) |
WenmuZhou's avatar
add re  
WenmuZhou committed
49

Bin Lu's avatar
Bin Lu committed
50
## 3. Effect demo
littletomatodonkey's avatar
littletomatodonkey committed
51

Bin Lu's avatar
Bin Lu committed
52
**Note:** The test images are from the XFUND dataset.
littletomatodonkey's avatar
littletomatodonkey committed
53

WenmuZhou's avatar
WenmuZhou committed
54
### 3.1 SER
littletomatodonkey's avatar
littletomatodonkey committed
55

WenmuZhou's avatar
WenmuZhou committed
56
![](../../doc/vqa/result_ser/zh_val_0_ser.jpg) | ![](../../doc/vqa/result_ser/zh_val_42_ser.jpg)
WenmuZhou's avatar
add re  
WenmuZhou committed
57
---|---
littletomatodonkey's avatar
littletomatodonkey committed
58

Bin Lu's avatar
Bin Lu committed
59
Boxes with different colors in the figure represent different categories. For the XFUND dataset, there are 3 categories: `QUESTION`, `ANSWER`, `HEADER`
littletomatodonkey's avatar
littletomatodonkey committed
60

Bin Lu's avatar
Bin Lu committed
61
62
63
* Dark purple: HEADER
* Light purple: QUESTION
* Army Green: ANSWER
littletomatodonkey's avatar
littletomatodonkey committed
64

Bin Lu's avatar
Bin Lu committed
65
The corresponding categories and OCR recognition results are also marked on the upper left of the OCR detection frame.
littletomatodonkey's avatar
littletomatodonkey committed
66

WenmuZhou's avatar
WenmuZhou committed
67
### 3.2 RE
littletomatodonkey's avatar
littletomatodonkey committed
68

WenmuZhou's avatar
WenmuZhou committed
69
![](../../doc/vqa/result_re/zh_val_21_re.jpg) | ![](../../doc/vqa/result_re/zh_val_40_re.jpg)
WenmuZhou's avatar
add re  
WenmuZhou committed
70
---|---
littletomatodonkey's avatar
littletomatodonkey committed
71
72


Bin Lu's avatar
Bin Lu committed
73
The red box in the figure represents the question, the blue box represents the answer, and the question and the answer are connected by a green line. The corresponding categories and OCR recognition results are also marked on the upper left of the OCR detection frame.
littletomatodonkey's avatar
littletomatodonkey committed
74

Bin Lu's avatar
Bin Lu committed
75
## 4. Install
WenmuZhou's avatar
add re  
WenmuZhou committed
76

Bin Lu's avatar
Bin Lu committed
77
### 4.1 Install dependencies
littletomatodonkey's avatar
littletomatodonkey committed
78

Bin Lu's avatar
Bin Lu committed
79
- **(1) Install PaddlePaddle**
littletomatodonkey's avatar
littletomatodonkey committed
80
81

```bash
WenmuZhou's avatar
WenmuZhou committed
82
python3 -m pip install --upgrade pip
littletomatodonkey's avatar
littletomatodonkey committed
83

Bin Lu's avatar
Bin Lu committed
84
# GPU installation
85
python3 -m pip install "paddlepaddle-gpu>=2.2" -i https://mirror.baidu.com/pypi/simple
littletomatodonkey's avatar
littletomatodonkey committed
86

Bin Lu's avatar
Bin Lu committed
87
# CPU installation
88
python3 -m pip install "paddlepaddle>=2.2" -i https://mirror.baidu.com/pypi/simple
littletomatodonkey's avatar
littletomatodonkey committed
89

Bin Lu's avatar
Bin Lu committed
90
91
````
For more requirements, please refer to the instructions in [Installation Documentation](https://www.paddlepaddle.org.cn/install/quick).
littletomatodonkey's avatar
littletomatodonkey committed
92

Bin Lu's avatar
Bin Lu committed
93
### 4.2 Install PaddleOCR
littletomatodonkey's avatar
littletomatodonkey committed
94

Bin Lu's avatar
Bin Lu committed
95
- **(1) pip install PaddleOCR whl package quickly (prediction only)**
littletomatodonkey's avatar
littletomatodonkey committed
96
97

```bash
98
python3 -m pip install paddleocr
Bin Lu's avatar
Bin Lu committed
99
````
littletomatodonkey's avatar
littletomatodonkey committed
100

Bin Lu's avatar
Bin Lu committed
101
- **(2) Download VQA source code (prediction + training)**
littletomatodonkey's avatar
littletomatodonkey committed
102
103

```bash
Bin Lu's avatar
Bin Lu committed
104
[Recommended] git clone https://github.com/PaddlePaddle/PaddleOCR
littletomatodonkey's avatar
littletomatodonkey committed
105

Bin Lu's avatar
Bin Lu committed
106
# If the pull cannot be successful due to network problems, you can also choose to use the hosting on the code cloud:
littletomatodonkey's avatar
littletomatodonkey committed
107
108
git clone https://gitee.com/paddlepaddle/PaddleOCR

Bin Lu's avatar
Bin Lu committed
109
110
# Note: Code cloud hosting code may not be able to synchronize the update of this github project in real time, there is a delay of 3 to 5 days, please use the recommended method first.
````
littletomatodonkey's avatar
littletomatodonkey committed
111

Bin Lu's avatar
Bin Lu committed
112
- **(3) Install VQA's `requirements`**
littletomatodonkey's avatar
littletomatodonkey committed
113
114

```bash
115
python3 -m pip install -r ppstructure/vqa/requirements.txt
Bin Lu's avatar
Bin Lu committed
116
````
littletomatodonkey's avatar
littletomatodonkey committed
117

Bin Lu's avatar
Bin Lu committed
118
## 5. Usage
littletomatodonkey's avatar
littletomatodonkey committed
119

Bin Lu's avatar
Bin Lu committed
120
### 5.1 Data and Model Preparation
littletomatodonkey's avatar
littletomatodonkey committed
121

Bin Lu's avatar
Bin Lu committed
122
If you want to experience the prediction process directly, you can download the pre-training model provided by us, skip the training process, and just predict directly.
123

Bin Lu's avatar
Bin Lu committed
124
* Download the processed dataset
125

Bin Lu's avatar
Bin Lu committed
126
The download address of the processed XFUND Chinese dataset: [https://paddleocr.bj.bcebos.com/dataset/XFUND.tar](https://paddleocr.bj.bcebos.com/dataset/XFUND.tar).
littletomatodonkey's avatar
littletomatodonkey committed
127
128


Bin Lu's avatar
Bin Lu committed
129
Download and unzip the dataset, and place the dataset in the current directory after unzipping.
littletomatodonkey's avatar
littletomatodonkey committed
130
131
132

```shell
wget https://paddleocr.bj.bcebos.com/dataset/XFUND.tar
Bin Lu's avatar
Bin Lu committed
133
````
littletomatodonkey's avatar
littletomatodonkey committed
134

Bin Lu's avatar
Bin Lu committed
135
* Convert the dataset
littletomatodonkey's avatar
littletomatodonkey committed
136

Bin Lu's avatar
Bin Lu committed
137
138
139
140
141
If you need to train other XFUND datasets, you can use the following commands to convert the datasets

```bash
python3 ppstructure/vqa/tools/trans_xfun_data.py --ori_gt_path=path/to/json_path --output_path=path/to/save_path
````
littletomatodonkey's avatar
littletomatodonkey committed
142

Bin Lu's avatar
Bin Lu committed
143
* Download the pretrained models
144
```bash
Bin Lu's avatar
Bin Lu committed
145
146
147
148
149
150
151
mkdir pretrain && cd pretrain
#download the SER model
wget https://paddleocr.bj.bcebos.com/pplayout/ser_LayoutXLM_xfun_zh.tar && tar -xvf ser_LayoutXLM_xfun_zh.tar
#download the RE model
wget https://paddleocr.bj.bcebos.com/pplayout/re_LayoutXLM_xfun_zh.tar && tar -xvf re_LayoutXLM_xfun_zh.tar
cd ../
````
littletomatodonkey's avatar
littletomatodonkey committed
152

WenmuZhou's avatar
WenmuZhou committed
153
### 5.2 SER
littletomatodonkey's avatar
littletomatodonkey committed
154

Bin Lu's avatar
Bin Lu committed
155
Before starting training, you need to modify the following four fields
156

Bin Lu's avatar
Bin Lu committed
157
158
159
160
1. `Train.dataset.data_dir`: point to the directory where the training set images are stored
2. `Train.dataset.label_file_list`: point to the training set label file
3. `Eval.dataset.data_dir`: refers to the directory where the validation set images are stored
4. `Eval.dataset.label_file_list`: point to the validation set label file
littletomatodonkey's avatar
littletomatodonkey committed
161

Bin Lu's avatar
Bin Lu committed
162
* start training
littletomatodonkey's avatar
littletomatodonkey committed
163
```shell
164
CUDA_VISIBLE_DEVICES=0 python3 tools/train.py -c configs/vqa/ser/layoutxlm.yml
Bin Lu's avatar
Bin Lu committed
165
````
littletomatodonkey's avatar
littletomatodonkey committed
166

Bin Lu's avatar
Bin Lu committed
167
168
Finally, `precision`, `recall`, `hmean` and other indicators will be printed.
In the `./output/ser_layoutxlm/` folder will save the training log, the optimal model and the model for the latest epoch.
littletomatodonkey's avatar
littletomatodonkey committed
169

Bin Lu's avatar
Bin Lu committed
170
* resume training
zhoujun's avatar
zhoujun committed
171

Bin Lu's avatar
Bin Lu committed
172
To resume training, assign the folder path of the previously trained model to the `Architecture.Backbone.checkpoints` field.
173

zhoujun's avatar
zhoujun committed
174
```shell
175
CUDA_VISIBLE_DEVICES=0 python3 tools/train.py -c configs/vqa/ser/layoutxlm.yml -o Architecture.Backbone.checkpoints=path/to/model_dir
Bin Lu's avatar
Bin Lu committed
176
````
zhoujun's avatar
zhoujun committed
177

Bin Lu's avatar
Bin Lu committed
178
* evaluate
zhoujun's avatar
zhoujun committed
179

Bin Lu's avatar
Bin Lu committed
180
Evaluation requires assigning the folder path of the model to be evaluated to the `Architecture.Backbone.checkpoints` field.
littletomatodonkey's avatar
littletomatodonkey committed
181
182

```shell
183
CUDA_VISIBLE_DEVICES=0 python3 tools/eval.py -c configs/vqa/ser/layoutxlm.yml -o Architecture.Backbone.checkpoints=path/to/model_dir
Bin Lu's avatar
Bin Lu committed
184
185
````
Finally, `precision`, `recall`, `hmean` and other indicators will be printed
littletomatodonkey's avatar
littletomatodonkey committed
186

Bin Lu's avatar
Bin Lu committed
187
* Use `OCR engine + SER` tandem prediction
littletomatodonkey's avatar
littletomatodonkey committed
188

Bin Lu's avatar
Bin Lu committed
189
Use the following command to complete the series prediction of `OCR engine + SER`, taking the pretrained SER model as an example:
littletomatodonkey's avatar
littletomatodonkey committed
190
191

```shell
Bin Lu's avatar
Bin Lu committed
192
193
CUDA_VISIBLE_DEVICES=0 python3 tools/infer_vqa_token_ser.py -c configs/vqa/ser/layoutxlm.yml -o Architecture.Backbone.checkpoints=pretrain/ser_LayoutXLM_xfun_zh/Global.infer_img=doc/vqa/input/zh_val_42.jpg
````
littletomatodonkey's avatar
littletomatodonkey committed
194

Bin Lu's avatar
Bin Lu committed
195
Finally, the prediction result visualization image and the prediction result text file will be saved in the directory configured by the `config.Global.save_res_path` field. The prediction result text file is named `infer_results.txt`.
196

Bin Lu's avatar
Bin Lu committed
197
* End-to-end evaluation of `OCR engine + SER` prediction system
littletomatodonkey's avatar
littletomatodonkey committed
198

Bin Lu's avatar
Bin Lu committed
199
First use the `tools/infer_vqa_token_ser.py` script to complete the prediction of the dataset, then use the following command to evaluate.
200

littletomatodonkey's avatar
littletomatodonkey committed
201
202
```shell
export CUDA_VISIBLE_DEVICES=0
Bin Lu's avatar
Bin Lu committed
203
204
python3 tools/eval_with_label_end2end.py --gt_json_path XFUND/zh_val/xfun_normalize_val.json --pred_json_path output_res/infer_results.txt
````
littletomatodonkey's avatar
littletomatodonkey committed
205

WenmuZhou's avatar
WenmuZhou committed
206
### 5.3 RE
littletomatodonkey's avatar
littletomatodonkey committed
207

Bin Lu's avatar
Bin Lu committed
208
* start training
littletomatodonkey's avatar
littletomatodonkey committed
209

Bin Lu's avatar
Bin Lu committed
210
Before starting training, you need to modify the following four fields
WenmuZhou's avatar
add re  
WenmuZhou committed
211

Bin Lu's avatar
Bin Lu committed
212
213
214
215
1. `Train.dataset.data_dir`: point to the directory where the training set images are stored
2. `Train.dataset.label_file_list`: point to the training set label file
3. `Eval.dataset.data_dir`: refers to the directory where the validation set images are stored
4. `Eval.dataset.label_file_list`: point to the validation set label file
zhoujun's avatar
zhoujun committed
216
217

```shell
218
CUDA_VISIBLE_DEVICES=0 python3 tools/train.py -c configs/vqa/re/layoutxlm.yml
Bin Lu's avatar
Bin Lu committed
219
````
zhoujun's avatar
zhoujun committed
220

Bin Lu's avatar
Bin Lu committed
221
222
Finally, `precision`, `recall`, `hmean` and other indicators will be printed.
In the `./output/re_layoutxlm/` folder will save the training log, the optimal model and the model for the latest epoch.
223

Bin Lu's avatar
Bin Lu committed
224
* resume training
225

Bin Lu's avatar
Bin Lu committed
226
To resume training, assign the folder path of the previously trained model to the `Architecture.Backbone.checkpoints` field.
WenmuZhou's avatar
add re  
WenmuZhou committed
227

zhoujun's avatar
zhoujun committed
228
```shell
229
CUDA_VISIBLE_DEVICES=0 python3 tools/train.py -c configs/vqa/re/layoutxlm.yml -o Architecture.Backbone.checkpoints=path/to/model_dir
Bin Lu's avatar
Bin Lu committed
230
````
zhoujun's avatar
zhoujun committed
231

Bin Lu's avatar
Bin Lu committed
232
* evaluate
zhoujun's avatar
zhoujun committed
233

Bin Lu's avatar
Bin Lu committed
234
Evaluation requires assigning the folder path of the model to be evaluated to the `Architecture.Backbone.checkpoints` field.
WenmuZhou's avatar
add re  
WenmuZhou committed
235
236

```shell
237
CUDA_VISIBLE_DEVICES=0 python3 tools/eval.py -c configs/vqa/re/layoutxlm.yml -o Architecture.Backbone.checkpoints=path/to/model_dir
Bin Lu's avatar
Bin Lu committed
238
239
````
Finally, `precision`, `recall`, `hmean` and other indicators will be printed
WenmuZhou's avatar
add re  
WenmuZhou committed
240

Bin Lu's avatar
Bin Lu committed
241
* Use `OCR engine + SER + RE` tandem prediction
WenmuZhou's avatar
add re  
WenmuZhou committed
242

Bin Lu's avatar
Bin Lu committed
243
Use the following command to complete the series prediction of `OCR engine + SER + RE`, taking the pretrained SER and RE models as an example:
WenmuZhou's avatar
add re  
WenmuZhou committed
244
245
```shell
export CUDA_VISIBLE_DEVICES=0
Bin Lu's avatar
Bin Lu committed
246
247
python3 tools/infer_vqa_token_ser_re.py -c configs/vqa/re/layoutxlm.yml -o Architecture.Backbone.checkpoints=pretrain/re_LayoutXLM_xfun_zh/Global.infer_img=doc/vqa/input/zh_val_21.jpg -c_ser configs/vqa/ser/layoutxlm. yml -o_ser Architecture.Backbone.checkpoints=pretrain/ser_LayoutXLM_xfun_zh/
````
littletomatodonkey's avatar
littletomatodonkey committed
248

Bin Lu's avatar
Bin Lu committed
249
Finally, the prediction result visualization image and the prediction result text file will be saved in the directory configured by the `config.Global.save_res_path` field. The prediction result text file is named `infer_results.txt`.
250

Bin Lu's avatar
Bin Lu committed
251
## 6. Reference Links
littletomatodonkey's avatar
littletomatodonkey committed
252
253
254
255

- LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding, https://arxiv.org/pdf/2104.08836.pdf
- microsoft/unilm/layoutxlm, https://github.com/microsoft/unilm/tree/master/layoutxlm
- XFUND dataset, https://github.com/doc-analysis/XFUND
MissPenguin's avatar
MissPenguin committed
256
257
258
259

## License

The content of this project itself is licensed under the [Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/)