VQA refers to visual question answering, which mainly asks and answers image content. DOC-VQA is one of the VQA tasks. DOC-VQA mainly asks questions about the text content of text images.
-支持基于多模态方法的语义实体识别 (Semantic Entity Recognition, SER) 以及关系抽取 (Relation Extraction, RE) 任务。基于 SER 任务,可以完成对图像中的文本识别与分类;基于 RE 任务,可以完成对图象中的文本内容的关系提取,如判断问题对(pair)。
-支持SER任务和RE任务的自定义训练。
-支持OCR+SER的端到端系统预测与评估。
-支持OCR+SER+RE的端到端系统预测。
-Integrate [LayoutXLM](https://arxiv.org/pdf/2104.08836.pdf) model and PP-OCR prediction engine.
-Supports Semantic Entity Recognition (SER) and Relation Extraction (RE) tasks based on multimodal methods. Based on the SER task, the text recognition and classification in the image can be completed; based on the RE task, the relationship extraction of the text content in the image can be completed, such as judging the problem pair (pair).
-Supports custom training for SER tasks and RE tasks.
-Supports end-to-end system prediction and evaluation of OCR+SER.
-Supports end-to-end system prediction of OCR+SER+RE.
本项目是[LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding](https://arxiv.org/pdf/2104.08836.pdf)在 Paddle 2.2上的开源实现,
This project is an open source implementation of[LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding](https://arxiv.org/pdf/2104.08836.pdf)on Paddle 2.2,
Included fine-tuning code on [XFUND dataset](https://github.com/doc-analysis/XFUND).
Boxes with different colors in the figure represent different categories. For the XFUND dataset, there are 3 categories: `QUESTION`, `ANSWER`, `HEADER`
*深紫色:HEADER
*浅紫色:QUESTION
*军绿色:ANSWER
*Dark purple: HEADER
*Light purple: QUESTION
*Army Green: ANSWER
在OCR检测框的左上方也标出了对应的类别和OCR识别结果。
The corresponding categories and OCR recognition results are also marked on the upper left of the OCR detection frame.
The red box in the figure represents the question, the blue box represents the answer, and the question and the answer are connected by a green line. The corresponding categories and OCR recognition results are also marked on the upper left of the OCR detection frame.
# Note: Code cloud hosting code may not be able to synchronize the update of this github project in real time, there is a delay of 3 to 5 days, please use the recommended method first.
If you want to experience the prediction process directly, you can download the pre-training model provided by us, skip the training process, and just predict directly.
The download address of the processed XFUND Chinese dataset: [https://paddleocr.bj.bcebos.com/dataset/XFUND.tar](https://paddleocr.bj.bcebos.com/dataset/XFUND.tar).
下载并解压该数据集,解压后将数据集放置在当前目录下。
Download and unzip the dataset, and place the dataset in the current directory after unzipping.
Finally, the prediction result visualization image and the prediction result text file will be saved in the directory configured by the `config.Global.save_res_path` field. The prediction result text file is named `infer_results.txt`.
*对`OCR引擎 + SER`预测系统进行端到端评估
* End-to-end evaluation of `OCR engine + SER` prediction system
Finally, the prediction result visualization image and the prediction result text file will be saved in the directory configured by the `config.Global.save_res_path` field. The prediction result text file is named `infer_results.txt`.
## 6. 参考链接
## 6. Reference Links
- LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding, https://arxiv.org/pdf/2104.08836.pdf
The content of this project itself is licensed under the [Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/)