Commit 4b214948 authored by zhiminzhang0830's avatar zhiminzhang0830
Browse files

Merge branch 'dygraph' of https://github.com/PaddlePaddle/PaddleOCR into new_branch

parents 917606b4 6e607a0f
This source diff could not be displayed because it is too large. You can view the blob instead.
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"# Text Recognition Algorithm Theory\n",
"\n",
"This chapter mainly introduces the theoretical knowledge of text recognition algorithms, including background introduction, algorithm classification and some classic paper ideas.\n",
"\n",
"Through the study of this chapter, you can master:\n",
"\n",
"1. The goal of text recognition\n",
"\n",
"2. Classification of text recognition algorithms\n",
"\n",
"3. Typical ideas of various algorithms\n",
"\n",
"\n",
"## 1 Background Introduction\n",
"\n",
"Text recognition is a subtask of OCR (Optical Character Recognition), and its task is to recognize the text content of a fixed area. In the two-stage method of OCR, it is followed by text detection and converts image information into text information.\n",
"\n",
"Specifically, the model inputs a positioned text line, and the model predicts the text content and confidence level in the picture. The visualization results are shown in the following figure:\n",
"\n",
"<center><img src=https://ai-studio-static-online.cdn.bcebos.com/a7c3404f778b489db9c1f686c7d2ff4d63b67c429b454f98b91ade7b89f8e903 width=\"600\"></center>\n",
"\n",
"<center><img src=https://ai-studio-static-online.cdn.bcebos.com/e72b1d6f80c342ac951d092bc8c325149cebb3763ec849ec8a2f54e7c8ad60ca width=\"600\"></center>\n",
"<br><center>Figure 1: Visualization results of model predicttion</center>\n",
"\n",
"There are many application scenarios for text recognition, including document recognition, road sign recognition, license plate recognition, industrial number recognition, etc. According to actual scenarios, text recognition tasks can be divided into two categories: **Regular text recognition** and **Irregular Text recognition**.\n",
"\n",
"* Regular text recognition: mainly refers to printed fonts, scanned text, etc., and the text is considered to be roughly in the horizontal position\n",
"\n",
"* Irregular text recognition: It often appears in natural scenes, and due to the huge differences in text curvature, direction, deformation, etc., the text is often not in the horizontal position, and there are problems such as bending, occlusion, and blurring.\n",
"\n",
"\n",
"The figure below shows the data patterns of IC15 and IC13, which represent irregular text and regular text respectively. It can be seen that irregular text often has problems such as distortion, blurring, and large font differences. It is closer to the real scene and is also more challenging.\n",
"\n",
"Therefore, the current major algorithms are trying to obtain higher indicators on irregular data sets.\n",
"\n",
"<center><img src=https://ai-studio-static-online.cdn.bcebos.com/bae4fce1370b4751a3779542323d0765a02a44eace7b44d2a87a241c13c6f8cf width=\"400\">\n",
"<br><center>Figure 2: IC15 picture sample (irregular text)</center>\n",
"<img src=https://ai-studio-static-online.cdn.bcebos.com/b55800d3276f4f5fad170ea1b567eb770177fce226f945fba5d3247a48c15c34 width=\"400\"></center>\n",
"<br><center>Figure 3: IC13 picture sample (rule text)</center>\n",
"\n",
"\n",
"When comparing the capabilities of different recognition algorithms, they are often compared on these two types of public data sets. Comparing the effects on multiple dimensions, currently the more common English benchmark data sets are classified as follows:\n",
"\n",
"<center><img src=https://ai-studio-static-online.cdn.bcebos.com/4d0aada261064031a16816b39a37f2ff6af70dbb57004cb7a106ae6485f14684 width=\"600\"></center>\n",
"<br><center>Figure 4: Common English benchmark data sets</center>\n",
"\n",
"## 2 Text Recognition Algorithm Classification\n",
"\n",
"In the traditional text recognition method, the task is divided into 3 steps, namely image preprocessing, character segmentation and character recognition. It is necessary to model a specific scene, and it will become invalid once the scene changes. In the face of complex text backgrounds and scene changes, methods based on deep learning have better performance.\n",
"\n",
"Most existing recognition algorithms can be represented by the following unified framework, and the algorithm flow is divided into 4 stages:\n",
"\n",
"![](https://ai-studio-static-online.cdn.bcebos.com/a2750f4170864f69a3af36fc13db7b606d851f2f467d43cea6fbf3521e65450f)\n",
"\n",
"\n",
"We have sorted out the mainstream algorithm categories and main papers, refer to the following table:\n",
"<center>\n",
" \n",
"| Algorithm category | Main ideas | Main papers |\n",
"| -------- | --------------- | -------- |\n",
"| Traditional algorithm | Sliding window, character extraction, dynamic programming |-|\n",
"| ctc | Based on ctc method, sequence is not aligned, faster recognition | CRNN, Rosetta |\n",
"| Attention | Attention-based method, applied to unconventional text | RARE, DAN, PREN |\n",
"| Transformer | Transformer-based method | SRN, NRTR, Master, ABINet |\n",
"| Correction | The correction module learns the text boundary and corrects it to the horizontal direction | RARE, ASTER, SAR |\n",
"| Segmentation | Based on the method of segmentation, extract the character position and then do classification | Text Scanner, Mask TextSpotter |\n",
" \n",
"</center>\n",
"\n",
"\n",
"### 2.1 Regular Text Recognition\n",
"\n",
"\n",
"There are two mainstream algorithms for text recognition, namely the CTC (Conectionist Temporal Classification)-based algorithm and the Sequence2Sequence algorithm. The difference is mainly in the decoding stage.\n",
"\n",
"The CTC-based algorithm connects the encoded sequence to the CTC for decoding; the Sequence2Sequence-based method connects the sequence to the Recurrent Neural Network (RNN) module for cyclic decoding. Both methods have been verified to be effective and mainstream. Two major practices.\n",
"\n",
"<center><img src=https://ai-studio-static-online.cdn.bcebos.com/f64eee66e4a6426f934c1befc3b138629324cf7360c74f72bd6cf3c0de9d49bd width=\"600\"></center>\n",
"<br><center>Figure 5: Left: CTC-based method, right: Sequece2Sequence-based method </center>\n",
"\n",
"\n",
"#### 2.1.1 Algorithm Based on CTC\n",
"\n",
"The most typical algorithm based on CTC is CRNN (Convolutional Recurrent Neural Network) [1], and its feature extraction part uses mainstream convolutional structures, commonly used ResNet, MobileNet, VGG, etc. Due to the particularity of text recognition tasks, there is a large amount of contextual information in the input data. The convolution kernel characteristics of convolutional neural networks make it more focused on local information and lack long-dependent modeling capabilities, so it is difficult to use only convolutional networks. Dig into the contextual connections between texts. In order to solve this problem, the CRNN text recognition algorithm introduces the bidirectional LSTM (Long Short-Term Memory) to enhance the context modeling. Experiments prove that the bidirectional LSTM module can effectively extract the context information in the picture. Finally, the output feature sequence is input to the CTC module, and the sequence result is directly decoded. This structure has been verified to be effective and widely used in text recognition tasks. Rosetta [2] is a recognition network proposed by FaceBook, which consists of a fully convolutional model and CTC. Gao Y [3] et al. used CNN convolution instead of LSTM, with fewer parameters, and the performance improvement accuracy was the same.\n",
"\n",
"<center><img src=https://ai-studio-static-online.cdn.bcebos.com/d3c96dd9e9794fddb12fa16f926abdd3485194f0a2b749e792e436037490899b width=\"600\"></center>\n",
"<center>Figure 6: CRNN structure diagram </center>\n",
"\n",
"\n",
"#### 2.1.2 Sequence2Sequence algorithm\n",
"\n",
"In the Sequence2Sequence algorithm, the Encoder encodes all input sequences into a unified semantic vector, which is then decoded by the Decoder. In the decoding process of the decoder, the output of the previous moment is continuously used as the input of the next moment, and the decoding is performed in a loop until the stop character is output. The general encoder is an RNN. For each input word, the encoder outputs a vector and hidden state, and uses the hidden state for the next input word to get the semantic vector in a loop; the decoder is another RNN, which receives the encoder Output a vector and output a series of words to create a transformation. Inspired by Sequence2Sequence in the field of translation, Shi [4] proposed an attention-based codec framework to recognize text. In this way, rnn can learn character-level language models hidden in strings from training data.\n",
"\n",
"<center><img src=https://ai-studio-static-online.cdn.bcebos.com/f575333696b7438d919975dc218e61ccda1305b638c5497f92b46a7ec3b85243 width=\"400\" hight=\"500\"></center>\n",
"<center>Figure 7: Sequence2Sequence structure diagram </center>\n",
"\n",
"The above two algorithms have very good effects on regular text, but due to the limitations of network design, this type of method is difficult to solve the task of irregular text recognition of bending and rotation. In order to solve such problems, some algorithm researchers have proposed a series of improved algorithms on the basis of the above two types of algorithms.\n",
"\n",
"### 2.2 Irregular Text Recognition\n",
"\n",
"* Irregular text recognition algorithms can be divided into 4 categories: correction-based methods; Attention-based methods; segmentation-based methods; and Transformer-based methods.\n",
"\n",
"#### 2.2.1 Correction-based Method\n",
"\n",
"The correction-based method uses some visual transformation modules to convert irregular text into regular text as much as possible, and then uses conventional methods for recognition.\n",
"\n",
"The RARE [4] model first proposed a correction scheme for irregular text. The entire network is divided into two main parts: a spatial transformation network STN (Spatial Transformer Network) and a recognition network based on Sequence2Squence. Among them, STN is the correction module. Irregular text images enter STN and are transformed into a horizontal image through TPS (Thin-Plate-Spline). This transformation can correct curved and transmissive text to a certain extent, and send it to sequence recognition after correction. Network for decoding.\n",
"\n",
"<center><img src=https://ai-studio-static-online.cdn.bcebos.com/66406f89507245e8a57969b9bed26bfe0227a8cf17a84873902dd4a464b97bb5 width=\"600\"></center>\n",
"<center>Figure 8: RARE structure diagram </center>\n",
"\n",
"The RARE paper pointed out that this method has greater advantages in irregular text data sets, especially comparing the two data sets CUTE80 and SVTP, which are more than 5 percentage points higher than CRNN, which proves the effectiveness of the correction module. Based on this [6] also combines a text recognition system with a spatial transformation network (STN) and an attention-based sequence recognition network.\n",
"\n",
"Correction-based methods have better migration. In addition to Attention-based methods such as RARE, STAR-Net [5] applies correction modules to CTC-based algorithms, which is also a good improvement compared to traditional CRNN.\n",
"\n",
"#### 2.2.2 Attention-based Method\n",
"\n",
"The Attention-based method mainly focuses on the correlation between the parts of the sequence. This method was first proposed in the field of machine translation. It is believed that the result of the current word in the process of text translation is mainly affected by certain words, so it needs to be The decisive word has greater weight. The same is true in the field of text recognition. When decoding the encoded sequence, each step selects the appropriate context to generate the next state, which is conducive to obtaining more accurate results.\n",
"\n",
"R^2AM [7] first introduced Attention into the field of text recognition. The model first extracts the encoded image features from the input image through a recursive convolutional layer, and then uses the implicitly learned character-level language statistics to decode the output through a recurrent neural network character. In the decoding process, the Attention mechanism is introduced to realize soft feature selection to make better use of image features. This selective processing method is more in line with human intuition.\n",
"\n",
"<center><img src=https://ai-studio-static-online.cdn.bcebos.com/a64ef10d4082422c8ac81dcda4ab75bf1db285d6b5fd462a8f309240445654d5 width=\"600\"></center>\n",
"<center>Figure 9: R^2AM structure drawing </center>\n",
"\n",
"A large number of algorithms will be explored and updated in the field of Attention in the future. For example, SAR[8] extends 1D attention to 2D attention. The RARE mentioned in the correction module is also a method based on Attention. Experiments prove that the Attention-based method has a good accuracy improvement compared with the CTC method.\n",
"\n",
"<center><img src=https://ai-studio-static-online.cdn.bcebos.com/4e2507fb58d94ec7a9b4d17151a986c84c5053114e05440cb1e7df423d32cb02 width=\"600\"></center>\n",
"<center>Figure 10: Attention diagram</center>\n",
"\n",
"\n",
"#### 2.2.3 Method Based on Segmentation\n",
"\n",
"The method based on segmentation is to treat each character of the text line as an independent individual, and it is easier to recognize the segmented individual characters than to recognize the entire text line after correction. It attempts to locate the position of each character in the input text image, and applies a character classifier to obtain these recognition results, simplifying the complex global problem into a local problem solving, and it has a relatively good effect in the irregular text scene. However, this method requires character-level labeling, and there is a certain degree of difficulty in data acquisition. Lyu [9] et al. proposed an instance word segmentation model for word recognition, which uses a method based on FCN (Fully Convolutional Network) in its recognition part. [10] Considering the problem of text recognition from a two-dimensional perspective, a character attention FCN is designed to solve the problem of text recognition. When the text is bent or severely distorted, this method has better positioning results for both regular and irregular text.\n",
"\n",
"<center><img src=https://ai-studio-static-online.cdn.bcebos.com/fd3e8ef0d6ce4249b01c072de31297ca5d02fc84649846388f890163b624ff10 width=\"800\"></center>\n",
"<center>Figure 11: Mask TextSpotter structure diagram </center>\n",
"\n",
"\n",
"\n",
"#### 2.2.4 Transformer-based Method\n",
"\n",
"With the rapid development of Transformer, both classification and detection fields have verified the effectiveness of Transformer in visual tasks. As mentioned in the regular text recognition part, CNN has limitations in long-dependency modeling. The Transformer structure just solves this problem. It can focus on global information in the feature extractor and can replace additional context modeling modules (LSTM ).\n",
"\n",
"Part of the text recognition algorithm uses Transformer's Encoder structure and convolution to extract sequence features. The Encoder is composed of multiple blocks stacked by MultiHeadAttentionLayer and Positionwise Feedforward Layer. The self-attention in MulitHeadAttention uses matrix multiplication to simulate the timing calculation of RNN, breaking the barrier of long-term dependence on timing in RNN. There are also some algorithms that use Transformer's Decoder module to decode, which can obtain stronger semantic information than traditional RNNs, and parallel computing has higher efficiency.\n",
"\n",
"The SRN[11] algorithm connects the Encoder module of Transformer to ResNet50 to enhance the 2D visual features. A parallel attention module is proposed, which uses the reading order as a query, making the calculation independent of time, and finally outputs the aligned visual features of all time steps in parallel. In addition, SRN also uses Transformer's Eecoder as a semantic module to integrate the visual information and semantic information of the picture, which has greater benefits in irregular text such as occlusion and blur.\n",
"\n",
"NRTR [12] uses a complete Transformer structure to encode and decode the input picture, and only uses a few simple convolutional layers for high-level feature extraction, and verifies the effectiveness of the Transformer structure in text recognition.\n",
"\n",
"<center><img src=https://ai-studio-static-online.cdn.bcebos.com/e7859f4469a842f0bd450e7e793a679d6e828007544241d09785c9b4ea2424a2 width=\"800\"></center>\n",
"<center>Figure 12: NRTR structure drawing </center>\n",
"\n",
"SRACN [13] uses Transformer's decoder to replace LSTM, once again verifying the efficiency and accuracy advantages of parallel training.\n",
"\n",
"## 3 Summary\n",
"\n",
"This section mainly introduces the theoretical knowledge and mainstream algorithms related to text recognition, including CTC-based methods, Sequence2Sequence-based methods, and segmentation-based methods. The ideas and contributions of classic papers are listed respectively. The next section will explain the practical course based on the CRNN algorithm, from networking to optimization to complete the entire training process,\n",
"\n",
"## 4 Reference\n",
"\n",
"\n",
"[1]Shi, B., Bai, X., & Yao, C. (2016). An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE transactions on pattern analysis and machine intelligence, 39(11), 2298-2304.\n",
"\n",
"[2]Fedor Borisyuk, Albert Gordo, and Viswanath Sivakumar. Rosetta: Large scale system for text detection and recognition in images. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 71–79. ACM, 2018.\n",
"\n",
"[3]Gao, Y., Chen, Y., Wang, J., & Lu, H. (2017). Reading scene text with attention convolutional sequence modeling. arXiv preprint arXiv:1709.04303.\n",
"\n",
"[4]Shi, B., Wang, X., Lyu, P., Yao, C., & Bai, X. (2016). Robust scene text recognition with automatic rectification. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4168-4176).\n",
"\n",
"[5] Star-Net Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. Spa- tial transformer networks. In Advances in neural information processing systems, pages 2017–2025, 2015.\n",
"\n",
"[6]Baoguang Shi, Mingkun Yang, XingGang Wang, Pengyuan Lyu, Xiang Bai, and Cong Yao. Aster: An attentional scene text recognizer with flexible rectification. IEEE transactions on pattern analysis and machine intelligence, 31(11):855–868, 2018.\n",
"\n",
"[7] Lee C Y , Osindero S . Recursive Recurrent Nets with Attention Modeling for OCR in the Wild[C]// IEEE Conference on Computer Vision & Pattern Recognition. IEEE, 2016.\n",
"\n",
"[8]Li, H., Wang, P., Shen, C., & Zhang, G. (2019, July). Show, attend and read: A simple and strong baseline for irregular text recognition. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 33, No. 01, pp. 8610-8617).\n",
"\n",
"[9]P. Lyu, C. Yao, W. Wu, S. Yan, and X. Bai. Multi-oriented scene text detection via corner localization and region segmentation. In Proc. CVPR, pages 7553–7563, 2018.\n",
"\n",
"[10] Liao, M., Zhang, J., Wan, Z., Xie, F., Liang, J., Lyu, P., ... & Bai, X. (2019, July). Scene text recognition from two-dimensional perspective. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 33, No. 01, pp. 8714-8721).\n",
"\n",
"[11] Yu, D., Li, X., Zhang, C., Liu, T., Han, J., Liu, J., & Ding, E. (2020). Towards accurate scene text recognition with semantic reasoning networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 12113-12122).\n",
"\n",
"[12] Sheng, F., Chen, Z., & Xu, B. (2019, September). NRTR: A no-recurrence sequence-to-sequence model for scene text recognition. In 2019 International Conference on Document Analysis and Recognition (ICDAR) (pp. 781-786). IEEE.\n",
"\n",
"[13]Yang, L., Wang, P., Li, H., Li, Z., & Zhang, Y. (2020). A holistic representation guided attention network for scene text recognition. Neurocomputing, 414, 67-75.\n",
"\n",
"[14]Wang, T., Zhu, Y., Jin, L., Luo, C., Chen, X., Wu, Y., ... & Cai, M. (2020, April). Decoupled attention network for text recognition. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 34, No. 07, pp. 12216-12224).\n",
"\n",
"[15] Wang, Y., Xie, H., Fang, S., Wang, J., Zhu, S., & Zhang, Y. (2021). From two to one: A new scene text recognizer with visual language modeling network. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 14194-14203).\n",
"\n",
"[16] Fang, S., Xie, H., Wang, Y., Mao, Z., & Zhang, Y. (2021). Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 7098-7107).\n",
"\n",
"[17] Yan, R., Peng, L., Xiao, S., & Yao, G. (2021). Primitive Representation Learning for Scene Text Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 284-293)."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "py35-paddle1.2.0"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Document Analysis Technology\n",
"\n",
"This chapter mainly introduces the theoretical knowledge of document analysis technology, including background introduction, algorithm classification and corresponding ideas.\n",
"\n",
"Through the study of this chapter, you can master:\n",
"\n",
"1. Classification and typical ideas of layout analysis\n",
"2. Classification and typical ideas of table recognition\n",
"3. Classification and typical ideas of information extraction"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"Layout analysis is mainly used for document retrieval, key information extraction, content classification, etc. Its task is mainly to classify the content of document images. Content categories can generally be divided into plain text, titles, tables, pictures, and lists. However, the diversity and complexity of document layout, formats, poor document image quality, and the lack of large-scale annotated datasets make layout analysis still a challenging task. Document analysis often includes the following research directions:\n",
"\n",
"1. Layout analysis module: Divide each document page into different content areas. This module can be used not only to delimit relevant and irrelevant areas, but also to classify the types of content it recognizes.\n",
"2. Optical Character Recognition (OCR) module: Locate and recognize all text present in the document.\n",
"3. Form recognition module: Recognize and convert the form information in the document into an excel file.\n",
"4. Information extraction module: Use OCR results and image information to understand and identify the specific information expressed in the document or the relationship between the information.\n",
"\n",
"Since the OCR module has been introduced in detail in the previous chapters, the following three modules will be introduced separately for the above layout analysis, table recognition and information extraction. For each module, the classic or common methods and data sets of the module will be introduced."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Layout Analysis\n",
"\n",
"### 1.1 Background Introduction\n",
"\n",
"Layout analysis is mainly used for document retrieval, key information extraction, content classification, etc. Its task is mainly to classify document images. Content categories can generally be divided into plain text, titles, tables, pictures, and lists. However, the diversity and complexity of document layouts, formats, poor document image quality, and the lack of large-scale annotated data sets make layout analysis still a challenging task.\n",
"The visualization of the layout analysis task is shown in the figure below:\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/2510dc76c66c49b8af079f25d08a9dcba726b2ce53d14c8ba5cd9bd57acecf19\" width=\"1000\"/></center>\n",
"<center>Figure 1: Layout analysis effect diagram</center>\n",
"\n",
"The existing solutions are generally based on target detection or semantic segmentation methods, which are based on detecting or segmenting different patterns in the document as different targets.\n",
"\n",
"Some representative papers are divided into the above two categories, as shown in the following table:\n",
"\n",
"| Category | Main paper |\n",
"| ---------------- | -------- |\n",
"| Method based on target detection | [Visual Detection with Context](https://aclanthology.org/D19-1348.pdf),[Object Detection](https://arxiv.org/pdf/2003.13197v1.pdf),[VSR](https://arxiv.org/pdf/2105.06220v1.pdf)\n",
"| Semantic segmentation method |[Semantic Segmentation](https://arxiv.org/pdf/1911.12170v2.pdf) |\n",
"\n",
"\n",
"### 1.2 Method Based on Target Detection \n",
"\n",
"Soto Carlos [1] is based on the target detection algorithm Faster R-CNN, combines context information and uses the inherent location information of the document content to improve the area detection performance. Li Kai [2] et al. also proposed a document analysis method based on object detection, which solved the cross-domain problem by introducing a feature pyramid alignment module, a region alignment module, and a rendering layer alignment module. These three modules complement each other. And adjust the domain from a general image perspective and a specific document image perspective, thus solving the problem of large labeled training datasets being different from the target domain. The following figure is a flow chart of layout analysis based on the target detection Faster R-CNN algorithm. \n",
"\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/d396e0d6183243898c0961250ee7a49bc536677079fb4ba2ac87c653f5472f01\" width=\"800\"/></center>\n",
"<center>Figure 2: Flow chart of layout analysis based on Faster R-CNN</center>\n",
"\n",
"### 1.3 Methods Based on Semantic Segmentation \n",
"\n",
"Sarkar Mausoom [3] et al. proposed a priori-based segmentation mechanism to train a document segmentation model on very high-resolution images, which solves the problem of indistinguishable and merging of dense regions caused by excessively shrinking the original image. Zhang Peng [4] et al. proposed a unified framework VSR (Vision, Semantics and Relations) for document layout analysis in combination with the vision, semantics and relations in the document. The framework uses a two-stream network to extract the visual and Semantic features, and adaptively fusion of these features through the adaptive aggregation module, solves the limitations of the existing CV-based methods of low efficiency of different modal fusion and lack of relationship modeling between layout components.\n",
"\n",
"### 1.4 Data Set\n",
"\n",
"Although the existing methods can solve the layout analysis task to a certain extent, these methods rely on a large amount of labeled training data. Recently, many data sets have been proposed for document analysis tasks.\n",
"\n",
"1. PubLayNet[5]: The data set contains 500,000 document images, of which 400,000 are used for training, 50,000 are used for verification, and 50,000 are used for testing. Five forms of table, text, image, title and list are marked.\n",
"2. HJDataset[6]: The data set contains 2271 document images. In addition to the bounding box and mask of the content area, it also includes the hierarchical structure and reading order of layout elements.\n",
"\n",
"A sample of the PubLayNet data set is shown in the figure below:\n",
"<center class=\"two\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/4b153117c9384f98a0ce5a6c6e7c205a4b1c57e95c894ccb9688cbfc94e68a1c\" width=\"400\"/><img src=\"https://ai-studio-static-online.cdn.bcebos.com/efb9faea39554760b280f9e0e70631d2915399fa97774eecaa44ee84411c4994\" width=\"400\"/>\n",
"</center>\n",
"<center>Figure 3: PubLayNet example</center>\n",
"Reference:\n",
"\n",
"[1]:Soto C, Yoo S. Visual detection with context for document layout analysis[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019: 3464-3470.\n",
"\n",
"[2]:Li K, Wigington C, Tensmeyer C, et al. Cross-domain document object detection: Benchmark suite and method[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020: 12915-12924.\n",
"\n",
"[3]:Sarkar M, Aggarwal M, Jain A, et al. Document Structure Extraction using Prior based High Resolution Hierarchical Semantic Segmentation[C]//European Conference on Computer Vision. Springer, Cham, 2020: 649-666.\n",
"\n",
"[4]:Zhang P, Li C, Qiao L, et al. VSR: A Unified Framework for Document Layout Analysis combining Vision, Semantics and Relations[J]. arXiv preprint arXiv:2105.06220, 2021.\n",
"\n",
"[5]:Zhong X, Tang J, Yepes A J. Publaynet: largest dataset ever for document layout analysis[C]//2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2019: 1015-1022.\n",
"\n",
"[6]:Li M, Xu Y, Cui L, et al. DocBank: A benchmark dataset for document layout analysis[J]. arXiv preprint arXiv:2006.01038, 2020.\n",
"\n",
"[7]:Shen Z, Zhang K, Dell M. A large dataset of historical japanese documents with complex layouts[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 2020: 548-549."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Form Recognition\n",
"\n",
"### 2.1 Background Introduction\n",
"\n",
"Tables are common page elements in various types of documents. With the explosive growth of various types of documents, how to efficiently find tables from documents and obtain content and structure information, that is, table identification has become an urgent problem to be solved. The difficulties of form identification are summarized as follows:\n",
"\n",
"1. The types and styles of tables are complex and diverse, such as *different rows and columns combined, different content text types*, etc.\n",
"2. The style of the document itself has various styles.\n",
"3. The lighting environment during shooting, etc.\n",
"\n",
"The task of table recognition is to convert the table information in the document to an excel file. The task visualization is as follows:\n",
"\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/99faa017e28b4928a408573406870ecaa251b626e0e84ab685e4b6f06f601a5f\" width=\"1600\"/></center>\n",
"\n",
"\n",
"<center>Figure 4: Example image of table recognition, the left side is the original image, and the right side is the result image after table recognition, presented in Excel format</center>\n",
"\n",
"Existing table recognition algorithms can be divided into the following four categories according to the principle of table structure reconstruction:\n",
"1. Method based on heuristic rules\n",
"2. CNN-based method\n",
"3. GCN-based method\n",
"4. Method based on End to End\n",
"\n",
"Some representative papers are divided into the above four categories, as shown in the following table:\n",
"\n",
"| Category | Idea | Main papers |\n",
"| ---------------- | ---- | -------- |\n",
"|Method based on heuristic rules|Manual design rules, connected domain detection analysis and processing|[T-Rect](https://www.researchgate.net/profile/Andreas-Dengel/publication/249657389_A_Paper-to-HTML_Table_Converting_System/links/0c9605322c9a67274d000000/A-Paper-to-HTML-Table-Converting-System.pdf),[pdf2table](https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.724.7272&rep=rep1&type=pdf)|\n",
"| CNN-based method | target detection, semantic segmentation | [CascadeTabNet](https://arxiv.org/pdf/2004.12629v2.pdf), [Multi-Type-TD-TSR](https://arxiv.org/pdf/2105.11021v1.pdf), [LGPMA](https://arxiv.org/pdf/2105.06224v2.pdf), [tabstruct-net](https://arxiv.org/pdf/2010.04565v1.pdf), [CDeC-Net](https://arxiv.org/pdf/2008.10831v1.pdf), [TableNet](https://arxiv.org/pdf/2001.01469v1.pdf), [TableSense](https://arxiv.org/pdf/2106.13500v1.pdf), [Deepdesrt](https://www.dfki.de/fileadmin/user_upload/import/9672_PID4966073.pdf), [Deeptabstr](https://www.dfki.de/fileadmin/user_upload/import/10649_DeepTabStR.pdf), [GTE](https://arxiv.org/pdf/2005.00589v2.pdf), [Cycle-CenterNet](https://arxiv.org/pdf/2109.02199v1.pdf), [FCN](https://www.researchgate.net/publication/339027294_Rethinking_Semantic_Segmentation_for_Table_Structure_Recognition_in_Documents)|\n",
"| GCN-based method | Based on graph neural network, the table recognition is regarded as a graph reconstruction problem | [GNN](https://arxiv.org/pdf/1905.13391v2.pdf), [TGRNet](https://arxiv.org/pdf/2106.10598v3.pdf), [GraphTSR](https://arxiv.org/pdf/1908.04729v2.pdf)|\n",
"| Method based on End to End | Use attention mechanism | [Table-Master](https://arxiv.org/pdf/2105.01848v1.pdf)|\n",
"\n",
"### 2.2 Traditional Algorithm Based on Heuristic Rules\n",
"Early research on table recognition was mainly based on heuristic rules. For example, the T-Rect system proposed by Kieninger [1] et al. used a bottom-up method to analyze the connected domain of document images, and then merge them according to defined rules to obtain logical text blocks. Then, pdf2table proposed by Yildiz[2] et al. is the first method for table recognition on PDF documents, which utilizes some unique information of PDF files (such as text, drawing paths and other information that are difficult to obtain in image documents) to assist with table recognition. In recent work, Koci[3] et al. expressed the layout area in the page as a graph, and then used the Remove and Conquer (RAC) algorithm to identify the table as a subgraph.\n",
"\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/66aeedb3f0924d80aee15f185e6799cc687b51fc20b74b98b338ca2ea25be3f3\" width=\"1000\"/></center>\n",
"<center>Figure 5: Schematic diagram of heuristic algorithm</center>\n",
"\n",
"### 2.3 Method Based on Deep Learning CNN\n",
"With the rapid development of deep learning technology in computer vision, natural language processing, speech processing and other fields, researchers have applied deep learning technology to the field of table recognition and achieved good results.\n",
"\n",
"In the DeepTabStR algorithm, Siddiqui Shoaib Ahmed [12] et al. described the table structure recognition problem as an object detection problem, and used deformable convolution to better detect table cells. Raja Sachin[6] et al. proposed that TabStruct-Net combines cell detection and structure recognition visually to perform table structure recognition, which solves the problem of recognition errors due to large changes in the table layout. However, this method cannot Deal with the problem of more empty cells in rows and columns.\n",
"\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/838be28836444bc1835ac30a25613d8b045a1b5aedd44b258499fe9f93dd298f\" width=\"1600\"/></center>\n",
"<center>Figure 6: Schematic diagram of algorithm based on deep learning CNN</center>\n",
"\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/4c40dda737bd44b09a533e1b1dd2e4c6a90ceea083bf4238b7f3c7b21087f409\" width=\"1600\"/></center>\n",
"<center>Figure 7: Example of algorithm error based on deep learning CNN</center>\n",
"\n",
"The previous table structure recognition method generally starts from the elements of different granularities (row/column, text area), and it is easy to ignore the problem of merging empty cells. Qiao Liang [10] et al. proposed a new framework LGPMA, which makes full use of the information from local and global features through mask re-scoring strategy, and then can obtain more reliable alignment of the cell area, and finally introduces the inclusion of cell matching, empty The table structure restoration pipeline of cell search and empty cell merging handles the problem of table structure identification.\n",
"\n",
"In addition to the above separate table recognition algorithms, there are also some methods that complete table detection and table recognition in one model. Schreiber Sebastian [11] et al. proposed DeepDeSRT, which uses Faster RCNN for table detection and FCN semantic segmentation model for Table structure row and column detection, but this method uses two independent models to solve these two problems. Prasad Devashish [4] et al. proposed an end-to-end deep learning method CascadeTabNet, which uses the Cascade Mask R-CNN HRNet model to perform table detection and structure recognition at the same time, which solves the problem of using two independent methods to process table recognition in the past. The insufficiency of the problem. Paliwal Shubham [8] et al. proposed a novel end-to-end deep multi-task architecture TableNet, which is used for table detection and structure recognition. At the same time, additional spatial semantic features are added to TableNet during training to further improve the performance of the model. Zheng Xinyi [13] et al. proposed a system framework GTE for table recognition, using a cell detection network to guide the training of the table detection network, and proposed a hierarchical network and a new clustering-based cell structure recognition algorithm, the framework can be connected to the back of any target detection model to facilitate the training of different table recognition algorithms. Previous research mainly focused on parsing from scanned PDF documents with a simple layout and well-aligned table images. However, the tables in real scenes are generally complex and may have serious deformation, bending or occlusion. Therefore, Long Rujiao [14] et al. also constructed a table recognition data set WTW in a realistic complex scene, and proposed a Cycle-CenterNet method, which uses the cyclic pairing module optimization and the proposed new pairing loss to accurately group discrete units into structured In the table, the performance of table recognition is improved.\n",
"\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/a01f714cbe1f42fc9c45c6658317d9d7da2cec9726844f6b9fa75e30cadc9f76\" width=\"1600\"/></center>\n",
"<center>Figure 8: Schematic diagram of end-to-end algorithm</center>\n",
"\n",
"The CNN-based method cannot handle the cross-row and column tables well, so in the follow-up method, two research methods are divided to solve the cross-row and column problems in the table.\n",
"\n",
"### 2.4 Method Based on Deep Learning GCN\n",
"In recent years, with the rise of Graph Convolutional Networks (Graph Convolutional Network), some researchers have tried to apply graph neural networks to table structure recognition problems. Qasim Shah Rukh [20] et al. converted the table structure recognition problem into a graph problem compatible with graph neural networks, and designed a novel differentiable architecture that can not only take advantage of the advantages of convolutional neural networks to extract features, but also The advantages of the effective interaction between the vertices of the graph neural network can be used, but this method only uses the location features of the cells, and does not use the semantic features. Chi Zewen [19] et al. proposed a novel graph neural network, GraphTSR, for table structure recognition in PDF files. It takes cells in the table as input, and then uses the characteristics of the edges and nodes of the graph to be connected. Predicting the relationship between cells to identify the table structure solves the problem of cell identification across rows or columns to a certain extent. Xue Wenyuan [21] et al. reformulated the problem of table structure recognition as table map reconstruction, and proposed an end-to-end method for table structure recognition, TGRNet, which includes cell detection branch and cell logic location branch. , These two branches jointly predict the spatial and logical positions of different cells, which solves the problem that the previous method did not pay attention to the logical position of cells.\n",
"\n",
"Diagram of GraphTSR table recognition algorithm:\n",
"\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/8ff89661142045a8aef54f8a7a2c69b1d243f8269034406a9e66bee2149f730f\" width=\"1600\"/></center>\n",
"<center>Figure 9: Diagram of GraphTSR table recognition algorithm</center>\n",
"\n",
"### 2.5 Based on An End-to-End Approach\n",
"\n",
"Different from other post-processing to complete the reconstruction of the table structure, based on the end-to-end method, directly use the network to complete the HTML representation output of the table structure\n",
"\n",
"![](https://ai-studio-static-online.cdn.bcebos.com/7865e58a83824facacfaa91bec12ccf834217cb706454dc5a0c165c203db79fb) | ![](https://ai-studio-static-online.cdn.bcebos.com/77d913b1b92f4a349b8f448e08ba78458d687eef4af142678a073830999f3edc))\n",
"---|---\n",
"Figure 10: Input and output of the end-to-end method | Figure 11: Image Caption example\n",
"\n",
"Most end-to-end methods use Image Caption's Seq2Seq method to complete the prediction of the table structure, such as some methods based on Attention or Transformer.\n",
"\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/3571280a9c364d3499a062e3edc724294fb5eaef8b38440991941e87f0af0c3b\" width=\"800\"/></center>\n",
"<center>Figure 12: Schematic diagram of Seq2Seq</center>\n",
"\n",
"Ye Jiaquan [22] obtained the table structure output model in TableMaster by improving the Master text algorithm based on Transformer. In addition, a branch is added for the coordinate regression of the box. The author did not split the model into two branches in the last layer, but decoupled the sequence prediction and the box regression into two after the first Transformer decoding layer. Branches. The comparison between its network structure and the original Master network is shown in the figure below:\n",
"\n",
"\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/f573709447a848b4ba7c73a2e297f0304caaca57c5c94588aada1f4cd893946c\" width=\"800\"/></center>\n",
"<center>Figure 13: Left: master network diagram, right: TableMaster network diagram</center>\n",
"\n",
"\n",
"### 2.6 Data Set\n",
"\n",
"Since the deep learning method is data-driven, a large amount of labeled data is required to train the model, and the small size of the existing data set is also an important constraint, so some data sets have also been proposed.\n",
"\n",
"1. PubTabNet[16]: Contains 568k table images and corresponding structured HTML representations.\n",
"2. PubMed Tables (PubTables-1M) [17]: Table structure recognition data set, containing highly detailed structural annotations, 460,589 pdf images used for table inspection tasks, and 947,642 table images used for table recognition tasks.\n",
"3. TableBank[18]: Table detection and recognition data set, using Word and Latex documents on the Internet to construct table data containing 417K high-quality annotations.\n",
"4. SciTSR[19]: Table structure recognition data set, most of the images are converted from the paper, which contains 15,000 tables from PDF files and their corresponding structure tags.\n",
"5. TabStructDB[12]: Includes 1081 table areas, which are marked with dense row and column information.\n",
"6. WTW[14]: Large-scale data set scene table detection and recognition data set, this data set contains table data in various deformation, bending and occlusion situations, and contains 14,581 images in total.\n",
"\n",
"Data set example\n",
"\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/c9763df56e67434f97cd435100d50ded71ba66d9d4f04d7f8f896d613cdf02b0\" /></center>\n",
"<center>Figure 14: Sample diagram of PubTables-1M data set</center>\n",
"\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/64de203bbe584642a74f844ac4b61d1ec3c5a38cacb84443ac961fbcc54a66ce\" width=\"600\"/></center>\n",
"<center>Figure 15: Sample diagram of WTW data set</center>\n",
"\n",
"\n",
"\n",
"Reference\n",
"\n",
"[1]:Kieninger T, Dengel A. A paper-to-HTML table converting system[C]//Proceedings of document analysis systems (DAS). 1998, 98: 356-365.\n",
"\n",
"[2]:Yildiz B, Kaiser K, Miksch S. pdf2table: A method to extract table information from pdf files[C]//IICAI. 2005: 1773-1785.\n",
"\n",
"[3]:Koci E, Thiele M, Lehner W, et al. Table recognition in spreadsheets via a graph representation[C]//2018 13th IAPR International Workshop on Document Analysis Systems (DAS). IEEE, 2018: 139-144.\n",
"\n",
"[4]:Prasad D, Gadpal A, Kapadni K, et al. CascadeTabNet: An approach for end to end table detection and structure recognition from image-based documents[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 2020: 572-573.\n",
"\n",
"[5]:Fischer P, Smajic A, Abrami G, et al. Multi-Type-TD-TSR–Extracting Tables from Document Images Using a Multi-stage Pipeline for Table Detection and Table Structure Recognition: From OCR to Structured Table Representations[C]//German Conference on Artificial Intelligence (Künstliche Intelligenz). Springer, Cham, 2021: 95-108.\n",
"\n",
"[6]:Raja S, Mondal A, Jawahar C V. Table structure recognition using top-down and bottom-up cues[C]//European Conference on Computer Vision. Springer, Cham, 2020: 70-86.\n",
"\n",
"[7]:Agarwal M, Mondal A, Jawahar C V. Cdec-net: Composite deformable cascade network for table detection in document images[C]//2020 25th International Conference on Pattern Recognition (ICPR). IEEE, 2021: 9491-9498.\n",
"\n",
"[8]:Paliwal S S, Vishwanath D, Rahul R, et al. Tablenet: Deep learning model for end-to-end table detection and tabular data extraction from scanned document images[C]//2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2019: 128-133.\n",
"\n",
"[9]:Dong H, Liu S, Han S, et al. Tablesense: Spreadsheet table detection with convolutional neural networks[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2019, 33(01): 69-76.\n",
"\n",
"[10]:Qiao L, Li Z, Cheng Z, et al. LGPMA: Complicated Table Structure Recognition with Local and Global Pyramid Mask Alignment[J]. arXiv preprint arXiv:2105.06224, 2021.\n",
"\n",
"[11]:Schreiber S, Agne S, Wolf I, et al. Deepdesrt: Deep learning for detection and structure recognition of tables in document images[C]//2017 14th IAPR international conference on document analysis and recognition (ICDAR). IEEE, 2017, 1: 1162-1167.\n",
"\n",
"[12]:Siddiqui S A, Fateh I A, Rizvi S T R, et al. Deeptabstr: Deep learning based table structure recognition[C]//2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2019: 1403-1409.\n",
"\n",
"[13]:Zheng X, Burdick D, Popa L, et al. Global table extractor (gte): A framework for joint table identification and cell structure recognition using visual context[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2021: 697-706.\n",
"\n",
"[14]:Long R, Wang W, Xue N, et al. Parsing Table Structures in the Wild[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 944-952.\n",
"\n",
"[15]:Siddiqui S A, Khan P I, Dengel A, et al. Rethinking semantic segmentation for table structure recognition in documents[C]//2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2019: 1397-1402.\n",
"\n",
"[16]:Zhong X, ShafieiBavani E, Jimeno Yepes A. Image-based table recognition: data, model, and evaluation[C]//Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16. Springer International Publishing, 2020: 564-580.\n",
"\n",
"[17]:Smock B, Pesala R, Abraham R. PubTables-1M: Towards a universal dataset and metrics for training and evaluating table extraction models[J]. arXiv preprint arXiv:2110.00061, 2021.\n",
"\n",
"[18]:Li M, Cui L, Huang S, et al. Tablebank: Table benchmark for image-based table detection and recognition[C]//Proceedings of the 12th Language Resources and Evaluation Conference. 2020: 1918-1925.\n",
"\n",
"[19]:Chi Z, Huang H, Xu H D, et al. Complicated table structure recognition[J]. arXiv preprint arXiv:1908.04729, 2019.\n",
"\n",
"[20]:Qasim S R, Mahmood H, Shafait F. Rethinking table recognition using graph neural networks[C]//2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2019: 142-147.\n",
"\n",
"[21]:Xue W, Yu B, Wang W, et al. TGRNet: A Table Graph Reconstruction Network for Table Structure Recognition[J]. arXiv preprint arXiv:2106.10598, 2021.\n",
"\n",
"[22]:Ye J, Qi X, He Y, et al. PingAn-VCGroup's Solution for ICDAR 2021 Competition on Scientific Literature Parsing Task B: Table Recognition to HTML[J]. arXiv preprint arXiv:2105.01848, 2021.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Document VQA\n",
"\n",
"The boss sent a task: develop an ID card recognition system\n",
"\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/63bbe893465e4f98b3aec80a042758b520d43e1a993a47e39bce1123c2d29b3f\" width=\"1600\"/></center>\n",
"\n",
"> How to choose a plan\n",
"> 1. Use rules to extract information after text detection\n",
"> 2. Use scale type to extract information after text detection\n",
"> 3. Outsourcing\n",
"\n",
"\n",
"### 3.1 Background Introduction\n",
"In the VQA (Visual Question Answering) task, questions and answers are mainly aimed at the content of the image, but for text images, the content of interest is the text information in the image, so this type of method can be divided into Text-VQA and text-VQA in natural scenes. DocVQA of the scanned document scene, the relationship between the three is shown in the figure below.\n",
"\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/a91cfd5152284152b020ca8a396db7a21fd982e3661540d5998cc19c17d84861\" width=\"600\"/></center>\n",
"<center>Figure 16: VQA level</center>\n",
"\n",
"The sample pictures of VQA, Text-VQA and DocVQA are shown in the figure below.\n",
"\n",
"|Task type|VQA | Text-VQA | DocVQA| \n",
"|---|---|---|---|\n",
"|Task description|Ask questions regarding **picture content**|Ask questions regarding **text content on pictures**|Ask questions regarding **text content of document images**|\n",
"|Sample picture|![vqa](https://ai-studio-static-online.cdn.bcebos.com/fc21b593276247249591231b3373608151ed8ae7787f4d6ba39e8779fdd12201)|![textvqa](https://ai-studio-static-online.cdn.bcebos.com/cd2404edf3bf430b89eb9b2509714499380cd02e4aa74ec39ca6d7aebcf9a559)|![docvqa](https://ai-studio-static-online.cdn.bcebos.com/0eec30a6f91b4f949c56729b856f7ff600d06abee0774642801c070303edfe83)|\n",
"\n",
"Because DocVQA is closer to actual application scenarios, a large number of academic and industrial work has emerged. In common scenarios, the questions asked in DocVQA are fixed. For example, the questions in the ID card scenario are generally\n",
"1. What is the citizenship number?\n",
"2. What is your name?\n",
"3. What is a clan?\n",
"\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/2d2b86468daf47c98be01f44b8d6efa64bc09e43cd764298afb127f19b07aede\" width=\"800\"/></center>\n",
"<center>Figure 17: Example of an ID card</center>\n",
"\n",
"\n",
"Based on this prior knowledge, DocVQA's research began to lean towards the Key Information Extraction (KIE) task. This time we also mainly discuss the KIE-related research. The KIE task mainly extracts the key information needed from the image, such as extracting from the ID card. Name and citizen identification number information.\n",
"\n",
"KIE is usually divided into two sub-tasks for research\n",
"1. SER: Semantic Entity Recognition, to classify each detected text, such as dividing it into name and ID. As shown in the black box and red box in the figure below.\n",
"2. RE: Relation Extraction, which classifies each detected text, such as dividing it into questions and answers. Then find the corresponding answer to each question. As shown in the figure below, the red and black boxes represent the question and the answer, respectively, and the yellow line represents the correspondence between the question and the answer.\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/899470ba601349fbbc402a4c83e6cdaee08aaa10b5004977b1f684f346ebe31f\" width=\"800\"/></center>\n",
"<center>Figure 18: Example of SER, RE task</center>\n",
"\n",
"The general KIE method is researched based on Named Entity Recognition (NER) [4], but this type of method only uses the text information in the image and lacks the use of visual and structural information, so the accuracy is not high. On this basis, the methods in recent years have begun to integrate visual and structural information with text information. According to the principles used when fusing multimodal information, these methods can be divided into the following three types:\n",
"\n",
"1. Grid-based approach\n",
"1. Token-based approach\n",
"2. GCN-based method\n",
"3. Based on the End to End method\n",
"\n",
"Some representative papers are divided into the above three categories, as shown in the following table:\n",
"\n",
"| Category | Ideas | Main Papers |\n",
"| ---------------- | ---- | -------- |\n",
"| Grid-based method | Fusion of multi-modal information on images (text, layout, image) | [Chargrid](https://arxiv.org/pdf/1809.08799) |\n",
"| Token-based method|Using methods such as Bert for multi-modal information fusion|[LayoutLM](https://arxiv.org/pdf/1912.13318), [LayoutLMv2](https://arxiv.org/pdf/2012.14740), [StrucText](https://arxiv.org/pdf/2108.02923), |\n",
"| GCN-based method|Using graph network structure for multi-modal information fusion|[GCN](https://arxiv.org/pdf/1903.11279), [PICK](https://arxiv.org/pdf/2004.07464), [SDMG-R](https://arxiv.org/pdf/2103.14470), [SERA](https://arxiv.org/pdf/2110.09915) |\n",
"| Based on End to End method | Unify OCR and key information extraction into one network | [Trie](https://arxiv.org/pdf/2005.13118) |\n",
"\n",
"### 3.2 Grid-Based Method\n",
"\n",
"The Grid-based method performs multimodal information fusion at the image level. Chargrid[5] firstly performs character-level text detection and recognition on the image, and then completes the construction of the network input by filling the one-hot encoding of the category into the corresponding character area (the non-black part in the right image below) , the input is finally passed through the CNN network of the encoder-decoder structure to perform coordinate detection and category classification of key information.\n",
"\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/f248841769ec4312a9015b4befda37bf29db66226431420ca1faad517783875e\" width=\"800\"/></center>\n",
"<center>Figure 19: Chargrid data example</center>\n",
"\n",
"\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/0682e52e275b4187a0e74f54961a50091fd3a0cdff734e17bedcbc993f6e29f9\" width=\"800\"/></center>\n",
"<center>Figure 20: Chargrid network</center>\n",
"\n",
"\n",
"Compared with the traditional method based only on text, this method can use both text information and structural information, so it can achieve a certain accuracy improvement. It's good to combine the two.\n",
"\n",
"### 3.3 Token-Based Method\n",
"LayoutLM[6] encodes 2D position information and text information together into the BERT model, and draws on the pre-training ideas of Bert in NLP, pre-training on large-scale data sets, and in downstream tasks, LayoutLM also adds image information To further improve the performance of the model. Although LayoutLM combines text, location and image information, the image information is fused in the training of downstream tasks, so the multi-modal fusion of the three types of information is not sufficient. Based on LayoutLM, LayoutLMv2 [7] integrates image information with text and layout information in the pre-training stage through transformers, and also adds a spatial perception self-attention mechanism to the Transformer to assist the model to better integrate visual and text features. Although LayoutLMv2 fuses text, location and image information in the pre-training stage, the visual features learned by the model are not fine enough due to the limitation of the pre-training task. StrucTexT [8] based on the previous multi-modal methods, proposed two new tasks, Sentence Length Prediction (SLP) and Paired Boxes Direction (PBD) in the pre-training task to help the network learn fine visual features. Among them, the SLP task makes the model Learn the length of the text segment, the PDB task allows the model to learn the matching relationship between Box directions. Through these two new pre-training tasks, the deep cross-modal fusion between text, visual and layout information can be accelerated.\n",
"\n",
"![](https://ai-studio-static-online.cdn.bcebos.com/17a26ade09ee4311b90e49a1c61d88a72a82104478434f9dabd99c27a65d789b) | ![](https://ai-studio-static-online.cdn.bcebos.com/d75addba67ef4b06a02ae40145e609d3692d613ff9b74cec85123335b465b3cc))\n",
"---|---\n",
"Figure 21: Transformer algorithm flow chart | Figure 22: LayoutLMv2 algorithm flow chart\n",
"\n",
"### 3.4 GCN-Based Method\n",
"\n",
"Although the existing GCN-based methods [10] use text and structure information, they do not make good use of image information. PICK [11] added image information to the GCN network and proposed a graph learning module to automatically learn edge types. SDMG-R [12] encodes the image as a bimodal graph. The nodes of the graph are the visual and textual information of the text area. The edges represent the direct spatial relationship between adjacent texts. By iteratively spreading information along the edges and inferring graph node categories, SDMG -R solves the problem that existing methods are incapable of unseen templates.\n",
"\n",
"\n",
"The PICK flow chart is shown in the figure below:\n",
"\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/d3282959e6b2448c89b762b3b9bbf6197a0364b101214a1f83cf01a28623c01c\" width=\"800\"/></center>\n",
"<center>Figure 23: PICK algorithm flow chart</center>\n",
"\n",
"SERA[10]The biaffine parser in dependency syntax analysis is introduced into document relation extraction, and GCN is used to fuse text and visual information.\n",
"\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/a97b7647968a4fa59e7b14b384dd7ffe812f158db8f741459b6e6bb0e8b657c7\" width=\"800\"/></center>\n",
"<center>Figure 24: SERA algorithm flow chart</center>\n",
"\n",
"### 3.5 Method Based on End to End\n",
"\n",
"Existing methods divide KIE into two independent tasks: text reading and information extraction. However, they mainly focus on improving the task of information extraction, ignoring that text reading and information extraction are interrelated. Therefore, Trie [9] Proposed a unified end-to-end network that can learn these two tasks at the same time and reinforce each other in the learning process.\n",
"\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/6e4a3b0f65254f6b9d40cea0875854d4f47e1dca6b1e408cad435b3629600608\" width=\"1300\"/></center>\n",
"<center>Figure 25: Trie algorithm flow chart</center>\n",
"\n",
"\n",
"### 3.6 Data Set\n",
"The data sets used for KIE mainly include the following two:\n",
"1. SROIE: Task 3 of the SROIE data set [2] aims to extract four predefined information from the scanned receipt: company, date, address or total. There are 626 samples in the data set for training and 347 samples for testing.\n",
"2. FUNSD: FUNSD data set [3] is a data set used to extract form information from scanned documents. It contains 199 marked real scan forms. Of the 199 samples, 149 are used for training and 50 are used for testing. The FUNSD data set assigns a semantic entity tag to each word: question, answer, title or other.\n",
"3. XFUN: The XFUN data set is a multilingual data set proposed by Microsoft. It contains 7 languages. Each language contains 149 training sets and 50 test sets.\n",
"\n",
"![](https://ai-studio-static-online.cdn.bcebos.com/dfdf530d79504761919c1f093f9a86dac21e6db3304c4892998ea1823f3187c6) | ![](https://ai-studio-static-online.cdn.bcebos.com/3b2a9f9476be4e7f892b73bd7096ce8d88fe98a70bae47e6ab4c5fcc87e83861))\n",
"---|---\n",
"Figure 26: sroie example image | Figure 27: xfun example image\n",
"\n",
"Reference:\n",
"\n",
"[1]:Mathew M, Karatzas D, Jawahar C V. Docvqa: A dataset for vqa on document images[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2021: 2200-2209.\n",
"\n",
"[2]:Huang Z, Chen K, He J, et al. Icdar2019 competition on scanned receipt ocr and information extraction[C]//2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2019: 1516-1520.\n",
"\n",
"[3]:Jaume G, Ekenel H K, Thiran J P. Funsd: A dataset for form understanding in noisy scanned documents[C]//2019 International Conference on Document Analysis and Recognition Workshops (ICDARW). IEEE, 2019, 2: 1-6.\n",
"\n",
"[4]:Lample G, Ballesteros M, Subramanian S, et al. Neural architectures for named entity recognition[J]. arXiv preprint arXiv:1603.01360, 2016.\n",
"\n",
"[5]:Katti A R, Reisswig C, Guder C, et al. Chargrid: Towards understanding 2d documents[J]. arXiv preprint arXiv:1809.08799, 2018.\n",
"\n",
"[6]:Xu Y, Li M, Cui L, et al. Layoutlm: Pre-training of text and layout for document image understanding[C]//Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2020: 1192-1200.\n",
"\n",
"[7]:Xu Y, Xu Y, Lv T, et al. LayoutLMv2: Multi-modal pre-training for visually-rich document understanding[J]. arXiv preprint arXiv:2012.14740, 2020.\n",
"\n",
"[8]:Li Y, Qian Y, Yu Y, et al. StrucTexT: Structured Text Understanding with Multi-Modal Transformers[C]//Proceedings of the 29th ACM International Conference on Multimedia. 2021: 1912-1920.\n",
"\n",
"[9]:Zhang P, Xu Y, Cheng Z, et al. Trie: End-to-end text reading and information extraction for document understanding[C]//Proceedings of the 28th ACM International Conference on Multimedia. 2020: 1413-1422.\n",
"\n",
"[10]:Liu X, Gao F, Zhang Q, et al. Graph convolution for multimodal information extraction from visually rich documents[J]. arXiv preprint arXiv:1903.11279, 2019.\n",
"\n",
"[11]:Yu W, Lu N, Qi X, et al. Pick: Processing key information extraction from documents using improved graph learning-convolutional networks[C]//2020 25th International Conference on Pattern Recognition (ICPR). IEEE, 2021: 4363-4370.\n",
"\n",
"[12]:Sun H, Kuang Z, Yue X, et al. Spatial Dual-Modality Graph Reasoning for Key Information Extraction[J]. arXiv preprint arXiv:2103.14470, 2021."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Summary\n",
"In this section, we mainly introduce the theoretical knowledge of three sub-modules related to document analysis technology: layout analysis, table recognition and information extraction. Below we will explain this form recognition and DOC-VQA practical tutorial based on the PaddleOCR framework."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "py35-paddle1.2.0"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"# 1. Course Prerequisites\n",
"\n",
"The OCR model involved in this course is based on deep learning, so its related basic knowledge, environment configuration, project engineering and other materials will be introduced in this section, especially for readers who are not familiar with deep learning. content.\n",
"\n",
"### 1.1 Preliminary Knowledge\n",
"\n",
"The \"learning\" of deep learning has been developed from the content of neurons, perceptrons, and multilayer neural networks in machine learning. Therefore, understanding the basic machine learning algorithms is of great help to the understanding and application of deep learning. The \"deepness\" of deep learning is embodied in a series of vector-based mathematical operations such as convolution and pooling used in the process of processing a large amount of information. If you lack the theoretical foundation of the two, you can learn from teacher Li Hongyi's [Linear Algebra](https://aistudio.baidu.com/aistudio/course/introduce/2063) and [Machine Learning](https://aistudio.baidu.com/aistudio/course/introduce/1978) courses.\n",
"\n",
"For the understanding of deep learning itself, you can refer to the zero-based course of Bai Ran, an outstanding architect of Baidu: [Baidu architects take you hands-on with zero-based practice deep learning](https://aistudio.baidu.com/aistudio/course/introduce/1297), which covers the development history of deep learning and introduces the complete components of deep learning through a classic case. It is a set of practice-oriented deep learning courses.\n",
"\n",
"For the practice of theoretical knowledge, [Python basic knowledge](https://aistudio.baidu.com/aistudio/course/introduce/1224) is essential. At the same time, in order to quickly reproduce the deep learning model, the deep learning framework used in this course For: Flying PaddlePaddle. If you have used other frameworks, you can quickly learn how to use flying paddles through [Quick Start Document](https://www.paddlepaddle.org.cn/documentation/docs/zh/practices/quick_start/hello_paddle.html).\n",
"\n",
"### 1.2 Basic Environment Preparation\n",
"\n",
"If you want to run the code of this course in a local environment and have not built a Python environment before, you can follow the [zero-base operating environment preparation](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/doc/doc_ch/environment.md), install Anaconda or docker environment according to your operating system.\n",
"\n",
"If you don't have local resources, you can run the code through the AI Studio training platform. Each item in it is presented in a notebook, which is convenient for developers to learn. If you are not familiar with the related operations of Notebook, you can refer to [AI Studio Project Description](https://ai.baidu.com/ai-doc/AISTUDIO/0k3e2tfzm).\n",
"\n",
"### 1.3 Get and Run the Code\n",
"\n",
"This course relies on the formation of PaddleOCR's code repository. First, clone the complete project of PaddleOCR:\n",
"\n",
"```bash\n",
"# [recommend]\n",
"git clone https://github.com/PaddlePaddle/PaddleOCR\n",
"\n",
"# If you cannot pull successfully due to network problems, you can also choose to use the hosting on Code Cloud:\n",
"git clone https://gitee.com/paddlepaddle/PaddleOCR\n",
"```\n",
"\n",
"> Note: The code cloud hosted code may not be able to synchronize the update of this github project in real time, there is a delay of 3~5 days, please use the recommended method first.\n",
">\n",
"> If you are not familiar with git operations, you can download the compressed package directly from the `Code` on the homepage of PaddleOCR\n",
"\n",
"Then install third-party libraries:\n",
"\n",
"```bash\n",
"cd PaddleOCR\n",
"pip3 install -r requirements.txt\n",
"```\n",
"\n",
"\n",
"\n",
"### 1.4 Access to Information\n",
"\n",
"[PaddleOCR Usage Document](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/README.md) describes in detail how to use PaddleOCR to complete model application, training and deployment. The document is rich in content, most of the user’s questions are described in the document or FAQ, especially in [FAQ](https://github.com/PaddlePaddle/PaddleOCR/blob/release%2F2.3/doc/doc_en/FAQ_en.md), in accordance with the application process of deep learning, has precipitated the user's common questions, it is recommended that you read it carefully.\n",
"\n",
"### 1.5 Ask for Help\n",
"\n",
"If you encounter BUG, ease of use or documentation related issues while using PaddleOCR, you can contact the official via [Github issue](https://github.com/PaddlePaddle/PaddleOCR/issues), please follow the issue template Provide as much information as possible so that official personnel can quickly locate the problem. At the same time, the WeChat group is the daily communication position for the majority of PaddleOCR users, and it is more suitable for asking some consulting questions. In addition to the PaddleOCR team members, there will also be enthusiastic developers answering your questions."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "py35-paddle1.2.0"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
......@@ -22,7 +22,8 @@ from .make_shrink_map import MakeShrinkMap
from .random_crop_data import EastRandomCropData, RandomCropImgMask
from .make_pse_gt import MakePseGt
from .rec_img_aug import RecAug, RecResizeImg, ClsResizeImg, SRNRecResizeImg, NRTRRecResizeImg, SARRecResizeImg
from .rec_img_aug import RecAug, RecResizeImg, ClsResizeImg, \
SRNRecResizeImg, NRTRRecResizeImg, SARRecResizeImg, PRENResizeImg
from .randaugment import RandAugment
from .copy_paste import CopyPaste
from .ColorJitter import ColorJitter
......
......@@ -785,6 +785,53 @@ class SARLabelEncode(BaseRecLabelEncode):
return [self.padding_idx]
class PRENLabelEncode(BaseRecLabelEncode):
def __init__(self,
max_text_length,
character_dict_path,
use_space_char=False,
**kwargs):
super(PRENLabelEncode, self).__init__(
max_text_length, character_dict_path, use_space_char)
def add_special_char(self, dict_character):
padding_str = '<PAD>' # 0
end_str = '<EOS>' # 1
unknown_str = '<UNK>' # 2
dict_character = [padding_str, end_str, unknown_str] + dict_character
self.padding_idx = 0
self.end_idx = 1
self.unknown_idx = 2
return dict_character
def encode(self, text):
if len(text) == 0 or len(text) >= self.max_text_len:
return None
if self.lower:
text = text.lower()
text_list = []
for char in text:
if char not in self.dict:
text_list.append(self.unknown_idx)
else:
text_list.append(self.dict[char])
text_list.append(self.end_idx)
if len(text_list) < self.max_text_len:
text_list += [self.padding_idx] * (
self.max_text_len - len(text_list))
return text_list
def __call__(self, data):
text = data['label']
encoded_text = self.encode(text)
if encoded_text is None:
return None
data['label'] = np.array(encoded_text)
return data
class VQATokenLabelEncode(object):
"""
Label encode for NLP VQA methods
......@@ -799,7 +846,7 @@ class VQATokenLabelEncode(object):
ocr_engine=None,
**kwargs):
super(VQATokenLabelEncode, self).__init__()
from paddlenlp.transformers import LayoutXLMTokenizer, LayoutLMTokenizer
from paddlenlp.transformers import LayoutXLMTokenizer, LayoutLMTokenizer, LayoutLMv2Tokenizer
from ppocr.utils.utility import load_vqa_bio_label_maps
tokenizer_dict = {
'LayoutXLM': {
......@@ -809,6 +856,10 @@ class VQATokenLabelEncode(object):
'LayoutLM': {
'class': LayoutLMTokenizer,
'pretrained_model': 'layoutlm-base-uncased'
},
'LayoutLMv2': {
'class': LayoutLMv2Tokenizer,
'pretrained_model': 'layoutlmv2-base-uncased'
}
}
self.contains_re = contains_re
......
......@@ -141,6 +141,25 @@ class SARRecResizeImg(object):
return data
class PRENResizeImg(object):
def __init__(self, image_shape, **kwargs):
"""
Accroding to original paper's realization, it's a hard resize method here.
So maybe you should optimize it to fit for your task better.
"""
self.dst_h, self.dst_w = image_shape
def __call__(self, data):
img = data['image']
resized_img = cv2.resize(
img, (self.dst_w, self.dst_h), interpolation=cv2.INTER_LINEAR)
resized_img = resized_img.transpose((2, 0, 1)) / 255
resized_img -= 0.5
resized_img /= 0.5
data['image'] = resized_img.astype(np.float32)
return data
def resize_norm_img_sar(img, image_shape, width_downsample_ratio=0.25):
imgC, imgH, imgW_min, imgW_max = image_shape
h = img.shape[0]
......
......@@ -12,6 +12,8 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from collections import defaultdict
class VQASerTokenChunk(object):
def __init__(self, max_seq_len=512, infer_mode=False, **kwargs):
......@@ -39,6 +41,8 @@ class VQASerTokenChunk(object):
encoded_inputs_example[key] = data[key]
encoded_inputs_all.append(encoded_inputs_example)
if len(encoded_inputs_all) == 0:
return None
return encoded_inputs_all[0]
......@@ -101,17 +105,18 @@ class VQAReTokenChunk(object):
"entities": self.reformat(entities_in_this_span),
"relations": self.reformat(relations_in_this_span),
})
item['entities']['label'] = [
self.entities_labels[x] for x in item['entities']['label']
]
encoded_inputs_all.append(item)
if len(item['entities']) > 0:
item['entities']['label'] = [
self.entities_labels[x] for x in item['entities']['label']
]
encoded_inputs_all.append(item)
if len(encoded_inputs_all) == 0:
return None
return encoded_inputs_all[0]
def reformat(self, data):
new_data = {}
new_data = defaultdict(list)
for item in data:
for k, v in item.items():
if k not in new_data:
new_data[k] = []
new_data[k].append(v)
return new_data
......@@ -33,6 +33,7 @@ from .rec_srn_loss import SRNLoss
from .rec_nrtr_loss import NRTRLoss
from .rec_sar_loss import SARLoss
from .rec_aster_loss import AsterLoss
from .rec_pren_loss import PRENLoss
# cls loss
from .cls_loss import ClsLoss
......@@ -59,7 +60,7 @@ def build_loss(config):
'DBLoss', 'PSELoss', 'EASTLoss', 'SASTLoss', 'FCELoss', 'CTCLoss',
'ClsLoss', 'AttentionLoss', 'SRNLoss', 'PGLoss', 'CombinedLoss',
'NRTRLoss', 'TableAttentionLoss', 'SARLoss', 'AsterLoss', 'SDMGRLoss',
'VQASerTokenLayoutLMLoss', 'LossFromOutput'
'VQASerTokenLayoutLMLoss', 'LossFromOutput', 'PRENLoss'
]
config = copy.deepcopy(config)
module_name = config.pop('name')
......
......@@ -31,7 +31,8 @@ class CTCLoss(nn.Layer):
predicts = predicts[-1]
predicts = predicts.transpose((1, 0, 2))
N, B, _ = predicts.shape
preds_lengths = paddle.to_tensor([N] * B, dtype='int64')
preds_lengths = paddle.to_tensor(
[N] * B, dtype='int64', place=paddle.CPUPlace())
labels = batch[1].astype("int32")
label_lengths = batch[2].astype('int64')
loss = self.loss_func(predicts, labels, preds_lengths, label_lengths)
......
# copyright (c) 2022 PaddlePaddle Authors. All Rights Reserve.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from paddle import nn
class PRENLoss(nn.Layer):
def __init__(self, **kwargs):
super(PRENLoss, self).__init__()
# note: 0 is padding idx
self.loss_func = nn.CrossEntropyLoss(reduction='mean', ignore_index=0)
def forward(self, predicts, batch):
loss = self.loss_func(predicts, batch[1].astype('int64'))
return {'loss': loss}
......@@ -30,9 +30,10 @@ def build_backbone(config, model_type):
from .rec_resnet_31 import ResNet31
from .rec_resnet_aster import ResNet_ASTER
from .rec_micronet import MicroNet
from .rec_efficientb3_pren import EfficientNetb3_PREN
support_dict = [
'MobileNetV1Enhance', 'MobileNetV3', 'ResNet', 'ResNetFPN', 'MTB',
"ResNet31", "ResNet_ASTER", 'MicroNet'
"ResNet31", "ResNet_ASTER", 'MicroNet', 'EfficientNetb3_PREN'
]
elif model_type == "e2e":
from .e2e_resnet_vd_pg import ResNet
......@@ -45,8 +46,11 @@ def build_backbone(config, model_type):
from .table_mobilenet_v3 import MobileNetV3
support_dict = ["ResNet", "MobileNetV3"]
elif model_type == 'vqa':
from .vqa_layoutlm import LayoutLMForSer, LayoutXLMForSer, LayoutXLMForRe
support_dict = ["LayoutLMForSer", "LayoutXLMForSer", 'LayoutXLMForRe']
from .vqa_layoutlm import LayoutLMForSer, LayoutLMv2ForSer, LayoutLMv2ForRe, LayoutXLMForSer, LayoutXLMForRe
support_dict = [
"LayoutLMForSer", "LayoutLMv2ForSer", 'LayoutLMv2ForRe',
"LayoutXLMForSer", 'LayoutXLMForRe'
]
else:
raise NotImplementedError
......
# copyright (c) 2022 PaddlePaddle Authors. All Rights Reserve.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Code is refer from:
https://github.com/RuijieJ/pren/blob/main/Nets/EfficientNet.py
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import math
from collections import namedtuple
import paddle
import paddle.nn as nn
import paddle.nn.functional as F
__all__ = ['EfficientNetb3']
class EffB3Params:
@staticmethod
def get_global_params():
"""
The fllowing are efficientnetb3's arch superparams, but to fit for scene
text recognition task, the resolution(image_size) here is changed
from 300 to 64.
"""
GlobalParams = namedtuple('GlobalParams', [
'drop_connect_rate', 'width_coefficient', 'depth_coefficient',
'depth_divisor', 'image_size'
])
global_params = GlobalParams(
drop_connect_rate=0.3,
width_coefficient=1.2,
depth_coefficient=1.4,
depth_divisor=8,
image_size=64)
return global_params
@staticmethod
def get_block_params():
BlockParams = namedtuple('BlockParams', [
'kernel_size', 'num_repeat', 'input_filters', 'output_filters',
'expand_ratio', 'id_skip', 'se_ratio', 'stride'
])
block_params = [
BlockParams(3, 1, 32, 16, 1, True, 0.25, 1),
BlockParams(3, 2, 16, 24, 6, True, 0.25, 2),
BlockParams(5, 2, 24, 40, 6, True, 0.25, 2),
BlockParams(3, 3, 40, 80, 6, True, 0.25, 2),
BlockParams(5, 3, 80, 112, 6, True, 0.25, 1),
BlockParams(5, 4, 112, 192, 6, True, 0.25, 2),
BlockParams(3, 1, 192, 320, 6, True, 0.25, 1)
]
return block_params
class EffUtils:
@staticmethod
def round_filters(filters, global_params):
"""Calculate and round number of filters based on depth multiplier."""
multiplier = global_params.width_coefficient
if not multiplier:
return filters
divisor = global_params.depth_divisor
filters *= multiplier
new_filters = int(filters + divisor / 2) // divisor * divisor
if new_filters < 0.9 * filters:
new_filters += divisor
return int(new_filters)
@staticmethod
def round_repeats(repeats, global_params):
"""Round number of filters based on depth multiplier."""
multiplier = global_params.depth_coefficient
if not multiplier:
return repeats
return int(math.ceil(multiplier * repeats))
class ConvBlock(nn.Layer):
def __init__(self, block_params):
super(ConvBlock, self).__init__()
self.block_args = block_params
self.has_se = (self.block_args.se_ratio is not None) and \
(0 < self.block_args.se_ratio <= 1)
self.id_skip = block_params.id_skip
# expansion phase
self.input_filters = self.block_args.input_filters
output_filters = \
self.block_args.input_filters * self.block_args.expand_ratio
if self.block_args.expand_ratio != 1:
self.expand_conv = nn.Conv2D(
self.input_filters, output_filters, 1, bias_attr=False)
self.bn0 = nn.BatchNorm(output_filters)
# depthwise conv phase
k = self.block_args.kernel_size
s = self.block_args.stride
self.depthwise_conv = nn.Conv2D(
output_filters,
output_filters,
groups=output_filters,
kernel_size=k,
stride=s,
padding='same',
bias_attr=False)
self.bn1 = nn.BatchNorm(output_filters)
# squeeze and excitation layer, if desired
if self.has_se:
num_squeezed_channels = max(1,
int(self.block_args.input_filters *
self.block_args.se_ratio))
self.se_reduce = nn.Conv2D(output_filters, num_squeezed_channels, 1)
self.se_expand = nn.Conv2D(num_squeezed_channels, output_filters, 1)
# output phase
self.final_oup = self.block_args.output_filters
self.project_conv = nn.Conv2D(
output_filters, self.final_oup, 1, bias_attr=False)
self.bn2 = nn.BatchNorm(self.final_oup)
self.swish = nn.Swish()
def drop_connect(self, inputs, p, training):
if not training:
return inputs
batch_size = inputs.shape[0]
keep_prob = 1 - p
random_tensor = keep_prob
random_tensor += paddle.rand([batch_size, 1, 1, 1], dtype=inputs.dtype)
random_tensor = paddle.to_tensor(random_tensor, place=inputs.place)
binary_tensor = paddle.floor(random_tensor)
output = inputs / keep_prob * binary_tensor
return output
def forward(self, inputs, drop_connect_rate=None):
# expansion and depthwise conv
x = inputs
if self.block_args.expand_ratio != 1:
x = self.swish(self.bn0(self.expand_conv(inputs)))
x = self.swish(self.bn1(self.depthwise_conv(x)))
# squeeze and excitation
if self.has_se:
x_squeezed = F.adaptive_avg_pool2d(x, 1)
x_squeezed = self.se_expand(self.swish(self.se_reduce(x_squeezed)))
x = F.sigmoid(x_squeezed) * x
x = self.bn2(self.project_conv(x))
# skip conntection and drop connect
if self.id_skip and self.block_args.stride == 1 and \
self.input_filters == self.final_oup:
if drop_connect_rate:
x = self.drop_connect(
x, p=drop_connect_rate, training=self.training)
x = x + inputs
return x
class EfficientNetb3_PREN(nn.Layer):
def __init__(self, in_channels):
super(EfficientNetb3_PREN, self).__init__()
self.blocks_params = EffB3Params.get_block_params()
self.global_params = EffB3Params.get_global_params()
self.out_channels = []
# stem
stem_channels = EffUtils.round_filters(32, self.global_params)
self.conv_stem = nn.Conv2D(
in_channels, stem_channels, 3, 2, padding='same', bias_attr=False)
self.bn0 = nn.BatchNorm(stem_channels)
self.blocks = []
# to extract three feature maps for fpn based on efficientnetb3 backbone
self.concerned_block_idxes = [7, 17, 25]
concerned_idx = 0
for i, block_params in enumerate(self.blocks_params):
block_params = block_params._replace(
input_filters=EffUtils.round_filters(block_params.input_filters,
self.global_params),
output_filters=EffUtils.round_filters(
block_params.output_filters, self.global_params),
num_repeat=EffUtils.round_repeats(block_params.num_repeat,
self.global_params))
self.blocks.append(
self.add_sublayer("{}-0".format(i), ConvBlock(block_params)))
concerned_idx += 1
if concerned_idx in self.concerned_block_idxes:
self.out_channels.append(block_params.output_filters)
if block_params.num_repeat > 1:
block_params = block_params._replace(
input_filters=block_params.output_filters, stride=1)
for j in range(block_params.num_repeat - 1):
self.blocks.append(
self.add_sublayer('{}-{}'.format(i, j + 1),
ConvBlock(block_params)))
concerned_idx += 1
if concerned_idx in self.concerned_block_idxes:
self.out_channels.append(block_params.output_filters)
self.swish = nn.Swish()
def forward(self, inputs):
outs = []
x = self.swish(self.bn0(self.conv_stem(inputs)))
for idx, block in enumerate(self.blocks):
drop_connect_rate = self.global_params.drop_connect_rate
if drop_connect_rate:
drop_connect_rate *= float(idx) / len(self.blocks)
x = block(x, drop_connect_rate=drop_connect_rate)
if idx in self.concerned_block_idxes:
outs.append(x)
return outs
......@@ -21,12 +21,14 @@ from paddle import nn
from paddlenlp.transformers import LayoutXLMModel, LayoutXLMForTokenClassification, LayoutXLMForRelationExtraction
from paddlenlp.transformers import LayoutLMModel, LayoutLMForTokenClassification
from paddlenlp.transformers import LayoutLMv2Model, LayoutLMv2ForTokenClassification, LayoutLMv2ForRelationExtraction
__all__ = ["LayoutXLMForSer", 'LayoutLMForSer']
pretrained_model_dict = {
LayoutXLMModel: 'layoutxlm-base-uncased',
LayoutLMModel: 'layoutlm-base-uncased'
LayoutLMModel: 'layoutlm-base-uncased',
LayoutLMv2Model: 'layoutlmv2-base-uncased'
}
......@@ -58,12 +60,34 @@ class NLPBaseModel(nn.Layer):
self.out_channels = 1
class LayoutXLMForSer(NLPBaseModel):
class LayoutLMForSer(NLPBaseModel):
def __init__(self, num_classes, pretrained=True, checkpoints=None,
**kwargs):
super(LayoutXLMForSer, self).__init__(
LayoutXLMModel,
LayoutXLMForTokenClassification,
super(LayoutLMForSer, self).__init__(
LayoutLMModel,
LayoutLMForTokenClassification,
'ser',
pretrained,
checkpoints,
num_classes=num_classes)
def forward(self, x):
x = self.model(
input_ids=x[0],
bbox=x[2],
attention_mask=x[4],
token_type_ids=x[5],
position_ids=None,
output_hidden_states=False)
return x
class LayoutLMv2ForSer(NLPBaseModel):
def __init__(self, num_classes, pretrained=True, checkpoints=None,
**kwargs):
super(LayoutLMv2ForSer, self).__init__(
LayoutLMv2Model,
LayoutLMv2ForTokenClassification,
'ser',
pretrained,
checkpoints,
......@@ -82,12 +106,12 @@ class LayoutXLMForSer(NLPBaseModel):
return x[0]
class LayoutLMForSer(NLPBaseModel):
class LayoutXLMForSer(NLPBaseModel):
def __init__(self, num_classes, pretrained=True, checkpoints=None,
**kwargs):
super(LayoutLMForSer, self).__init__(
LayoutLMModel,
LayoutLMForTokenClassification,
super(LayoutXLMForSer, self).__init__(
LayoutXLMModel,
LayoutXLMForTokenClassification,
'ser',
pretrained,
checkpoints,
......@@ -97,10 +121,33 @@ class LayoutLMForSer(NLPBaseModel):
x = self.model(
input_ids=x[0],
bbox=x[2],
image=x[3],
attention_mask=x[4],
token_type_ids=x[5],
position_ids=None,
output_hidden_states=False)
head_mask=None,
labels=None)
return x[0]
class LayoutLMv2ForRe(NLPBaseModel):
def __init__(self, pretrained=True, checkpoints=None, **kwargs):
super(LayoutLMv2ForRe, self).__init__(LayoutLMv2Model,
LayoutLMv2ForRelationExtraction,
're', pretrained, checkpoints)
def forward(self, x):
x = self.model(
input_ids=x[0],
bbox=x[1],
labels=None,
image=x[2],
attention_mask=x[3],
token_type_ids=x[4],
position_ids=None,
head_mask=None,
entities=x[5],
relations=x[6])
return x
......
......@@ -31,6 +31,7 @@ def build_head(config):
from .rec_nrtr_head import Transformer
from .rec_sar_head import SARHead
from .rec_aster_head import AsterHead
from .rec_pren_head import PRENHead
# cls head
from .cls_head import ClsHead
......@@ -43,7 +44,7 @@ def build_head(config):
support_dict = [
'DBHead', 'PSEHead', 'FCEHead', 'EASTHead', 'SASTHead', 'CTCHead',
'ClsHead', 'AttentionHead', 'SRNHead', 'PGHead', 'Transformer',
'TableAttentionHead', 'SARHead', 'AsterHead', 'SDMGRHead'
'TableAttentionHead', 'SARHead', 'AsterHead', 'SDMGRHead', 'PRENHead'
]
#table head
......
# copyright (c) 2022 PaddlePaddle Authors. All Rights Reserve.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from paddle import nn
from paddle.nn import functional as F
class PRENHead(nn.Layer):
def __init__(self, in_channels, out_channels, **kwargs):
super(PRENHead, self).__init__()
self.linear = nn.Linear(in_channels, out_channels)
def forward(self, x, targets=None):
predicts = self.linear(x)
if not self.training:
predicts = F.softmax(predicts, axis=2)
return predicts
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment