Unverified Commit cf613b13 authored by Bin Lu's avatar Bin Lu Committed by GitHub
Browse files

Merge branch 'PaddlePaddle:dygraph' into dygraph

parents 8fe6209d 732fa778
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Text detection FAQ\n",
"\n",
"This section lists some of the problems that developers often encounter when using PaddleOCR's text detection model, and gives corresponding solutions or suggestions.\n",
"\n",
"The FAQ is introduced in two parts, namely:\n",
" -Text detection training related\n",
" -Text detection and prediction related"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. FAQ about Text Detection Training\n",
"\n",
"**1.1 What are the text detection algorithms provided by PaddleOCR?**\n",
"\n",
"**A**: PaddleOCR contains a variety of text detection models, including regression-based text detection methods EAST and SAST, and segmentation-based text detection methods DB, PSENet.\n",
"\n",
"\n",
"**1.2: What data sets are used in the Chinese ultra-lightweight and general models in the PaddleOCR project? How many samples were trained, what configuration of GPUs, how many epochs were run, and how long did they run?**\n",
"\n",
"**A**: For the ultra-lightweight DB detection model, the training data includes open source data sets lsvt, rctw, CASIA, CCPD, MSRA, MLT, BornDigit, iflytek, SROIE and synthetic data sets, etc. The total data volume is 10W, The data set is divided into 5 parts. A random sampling strategy is used during training. The training takes about 500 epochs on a 4-card V100GPU, which takes 3 days.\n",
"\n",
"\n",
"**1.3 Does the text detection training label require specific text labeling? What does the \"###\" in the label mean?**\n",
"\n",
"**A**: Text detection training only needs the coordinates of the text area. The label can be four or fourteen points, arranged in the order of upper left, upper right, lower right, and lower left. The label file provided by PaddleOCR contains text fields. For unclear text in the text area, ### will be used instead. When training the detection model, the text field in the label will not be used.\n",
" \n",
"**1.4 Is the effect of the text detection model trained when the text lines are tight?**\n",
"\n",
"**A**: When using segmentation-based methods, such as DB, to detect dense text lines, it is best to collect a batch of data for training, and during training, a binary image will be generated [shrink_ratio](https://github.com/PaddlePaddle/PaddleOCR/blob/8b656a3e13631dfb1ac21d2095d4d4a4993ef710/ppocr/data/imaug/make_shrink_map.py?_pjax=%23js-repo-pjax-container%2C%20div%5Bitemtype%3D%22http%3A%2F%2Fschema.org%2FSoftwareSourceCode%22%5D%20main%2C%20%5Bdata-pjax-container%5D#L37)Turn down the parameter. In addition, when forecasting, you can appropriately reduce [unclip_ratio](https://github.com/PaddlePaddle/PaddleOCR/blob/8b656a3e13631dfb1ac21d2095d4d4a4993ef710/configs/det/ch_ppocr_v2.0/ch_det_mv3_db_v2.0.yml?_pjax=%23js-repo-pjax-container%2C%20div%5Bitemtype%3D%22http%3A%2F%2Fschema.org%2FSoftwareSourceCode%22%5D%20main%2C%20%5Bdata-pjax-container%5D#L59) parameter, the larger the unclip_ratio parameter value, the larger the detection frame.\n",
"\n",
"\n",
"**1.5 For some large-sized document images, DB will have more missed inspections during inspection. How to avoid this kind of missed inspections?**\n",
"\n",
"**A**: First of all, you need to determine whether the model is not well-trained or is the problem handled during prediction. If the model is not well trained, it is recommended to add more data for training, or add more data to enhance it during training.\n",
"If the problem is that the predicted image is too large, you can increase the longest side setting parameter [det_limit_side_len] entered during prediction [det_limit_side_len](https://github.com/PaddlePaddle/PaddleOCR/blob/8b656a3e13631dfb1ac21d2095d4d4a4993ef710/tools/infer/utility.py?_pjax=%23js-repo-pjax-container%2C%20div%5Bitemtype%3D%22http%3A%2F%2Fschema.org%2FSoftwareSourceCode%22%5D%20main%2C%20%5Bdata-pjax-container%5D#L47), which is 960 by default.\n",
"Secondly, you can observe whether the missed text has segmentation results by visualizing the post-processed segmentation map. If there is no segmentation result, the model is not well trained. If there is a complete segmentation area, it means that it is a problem of post-prediction processing. In this case, it is recommended to adjust [DB post-processing parameters](https://github.com/PaddlePaddle/PaddleOCR/blob/8b656a3e13631dfb1ac21d2095d4d4a4993ef710/tools/infer/utility.py?_pjax=%23js-repo-pjax-container%2C%20div%5Bitemtype%3D%22http%3A%2F%2Fschema.org%2FSoftwareSourceCode%22%5D%20main%2C%20%5Bdata-pjax-container%5D#L51-L53)。\n",
"\n",
"\n",
"**1.6 The problem of missed detection of DB model bending text (such as a slightly deformed document image)?**\n",
"\n",
"**A**: When calculating the average score of the text box in the DB post-processing, it is the average score of the rectangle area, which is easy to cause the missed detection of the curved text. The average score of the polygon area has been added, which will be more accurate, but the speed is somewhat different. Decrease, can be selected as needed, and you can view the [Visual Contrast Effect] (https://github.com/PaddlePaddle/PaddleOCR/pull/2604) in the relevant pr. This function is selected by the parameter [det_db_score_mode](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.1/tools/infer/utility.py#L51), the parameter value is optional [`fast` (default) , `slow`], `fast` corresponds to the original rectangle mode, and `slow` corresponds to the polygon mode. Thanks to the user [buptlihang](https://github.com/buptlihang) for mentioning [pr](https://github.com/PaddlePaddle/PaddleOCR/pull/2574) to help solve this problem.\n",
"\n",
"\n",
"**1.7 For simple OCR tasks with low accuracy requirements, how many data sets do I need to prepare?**\n",
"\n",
"**A**: (1) The amount of training data is related to the complexity of the problem to be solved. The greater the difficulty and the higher the accuracy requirements, the greater the data set requirements, and in general, the more training data in practice, the better the effect.\n",
"\n",
"(2) For scenes with low accuracy requirements, the amount of data required for detection tasks and recognition tasks is different. For inspection tasks, 500 images can guarantee the basic inspection results. For recognition tasks, it is necessary to ensure that the number of line text images in which each character in the recognition dictionary appears in different scenes needs to be greater than 200 (for example, if there are 5 words in the dictionary, each word needs to appear in more than 200 pictures, then The minimum required number of images should be between 200-1000), so that the basic recognition effect can be guaranteed.\n",
"\n",
"\n",
"**1.8 How to get more data when the amount of training data is small?**\n",
"\n",
"**A**: When the amount of training data is small, you can try the following three ways to get more data: (1) Collect more training data manually, the most direct and effective way. (2) Basic image processing or transformation based on PIL and opencv. For example, the three modules of ImageFont, Image, ImageDraw in PIL write text into the background, opencv's rotating affine transformation, Gaussian filtering and so on. (3) Synthesize data using data generation algorithms, such as algorithms such as pix2pix.\n",
"\n",
"\n",
"**1.9 How to replace the backbone of text detection/recognition?**\n",
"\n",
"A: Whether it is text detection or text recognition, the choice of backbone network is a trade-off between prediction effect and prediction efficiency. Generally, if you choose a larger-scale backbone network, such as ResNet101_vd, the detection or recognition will be more accurate, but the prediction time will increase accordingly. However, choosing a smaller-scale backbone network, such as MobileNetV3_small_x0_35, will predict faster, but the accuracy of detection or recognition will be greatly reduced. Fortunately, the detection or recognition effects of different backbone networks are positively correlated with the image 1000 classification task in the ImageNet dataset. PaddleClas, a flying paddle image classification suite, summarizes 23 series of classification network structures such as ResNet_vd, Res2Net, HRNet, MobileNetV3, GhostNet, etc. The top1 recognition accuracy rate of the above image classification task, GPU (V100 and T4) and CPU (Snapdragon 855) The prediction time-consuming and the corresponding 117 pre-training model download addresses.\n",
"\n",
"(1) The replacement of the text detection backbone network is mainly to determine 4 stages similar to ResNet to facilitate the integration of subsequent detection heads similar to FPN. In addition, for the text detection problem, the classification pre-training model trained by ImageNet can accelerate the convergence and improve the effect.\n",
"\n",
"(2) The replacement of the backbone network for text recognition requires attention to the drop position of the network width and height stride. Since text recognition generally has a large ratio of width to height, the frequency of height reduction is less, and the frequency of width reduction is more. You can refer to [Changes to the MobileNetV3 backbone network in PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR/blob/release%2F2.3/ppocr/modeling/backbones/rec_mobilenet_v3.py)。\n",
"\n",
"\n",
"**1.10 How to finetune the detection model, such as freezing the previous layer or learning with a small learning rate for some layers?**\n",
"\n",
"**A**: If you freeze certain layers, you can set the stop_gradient property of the variable to True, so that all the parameters before calculating this variable will not be updated, refer to: https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/faq/train_cn.html#id4\n",
"\n",
"If learning with a smaller learning rate for some layers is not very convenient in the static graph, one method is to set a fixed learning rate for the weight attribute when the parameters are initialized, refer to: https://www.paddlepaddle.org.cn/documentation/docs/en/develop/api/paddle/fluid/param_attr/ParamAttr_cn.html#paramattr\n",
"\n",
"In fact, our experiment found that directly loading the model to fine-tune without setting different learning rates of certain layers, the effect is also good.\n",
"\n",
"**1.11 In the preprocessing part of DB, why should the length and width of the picture be processed into multiples of 32?**\n",
"\n",
"**A**: It is related to the stride of the network downsampling. Take the resnet backbone network under inspection as an example. After the image is input to the network, it needs to be downsampled by 2 times for 5 times, a total of 32 times. Therefore, it is recommended that the input image size be a multiple of 32.\n",
"\n",
"\n",
"**1.12 In the PP-OCR series models, why does the backbone network for text detection not use SEBlock?**\n",
"\n",
"**A**: The SE module is an important module of the MobileNetV3 network. Its purpose is to estimate the importance of each feature channel of the feature map, assign weights to each feature of the feature map, and improve the expressive ability of the network. However, for text detection, the resolution of the input network is relatively large, generally 640\\*640. It is difficult to use the SE module to estimate the importance of each feature channel of the feature map. The network improvement ability is limited, but the module is relatively time-consuming. In the PP-OCR system, the backbone network for text detection does not use the SE module. Experiments also show that when the SE module is removed, the size of the ultra-lightweight model can be reduced by 40%, and the text detection effect is basically not affected. For details, please refer to the PP-OCR technical article, https://arxiv.org/abs/2009.09941.\n",
"\n",
"\n",
"**1.13 The PP-OCR detection effect is not good, how to optimize it?**\n",
"\n",
"**A**: Specific analysis of specific issues:\n",
"- If the detection effect is not available on your scene, the first choice is to do finetune training on your data;\n",
"- If the image is too large and the text is too dense, it is recommended not to over-compress the image. You can try to modify the resize logic of the detection preprocessing to prevent the image from being over-compressed;\n",
"- The size of the detection frame is too close to the text or the detection frame is too large, you can adjust the db_unclip_ratio parameter, increasing the parameter can enlarge the detection frame, and reducing the parameter can reduce the size of the detection frame;\n",
"- There are many missed detection problems in the detection frame, which can reduce the threshold parameter det_db_box_thresh for DB detection to prevent some detection frames from being filtered out. You can also try to set det_db_score_mode to'slow';\n",
"- Other methods can choose use_dilation as True to expand the feature map of the detection output. In general, the effect will be improved.\n",
"\n",
"\n",
"## 2. FAQ about Text Detection and Prediction\n",
"\n",
"**2.1 In DB, some boxes are too pasted with text, but some corners of the text are removed to affect the recognition. Is there any way to alleviate this problem?**\n",
"\n",
"**A**: The post-processing parameter [unclip_ratio](https://github.com/PaddlePaddle/PaddleOCR/blob/d80afce9b51f09fd3d90e539c40eba8eb5e50dd6/tools/infer/utility.py?_pjax=%23js-repo-pjax-container%2C%20div%5Bitemtype%3D%22http%3A%2F%2Fschema.org%2FSoftwareSourceCode%22%5D%20main%2C%20%5Bdata-pjax-container%5D#L52) can be appropriately increased. the larger the parameter, the larger the text box.\n",
"\n",
"\n",
"**2.2 Why does the PaddleOCR detection prediction only support one image test? That is, test_batch_size_per_card=1**\n",
"\n",
"**A**: When predicting, the image is scaled in equal proportions, the longest side is 960, and the length and width of different images after scaling in equal proportions are inconsistent, and they cannot form a batch, so set test_batch_size to 1.\n",
"\n",
"\n",
"**2.3 Accelerate PaddleOCR's text detection model prediction on the CPU?**\n",
"\n",
"**A**: x86 CPU can use mkldnn (OneDNN) for acceleration; enable [enable_mkldnn](https://github.com/PaddlePaddle/PaddleOCR/blob/8b656a3e13631dfb1ac21d2095d4d4a4993ef710/tools/infer/utility.py#L105) Parameters. In addition, in conjunction with increasing the number of threads used for prediction on the CPU, [num_threads](https://github.com/PaddlePaddle/PaddleOCR/blob/8b656a3e13631dfb1ac21d2095d4d4a4993ef710/tools/infer/utility.py#L106) can effectively speed up the prediction speed on the CPU.\n",
"\n",
"**2.4 Accelerate PaddleOCR's text detection model prediction on GPU?**\n",
"\n",
"**A**: TensorRT is recommended for GPU accelerated prediction.\n",
"- 1. Download the Paddle installation package or prediction library with TensorRT from [link](https://paddleinference.paddlepaddle.org.cn/master/user_guides/download_lib.html).\n",
"- 2. Download the [TensorRT](https://developer.nvidia.com/tensorrt) from the Nvidia official website. Note that the downloaded TensorRT version is consistent with the TensorRT version compiled in the paddle installation package.\n",
"- 3. Set the environment variable `LD_LIBRARY_PATH` to point to the lib folder of TensorRT\n",
"```\n",
"export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<TensorRT-${version}/lib>\n",
"```\n",
"- 4. Enable [tensorrt option](https://github.com/PaddlePaddle/PaddleOCR/blob/8b656a3e13631dfb1ac21d2095d4d4a4993ef710/tools/infer/utility.py?_pjax=%23js-repo-pjax-container%2%5Bitemtype%3D%22http%3A%2F%2Fschema.org%2FSoftwareSourceCode%22%5D%20main%2C%20%5Bdata-pjax-container%5D#L38).\n",
"\n",
"**2.5 How to deploy PaddleOCR model on the mobile terminal?**\n",
"\n",
"**A**: Flying Oar Paddle has a special tool for mobile deployment [PaddleLite](https://github.com/PaddlePaddle/Paddle-Lite), and PaddleOCR provides DB+CRNN as the demo android arm deployment code , Refer to [link](https://github.com/PaddlePaddle/PaddleOCR/blob/release%2F2.3/deploy/lite/readme.md).\n",
"\n",
"\n",
"**2.6 How to use PaddleOCR multi-process prediction?**\n",
"\n",
"**A**: PaddleOCR recently added [Multi-Process Predictive Control Parameters](https://github.com/PaddlePaddle/PaddleOCR/blob/8b656a3e13631dfb1ac21d2095d4d4a4993ef710/tools/infer/utility.py?_pjax=%23js-repo-pjax-container%2C%20div%5Bitemtype%3D%22http%3A%2F%2Fschema.org%2FSoftwareSourceCode%22%5D%20main%2C%20%5Bdata-pjax-container%5D#L111), `use_mp` indicates whether When using multiple processes, `total_process_num` indicates the number of processes when using multiple processes. For specific usage, please refer to [document](https://github.com/PaddlePaddle/PaddleOCR/blob/release%2F2.3/doc/doc_ch/inference.md#1-%E8%B6%85%E8%BD%BB%E9%87%8F%E4%B8%AD%E6%96%87ocr%E6%A8%A1%E5%9E%8B%E6%8E%A8%E7%90%86).\n",
"\n",
"**2.7 Video memory explosion and memory leak during prediction?**\n",
"\n",
"**A**: If it is the prediction of the training model, the video memory is not enough because the model is too large or the input image is too large, you can refer to the code and add paddle.no_grad() before the main function runs to reduce the video memory usage. If the memory usage of the inference model is too high, you can add [config.enable_memory_optim()](https://github.com/PaddlePaddle/PaddleOCR/blob/8b656a3e13631dfb1ac21d2095d4d4a4993ef710/tools/infer/utility.py?_pjax=%23js-repo-pjax-container%2C%20div%5Bitemtype%3D%22http%3A%2F%2Fschema.org%2FSoftwareSourceCode%22%5D%20main%2C%20%5Bdata-pjax-container%5D#L267) to reduce the memory usage when configuring Config.\n",
"\n",
"In addition, regarding the memory leak when using Paddle to predict, it is recommended to install the latest version of paddle. The memory leak has been fixed."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
This source diff could not be displayed because it is too large. You can view the blob instead.
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Text Detection Algorithm Theory\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1 Text Detection\n",
"\n",
"The task of text detection is to find out the position of text in an image or video. Different from the task of target detection, target detection must not only solve the positioning problem, but also solve the problem of target classification.\n",
"\n",
"The manifestation of text in images can be regarded as a kind of 'target', and general target detection methods are also suitable for text detection. From the perspective of the task itself:\n",
"\n",
"- Target detection: Given an image or video, find out the location (box) of the target, and give the target category;\n",
"- Text detection: Given an input image or video, find out the area of the text, which can be a single character position or a whole text line position;\n",
"\n",
"\n",
"\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/af2d8eca913a4d5a968945ae6cac180b009c6cc94abc43bfbaf1ba6a3de98125\" width=\"400\" ></center>\n",
"\n",
"<br><center>Figure 1: Schematic diagram of target detection</center>\n",
"\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/400b9100573b4286b40b0a668358bcab9627f169ab934133a1280361505ddd33\" width=\"1000\" ></center>\n",
"\n",
"<br><center>Figure 2: Schematic diagram of text detection</center>\n",
"\n",
"Object detection and text detection are both \"location\" problems. But text detection does not need to classify the target, and the shape of the text is complex and diverse.\n",
"\n",
"The current text detection is generally natural scene text detection. The difficulty lies in:\n",
"\n",
"1. Text in natural scenes is diverse: text detection is affected by text color, size, font, shape, direction, language, and text length;\n",
"2. Complex background and interference; text detection is affected by image distortion, blur, low resolution, shadow, brightness and other factors;\n",
"3. Dense or overlapping text will affect text detection;\n",
"4. Local consistency of text: a small part of a text line can also be considered as independent text.\n",
"\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/072f208f2aff47e886cf2cf1378e23c648356686cf1349c799b42f662d8ced00\"\n",
"width=\"1000\" ></center>\n",
"\n",
"<br><center>Figure 3: Text detection scene</center>\n",
"\n",
"In response to the above problems, many text detection algorithms based on deep learning have been derived to solve the problem of text detection in natural scenes. These methods can be divided into regression-based and segmentation-based text detection methods.\n",
"\n",
"The next section will briefly introduce the classic text detection algorithm based on deep learning technology."
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"## 2 Introduction to Text Detection Methods\n",
"\n",
"\n",
"In recent years, deep learning-based text detection algorithms have emerged one after another. These methods can be roughly divided into two categories:\n",
"1. Regression-based text detection method\n",
"2. Text detection method based on segmentation\n",
"\n",
"\n",
"This section screens the commonly used text detection methods from 2017 to 2021, and is classified according to the above two types of methods as shown in the following table:\n",
"\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/22314238b70b486f942701107ffddca48b87235a473c4d8db05b317f132daea0\"\n",
"width=\"600\" ></center>\n",
"<br><center>Figure 4: Text detection algorithm</center>\n",
"\n",
"\n",
"### 2.1 Regression-based Text Detection\n",
"\n",
"The method based on the regression text detection method is similar to the method of the target detection algorithm. The text detection method has only two categories. The text in the image is regarded as the target to be detected, and the rest is regarded as the background.\n",
"\n",
"#### 2.1.1 Horizontal Text Detection\n",
"\n",
"Early text detection algorithms based on deep learning are improved from the target detection method and support horizontal text detection. For example, the TextBoxes algorithm is improved based on the SSD algorithm, and the CTPN is improved based on the two-stage target detection Fast-RCNN algorithm.\n",
"\n",
"In TextBoxes[1], the algorithm is adjusted according to the one-stage target detector SSD, and the default text box is changed to a quadrilateral that adapts to the specifications of the text direction and aspect ratio, providing an end-to-end training text detection method without complicated Post-processing.\n",
"-Use a pre-selection box with a larger aspect ratio\n",
"-The convolution kernel has been changed from 3x3 to 1x5, which is more suitable for long text detection\n",
"-Adopt multi-scale input\n",
"\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/3864ccf9d009467cbc04225daef0eb562ac0c8c36f9b4f5eab036c319e5f05e7\" width=\"1000\" ></center>\n",
"<br><center>Figure 5: Textbox frame diagram</center>\n",
"\n",
"CTPN[3] is based on the Fast-RCNN algorithm, expands the RPN module and designs a CRNN-based module to allow the entire network to detect text sequences from convolutional features. The two-stage method obtains more accurate feature positioning through ROI Pooling. But TextBoxes and CTPN only support the detection of horizontal text.\n",
"\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/452833c2016e4cf7b35291efd09740c13c4bfb8f7c56446b8f7a02fc7eb3e901\" width=\"1000\" ></center>\n",
"<br><center>Figure 6: CTPN frame diagram</center>\n",
"\n",
"#### 2.1.2 Any Angle Text Detection\n",
"\n",
"TextBoxes++[2] is improved on the basis of TextBoxes, and supports the detection of text at any angle. Structurally, unlike TextBoxes, TextBoxes++ detects multi-angle text. First, modify the aspect ratio of the preselection box and adjust the aspect ratio to 1, 2, 3, 5, 1/2, 1/3, 1/5. The second is to change the $1*5$ convolution kernel to $3*5$ to better learn the characteristics of the slanted text; finally, the representation information of the output rotating box of TextBoxes++.\n",
"\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/ae96e3acbac04be296b6d54a4d72e5881d592fcc91f44882b24bc7d38b9d2658\"\n",
"width=\"1000\" ></center>\n",
"<br><center>Figure 7: TextBoxes++ frame diagram</center>\n",
"\n",
"\n",
"EAST [4] proposed a two-stage text detection method for the location of slanted text, including FCN feature extraction and NMS part. EAST proposes a new text detection pipline structure, which can be trained end-to-end and supports the detection of text in any orientation, and has the characteristics of simple structure and high performance. FCN supports the output of inclined rectangular frame and horizontal frame, and the output format can be freely selected.\n",
"-If the output detection shape is RBox, output Box rotation angle and AABB text shape information, AABB represents the offset to the top, bottom, left, and right sides of the text box. RBox can rotate rectangular text.\n",
"-If the output detection box is a four-point box, the last dimension of the output is 8 numbers, which represents the position offset from the four corner vertices of the quadrilateral. This output method can predict irregular quadrilateral text.\n",
"\n",
"Considering that the text box output by FCN is relatively redundant, for example, the box generated by the adjacent pixels of a text area has a high degree of coincidence. But it is not the detection frame generated by the same text, and the degree of coincidence is very small. Therefore, EAST proposes to merge the prediction boxes by row first. Finally, filter the remaining quads with the original NMS.\n",
"\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/d7411ada08714adab73fa0edf7555a679327b71e29184446a33d81cdd910e4fc\"\n",
"width=\"1000\" ></center>\n",
"<br><center>Figure 8: EAST frame diagram</center>\n",
"\n",
"\n",
"MOST [15] proposed that the TFAM module dynamically adjusts the receptive field of coarse-grained detection results, and also proposed that PA-NMS combines reliable detection and prediction results based on location information. In addition, the Instance-wise IoU loss function is also proposed during training, which is used to balance training to handle text instances of different scales. This method can be combined with the EAST method, and has better detection effect and performance in detecting texts with extreme aspect ratios and different scales.\n",
"\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/73052d9439714bba86ffe4a959d58c523b07baf3f1d74882b4517e71f5a645fe\"\n",
"width=\"1000\" ></center>\n",
"<br><center>Figure 9: MOST frame diagram</center>\n",
"\n",
"\n",
"#### 2.1.3 Curved Text Detection\n",
"\n",
"Using regression to solve the problem of curved text detection, a simple idea is to describe the boundary polygon of the curved text with multi-point coordinates, and then directly predict the vertex coordinates of the polygon.\n",
"\n",
"CTD [6] proposed to directly predict the boundary polygons of 14 vertices of curved text. The network uses the Bi-LSTM [13] layer to refine the prediction coordinates of the vertices, and realizes the detection of curved text based on the regression method.\n",
"\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/6e33d76ebb814cac9ebb2942b779054af160857125294cd69481680aca2fa98a\"\n",
"width=\"600\" ></center>\n",
"<br><center>Figure 10: CTD frame diagram</center>\n",
"\n",
"\n",
"\n",
"LOMO [19] proposes iterative optimization of text localization features to obtain finer text localization for long text and curved text problems. The method consists of three parts: the coordinate regression module DR, the iterative optimization module IRM and the arbitrary shape expression module SEM. They are used to generate approximate text regions, iteratively optimize text localization features, and predict text regions, text centerlines, and text boundaries. Iteratively optimized text features can better solve the problem of long text localization and obtain more accurate text area localization.\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/e90adf3ca25a45a0af0b84a181fbe2c4954be1fcca8f4049957128548b7131ef\"\n",
"width=\"1000\" ></center>\n",
"<br><center>Figure 11: LOMO frame diagram</center>\n",
"\n",
"\n",
"Contournet [18] is based on the proposed modeling of text contour points to obtain a curved text detection frame. This method first uses Adaptive-RPN to obtain the proposal features of the text area, and then designs a local orthogonal texture perception LOTM module to learn horizontal and vertical textures. The feature is represented by contour points. Finally, by considering the feature responses in two orthogonal directions at the same time, the Point Re-Scoring algorithm can effectively filter out the prediction of strong unidirectional or weak orthogonal activation, and the final text contour can be used as a A group of high-quality contour points are shown.\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/1f59ab5db899412f8c70ba71e8dd31d4ea9480d6511f498ea492c97dd2152384\"\n",
"width=\"600\" ></center>\n",
"<br><center>Figure 12: Contournet frame diagram</center>\n",
"\n",
"\n",
"PCR [14] proposed progressive coordinate regression to deal with curved text detection. The problem is divided into three stages. Firstly, the text area is roughly detected, and the text box is obtained. In addition, the corners of the smallest bounding box of the text are predicted by the designed Contour Localization Mechanism. Coordinates, and then the curved text is predicted by superimposing multiple CLM modules and RCLM modules. This method uses the text contour information aggregation to obtain a rich text contour feature representation, which can not only suppress the influence of redundant noise points on the coordinate regression, but also locate the text area more accurately.\n",
"\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/c677c4602cee44999ae4b38bd780b69795887f2ae10747968bb084db6209b6cc\"\n",
"width=\"600\" ></center>\n",
"<br><center>Figure 13: PCR frame diagram</center>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"### 2.2 Text Detection Based on Segmentation\n",
"\n",
"Although the regression-based method has achieved good results in text detection, it is often difficult to obtain a smooth text surrounding curve for solving curved text, and the model is more complex and does not have performance advantages. Therefore, researchers proposed a text segmentation method based on image segmentation. First, classify at the pixel level, determine whether each pixel belongs to a text target, obtain the probability map of the text area, and obtain the enclosing curve of the text segmentation area through post-processing. .\n",
"\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/fb9e50c410984c339481869ba11c1f39f80a4d74920b44b084601f2f8a23099f\"\n",
"width=\"600\" ></center>\n",
"<br><center>Figure 14: Schematic diagram of text segmentation algorithm</center>\n",
"\n",
"\n",
"Such methods are usually based on segmentation to achieve text detection, and segmentation-based methods have natural advantages for text detection with irregular shapes. The main idea of ​​the segmentation-based text detection method is to obtain the text area in the image through the segmentation method, and then use opencv, polygon and other post-processing to obtain the minimum enclosing curve of the text area.\n",
"\n",
"\n",
"Pixellink [7] uses a segmentation method to solve the text detection problem. The segmentation object is a text area. The pixels in the same text line (word) are linked together to segment the text, and the text bounding box is directly extracted from the segmentation result without a position. Regression can achieve the effect of text detection based on regression. However, there is a problem with the segmentation-based method. For texts with similar positions, the text segmentation area is prone to \"sticky\" problems. Wu, Yue et al. [8] proposed to separate the text while learning the boundary position of the text to better distinguish the text area. In addition, Tian et al. [9] proposed to map the pixels of the same text to the mapping space. In the mapping space, the distance of the mapping vector of the unified text is close, and the distance of the mapping vector of different texts becomes longer.\n",
"\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/462b5e1472824452a2c530939cda5e59ada226b2d0b745d19dd56068753a7f97\"\n",
"width=\"600\" ></center>\n",
"<br><center>Figure 15: PixelLink frame diagram</center>\n",
"\n",
"For the multi-scale problem of text detection, MSR [20] proposes to extract multiple scale features of the same image, then merge these features and up-sample to the original image size. The network finally predicts the text center area and each point of the text center area. The x-coordinate offset and y-coordinate offset of the nearest boundary point can finally get the contour coordinate set of the text area.\n",
"\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/9597efd68a224d60b74d7c51c99f7ff0ba9939e5cdb84fb79209b7e213f7d039\"\n",
"width=\"600\" ></center>\n",
"<br><center>Figure 16: MSR frame diagram</center>\n",
" \n",
"Aiming at the problem of segmentation-based text algorithms that are difficult to distinguish between adjacent texts, PSENet [10] proposed a progressive scale expansion network to learn text segmentation regions, predict text regions with different shrinkage ratios, and expand the detected text regions one by one. The essence of this method is The above is a variant of the boundary learning method, which can effectively solve the problem of detecting adjacent text of any shape.\n",
"\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/fa870b69a2a5423cad7422f64c32e0645dfc31a4ecc94a52832cf8742cded5ba\"\n",
"width=\"1000\" ></center>\n",
"<br><center>Figure 17: PSENet frame diagram</center>\n",
"\n",
"Assume that PSENet post-processing uses 3 kernels of different scales, as shown in the above figure s1, s2, and s3. First, start from the minimum kernel s1, calculate the connected domain of the text segmentation area, and get (b), and then expand the connected domain along the up, down, left, and right, and classify the pixels that belong to s2 but not to s1 in the expanded area. When encountering conflicts, the principle of \"first come first served\" is adopted, and the scale expansion operation is repeated, and finally independent segmented regions of different text lines can be obtained.\n",
"\n",
"\n",
"Seglink++ [17] proposed a characterization of the attraction and repulsion relationship between text block units for curved text and dense text, and then designed a minimum spanning tree algorithm to combine the units to obtain the final text detection box, and An instance-aware loss function is proposed so that the Seglink++ method can be trained end-to-end.\n",
"\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/1a16568361c0468db537ac25882eed096bca83f9c1544a92aee5239890f9d8d9\"\n",
"width=\"1000\" ></center>\n",
"<br><center>Figure 18: Seglink++ frame diagram</center>\n",
"\n",
"Although the segmentation method solves the problem of curved text detection, complex post-processing logic and prediction speed are also goals that need to be optimized.\n",
"\n",
"PAN [11] aims at the problem of slow text detection and prediction speed, and improves the performance of the algorithm from the aspects of network design and post-processing. First, PAN uses the lightweight ResNet18 as the Backbone, and also designs the lightweight feature enhancement module FPEM and feature fusion module FFM to enhance the features extracted by the Backbone. In terms of post-processing, a pixel clustering method is used to merge pixels whose distance from the kernel is less than the threshold d along the predicted text center (kernel). PAN guarantees high accuracy while having faster prediction speed.\n",
"\n",
"\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/a76771f91db246ee8be062f96fa2a8abc7598dd87e6d4755b63fac71a4ebc170\"\n",
"width=\"1000\" ></center>\n",
"<br><center>Figure 19: PAN frame diagram</center>\n",
"\n",
"DBNet [12] aimed at the problem of time-consuming post-processing that requires the use of thresholds for binarization based on segmentation methods. It proposed a learnable threshold and cleverly designed a binarization function that approximates the step function to make the segmentation The network can learn the threshold of text segmentation end-to-end during training. The automatic adjustment of the threshold not only improves accuracy, but also simplifies post-processing and improves the performance of text detection.\n",
"\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/0d6423e3c79448f8b09090cf2dcf9d0c7baa0f6856c645808502678ae88d2917\"\n",
"width=\"1000\" ></center>\n",
"<br><center>Figure 20: DB frame diagram</center>\n",
"\n",
"FCENet [16] proposed to express the text enclosing curve with Fourier transform parameters. Since the Fourier coefficient representation can theoretically fit any closed curve, by designing a suitable model to predict an arbitrary shape text enclosing box based on Fourier transform In this way, the detection accuracy of highly curved text instances in natural scene text detection is improved.\n",
"\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/45e9a374d97145689a961977f896c8f9f470a66655234c1498e1c8477e277954\"\n",
"width=\"1000\" ></center>\n",
"<br><center>Figure 21: FCENet frame diagram</center>\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3 Summary\n",
"\n",
"This section introduces the development of the field of text detection in recent years, including text detection methods based on regression and segmentation, and respectively enumerates and introduces the method ideas of some classic papers. The next section takes the PaddleOCR open source library as an example to introduce in detail the algorithm principles and core code implementation of DBNet."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Reference\n",
"1. Liao, Minghui, et al. \"Textboxes: A fast text detector with a single deep neural network.\" Thirty-first AAAI conference on artificial intelligence. 2017.\n",
"2. Liao, Minghui, Baoguang Shi, and Xiang Bai. \"Textboxes++: A single-shot oriented scene text detector.\" IEEE transactions on image processing 27.8 (2018): 3676-3690.\n",
"3. Tian, Zhi, et al. \"Detecting text in natural image with connectionist text proposal network.\" European conference on computer vision. Springer, Cham, 2016.\n",
"4. Zhou, Xinyu, et al. \"East: an efficient and accurate scene text detector.\" Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2017.\n",
"5. Wang, Fangfang, et al. \"Geometry-aware scene text detection with instance transformation network.\" Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.\n",
"6. Yuliang, Liu, et al. \"Detecting curve text in the wild: New dataset and new solution.\" arXiv preprint arXiv:1712.02170 (2017).\n",
"7. Deng, Dan, et al. \"Pixellink: Detecting scene text via instance segmentation.\" Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 32. No. 1. 2018.\n",
"8. Wu, Yue, and Prem Natarajan. \"Self-organized text detection with minimal post-processing via border learning.\" Proceedings of the IEEE International Conference on Computer Vision. 2017.\n",
"9. Tian, Zhuotao, et al. \"Learning shape-aware embedding for scene text detection.\" Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.\n",
"10. Wang, Wenhai, et al. \"Shape robust text detection with progressive scale expansion network.\" Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.\n",
"11. Wang, Wenhai, et al. \"Efficient and accurate arbitrary-shaped text detection with pixel aggregation network.\" Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019.\n",
"12. Liao, Minghui, et al. \"Real-time scene text detection with differentiable binarization.\" Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 34. No. 07. 2020.\n",
"13. Hochreiter, Sepp, and Jürgen Schmidhuber. \"Long short-term memory.\" Neural computation 9.8 (1997): 1735-1780.\n",
"14. Dai, Pengwen, et al. \"Progressive Contour Regression for Arbitrary-Shape Scene Text Detection.\" Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.\n",
"15. He, Minghang, et al. \"MOST: A Multi-Oriented Scene Text Detector with Localization Refinement.\" Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.\n",
"16. Zhu, Yiqin, et al. \"Fourier contour embedding for arbitrary-shaped text detection.\" Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.\n",
"17. Tang, Jun, et al. \"Seglink++: Detecting dense and arbitrary-shaped scene text by instance-aware component grouping.\" Pattern recognition 96 (2019): 106954.\n",
"18. Wang, Yuxin, et al. \"Contournet: Taking a further step toward accurate arbitrary-shaped scene text detection.\" Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.\n",
"19. Zhang, Chengquan, et al. \"Look more than once: An accurate detector for text of arbitrary shapes.\" Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.\n",
"20. Xue C, Lu S, Zhang W. Msr: Multi-scale shape regression for scene text detection[J]. arXiv preprint arXiv:1901.02596, 2019. \n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "py35-paddle1.2.0"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
This source diff could not be displayed because it is too large. You can view the blob instead.
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"# Text Recognition Algorithm Theory\n",
"\n",
"This chapter mainly introduces the theoretical knowledge of text recognition algorithms, including background introduction, algorithm classification and some classic paper ideas.\n",
"\n",
"Through the study of this chapter, you can master:\n",
"\n",
"1. The goal of text recognition\n",
"\n",
"2. Classification of text recognition algorithms\n",
"\n",
"3. Typical ideas of various algorithms\n",
"\n",
"\n",
"## 1 Background Introduction\n",
"\n",
"Text recognition is a subtask of OCR (Optical Character Recognition), and its task is to recognize the text content of a fixed area. In the two-stage method of OCR, it is followed by text detection and converts image information into text information.\n",
"\n",
"Specifically, the model inputs a positioned text line, and the model predicts the text content and confidence level in the picture. The visualization results are shown in the following figure:\n",
"\n",
"<center><img src=https://ai-studio-static-online.cdn.bcebos.com/a7c3404f778b489db9c1f686c7d2ff4d63b67c429b454f98b91ade7b89f8e903 width=\"600\"></center>\n",
"\n",
"<center><img src=https://ai-studio-static-online.cdn.bcebos.com/e72b1d6f80c342ac951d092bc8c325149cebb3763ec849ec8a2f54e7c8ad60ca width=\"600\"></center>\n",
"<br><center>Figure 1: Visualization results of model predicttion</center>\n",
"\n",
"There are many application scenarios for text recognition, including document recognition, road sign recognition, license plate recognition, industrial number recognition, etc. According to actual scenarios, text recognition tasks can be divided into two categories: **Regular text recognition** and **Irregular Text recognition**.\n",
"\n",
"* Regular text recognition: mainly refers to printed fonts, scanned text, etc., and the text is considered to be roughly in the horizontal position\n",
"\n",
"* Irregular text recognition: It often appears in natural scenes, and due to the huge differences in text curvature, direction, deformation, etc., the text is often not in the horizontal position, and there are problems such as bending, occlusion, and blurring.\n",
"\n",
"\n",
"The figure below shows the data patterns of IC15 and IC13, which represent irregular text and regular text respectively. It can be seen that irregular text often has problems such as distortion, blurring, and large font differences. It is closer to the real scene and is also more challenging.\n",
"\n",
"Therefore, the current major algorithms are trying to obtain higher indicators on irregular data sets.\n",
"\n",
"<center><img src=https://ai-studio-static-online.cdn.bcebos.com/bae4fce1370b4751a3779542323d0765a02a44eace7b44d2a87a241c13c6f8cf width=\"400\">\n",
"<br><center>Figure 2: IC15 picture sample (irregular text)</center>\n",
"<img src=https://ai-studio-static-online.cdn.bcebos.com/b55800d3276f4f5fad170ea1b567eb770177fce226f945fba5d3247a48c15c34 width=\"400\"></center>\n",
"<br><center>Figure 3: IC13 picture sample (rule text)</center>\n",
"\n",
"\n",
"When comparing the capabilities of different recognition algorithms, they are often compared on these two types of public data sets. Comparing the effects on multiple dimensions, currently the more common English benchmark data sets are classified as follows:\n",
"\n",
"<center><img src=https://ai-studio-static-online.cdn.bcebos.com/4d0aada261064031a16816b39a37f2ff6af70dbb57004cb7a106ae6485f14684 width=\"600\"></center>\n",
"<br><center>Figure 4: Common English benchmark data sets</center>\n",
"\n",
"## 2 Text Recognition Algorithm Classification\n",
"\n",
"In the traditional text recognition method, the task is divided into 3 steps, namely image preprocessing, character segmentation and character recognition. It is necessary to model a specific scene, and it will become invalid once the scene changes. In the face of complex text backgrounds and scene changes, methods based on deep learning have better performance.\n",
"\n",
"Most existing recognition algorithms can be represented by the following unified framework, and the algorithm flow is divided into 4 stages:\n",
"\n",
"![](https://ai-studio-static-online.cdn.bcebos.com/a2750f4170864f69a3af36fc13db7b606d851f2f467d43cea6fbf3521e65450f)\n",
"\n",
"\n",
"We have sorted out the mainstream algorithm categories and main papers, refer to the following table:\n",
"<center>\n",
" \n",
"| Algorithm category | Main ideas | Main papers |\n",
"| -------- | --------------- | -------- |\n",
"| Traditional algorithm | Sliding window, character extraction, dynamic programming |-|\n",
"| ctc | Based on ctc method, sequence is not aligned, faster recognition | CRNN, Rosetta |\n",
"| Attention | Attention-based method, applied to unconventional text | RARE, DAN, PREN |\n",
"| Transformer | Transformer-based method | SRN, NRTR, Master, ABINet |\n",
"| Correction | The correction module learns the text boundary and corrects it to the horizontal direction | RARE, ASTER, SAR |\n",
"| Segmentation | Based on the method of segmentation, extract the character position and then do classification | Text Scanner, Mask TextSpotter |\n",
" \n",
"</center>\n",
"\n",
"\n",
"### 2.1 Regular Text Recognition\n",
"\n",
"\n",
"There are two mainstream algorithms for text recognition, namely the CTC (Conectionist Temporal Classification)-based algorithm and the Sequence2Sequence algorithm. The difference is mainly in the decoding stage.\n",
"\n",
"The CTC-based algorithm connects the encoded sequence to the CTC for decoding; the Sequence2Sequence-based method connects the sequence to the Recurrent Neural Network (RNN) module for cyclic decoding. Both methods have been verified to be effective and mainstream. Two major practices.\n",
"\n",
"<center><img src=https://ai-studio-static-online.cdn.bcebos.com/f64eee66e4a6426f934c1befc3b138629324cf7360c74f72bd6cf3c0de9d49bd width=\"600\"></center>\n",
"<br><center>Figure 5: Left: CTC-based method, right: Sequece2Sequence-based method </center>\n",
"\n",
"\n",
"#### 2.1.1 Algorithm Based on CTC\n",
"\n",
"The most typical algorithm based on CTC is CRNN (Convolutional Recurrent Neural Network) [1], and its feature extraction part uses mainstream convolutional structures, commonly used ResNet, MobileNet, VGG, etc. Due to the particularity of text recognition tasks, there is a large amount of contextual information in the input data. The convolution kernel characteristics of convolutional neural networks make it more focused on local information and lack long-dependent modeling capabilities, so it is difficult to use only convolutional networks. Dig into the contextual connections between texts. In order to solve this problem, the CRNN text recognition algorithm introduces the bidirectional LSTM (Long Short-Term Memory) to enhance the context modeling. Experiments prove that the bidirectional LSTM module can effectively extract the context information in the picture. Finally, the output feature sequence is input to the CTC module, and the sequence result is directly decoded. This structure has been verified to be effective and widely used in text recognition tasks. Rosetta [2] is a recognition network proposed by FaceBook, which consists of a fully convolutional model and CTC. Gao Y [3] et al. used CNN convolution instead of LSTM, with fewer parameters, and the performance improvement accuracy was the same.\n",
"\n",
"<center><img src=https://ai-studio-static-online.cdn.bcebos.com/d3c96dd9e9794fddb12fa16f926abdd3485194f0a2b749e792e436037490899b width=\"600\"></center>\n",
"<center>Figure 6: CRNN structure diagram </center>\n",
"\n",
"\n",
"#### 2.1.2 Sequence2Sequence algorithm\n",
"\n",
"In the Sequence2Sequence algorithm, the Encoder encodes all input sequences into a unified semantic vector, which is then decoded by the Decoder. In the decoding process of the decoder, the output of the previous moment is continuously used as the input of the next moment, and the decoding is performed in a loop until the stop character is output. The general encoder is an RNN. For each input word, the encoder outputs a vector and hidden state, and uses the hidden state for the next input word to get the semantic vector in a loop; the decoder is another RNN, which receives the encoder Output a vector and output a series of words to create a transformation. Inspired by Sequence2Sequence in the field of translation, Shi [4] proposed an attention-based codec framework to recognize text. In this way, rnn can learn character-level language models hidden in strings from training data.\n",
"\n",
"<center><img src=https://ai-studio-static-online.cdn.bcebos.com/f575333696b7438d919975dc218e61ccda1305b638c5497f92b46a7ec3b85243 width=\"400\" hight=\"500\"></center>\n",
"<center>Figure 7: Sequence2Sequence structure diagram </center>\n",
"\n",
"The above two algorithms have very good effects on regular text, but due to the limitations of network design, this type of method is difficult to solve the task of irregular text recognition of bending and rotation. In order to solve such problems, some algorithm researchers have proposed a series of improved algorithms on the basis of the above two types of algorithms.\n",
"\n",
"### 2.2 Irregular Text Recognition\n",
"\n",
"* Irregular text recognition algorithms can be divided into 4 categories: correction-based methods; Attention-based methods; segmentation-based methods; and Transformer-based methods.\n",
"\n",
"#### 2.2.1 Correction-based Method\n",
"\n",
"The correction-based method uses some visual transformation modules to convert irregular text into regular text as much as possible, and then uses conventional methods for recognition.\n",
"\n",
"The RARE [4] model first proposed a correction scheme for irregular text. The entire network is divided into two main parts: a spatial transformation network STN (Spatial Transformer Network) and a recognition network based on Sequence2Squence. Among them, STN is the correction module. Irregular text images enter STN and are transformed into a horizontal image through TPS (Thin-Plate-Spline). This transformation can correct curved and transmissive text to a certain extent, and send it to sequence recognition after correction. Network for decoding.\n",
"\n",
"<center><img src=https://ai-studio-static-online.cdn.bcebos.com/66406f89507245e8a57969b9bed26bfe0227a8cf17a84873902dd4a464b97bb5 width=\"600\"></center>\n",
"<center>Figure 8: RARE structure diagram </center>\n",
"\n",
"The RARE paper pointed out that this method has greater advantages in irregular text data sets, especially comparing the two data sets CUTE80 and SVTP, which are more than 5 percentage points higher than CRNN, which proves the effectiveness of the correction module. Based on this [6] also combines a text recognition system with a spatial transformation network (STN) and an attention-based sequence recognition network.\n",
"\n",
"Correction-based methods have better migration. In addition to Attention-based methods such as RARE, STAR-Net [5] applies correction modules to CTC-based algorithms, which is also a good improvement compared to traditional CRNN.\n",
"\n",
"#### 2.2.2 Attention-based Method\n",
"\n",
"The Attention-based method mainly focuses on the correlation between the parts of the sequence. This method was first proposed in the field of machine translation. It is believed that the result of the current word in the process of text translation is mainly affected by certain words, so it needs to be The decisive word has greater weight. The same is true in the field of text recognition. When decoding the encoded sequence, each step selects the appropriate context to generate the next state, which is conducive to obtaining more accurate results.\n",
"\n",
"R^2AM [7] first introduced Attention into the field of text recognition. The model first extracts the encoded image features from the input image through a recursive convolutional layer, and then uses the implicitly learned character-level language statistics to decode the output through a recurrent neural network character. In the decoding process, the Attention mechanism is introduced to realize soft feature selection to make better use of image features. This selective processing method is more in line with human intuition.\n",
"\n",
"<center><img src=https://ai-studio-static-online.cdn.bcebos.com/a64ef10d4082422c8ac81dcda4ab75bf1db285d6b5fd462a8f309240445654d5 width=\"600\"></center>\n",
"<center>Figure 9: R^2AM structure drawing </center>\n",
"\n",
"A large number of algorithms will be explored and updated in the field of Attention in the future. For example, SAR[8] extends 1D attention to 2D attention. The RARE mentioned in the correction module is also a method based on Attention. Experiments prove that the Attention-based method has a good accuracy improvement compared with the CTC method.\n",
"\n",
"<center><img src=https://ai-studio-static-online.cdn.bcebos.com/4e2507fb58d94ec7a9b4d17151a986c84c5053114e05440cb1e7df423d32cb02 width=\"600\"></center>\n",
"<center>Figure 10: Attention diagram</center>\n",
"\n",
"\n",
"#### 2.2.3 Method Based on Segmentation\n",
"\n",
"The method based on segmentation is to treat each character of the text line as an independent individual, and it is easier to recognize the segmented individual characters than to recognize the entire text line after correction. It attempts to locate the position of each character in the input text image, and applies a character classifier to obtain these recognition results, simplifying the complex global problem into a local problem solving, and it has a relatively good effect in the irregular text scene. However, this method requires character-level labeling, and there is a certain degree of difficulty in data acquisition. Lyu [9] et al. proposed an instance word segmentation model for word recognition, which uses a method based on FCN (Fully Convolutional Network) in its recognition part. [10] Considering the problem of text recognition from a two-dimensional perspective, a character attention FCN is designed to solve the problem of text recognition. When the text is bent or severely distorted, this method has better positioning results for both regular and irregular text.\n",
"\n",
"<center><img src=https://ai-studio-static-online.cdn.bcebos.com/fd3e8ef0d6ce4249b01c072de31297ca5d02fc84649846388f890163b624ff10 width=\"800\"></center>\n",
"<center>Figure 11: Mask TextSpotter structure diagram </center>\n",
"\n",
"\n",
"\n",
"#### 2.2.4 Transformer-based Method\n",
"\n",
"With the rapid development of Transformer, both classification and detection fields have verified the effectiveness of Transformer in visual tasks. As mentioned in the regular text recognition part, CNN has limitations in long-dependency modeling. The Transformer structure just solves this problem. It can focus on global information in the feature extractor and can replace additional context modeling modules (LSTM ).\n",
"\n",
"Part of the text recognition algorithm uses Transformer's Encoder structure and convolution to extract sequence features. The Encoder is composed of multiple blocks stacked by MultiHeadAttentionLayer and Positionwise Feedforward Layer. The self-attention in MulitHeadAttention uses matrix multiplication to simulate the timing calculation of RNN, breaking the barrier of long-term dependence on timing in RNN. There are also some algorithms that use Transformer's Decoder module to decode, which can obtain stronger semantic information than traditional RNNs, and parallel computing has higher efficiency.\n",
"\n",
"The SRN[11] algorithm connects the Encoder module of Transformer to ResNet50 to enhance the 2D visual features. A parallel attention module is proposed, which uses the reading order as a query, making the calculation independent of time, and finally outputs the aligned visual features of all time steps in parallel. In addition, SRN also uses Transformer's Eecoder as a semantic module to integrate the visual information and semantic information of the picture, which has greater benefits in irregular text such as occlusion and blur.\n",
"\n",
"NRTR [12] uses a complete Transformer structure to encode and decode the input picture, and only uses a few simple convolutional layers for high-level feature extraction, and verifies the effectiveness of the Transformer structure in text recognition.\n",
"\n",
"<center><img src=https://ai-studio-static-online.cdn.bcebos.com/e7859f4469a842f0bd450e7e793a679d6e828007544241d09785c9b4ea2424a2 width=\"800\"></center>\n",
"<center>Figure 12: NRTR structure drawing </center>\n",
"\n",
"SRACN [13] uses Transformer's decoder to replace LSTM, once again verifying the efficiency and accuracy advantages of parallel training.\n",
"\n",
"## 3 Summary\n",
"\n",
"This section mainly introduces the theoretical knowledge and mainstream algorithms related to text recognition, including CTC-based methods, Sequence2Sequence-based methods, and segmentation-based methods. The ideas and contributions of classic papers are listed respectively. The next section will explain the practical course based on the CRNN algorithm, from networking to optimization to complete the entire training process,\n",
"\n",
"## 4 Reference\n",
"\n",
"\n",
"[1]Shi, B., Bai, X., & Yao, C. (2016). An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE transactions on pattern analysis and machine intelligence, 39(11), 2298-2304.\n",
"\n",
"[2]Fedor Borisyuk, Albert Gordo, and Viswanath Sivakumar. Rosetta: Large scale system for text detection and recognition in images. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 71–79. ACM, 2018.\n",
"\n",
"[3]Gao, Y., Chen, Y., Wang, J., & Lu, H. (2017). Reading scene text with attention convolutional sequence modeling. arXiv preprint arXiv:1709.04303.\n",
"\n",
"[4]Shi, B., Wang, X., Lyu, P., Yao, C., & Bai, X. (2016). Robust scene text recognition with automatic rectification. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4168-4176).\n",
"\n",
"[5] Star-Net Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. Spa- tial transformer networks. In Advances in neural information processing systems, pages 2017–2025, 2015.\n",
"\n",
"[6]Baoguang Shi, Mingkun Yang, XingGang Wang, Pengyuan Lyu, Xiang Bai, and Cong Yao. Aster: An attentional scene text recognizer with flexible rectification. IEEE transactions on pattern analysis and machine intelligence, 31(11):855–868, 2018.\n",
"\n",
"[7] Lee C Y , Osindero S . Recursive Recurrent Nets with Attention Modeling for OCR in the Wild[C]// IEEE Conference on Computer Vision & Pattern Recognition. IEEE, 2016.\n",
"\n",
"[8]Li, H., Wang, P., Shen, C., & Zhang, G. (2019, July). Show, attend and read: A simple and strong baseline for irregular text recognition. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 33, No. 01, pp. 8610-8617).\n",
"\n",
"[9]P. Lyu, C. Yao, W. Wu, S. Yan, and X. Bai. Multi-oriented scene text detection via corner localization and region segmentation. In Proc. CVPR, pages 7553–7563, 2018.\n",
"\n",
"[10] Liao, M., Zhang, J., Wan, Z., Xie, F., Liang, J., Lyu, P., ... & Bai, X. (2019, July). Scene text recognition from two-dimensional perspective. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 33, No. 01, pp. 8714-8721).\n",
"\n",
"[11] Yu, D., Li, X., Zhang, C., Liu, T., Han, J., Liu, J., & Ding, E. (2020). Towards accurate scene text recognition with semantic reasoning networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 12113-12122).\n",
"\n",
"[12] Sheng, F., Chen, Z., & Xu, B. (2019, September). NRTR: A no-recurrence sequence-to-sequence model for scene text recognition. In 2019 International Conference on Document Analysis and Recognition (ICDAR) (pp. 781-786). IEEE.\n",
"\n",
"[13]Yang, L., Wang, P., Li, H., Li, Z., & Zhang, Y. (2020). A holistic representation guided attention network for scene text recognition. Neurocomputing, 414, 67-75.\n",
"\n",
"[14]Wang, T., Zhu, Y., Jin, L., Luo, C., Chen, X., Wu, Y., ... & Cai, M. (2020, April). Decoupled attention network for text recognition. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 34, No. 07, pp. 12216-12224).\n",
"\n",
"[15] Wang, Y., Xie, H., Fang, S., Wang, J., Zhu, S., & Zhang, Y. (2021). From two to one: A new scene text recognizer with visual language modeling network. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 14194-14203).\n",
"\n",
"[16] Fang, S., Xie, H., Wang, Y., Mao, Z., & Zhang, Y. (2021). Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 7098-7107).\n",
"\n",
"[17] Yan, R., Peng, L., Xiao, S., & Yao, G. (2021). Primitive Representation Learning for Scene Text Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 284-293)."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "py35-paddle1.2.0"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Document Analysis Technology\n",
"\n",
"This chapter mainly introduces the theoretical knowledge of document analysis technology, including background introduction, algorithm classification and corresponding ideas.\n",
"\n",
"Through the study of this chapter, you can master:\n",
"\n",
"1. Classification and typical ideas of layout analysis\n",
"2. Classification and typical ideas of table recognition\n",
"3. Classification and typical ideas of information extraction"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"Layout analysis is mainly used for document retrieval, key information extraction, content classification, etc. Its task is mainly to classify the content of document images. Content categories can generally be divided into plain text, titles, tables, pictures, and lists. However, the diversity and complexity of document layout, formats, poor document image quality, and the lack of large-scale annotated datasets make layout analysis still a challenging task. Document analysis often includes the following research directions:\n",
"\n",
"1. Layout analysis module: Divide each document page into different content areas. This module can be used not only to delimit relevant and irrelevant areas, but also to classify the types of content it recognizes.\n",
"2. Optical Character Recognition (OCR) module: Locate and recognize all text present in the document.\n",
"3. Form recognition module: Recognize and convert the form information in the document into an excel file.\n",
"4. Information extraction module: Use OCR results and image information to understand and identify the specific information expressed in the document or the relationship between the information.\n",
"\n",
"Since the OCR module has been introduced in detail in the previous chapters, the following three modules will be introduced separately for the above layout analysis, table recognition and information extraction. For each module, the classic or common methods and data sets of the module will be introduced."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Layout Analysis\n",
"\n",
"### 1.1 Background Introduction\n",
"\n",
"Layout analysis is mainly used for document retrieval, key information extraction, content classification, etc. Its task is mainly to classify document images. Content categories can generally be divided into plain text, titles, tables, pictures, and lists. However, the diversity and complexity of document layouts, formats, poor document image quality, and the lack of large-scale annotated data sets make layout analysis still a challenging task.\n",
"The visualization of the layout analysis task is shown in the figure below:\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/2510dc76c66c49b8af079f25d08a9dcba726b2ce53d14c8ba5cd9bd57acecf19\" width=\"1000\"/></center>\n",
"<center>Figure 1: Layout analysis effect diagram</center>\n",
"\n",
"The existing solutions are generally based on target detection or semantic segmentation methods, which are based on detecting or segmenting different patterns in the document as different targets.\n",
"\n",
"Some representative papers are divided into the above two categories, as shown in the following table:\n",
"\n",
"| Category | Main paper |\n",
"| ---------------- | -------- |\n",
"| Method based on target detection | [Visual Detection with Context](https://aclanthology.org/D19-1348.pdf),[Object Detection](https://arxiv.org/pdf/2003.13197v1.pdf),[VSR](https://arxiv.org/pdf/2105.06220v1.pdf)\n",
"| Semantic segmentation method |[Semantic Segmentation](https://arxiv.org/pdf/1911.12170v2.pdf) |\n",
"\n",
"\n",
"### 1.2 Method Based on Target Detection \n",
"\n",
"Soto Carlos [1] is based on the target detection algorithm Faster R-CNN, combines context information and uses the inherent location information of the document content to improve the area detection performance. Li Kai [2] et al. also proposed a document analysis method based on object detection, which solved the cross-domain problem by introducing a feature pyramid alignment module, a region alignment module, and a rendering layer alignment module. These three modules complement each other. And adjust the domain from a general image perspective and a specific document image perspective, thus solving the problem of large labeled training datasets being different from the target domain. The following figure is a flow chart of layout analysis based on the target detection Faster R-CNN algorithm. \n",
"\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/d396e0d6183243898c0961250ee7a49bc536677079fb4ba2ac87c653f5472f01\" width=\"800\"/></center>\n",
"<center>Figure 2: Flow chart of layout analysis based on Faster R-CNN</center>\n",
"\n",
"### 1.3 Methods Based on Semantic Segmentation \n",
"\n",
"Sarkar Mausoom [3] et al. proposed a priori-based segmentation mechanism to train a document segmentation model on very high-resolution images, which solves the problem of indistinguishable and merging of dense regions caused by excessively shrinking the original image. Zhang Peng [4] et al. proposed a unified framework VSR (Vision, Semantics and Relations) for document layout analysis in combination with the vision, semantics and relations in the document. The framework uses a two-stream network to extract the visual and Semantic features, and adaptively fusion of these features through the adaptive aggregation module, solves the limitations of the existing CV-based methods of low efficiency of different modal fusion and lack of relationship modeling between layout components.\n",
"\n",
"### 1.4 Data Set\n",
"\n",
"Although the existing methods can solve the layout analysis task to a certain extent, these methods rely on a large amount of labeled training data. Recently, many data sets have been proposed for document analysis tasks.\n",
"\n",
"1. PubLayNet[5]: The data set contains 500,000 document images, of which 400,000 are used for training, 50,000 are used for verification, and 50,000 are used for testing. Five forms of table, text, image, title and list are marked.\n",
"2. HJDataset[6]: The data set contains 2271 document images. In addition to the bounding box and mask of the content area, it also includes the hierarchical structure and reading order of layout elements.\n",
"\n",
"A sample of the PubLayNet data set is shown in the figure below:\n",
"<center class=\"two\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/4b153117c9384f98a0ce5a6c6e7c205a4b1c57e95c894ccb9688cbfc94e68a1c\" width=\"400\"/><img src=\"https://ai-studio-static-online.cdn.bcebos.com/efb9faea39554760b280f9e0e70631d2915399fa97774eecaa44ee84411c4994\" width=\"400\"/>\n",
"</center>\n",
"<center>Figure 3: PubLayNet example</center>\n",
"Reference:\n",
"\n",
"[1]:Soto C, Yoo S. Visual detection with context for document layout analysis[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019: 3464-3470.\n",
"\n",
"[2]:Li K, Wigington C, Tensmeyer C, et al. Cross-domain document object detection: Benchmark suite and method[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020: 12915-12924.\n",
"\n",
"[3]:Sarkar M, Aggarwal M, Jain A, et al. Document Structure Extraction using Prior based High Resolution Hierarchical Semantic Segmentation[C]//European Conference on Computer Vision. Springer, Cham, 2020: 649-666.\n",
"\n",
"[4]:Zhang P, Li C, Qiao L, et al. VSR: A Unified Framework for Document Layout Analysis combining Vision, Semantics and Relations[J]. arXiv preprint arXiv:2105.06220, 2021.\n",
"\n",
"[5]:Zhong X, Tang J, Yepes A J. Publaynet: largest dataset ever for document layout analysis[C]//2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2019: 1015-1022.\n",
"\n",
"[6]:Li M, Xu Y, Cui L, et al. DocBank: A benchmark dataset for document layout analysis[J]. arXiv preprint arXiv:2006.01038, 2020.\n",
"\n",
"[7]:Shen Z, Zhang K, Dell M. A large dataset of historical japanese documents with complex layouts[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 2020: 548-549."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Form Recognition\n",
"\n",
"### 2.1 Background Introduction\n",
"\n",
"Tables are common page elements in various types of documents. With the explosive growth of various types of documents, how to efficiently find tables from documents and obtain content and structure information, that is, table identification has become an urgent problem to be solved. The difficulties of form identification are summarized as follows:\n",
"\n",
"1. The types and styles of tables are complex and diverse, such as *different rows and columns combined, different content text types*, etc.\n",
"2. The style of the document itself has various styles.\n",
"3. The lighting environment during shooting, etc.\n",
"\n",
"The task of table recognition is to convert the table information in the document to an excel file. The task visualization is as follows:\n",
"\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/99faa017e28b4928a408573406870ecaa251b626e0e84ab685e4b6f06f601a5f\" width=\"1600\"/></center>\n",
"\n",
"\n",
"<center>Figure 4: Example image of table recognition, the left side is the original image, and the right side is the result image after table recognition, presented in Excel format</center>\n",
"\n",
"Existing table recognition algorithms can be divided into the following four categories according to the principle of table structure reconstruction:\n",
"1. Method based on heuristic rules\n",
"2. CNN-based method\n",
"3. GCN-based method\n",
"4. Method based on End to End\n",
"\n",
"Some representative papers are divided into the above four categories, as shown in the following table:\n",
"\n",
"| Category | Idea | Main papers |\n",
"| ---------------- | ---- | -------- |\n",
"|Method based on heuristic rules|Manual design rules, connected domain detection analysis and processing|[T-Rect](https://www.researchgate.net/profile/Andreas-Dengel/publication/249657389_A_Paper-to-HTML_Table_Converting_System/links/0c9605322c9a67274d000000/A-Paper-to-HTML-Table-Converting-System.pdf),[pdf2table](https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.724.7272&rep=rep1&type=pdf)|\n",
"| CNN-based method | target detection, semantic segmentation | [CascadeTabNet](https://arxiv.org/pdf/2004.12629v2.pdf), [Multi-Type-TD-TSR](https://arxiv.org/pdf/2105.11021v1.pdf), [LGPMA](https://arxiv.org/pdf/2105.06224v2.pdf), [tabstruct-net](https://arxiv.org/pdf/2010.04565v1.pdf), [CDeC-Net](https://arxiv.org/pdf/2008.10831v1.pdf), [TableNet](https://arxiv.org/pdf/2001.01469v1.pdf), [TableSense](https://arxiv.org/pdf/2106.13500v1.pdf), [Deepdesrt](https://www.dfki.de/fileadmin/user_upload/import/9672_PID4966073.pdf), [Deeptabstr](https://www.dfki.de/fileadmin/user_upload/import/10649_DeepTabStR.pdf), [GTE](https://arxiv.org/pdf/2005.00589v2.pdf), [Cycle-CenterNet](https://arxiv.org/pdf/2109.02199v1.pdf), [FCN](https://www.researchgate.net/publication/339027294_Rethinking_Semantic_Segmentation_for_Table_Structure_Recognition_in_Documents)|\n",
"| GCN-based method | Based on graph neural network, the table recognition is regarded as a graph reconstruction problem | [GNN](https://arxiv.org/pdf/1905.13391v2.pdf), [TGRNet](https://arxiv.org/pdf/2106.10598v3.pdf), [GraphTSR](https://arxiv.org/pdf/1908.04729v2.pdf)|\n",
"| Method based on End to End | Use attention mechanism | [Table-Master](https://arxiv.org/pdf/2105.01848v1.pdf)|\n",
"\n",
"### 2.2 Traditional Algorithm Based on Heuristic Rules\n",
"Early research on table recognition was mainly based on heuristic rules. For example, the T-Rect system proposed by Kieninger [1] et al. used a bottom-up method to analyze the connected domain of document images, and then merge them according to defined rules to obtain logical text blocks. Then, pdf2table proposed by Yildiz[2] et al. is the first method for table recognition on PDF documents, which utilizes some unique information of PDF files (such as text, drawing paths and other information that are difficult to obtain in image documents) to assist with table recognition. In recent work, Koci[3] et al. expressed the layout area in the page as a graph, and then used the Remove and Conquer (RAC) algorithm to identify the table as a subgraph.\n",
"\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/66aeedb3f0924d80aee15f185e6799cc687b51fc20b74b98b338ca2ea25be3f3\" width=\"1000\"/></center>\n",
"<center>Figure 5: Schematic diagram of heuristic algorithm</center>\n",
"\n",
"### 2.3 Method Based on Deep Learning CNN\n",
"With the rapid development of deep learning technology in computer vision, natural language processing, speech processing and other fields, researchers have applied deep learning technology to the field of table recognition and achieved good results.\n",
"\n",
"In the DeepTabStR algorithm, Siddiqui Shoaib Ahmed [12] et al. described the table structure recognition problem as an object detection problem, and used deformable convolution to better detect table cells. Raja Sachin[6] et al. proposed that TabStruct-Net combines cell detection and structure recognition visually to perform table structure recognition, which solves the problem of recognition errors due to large changes in the table layout. However, this method cannot Deal with the problem of more empty cells in rows and columns.\n",
"\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/838be28836444bc1835ac30a25613d8b045a1b5aedd44b258499fe9f93dd298f\" width=\"1600\"/></center>\n",
"<center>Figure 6: Schematic diagram of algorithm based on deep learning CNN</center>\n",
"\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/4c40dda737bd44b09a533e1b1dd2e4c6a90ceea083bf4238b7f3c7b21087f409\" width=\"1600\"/></center>\n",
"<center>Figure 7: Example of algorithm error based on deep learning CNN</center>\n",
"\n",
"The previous table structure recognition method generally starts from the elements of different granularities (row/column, text area), and it is easy to ignore the problem of merging empty cells. Qiao Liang [10] et al. proposed a new framework LGPMA, which makes full use of the information from local and global features through mask re-scoring strategy, and then can obtain more reliable alignment of the cell area, and finally introduces the inclusion of cell matching, empty The table structure restoration pipeline of cell search and empty cell merging handles the problem of table structure identification.\n",
"\n",
"In addition to the above separate table recognition algorithms, there are also some methods that complete table detection and table recognition in one model. Schreiber Sebastian [11] et al. proposed DeepDeSRT, which uses Faster RCNN for table detection and FCN semantic segmentation model for Table structure row and column detection, but this method uses two independent models to solve these two problems. Prasad Devashish [4] et al. proposed an end-to-end deep learning method CascadeTabNet, which uses the Cascade Mask R-CNN HRNet model to perform table detection and structure recognition at the same time, which solves the problem of using two independent methods to process table recognition in the past. The insufficiency of the problem. Paliwal Shubham [8] et al. proposed a novel end-to-end deep multi-task architecture TableNet, which is used for table detection and structure recognition. At the same time, additional spatial semantic features are added to TableNet during training to further improve the performance of the model. Zheng Xinyi [13] et al. proposed a system framework GTE for table recognition, using a cell detection network to guide the training of the table detection network, and proposed a hierarchical network and a new clustering-based cell structure recognition algorithm, the framework can be connected to the back of any target detection model to facilitate the training of different table recognition algorithms. Previous research mainly focused on parsing from scanned PDF documents with a simple layout and well-aligned table images. However, the tables in real scenes are generally complex and may have serious deformation, bending or occlusion. Therefore, Long Rujiao [14] et al. also constructed a table recognition data set WTW in a realistic complex scene, and proposed a Cycle-CenterNet method, which uses the cyclic pairing module optimization and the proposed new pairing loss to accurately group discrete units into structured In the table, the performance of table recognition is improved.\n",
"\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/a01f714cbe1f42fc9c45c6658317d9d7da2cec9726844f6b9fa75e30cadc9f76\" width=\"1600\"/></center>\n",
"<center>Figure 8: Schematic diagram of end-to-end algorithm</center>\n",
"\n",
"The CNN-based method cannot handle the cross-row and column tables well, so in the follow-up method, two research methods are divided to solve the cross-row and column problems in the table.\n",
"\n",
"### 2.4 Method Based on Deep Learning GCN\n",
"In recent years, with the rise of Graph Convolutional Networks (Graph Convolutional Network), some researchers have tried to apply graph neural networks to table structure recognition problems. Qasim Shah Rukh [20] et al. converted the table structure recognition problem into a graph problem compatible with graph neural networks, and designed a novel differentiable architecture that can not only take advantage of the advantages of convolutional neural networks to extract features, but also The advantages of the effective interaction between the vertices of the graph neural network can be used, but this method only uses the location features of the cells, and does not use the semantic features. Chi Zewen [19] et al. proposed a novel graph neural network, GraphTSR, for table structure recognition in PDF files. It takes cells in the table as input, and then uses the characteristics of the edges and nodes of the graph to be connected. Predicting the relationship between cells to identify the table structure solves the problem of cell identification across rows or columns to a certain extent. Xue Wenyuan [21] et al. reformulated the problem of table structure recognition as table map reconstruction, and proposed an end-to-end method for table structure recognition, TGRNet, which includes cell detection branch and cell logic location branch. , These two branches jointly predict the spatial and logical positions of different cells, which solves the problem that the previous method did not pay attention to the logical position of cells.\n",
"\n",
"Diagram of GraphTSR table recognition algorithm:\n",
"\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/8ff89661142045a8aef54f8a7a2c69b1d243f8269034406a9e66bee2149f730f\" width=\"1600\"/></center>\n",
"<center>Figure 9: Diagram of GraphTSR table recognition algorithm</center>\n",
"\n",
"### 2.5 Based on An End-to-End Approach\n",
"\n",
"Different from other post-processing to complete the reconstruction of the table structure, based on the end-to-end method, directly use the network to complete the HTML representation output of the table structure\n",
"\n",
"![](https://ai-studio-static-online.cdn.bcebos.com/7865e58a83824facacfaa91bec12ccf834217cb706454dc5a0c165c203db79fb) | ![](https://ai-studio-static-online.cdn.bcebos.com/77d913b1b92f4a349b8f448e08ba78458d687eef4af142678a073830999f3edc))\n",
"---|---\n",
"Figure 10: Input and output of the end-to-end method | Figure 11: Image Caption example\n",
"\n",
"Most end-to-end methods use Image Caption's Seq2Seq method to complete the prediction of the table structure, such as some methods based on Attention or Transformer.\n",
"\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/3571280a9c364d3499a062e3edc724294fb5eaef8b38440991941e87f0af0c3b\" width=\"800\"/></center>\n",
"<center>Figure 12: Schematic diagram of Seq2Seq</center>\n",
"\n",
"Ye Jiaquan [22] obtained the table structure output model in TableMaster by improving the Master text algorithm based on Transformer. In addition, a branch is added for the coordinate regression of the box. The author did not split the model into two branches in the last layer, but decoupled the sequence prediction and the box regression into two after the first Transformer decoding layer. Branches. The comparison between its network structure and the original Master network is shown in the figure below:\n",
"\n",
"\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/f573709447a848b4ba7c73a2e297f0304caaca57c5c94588aada1f4cd893946c\" width=\"800\"/></center>\n",
"<center>Figure 13: Left: master network diagram, right: TableMaster network diagram</center>\n",
"\n",
"\n",
"### 2.6 Data Set\n",
"\n",
"Since the deep learning method is data-driven, a large amount of labeled data is required to train the model, and the small size of the existing data set is also an important constraint, so some data sets have also been proposed.\n",
"\n",
"1. PubTabNet[16]: Contains 568k table images and corresponding structured HTML representations.\n",
"2. PubMed Tables (PubTables-1M) [17]: Table structure recognition data set, containing highly detailed structural annotations, 460,589 pdf images used for table inspection tasks, and 947,642 table images used for table recognition tasks.\n",
"3. TableBank[18]: Table detection and recognition data set, using Word and Latex documents on the Internet to construct table data containing 417K high-quality annotations.\n",
"4. SciTSR[19]: Table structure recognition data set, most of the images are converted from the paper, which contains 15,000 tables from PDF files and their corresponding structure tags.\n",
"5. TabStructDB[12]: Includes 1081 table areas, which are marked with dense row and column information.\n",
"6. WTW[14]: Large-scale data set scene table detection and recognition data set, this data set contains table data in various deformation, bending and occlusion situations, and contains 14,581 images in total.\n",
"\n",
"Data set example\n",
"\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/c9763df56e67434f97cd435100d50ded71ba66d9d4f04d7f8f896d613cdf02b0\" /></center>\n",
"<center>Figure 14: Sample diagram of PubTables-1M data set</center>\n",
"\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/64de203bbe584642a74f844ac4b61d1ec3c5a38cacb84443ac961fbcc54a66ce\" width=\"600\"/></center>\n",
"<center>Figure 15: Sample diagram of WTW data set</center>\n",
"\n",
"\n",
"\n",
"Reference\n",
"\n",
"[1]:Kieninger T, Dengel A. A paper-to-HTML table converting system[C]//Proceedings of document analysis systems (DAS). 1998, 98: 356-365.\n",
"\n",
"[2]:Yildiz B, Kaiser K, Miksch S. pdf2table: A method to extract table information from pdf files[C]//IICAI. 2005: 1773-1785.\n",
"\n",
"[3]:Koci E, Thiele M, Lehner W, et al. Table recognition in spreadsheets via a graph representation[C]//2018 13th IAPR International Workshop on Document Analysis Systems (DAS). IEEE, 2018: 139-144.\n",
"\n",
"[4]:Prasad D, Gadpal A, Kapadni K, et al. CascadeTabNet: An approach for end to end table detection and structure recognition from image-based documents[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 2020: 572-573.\n",
"\n",
"[5]:Fischer P, Smajic A, Abrami G, et al. Multi-Type-TD-TSR–Extracting Tables from Document Images Using a Multi-stage Pipeline for Table Detection and Table Structure Recognition: From OCR to Structured Table Representations[C]//German Conference on Artificial Intelligence (Künstliche Intelligenz). Springer, Cham, 2021: 95-108.\n",
"\n",
"[6]:Raja S, Mondal A, Jawahar C V. Table structure recognition using top-down and bottom-up cues[C]//European Conference on Computer Vision. Springer, Cham, 2020: 70-86.\n",
"\n",
"[7]:Agarwal M, Mondal A, Jawahar C V. Cdec-net: Composite deformable cascade network for table detection in document images[C]//2020 25th International Conference on Pattern Recognition (ICPR). IEEE, 2021: 9491-9498.\n",
"\n",
"[8]:Paliwal S S, Vishwanath D, Rahul R, et al. Tablenet: Deep learning model for end-to-end table detection and tabular data extraction from scanned document images[C]//2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2019: 128-133.\n",
"\n",
"[9]:Dong H, Liu S, Han S, et al. Tablesense: Spreadsheet table detection with convolutional neural networks[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2019, 33(01): 69-76.\n",
"\n",
"[10]:Qiao L, Li Z, Cheng Z, et al. LGPMA: Complicated Table Structure Recognition with Local and Global Pyramid Mask Alignment[J]. arXiv preprint arXiv:2105.06224, 2021.\n",
"\n",
"[11]:Schreiber S, Agne S, Wolf I, et al. Deepdesrt: Deep learning for detection and structure recognition of tables in document images[C]//2017 14th IAPR international conference on document analysis and recognition (ICDAR). IEEE, 2017, 1: 1162-1167.\n",
"\n",
"[12]:Siddiqui S A, Fateh I A, Rizvi S T R, et al. Deeptabstr: Deep learning based table structure recognition[C]//2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2019: 1403-1409.\n",
"\n",
"[13]:Zheng X, Burdick D, Popa L, et al. Global table extractor (gte): A framework for joint table identification and cell structure recognition using visual context[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2021: 697-706.\n",
"\n",
"[14]:Long R, Wang W, Xue N, et al. Parsing Table Structures in the Wild[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 944-952.\n",
"\n",
"[15]:Siddiqui S A, Khan P I, Dengel A, et al. Rethinking semantic segmentation for table structure recognition in documents[C]//2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2019: 1397-1402.\n",
"\n",
"[16]:Zhong X, ShafieiBavani E, Jimeno Yepes A. Image-based table recognition: data, model, and evaluation[C]//Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16. Springer International Publishing, 2020: 564-580.\n",
"\n",
"[17]:Smock B, Pesala R, Abraham R. PubTables-1M: Towards a universal dataset and metrics for training and evaluating table extraction models[J]. arXiv preprint arXiv:2110.00061, 2021.\n",
"\n",
"[18]:Li M, Cui L, Huang S, et al. Tablebank: Table benchmark for image-based table detection and recognition[C]//Proceedings of the 12th Language Resources and Evaluation Conference. 2020: 1918-1925.\n",
"\n",
"[19]:Chi Z, Huang H, Xu H D, et al. Complicated table structure recognition[J]. arXiv preprint arXiv:1908.04729, 2019.\n",
"\n",
"[20]:Qasim S R, Mahmood H, Shafait F. Rethinking table recognition using graph neural networks[C]//2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2019: 142-147.\n",
"\n",
"[21]:Xue W, Yu B, Wang W, et al. TGRNet: A Table Graph Reconstruction Network for Table Structure Recognition[J]. arXiv preprint arXiv:2106.10598, 2021.\n",
"\n",
"[22]:Ye J, Qi X, He Y, et al. PingAn-VCGroup's Solution for ICDAR 2021 Competition on Scientific Literature Parsing Task B: Table Recognition to HTML[J]. arXiv preprint arXiv:2105.01848, 2021.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Document VQA\n",
"\n",
"The boss sent a task: develop an ID card recognition system\n",
"\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/63bbe893465e4f98b3aec80a042758b520d43e1a993a47e39bce1123c2d29b3f\" width=\"1600\"/></center>\n",
"\n",
"> How to choose a plan\n",
"> 1. Use rules to extract information after text detection\n",
"> 2. Use scale type to extract information after text detection\n",
"> 3. Outsourcing\n",
"\n",
"\n",
"### 3.1 Background Introduction\n",
"In the VQA (Visual Question Answering) task, questions and answers are mainly aimed at the content of the image, but for text images, the content of interest is the text information in the image, so this type of method can be divided into Text-VQA and text-VQA in natural scenes. DocVQA of the scanned document scene, the relationship between the three is shown in the figure below.\n",
"\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/a91cfd5152284152b020ca8a396db7a21fd982e3661540d5998cc19c17d84861\" width=\"600\"/></center>\n",
"<center>Figure 16: VQA level</center>\n",
"\n",
"The sample pictures of VQA, Text-VQA and DocVQA are shown in the figure below.\n",
"\n",
"|Task type|VQA | Text-VQA | DocVQA| \n",
"|---|---|---|---|\n",
"|Task description|Ask questions regarding **picture content**|Ask questions regarding **text content on pictures**|Ask questions regarding **text content of document images**|\n",
"|Sample picture|![vqa](https://ai-studio-static-online.cdn.bcebos.com/fc21b593276247249591231b3373608151ed8ae7787f4d6ba39e8779fdd12201)|![textvqa](https://ai-studio-static-online.cdn.bcebos.com/cd2404edf3bf430b89eb9b2509714499380cd02e4aa74ec39ca6d7aebcf9a559)|![docvqa](https://ai-studio-static-online.cdn.bcebos.com/0eec30a6f91b4f949c56729b856f7ff600d06abee0774642801c070303edfe83)|\n",
"\n",
"Because DocVQA is closer to actual application scenarios, a large number of academic and industrial work has emerged. In common scenarios, the questions asked in DocVQA are fixed. For example, the questions in the ID card scenario are generally\n",
"1. What is the citizenship number?\n",
"2. What is your name?\n",
"3. What is a clan?\n",
"\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/2d2b86468daf47c98be01f44b8d6efa64bc09e43cd764298afb127f19b07aede\" width=\"800\"/></center>\n",
"<center>Figure 17: Example of an ID card</center>\n",
"\n",
"\n",
"Based on this prior knowledge, DocVQA's research began to lean towards the Key Information Extraction (KIE) task. This time we also mainly discuss the KIE-related research. The KIE task mainly extracts the key information needed from the image, such as extracting from the ID card. Name and citizen identification number information.\n",
"\n",
"KIE is usually divided into two sub-tasks for research\n",
"1. SER: Semantic Entity Recognition, to classify each detected text, such as dividing it into name and ID. As shown in the black box and red box in the figure below.\n",
"2. RE: Relation Extraction, which classifies each detected text, such as dividing it into questions and answers. Then find the corresponding answer to each question. As shown in the figure below, the red and black boxes represent the question and the answer, respectively, and the yellow line represents the correspondence between the question and the answer.\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/899470ba601349fbbc402a4c83e6cdaee08aaa10b5004977b1f684f346ebe31f\" width=\"800\"/></center>\n",
"<center>Figure 18: Example of SER, RE task</center>\n",
"\n",
"The general KIE method is researched based on Named Entity Recognition (NER) [4], but this type of method only uses the text information in the image and lacks the use of visual and structural information, so the accuracy is not high. On this basis, the methods in recent years have begun to integrate visual and structural information with text information. According to the principles used when fusing multimodal information, these methods can be divided into the following three types:\n",
"\n",
"1. Grid-based approach\n",
"1. Token-based approach\n",
"2. GCN-based method\n",
"3. Based on the End to End method\n",
"\n",
"Some representative papers are divided into the above three categories, as shown in the following table:\n",
"\n",
"| Category | Ideas | Main Papers |\n",
"| ---------------- | ---- | -------- |\n",
"| Grid-based method | Fusion of multi-modal information on images (text, layout, image) | [Chargrid](https://arxiv.org/pdf/1809.08799) |\n",
"| Token-based method|Using methods such as Bert for multi-modal information fusion|[LayoutLM](https://arxiv.org/pdf/1912.13318), [LayoutLMv2](https://arxiv.org/pdf/2012.14740), [StrucText](https://arxiv.org/pdf/2108.02923), |\n",
"| GCN-based method|Using graph network structure for multi-modal information fusion|[GCN](https://arxiv.org/pdf/1903.11279), [PICK](https://arxiv.org/pdf/2004.07464), [SDMG-R](https://arxiv.org/pdf/2103.14470), [SERA](https://arxiv.org/pdf/2110.09915) |\n",
"| Based on End to End method | Unify OCR and key information extraction into one network | [Trie](https://arxiv.org/pdf/2005.13118) |\n",
"\n",
"### 3.2 Grid-Based Method\n",
"\n",
"The Grid-based method performs multimodal information fusion at the image level. Chargrid[5] firstly performs character-level text detection and recognition on the image, and then completes the construction of the network input by filling the one-hot encoding of the category into the corresponding character area (the non-black part in the right image below) , the input is finally passed through the CNN network of the encoder-decoder structure to perform coordinate detection and category classification of key information.\n",
"\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/f248841769ec4312a9015b4befda37bf29db66226431420ca1faad517783875e\" width=\"800\"/></center>\n",
"<center>Figure 19: Chargrid data example</center>\n",
"\n",
"\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/0682e52e275b4187a0e74f54961a50091fd3a0cdff734e17bedcbc993f6e29f9\" width=\"800\"/></center>\n",
"<center>Figure 20: Chargrid network</center>\n",
"\n",
"\n",
"Compared with the traditional method based only on text, this method can use both text information and structural information, so it can achieve a certain accuracy improvement. It's good to combine the two.\n",
"\n",
"### 3.3 Token-Based Method\n",
"LayoutLM[6] encodes 2D position information and text information together into the BERT model, and draws on the pre-training ideas of Bert in NLP, pre-training on large-scale data sets, and in downstream tasks, LayoutLM also adds image information To further improve the performance of the model. Although LayoutLM combines text, location and image information, the image information is fused in the training of downstream tasks, so the multi-modal fusion of the three types of information is not sufficient. Based on LayoutLM, LayoutLMv2 [7] integrates image information with text and layout information in the pre-training stage through transformers, and also adds a spatial perception self-attention mechanism to the Transformer to assist the model to better integrate visual and text features. Although LayoutLMv2 fuses text, location and image information in the pre-training stage, the visual features learned by the model are not fine enough due to the limitation of the pre-training task. StrucTexT [8] based on the previous multi-modal methods, proposed two new tasks, Sentence Length Prediction (SLP) and Paired Boxes Direction (PBD) in the pre-training task to help the network learn fine visual features. Among them, the SLP task makes the model Learn the length of the text segment, the PDB task allows the model to learn the matching relationship between Box directions. Through these two new pre-training tasks, the deep cross-modal fusion between text, visual and layout information can be accelerated.\n",
"\n",
"![](https://ai-studio-static-online.cdn.bcebos.com/17a26ade09ee4311b90e49a1c61d88a72a82104478434f9dabd99c27a65d789b) | ![](https://ai-studio-static-online.cdn.bcebos.com/d75addba67ef4b06a02ae40145e609d3692d613ff9b74cec85123335b465b3cc))\n",
"---|---\n",
"Figure 21: Transformer algorithm flow chart | Figure 22: LayoutLMv2 algorithm flow chart\n",
"\n",
"### 3.4 GCN-Based Method\n",
"\n",
"Although the existing GCN-based methods [10] use text and structure information, they do not make good use of image information. PICK [11] added image information to the GCN network and proposed a graph learning module to automatically learn edge types. SDMG-R [12] encodes the image as a bimodal graph. The nodes of the graph are the visual and textual information of the text area. The edges represent the direct spatial relationship between adjacent texts. By iteratively spreading information along the edges and inferring graph node categories, SDMG -R solves the problem that existing methods are incapable of unseen templates.\n",
"\n",
"\n",
"The PICK flow chart is shown in the figure below:\n",
"\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/d3282959e6b2448c89b762b3b9bbf6197a0364b101214a1f83cf01a28623c01c\" width=\"800\"/></center>\n",
"<center>Figure 23: PICK algorithm flow chart</center>\n",
"\n",
"SERA[10]The biaffine parser in dependency syntax analysis is introduced into document relation extraction, and GCN is used to fuse text and visual information.\n",
"\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/a97b7647968a4fa59e7b14b384dd7ffe812f158db8f741459b6e6bb0e8b657c7\" width=\"800\"/></center>\n",
"<center>Figure 24: SERA algorithm flow chart</center>\n",
"\n",
"### 3.5 Method Based on End to End\n",
"\n",
"Existing methods divide KIE into two independent tasks: text reading and information extraction. However, they mainly focus on improving the task of information extraction, ignoring that text reading and information extraction are interrelated. Therefore, Trie [9] Proposed a unified end-to-end network that can learn these two tasks at the same time and reinforce each other in the learning process.\n",
"\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/6e4a3b0f65254f6b9d40cea0875854d4f47e1dca6b1e408cad435b3629600608\" width=\"1300\"/></center>\n",
"<center>Figure 25: Trie algorithm flow chart</center>\n",
"\n",
"\n",
"### 3.6 Data Set\n",
"The data sets used for KIE mainly include the following two:\n",
"1. SROIE: Task 3 of the SROIE data set [2] aims to extract four predefined information from the scanned receipt: company, date, address or total. There are 626 samples in the data set for training and 347 samples for testing.\n",
"2. FUNSD: FUNSD data set [3] is a data set used to extract form information from scanned documents. It contains 199 marked real scan forms. Of the 199 samples, 149 are used for training and 50 are used for testing. The FUNSD data set assigns a semantic entity tag to each word: question, answer, title or other.\n",
"3. XFUN: The XFUN data set is a multilingual data set proposed by Microsoft. It contains 7 languages. Each language contains 149 training sets and 50 test sets.\n",
"\n",
"![](https://ai-studio-static-online.cdn.bcebos.com/dfdf530d79504761919c1f093f9a86dac21e6db3304c4892998ea1823f3187c6) | ![](https://ai-studio-static-online.cdn.bcebos.com/3b2a9f9476be4e7f892b73bd7096ce8d88fe98a70bae47e6ab4c5fcc87e83861))\n",
"---|---\n",
"Figure 26: sroie example image | Figure 27: xfun example image\n",
"\n",
"Reference:\n",
"\n",
"[1]:Mathew M, Karatzas D, Jawahar C V. Docvqa: A dataset for vqa on document images[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2021: 2200-2209.\n",
"\n",
"[2]:Huang Z, Chen K, He J, et al. Icdar2019 competition on scanned receipt ocr and information extraction[C]//2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2019: 1516-1520.\n",
"\n",
"[3]:Jaume G, Ekenel H K, Thiran J P. Funsd: A dataset for form understanding in noisy scanned documents[C]//2019 International Conference on Document Analysis and Recognition Workshops (ICDARW). IEEE, 2019, 2: 1-6.\n",
"\n",
"[4]:Lample G, Ballesteros M, Subramanian S, et al. Neural architectures for named entity recognition[J]. arXiv preprint arXiv:1603.01360, 2016.\n",
"\n",
"[5]:Katti A R, Reisswig C, Guder C, et al. Chargrid: Towards understanding 2d documents[J]. arXiv preprint arXiv:1809.08799, 2018.\n",
"\n",
"[6]:Xu Y, Li M, Cui L, et al. Layoutlm: Pre-training of text and layout for document image understanding[C]//Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2020: 1192-1200.\n",
"\n",
"[7]:Xu Y, Xu Y, Lv T, et al. LayoutLMv2: Multi-modal pre-training for visually-rich document understanding[J]. arXiv preprint arXiv:2012.14740, 2020.\n",
"\n",
"[8]:Li Y, Qian Y, Yu Y, et al. StrucTexT: Structured Text Understanding with Multi-Modal Transformers[C]//Proceedings of the 29th ACM International Conference on Multimedia. 2021: 1912-1920.\n",
"\n",
"[9]:Zhang P, Xu Y, Cheng Z, et al. Trie: End-to-end text reading and information extraction for document understanding[C]//Proceedings of the 28th ACM International Conference on Multimedia. 2020: 1413-1422.\n",
"\n",
"[10]:Liu X, Gao F, Zhang Q, et al. Graph convolution for multimodal information extraction from visually rich documents[J]. arXiv preprint arXiv:1903.11279, 2019.\n",
"\n",
"[11]:Yu W, Lu N, Qi X, et al. Pick: Processing key information extraction from documents using improved graph learning-convolutional networks[C]//2020 25th International Conference on Pattern Recognition (ICPR). IEEE, 2021: 4363-4370.\n",
"\n",
"[12]:Sun H, Kuang Z, Yue X, et al. Spatial Dual-Modality Graph Reasoning for Key Information Extraction[J]. arXiv preprint arXiv:2103.14470, 2021."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Summary\n",
"In this section, we mainly introduce the theoretical knowledge of three sub-modules related to document analysis technology: layout analysis, table recognition and information extraction. Below we will explain this form recognition and DOC-VQA practical tutorial based on the PaddleOCR framework."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "py35-paddle1.2.0"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"# 1. Course Prerequisites\n",
"\n",
"The OCR model involved in this course is based on deep learning, so its related basic knowledge, environment configuration, project engineering and other materials will be introduced in this section, especially for readers who are not familiar with deep learning. content.\n",
"\n",
"### 1.1 Preliminary Knowledge\n",
"\n",
"The \"learning\" of deep learning has been developed from the content of neurons, perceptrons, and multilayer neural networks in machine learning. Therefore, understanding the basic machine learning algorithms is of great help to the understanding and application of deep learning. The \"deepness\" of deep learning is embodied in a series of vector-based mathematical operations such as convolution and pooling used in the process of processing a large amount of information. If you lack the theoretical foundation of the two, you can learn from teacher Li Hongyi's [Linear Algebra](https://aistudio.baidu.com/aistudio/course/introduce/2063) and [Machine Learning](https://aistudio.baidu.com/aistudio/course/introduce/1978) courses.\n",
"\n",
"For the understanding of deep learning itself, you can refer to the zero-based course of Bai Ran, an outstanding architect of Baidu: [Baidu architects take you hands-on with zero-based practice deep learning](https://aistudio.baidu.com/aistudio/course/introduce/1297), which covers the development history of deep learning and introduces the complete components of deep learning through a classic case. It is a set of practice-oriented deep learning courses.\n",
"\n",
"For the practice of theoretical knowledge, [Python basic knowledge](https://aistudio.baidu.com/aistudio/course/introduce/1224) is essential. At the same time, in order to quickly reproduce the deep learning model, the deep learning framework used in this course For: Flying PaddlePaddle. If you have used other frameworks, you can quickly learn how to use flying paddles through [Quick Start Document](https://www.paddlepaddle.org.cn/documentation/docs/zh/practices/quick_start/hello_paddle.html).\n",
"\n",
"### 1.2 Basic Environment Preparation\n",
"\n",
"If you want to run the code of this course in a local environment and have not built a Python environment before, you can follow the [zero-base operating environment preparation](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/doc/doc_ch/environment.md), install Anaconda or docker environment according to your operating system.\n",
"\n",
"If you don't have local resources, you can run the code through the AI Studio training platform. Each item in it is presented in a notebook, which is convenient for developers to learn. If you are not familiar with the related operations of Notebook, you can refer to [AI Studio Project Description](https://ai.baidu.com/ai-doc/AISTUDIO/0k3e2tfzm).\n",
"\n",
"### 1.3 Get and Run the Code\n",
"\n",
"This course relies on the formation of PaddleOCR's code repository. First, clone the complete project of PaddleOCR:\n",
"\n",
"```bash\n",
"# [recommend]\n",
"git clone https://github.com/PaddlePaddle/PaddleOCR\n",
"\n",
"# If you cannot pull successfully due to network problems, you can also choose to use the hosting on Code Cloud:\n",
"git clone https://gitee.com/paddlepaddle/PaddleOCR\n",
"```\n",
"\n",
"> Note: The code cloud hosted code may not be able to synchronize the update of this github project in real time, there is a delay of 3~5 days, please use the recommended method first.\n",
">\n",
"> If you are not familiar with git operations, you can download the compressed package directly from the `Code` on the homepage of PaddleOCR\n",
"\n",
"Then install third-party libraries:\n",
"\n",
"```bash\n",
"cd PaddleOCR\n",
"pip3 install -r requirements.txt\n",
"```\n",
"\n",
"\n",
"\n",
"### 1.4 Access to Information\n",
"\n",
"[PaddleOCR Usage Document](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/README.md) describes in detail how to use PaddleOCR to complete model application, training and deployment. The document is rich in content, most of the user’s questions are described in the document or FAQ, especially in [FAQ](https://github.com/PaddlePaddle/PaddleOCR/blob/release%2F2.3/doc/doc_en/FAQ_en.md), in accordance with the application process of deep learning, has precipitated the user's common questions, it is recommended that you read it carefully.\n",
"\n",
"### 1.5 Ask for Help\n",
"\n",
"If you encounter BUG, ease of use or documentation related issues while using PaddleOCR, you can contact the official via [Github issue](https://github.com/PaddlePaddle/PaddleOCR/issues), please follow the issue template Provide as much information as possible so that official personnel can quickly locate the problem. At the same time, the WeChat group is the daily communication position for the majority of PaddleOCR users, and it is more suitable for asking some consulting questions. In addition to the PaddleOCR team members, there will also be enthusiastic developers answering your questions."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "py35-paddle1.2.0"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
...@@ -31,7 +31,8 @@ class CTCLoss(nn.Layer): ...@@ -31,7 +31,8 @@ class CTCLoss(nn.Layer):
predicts = predicts[-1] predicts = predicts[-1]
predicts = predicts.transpose((1, 0, 2)) predicts = predicts.transpose((1, 0, 2))
N, B, _ = predicts.shape N, B, _ = predicts.shape
preds_lengths = paddle.to_tensor([N] * B, dtype='int64') preds_lengths = paddle.to_tensor(
[N] * B, dtype='int64', place=paddle.CPUPlace())
labels = batch[1].astype("int32") labels = batch[1].astype("int32")
label_lengths = batch[2].astype('int64') label_lengths = batch[2].astype('int64')
loss = self.loss_func(predicts, labels, preds_lengths, label_lengths) loss = self.loss_func(predicts, labels, preds_lengths, label_lengths)
......
...@@ -16,6 +16,7 @@ ...@@ -16,6 +16,7 @@
class ClsMetric(object): class ClsMetric(object):
def __init__(self, main_indicator='acc', **kwargs): def __init__(self, main_indicator='acc', **kwargs):
self.main_indicator = main_indicator self.main_indicator = main_indicator
self.eps = 1e-5
self.reset() self.reset()
def __call__(self, pred_label, *args, **kwargs): def __call__(self, pred_label, *args, **kwargs):
...@@ -28,7 +29,7 @@ class ClsMetric(object): ...@@ -28,7 +29,7 @@ class ClsMetric(object):
all_num += 1 all_num += 1
self.correct_num += correct_num self.correct_num += correct_num
self.all_num += all_num self.all_num += all_num
return {'acc': correct_num / all_num, } return {'acc': correct_num / (all_num + self.eps), }
def get_metric(self): def get_metric(self):
""" """
...@@ -36,7 +37,7 @@ class ClsMetric(object): ...@@ -36,7 +37,7 @@ class ClsMetric(object):
'acc': 0 'acc': 0
} }
""" """
acc = self.correct_num / self.all_num acc = self.correct_num / (self.all_num + self.eps)
self.reset() self.reset()
return {'acc': acc} return {'acc': acc}
......
...@@ -20,6 +20,7 @@ class RecMetric(object): ...@@ -20,6 +20,7 @@ class RecMetric(object):
def __init__(self, main_indicator='acc', is_filter=False, **kwargs): def __init__(self, main_indicator='acc', is_filter=False, **kwargs):
self.main_indicator = main_indicator self.main_indicator = main_indicator
self.is_filter = is_filter self.is_filter = is_filter
self.eps = 1e-5
self.reset() self.reset()
def _normalize_text(self, text): def _normalize_text(self, text):
...@@ -47,8 +48,8 @@ class RecMetric(object): ...@@ -47,8 +48,8 @@ class RecMetric(object):
self.all_num += all_num self.all_num += all_num
self.norm_edit_dis += norm_edit_dis self.norm_edit_dis += norm_edit_dis
return { return {
'acc': correct_num / all_num, 'acc': correct_num / (all_num + self.eps),
'norm_edit_dis': 1 - norm_edit_dis / (all_num + 1e-3) 'norm_edit_dis': 1 - norm_edit_dis / (all_num + self.eps)
} }
def get_metric(self): def get_metric(self):
...@@ -58,8 +59,8 @@ class RecMetric(object): ...@@ -58,8 +59,8 @@ class RecMetric(object):
'norm_edit_dis': 0, 'norm_edit_dis': 0,
} }
""" """
acc = 1.0 * self.correct_num / (self.all_num + 1e-3) acc = 1.0 * self.correct_num / (self.all_num + self.eps)
norm_edit_dis = 1 - self.norm_edit_dis / (self.all_num + 1e-3) norm_edit_dis = 1 - self.norm_edit_dis / (self.all_num + self.eps)
self.reset() self.reset()
return {'acc': acc, 'norm_edit_dis': norm_edit_dis} return {'acc': acc, 'norm_edit_dis': norm_edit_dis}
......
...@@ -12,9 +12,12 @@ ...@@ -12,9 +12,12 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
import numpy as np import numpy as np
class TableMetric(object): class TableMetric(object):
def __init__(self, main_indicator='acc', **kwargs): def __init__(self, main_indicator='acc', **kwargs):
self.main_indicator = main_indicator self.main_indicator = main_indicator
self.eps = 1e-5
self.reset() self.reset()
def __call__(self, pred, batch, *args, **kwargs): def __call__(self, pred, batch, *args, **kwargs):
...@@ -31,9 +34,7 @@ class TableMetric(object): ...@@ -31,9 +34,7 @@ class TableMetric(object):
correct_num += 1 correct_num += 1
self.correct_num += correct_num self.correct_num += correct_num
self.all_num += all_num self.all_num += all_num
return { return {'acc': correct_num * 1.0 / (all_num + self.eps), }
'acc': correct_num * 1.0 / all_num,
}
def get_metric(self): def get_metric(self):
""" """
...@@ -41,7 +42,7 @@ class TableMetric(object): ...@@ -41,7 +42,7 @@ class TableMetric(object):
'acc': 0, 'acc': 0,
} }
""" """
acc = 1.0 * self.correct_num / self.all_num acc = 1.0 * self.correct_num / (self.all_num + self.eps)
self.reset() self.reset()
return {'acc': acc} return {'acc': acc}
......
...@@ -105,3 +105,22 @@ def set_seed(seed=1024): ...@@ -105,3 +105,22 @@ def set_seed(seed=1024):
random.seed(seed) random.seed(seed)
np.random.seed(seed) np.random.seed(seed)
paddle.seed(seed) paddle.seed(seed)
class AverageMeter:
def __init__(self):
self.reset()
def reset(self):
"""reset"""
self.val = 0
self.avg = 0
self.sum = 0
self.count = 0
def update(self, val, n=1):
"""update"""
self.val = val
self.sum += val * n
self.count += n
self.avg = self.sum / self.count
...@@ -133,16 +133,16 @@ cd ppstructure ...@@ -133,16 +133,16 @@ cd ppstructure
# 下载模型 # 下载模型
mkdir inference && cd inference mkdir inference && cd inference
# 下载超轻量级中文OCR模型的检测模型并解压 # 下载PP-OCRv2文本检测模型并解压
wget https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_mobile_v2.0_det_infer.tar && tar xf ch_ppocr_mobile_v2.0_det_infer.tar wget https://paddleocr.bj.bcebos.com/PP-OCRv2/chinese/ch_PP-OCRv2_det_slim_quant_infer.tar && tar xf ch_PP-OCRv2_det_slim_quant_infer.tar
# 下载超轻量级中文OCR模型的识别模型并解压 # 下载PP-OCRv2文本识别模型并解压
wget https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_mobile_v2.0_rec_infer.tar && tar xf ch_ppocr_mobile_v2.0_rec_infer.tar wget https://paddleocr.bj.bcebos.com/PP-OCRv2/chinese/ch_PP-OCRv2_rec_slim_quant_infer.tar && tar xf ch_PP-OCRv2_rec_slim_quant_infer.tar
# 下载超轻量级英文表格英寸模型并解压 # 下载超轻量级英文表格预测模型并解压
wget https://paddleocr.bj.bcebos.com/dygraph_v2.0/table/en_ppocr_mobile_v2.0_table_structure_infer.tar && tar xf en_ppocr_mobile_v2.0_table_structure_infer.tar wget https://paddleocr.bj.bcebos.com/dygraph_v2.0/table/en_ppocr_mobile_v2.0_table_structure_infer.tar && tar xf en_ppocr_mobile_v2.0_table_structure_infer.tar
cd .. cd ..
python3 predict_system.py --det_model_dir=inference/ch_ppocr_mobile_v2.0_det_infer \ python3 predict_system.py --det_model_dir=inference/ch_PP-OCRv2_det_slim_quant_infer \
--rec_model_dir=inference/ch_ppocr_mobile_v2.0_rec_infer \ --rec_model_dir=inference/ch_PP-OCRv2_rec_slim_quant_infer \
--table_model_dir=inference/en_ppocr_mobile_v2.0_table_structure_infer \ --table_model_dir=inference/en_ppocr_mobile_v2.0_table_structure_infer \
--image_dir=../doc/table/1.png \ --image_dir=../doc/table/1.png \
--rec_char_dict_path=../ppocr/utils/ppocr_keys_v1.txt \ --rec_char_dict_path=../ppocr/utils/ppocr_keys_v1.txt \
......
...@@ -41,7 +41,7 @@ wget https://paddleocr.bj.bcebos.com/dygraph_v2.0/table/en_ppocr_mobile_v2.0_tab ...@@ -41,7 +41,7 @@ wget https://paddleocr.bj.bcebos.com/dygraph_v2.0/table/en_ppocr_mobile_v2.0_tab
wget https://paddleocr.bj.bcebos.com/dygraph_v2.0/table/en_ppocr_mobile_v2.0_table_structure_infer.tar && tar xf en_ppocr_mobile_v2.0_table_structure_infer.tar wget https://paddleocr.bj.bcebos.com/dygraph_v2.0/table/en_ppocr_mobile_v2.0_table_structure_infer.tar && tar xf en_ppocr_mobile_v2.0_table_structure_infer.tar
cd .. cd ..
# run # run
python3 table/predict_table.py --det_model_dir=inference/en_ppocr_mobile_v2.0_table_det_infer --rec_model_dir=inference/en_ppocr_mobile_v2.0_table_rec_infer --table_model_dir=inference/en_ppocr_mobile_v2.0_table_structure_infer --image_dir=../doc/table/table.jpg --rec_char_dict_path=../ppocr/utils/dict/table_dict.txt --table_char_dict_path=../ppocr/utils/dict/table_structure_dict.txt --rec_char_dict_path=../ppocr/utils/dict/en_dict.txt --det_limit_side_len=736 --det_limit_type=min --output ../output/table python3 table/predict_table.py --det_model_dir=inference/en_ppocr_mobile_v2.0_table_det_infer --rec_model_dir=inference/en_ppocr_mobile_v2.0_table_rec_infer --table_model_dir=inference/en_ppocr_mobile_v2.0_table_structure_infer --image_dir=../doc/table/table.jpg --rec_char_dict_path=../ppocr/utils/dict/table_dict.txt --table_char_dict_path=../ppocr/utils/dict/table_structure_dict.txt --det_limit_side_len=736 --det_limit_type=min --output ../output/table
``` ```
Note: The above model is trained on the PubLayNet dataset and only supports English scanning scenarios. If you need to identify other scenarios, you need to train the model yourself and replace the three fields `det_model_dir`, `rec_model_dir`, `table_model_dir`. Note: The above model is trained on the PubLayNet dataset and only supports English scanning scenarios. If you need to identify other scenarios, you need to train the model yourself and replace the three fields `det_model_dir`, `rec_model_dir`, `table_model_dir`.
......
...@@ -56,7 +56,7 @@ wget https://paddleocr.bj.bcebos.com/dygraph_v2.0/table/en_ppocr_mobile_v2.0_tab ...@@ -56,7 +56,7 @@ wget https://paddleocr.bj.bcebos.com/dygraph_v2.0/table/en_ppocr_mobile_v2.0_tab
wget https://paddleocr.bj.bcebos.com/dygraph_v2.0/table/en_ppocr_mobile_v2.0_table_structure_infer.tar && tar xf en_ppocr_mobile_v2.0_table_structure_infer.tar wget https://paddleocr.bj.bcebos.com/dygraph_v2.0/table/en_ppocr_mobile_v2.0_table_structure_infer.tar && tar xf en_ppocr_mobile_v2.0_table_structure_infer.tar
cd .. cd ..
# 执行预测 # 执行预测
python3 table/predict_table.py --det_model_dir=inference/en_ppocr_mobile_v2.0_table_det_infer --rec_model_dir=inference/en_ppocr_mobile_v2.0_table_rec_infer --table_model_dir=inference/en_ppocr_mobile_v2.0_table_structure_infer --image_dir=../doc/table/table.jpg --rec_char_dict_path=../ppocr/utils/dict/table_dict.txt --table_char_dict_path=../ppocr/utils/dict/table_structure_dict.txt --rec_char_dict_path=../ppocr/utils/dict/en_dict.txt --det_limit_side_len=736 --det_limit_type=min --output ../output/table python3 table/predict_table.py --det_model_dir=inference/en_ppocr_mobile_v2.0_table_det_infer --rec_model_dir=inference/en_ppocr_mobile_v2.0_table_rec_infer --table_model_dir=inference/en_ppocr_mobile_v2.0_table_structure_infer --image_dir=../doc/table/table.jpg --rec_char_dict_path=../ppocr/utils/dict/table_dict.txt --table_char_dict_path=../ppocr/utils/dict/table_structure_dict.txt --det_limit_side_len=736 --det_limit_type=min --output ../output/table
``` ```
运行完成后,每张图片的excel表格会保存到output字段指定的目录下 运行完成后,每张图片的excel表格会保存到output字段指定的目录下
......
...@@ -3,7 +3,7 @@ ...@@ -3,7 +3,7 @@
Linux端基础训练预测功能测试的主程序为test_train_python.sh,可以测试基于Python的模型训练、评估等基本功能,包括裁剪、量化、蒸馏训练。 Linux端基础训练预测功能测试的主程序为test_train_python.sh,可以测试基于Python的模型训练、评估等基本功能,包括裁剪、量化、蒸馏训练。
![](./tipc_train.png) ![](./test_tipc/tipc_train.png)
测试链条如上图所示,主要测试内容有带共享权重,自定义OP的模型的正常训练和slim相关功能训练流程是否正常。 测试链条如上图所示,主要测试内容有带共享权重,自定义OP的模型的正常训练和slim相关功能训练流程是否正常。
...@@ -28,23 +28,23 @@ pip3 install -r requirements.txt ...@@ -28,23 +28,23 @@ pip3 install -r requirements.txt
- 模式1:lite_train_lite_infer,使用少量数据训练,用于快速验证训练到预测的走通流程,不验证精度和速度; - 模式1:lite_train_lite_infer,使用少量数据训练,用于快速验证训练到预测的走通流程,不验证精度和速度;
``` ```
bash test_tipc/test_train_python.sh ./test_tipc/ch_ppocr_mobile_v2.0_det/train_infer_python.txt 'lite_train_lite_infer' bash test_tipc/test_train_python.sh ./test_tipc/train_infer_python.txt 'lite_train_lite_infer'
``` ```
- 模式2:whole_train_whole_infer,使用全量数据训练,用于快速验证训练到预测的走通流程,验证模型最终训练精度; - 模式2:whole_train_whole_infer,使用全量数据训练,用于快速验证训练到预测的走通流程,验证模型最终训练精度;
``` ```
bash test_tipc/test_train_python.sh ./test_tipc/ch_ppocr_mobile_v2.0_det/train_infer_python.txt 'whole_train_whole_infer' bash test_tipc/test_train_python.sh ./test_tipc/train_infer_python.txt 'whole_train_whole_infer'
``` ```
如果是运行量化裁剪等训练方式,需要使用不同的配置文件。量化训练的测试指令如下: 如果是运行量化裁剪等训练方式,需要使用不同的配置文件。量化训练的测试指令如下:
``` ```
bash test_tipc/test_train_python.sh ./test_tipc/ch_ppocr_mobile_v2.0_det/train_infer_python_PACT.txt 'lite_train_lite_infer' bash test_tipc/test_train_python.sh ./test_tipc/train_infer_python_PACT.txt 'lite_train_lite_infer'
``` ```
同理,FPGM裁剪的运行方式如下: 同理,FPGM裁剪的运行方式如下:
``` ```
bash test_tipc/test_train_python.sh ./test_tipc/ch_ppocr_mobile_v2.0_det/train_infer_python_FPGM.txt 'lite_train_lite_infer' bash test_tipc/test_train_python.sh ./test_tipc/train_infer_python_FPGM.txt 'lite_train_lite_infer'
``` ```
运行相应指令后,在`test_tipc/output`文件夹下自动会保存运行日志。如'lite_train_lite_infer'模式运行后,在test_tipc/extra_output文件夹有以下文件: 运行相应指令后,在`test_tipc/output`文件夹下自动会保存运行日志。如'lite_train_lite_infer'模式运行后,在test_tipc/extra_output文件夹有以下文件:
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment