Merge branch 'dygraph' of https://github.com/Topdu/PaddleOCR into dygraph

ac703e56 · Topdu · f83524bd · 7c7972f3 · ac703e56 · ac703e56
Commit ac703e56 authored Apr 28, 2022 by Topdu
20 changed files
--- a/doc/doc_ch/dataset/ocr_datasets.md
+++ b/doc/doc_ch/dataset/ocr_datasets.md
+# OCR数据集
+- [1. 文本检测](#1-文本检测)
+  - [1.1 PaddleOCR 文字检测数据格式](#11-paddleocr-文字检测数据格式)
+  - [1.2 公开数据集](#12-公开数据集)
+    - [1.2.1 ICDAR 2015](#121-icdar-2015)
+- [2. 文本识别](#2-文本识别)
+  - [2.1 PaddleOCR 文字识别数据格式](#21-paddleocr-文字识别数据格式)
+  - [2.2 公开数据集](#22-公开数据集)
+    - [2.1 ICDAR 2015](#21-icdar-2015)
+- [3. 数据存放路径](#3-数据存放路径)
+这里整理了OCR中常用的公开数据集，持续更新中，欢迎各位小伙伴贡献数据集～
+## 1. 文本检测
+### 1.1 PaddleOCR 文字检测数据格式
+PaddleOCR 中的文本检测算法支持的标注文件格式如下，中间用"\t"分隔：
+```
+" 图像文件名                    json.dumps编码的图像标注信息"
+ch4_test_images/img_61.jpg    [{"transcription": "MASA", "points": [[310, 104], [416, 141], [418, 216], [312, 179]]}, {...}]
+```
+json.dumps编码前的图像标注信息是包含多个字典的list，字典中的 `points` 表示文本框的四个点的坐标(x, y)，从左上角的点开始顺时针排列。
+`transcription` 表示当前文本框的文字，**当其内容为“###”时，表示该文本框无效，在训练时会跳过。**
+如果您想在我们未提供的数据集上训练，可以按照上述形式构建标注文件。
+### 1.2 公开数据集
+| 数据集名称 |图片下载地址| PaddleOCR 标注下载地址 |
+|---|---|---|
+| ICDAR 2015 |https://rrc.cvc.uab.es/?ch=4&com=downloads| [train](https://paddleocr.bj.bcebos.com/dataset/train_icdar2015_label.txt) / [test](https://paddleocr.bj.bcebos.com/dataset/test_icdar2015_label.txt) |
+| ctw1500 |https://paddleocr.bj.bcebos.com/dataset/ctw1500.zip| 图片下载地址中已包含 |
+| total text |https://paddleocr.bj.bcebos.com/dataset/total_text.tar| 图片下载地址中已包含 |
+#### 1.2.1 ICDAR 2015
+ICDAR 2015 数据集包含1000张训练图像和500张测试图像。ICDAR 2015 数据集可以从上表中链接下载，首次下载需注册。
+注册完成登陆后，下载下图中红色框标出的部分，其中， `Training Set Images`下载的内容保存在`icdar_c4_train_imgs`文件夹下，`Test Set Images` 下载的内容保存早`ch4_test_images`文件夹下
+<p align="center">
+ <img src="../../datasets/ic15_location_download.png" align="middle" width = "700"/>
+<p align="center">
+将下载到的数据集解压到工作目录下，假设解压在 PaddleOCR/train_data/下。然后从上表中下载转换好的标注文件。
+PaddleOCR 也提供了数据格式转换脚本，可以将官网 label 转换支持的数据格式。 数据转换工具在 `ppocr/utils/gen_label.py`, 这里以训练集为例：
+```
+# 将官网下载的标签文件转换为 train_icdar2015_label.txt
+python gen_label.py --mode="det" --root_path="/path/to/icdar_c4_train_imgs/"  \
+                    --input_path="/path/to/ch4_training_localization_transcription_gt" \
+                    --output_label="/path/to/train_icdar2015_label.txt"
+```
+解压数据集和下载标注文件后，PaddleOCR/train_data/ 有两个文件夹和两个文件，按照如下方式组织icdar2015数据集：
+```
+/PaddleOCR/train_data/icdar2015/text_localization/
+  └─ icdar_c4_train_imgs/         icdar 2015 数据集的训练数据
+  └─ ch4_test_images/             icdar 2015 数据集的测试数据
+  └─ train_icdar2015_label.txt    icdar 2015 数据集的训练标注
+  └─ test_icdar2015_label.txt     icdar 2015 数据集的测试标注
+```
+## 2. 文本识别
+### 2.1 PaddleOCR 文字识别数据格式
+PaddleOCR 中的文字识别算法支持两种数据格式:
+ - `lmdb` 用于训练以lmdb格式存储的数据集，使用 [lmdb_dataset.py](../../../ppocr/data/lmdb_dataset.py) 进行读取;
+ - `通用数据` 用于训练以文本文件存储的数据集，使用 [simple_dataset.py](../../../ppocr/data/simple_dataset.py)进行读取。
+下面以通用数据集为例， 介绍如何准备数据集：
+* 训练集
+建议将训练图片放入同一个文件夹，并用一个txt文件（rec_gt_train.txt）记录图片路径和标签，txt文件里的内容如下:
+**注意：** txt文件中默认请将图片路径和图片标签用 \t 分割，如用其他方式分割将造成训练报错。
+```
+" 图像文件名                 图像标注信息 "
+train_data/rec/train/word_001.jpg   简单可依赖
+train_data/rec/train/word_002.jpg   用科技让复杂的世界更简单
+...
+```
+最终训练集应有如下文件结构：
+```
+|-train_data
+  |-rec
+    |- rec_gt_train.txt
+    |- train
+        |- word_001.png
+        |- word_002.jpg
+        |- word_003.jpg
+        | ...
+```
+除上述单张图像为一行格式之外，PaddleOCR也支持对离线增广后的数据进行训练，为了防止相同样本在同一个batch中被多次采样，我们可以将相同标签对应的图片路径写在一行中，以列表的形式给出，在训练中，PaddleOCR会随机选择列表中的一张图片进行训练。对应地，标注文件的格式如下。
+```
+["11.jpg", "12.jpg"]   简单可依赖
+["21.jpg", "22.jpg", "23.jpg"]   用科技让复杂的世界更简单
+3.jpg   ocr
+```
+上述示例标注文件中，"11.jpg"和"12.jpg"的标签相同，都是`简单可依赖`，在训练的时候，对于该行标注，会随机选择其中的一张图片进行训练。
+- 验证集
+同训练集类似，验证集也需要提供一个包含所有图片的文件夹（test）和一个rec_gt_test.txt，验证集的结构如下所示：
+```
+|-train_data
+  |-rec
+    |- rec_gt_test.txt
+    |- test
+        |- word_001.jpg
+        |- word_002.jpg
+        |- word_003.jpg
+        | ...
+```
+### 2.2 公开数据集
+| 数据集名称 | 图片下载地址 | PaddleOCR 标注下载地址                                                         |
+|---|---|---------------------------------------------------------------------|
+| en benchmark(MJ, SJ, IIIT, SVT, IC03, IC13, IC15, SVTP, and CUTE.) | [DTRB](https://github.com/clovaai/deep-text-recognition-benchmark#download-lmdb-dataset-for-traininig-and-evaluation-from-here) | LMDB格式，可直接用[lmdb_dataset.py](../../../ppocr/data/lmdb_dataset.py)加载 |
+|ICDAR 2015| http://rrc.cvc.uab.es/?ch=4&com=downloads | [train](https://paddleocr.bj.bcebos.com/dataset/rec_gt_train.txt)/ [test](https://paddleocr.bj.bcebos.com/dataset/rec_gt_test.txt) |
+| 多语言数据集 |[百度网盘](https://pan.baidu.com/s/1bS_u207Rm7YbY33wOECKDA) 提取码：frgi <br> [google drive](https://drive.google.com/file/d/18cSWX7wXSy4G0tbKJ0d9PuIaiwRLHpjA/view) | 图片下载地址中已包含 |
+#### 2.1 ICDAR 2015
+ICDAR 2015 数据集可以在上表中链接下载，用于快速验证。也可以从上表中下载 en benchmark 所需的lmdb格式数据集。
+下载完图片后从上表中下载转换好的标注文件。
+PaddleOCR 也提供了数据格式转换脚本，可以将ICDAR官网 label 转换为PaddleOCR支持的数据格式。 数据转换工具在 `ppocr/utils/gen_label.py`, 这里以训练集为例：
+```
+# 将官网下载的标签文件转换为 rec_gt_label.txt
+python gen_label.py --mode="rec" --input_path="{path/of/origin/label}" --output_label="rec_gt_label.txt"
+```
+数据样式格式如下，(a)为原始图片,(b)为每张图片对应的 Ground Truth 文本文件：
+![](../../datasets/icdar_rec.png)
+## 3. 数据存放路径
+PaddleOCR训练数据的默认存储路径是 `PaddleOCR/train_data`,如果您的磁盘上已有数据集，只需创建软链接至数据集目录：
+```
+# linux and mac os
+ln -sf <path/to/dataset> <path/to/paddle_ocr>/train_data/dataset
+# windows
+mklink /d <path/to/paddle_ocr>/train_data/dataset <path/to/dataset>
+```
--- a/doc/doc_ch/dataset/table_datasets.md
+++ b/doc/doc_ch/dataset/table_datasets.md
+# 表格识别数据集
+- [数据集汇总](#数据集汇总)
+- [1. PubTabNet数据集](#1-pubtabnet数据集)
+- [2. 好未来表格识别竞赛数据集](#2-好未来表格识别竞赛数据集)
+这里整理了常用表格识别数据集，持续更新中，欢迎各位小伙伴贡献数据集～
+## 数据集汇总
+| 数据集名称 |图片下载地址| PPOCR标注下载地址 |
+|---|---|---|
+| PubTabNet |https://github.com/ibm-aur-nlp/PubTabNet| jsonl格式，可直接用[pubtab_dataset.py](../../../ppocr/data/pubtab_dataset.py)加载 |
+| 好未来表格识别竞赛数据集 |https://ai.100tal.com/dataset| jsonl格式，可直接用[pubtab_dataset.py](../../../ppocr/data/pubtab_dataset.py)加载 |
+## 1. PubTabNet数据集
+- **数据简介**：PubTabNet数据集的训练集合中包含50万张图像，验证集合中包含0.9万张图像。部分图像可视化如下所示。
+<div align="center">
+    <img src="../../datasets/table_PubTabNet_demo/PMC524509_007_00.png" width="500">
+    <img src="../../datasets/table_PubTabNet_demo/PMC535543_007_01.png" width="500">
+</div>
+- **说明**：使用该数据集时，需要遵守[CDLA-Permissive](https://cdla.io/permissive-1-0/)协议。
+## 2. 好未来表格识别竞赛数据集
+- **数据简介**：好未来表格识别竞赛数据集的训练集合中包含1.6万张图像。验证集未给出可训练的标注。
+<div align="center">
+    <img src="../../datasets/table_tal_demo/1.jpg" width="500">
+    <img src="../../datasets/table_tal_demo/2.jpg" width="500">
+</div>
--- a/doc/doc_ch/vertical_and_multilingual_datasets.md
+++ b/doc/doc_ch/vertical_and_multilingual_datasets.md
@@ -22,7 +22,7 @@
    * CCPD-Challenge: 至今在车牌检测识别任务中最有挑战性的一些图片
    * CCPD-NP: 没有安装车牌的新车图片。
-    ![](../datasets/ccpd_demo.png)
+    ![](../../datasets/ccpd_demo.png)
 - **下载地址**
@@ -46,7 +46,7 @@
        * 有效期结束：07/41
        * 卡用户拼音：MICHAEL
-    ![](../datasets/cmb_demo.jpg)
+    ![](../../datasets/cmb_demo.jpg)
 - **下载地址**: [https://cdn.kesci.com/cmb2017-2.zip](https://cdn.kesci.com/cmb2017-2.zip)
@@ -59,7 +59,7 @@
 - **数据简介**: 这是一个数据合成的工具包，可以根据输入的文本，输出验证码图片，使用该工具包生成几张demo图片如下。
-    ![](../datasets/captcha_demo.png)
+    ![](../../datasets/captcha_demo.png)
 - **下载地址**: 该数据集是生成得到，无下载地址。

--- a/doc/doc_ch/detection.md
+++ b/doc/doc_ch/detection.md
 # 文字检测
 本节以icdar2015数据集为例，介绍PaddleOCR中检测模型训练、评估、测试的使用方式。
- [1. 准备数据和模型](#1--------)
+- [1. 准备数据和模型](#1-准备数据和模型)
-  * [1.1 数据准备](#11-----)
+  - [1.1 准备数据集](#11-准备数据集)
-  * [1.2 下载预训练模型](#12--------)
+  - [1.2 下载预训练模型](#12-下载预训练模型)
- [2. 开始训练](#2-----)
+- [2. 开始训练](#2-开始训练)
-  * [2.1 启动训练](#21-----)
+  - [2.1 启动训练](#21-启动训练)
-  * [2.2 断点训练](#22-----)
+  - [2.2 断点训练](#22-断点训练)
-  * [2.3 更换Backbone 训练](#23---backbone---)
+  - [2.3 更换Backbone 训练](#23-更换backbone-训练)
-  * [2.4 混合精度训练](#24---amp---)
+  - [2.4 混合精度训练](#24-混合精度训练)
-  * [2.5 分布式训练](#25---fleet---)
+  - [2.5 分布式训练](#25-分布式训练)
-  * [2.6 知识蒸馏训练](#26---distill---)
+  - [2.6 知识蒸馏训练](#26-知识蒸馏训练)
-  * [2.7 其他训练环境（Windows/macOS/Linux DCU）](#27---other---)
+  - [2.7 其他训练环境](#27-其他训练环境)
- [3. 模型评估与预测](#3--------)
+- [3. 模型评估与预测](#3-模型评估与预测)
-  * [3.1 指标评估](#31-----)
+  - [3.1 指标评估](#31-指标评估)
-  * [3.2 测试检测效果](#32-------)
+  - [3.2 测试检测效果](#32-测试检测效果)
- [4. 模型导出与预测](#4--------)
+- [4. 模型导出与预测](#4-模型导出与预测)
 - [5. FAQ](#5-faq)
 <a name="1--------"></a>
 # 1. 准备数据和模型
-<a name="11-----"></a>
+## 1.1 准备数据集
-## 1.1 数据准备
-icdar2015 TextLocalization数据集是文本检测的数据集，包含1000张训练图像和500张测试图像。
-icdar2015数据集可以从[官网](https://rrc.cvc.uab.es/?ch=4&com=downloads)下载到，首次下载需注册。
-注册完成登陆后，下载下图中红色框标出的部分，其中， `Training Set Images`下载的内容保存为`icdar_c4_train_imgs`文件夹下，`Test Set Images` 下载的内容保存为`ch4_test_images`文件夹下
-<p align="center">
- <img src="../datasets/ic15_location_download.png" align="middle" width = "700"/>
-<p align="center">
-将下载到的数据集解压到工作目录下，假设解压在 PaddleOCR/train_data/下。另外，PaddleOCR将零散的标注文件整理成单独的标注文件
-，您可以通过wget的方式进行下载。
-```shell
-# 在PaddleOCR路径下
-cd PaddleOCR/
-wget -P ./train_data/  https://paddleocr.bj.bcebos.com/dataset/train_icdar2015_label.txt
-wget -P ./train_data/  https://paddleocr.bj.bcebos.com/dataset/test_icdar2015_label.txt
-```
-PaddleOCR 也提供了数据格式转换脚本，可以将官网 label 转换支持的数据格式。 数据转换工具在 `ppocr/utils/gen_label.py`, 这里以训练集为例：
-```
+准备数据集可参考 [ocr_datasets](./dataset/ocr_datasets.md) 。
-# 将官网下载的标签文件转换为 train_icdar2015_label.txt
-python gen_label.py --mode="det" --root_path="/path/to/icdar_c4_train_imgs/"  \
-                    --input_path="/path/to/ch4_training_localization_transcription_gt" \
-                    --output_label="/path/to/train_icdar2015_label.txt"
-```
-解压数据集和下载标注文件后，PaddleOCR/train_data/ 有两个文件夹和两个文件，按照如下方式组织icdar2015数据集：
-```
-/PaddleOCR/train_data/icdar2015/text_localization/
-  └─ icdar_c4_train_imgs/         icdar数据集的训练数据
-  └─ ch4_test_images/             icdar数据集的测试数据
-  └─ train_icdar2015_label.txt    icdar数据集的训练标注
-  └─ test_icdar2015_label.txt     icdar数据集的测试标注
-```
-提供的标注文件格式如下，中间用"\t"分隔：
-```
-" 图像文件名                    json.dumps编码的图像标注信息"
-ch4_test_images/img_61.jpg    [{"transcription": "MASA", "points": [[310, 104], [416, 141], [418, 216], [312, 179]]}, {...}]
-```
-json.dumps编码前的图像标注信息是包含多个字典的list，字典中的 `points` 表示文本框的四个点的坐标(x, y)，从左上角的点开始顺时针排列。
-`transcription` 表示当前文本框的文字，**当其内容为“###”时，表示该文本框无效，在训练时会跳过。**
-如果您想在其他数据集上训练，可以按照上述形式构建标注文件。
 <a name="12--------"></a>
 ## 1.2 下载预训练模型

--- a/doc/doc_ch/ppocr_introduction.md
+++ b/doc/doc_ch/ppocr_introduction.md
@@ -17,6 +17,8 @@
 PP-OCR是PaddleOCR自研的实用的超轻量OCR系统。在实现[前沿算法](algorithm.md)的基础上，考虑精度与速度的平衡，进行**模型瘦身**和**深度优化**，使其尽可能满足产业落地需求。
+#### PP-OCR
 PP-OCR是一个两阶段的OCR系统，其中文本检测算法选用[DB](algorithm_det_db.md)，文本识别算法选用[CRNN](algorithm_rec_crnn.md)，并在检测和识别模块之间添加[文本方向分类器](angle_class.md)，以应对不同方向的文本识别。
 PP-OCR系统pipeline如下：
@@ -28,9 +30,13 @@ PP-OCR系统pipeline如下：
 PP-OCR系统在持续迭代优化，目前已发布PP-OCR和PP-OCRv2两个版本：
-[1] PP-OCR从骨干网络选择和调整、预测头部的设计、数据增强、学习率变换策略、正则化参数选择、预训练模型使用以及模型自动裁剪量化8个方面，采用19个有效策略，对各个模块的模型进行效果调优和瘦身(如绿框所示)，最终得到整体大小为3.5M的超轻量中英文OCR和2.8M的英文数字OCR。更多细节请参考PP-OCR技术方案 https://arxiv.org/abs/2009.09941
+PP-OCR从骨干网络选择和调整、预测头部的设计、数据增强、学习率变换策略、正则化参数选择、预训练模型使用以及模型自动裁剪量化8个方面，采用19个有效策略，对各个模块的模型进行效果调优和瘦身(如绿框所示)，最终得到整体大小为3.5M的超轻量中英文OCR和2.8M的英文数字OCR。更多细节请参考PP-OCR技术方案 https://arxiv.org/abs/2009.09941
+#### PP-OCRv2
+PP-OCRv2在PP-OCR的基础上，进一步在5个方面重点优化，检测模型采用CML协同互学习知识蒸馏策略和CopyPaste数据增广策略；识别模型采用LCNet轻量级骨干网络、UDML 改进知识蒸馏策略和[Enhanced CTC loss](./doc/doc_ch/enhanced_ctc_loss.md)损失函数改进（如上图红框所示），进一步在推理速度和预测效果上取得明显提升。更多细节请参考PP-OCRv2[技术报告](https://arxiv.org/abs/2109.03144)。
-[2] PP-OCRv2在PP-OCR的基础上，进一步在5个方面重点优化，检测模型采用CML协同互学习知识蒸馏策略和CopyPaste数据增广策略；识别模型采用LCNet轻量级骨干网络、UDML 改进知识蒸馏策略和[Enhanced CTC loss](./doc/doc_ch/enhanced_ctc_loss.md)损失函数改进（如上图红框所示），进一步在推理速度和预测效果上取得明显提升。更多细节请参考PP-OCRv2[技术报告](https://arxiv.org/abs/2109.03144)。
+#### PP-OCRv3
 <a name="2"></a>

--- a/doc/doc_ch/recognition.md
+++ b/doc/doc_ch/recognition.md
@@ -2,133 +2,31 @@
 本文提供了PaddleOCR文本识别任务的全流程指南，包括数据准备、模型训练、调优、评估、预测，各个阶段的详细说明：
- [文字识别](#文字识别)
+- [1. 数据准备](#1-数据准备)
-  - [1. 数据准备](#1-数据准备)
+  - [1.1 准备数据集](#11-准备数据集)
-    - [1.1 自定义数据集](#11-自定义数据集)
+  - [1.2 字典](#12-字典)
-    - [1.2 数据下载](#12-数据下载)
+  - [1.3 添加空格类别](#13-添加空格类别)
-    - [1.3 字典](#13-字典)
+- [2. 启动训练](#2-启动训练)
-    - [1.4 添加空格类别](#14-添加空格类别)
-  - [2. 启动训练](#2-启动训练)
  - [2.1 数据增强](#21-数据增强)
  - [2.2 通用模型训练](#22-通用模型训练)
  - [2.3 多语言模型训练](#23-多语言模型训练)
  - [2.4 知识蒸馏训练](#24-知识蒸馏训练)
-  - [3 评估](#3-评估)
+- [3 评估](#3-评估)
-  - [4 预测](#4-预测)
+- [4 预测](#4-预测)
-  - [5. 转Inference模型测试](#5-转inference模型测试)
+- [5. 转Inference模型测试](#5-转inference模型测试)
 <a name="数据准备"></a>
 ## 1. 数据准备
+### 1.1 准备数据集
-PaddleOCR 支持两种数据格式:
+准备数据集可参考 [ocr_datasets](./dataset/ocr_datasets.md) 。
- - `lmdb` 用于训练以lmdb格式存储的数据集(LMDBDataSet);
- - `通用数据` 用于训练以文本文件存储的数据集(SimpleDataSet);
-训练数据的默认存储路径是 `PaddleOCR/train_data`,如果您的磁盘上已有数据集，只需创建软链接至数据集目录：
-```
-# linux and mac os
-ln -sf <path/to/dataset> <path/to/paddle_ocr>/train_data/dataset
-# windows
-mklink /d <path/to/paddle_ocr>/train_data/dataset <path/to/dataset>
-```
-<a name="准备数据集"></a>
-### 1.1 自定义数据集
-下面以通用数据集为例， 介绍如何准备数据集：
-* 训练集
-建议将训练图片放入同一个文件夹，并用一个txt文件（rec_gt_train.txt）记录图片路径和标签，txt文件里的内容如下:
-**注意：** txt文件中默认请将图片路径和图片标签用 \t 分割，如用其他方式分割将造成训练报错。
-```
-" 图像文件名                 图像标注信息 "
-train_data/rec/train/word_001.jpg   简单可依赖
-train_data/rec/train/word_002.jpg   用科技让复杂的世界更简单
-...
-```
-最终训练集应有如下文件结构：
-```
-|-train_data
-  |-rec
-    |- rec_gt_train.txt
-    |- train
-        |- word_001.png
-        |- word_002.jpg
-        |- word_003.jpg
-        | ...
-```
-除上述单张图像为一行格式之外，PaddleOCR也支持对离线增广后的数据进行训练，为了防止相同样本在同一个batch中被多次采样，我们可以将相同标签对应的图片路径写在一行中，以列表的形式给出，在训练中，PaddleOCR会随机选择列表中的一张图片进行训练。对应地，标注文件的格式如下。
-```
-["11.jpg", "12.jpg"]   简单可依赖
-["21.jpg", "22.jpg", "23.jpg"]   用科技让复杂的世界更简单
-3.jpg   ocr
-```
-上述示例标注文件中，"11.jpg"和"12.jpg"的标签相同，都是`简单可依赖`，在训练的时候，对于该行标注，会随机选择其中的一张图片进行训练。
- 验证集
-同训练集类似，验证集也需要提供一个包含所有图片的文件夹（test）和一个rec_gt_test.txt，验证集的结构如下所示：
-```
-|-train_data
-  |-rec
-    |- rec_gt_test.txt
-    |- test
-        |- word_001.jpg
-        |- word_002.jpg
-        |- word_003.jpg
-        | ...
-```
-<a name="数据下载"></a>
-### 1.2 数据下载
- ICDAR2015
-若您本地没有数据集，可以在官网下载 [ICDAR2015](http://rrc.cvc.uab.es/?ch=4&com=downloads) 数据，用于快速验证。也可以参考[DTRB](https://github.com/clovaai/deep-text-recognition-benchmark#download-lmdb-dataset-for-traininig-and-evaluation-from-here) ，下载 benchmark 所需的lmdb格式数据集。
 如果希望复现SAR的论文指标，需要下载[SynthAdd](https://pan.baidu.com/share/init?surl=uV0LtoNmcxbO-0YA7Ch4dg), 提取码：627x。此外，真实数据集icdar2013, icdar2015, cocotext, IIIT5也作为训练数据的一部分。具体数据细节可以参考论文SAR。
-如果你使用的是icdar2015的公开数据集，PaddleOCR 提供了一份用于训练 ICDAR2015 数据集的标签文件，通过以下方式下载：
-```
-# 训练集标签
-wget -P ./train_data/ic15_data  https://paddleocr.bj.bcebos.com/dataset/rec_gt_train.txt
-# 测试集标签
-wget -P ./train_data/ic15_data  https://paddleocr.bj.bcebos.com/dataset/rec_gt_test.txt
-```
-PaddleOCR 也提供了数据格式转换脚本，可以将ICDAR官网 label 转换为PaddleOCR支持的数据格式。 数据转换工具在 `ppocr/utils/gen_label.py`, 这里以训练集为例：
-```
-# 将官网下载的标签文件转换为 rec_gt_label.txt
-python gen_label.py --mode="rec" --input_path="{path/of/origin/label}" --output_label="rec_gt_label.txt"
-```
-数据样式格式如下，(a)为原始图片,(b)为每张图片对应的 Ground Truth 文本文件：
-![](../datasets/icdar_rec.png)
- 多语言数据集
-多语言模型的训练数据集均为100w的合成数据，使用了开源合成工具 [text_renderer](https://github.com/Sanster/text_renderer) ，少量的字体可以通过下面两种方式下载。
-* [百度网盘](https://pan.baidu.com/s/1bS_u207Rm7YbY33wOECKDA) 提取码：frgi
-* [google drive](https://drive.google.com/file/d/18cSWX7wXSy4G0tbKJ0d9PuIaiwRLHpjA/view)
 <a name="字典"></a>
-### 1.3 字典
+### 1.2 字典
 最后需要提供一个字典（{word_dict_name}.txt），使模型在训练时，可以将所有出现的字符映射为字典的索引。
@@ -174,7 +72,7 @@ PaddleOCR内置了一部分字典，可以按需使用。
 如需自定义dic文件，请在 `configs/rec/rec_icdar15_train.yml` 中添加 `character_dict_path` 字段, 指向您的字典路径。
 <a name="支持空格"></a>
-### 1.4 添加空格类别
+### 1.3 添加空格类别
 如果希望支持识别"空格"类别, 请将yml文件中的 `use_space_char` 字段设置为 `True`。

--- a/doc/doc_ch/table_datasets.md
+++ b/doc/doc_ch/table_datasets.md
--- a/doc/doc_ch/training.md
+++ b/doc/doc_ch/training.md
@@ -87,7 +87,7 @@ Optimizer:
        - 中文数据集，LSVT街景数据集根据真值将图crop出来，并进行位置校准，总共30w张图像。此外基于LSVT的语料，合成数据500w。
        - 小语种数据集，使用不同语料和字体，分别生成了100w合成数据集，并使用ICDAR-MLT作为验证集。
-其中，公开数据集都是开源的，用户可自行搜索下载，也可参考[中文数据集](./datasets.md)，合成数据暂不开源，用户可使用开源合成工具自行合成，可参考的合成工具包括[text_renderer](https://github.com/Sanster/text_renderer) 、[SynthText](https://github.com/ankush-me/SynthText) 、[TextRecognitionDataGenerator](https://github.com/Belval/TextRecognitionDataGenerator) 等。
+其中，公开数据集都是开源的，用户可自行搜索下载，也可参考[中文数据集](dataset/datasets.md)，合成数据暂不开源，用户可使用开源合成工具自行合成，可参考的合成工具包括[text_renderer](https://github.com/Sanster/text_renderer) 、[SynthText](https://github.com/ankush-me/SynthText) 、[TextRecognitionDataGenerator](https://github.com/Belval/TextRecognitionDataGenerator) 等。
 <a name="垂类场景"></a>
 ### 3.2 垂类场景
@@ -149,4 +149,3 @@ PaddleOCR主要聚焦通用OCR，如果有垂类需求，您可以用PaddleOCR+
 - [文本方向分类器训练](./angle_class.md)  
 - [知识蒸馏](./knowledge_distillation.md)
--- a/doc/doc_ch/update.md
+++ b/doc/doc_ch/update.md
@@ -22,7 +22,7 @@
 - 2020.7.15 整理OCR相关数据集、常用数据标注以及合成工具
 - 2020.7.9 添加支持空格的识别模型，识别效果，预测及训练方式请参考快速开始和文本识别训练相关文档
 - 2020.7.9 添加数据增强、学习率衰减策略,具体参考[配置文件](./config.md)
- 2020.6.8 添加[数据集](./datasets.md)，并保持持续更新
+- 2020.6.8 添加[数据集](dataset/datasets.md)，并保持持续更新
 - 2020.6.5 支持 `attetnion` 模型导出 `inference_model`
 - 2020.6.5 支持单独预测识别时，输出结果得分
 - 2020.5.30 提供超轻量级中文OCR在线体验

--- a/doc/doc_en/FAQ_en.md
+++ b/doc/doc_en/FAQ_en.md
@@ -42,7 +42,7 @@ At present, the open source model, dataset and magnitude are as follows:
    English dataset: MJSynth and SynthText synthetic dataset, the amount of data is tens of millions.  
    Chinese dataset: LSVT street view dataset with cropped text area, a total of 30w images. In addition, the synthesized data based on LSVT corpus is 500w.
-    Among them, the public datasets are opensourced, users can search and download by themselves, or refer to [Chinese data set](./datasets_en.md), synthetic data is not opensourced, users can use open-source synthesis tools to synthesize data themselves. Current available synthesis tools include [text_renderer](https://github.com/Sanster/text_renderer), [SynthText](https://github.com/ankush-me/SynthText), [TextRecognitionDataGenerator](https://github.com/Belval/TextRecognitionDataGenerator), etc.
+    Among them, the public datasets are opensourced, users can search and download by themselves, or refer to [Chinese data set](dataset/datasets_en.md), synthetic data is not opensourced, users can use open-source synthesis tools to synthesize data themselves. Current available synthesis tools include [text_renderer](https://github.com/Sanster/text_renderer), [SynthText](https://github.com/ankush-me/SynthText), [TextRecognitionDataGenerator](https://github.com/Belval/TextRecognitionDataGenerator), etc.
 10. **Error in using the model with TPS module for prediction**  
 Error message: Input(X) dims[3] and Input(Grid) dims[2] should be equal, but received X dimension[3]\(108) != Grid dimension[2]\(100)  

--- a/doc/doc_en/algorithm_det_db_en.md
+++ b/doc/doc_en/algorithm_det_db_en.md
@@ -25,8 +25,8 @@ On the ICDAR2015 dataset, the text detection result is as follows:
 |Model|Backbone|Configuration|Precision|Recall|Hmean|Download|
 | --- | --- | --- | --- | --- | --- | --- |
-|DB|ResNet50_vd|configs/det/det_r50_vd_db.yml|86.41%|78.72%|82.38%|[trained model](https://paddleocr.bj.bcebos.com/dygraph_v2.0/en/det_r50_vd_db_v2.0_train.tar)|
+|DB|ResNet50_vd|[configs/det/det_r50_vd_db.yml](../../configs/det/det_r50_vd_db.yml)|86.41%|78.72%|82.38%|[trained model](https://paddleocr.bj.bcebos.com/dygraph_v2.0/en/det_r50_vd_db_v2.0_train.tar)|
-|DB|MobileNetV3|configs/det/det_mv3_db.yml|77.29%|73.08%|75.12%|[trained model](https://paddleocr.bj.bcebos.com/dygraph_v2.0/en/det_mv3_db_v2.0_train.tar)|
+|DB|MobileNetV3|[configs/det/det_mv3_db.yml](../../configs/det/det_mv3_db.yml)|77.29%|73.08%|75.12%|[trained model](https://paddleocr.bj.bcebos.com/dygraph_v2.0/en/det_mv3_db_v2.0_train.tar)|
 <a name="2"></a>

--- a/doc/doc_en/algorithm_det_fcenet_en.md
+++ b/doc/doc_en/algorithm_det_fcenet_en.md
+# FCENet
+- [1. Introduction](#1)
+- [2. Environment](#2)
+- [3. Model Training / Evaluation / Prediction](#3)
+    - [3.1 Training](#3-1)
+    - [3.2 Evaluation](#3-2)
+    - [3.3 Prediction](#3-3)
+- [4. Inference and Deployment](#4)
+    - [4.1 Python Inference](#4-1)
+    - [4.2 C++ Inference](#4-2)
+    - [4.3 Serving](#4-3)
+    - [4.4 More](#4-4)
+- [5. FAQ](#5)
+<a name="1"></a>
+## 1. Introduction
+Paper:
+> [Fourier Contour Embedding for Arbitrary-Shaped Text Detection](https://arxiv.org/abs/2104.10442)
+> Yiqin Zhu and Jianyong Chen and Lingyu Liang and Zhanghui Kuang and Lianwen Jin and Wayne Zhang
+> CVPR, 2021
+On the CTW1500 dataset, the text detection result is as follows:
+|Model|Backbone|Configuration|Precision|Recall|Hmean|Download|
+| --- | --- | --- | --- | --- | --- | --- |
+| FCE | ResNet50_dcn | [configs/det/det_r50_vd_dcn_fce_ctw.yml](../../configs/det/det_r50_vd_dcn_fce_ctw.yml)| 88.39%|82.18%|85.27%|[trained model](https://paddleocr.bj.bcebos.com/contribution/det_r50_dcn_fce_ctw_v2.0_train.tar)|
+<a name="2"></a>
+## 2. Environment
+Please prepare your environment referring to [prepare the environment](./environment_en.md) and [clone the repo](./clone_en.md).
+<a name="3"></a>
+## 3. Model Training / Evaluation / Prediction
+The above FCE model is trained using the CTW1500 text detection public dataset. For the download of the dataset, please refer to [ocr_datasets](./dataset/ocr_datasets_en.md).
+After the data download is complete, please refer to [Text Detection Training Tutorial](./detection.md) for training. PaddleOCR has modularized the code structure, so that you only need to **replace the configuration file** to train different detection models.
+<a name="4"></a>
+## 4. Inference and Deployment
+<a name="4-1"></a>
+### 4.1 Python Inference
+First, convert the model saved in the FCE text detection training process into an inference model. Taking the model based on the Resnet50_vd_dcn backbone network and trained on the CTW1500 English dataset as example ([model download link](https://paddleocr.bj.bcebos.com/contribution/det_r50_dcn_fce_ctw_v2.0_train.tar)), you can use the following command to convert:
+```shell
+python3 tools/export_model.py -c configs/det/det_r50_vd_dcn_fce_ctw.yml -o Global.pretrained_model=./det_r50_dcn_fce_ctw_v2.0_train/best_accuracy  Global.save_inference_dir=./inference/det_fce
+```
+FCE text detection model inference, to perform non-curved text detection, you can run the following commands:
+```shell
+python3 tools/infer/predict_det.py --image_dir="./doc/imgs_en/img_10.jpg" --det_model_dir="./inference/det_fce/" --det_algorithm="FCE" --det_fce_box_type=quad
+```
+The visualized text detection results are saved to the `./inference_results` folder by default, and the name of the result file is prefixed with 'det_res'. Examples of results are as follows:
+![](../imgs_results/det_res_img_10_fce.jpg)
+If you want to perform curved text detection, you can execute the following command:
+```shell
+python3 tools/infer/predict_det.py --image_dir="./doc/imgs_en/img623.jpg" --det_model_dir="./inference/det_fce/" --det_algorithm="FCE" --det_fce_box_type=poly
+```
+The visualized text detection results are saved to the `./inference_results` folder by default, and the name of the result file is prefixed with 'det_res'. Examples of results are as follows:
+![](../imgs_results/det_res_img623_fce.jpg)
+**Note**: Since the CTW1500 dataset has only 1,000 training images, mainly for English scenes, the above model has very poor detection result on Chinese or curved text images.
+<a name="4-2"></a>
+### 4.2 C++ Inference
+Since the post-processing is not written in CPP, the FCE text detection model does not support CPP inference.
+<a name="4-3"></a>
+### 4.3 Serving
+Not supported
+<a name="4-4"></a>
+### 4.4 More
+Not supported
+<a name="5"></a>
+## 5. FAQ
+## Citation
+```bibtex
+@InProceedings{zhu2021fourier,
+  title={Fourier Contour Embedding for Arbitrary-Shaped Text Detection},
+  author={Yiqin Zhu and Jianyong Chen and Lingyu Liang and Zhanghui Kuang and Lianwen Jin and Wayne Zhang},
+  year={2021},
+  booktitle = {CVPR}
+}
+```
--- a/doc/doc_en/algorithm_det_psenet_en.md
+++ b/doc/doc_en/algorithm_det_psenet_en.md
+# PSENet
+- [1. Introduction](#1)
+- [2. Environment](#2)
+- [3. Model Training / Evaluation / Prediction](#3)
+    - [3.1 Training](#3-1)
+    - [3.2 Evaluation](#3-2)
+    - [3.3 Prediction](#3-3)
+- [4. Inference and Deployment](#4)
+    - [4.1 Python Inference](#4-1)
+    - [4.2 C++ Inference](#4-2)
+    - [4.3 Serving](#4-3)
+    - [4.4 More](#4-4)
+- [5. FAQ](#5)
+<a name="1"></a>
+## 1. Introduction
+Paper:
+> [Shape robust text detection with progressive scale expansion network](https://arxiv.org/abs/1903.12473)
+> Wang, Wenhai and Xie, Enze and Li, Xiang and Hou, Wenbo and Lu, Tong and Yu, Gang and Shao, Shuai
+> CVPR, 2019
+On the ICDAR2015 dataset, the text detection result is as follows:
+|Model|Backbone|Configuration|Precision|Recall|Hmean|Download|
+| --- | --- | --- | --- | --- | --- | --- |
+|PSE| ResNet50_vd | [configs/det/det_r50_vd_pse.yml](../../configs/det/det_r50_vd_pse.yml)| 85.81%    |79.53%|82.55%|[trained model](https://paddleocr.bj.bcebos.com/dygraph_v2.1/en_det/det_r50_vd_pse_v2.0_train.tar)|
+|PSE| MobileNetV3| [configs/det/det_mv3_pse.yml](../../configs/det/det_mv3_pse.yml) | 82.20%    |70.48%|75.89%|[trained model](https://paddleocr.bj.bcebos.com/dygraph_v2.1/en_det/det_mv3_pse_v2.0_train.tar)|
+<a name="2"></a>
+## 2. Environment
+Please prepare your environment referring to [prepare the environment](./environment_en.md) and [clone the repo](./clone_en.md).
+<a name="3"></a>
+## 3. Model Training / Evaluation / Prediction
+The above PSE model is trained using the ICDAR2015 text detection public dataset. For the download of the dataset, please refer to [ocr_datasets](./dataset/ocr_datasets_en.md).
+After the data download is complete, please refer to [Text Detection Training Tutorial](./detection.md) for training. PaddleOCR has modularized the code structure, so that you only need to **replace the configuration file** to train different detection models.
+<a name="4"></a>
+## 4. Inference and Deployment
+<a name="4-1"></a>
+### 4.1 Python Inference
+First, convert the model saved in the PSE text detection training process into an inference model. Taking the model based on the Resnet50_vd backbone network and trained on the ICDAR2015 English dataset as example ([model download link](https://paddleocr.bj.bcebos.com/dygraph_v2.1/en_det/det_r50_vd_pse_v2.0_train.tar)), you can use the following command to convert:
+```shell
+python3 tools/export_model.py -c configs/det/det_r50_vd_pse.yml -o Global.pretrained_model=./det_r50_vd_pse_v2.0_train/best_accuracy  Global.save_inference_dir=./inference/det_pse
+```
+PSE text detection model inference, to perform non-curved text detection, you can run the following commands:
+```shell
+python3 tools/infer/predict_det.py --image_dir="./doc/imgs_en/img_10.jpg" --det_model_dir="./inference/det_pse/" --det_algorithm="PSE" --det_pse_box_type=quad
+```
+The visualized text detection results are saved to the `./inference_results` folder by default, and the name of the result file is prefixed with 'det_res'. Examples of results are as follows:
+![](../imgs_results/det_res_img_10_pse.jpg)
+If you want to perform curved text detection, you can execute the following command:
+```shell
+python3 tools/infer/predict_det.py --image_dir="./doc/imgs_en/img_10.jpg" --det_model_dir="./inference/det_pse/" --det_algorithm="PSE" --det_pse_box_type=poly
+```
+The visualized text detection results are saved to the `./inference_results` folder by default, and the name of the result file is prefixed with 'det_res'. Examples of results are as follows:
+![](../imgs_results/det_res_img_10_pse_poly.jpg)
+**Note**: Since the ICDAR2015 dataset has only 1,000 training images, mainly for English scenes, the above model has very poor detection result on Chinese or curved text images.
+<a name="4-2"></a>
+### 4.2 C++ Inference
+Since the post-processing is not written in CPP, the PSE text detection model does not support CPP inference.
+<a name="4-3"></a>
+### 4.3 Serving
+Not supported
+<a name="4-4"></a>
+### 4.4 More
+Not supported
+<a name="5"></a>
+## 5. FAQ
+## Citation
+```bibtex
+@inproceedings{wang2019shape,
+  title={Shape robust text detection with progressive scale expansion network},
+  author={Wang, Wenhai and Xie, Enze and Li, Xiang and Hou, Wenbo and Lu, Tong and Yu, Gang and Shao, Shuai},
+  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
+  pages={9336--9345},
+  year={2019}
+}
+```
--- a/doc/doc_en/datasets_en.md
+++ b/doc/doc_en/datasets_en.md
@@ -12,11 +12,11 @@ In addition to opensource data, users can also use synthesis tools to synthesize
 #### 1. ICDAR2019-LSVT
 - **Data sources**：https://ai.baidu.com/broad/introduction?dataset=lsvt
 - **Introduction**： A total of 45w Chinese street view images, including 5w (2w test + 3w training) fully labeled data (text coordinates + text content), 40w weakly labeled data (text content only), as shown in the following figure:
-    ![](../datasets/LSVT_1.jpg)
+    ![](../../datasets/LSVT_1.jpg)
    (a) Fully labeled data
-    ![](../datasets/LSVT_2.jpg)
+    ![](../../datasets/LSVT_2.jpg)
    (b) Weakly labeled data
 - **Download link**：https://ai.baidu.com/broad/download?dataset=lsvt
@@ -25,7 +25,7 @@ In addition to opensource data, users can also use synthesis tools to synthesize
 #### 2. ICDAR2017-RCTW-17
 - **Data sources**：https://rctw.vlrlab.net/
 - **Introduction**：It contains 12000 + images, most of them are collected in the wild through mobile camera. Some are screenshots. These images show a variety of scenes, including street views, posters, menus, indoor scenes and screenshots of mobile applications.
-    ![](../datasets/rctw.jpg)
+    ![](../../datasets/rctw.jpg)
 - **Download link**：https://rctw.vlrlab.net/dataset/
 <a name="中文街景文字识别"></a>
@@ -33,9 +33,9 @@ In addition to opensource data, users can also use synthesis tools to synthesize
 - **Data sources**：https://aistudio.baidu.com/aistudio/competition/detail/8
 - **Introduction**：A total of 290000 pictures are included, of which 210000 are used as training sets (with labels) and 80000 are used as test sets (without labels). The dataset is collected from the Chinese street view, and is formed by by cutting out the text line area (such as shop signs, landmarks, etc.) in the street view picture. All the images are preprocessed: by using affine transform, the text area is proportionally mapped to a picture with a height of 48 pixels, as shown in the figure:
-    ![](../datasets/ch_street_rec_1.png)  
+    ![](../../datasets/ch_street_rec_1.png)  
    (a) Label: 魅派集成吊顶  
-    ![](../datasets/ch_street_rec_2.png)  
+    ![](../../datasets/ch_street_rec_2.png)  
    (b) Label: 母婴用品连锁  
 - **Download link**
 https://aistudio.baidu.com/aistudio/datasetdetail/8429
@@ -49,13 +49,13 @@ https://aistudio.baidu.com/aistudio/datasetdetail/8429
    - 5990 characters including Chinese characters, English letters, numbers and punctuation（Characters set: https://github.com/YCG09/chinese_ocr/blob/master/train/char_std_5990.txt ）
    - Each sample is fixed with 10 characters, and the characters are randomly intercepted from the sentences in the corpus
    - Image resolution is 280x32  
-    ![](../datasets/ch_doc1.jpg)  
+    ![](../../datasets/ch_doc1.jpg)  
-    ![](../datasets/ch_doc3.jpg)  
+    ![](../../datasets/ch_doc3.jpg)  
 - **Download link**：https://pan.baidu.com/s/1QkI7kjah8SPHwOQ40rS1Pw (Password: lu7m)
 <a name="ICDAR2019-ArT"></a>
 #### 5、ICDAR2019-ArT
 - **Data source**：https://ai.baidu.com/broad/introduction?dataset=art
 - **Introduction**：It includes 10166 images, 5603 in training sets and 4563 in test sets. It is composed of three parts: total text, scut-ctw1500 and Baidu curved scene text, including text with various shapes such as horizontal, multi-directional and curved.
-    ![](../datasets/ArT.jpg)
+    ![](../../datasets/ArT.jpg)
 - **Download link**：https://ai.baidu.com/broad/download?dataset=art
--- a/doc/doc_en/handwritten_datasets_en.md
+++ b/doc/doc_en/handwritten_datasets_en.md
@@ -9,7 +9,7 @@ Here we have sorted out the commonly used handwritten OCR dataset datasets, whic
 - **Data introduction**:
    * It includes online and offline handwritten data,`HWDB1.0~1.2` has totally 3895135 handwritten single character samples, which belong to 7356 categories (7185 Chinese characters and 171 English letters, numbers and symbols);`HWDB2.0~2.2` has totally 5091 pages of images, which are divided into 52230 text lines and 1349414 words. All text and text samples are stored as grayscale images. Some sample words are shown below.
-        ![](../datasets/CASIA_0.jpg)
+        ![](../../datasets/CASIA_0.jpg)
 - **Download address**:http://www.nlpr.ia.ac.cn/databases/handwriting/Download.html
 - **使用建议**:Data for single character, white background, can form a large number of text lines for training. White background can be processed into transparent state, which is convenient to add various backgrounds. For the case of semantic needs, it is suggested to extract single character from real corpus to form text lines.
@@ -22,7 +22,7 @@ Here we have sorted out the commonly used handwritten OCR dataset datasets, whic
 - **Data introduction**: NIST19 dataset is suitable for handwritten document and character recognition model training. It is extracted from the handwritten sample form of 3600 authors and contains 810000 character images in total. Nine of them are shown below.
-    ![](../datasets/nist_demo.png)
+    ![](../../datasets/nist_demo.png)
 - **Download address**: [https://www.nist.gov/srd/nist-special-database-19](https://www.nist.gov/srd/nist-special-database-19)
--- a/doc/doc_en/dataset/ocr_datasets_en.md
+++ b/doc/doc_en/dataset/ocr_datasets_en.md
+# OCR datasets
+- [1. Text detection](#1-text-detection)
+  - [1.1 PaddleOCR text detection format annotation](#11-paddleocr-text-detection-format-annotation)
+  - [1.2 Public dataset](#12-public-dataset)
+    - [1.2.1 ICDAR 2015](#121-icdar-2015)
+- [2. Text recognition](#2-text-recognition)
+  - [2.1 PaddleOCR text recognition format annotation](#21-paddleocr-text-recognition-format-annotation)
+  - [2.2 Public dataset](#22-public-dataset)
+    - [2.1 ICDAR 2015](#21-icdar-2015)
+- [3. Data storage path](#3-data-storage-path)
+Here is a list of public datasets commonly used in OCR, which are being continuously updated. Welcome to contribute datasets~
+## 1. Text detection
+### 1.1 PaddleOCR text detection format annotation
+The annotation file formats supported by the PaddleOCR text detection algorithm are as follows, separated by "\t":
+```
+" Image file name             Image annotation information encoded by json.dumps"
+ch4_test_images/img_61.jpg    [{"transcription": "MASA", "points": [[310, 104], [416, 141], [418, 216], [312, 179]]}, {...}]
+```
+The image annotation after **json.dumps()** encoding is a list containing multiple dictionaries.
+The `points` in the dictionary represent the coordinates (x, y) of the four points of the text box, arranged clockwise from the point at the upper left corner.
+`transcription` represents the text of the current text box. **When its content is "###" it means that the text box is invalid and will be skipped during training.**
+If you want to train PaddleOCR on other datasets, please build the annotation file according to the above format.
+### 1.2 Public dataset
+| dataset | Image download link | PaddleOCR format annotation download link |
+|---|---|---|
+| ICDAR 2015 | https://rrc.cvc.uab.es/?ch=4&com=downloads            | [train](https://paddleocr.bj.bcebos.com/dataset/train_icdar2015_label.txt) / [test](https://paddleocr.bj.bcebos.com/dataset/test_icdar2015_label.txt) |
+| ctw1500 | https://paddleocr.bj.bcebos.com/dataset/ctw1500.zip   | Included in the downloaded image zip                                                                                                           |
+| total text | https://paddleocr.bj.bcebos.com/dataset/total_text.tar |  Included in the downloaded image zip  |
+#### 1.2.1 ICDAR 2015
+The icdar2015 dataset contains train set which has 1000 images obtained with wearable cameras and test set which has 500 images obtained with wearable cameras. The icdar2015 dataset can be downloaded from the link in the table above. Registration is required for downloading.
+After registering and logging in, download the part marked in the red box in the figure below. And, the content downloaded by `Training Set Images` should be saved as the folder `icdar_c4_train_imgs`, and the content downloaded by `Test Set Images` is saved as the folder `ch4_test_images`
+<p align="center">
+ <img src="../../datasets/ic15_location_download.png" align="middle" width = "700"/>
+<p align="center">
+Decompress the downloaded dataset to the working directory, assuming it is decompressed under PaddleOCR/train_data/. Then download the PaddleOCR format annotation file from the table above.
+PaddleOCR also provides a data format conversion script, which can convert the official website label to the PaddleOCR format. The data conversion tool is in `ppocr/utils/gen_label.py`, here is the training set as an example:
+```
+# Convert the label file downloaded from the official website to train_icdar2015_label.txt
+python gen_label.py --mode="det" --root_path="/path/to/icdar_c4_train_imgs/"  \
+                    --input_path="/path/to/ch4_training_localization_transcription_gt" \
+                    --output_label="/path/to/train_icdar2015_label.txt"
+```
+After decompressing the data set and downloading the annotation file, PaddleOCR/train_data/ has two folders and two files, which are:
+```
+/PaddleOCR/train_data/icdar2015/text_localization/
+  └─ icdar_c4_train_imgs/         Training data of icdar dataset
+  └─ ch4_test_images/             Testing data of icdar dataset
+  └─ train_icdar2015_label.txt    Training annotation of icdar dataset
+  └─ test_icdar2015_label.txt     Test annotation of icdar dataset
+```
+## 2. Text recognition
+### 2.1 PaddleOCR text recognition format annotation
+The text recognition algorithm in PaddleOCR supports two data formats:
+ - `lmdb` is used to train data sets stored in lmdb format, use [lmdb_dataset.py](../../../ppocr/data/lmdb_dataset.py) to load;
+ - `common dataset` is used to train data sets stored in text files, use [simple_dataset.py](../../../ppocr/data/simple_dataset.py) to load.
+If you want to use your own data for training, please refer to the following to organize your data.
+- Training set
+It is recommended to put the training images in the same folder, and use a txt file (rec_gt_train.txt) to store the image path and label. The contents of the txt file are as follows:
+* Note: by default, the image path and image label are split with \t, if you use other methods to split, it will cause training error
+```
+" Image file name           Image annotation "
+train_data/rec/train/word_001.jpg   简单可依赖
+train_data/rec/train/word_002.jpg   用科技让复杂的世界更简单
+...
+```
+The final training set should have the following file structure:
+```
+|-train_data
+  |-rec
+    |- rec_gt_train.txt
+    |- train
+        |- word_001.png
+        |- word_002.jpg
+        |- word_003.jpg
+        | ...
+```
+- Test set
+Similar to the training set, the test set also needs to be provided a folder containing all images (test) and a rec_gt_test.txt. The structure of the test set is as follows:
+```
+|-train_data
+  |-rec
+    |-ic15_data
+        |- rec_gt_test.txt
+        |- test
+            |- word_001.jpg
+            |- word_002.jpg
+            |- word_003.jpg
+            | ...
+```
+### 2.2 Public dataset
+| dataset | Image download link | PaddleOCR format annotation download link |
+|---|---|---|
+| en benchmark(MJ, SJ, IIIT, SVT, IC03, IC13, IC15, SVTP, and CUTE.) | [DTRB](https://github.com/clovaai/deep-text-recognition-benchmark#download-lmdb-dataset-for-traininig-and-evaluation-from-here) | LMDB format, which can be loaded directly with [lmdb_dataset.py](../../../ppocr/data/lmdb_dataset.py) |
+|ICDAR 2015| http://rrc.cvc.uab.es/?ch=4&com=downloads | [train](https://paddleocr.bj.bcebos.com/dataset/rec_gt_train.txt)/ [test](https://paddleocr.bj.bcebos.com/dataset/rec_gt_test.txt) |
+| Multilingual datasets |[Baidu network disk](https://pan.baidu.com/s/1bS_u207Rm7YbY33wOECKDA) Extraction code: frgi <br> [google drive](https://drive.google.com/file/d/18cSWX7wXSy4G0tbKJ0d9PuIaiwRLHpjA/view) | Included in the downloaded image zip |
+#### 2.1 ICDAR 2015
+The ICDAR 2015 dataset can be downloaded from the link in the table above for quick validation. The lmdb format dataset required by en benchmark can also be downloaded from the table above.
+Then download the PaddleOCR format annotation file from the table above.
+PaddleOCR also provides a data format conversion script, which can convert the ICDAR official website label to the data format supported by PaddleOCR. The data conversion tool is in `ppocr/utils/gen_label.py`, here is the training set as an example:
+```
+# Convert the label file downloaded from the official website to rec_gt_label.txt
+python gen_label.py --mode="rec" --input_path="{path/of/origin/label}" --output_label="rec_gt_label.txt"
+```
+The data format is as follows, (a) is the original picture, (b) is the Ground Truth text file corresponding to each picture:
+![](../../datasets/icdar_rec.png)
+## 3. Data storage path
+The default storage path for PaddleOCR training data is `PaddleOCR/train_data`, if you already have a dataset on your disk, just create a soft link to the dataset directory:
+```
+# linux and mac os
+ln -sf <path/to/dataset> <path/to/paddle_ocr>/train_data/dataset
+# windows
+mklink /d <path/to/paddle_ocr>/train_data/dataset <path/to/dataset>
+```
--- a/doc/doc_en/dataset/table_datasets_en.md
+++ b/doc/doc_en/dataset/table_datasets_en.md
+# Table Recognition Datasets
+- [Dataset Summary](#dataset-summary)
+- [1. PubTabNet](#1-pubtabnet)
+- [2. TAL Table Recognition Competition Dataset](#2-tal-table-recognition-competition-dataset)
+Here are the commonly used table recognition datasets, which are being updated continuously. Welcome to contribute datasets~
+## Dataset Summary
+| dataset | Image download link | PPOCR format annotation download link |
+|---|---|---|
+| PubTabNet |https://github.com/ibm-aur-nlp/PubTabNet| jsonl format, which can be loaded directly with [pubtab_dataset.py](../../../ppocr/data/pubtab_dataset.py) |
+| TAL Table Recognition Competition Dataset |https://ai.100tal.com/dataset| jsonl format, which can be loaded directly with [pubtab_dataset.py](../../../ppocr/data/pubtab_dataset.py) |
+## 1. PubTabNet
+- **Data Introduction**：The training set of the PubTabNet dataset contains 500,000 images and the validation set contains 9000 images. Part of the image visualization is shown below.
+<div align="center">
+    <img src="../../datasets/table_PubTabNet_demo/PMC524509_007_00.png" width="500">
+    <img src="../../datasets/table_PubTabNet_demo/PMC535543_007_01.png" width="500">
+</div>
+- **illustrate**：When using this dataset, the [CDLA-Permissive](https://cdla.io/permissive-1-0/) protocol is required.
+## 2. TAL Table Recognition Competition Dataset
+- **Data Introduction**：The training set of the TAL table recognition competition dataset contains 16,000 images. The validation set does not give trainable annotations.
+<div align="center">
+    <img src="../../datasets/table_tal_demo/1.jpg" width="500">
+    <img src="../../datasets/table_tal_demo/2.jpg" width="500">
+</div>
--- a/doc/doc_en/vertical_and_multilingual_datasets_en.md
+++ b/doc/doc_en/vertical_and_multilingual_datasets_en.md
@@ -22,7 +22,7 @@ Here we have sorted out the commonly used vertical multi-language OCR dataset da
    * CCPD-Challenge: So far, some of the most challenging images in license plate detection and recognition tasks
    * CCPD-NP: Pictures of new cars without license plates.
-    ![](../datasets/ccpd_demo.png)
+    ![](../../datasets/ccpd_demo.png)
 - **Download address**
@@ -46,7 +46,7 @@ Here we have sorted out the commonly used vertical multi-language OCR dataset da
        * End of validity: 07/41
        * Chinese phonetic alphabet of card users: MICHAEL
-    ![](../datasets/cmb_demo.jpg)
+    ![](../../datasets/cmb_demo.jpg)
 - **Download address**: [https://cdn.kesci.com/cmb2017-2.zip](https://cdn.kesci.com/cmb2017-2.zip)
@@ -59,7 +59,7 @@ Here we have sorted out the commonly used vertical multi-language OCR dataset da
 - **Data introduction**: This is a toolkit for data synthesis. You can output captcha images according to the input text. Use the toolkit to generate several demo images as follows.
-    ![](../datasets/captcha_demo.png)
+    ![](../../datasets/captcha_demo.png)
 - **Download address**: The dataset is generated and has no download address.

--- a/doc/doc_en/detection_en.md
+++ b/doc/doc_en/detection_en.md
@@ -2,63 +2,25 @@
 This section uses the icdar2015 dataset as an example to introduce the training, evaluation, and testing of the detection model in PaddleOCR.
- [1. Data and Weights Preparation](#1-data-and-weights-preparatio)
+- [1. Data and Weights Preparation](#1-data-and-weights-preparation)
-  * [1.1 Data Preparation](#11-data-preparation)
+  - [1.1 Data Preparation](#11-data-preparation)
-  * [1.2 Download Pre-trained Model](#12-download-pretrained-model)
+  - [1.2 Download Pre-trained Model](#12-download-pre-trained-model)
 - [2. Training](#2-training)
-  * [2.1 Start Training](#21-start-training)
+  - [2.1 Start Training](#21-start-training)
-  * [2.2 Load Trained Model and Continue Training](#22-load-trained-model-and-continue-training)
+  - [2.2 Load Trained Model and Continue Training](#22-load-trained-model-and-continue-training)
-  * [2.3 Training with New Backbone](#23-training-with-new-backbone)
+  - [2.3 Training with New Backbone](#23-training-with-new-backbone)
-  * [2.4 Training with knowledge distillation](#24)
+  - [2.4 Training with knowledge distillation](#24-training-with-knowledge-distillation)
 - [3. Evaluation and Test](#3-evaluation-and-test)
-  * [3.1 Evaluation](#31-evaluation)
+  - [3.1 Evaluation](#31-evaluation)
-  * [3.2 Test](#32-test)
+  - [3.2 Test](#32-test)
 - [4. Inference](#4-inference)
- [5. FAQ](#2-faq)
+- [5. FAQ](#5-faq)
 ## 1. Data and Weights Preparation
 ### 1.1 Data Preparation
-The icdar2015 dataset contains train set which has 1000 images obtained with wearable cameras and test set which has 500 images obtained with wearable cameras. The icdar2015 can be obtained from [official website](https://rrc.cvc.uab.es/?ch=4&com=downloads). Registration is required for downloading.
+To prepare datasets, refer to [ocr_datasets](./dataset/ocr_datasets_en.md) .
-After registering and logging in, download the part marked in the red box in the figure below. And, the content downloaded by `Training Set Images` should be saved as the folder `icdar_c4_train_imgs`, and the content downloaded by `Test Set Images` is saved as the folder `ch4_test_images`
-<p align="center">
- <img src="../datasets/ic15_location_download.png" align="middle" width = "700"/>
-<p align="center">
-Decompress the downloaded dataset to the working directory, assuming it is decompressed under PaddleOCR/train_data/. In addition, PaddleOCR organizes many scattered annotation files into two separate annotation files for train and test respectively, which can be downloaded by wget:
-```shell
-# Under the PaddleOCR path
-cd PaddleOCR/
-wget -P ./train_data/  https://paddleocr.bj.bcebos.com/dataset/train_icdar2015_label.txt
-wget -P ./train_data/  https://paddleocr.bj.bcebos.com/dataset/test_icdar2015_label.txt
-```
-After decompressing the data set and downloading the annotation file, PaddleOCR/train_data/ has two folders and two files, which are:
-```
-/PaddleOCR/train_data/icdar2015/text_localization/
-  └─ icdar_c4_train_imgs/         Training data of icdar dataset
-  └─ ch4_test_images/             Testing data of icdar dataset
-  └─ train_icdar2015_label.txt    Training annotation of icdar dataset
-  └─ test_icdar2015_label.txt     Test annotation of icdar dataset
-```
-The provided annotation file format is as follow, separated by "\t":
-```
-" Image file name             Image annotation information encoded by json.dumps"
-ch4_test_images/img_61.jpg    [{"transcription": "MASA", "points": [[310, 104], [416, 141], [418, 216], [312, 179]]}, {...}]
-```
-The image annotation after **json.dumps()** encoding is a list containing multiple dictionaries.
-The `points` in the dictionary represent the coordinates (x, y) of the four points of the text box, arranged clockwise from the point at the upper left corner.
-`transcription` represents the text of the current text box. **When its content is "###" it means that the text box is invalid and will be skipped during training.**
-If you want to train PaddleOCR on other datasets, please build the annotation file according to the above format.
 ### 1.2 Download Pre-trained Model

--- a/doc/doc_en/recognition_en.md
+++ b/doc/doc_en/recognition_en.md
 # Text Recognition
- [1. Data Preparation](#DATA_PREPARATION)
+- [1. Data Preparation](#1-data-preparation)
-    - [1.1 Costom Dataset](#Costom_Dataset)
+  - [1.1 DataSet Preparation](#11-dataset-preparation)
-    - [1.2 Dataset Download](#Dataset_download)
+  - [1.2 Dictionary](#12-dictionary)
-    - [1.3 Dictionary](#Dictionary)  
+  - [1.4 Add Space Category](#14-add-space-category)
-    - [1.4 Add Space Category](#Add_space_category)
+- [2.Training](#2training)
+  - [2.1 Data Augmentation](#21-data-augmentation)
- [2. Training](#TRAINING)
+  - [2.2 General Training](#22-general-training)
-    - [2.1 Data Augmentation](#Data_Augmentation)
+  - [2.3 Multi-language Training](#23-multi-language-training)
-    - [2.2 General Training](#Training)
+  - [2.4 Training with Knowledge Distillation](#24-training-with-knowledge-distillation)
-    - [2.3 Multi-language Training](#Multi_language)
+- [3. Evalution](#3-evalution)
-    - [2.4 Training with Knowledge Distillation](#kd)
+- [4. Prediction](#4-prediction)
+- [5. Convert to Inference Model](#5-convert-to-inference-model)
- [3. Evaluation](#EVALUATION)
- [4. Prediction](#PREDICTION)
- [5. Convert to Inference Model](#Inference)
 <a name="DATA_PREPARATION"></a>
 ## 1. Data Preparation
+### 1.1 DataSet Preparation
-PaddleOCR supports two data formats:
+To prepare datasets, refer to [ocr_datasets](./dataset/ocr_datasets.md) .
- `LMDB` is used to train data sets stored in lmdb format（LMDBDataSet）;
- `general data` is used to train data sets stored in text files（SimpleDataSet）:
-Please organize the dataset as follows:
-The default storage path for training data is `PaddleOCR/train_data`, if you already have a dataset on your disk, just create a soft link to the dataset directory:
-```
-# linux and mac os
-ln -sf <path/to/dataset> <path/to/paddle_ocr>/train_data/dataset
-# windows
-mklink /d <path/to/paddle_ocr>/train_data/dataset <path/to/dataset>
-```
-<a name="Costom_Dataset"></a>
-### 1.1 Costom Dataset
-If you want to use your own data for training, please refer to the following to organize your data.
- Training set
-It is recommended to put the training images in the same folder, and use a txt file (rec_gt_train.txt) to store the image path and label. The contents of the txt file are as follows:
-* Note: by default, the image path and image label are split with \t, if you use other methods to split, it will cause training error
-```
-" Image file name           Image annotation "
-train_data/rec/train/word_001.jpg   简单可依赖
-train_data/rec/train/word_002.jpg   用科技让复杂的世界更简单
-...
-```
-The final training set should have the following file structure:
-```
-|-train_data
-  |-rec
-    |- rec_gt_train.txt
-    |- train
-        |- word_001.png
-        |- word_002.jpg
-        |- word_003.jpg
-        | ...
-```
- Test set
-Similar to the training set, the test set also needs to be provided a folder containing all images (test) and a rec_gt_test.txt. The structure of the test set is as follows:
-```
-|-train_data
-  |-rec
-    |-ic15_data
-        |- rec_gt_test.txt
-        |- test
-            |- word_001.jpg
-            |- word_002.jpg
-            |- word_003.jpg
-            | ...
-```
-<a name="Dataset_download"></a>
-### 1.2 Dataset Download
- ICDAR2015
-If you do not have a dataset locally, you can download it on the official website [icdar2015](http://rrc.cvc.uab.es/?ch=4&com=downloads).
-Also refer to [DTRB](https://github.com/clovaai/deep-text-recognition-benchmark#download-lmdb-dataset-for-traininig-and-evaluation-from-here) ，download the lmdb format dataset required for benchmark
 If you want to reproduce the paper SAR, you need to download extra dataset [SynthAdd](https://pan.baidu.com/share/init?surl=uV0LtoNmcxbO-0YA7Ch4dg), extraction code: 627x. Besides, icdar2013, icdar2015, cocotext, IIIT5k datasets are also used to train. For specific details, please refer to the paper SAR.
-PaddleOCR provides label files for training the icdar2015 dataset, which can be downloaded in the following ways:
-```
-# Training set label
-wget -P ./train_data/ic15_data  https://paddleocr.bj.bcebos.com/dataset/rec_gt_train.txt
-# Test Set Label
-wget -P ./train_data/ic15_data  https://paddleocr.bj.bcebos.com/dataset/rec_gt_test.txt
-```
-PaddleOCR also provides a data format conversion script, which can convert ICDAR official website label to a data format
-supported by PaddleOCR. The data conversion tool is in `ppocr/utils/gen_label.py`, here is the training set as an example:
-```
-# convert the official gt to rec_gt_label.txt
-python gen_label.py --mode="rec" --input_path="{path/of/origin/label}" --output_label="rec_gt_label.txt"
-```
-The data format is as follows, (a) is the original picture, (b) is the Ground Truth text file corresponding to each picture:
-![](../datasets/icdar_rec.png)
- Multilingual dataset
-The multi-language model training method is the same as the Chinese model. The training data set is 100w synthetic data. A small amount of fonts and test data can be downloaded using the following two methods.
-* [Baidu Netdisk](https://pan.baidu.com/s/1bS_u207Rm7YbY33wOECKDA) ,Extraction code:frgi.
-* [Google drive](https://drive.google.com/file/d/18cSWX7wXSy4G0tbKJ0d9PuIaiwRLHpjA/view)
 <a name="Dictionary"></a>
-### 1.3 Dictionary
+### 1.2 Dictionary
 Finally, a dictionary ({word_dict_name}.txt) needs to be provided so that when the model is trained, all the characters that appear can be mapped to the dictionary index.