Merge pull request #969 from opendatalab/release-0.9.3

Release 0.9.3

Merge pull request #969 from opendatalab/release-0.9.3
Release 0.9.3
845a3ff0 · Xiaomeng Zhao · GitHub · d0558abb · 6083e109 · 845a3ff0
Unverified Commit 845a3ff0 authored Nov 15, 2024 by Xiaomeng Zhao Committed by GitHub Nov 15, 2024
20 changed files
--- a/.gitignore
+++ b/.gitignore
@@ -48,3 +48,6 @@ debug_utils/
 # sphinx docs
 _build/
+output/
\ No newline at end of file
--- a/README.md
+++ b/README.md
@@ -42,6 +42,7 @@
 </div>
 # Changelog
+- 2024/11/15 0.9.3 released. Integrated [RapidTable](https://github.com/RapidAI/RapidTable) for table recognition, improving single-table parsing speed by more than 10 times, with higher accuracy and lower GPU memory usage.
 - 2024/11/06 0.9.2 released. Integrated the [StructTable-InternVL2-1B](https://huggingface.co/U4R/StructTable-InternVL2-1B) model for table recognition functionality.
 - 2024/10/31 0.9.0 released. This is a major new version with extensive code refactoring, addressing numerous issues, improving performance, reducing hardware requirements, and enhancing usability:
  - Refactored the sorting module code to use [layoutreader](https://github.com/ppaanngggg/layoutreader) for reading order sorting, ensuring high accuracy in various layouts.
@@ -246,7 +247,7 @@ You can modify certain configurations in this file to enable or disable features
        "enable": true  // The formula recognition feature is enabled by default. If you need to disable it, please change the value here to "false".
    },
    "table-config": {
-        "model": "tablemaster",  // When using structEqTable, please change to "struct_eqtable".
+        "model": "rapid_table",  // When using structEqTable, please change to "struct_eqtable".
        "enable": false, // The table recognition feature is disabled by default. If you need to enable it, please change the value here to "true".
        "max_time": 400
    }
@@ -261,7 +262,7 @@ If your device supports CUDA and meets the GPU requirements of the mainline envi
 - [Windows 10/11 + GPU](docs/README_Windows_CUDA_Acceleration_en_US.md)
 - Quick Deployment with Docker
 > [!IMPORTANT]
-> Docker requires a GPU with at least 16GB of VRAM, and all acceleration features are enabled by default.
+> Docker requires a GPU with at least 8GB of VRAM, and all acceleration features are enabled by default.
 >
 > Before running this Docker, you can use the following command to check if your device supports CUDA acceleration on Docker.
 > 
@@ -421,7 +422,9 @@ This project currently uses PyMuPDF to achieve advanced functionality. However,
 # Acknowledgments
 - [PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit)
+- [DocLayout-YOLO](https://github.com/opendatalab/DocLayout-YOLO)
 - [StructEqTable](https://github.com/UniModal4Reasoning/StructEqTable-Deploy)
+- [RapidTable](https://github.com/RapidAI/RapidTable)
 - [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
 - [PyMuPDF](https://github.com/pymupdf/PyMuPDF)
 - [layoutreader](https://github.com/ppaanngggg/layoutreader)

--- a/README_ja-JP.md
+++ b/README_ja-JP.md
+> [!Warning]
+> このドキュメントはすでに古くなっています。最新版のドキュメントを参照してください：[ENGLISH](README.md)。
 <div id="top">
 <p align="center">
@@ -18,9 +20,7 @@
 <a href="https://trendshift.io/repositories/11174" target="_blank"><img src="https://trendshift.io/api/badge/repositories/11174" alt="opendatalab%2FMinerU | Trendshift" style="width: 200px; height: 55px;"/></a>
-<div align="center" style="color: red; background-color: #ffdddd; padding: 10px; border: 1px solid red; border-radius: 5px;">
-  <strong>NOTE：</strong> このドキュメントはすでに古くなっています。最新版のドキュメントを参照してください。
-</div>
 [English](README.md) | [简体中文](README_zh-CN.md) | [日本語](README_ja-JP.md)

--- a/README_zh-CN.md
+++ b/README_zh-CN.md
@@ -42,7 +42,7 @@
 </div>
 # 更新记录
+- 2024/11/15 0.9.3发布，为表格识别功能接入了[RapidTable](https://github.com/RapidAI/RapidTable),单表解析速度提升10倍以上，准确率更高，显存占用更低
 - 2024/11/06 0.9.2发布，为表格识别功能接入了[StructTable-InternVL2-1B](https://huggingface.co/U4R/StructTable-InternVL2-1B)模型
 - 2024/10/31 0.9.0发布，这是我们进行了大量代码重构的全新版本，解决了众多问题，提升了性能，降低了硬件需求，并提供了更丰富的易用性：
  - 重构排序模块代码，使用 [layoutreader](https://github.com/ppaanngggg/layoutreader) 进行阅读顺序排序，确保在各种排版下都能实现极高准确率
@@ -188,13 +188,13 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
        <td rowspan="2">GPU硬件支持列表</td>
        <td colspan="2">最低要求 8G+显存</td>
        <td colspan="2">3060ti/3070/4060<br>
-        8G显存可开启layout、公式识别和ocr加速</td>
+        8G显存可开启全部加速功能(表格仅限rapid_table)</td>
        <td rowspan="2">None</td>
    </tr>
    <tr>
        <td colspan="2">推荐配置 10G+显存</td>
        <td colspan="2">3080/3080ti/3090/3090ti/4070/4070ti/4070tisuper/4080/4090<br>
-        10G显存及以上可以同时开启layout、公式识别和ocr加速和表格识别加速<br>
+        10G显存及以上可开启全部加速功能<br>
        </td>
    </tr>
 </table>
@@ -251,7 +251,7 @@ pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com -i h
        "enable": true  // 公式识别功能默认是开启的，如果需要关闭请修改此处的值为"false"
    },
    "table-config": {
-        "model": "tablemaster",  // 使用structEqTable请修改为"struct_eqtable"
+        "model": "rapid_table",  // 使用structEqTable请修改为"struct_eqtable"
        "enable": false, // 表格识别功能默认是关闭的，如果需要开启请修改此处的值为"true"
        "max_time": 400
    }
@@ -266,7 +266,7 @@ pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com -i h
 - [Windows10/11 + GPU](docs/README_Windows_CUDA_Acceleration_zh_CN.md)
 - 使用Docker快速部署
 > [!IMPORTANT]
-> Docker 需设备gpu显存大于等于16GB，默认开启所有加速功能
+> Docker 需设备gpu显存大于等于8GB，默认开启所有加速功能
 > 
 > 运行本docker前可以通过以下命令检测自己的设备是否支持在docker上使用CUDA加速
 > 
@@ -431,6 +431,7 @@ TODO
 - [PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit)
 - [DocLayout-YOLO](https://github.com/opendatalab/DocLayout-YOLO)
 - [StructEqTable](https://github.com/UniModal4Reasoning/StructEqTable-Deploy)
+- [RapidTable](https://github.com/RapidAI/RapidTable)
 - [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
 - [PyMuPDF](https://github.com/pymupdf/PyMuPDF)
 - [layoutreader](https://github.com/ppaanngggg/layoutreader)

--- a/demo/magic_pdf_parse_main.py
+++ b/demo/magic_pdf_parse_main.py
@@ -19,9 +19,10 @@ def json_md_dump(
        pdf_name,
        content_list,
        md_content,
+        orig_model_list,
 ):
    # 写入模型结果到 model.json
-    orig_model_list = copy.deepcopy(pipe.model_list)
    md_writer.write(
        content=json.dumps(orig_model_list, ensure_ascii=False, indent=4),
        path=f"{pdf_name}_model.json"
@@ -87,9 +88,12 @@ def pdf_parse_main(
        pdf_bytes = open(pdf_path, "rb").read()  # 读取 pdf 文件的二进制数据
+        orig_model_list = []
        if model_json_path:
            # 读取已经被模型解析后的pdf文件的 json 原始数据，list 类型
            model_json = json.loads(open(model_json_path, "r", encoding="utf-8").read())
+            orig_model_list = copy.deepcopy(model_json)
        else:
            model_json = []
@@ -115,8 +119,9 @@ def pdf_parse_main(
        pipe.pipe_classify()
        # 如果没有传入模型数据，则使用内置模型解析
-        if not model_json:
+        if len(model_json) == 0:
            pipe.pipe_analyze()  # 解析
+            orig_model_list = copy.deepcopy(pipe.model_list)
        # 执行解析
        pipe.pipe_parse()
@@ -126,7 +131,7 @@ def pdf_parse_main(
        md_content = pipe.pipe_mk_markdown(image_path_parent, drop_mode="none")
        if is_json_md_dump:
-            json_md_dump(pipe, md_writer, pdf_name, content_list, md_content)
+            json_md_dump(pipe, md_writer, pdf_name, content_list, md_content, orig_model_list)
        if is_draw_visualization_bbox:
            draw_visualization_bbox(pipe.pdf_mid_data['pdf_info'], pdf_bytes, output_path, pdf_name)

--- a/magic-pdf.template.json
+++ b/magic-pdf.template.json
@@ -15,7 +15,7 @@
        "enable": true
    },
    "table-config": {
-        "model": "tablemaster",
+        "model": "rapid_table",
        "enable": false,
        "max_time": 400
    },

--- a/magic_pdf/dict2md/ocr_mkcontent.py
+++ b/magic_pdf/dict2md/ocr_mkcontent.py
@@ -168,7 +168,7 @@ def merge_para_with_text(para_block):
                        # 如果是前一行带有-连字符，那么末尾不应该加空格
                        if __is_hyphen_at_line_end(content):
                            para_text += content[:-1]
-                        elif len(content) == 1 and content not in ['A', 'I', 'a', 'i']:
+                        elif len(content) == 1 and content not in ['A', 'I', 'a', 'i'] and not content.isdigit():
                            para_text += content
                        else:  # 西方文本语境下 content间需要空格分隔
                            para_text += f"{content} "

--- a/magic_pdf/libs/Constants.py
+++ b/magic_pdf/libs/Constants.py
@@ -50,4 +50,6 @@ class MODEL_NAME:
    YOLO_V8_MFD = "yolo_v8_mfd"
    UniMerNet_v2_Small = "unimernet_small"
\ No newline at end of file
+    RAPID_TABLE = "rapid_table"
\ No newline at end of file
--- a/magic_pdf/libs/config_reader.py
+++ b/magic_pdf/libs/config_reader.py
@@ -92,7 +92,7 @@ def get_table_recog_config():
    table_config = config.get('table-config')
    if table_config is None:
        logger.warning(f"'table-config' not found in {CONFIG_FILE_NAME}, use 'False' as default")
-        return json.loads(f'{{"model": "{MODEL_NAME.TABLE_MASTER}","enable": false, "max_time": 400}}')
+        return json.loads(f'{{"model": "{MODEL_NAME.RAPID_TABLE}","enable": false, "max_time": 400}}')
    else:
        return table_config

--- a/magic_pdf/libs/draw_bbox.py
+++ b/magic_pdf/libs/draw_bbox.py
@@ -369,10 +369,16 @@ def draw_line_sort_bbox(pdf_info, pdf_bytes, out_path, filename):
            if block['type'] in [BlockType.Image, BlockType.Table]:
                for sub_block in block['blocks']:
                    if sub_block['type'] in [BlockType.ImageBody, BlockType.TableBody]:
-                        for line in sub_block['virtual_lines']:
+                        if len(sub_block['virtual_lines']) > 0 and sub_block['virtual_lines'][0].get('index', None) is not None:
-                            bbox = line['bbox']
+                            for line in sub_block['virtual_lines']:
-                            index = line['index']
+                                bbox = line['bbox']
-                            page_line_list.append({'index': index, 'bbox': bbox})
+                                index = line['index']
+                                page_line_list.append({'index': index, 'bbox': bbox})
+                        else:
+                            for line in sub_block['lines']:
+                                bbox = line['bbox']
+                                index = line['index']
+                                page_line_list.append({'index': index, 'bbox': bbox})
                    elif sub_block['type'] in [BlockType.ImageCaption, BlockType.TableCaption, BlockType.ImageFootnote, BlockType.TableFootnote]:
                        for line in sub_block['lines']:
                            bbox = line['bbox']

--- a/magic_pdf/model/pdf_extract_kit.py
+++ b/magic_pdf/model/pdf_extract_kit.py
--- a/magic_pdf/model/pek_sub_modules/post_process.py
+++ b/magic_pdf/model/pek_sub_modules/post_process.py
-import re
-def layout_rm_equation(layout_res):
-    rm_idxs = []
-    for idx, ele in enumerate(layout_res['layout_dets']):
-        if ele['category_id'] == 10:
-            rm_idxs.append(idx)
-    for idx in rm_idxs[::-1]:
-        del layout_res['layout_dets'][idx]
-    return layout_res
-def get_croped_image(image_pil, bbox):
-    x_min, y_min, x_max, y_max = bbox
-    croped_img = image_pil.crop((x_min, y_min, x_max, y_max))
-    return croped_img
-def latex_rm_whitespace(s: str):
-    """Remove unnecessary whitespace from LaTeX code.
-    """
-    text_reg = r'(\\(operatorname|mathrm|text|mathbf)\s?\*? {.*?})'
-    letter = '[a-zA-Z]'
-    noletter = '[\W_^\d]'
-    names = [x[0].replace(' ', '') for x in re.findall(text_reg, s)]
-    s = re.sub(text_reg, lambda match: str(names.pop(0)), s)
-    news = s
-    while True:
-        s = news
-        news = re.sub(r'(?!\\ )(%s)\s+?(%s)' % (noletter, noletter), r'\1\2', s)
-        news = re.sub(r'(?!\\ )(%s)\s+?(%s)' % (noletter, letter), r'\1\2', news)
-        news = re.sub(r'(%s)\s+?(%s)' % (letter, noletter), r'\1\2', news)
-        if news == s:
-            break
-    return s
\ No newline at end of file
--- a/magic_pdf/model/pek_sub_modules/__init__.py
+++ b/magic_pdf/model/pek_sub_modules/__init__.py
--- a/magic_pdf/model/pek_sub_modules/layoutlmv3/__init__.py
+++ b/magic_pdf/model/pek_sub_modules/layoutlmv3/__init__.py
--- a/magic_pdf/model/sub_modules/layout/doclayout_yolo/DocLayoutYOLO.py
+++ b/magic_pdf/model/sub_modules/layout/doclayout_yolo/DocLayoutYOLO.py
+from doclayout_yolo import YOLOv10
+class DocLayoutYOLOModel(object):
+    def __init__(self, weight, device):
+        self.model = YOLOv10(weight)
+        self.device = device
+    def predict(self, image):
+        layout_res = []
+        doclayout_yolo_res = self.model.predict(image, imgsz=1024, conf=0.25, iou=0.45, verbose=True, device=self.device)[0]
+        for xyxy, conf, cla in zip(doclayout_yolo_res.boxes.xyxy.cpu(), doclayout_yolo_res.boxes.conf.cpu(),
+                                   doclayout_yolo_res.boxes.cls.cpu()):
+            xmin, ymin, xmax, ymax = [int(p.item()) for p in xyxy]
+            new_item = {
+                'category_id': int(cla.item()),
+                'poly': [xmin, ymin, xmax, ymin, xmax, ymax, xmin, ymax],
+                'score': round(float(conf.item()), 3),
+            }
+            layout_res.append(new_item)
+        return layout_res
\ No newline at end of file
--- a/magic_pdf/model/pek_sub_modules/structeqtable/__init__.py
+++ b/magic_pdf/model/pek_sub_modules/structeqtable/__init__.py
--- a/magic_pdf/model/v3/__init__.py
+++ b/magic_pdf/model/v3/__init__.py
--- a/magic_pdf/model/pek_sub_modules/layoutlmv3/backbone.py
+++ b/magic_pdf/model/pek_sub_modules/layoutlmv3/backbone.py
--- a/magic_pdf/model/pek_sub_modules/layoutlmv3/beit.py
+++ b/magic_pdf/model/pek_sub_modules/layoutlmv3/beit.py
--- a/magic_pdf/model/pek_sub_modules/layoutlmv3/deit.py
+++ b/magic_pdf/model/pek_sub_modules/layoutlmv3/deit.py