Merge pull request #2514 from opendatalab/release-1.3.12

Release 1.3.12

Merge pull request #2514 from opendatalab/release-1.3.12
Release 1.3.12
a989444e · Xiaomeng Zhao · GitHub · 40851b1c · e3a42955 · a989444e
Unverified Commit a989444e authored May 24, 2025 by Xiaomeng Zhao Committed by GitHub May 24, 2025
13 changed files
--- a/README.md
+++ b/README.md
@@ -48,6 +48,20 @@ Easier to use: Just grab MinerU Desktop. No coding, no login, just a simple inte
 </div>

 # Changelog
+- 2025/05/24 1.3.12 Released
+  - Added support for ppocrv5 model, updated `ch_server` model to `PP-OCRv5_rec_server` and `ch_lite` model to `PP-OCRv5_rec_mobile` (model update required)
+    - In testing, we found that ppocrv5(server) shows some improvement for handwritten documents, but slightly lower accuracy than v4_server_doc for other document types. Therefore, the default ch model remains unchanged as `PP-OCRv4_server_rec_doc`.
+    - Since ppocrv5 enhances recognition capabilities for handwritten text and special characters, you can manually select ppocrv5 models for Japanese, traditional Chinese mixed scenarios and handwritten document scenarios
+    - You can select the appropriate model through the lang parameter `lang='ch_server'` (python api) or `--lang ch_server` (command line):
+      - `ch`: `PP-OCRv4_rec_server_doc` (default) (Chinese, English, Japanese, Traditional Chinese mixed/15k dictionary)
+      - `ch_server`: `PP-OCRv5_rec_server` (Chinese, English, Japanese, Traditional Chinese mixed + handwriting/18k dictionary)
+      - `ch_lite`: `PP-OCRv5_rec_mobile` (Chinese, English, Japanese, Traditional Chinese mixed + handwriting/18k dictionary)
+      - `ch_server_v4`: `PP-OCRv4_rec_server` (Chinese, English mixed/6k dictionary)
+      - `ch_lite_v4`: `PP-OCRv4_rec_mobile` (Chinese, English mixed/6k dictionary)
+  - Added support for handwritten documents by optimizing layout recognition of handwritten text areas
+    - This feature is supported by default, no additional configuration needed
+    - You can refer to the instructions above to manually select ppocrv5 model for better handwritten document parsing
+  - The demos on `huggingface` and `modelscope` have been updated to support handwriting recognition and ppocrv5 models, which you can experience online
 - 2025/04/29 1.3.10 Released
  - Support for custom formula delimiters can be achieved by modifying the `latex-delimiter-config` item in the `magic-pdf.json` file under the user directory.
 - 2025/04/27 1.3.9 Released  

--- a/README_zh-CN.md
+++ b/README_zh-CN.md
@@ -47,6 +47,20 @@
 </div>

 # 更新记录
+- 2025/05/24 1.3.12 发布
+  - 增加ppocrv5模型的支持，将`ch_server`模型更新为`PP-OCRv5_rec_server`，`ch_lite`模型更新为`PP-OCRv5_rec_mobile`（需更新模型）
+    - 在测试中，发现ppocrv5(server)对手写文档效果有一定提升，但在其余类别文档的精度略差于v4_server_doc，因此默认的ch模型保持不变，仍为`PP-OCRv4_server_rec_doc`。
+    - 由于ppocrv5强化了手写场景和特殊字符的识别能力，因此您可以在日繁混合场景以及手写文档场景下手动选择使用ppocrv5模型
+    - 您可通过lang参数`lang='ch_server'`(python api)或`--lang ch_server`(命令行)自行选择相应的模型：
+      - `ch` ：`PP-OCRv4_rec_server_doc`（默认）（中英日繁混合/1.5w字典）
+      - `ch_server` ：`PP-OCRv5_rec_server`（中英日繁混合+手写场景/1.8w字典）
+      - `ch_lite` ：`PP-OCRv5_rec_mobile`（中英日繁混合+手写场景/1.8w字典）
+      - `ch_server_v4` ：`PP-OCRv4_rec_server`（中英混合/6k字典）
+      - `ch_lite_v4` ：`PP-OCRv4_rec_mobile`（中英混合/6k字典）
+  - 增加手写文档的支持，通过优化layout对手写文本区域的识别，现已支持手写文档的解析
+    - 默认支持此功能，无需额外配置 
+    - 可以参考上述说明，手动选择ppocrv5模型以获得更好的手写文档解析效果
+  - `huggingface`和`modelscope`的demo已更新为支持手写识别和ppocrv5模型的版本，可自行在线体验
 - 2025/04/29 1.3.10 发布
  - 支持使用自定义公式标识符，可通过修改用户目录下的`magic-pdf.json`文件中的`latex-delimiter-config`项实现。
 - 2025/04/27 1.3.9 发布

--- a/magic_pdf/data/utils.py
+++ b/magic_pdf/data/utils.py
@@ -10,22 +10,22 @@ from loguru import logger



-def fitz_doc_to_image(doc, dpi=200) -> dict:
+def fitz_doc_to_image(page, dpi=200) -> dict:
    """Convert fitz.Document to image, Then convert the image to numpy array.

    Args:
-        doc (_type_): pymudoc page
+        page (_type_): pymudoc page
        dpi (int, optional): reset the dpi of dpi. Defaults to 200.

    Returns:
        dict:  {'img': numpy array, 'width': width, 'height': height }
    """
    mat = fitz.Matrix(dpi / 72, dpi / 72)
-    pm = doc.get_pixmap(matrix=mat, alpha=False)
+    pm = page.get_pixmap(matrix=mat, alpha=False)

    # If the width or height exceeds 4500 after scaling, do not scale further.
    if pm.width > 4500 or pm.height > 4500:
-        pm = doc.get_pixmap(matrix=fitz.Matrix(1, 1), alpha=False)
+        pm = page.get_pixmap(matrix=fitz.Matrix(1, 1), alpha=False)

    # Convert pixmap samples directly to numpy array
    img = np.frombuffer(pm.samples, dtype=np.uint8).reshape(pm.height, pm.width, 3)

--- a/magic_pdf/dict2md/ocr_mkcontent.py
+++ b/magic_pdf/dict2md/ocr_mkcontent.py
@@ -70,19 +70,34 @@ def ocr_mk_markdown_with_para_core_v2(paras_of_layout,
            if mode == 'nlp':
                continue
            elif mode == 'mm':
-                for block in para_block['blocks']:  # 1st.拼image_body
-                    if block['type'] == BlockType.ImageBody:
-                        for line in block['lines']:
-                            for span in line['spans']:
-                                if span['type'] == ContentType.Image:
-                                    if span.get('image_path', ''):
-                                        para_text += f"\n![]({join_path(img_buket_path, span['image_path'])})  \n"
-                for block in para_block['blocks']:  # 2nd.拼image_caption
-                    if block['type'] == BlockType.ImageCaption:
-                        para_text += merge_para_with_text(block) + '  \n'
-                for block in para_block['blocks']:  # 3rd.拼image_footnote
-                    if block['type'] == BlockType.ImageFootnote:
-                        para_text += merge_para_with_text(block) + '  \n'
+                # 检测是否存在图片脚注
+                has_image_footnote = any(block['type'] == BlockType.ImageFootnote for block in para_block['blocks'])
+                # 如果存在图片脚注，则将图片脚注拼接到图片正文后面
+                if has_image_footnote:
+                    for block in para_block['blocks']:  # 1st.拼image_caption
+                        if block['type'] == BlockType.ImageCaption:
+                            para_text += merge_para_with_text(block) + '  \n'
+                    for block in para_block['blocks']:  # 2nd.拼image_body
+                        if block['type'] == BlockType.ImageBody:
+                            for line in block['lines']:
+                                for span in line['spans']:
+                                    if span['type'] == ContentType.Image:
+                                        if span.get('image_path', ''):
+                                            para_text += f"![]({img_buket_path}/{span['image_path']})"
+                    for block in para_block['blocks']:  # 3rd.拼image_footnote
+                        if block['type'] == BlockType.ImageFootnote:
+                            para_text += '  \n' + merge_para_with_text(block)
+                else:
+                    for block in para_block['blocks']:  # 1st.拼image_body
+                        if block['type'] == BlockType.ImageBody:
+                            for line in block['lines']:
+                                for span in line['spans']:
+                                    if span['type'] == ContentType.Image:
+                                        if span.get('image_path', ''):
+                                            para_text += f"![]({img_buket_path}/{span['image_path']})"
+                    for block in para_block['blocks']:  # 2nd.拼image_caption
+                        if block['type'] == BlockType.ImageCaption:
+                            para_text += '  \n' + merge_para_with_text(block)
        elif para_type == BlockType.Table:
            if mode == 'nlp':
                continue
@@ -96,20 +111,19 @@ def ocr_mk_markdown_with_para_core_v2(paras_of_layout,
                            for span in line['spans']:
                                if span['type'] == ContentType.Table:
                                    # if processed by table model
-                                    if span.get('latex', ''):
-                                        para_text += f"\n\n$\n {span['latex']}\n$\n\n"
-                                    elif span.get('html', ''):
-                                        para_text += f"\n\n{span['html']}\n\n"
+                                    if span.get('html', ''):
+                                        para_text += f"\n{span['html']}\n"
                                    elif span.get('image_path', ''):
-                                        para_text += f"\n![]({join_path(img_buket_path, span['image_path'])})  \n"
+                                        para_text += f"![]({img_buket_path}/{span['image_path']})"
                for block in para_block['blocks']:  # 3rd.拼table_footnote
                    if block['type'] == BlockType.TableFootnote:
-                        para_text += merge_para_with_text(block) + '  \n'
+                        para_text += '\n' + merge_para_with_text(block) + '  '

        if para_text.strip() == '':
            continue
        else:
-            page_markdown.append(para_text.strip() + '  ')
+            # page_markdown.append(para_text.strip() + '  ')
+            page_markdown.append(para_text.strip())

    return page_markdown

@@ -257,9 +271,9 @@ def para_to_standard_format_v2(para_block, img_buket_path, page_idx, drop_reason
                        if span['type'] == ContentType.Table:

                            if span.get('latex', ''):
-                                para_content['table_body'] = f"\n\n$\n {span['latex']}\n$\n\n"
+                                para_content['table_body'] = f"{span['latex']}"
                            elif span.get('html', ''):
-                                para_content['table_body'] = f"\n\n{span['html']}\n\n"
+                                para_content['table_body'] = f"{span['html']}"

                            if span.get('image_path', ''):
                                para_content['img_path'] = join_path(img_buket_path, span['image_path'])

--- a/magic_pdf/model/batch_analyze.py
+++ b/magic_pdf/model/batch_analyze.py
@@ -6,7 +6,7 @@ from tqdm import tqdm
 from magic_pdf.config.constants import MODEL_NAME
 from magic_pdf.model.sub_modules.model_init import AtomModelSingleton
 from magic_pdf.model.sub_modules.model_utils import (
-    clean_vram, crop_img, get_res_list_from_layout_res)
+    clean_vram, crop_img, get_res_list_from_layout_res, get_coords_and_area)
 from magic_pdf.model.sub_modules.ocr.paddleocr2pytorch.ocr_utils import (
    get_adjusted_mfdetrec_res, get_ocr_result_list)

@@ -148,6 +148,19 @@ class BatchAnalyze:
                # Integration results
                if ocr_res:
                    ocr_result_list = get_ocr_result_list(ocr_res, useful_list, ocr_res_list_dict['ocr_enable'], new_image, _lang)
+
+                    if res["category_id"] == 3:
+                        # ocr_result_list中所有bbox的面积之和
+                        ocr_res_area = sum(get_coords_and_area(ocr_res_item)[4] for ocr_res_item in ocr_result_list if 'poly' in ocr_res_item)
+                        # 求ocr_res_area和res的面积的比值
+                        res_area = get_coords_and_area(res)[4]
+                        if res_area > 0:
+                            ratio = ocr_res_area / res_area
+                            if ratio > 0.25:
+                                res["category_id"] = 1
+                            else:
+                                continue
+
                    ocr_res_list_dict['layout_res'].extend(ocr_result_list)

            # det_count += len(ocr_res_list_dict['ocr_res_list'])

--- a/magic_pdf/model/doc_analyze_by_custom_model.py
+++ b/magic_pdf/model/doc_analyze_by_custom_model.py
@@ -189,7 +189,7 @@ def batch_doc_analyze(
    formula_enable=None,
    table_enable=None,
 ):
-    MIN_BATCH_INFERENCE_SIZE = int(os.environ.get('MINERU_MIN_BATCH_INFERENCE_SIZE', 200))
+    MIN_BATCH_INFERENCE_SIZE = int(os.environ.get('MINERU_MIN_BATCH_INFERENCE_SIZE', 100))
    batch_size = MIN_BATCH_INFERENCE_SIZE
    page_wh_list = []


--- a/magic_pdf/model/sub_modules/model_utils.py
+++ b/magic_pdf/model/sub_modules/model_utils.py
@@ -31,10 +31,10 @@ def crop_img(input_res, input_np_img, crop_paste_x=0, crop_paste_y=0):
    return return_image, return_list


-def get_coords_and_area(table):
+def get_coords_and_area(block_with_poly):
    """Extract coordinates and area from a table."""
-    xmin, ymin = int(table['poly'][0]), int(table['poly'][1])
-    xmax, ymax = int(table['poly'][4]), int(table['poly'][5])
+    xmin, ymin = int(block_with_poly['poly'][0]), int(block_with_poly['poly'][1])
+    xmax, ymax = int(block_with_poly['poly'][4]), int(block_with_poly['poly'][5])
    area = (xmax - xmin) * (ymax - ymin)
    return xmin, ymin, xmax, ymax, area

@@ -243,7 +243,7 @@ def get_res_list_from_layout_res(layout_res, iou_threshold=0.7, overlap_threshol
                "bbox": [int(res['poly'][0]), int(res['poly'][1]),
                         int(res['poly'][4]), int(res['poly'][5])],
            })
-        elif category_id in [0, 2, 4, 6, 7]:  # OCR regions
+        elif category_id in [0, 2, 4, 6, 7, 3]:  # OCR regions
            ocr_res_list.append(res)
        elif category_id == 5:  # Table regions
            table_res_list.append(res)

--- a/magic_pdf/model/sub_modules/ocr/paddleocr2pytorch/pytorchocr/modeling/backbones/__init__.py
+++ b/magic_pdf/model/sub_modules/ocr/paddleocr2pytorch/pytorchocr/modeling/backbones/__init__.py
@@ -35,7 +35,7 @@ def build_backbone(config, model_type):
        from .rec_mobilenet_v3 import MobileNetV3
        from .rec_svtrnet import SVTRNet
        from .rec_mv1_enhance import MobileNetV1Enhance
-
+        from .rec_pphgnetv2 import PPHGNetV2_B4
        support_dict = [
            "MobileNetV1Enhance",
            "MobileNetV3",
@@ -48,6 +48,7 @@ def build_backbone(config, model_type):
            "DenseNet",
            "PPLCNetV3",
            "PPHGNet_small",
+            "PPHGNetV2_B4",
        ]
    else:
        raise NotImplementedError

--- a/magic_pdf/model/sub_modules/ocr/paddleocr2pytorch/pytorchocr/modeling/backbones/rec_pphgnetv2.py
+++ b/magic_pdf/model/sub_modules/ocr/paddleocr2pytorch/pytorchocr/modeling/backbones/rec_pphgnetv2.py
--- a/magic_pdf/model/sub_modules/ocr/paddleocr2pytorch/pytorchocr/modeling/necks/rnn.py
+++ b/magic_pdf/model/sub_modules/ocr/paddleocr2pytorch/pytorchocr/modeling/necks/rnn.py
@@ -9,14 +9,27 @@ class Im2Seq(nn.Module):
        super().__init__()
        self.out_channels = in_channels

+    # def forward(self, x):
+    #     B, C, H, W = x.shape
+    #     # assert H == 1
+    #     x = x.squeeze(dim=2)
+    #     # x = x.transpose([0, 2, 1])  # paddle (NTC)(batch, width, channels)
+    #     x = x.permute(0, 2, 1)
+    #     return x
+
    def forward(self, x):
        B, C, H, W = x.shape
-        # assert H == 1
-        x = x.squeeze(dim=2)
-        # x = x.transpose([0, 2, 1])  # paddle (NTC)(batch, width, channels)
-        x = x.permute(0, 2, 1)
-        return x
+        # 处理四维张量，将空间维度展平为序列
+        if H == 1:
+            # 原来的处理逻辑，适用于H=1的情况
+            x = x.squeeze(dim=2)
+            x = x.permute(0, 2, 1)  # (B, W, C)
+        else:
+            # 处理H不为1的情况
+            x = x.permute(0, 2, 3, 1)  # (B, H, W, C)
+            x = x.reshape(B, H * W, C)  # (B, H*W, C)

+        return x

 class EncoderWithRNN_(nn.Module):
    def __init__(self, in_channels, hidden_size):

--- a/magic_pdf/model/sub_modules/ocr/paddleocr2pytorch/pytorchocr/utils/resources/arch_config.yaml
+++ b/magic_pdf/model/sub_modules/ocr/paddleocr2pytorch/pytorchocr/utils/resources/arch_config.yaml
@@ -104,6 +104,22 @@ ch_PP-OCRv4_det_infer:
    name: DBHead
    k: 50

+ch_PP-OCRv5_det_infer:
+  model_type: det
+  algorithm: DB
+  Transform: null
+  Backbone:
+    name: PPLCNetV3
+    scale: 0.75
+    det: True
+  Neck:
+    name: RSEFPN
+    out_channels: 96
+    shortcut: True
+  Head:
+    name: DBHead
+    k: 50
+
 ch_PP-OCRv4_det_server_infer:
  model_type: det
  algorithm: DB
@@ -196,6 +212,58 @@ ch_PP-OCRv4_rec_server_doc_infer:
          nrtr_dim: 384
          max_text_length: 25

+ch_PP-OCRv5_rec_server_infer:
+  model_type: rec
+  algorithm: SVTR_HGNet
+  Transform:
+  Backbone:
+    name: PPHGNetV2_B4
+    text_rec: True
+  Head:
+    name: MultiHead
+    out_channels_list:
+      CTCLabelDecode: 18385
+    head_list:
+      - CTCHead:
+          Neck:
+            name: svtr
+            dims: 120
+            depth: 2
+            hidden_dims: 120
+            kernel_size: [ 1, 3 ]
+            use_guide: True
+          Head:
+            fc_decay: 0.00001
+      - NRTRHead:
+          nrtr_dim: 384
+          max_text_length: 25
+
+ch_PP-OCRv5_rec_infer:
+  model_type: rec
+  algorithm: SVTR_HGNet
+  Transform:
+  Backbone:
+    name: PPLCNetV3
+    scale: 0.95
+  Head:
+    name: MultiHead
+    out_channels_list:
+      CTCLabelDecode: 18385
+    head_list:
+      - CTCHead:
+          Neck:
+            name: svtr
+            dims: 120
+            depth: 2
+            hidden_dims: 120
+            kernel_size: [ 1, 3 ]
+            use_guide: True
+          Head:
+            fc_decay: 0.00001
+      - NRTRHead:
+          nrtr_dim: 384
+          max_text_length: 25
+
 chinese_cht_PP-OCRv3_rec_infer:
  model_type: rec
  algorithm: SVTR

--- a/magic_pdf/model/sub_modules/ocr/paddleocr2pytorch/pytorchocr/utils/resources/dict/ppocrv5_dict.txt
+++ b/magic_pdf/model/sub_modules/ocr/paddleocr2pytorch/pytorchocr/utils/resources/dict/ppocrv5_dict.txt
--- a/magic_pdf/model/sub_modules/ocr/paddleocr2pytorch/pytorchocr/utils/resources/models_config.yml
+++ b/magic_pdf/model/sub_modules/ocr/paddleocr2pytorch/pytorchocr/utils/resources/models_config.yml
 lang:
  ch_lite:
+    det: ch_PP-OCRv3_det_infer.pth
+    rec: ch_PP-OCRv5_rec_infer.pth
+    dict: ppocrv5_dict.txt
+  ch_lite_v4:
    det: ch_PP-OCRv3_det_infer.pth
    rec: ch_PP-OCRv4_rec_infer.pth
    dict: ppocr_keys_v1.txt
  ch_server:
+    det: ch_PP-OCRv3_det_infer.pth
+    rec: ch_PP-OCRv5_rec_server_infer.pth
+    dict: ppocrv5_dict.txt
+  ch_server_v4:
    det: ch_PP-OCRv3_det_infer.pth
    rec: ch_PP-OCRv4_rec_server_infer.pth
    dict: ppocr_keys_v1.txt