Unverified Commit a989444e authored by Xiaomeng Zhao's avatar Xiaomeng Zhao Committed by GitHub
Browse files

Merge pull request #2514 from opendatalab/release-1.3.12

Release 1.3.12
parents 40851b1c e3a42955
......@@ -48,6 +48,20 @@ Easier to use: Just grab MinerU Desktop. No coding, no login, just a simple inte
</div>
# Changelog
- 2025/05/24 1.3.12 Released
- Added support for ppocrv5 model, updated `ch_server` model to `PP-OCRv5_rec_server` and `ch_lite` model to `PP-OCRv5_rec_mobile` (model update required)
- In testing, we found that ppocrv5(server) shows some improvement for handwritten documents, but slightly lower accuracy than v4_server_doc for other document types. Therefore, the default ch model remains unchanged as `PP-OCRv4_server_rec_doc`.
- Since ppocrv5 enhances recognition capabilities for handwritten text and special characters, you can manually select ppocrv5 models for Japanese, traditional Chinese mixed scenarios and handwritten document scenarios
- You can select the appropriate model through the lang parameter `lang='ch_server'` (python api) or `--lang ch_server` (command line):
- `ch`: `PP-OCRv4_rec_server_doc` (default) (Chinese, English, Japanese, Traditional Chinese mixed/15k dictionary)
- `ch_server`: `PP-OCRv5_rec_server` (Chinese, English, Japanese, Traditional Chinese mixed + handwriting/18k dictionary)
- `ch_lite`: `PP-OCRv5_rec_mobile` (Chinese, English, Japanese, Traditional Chinese mixed + handwriting/18k dictionary)
- `ch_server_v4`: `PP-OCRv4_rec_server` (Chinese, English mixed/6k dictionary)
- `ch_lite_v4`: `PP-OCRv4_rec_mobile` (Chinese, English mixed/6k dictionary)
- Added support for handwritten documents by optimizing layout recognition of handwritten text areas
- This feature is supported by default, no additional configuration needed
- You can refer to the instructions above to manually select ppocrv5 model for better handwritten document parsing
- The demos on `huggingface` and `modelscope` have been updated to support handwriting recognition and ppocrv5 models, which you can experience online
- 2025/04/29 1.3.10 Released
- Support for custom formula delimiters can be achieved by modifying the `latex-delimiter-config` item in the `magic-pdf.json` file under the user directory.
- 2025/04/27 1.3.9 Released
......
......@@ -47,6 +47,20 @@
</div>
# 更新记录
- 2025/05/24 1.3.12 发布
- 增加ppocrv5模型的支持,将`ch_server`模型更新为`PP-OCRv5_rec_server``ch_lite`模型更新为`PP-OCRv5_rec_mobile`(需更新模型)
- 在测试中,发现ppocrv5(server)对手写文档效果有一定提升,但在其余类别文档的精度略差于v4_server_doc,因此默认的ch模型保持不变,仍为`PP-OCRv4_server_rec_doc`
- 由于ppocrv5强化了手写场景和特殊字符的识别能力,因此您可以在日繁混合场景以及手写文档场景下手动选择使用ppocrv5模型
- 您可通过lang参数`lang='ch_server'`(python api)或`--lang ch_server`(命令行)自行选择相应的模型:
- `ch``PP-OCRv4_rec_server_doc`(默认)(中英日繁混合/1.5w字典)
- `ch_server``PP-OCRv5_rec_server`(中英日繁混合+手写场景/1.8w字典)
- `ch_lite``PP-OCRv5_rec_mobile`(中英日繁混合+手写场景/1.8w字典)
- `ch_server_v4``PP-OCRv4_rec_server`(中英混合/6k字典)
- `ch_lite_v4``PP-OCRv4_rec_mobile`(中英混合/6k字典)
- 增加手写文档的支持,通过优化layout对手写文本区域的识别,现已支持手写文档的解析
- 默认支持此功能,无需额外配置
- 可以参考上述说明,手动选择ppocrv5模型以获得更好的手写文档解析效果
- `huggingface``modelscope`的demo已更新为支持手写识别和ppocrv5模型的版本,可自行在线体验
- 2025/04/29 1.3.10 发布
- 支持使用自定义公式标识符,可通过修改用户目录下的`magic-pdf.json`文件中的`latex-delimiter-config`项实现。
- 2025/04/27 1.3.9 发布
......
......@@ -10,22 +10,22 @@ from loguru import logger
def fitz_doc_to_image(doc, dpi=200) -> dict:
def fitz_doc_to_image(page, dpi=200) -> dict:
"""Convert fitz.Document to image, Then convert the image to numpy array.
Args:
doc (_type_): pymudoc page
page (_type_): pymudoc page
dpi (int, optional): reset the dpi of dpi. Defaults to 200.
Returns:
dict: {'img': numpy array, 'width': width, 'height': height }
"""
mat = fitz.Matrix(dpi / 72, dpi / 72)
pm = doc.get_pixmap(matrix=mat, alpha=False)
pm = page.get_pixmap(matrix=mat, alpha=False)
# If the width or height exceeds 4500 after scaling, do not scale further.
if pm.width > 4500 or pm.height > 4500:
pm = doc.get_pixmap(matrix=fitz.Matrix(1, 1), alpha=False)
pm = page.get_pixmap(matrix=fitz.Matrix(1, 1), alpha=False)
# Convert pixmap samples directly to numpy array
img = np.frombuffer(pm.samples, dtype=np.uint8).reshape(pm.height, pm.width, 3)
......
......@@ -70,19 +70,34 @@ def ocr_mk_markdown_with_para_core_v2(paras_of_layout,
if mode == 'nlp':
continue
elif mode == 'mm':
for block in para_block['blocks']: # 1st.拼image_body
if block['type'] == BlockType.ImageBody:
for line in block['lines']:
for span in line['spans']:
if span['type'] == ContentType.Image:
if span.get('image_path', ''):
para_text += f"\n![]({join_path(img_buket_path, span['image_path'])}) \n"
for block in para_block['blocks']: # 2nd.拼image_caption
if block['type'] == BlockType.ImageCaption:
para_text += merge_para_with_text(block) + ' \n'
for block in para_block['blocks']: # 3rd.拼image_footnote
if block['type'] == BlockType.ImageFootnote:
para_text += merge_para_with_text(block) + ' \n'
# 检测是否存在图片脚注
has_image_footnote = any(block['type'] == BlockType.ImageFootnote for block in para_block['blocks'])
# 如果存在图片脚注,则将图片脚注拼接到图片正文后面
if has_image_footnote:
for block in para_block['blocks']: # 1st.拼image_caption
if block['type'] == BlockType.ImageCaption:
para_text += merge_para_with_text(block) + ' \n'
for block in para_block['blocks']: # 2nd.拼image_body
if block['type'] == BlockType.ImageBody:
for line in block['lines']:
for span in line['spans']:
if span['type'] == ContentType.Image:
if span.get('image_path', ''):
para_text += f"![]({img_buket_path}/{span['image_path']})"
for block in para_block['blocks']: # 3rd.拼image_footnote
if block['type'] == BlockType.ImageFootnote:
para_text += ' \n' + merge_para_with_text(block)
else:
for block in para_block['blocks']: # 1st.拼image_body
if block['type'] == BlockType.ImageBody:
for line in block['lines']:
for span in line['spans']:
if span['type'] == ContentType.Image:
if span.get('image_path', ''):
para_text += f"![]({img_buket_path}/{span['image_path']})"
for block in para_block['blocks']: # 2nd.拼image_caption
if block['type'] == BlockType.ImageCaption:
para_text += ' \n' + merge_para_with_text(block)
elif para_type == BlockType.Table:
if mode == 'nlp':
continue
......@@ -96,20 +111,19 @@ def ocr_mk_markdown_with_para_core_v2(paras_of_layout,
for span in line['spans']:
if span['type'] == ContentType.Table:
# if processed by table model
if span.get('latex', ''):
para_text += f"\n\n$\n {span['latex']}\n$\n\n"
elif span.get('html', ''):
para_text += f"\n\n{span['html']}\n\n"
if span.get('html', ''):
para_text += f"\n{span['html']}\n"
elif span.get('image_path', ''):
para_text += f"\n![]({join_path(img_buket_path, span['image_path'])}) \n"
para_text += f"![]({img_buket_path}/{span['image_path']})"
for block in para_block['blocks']: # 3rd.拼table_footnote
if block['type'] == BlockType.TableFootnote:
para_text += merge_para_with_text(block) + ' \n'
para_text += '\n' + merge_para_with_text(block) + ' '
if para_text.strip() == '':
continue
else:
page_markdown.append(para_text.strip() + ' ')
# page_markdown.append(para_text.strip() + ' ')
page_markdown.append(para_text.strip())
return page_markdown
......@@ -257,9 +271,9 @@ def para_to_standard_format_v2(para_block, img_buket_path, page_idx, drop_reason
if span['type'] == ContentType.Table:
if span.get('latex', ''):
para_content['table_body'] = f"\n\n$\n {span['latex']}\n$\n\n"
para_content['table_body'] = f"{span['latex']}"
elif span.get('html', ''):
para_content['table_body'] = f"\n\n{span['html']}\n\n"
para_content['table_body'] = f"{span['html']}"
if span.get('image_path', ''):
para_content['img_path'] = join_path(img_buket_path, span['image_path'])
......
......@@ -6,7 +6,7 @@ from tqdm import tqdm
from magic_pdf.config.constants import MODEL_NAME
from magic_pdf.model.sub_modules.model_init import AtomModelSingleton
from magic_pdf.model.sub_modules.model_utils import (
clean_vram, crop_img, get_res_list_from_layout_res)
clean_vram, crop_img, get_res_list_from_layout_res, get_coords_and_area)
from magic_pdf.model.sub_modules.ocr.paddleocr2pytorch.ocr_utils import (
get_adjusted_mfdetrec_res, get_ocr_result_list)
......@@ -148,6 +148,19 @@ class BatchAnalyze:
# Integration results
if ocr_res:
ocr_result_list = get_ocr_result_list(ocr_res, useful_list, ocr_res_list_dict['ocr_enable'], new_image, _lang)
if res["category_id"] == 3:
# ocr_result_list中所有bbox的面积之和
ocr_res_area = sum(get_coords_and_area(ocr_res_item)[4] for ocr_res_item in ocr_result_list if 'poly' in ocr_res_item)
# 求ocr_res_area和res的面积的比值
res_area = get_coords_and_area(res)[4]
if res_area > 0:
ratio = ocr_res_area / res_area
if ratio > 0.25:
res["category_id"] = 1
else:
continue
ocr_res_list_dict['layout_res'].extend(ocr_result_list)
# det_count += len(ocr_res_list_dict['ocr_res_list'])
......
......@@ -189,7 +189,7 @@ def batch_doc_analyze(
formula_enable=None,
table_enable=None,
):
MIN_BATCH_INFERENCE_SIZE = int(os.environ.get('MINERU_MIN_BATCH_INFERENCE_SIZE', 200))
MIN_BATCH_INFERENCE_SIZE = int(os.environ.get('MINERU_MIN_BATCH_INFERENCE_SIZE', 100))
batch_size = MIN_BATCH_INFERENCE_SIZE
page_wh_list = []
......
......@@ -31,10 +31,10 @@ def crop_img(input_res, input_np_img, crop_paste_x=0, crop_paste_y=0):
return return_image, return_list
def get_coords_and_area(table):
def get_coords_and_area(block_with_poly):
"""Extract coordinates and area from a table."""
xmin, ymin = int(table['poly'][0]), int(table['poly'][1])
xmax, ymax = int(table['poly'][4]), int(table['poly'][5])
xmin, ymin = int(block_with_poly['poly'][0]), int(block_with_poly['poly'][1])
xmax, ymax = int(block_with_poly['poly'][4]), int(block_with_poly['poly'][5])
area = (xmax - xmin) * (ymax - ymin)
return xmin, ymin, xmax, ymax, area
......@@ -243,7 +243,7 @@ def get_res_list_from_layout_res(layout_res, iou_threshold=0.7, overlap_threshol
"bbox": [int(res['poly'][0]), int(res['poly'][1]),
int(res['poly'][4]), int(res['poly'][5])],
})
elif category_id in [0, 2, 4, 6, 7]: # OCR regions
elif category_id in [0, 2, 4, 6, 7, 3]: # OCR regions
ocr_res_list.append(res)
elif category_id == 5: # Table regions
table_res_list.append(res)
......
......@@ -35,7 +35,7 @@ def build_backbone(config, model_type):
from .rec_mobilenet_v3 import MobileNetV3
from .rec_svtrnet import SVTRNet
from .rec_mv1_enhance import MobileNetV1Enhance
from .rec_pphgnetv2 import PPHGNetV2_B4
support_dict = [
"MobileNetV1Enhance",
"MobileNetV3",
......@@ -48,6 +48,7 @@ def build_backbone(config, model_type):
"DenseNet",
"PPLCNetV3",
"PPHGNet_small",
"PPHGNetV2_B4",
]
else:
raise NotImplementedError
......
......@@ -9,14 +9,27 @@ class Im2Seq(nn.Module):
super().__init__()
self.out_channels = in_channels
# def forward(self, x):
# B, C, H, W = x.shape
# # assert H == 1
# x = x.squeeze(dim=2)
# # x = x.transpose([0, 2, 1]) # paddle (NTC)(batch, width, channels)
# x = x.permute(0, 2, 1)
# return x
def forward(self, x):
B, C, H, W = x.shape
# assert H == 1
x = x.squeeze(dim=2)
# x = x.transpose([0, 2, 1]) # paddle (NTC)(batch, width, channels)
x = x.permute(0, 2, 1)
return x
# 处理四维张量,将空间维度展平为序列
if H == 1:
# 原来的处理逻辑,适用于H=1的情况
x = x.squeeze(dim=2)
x = x.permute(0, 2, 1) # (B, W, C)
else:
# 处理H不为1的情况
x = x.permute(0, 2, 3, 1) # (B, H, W, C)
x = x.reshape(B, H * W, C) # (B, H*W, C)
return x
class EncoderWithRNN_(nn.Module):
def __init__(self, in_channels, hidden_size):
......
......@@ -104,6 +104,22 @@ ch_PP-OCRv4_det_infer:
name: DBHead
k: 50
ch_PP-OCRv5_det_infer:
model_type: det
algorithm: DB
Transform: null
Backbone:
name: PPLCNetV3
scale: 0.75
det: True
Neck:
name: RSEFPN
out_channels: 96
shortcut: True
Head:
name: DBHead
k: 50
ch_PP-OCRv4_det_server_infer:
model_type: det
algorithm: DB
......@@ -196,6 +212,58 @@ ch_PP-OCRv4_rec_server_doc_infer:
nrtr_dim: 384
max_text_length: 25
ch_PP-OCRv5_rec_server_infer:
model_type: rec
algorithm: SVTR_HGNet
Transform:
Backbone:
name: PPHGNetV2_B4
text_rec: True
Head:
name: MultiHead
out_channels_list:
CTCLabelDecode: 18385
head_list:
- CTCHead:
Neck:
name: svtr
dims: 120
depth: 2
hidden_dims: 120
kernel_size: [ 1, 3 ]
use_guide: True
Head:
fc_decay: 0.00001
- NRTRHead:
nrtr_dim: 384
max_text_length: 25
ch_PP-OCRv5_rec_infer:
model_type: rec
algorithm: SVTR_HGNet
Transform:
Backbone:
name: PPLCNetV3
scale: 0.95
Head:
name: MultiHead
out_channels_list:
CTCLabelDecode: 18385
head_list:
- CTCHead:
Neck:
name: svtr
dims: 120
depth: 2
hidden_dims: 120
kernel_size: [ 1, 3 ]
use_guide: True
Head:
fc_decay: 0.00001
- NRTRHead:
nrtr_dim: 384
max_text_length: 25
chinese_cht_PP-OCRv3_rec_infer:
model_type: rec
algorithm: SVTR
......
lang:
ch_lite:
det: ch_PP-OCRv3_det_infer.pth
rec: ch_PP-OCRv5_rec_infer.pth
dict: ppocrv5_dict.txt
ch_lite_v4:
det: ch_PP-OCRv3_det_infer.pth
rec: ch_PP-OCRv4_rec_infer.pth
dict: ppocr_keys_v1.txt
ch_server:
det: ch_PP-OCRv3_det_infer.pth
rec: ch_PP-OCRv5_rec_server_infer.pth
dict: ppocrv5_dict.txt
ch_server_v4:
det: ch_PP-OCRv3_det_infer.pth
rec: ch_PP-OCRv4_rec_server_infer.pth
dict: ppocr_keys_v1.txt
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment