Unverified Commit 845a3ff0 authored by Xiaomeng Zhao's avatar Xiaomeng Zhao Committed by GitHub
Browse files

Merge pull request #969 from opendatalab/release-0.9.3

Release 0.9.3
parents d0558abb 6083e109
...@@ -48,3 +48,6 @@ debug_utils/ ...@@ -48,3 +48,6 @@ debug_utils/
# sphinx docs # sphinx docs
_build/ _build/
output/
\ No newline at end of file
...@@ -42,6 +42,7 @@ ...@@ -42,6 +42,7 @@
</div> </div>
# Changelog # Changelog
- 2024/11/15 0.9.3 released. Integrated [RapidTable](https://github.com/RapidAI/RapidTable) for table recognition, improving single-table parsing speed by more than 10 times, with higher accuracy and lower GPU memory usage.
- 2024/11/06 0.9.2 released. Integrated the [StructTable-InternVL2-1B](https://huggingface.co/U4R/StructTable-InternVL2-1B) model for table recognition functionality. - 2024/11/06 0.9.2 released. Integrated the [StructTable-InternVL2-1B](https://huggingface.co/U4R/StructTable-InternVL2-1B) model for table recognition functionality.
- 2024/10/31 0.9.0 released. This is a major new version with extensive code refactoring, addressing numerous issues, improving performance, reducing hardware requirements, and enhancing usability: - 2024/10/31 0.9.0 released. This is a major new version with extensive code refactoring, addressing numerous issues, improving performance, reducing hardware requirements, and enhancing usability:
- Refactored the sorting module code to use [layoutreader](https://github.com/ppaanngggg/layoutreader) for reading order sorting, ensuring high accuracy in various layouts. - Refactored the sorting module code to use [layoutreader](https://github.com/ppaanngggg/layoutreader) for reading order sorting, ensuring high accuracy in various layouts.
...@@ -246,7 +247,7 @@ You can modify certain configurations in this file to enable or disable features ...@@ -246,7 +247,7 @@ You can modify certain configurations in this file to enable or disable features
"enable": true // The formula recognition feature is enabled by default. If you need to disable it, please change the value here to "false". "enable": true // The formula recognition feature is enabled by default. If you need to disable it, please change the value here to "false".
}, },
"table-config": { "table-config": {
"model": "tablemaster", // When using structEqTable, please change to "struct_eqtable". "model": "rapid_table", // When using structEqTable, please change to "struct_eqtable".
"enable": false, // The table recognition feature is disabled by default. If you need to enable it, please change the value here to "true". "enable": false, // The table recognition feature is disabled by default. If you need to enable it, please change the value here to "true".
"max_time": 400 "max_time": 400
} }
...@@ -261,7 +262,7 @@ If your device supports CUDA and meets the GPU requirements of the mainline envi ...@@ -261,7 +262,7 @@ If your device supports CUDA and meets the GPU requirements of the mainline envi
- [Windows 10/11 + GPU](docs/README_Windows_CUDA_Acceleration_en_US.md) - [Windows 10/11 + GPU](docs/README_Windows_CUDA_Acceleration_en_US.md)
- Quick Deployment with Docker - Quick Deployment with Docker
> [!IMPORTANT] > [!IMPORTANT]
> Docker requires a GPU with at least 16GB of VRAM, and all acceleration features are enabled by default. > Docker requires a GPU with at least 8GB of VRAM, and all acceleration features are enabled by default.
> >
> Before running this Docker, you can use the following command to check if your device supports CUDA acceleration on Docker. > Before running this Docker, you can use the following command to check if your device supports CUDA acceleration on Docker.
> >
...@@ -421,7 +422,9 @@ This project currently uses PyMuPDF to achieve advanced functionality. However, ...@@ -421,7 +422,9 @@ This project currently uses PyMuPDF to achieve advanced functionality. However,
# Acknowledgments # Acknowledgments
- [PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit) - [PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit)
- [DocLayout-YOLO](https://github.com/opendatalab/DocLayout-YOLO)
- [StructEqTable](https://github.com/UniModal4Reasoning/StructEqTable-Deploy) - [StructEqTable](https://github.com/UniModal4Reasoning/StructEqTable-Deploy)
- [RapidTable](https://github.com/RapidAI/RapidTable)
- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR) - [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
- [PyMuPDF](https://github.com/pymupdf/PyMuPDF) - [PyMuPDF](https://github.com/pymupdf/PyMuPDF)
- [layoutreader](https://github.com/ppaanngggg/layoutreader) - [layoutreader](https://github.com/ppaanngggg/layoutreader)
......
> [!Warning]
> このドキュメントはすでに古くなっています。最新版のドキュメントを参照してください:[ENGLISH](README.md)。
<div id="top"> <div id="top">
<p align="center"> <p align="center">
...@@ -18,9 +20,7 @@ ...@@ -18,9 +20,7 @@
<a href="https://trendshift.io/repositories/11174" target="_blank"><img src="https://trendshift.io/api/badge/repositories/11174" alt="opendatalab%2FMinerU | Trendshift" style="width: 200px; height: 55px;"/></a> <a href="https://trendshift.io/repositories/11174" target="_blank"><img src="https://trendshift.io/api/badge/repositories/11174" alt="opendatalab%2FMinerU | Trendshift" style="width: 200px; height: 55px;"/></a>
<div align="center" style="color: red; background-color: #ffdddd; padding: 10px; border: 1px solid red; border-radius: 5px;">
<strong>NOTE:</strong> このドキュメントはすでに古くなっています。最新版のドキュメントを参照してください。
</div>
[English](README.md) | [简体中文](README_zh-CN.md) | [日本語](README_ja-JP.md) [English](README.md) | [简体中文](README_zh-CN.md) | [日本語](README_ja-JP.md)
......
...@@ -42,7 +42,7 @@ ...@@ -42,7 +42,7 @@
</div> </div>
# 更新记录 # 更新记录
- 2024/11/15 0.9.3发布,为表格识别功能接入了[RapidTable](https://github.com/RapidAI/RapidTable),单表解析速度提升10倍以上,准确率更高,显存占用更低
- 2024/11/06 0.9.2发布,为表格识别功能接入了[StructTable-InternVL2-1B](https://huggingface.co/U4R/StructTable-InternVL2-1B)模型 - 2024/11/06 0.9.2发布,为表格识别功能接入了[StructTable-InternVL2-1B](https://huggingface.co/U4R/StructTable-InternVL2-1B)模型
- 2024/10/31 0.9.0发布,这是我们进行了大量代码重构的全新版本,解决了众多问题,提升了性能,降低了硬件需求,并提供了更丰富的易用性: - 2024/10/31 0.9.0发布,这是我们进行了大量代码重构的全新版本,解决了众多问题,提升了性能,降低了硬件需求,并提供了更丰富的易用性:
- 重构排序模块代码,使用 [layoutreader](https://github.com/ppaanngggg/layoutreader) 进行阅读顺序排序,确保在各种排版下都能实现极高准确率 - 重构排序模块代码,使用 [layoutreader](https://github.com/ppaanngggg/layoutreader) 进行阅读顺序排序,确保在各种排版下都能实现极高准确率
...@@ -188,13 +188,13 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c ...@@ -188,13 +188,13 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
<td rowspan="2">GPU硬件支持列表</td> <td rowspan="2">GPU硬件支持列表</td>
<td colspan="2">最低要求 8G+显存</td> <td colspan="2">最低要求 8G+显存</td>
<td colspan="2">3060ti/3070/4060<br> <td colspan="2">3060ti/3070/4060<br>
8G显存可开启layout、公式识别和ocr加速</td> 8G显存可开启全部加速功能(表格仅限rapid_table)</td>
<td rowspan="2">None</td> <td rowspan="2">None</td>
</tr> </tr>
<tr> <tr>
<td colspan="2">推荐配置 10G+显存</td> <td colspan="2">推荐配置 10G+显存</td>
<td colspan="2">3080/3080ti/3090/3090ti/4070/4070ti/4070tisuper/4080/4090<br> <td colspan="2">3080/3080ti/3090/3090ti/4070/4070ti/4070tisuper/4080/4090<br>
10G显存及以上可以同时开启layout、公式识别和ocr加速和表格识别加速<br> 10G显存及以上可开启全部加速功能<br>
</td> </td>
</tr> </tr>
</table> </table>
...@@ -251,7 +251,7 @@ pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com -i h ...@@ -251,7 +251,7 @@ pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com -i h
"enable": true // 公式识别功能默认是开启的,如果需要关闭请修改此处的值为"false" "enable": true // 公式识别功能默认是开启的,如果需要关闭请修改此处的值为"false"
}, },
"table-config": { "table-config": {
"model": "tablemaster", // 使用structEqTable请修改为"struct_eqtable" "model": "rapid_table", // 使用structEqTable请修改为"struct_eqtable"
"enable": false, // 表格识别功能默认是关闭的,如果需要开启请修改此处的值为"true" "enable": false, // 表格识别功能默认是关闭的,如果需要开启请修改此处的值为"true"
"max_time": 400 "max_time": 400
} }
...@@ -266,7 +266,7 @@ pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com -i h ...@@ -266,7 +266,7 @@ pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com -i h
- [Windows10/11 + GPU](docs/README_Windows_CUDA_Acceleration_zh_CN.md) - [Windows10/11 + GPU](docs/README_Windows_CUDA_Acceleration_zh_CN.md)
- 使用Docker快速部署 - 使用Docker快速部署
> [!IMPORTANT] > [!IMPORTANT]
> Docker 需设备gpu显存大于等于16GB,默认开启所有加速功能 > Docker 需设备gpu显存大于等于8GB,默认开启所有加速功能
> >
> 运行本docker前可以通过以下命令检测自己的设备是否支持在docker上使用CUDA加速 > 运行本docker前可以通过以下命令检测自己的设备是否支持在docker上使用CUDA加速
> >
...@@ -431,6 +431,7 @@ TODO ...@@ -431,6 +431,7 @@ TODO
- [PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit) - [PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit)
- [DocLayout-YOLO](https://github.com/opendatalab/DocLayout-YOLO) - [DocLayout-YOLO](https://github.com/opendatalab/DocLayout-YOLO)
- [StructEqTable](https://github.com/UniModal4Reasoning/StructEqTable-Deploy) - [StructEqTable](https://github.com/UniModal4Reasoning/StructEqTable-Deploy)
- [RapidTable](https://github.com/RapidAI/RapidTable)
- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR) - [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
- [PyMuPDF](https://github.com/pymupdf/PyMuPDF) - [PyMuPDF](https://github.com/pymupdf/PyMuPDF)
- [layoutreader](https://github.com/ppaanngggg/layoutreader) - [layoutreader](https://github.com/ppaanngggg/layoutreader)
......
...@@ -19,9 +19,10 @@ def json_md_dump( ...@@ -19,9 +19,10 @@ def json_md_dump(
pdf_name, pdf_name,
content_list, content_list,
md_content, md_content,
orig_model_list,
): ):
# 写入模型结果到 model.json # 写入模型结果到 model.json
orig_model_list = copy.deepcopy(pipe.model_list)
md_writer.write( md_writer.write(
content=json.dumps(orig_model_list, ensure_ascii=False, indent=4), content=json.dumps(orig_model_list, ensure_ascii=False, indent=4),
path=f"{pdf_name}_model.json" path=f"{pdf_name}_model.json"
...@@ -87,9 +88,12 @@ def pdf_parse_main( ...@@ -87,9 +88,12 @@ def pdf_parse_main(
pdf_bytes = open(pdf_path, "rb").read() # 读取 pdf 文件的二进制数据 pdf_bytes = open(pdf_path, "rb").read() # 读取 pdf 文件的二进制数据
orig_model_list = []
if model_json_path: if model_json_path:
# 读取已经被模型解析后的pdf文件的 json 原始数据,list 类型 # 读取已经被模型解析后的pdf文件的 json 原始数据,list 类型
model_json = json.loads(open(model_json_path, "r", encoding="utf-8").read()) model_json = json.loads(open(model_json_path, "r", encoding="utf-8").read())
orig_model_list = copy.deepcopy(model_json)
else: else:
model_json = [] model_json = []
...@@ -115,8 +119,9 @@ def pdf_parse_main( ...@@ -115,8 +119,9 @@ def pdf_parse_main(
pipe.pipe_classify() pipe.pipe_classify()
# 如果没有传入模型数据,则使用内置模型解析 # 如果没有传入模型数据,则使用内置模型解析
if not model_json: if len(model_json) == 0:
pipe.pipe_analyze() # 解析 pipe.pipe_analyze() # 解析
orig_model_list = copy.deepcopy(pipe.model_list)
# 执行解析 # 执行解析
pipe.pipe_parse() pipe.pipe_parse()
...@@ -126,7 +131,7 @@ def pdf_parse_main( ...@@ -126,7 +131,7 @@ def pdf_parse_main(
md_content = pipe.pipe_mk_markdown(image_path_parent, drop_mode="none") md_content = pipe.pipe_mk_markdown(image_path_parent, drop_mode="none")
if is_json_md_dump: if is_json_md_dump:
json_md_dump(pipe, md_writer, pdf_name, content_list, md_content) json_md_dump(pipe, md_writer, pdf_name, content_list, md_content, orig_model_list)
if is_draw_visualization_bbox: if is_draw_visualization_bbox:
draw_visualization_bbox(pipe.pdf_mid_data['pdf_info'], pdf_bytes, output_path, pdf_name) draw_visualization_bbox(pipe.pdf_mid_data['pdf_info'], pdf_bytes, output_path, pdf_name)
......
...@@ -15,7 +15,7 @@ ...@@ -15,7 +15,7 @@
"enable": true "enable": true
}, },
"table-config": { "table-config": {
"model": "tablemaster", "model": "rapid_table",
"enable": false, "enable": false,
"max_time": 400 "max_time": 400
}, },
......
...@@ -168,7 +168,7 @@ def merge_para_with_text(para_block): ...@@ -168,7 +168,7 @@ def merge_para_with_text(para_block):
# 如果是前一行带有-连字符,那么末尾不应该加空格 # 如果是前一行带有-连字符,那么末尾不应该加空格
if __is_hyphen_at_line_end(content): if __is_hyphen_at_line_end(content):
para_text += content[:-1] para_text += content[:-1]
elif len(content) == 1 and content not in ['A', 'I', 'a', 'i']: elif len(content) == 1 and content not in ['A', 'I', 'a', 'i'] and not content.isdigit():
para_text += content para_text += content
else: # 西方文本语境下 content间需要空格分隔 else: # 西方文本语境下 content间需要空格分隔
para_text += f"{content} " para_text += f"{content} "
......
...@@ -51,3 +51,5 @@ class MODEL_NAME: ...@@ -51,3 +51,5 @@ class MODEL_NAME:
YOLO_V8_MFD = "yolo_v8_mfd" YOLO_V8_MFD = "yolo_v8_mfd"
UniMerNet_v2_Small = "unimernet_small" UniMerNet_v2_Small = "unimernet_small"
RAPID_TABLE = "rapid_table"
\ No newline at end of file
...@@ -92,7 +92,7 @@ def get_table_recog_config(): ...@@ -92,7 +92,7 @@ def get_table_recog_config():
table_config = config.get('table-config') table_config = config.get('table-config')
if table_config is None: if table_config is None:
logger.warning(f"'table-config' not found in {CONFIG_FILE_NAME}, use 'False' as default") logger.warning(f"'table-config' not found in {CONFIG_FILE_NAME}, use 'False' as default")
return json.loads(f'{{"model": "{MODEL_NAME.TABLE_MASTER}","enable": false, "max_time": 400}}') return json.loads(f'{{"model": "{MODEL_NAME.RAPID_TABLE}","enable": false, "max_time": 400}}')
else: else:
return table_config return table_config
......
...@@ -369,10 +369,16 @@ def draw_line_sort_bbox(pdf_info, pdf_bytes, out_path, filename): ...@@ -369,10 +369,16 @@ def draw_line_sort_bbox(pdf_info, pdf_bytes, out_path, filename):
if block['type'] in [BlockType.Image, BlockType.Table]: if block['type'] in [BlockType.Image, BlockType.Table]:
for sub_block in block['blocks']: for sub_block in block['blocks']:
if sub_block['type'] in [BlockType.ImageBody, BlockType.TableBody]: if sub_block['type'] in [BlockType.ImageBody, BlockType.TableBody]:
if len(sub_block['virtual_lines']) > 0 and sub_block['virtual_lines'][0].get('index', None) is not None:
for line in sub_block['virtual_lines']: for line in sub_block['virtual_lines']:
bbox = line['bbox'] bbox = line['bbox']
index = line['index'] index = line['index']
page_line_list.append({'index': index, 'bbox': bbox}) page_line_list.append({'index': index, 'bbox': bbox})
else:
for line in sub_block['lines']:
bbox = line['bbox']
index = line['index']
page_line_list.append({'index': index, 'bbox': bbox})
elif sub_block['type'] in [BlockType.ImageCaption, BlockType.TableCaption, BlockType.ImageFootnote, BlockType.TableFootnote]: elif sub_block['type'] in [BlockType.ImageCaption, BlockType.TableCaption, BlockType.ImageFootnote, BlockType.TableFootnote]:
for line in sub_block['lines']: for line in sub_block['lines']:
bbox = line['bbox'] bbox = line['bbox']
......
This diff is collapsed.
import re
def layout_rm_equation(layout_res):
rm_idxs = []
for idx, ele in enumerate(layout_res['layout_dets']):
if ele['category_id'] == 10:
rm_idxs.append(idx)
for idx in rm_idxs[::-1]:
del layout_res['layout_dets'][idx]
return layout_res
def get_croped_image(image_pil, bbox):
x_min, y_min, x_max, y_max = bbox
croped_img = image_pil.crop((x_min, y_min, x_max, y_max))
return croped_img
def latex_rm_whitespace(s: str):
"""Remove unnecessary whitespace from LaTeX code.
"""
text_reg = r'(\\(operatorname|mathrm|text|mathbf)\s?\*? {.*?})'
letter = '[a-zA-Z]'
noletter = '[\W_^\d]'
names = [x[0].replace(' ', '') for x in re.findall(text_reg, s)]
s = re.sub(text_reg, lambda match: str(names.pop(0)), s)
news = s
while True:
s = news
news = re.sub(r'(?!\\ )(%s)\s+?(%s)' % (noletter, noletter), r'\1\2', s)
news = re.sub(r'(?!\\ )(%s)\s+?(%s)' % (noletter, letter), r'\1\2', news)
news = re.sub(r'(%s)\s+?(%s)' % (letter, noletter), r'\1\2', news)
if news == s:
break
return s
\ No newline at end of file
from doclayout_yolo import YOLOv10
class DocLayoutYOLOModel(object):
def __init__(self, weight, device):
self.model = YOLOv10(weight)
self.device = device
def predict(self, image):
layout_res = []
doclayout_yolo_res = self.model.predict(image, imgsz=1024, conf=0.25, iou=0.45, verbose=True, device=self.device)[0]
for xyxy, conf, cla in zip(doclayout_yolo_res.boxes.xyxy.cpu(), doclayout_yolo_res.boxes.conf.cpu(),
doclayout_yolo_res.boxes.cls.cpu()):
xmin, ymin, xmax, ymax = [int(p.item()) for p in xyxy]
new_item = {
'category_id': int(cla.item()),
'poly': [xmin, ymin, xmax, ymin, xmax, ymax, xmin, ymax],
'score': round(float(conf.item()), 3),
}
layout_res.append(new_item)
return layout_res
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment