Unverified Commit b4f7b53e authored by Xiaomeng Zhao's avatar Xiaomeng Zhao Committed by GitHub
Browse files

Merge pull request #1261 from opendatalab/release-0.10.6

Release 0.10.6
parents a962824b d3b51aa5
...@@ -137,49 +137,45 @@ poly 坐标的格式 [x0, y0, x1, y1, x2, y2, x3, y3], ...@@ -137,49 +137,45 @@ poly 坐标的格式 [x0, y0, x1, y1, x2, y2, x3, y3],
some_pdf_middle.json some_pdf_middle.json
~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~
+-----------+----------------------------------------------------------+ +--------------------+----------------------------------------------------------+
| 字段名 | 解释 | | 字段名 | 解释 |
+===========+==========================================================+ +====================+==========================================================+
| pdf_info | list,每个 | | pdf_info | list,每个元素都是一个 |
| | 元素都是一个dict,这个dict是每一页pdf的解析结果,详见下表 | | | dict,这个dict是每一页pdf的解析结果,详见下表 |
+-----------+----------------------------------------------------------+ +--------------------+----------------------------------------------------------+
| | ocr \| txt,用来标识本次解析的中间态使用的模式 | | \_parse_type | ocr \| txt,用来标识本次解析的中间态使用的模式 |
| \_parse_type | | +--------------------+----------------------------------------------------------+
+-----------+----------------------------------------------------------+ | \_version_name | string,表示本次解析使用的 magic-pdf 的版本号 |
| | string, 表示本次解析使用的 magic-pdf 的版本号 | +-------------------------------------------------------------------------------+
| \_version_name | |
+-----------+----------------------------------------------------------+
**pdf_info** 字段结构说明 **pdf_info** 字段结构说明
+--------------+-------------------------------------------------------+ +---------------------+-------------------------------------------------------+
| 字段名 | 解释 | | 字段名 | 解释 |
+==============+=======================================================+ +=====================+=======================================================+
| | pdf预处理后,未分段的中间结果 | | preproc_blocks | pdf预处理后,未分段的中间结果 |
| preeproc_blocks | | +---------------------+-------------------------------------------------------+
+--------------+-------------------------------------------------------+ | | 布局分割的结果, |
| | 布局分割的结果, | | layout_bboxes | 含有布局的方向(垂直、水平),和bbox,按阅读顺序排序 |
| layout_bboxes | 含有布局的方向(垂直、水平),和bbox,按阅读顺序排序 | +---------------------+-------------------------------------------------------+
+--------------+-------------------------------------------------------+ | page_idx | 页码,从0开始 |
| page_idx | 页码,从0开始 | +---------------------+-------------------------------------------------------+
+--------------+-------------------------------------------------------+ | page_size | 页面的宽度和高度 |
| page_size | 页面的宽度和高度 | +---------------------+-------------------------------------------------------+
+--------------+-------------------------------------------------------+ | \_layout_tree | 布局树状结构 |
| \ | 布局树状结构 | +---------------------+-------------------------------------------------------+
| _layout_tree | | | images | list,每个元素是一个dict,每个dict表示一个img_block |
+--------------+-------------------------------------------------------+ +---------------------+-------------------------------------------------------+
| images | list,每个元素是一个dict,每个dict表示一个img_block | | tables | list,每个元素是一个dict,每个dict表示一个table_block |
+--------------+-------------------------------------------------------+ +---------------------+-------------------------------------------------------+
| tables | list,每个元素是一个dict,每个dict表示一个table_block | | | list,每个元素是一个 |
+--------------+-------------------------------------------------------+ | interline_equations | dict,每个dict表示一个interline_equation_block |
| | list,每个元素 | +---------------------+-------------------------------------------------------+
| interline_equations | 是一个dict,每个dict表示一个interline_equation_block | | | List, 模型返回的需要drop的block信息 |
+--------------+-------------------------------------------------------+ | discarded_blocks | |
| | List, 模型返回的需要drop的block信息 | +---------------------+-------------------------------------------------------+
| discarded_blocks | | | para_blocks | 将preproc_blocks进行分段之后的结果 |
+--------------+-------------------------------------------------------+ +---------------------+-------------------------------------------------------+
| para_blocks | 将preproc_blocks进行分段之后的结果 |
+--------------+-------------------------------------------------------+
上表中 ``para_blocks`` 上表中 ``para_blocks``
是个dict的数组,每个dict是一个block结构,block最多支持一次嵌套 是个dict的数组,每个dict是一个block结构,block最多支持一次嵌套
...@@ -200,20 +196,18 @@ blocks list,里面的每个元素都是一个dict格式的二级block ...@@ -200,20 +196,18 @@ blocks list,里面的每个元素都是一个dict格式的二级block
二级block中的字段包括 二级block中的字段包括
+-----+----------------------------------------------------------------+ +----------+----------------------------------------------------------------+
| 字 | 解释 | | 字 | 解释 |
| 段 | | | 段 | |
| 名 | | | 名 | |
+=====+================================================================+ +==========+================================================================+
| | block类型 | | | block类型 |
| type | | | type | |
+-----+----------------------------------------------------------------+ +----------+----------------------------------------------------------------+
| | block矩形框坐标 | | bbox | block矩形框坐标 |
| bbox | | +----------+----------------------------------------------------------------+
+-----+----------------------------------------------------------------+ | lines | list,每个元素都是一个dict表示的line,用来描述一行信息的构成 |
| | list,每个元素都是一个dict表示的line,用来描述一行信息的构成 | +----------+----------------------------------------------------------------+
| lines | |
+-----+----------------------------------------------------------------+
二级block的类型详解 二级block的类型详解
...@@ -237,22 +231,21 @@ interline_equation 行间公式块 ...@@ -237,22 +231,21 @@ interline_equation 行间公式块
line 的 字段格式如下 line 的 字段格式如下
+----+-----------------------------------------------------------------+ +-----------+-----------------------------------------------------------------+
| 字 | 解释 | | 字 | 解释 |
| 段 | | | 段 | |
| 名 | | | 名 | |
+====+=================================================================+ +===========+=================================================================+
| bbox | line的矩形框坐标 | | bbox | line的矩形框坐标 |
| | | +-----------+-----------------------------------------------------------------+
+----+-----------------------------------------------------------------+ | spans | list, |
| spans | list, | | | 每个元素都是一个dict表示的span,用来描述一个最小组成单元的构成 |
| | 每个元素都是一个dict表示的span,用来描述一个最小组成单元的构成 | +-----------+-----------------------------------------------------------------+
+----+-----------------------------------------------------------------+
**span** **span**
+------------+---------------------------------------------------------+ +------------+---------------------------------------------------------+
| 字段名 | 解释 | | 字段名 | 解释 |
+============+=========================================================+ +============+=========================================================+
| bbox | span的矩形框坐标 | | bbox | span的矩形框坐标 |
+------------+---------------------------------------------------------+ +------------+---------------------------------------------------------+
......
流水线管道
===========
极简示例
^^^^^^^^
.. code:: python
import os
from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader
from magic_pdf.data.dataset import PymuDocDataset
from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
# args
pdf_file_name = "abc.pdf" # replace with the real pdf path
name_without_suff = pdf_file_name.split(".")[0]
# prepare env
local_image_dir, local_md_dir = "output/images", "output"
image_dir = str(os.path.basename(local_image_dir))
os.makedirs(local_image_dir, exist_ok=True)
image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter(
local_md_dir
)
image_dir = str(os.path.basename(local_image_dir))
# read bytes
reader1 = FileBasedDataReader("")
pdf_bytes = reader1.read(pdf_file_name) # read the pdf content
# proc
## Create Dataset Instance
ds = PymuDocDataset(pdf_bytes)
ds.apply(doc_analyze, ocr=True).pipe_ocr_mode(image_writer).dump_md(md_writer, f"{name_without_suff}.md", image_dir)
运行以上的代码,会得到如下的结果
.. code:: bash
output/
├── abc.md
└── images
除去初始化环境,如建立目录、导入依赖库等逻辑。真正将 ``pdf`` 转换为 ``markdown`` 的代码片段如下
.. code::
# read bytes
reader1 = FileBasedDataReader("")
pdf_bytes = reader1.read(pdf_file_name) # read the pdf content
# proc
## Create Dataset Instance
ds = PymuDocDataset(pdf_bytes)
ds.apply(doc_analyze, ocr=True).pipe_ocr_mode(image_writer).dump_md(md_writer, f"{name_without_suff}.md", image_dir)
``ds.apply(doc_analyze, ocr=True)`` 会生成 ``InferenceResult`` 对象。 ``InferenceResult`` 对象执行 ``pipe_ocr_mode`` 方法会生成 ``PipeResult`` 对象。
``PipeResult`` 对象执行 ``dump_md`` 会在指定位置生成 ``markdown`` 文件。
pipeline 的执行过程如下图所示
.. image:: ../../_static/image/pipeline.drawio.svg
.. raw:: html
<br> </br>
目前划分出数据、推理、程序处理三个阶段,分别对应着图上的 ``Dataset``, ``InferenceResult``, ``PipeResult`` 这三个实体。通过 ``apply`` , ``doc_analyze`` 或 ``pipe_ocr_mode`` 等方法链接在一起。
.. admonition:: Tip
:class: tip
要想获得更多有关 Dataset、InferenceResult、PipeResult 的使用示例子,请前往 :doc:`../quick_start/to_markdown`
要想获得更多有关 Dataset、InferenceResult、PipeResult 的细节信息请前往英文版 MinerU 文档进行查看!
管道组合
^^^^^^^^^
.. code:: python
class Dataset(ABC):
@abstractmethod
def apply(self, proc: Callable, *args, **kwargs):
"""Apply callable method which.
Args:
proc (Callable): invoke proc as follows:
proc(self, *args, **kwargs)
Returns:
Any: return the result generated by proc
"""
pass
class InferenceResult(InferenceResultBase):
def apply(self, proc: Callable, *args, **kwargs):
"""Apply callable method which.
Args:
proc (Callable): invoke proc as follows:
proc(inference_result, *args, **kwargs)
Returns:
Any: return the result generated by proc
"""
return proc(copy.deepcopy(self._infer_res), *args, **kwargs)
def pipe_ocr_mode(
self,
imageWriter: DataWriter,
start_page_id=0,
end_page_id=None,
debug_mode=False,
lang=None,
) -> PipeResult:
pass
class PipeResult:
def apply(self, proc: Callable, *args, **kwargs):
"""Apply callable method which.
Args:
proc (Callable): invoke proc as follows:
proc(pipeline_result, *args, **kwargs)
Returns:
Any: return the result generated by proc
"""
return proc(copy.deepcopy(self._pipe_res), *args, **kwargs)
``Dataset`` 、 ``InferenceResult`` 和 ``PipeResult`` 类均有 ``apply`` method。可用于组合不同阶段的运算过程。
如下所示,``MinerU`` 提供一套组合这些类的计算过程。
.. code:: python
# proc
## Create Dataset Instance
ds = PymuDocDataset(pdf_bytes)
ds.apply(doc_analyze, ocr=True).pipe_ocr_mode(image_writer).dump_md(md_writer, f"{name_without_suff}.md", image_dir)
用户可以根据的需求,自行实现一些组合用的函数。比如用户通过 ``apply`` 方法实现一个统计 ``pdf`` 文件页数的功能。
.. code:: python
from magic_pdf.data.data_reader_writer import FileBasedDataReader
from magic_pdf.data.dataset import PymuDocDataset
# args
pdf_file_name = "abc.pdf" # replace with the real pdf path
# read bytes
reader1 = FileBasedDataReader("")
pdf_bytes = reader1.read(pdf_file_name) # read the pdf content
# proc
## Create Dataset Instance
ds = PymuDocDataset(pdf_bytes)
def count_page(ds)-> int:
return len(ds)
print("page number: ", ds.apply(count_page)) # will output the page count of `abc.pdf`
...@@ -7,9 +7,9 @@ numpy>=1.21.6,<2.0.0 ...@@ -7,9 +7,9 @@ numpy>=1.21.6,<2.0.0
fast-langdetect==0.2.0 fast-langdetect==0.2.0
scikit-learn>=1.0.2 scikit-learn>=1.0.2
pdfminer.six==20231228 pdfminer.six==20231228
unimernet==0.2.1 unimernet==0.2.2
matplotlib matplotlib
ultralytics ultralytics>=8.3.48
paddleocr==2.7.3 paddleocr==2.7.3
paddlepaddle==3.0.0b1 paddlepaddle==3.0.0b1
struct-eqtable==0.3.2 struct-eqtable==0.3.2
......
...@@ -7,7 +7,7 @@ numpy>=1.21.6,<2.0.0 ...@@ -7,7 +7,7 @@ numpy>=1.21.6,<2.0.0
pydantic>=2.7.2,<2.8.0 pydantic>=2.7.2,<2.8.0
PyMuPDF>=1.24.9 PyMuPDF>=1.24.9
scikit-learn>=1.0.2 scikit-learn>=1.0.2
torch>=2.2.2,<=2.3.1 torch>=2.2.2
transformers transformers
# pdfminer.six==20231228 pdfminer.six==20231228
# The requirements.txt must ensure that only necessary external dependencies are introduced. If there are new dependencies to add, please contact the project administrator. # The requirements.txt must ensure that only necessary external dependencies are introduced. If there are new dependencies to add, please contact the project administrator.
...@@ -36,10 +36,12 @@ if __name__ == '__main__': ...@@ -36,10 +36,12 @@ if __name__ == '__main__':
"paddlepaddle==3.0.0b1;platform_system=='Linux'", "paddlepaddle==3.0.0b1;platform_system=='Linux'",
"paddlepaddle==2.6.1;platform_system=='Windows' or platform_system=='Darwin'", "paddlepaddle==2.6.1;platform_system=='Windows' or platform_system=='Darwin'",
], ],
"full": ["unimernet==0.2.1", # unimernet升级0.2.1 "full": ["unimernet==0.2.2", # unimernet升级0.2.2,移除torchtext的依赖
"torch>=2.2.2,<=2.3.1", # torch2.4.0及之后版本未测试,先卡住版本上限
"torchvision>=0.17.2,<=0.18.1", # torchvision 受torch版本约束
"matplotlib<=3.9.0;platform_system=='Windows'", # 3.9.1及之后不提供windows的预编译包,避免一些没有编译环境的windows设备安装失败 "matplotlib<=3.9.0;platform_system=='Windows'", # 3.9.1及之后不提供windows的预编译包,避免一些没有编译环境的windows设备安装失败
"matplotlib;platform_system=='Linux' or platform_system=='Darwin'", # linux 和 macos 不应限制matplotlib的最高版本,以避免无法更新导致的一些bug "matplotlib;platform_system=='Linux' or platform_system=='Darwin'", # linux 和 macos 不应限制matplotlib的最高版本,以避免无法更新导致的一些bug
"ultralytics", # yolov8,公式检测 "ultralytics>=8.3.48", # yolov8,公式检测
"paddleocr==2.7.3", # 2.8.0及2.8.1版本与detectron2有冲突,需锁定2.7.3 "paddleocr==2.7.3", # 2.8.0及2.8.1版本与detectron2有冲突,需锁定2.7.3
"paddlepaddle==3.0.0b1;platform_system=='Linux'", # 解决linux的段异常问题 "paddlepaddle==3.0.0b1;platform_system=='Linux'", # 解决linux的段异常问题
"paddlepaddle==2.6.1;platform_system=='Windows' or platform_system=='Darwin'", # windows版本3.0.0b1效率下降,需锁定2.6.1 "paddlepaddle==2.6.1;platform_system=='Windows' or platform_system=='Darwin'", # windows版本3.0.0b1效率下降,需锁定2.6.1
......
...@@ -7,8 +7,11 @@ from lib import common ...@@ -7,8 +7,11 @@ from lib import common
import time import time
import magic_pdf.model as model_config import magic_pdf.model as model_config
from magic_pdf.pipe.UNIPipe import UNIPipe from magic_pdf.pipe.UNIPipe import UNIPipe
from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter import os
from magic_pdf.rw.S3ReaderWriter import S3ReaderWriter from magic_pdf.data.data_reader_writer import FileBasedDataWriter
from magic_pdf.data.data_reader_writer import S3DataReader, S3DataWriter
from magic_pdf.config.make_content_config import DropMode, MakeMode
from magic_pdf.pipe.OCRPipe import OCRPipe
model_config.__use_inside_model__ = True model_config.__use_inside_model__ = True
pdf_res_path = conf.conf['pdf_res_path'] pdf_res_path = conf.conf['pdf_res_path']
code_path = conf.conf['code_path'] code_path = conf.conf['code_path']
...@@ -41,7 +44,7 @@ class TestCli: ...@@ -41,7 +44,7 @@ class TestCli:
pdf_bytes = open(pdf_path, 'rb').read() pdf_bytes = open(pdf_path, 'rb').read()
local_image_dir = os.path.join(pdf_dev_path, 'pdf', 'images') local_image_dir = os.path.join(pdf_dev_path, 'pdf', 'images')
image_dir = str(os.path.basename(local_image_dir)) image_dir = str(os.path.basename(local_image_dir))
image_writer = DiskReaderWriter(local_image_dir) image_writer = FileBasedDataWriter(local_image_dir)
model_json = list() model_json = list()
jso_useful_key = {'_pdf_type': '', 'model_list': model_json} jso_useful_key = {'_pdf_type': '', 'model_list': model_json}
pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer) pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer)
...@@ -77,7 +80,7 @@ class TestCli: ...@@ -77,7 +80,7 @@ class TestCli:
pdf_bytes = open(pdf_path, 'rb').read() pdf_bytes = open(pdf_path, 'rb').read()
local_image_dir = os.path.join(pdf_dev_path, 'pdf', 'images') local_image_dir = os.path.join(pdf_dev_path, 'pdf', 'images')
image_dir = str(os.path.basename(local_image_dir)) image_dir = str(os.path.basename(local_image_dir))
image_writer = DiskReaderWriter(local_image_dir) image_writer = FileBasedDataWriter(local_image_dir)
model_json = list() model_json = list()
jso_useful_key = {'_pdf_type': 'ocr', 'model_list': model_json} jso_useful_key = {'_pdf_type': 'ocr', 'model_list': model_json}
pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer) pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer)
...@@ -112,7 +115,7 @@ class TestCli: ...@@ -112,7 +115,7 @@ class TestCli:
pdf_bytes = open(pdf_path, 'rb').read() pdf_bytes = open(pdf_path, 'rb').read()
local_image_dir = os.path.join(pdf_dev_path, 'pdf', 'images') local_image_dir = os.path.join(pdf_dev_path, 'pdf', 'images')
image_dir = str(os.path.basename(local_image_dir)) image_dir = str(os.path.basename(local_image_dir))
image_writer = DiskReaderWriter(local_image_dir) image_writer = FileBasedDataWriter(local_image_dir)
model_json = list() model_json = list()
jso_useful_key = {'_pdf_type': 'txt', 'model_list': model_json} jso_useful_key = {'_pdf_type': 'txt', 'model_list': model_json}
pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer) pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer)
...@@ -284,12 +287,13 @@ class TestCli: ...@@ -284,12 +287,13 @@ class TestCli:
pdf_endpoint = os.environ.get('pdf_endpoint', "") pdf_endpoint = os.environ.get('pdf_endpoint', "")
s3_pdf_path = conf.conf["s3_pdf_path"] s3_pdf_path = conf.conf["s3_pdf_path"]
image_dir = "s3://" + pdf_bucket + "/mineru/test/output" image_dir = "s3://" + pdf_bucket + "/mineru/test/output"
print (image_dir) prefix = "mineru/test/output"
s3pdf_cli = S3ReaderWriter(pdf_ak, pdf_sk, pdf_endpoint) reader = S3DataReader(prefix, pdf_bucket, pdf_ak, pdf_sk, pdf_endpoint)
s3image_cli = S3ReaderWriter(pdf_ak, pdf_sk, pdf_endpoint, parent_path=image_dir) # = S3DataWriter(prefix, pdf_bucket, pdf_ak, pdf_sk, pdf_endpoint)
pdf_bytes = s3pdf_cli.read(s3_pdf_path, mode=s3pdf_cli.MODE_BIN) image_writer = S3DataWriter(prefix, pdf_bucket, pdf_ak, pdf_sk, pdf_endpoint)
jso_useful_key = {"_pdf_type": "", "model_list": []} pdf_bytes = reader.read(s3_pdf_path)
pipe = UNIPipe(pdf_bytes, jso_useful_key, s3image_cli) model_list = []
pipe = OCRPipe(pdf_bytes, model_list, image_writer)
pipe.pipe_classify() pipe.pipe_classify()
pipe.pipe_analyze() pipe.pipe_analyze()
pipe.pipe_parse() pipe.pipe_parse()
...@@ -427,3 +431,4 @@ class TestCli: ...@@ -427,3 +431,4 @@ class TestCli:
if __name__ == '__main__': if __name__ == '__main__':
pytest.main() pytest.main()
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment