Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
wangsen
MinerU
Commits
dd8da7bf
"macapp/src/install.ts" did not exist on "f92385590633d994b2547bc0732fbae1cb2af608"
Commit
dd8da7bf
authored
Nov 08, 2024
by
myhloli
Browse files
Merge remote-tracking branch 'origin/dev' into dev
parents
8ea23813
74fba476
Changes
45
Hide whitespace changes
Inline
Side-by-side
Showing
5 changed files
with
532 additions
and
0 deletions
+532
-0
next_docs/zh_cn/user_guide/quick_start.rst
next_docs/zh_cn/user_guide/quick_start.rst
+13
-0
next_docs/zh_cn/user_guide/quick_start/command_line.rst
next_docs/zh_cn/user_guide/quick_start/command_line.rst
+61
-0
next_docs/zh_cn/user_guide/quick_start/to_markdown.rst
next_docs/zh_cn/user_guide/quick_start/to_markdown.rst
+53
-0
next_docs/zh_cn/user_guide/tutorial.rst
next_docs/zh_cn/user_guide/tutorial.rst
+11
-0
next_docs/zh_cn/user_guide/tutorial/output_file_description.rst
...ocs/zh_cn/user_guide/tutorial/output_file_description.rst
+394
-0
No files found.
next_docs/zh_cn/user_guide/quick_start.rst
0 → 100644
View file @
dd8da7bf
快速开始
==============
从这里开始学习 MinerU 基本使用方法。若还没有安装,请参考安装文档进行安装
.. toctree::
:maxdepth: 1
:caption: 快速开始
quick_start/command_line
quick_start/to_markdown
next_docs/zh_cn/user_guide/quick_start/command_line.rst
0 → 100644
View file @
dd8da7bf
命令行
========
.. code:: bash
magic-pdf --help
Usage: magic-pdf [OPTIONS]
Options:
-v, --version display the version and exit
-p, --path PATH local pdf filepath or directory [required]
-o, --output-dir PATH output local directory [required]
-m, --method [ocr|txt|auto] the method for parsing pdf. ocr: using ocr
technique to extract information from pdf. txt:
suitable for the text-based pdf only and
outperform ocr. auto: automatically choose the
best method for parsing pdf from ocr and txt.
without method specified, auto will be used by
default.
-l, --lang TEXT Input the languages in the pdf (if known) to
improve OCR accuracy. Optional. You should
input "Abbreviation" with language form url: ht
tps://paddlepaddle.github.io/PaddleOCR/en/ppocr
/blog/multi_languages.html#5-support-languages-
and-abbreviations
-d, --debug BOOLEAN Enables detailed debugging information during
the execution of the CLI commands.
-s, --start INTEGER The starting page for PDF parsing, beginning
from 0.
-e, --end INTEGER The ending page for PDF parsing, beginning from
0.
--help Show this message and exit.
## show version
magic-pdf -v
## command line example
magic-pdf -p {some_pdf} -o {some_output_dir} -m auto
``{some_pdf}`` 可以是单个 PDF 文件或者一个包含多个 PDF 文件的目录。 解析的结果文件存放在目录 ``{some_output_dir}`` 下。 生成的结果文件列表如下所示:
.. code:: text
├── some_pdf.md # markdown 文件
├── images # 存放图片目录
├── some_pdf_layout.pdf # layout 绘图 (包含layout阅读顺序)
├── some_pdf_middle.json # minerU 中间处理结果
├── some_pdf_model.json # 模型推理结果
├── some_pdf_origin.pdf # 原 pdf 文件
├── some_pdf_spans.pdf # 最小粒度的bbox位置信息绘图
└── some_pdf_content_list.json # 按阅读顺序排列的富文本json
.. admonition:: Tip
:class: tip
欲知更多有关结果文件的信息,请参考 :doc:`../tutorial/output_file_description`
next_docs/zh_cn/user_guide/quick_start/to_markdown.rst
0 → 100644
View file @
dd8da7bf
转换为 Markdown 文件
========================
.. code:: python
import os
from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader
from magic_pdf.libs.MakeContentConfig import DropMode, MakeMode
from magic_pdf.pipe.OCRPipe import OCRPipe
## args
model_list = []
pdf_file_name = "abc.pdf" # replace with the real pdf path
## prepare env
local_image_dir, local_md_dir = "output/images", "output"
os.makedirs(local_image_dir, exist_ok=True)
image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter(
local_md_dir
) # create 00
image_dir = str(os.path.basename(local_image_dir))
reader1 = FileBasedDataReader("")
pdf_bytes = reader1.read(pdf_file_name) # read the pdf content
pipe = OCRPipe(pdf_bytes, model_list, image_writer)
pipe.pipe_classify()
pipe.pipe_analyze()
pipe.pipe_parse()
pdf_info = pipe.pdf_mid_data["pdf_info"]
md_content = pipe.pipe_mk_markdown(
image_dir, drop_mode=DropMode.NONE, md_make_mode=MakeMode.MM_MD
)
if isinstance(md_content, list):
md_writer.write_string(f"{pdf_file_name}.md", "\n".join(md_content))
else:
md_writer.write_string(f"{pdf_file_name}.md", md_content)
前去 :doc:`../data/data_reader_writer` 获取更多有关**读写**示例
next_docs/zh_cn/user_guide/tutorial.rst
0 → 100644
View file @
dd8da7bf
教程
===========
让我们通过构建一个最小项目来学习 MinerU
.. toctree::
:maxdepth: 1
:caption: 教程
tutorial/output_file_description
next_docs/zh_cn/user_guide/tutorial/output_file_description.rst
0 → 100644
View file @
dd8da7bf
输出文件格式介绍
===============
``magic-pdf`` 命令执行后除了输出和 markdown
有关的文件以外,还会生成若干个和 markdown
无关的文件。现在将一一介绍这些文件
some_pdf_layout.pdf
~~~~~~~~~~~~~~~~~~~
每一页的 layout 均由一个或多个框组成。
每个框左上脚的数字表明它们的序号。此外 layout.pdf
框内用不同的背景色块圈定不同的内容块。
.. figure:: ../../_static/image/layout_example.png
:alt: layout 页面示例
layout 页面示例
some_pdf_spans.pdf
~~~~~~~~~~~~~~~~~~
根据 span 类型的不同,采用不同颜色线框绘制页面上所有
span。该文件可以用于质检,可以快速排查出文本丢失、行间公式未识别等问题。
.. figure:: ../../_static/image/spans_example.png
:alt: span 页面示例
span 页面示例
some_pdf_model.json
~~~~~~~~~~~~~~~~~~~
结构定义
^^^^^^^^
.. code:: python
from pydantic import BaseModel, Field
from enum import IntEnum
class CategoryType(IntEnum):
title = 0 # 标题
plain_text = 1 # 文本
abandon = 2 # 包括页眉页脚页码和页面注释
figure = 3 # 图片
figure_caption = 4 # 图片描述
table = 5 # 表格
table_caption = 6 # 表格描述
table_footnote = 7 # 表格注释
isolate_formula = 8 # 行间公式
formula_caption = 9 # 行间公式的标号
embedding = 13 # 行内公式
isolated = 14 # 行间公式
text = 15 # ocr 识别结果
class PageInfo(BaseModel):
page_no: int = Field(description="页码序号,第一页的序号是 0", ge=0)
height: int = Field(description="页面高度", gt=0)
width: int = Field(description="页面宽度", ge=0)
class ObjectInferenceResult(BaseModel):
category_id: CategoryType = Field(description="类别", ge=0)
poly: list[float] = Field(description="四边形坐标, 分别是 左上,右上,右下,左下 四点的坐标")
score: float = Field(description="推理结果的置信度")
latex: str | None = Field(description="latex 解析结果", default=None)
html: str | None = Field(description="html 解析结果", default=None)
class PageInferenceResults(BaseModel):
layout_dets: list[ObjectInferenceResult] = Field(description="页面识别结果", ge=0)
page_info: PageInfo = Field(description="页面元信息")
# 所有页面的推理结果按照页码顺序依次放到列表中即为 minerU 推理结果
inference_result: list[PageInferenceResults] = []
poly 坐标的格式 [x0, y0, x1, y1, x2, y2, x3, y3],
分别表示左上、右上、右下、左下四点的坐标 |poly 坐标示意图|
示例数据
^^^^^^^^
.. code:: json
[
{
"layout_dets": [
{
"category_id": 2,
"poly": [
99.1906967163086,
100.3119125366211,
730.3707885742188,
100.3119125366211,
730.3707885742188,
245.81326293945312,
99.1906967163086,
245.81326293945312
],
"score": 0.9999997615814209
}
],
"page_info": {
"page_no": 0,
"height": 2339,
"width": 1654
}
},
{
"layout_dets": [
{
"category_id": 5,
"poly": [
99.13092803955078,
2210.680419921875,
497.3183898925781,
2210.680419921875,
497.3183898925781,
2264.78076171875,
99.13092803955078,
2264.78076171875
],
"score": 0.9999997019767761
}
],
"page_info": {
"page_no": 1,
"height": 2339,
"width": 1654
}
}
]
some_pdf_middle.json
~~~~~~~~~~~~~~~~~~~~
+-----------+----------------------------------------------------------+
| 字段名 | 解释 |
+===========+==========================================================+
| pdf_info | list,每个 |
| | 元素都是一个dict,这个dict是每一页pdf的解析结果,详见下表 |
+-----------+----------------------------------------------------------+
| \_p | ocr \| txt,用来标识本次解析的中间态使用的模式 |
| arse_type | |
+-----------+----------------------------------------------------------+
| \_ver | string, 表示本次解析使用的 magic-pdf 的版本号 |
| sion_name | |
+-----------+----------------------------------------------------------+
**pdf_info** 字段结构说明
+--------------+-------------------------------------------------------+
| 字段名 | 解释 |
+==============+=======================================================+
| pr | pdf预处理后,未分段的中间结果 |
| eproc_blocks | |
+--------------+-------------------------------------------------------+
| l | 布局分割的结果, |
| ayout_bboxes | 含有布局的方向(垂直、水平),和bbox,按阅读顺序排序 |
+--------------+-------------------------------------------------------+
| page_idx | 页码,从0开始 |
+--------------+-------------------------------------------------------+
| page_size | 页面的宽度和高度 |
+--------------+-------------------------------------------------------+
| \ | 布局树状结构 |
| _layout_tree | |
+--------------+-------------------------------------------------------+
| images | list,每个元素是一个dict,每个dict表示一个img_block |
+--------------+-------------------------------------------------------+
| tables | list,每个元素是一个dict,每个dict表示一个table_block |
+--------------+-------------------------------------------------------+
| interli | list,每个元素 |
| ne_equations | 是一个dict,每个dict表示一个interline_equation_block |
+--------------+-------------------------------------------------------+
| disc | List, 模型返回的需要drop的block信息 |
| arded_blocks | |
+--------------+-------------------------------------------------------+
| para_blocks | 将preproc_blocks进行分段之后的结果 |
+--------------+-------------------------------------------------------+
上表中 ``para_blocks``
是个dict的数组,每个dict是一个block结构,block最多支持一次嵌套
**block**
外层block被称为一级block,一级block中的字段包括
====== ===============================================
字段名 解释
====== ===============================================
type block类型(table|image)
bbox block矩形框坐标
blocks list,里面的每个元素都是一个dict格式的二级block
====== ===============================================
一级block只有”table”和”image”两种类型,其余block均为二级block
二级block中的字段包括
+-----+----------------------------------------------------------------+
| 字 | 解释 |
| 段 | |
| 名 | |
+=====+================================================================+
| t | block类型 |
| ype | |
+-----+----------------------------------------------------------------+
| b | block矩形框坐标 |
| box | |
+-----+----------------------------------------------------------------+
| li | list,每个元素都是一个dict表示的line,用来描述一行信息的构成 |
| nes | |
+-----+----------------------------------------------------------------+
二级block的类型详解
================== ==============
type desc
================== ==============
image_body 图像的本体
image_caption 图像的描述文本
image_footnote 图像的脚注
table_body 表格本体
table_caption 表格的描述文本
table_footnote 表格的脚注
text 文本块
title 标题块
index 目录块
list 列表块
interline_equation 行间公式块
================== ==============
**line**
line 的 字段格式如下
+----+-----------------------------------------------------------------+
| 字 | 解释 |
| 段 | |
| 名 | |
+====+=================================================================+
| bb | line的矩形框坐标 |
| ox | |
+----+-----------------------------------------------------------------+
| s | list, |
| pa | 每个元素都是一个dict表示的span,用来描述一个最小组成单元的构成 |
| ns | |
+----+-----------------------------------------------------------------+
**span**
+------------+---------------------------------------------------------+
| 字段名 | 解释 |
+============+=========================================================+
| bbox | span的矩形框坐标 |
+------------+---------------------------------------------------------+
| type | span的类型 |
+------------+---------------------------------------------------------+
| content \| | 文本类型的span使用content,图表类使用img_path |
| img_path | 用来存储实际的文本或者截图路径信息 |
+------------+---------------------------------------------------------+
span 的类型有如下几种
================== ========
type desc
================== ========
image 图片
table 表格
text 文本
inline_equation 行内公式
interline_equation 行间公式
================== ========
**总结**
span是所有元素的最小存储单元
para_blocks内存储的元素为区块信息
区块结构为
一级block(如有)->二级block->line->span
.. _示例数据-1:
示例数据
^^^^^^^^
.. code:: json
{
"pdf_info": [
{
"preproc_blocks": [
{
"type": "text",
"bbox": [
52,
61.956024169921875,
294,
82.99800872802734
],
"lines": [
{
"bbox": [
52,
61.956024169921875,
294,
72.0000228881836
],
"spans": [
{
"bbox": [
54.0,
61.956024169921875,
296.2261657714844,
72.0000228881836
],
"content": "dependent on the service headway and the reliability of the departure ",
"type": "text",
"score": 1.0
}
]
}
]
}
],
"layout_bboxes": [
{
"layout_bbox": [
52,
61,
294,
731
],
"layout_label": "V",
"sub_layout": []
}
],
"page_idx": 0,
"page_size": [
612.0,
792.0
],
"_layout_tree": [],
"images": [],
"tables": [],
"interline_equations": [],
"discarded_blocks": [],
"para_blocks": [
{
"type": "text",
"bbox": [
52,
61.956024169921875,
294,
82.99800872802734
],
"lines": [
{
"bbox": [
52,
61.956024169921875,
294,
72.0000228881836
],
"spans": [
{
"bbox": [
54.0,
61.956024169921875,
296.2261657714844,
72.0000228881836
],
"content": "dependent on the service headway and the reliability of the departure ",
"type": "text",
"score": 1.0
}
]
}
]
}
]
}
],
"_parse_type": "txt",
"_version_name": "0.6.1"
}
.. |poly 坐标示意图| image:: ../../_static/image/poly.png
Prev
1
2
3
Next
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment