Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
wangsen
MinerU
Commits
132c2089
Unverified
Commit
132c2089
authored
Nov 27, 2024
by
Xiaomeng Zhao
Committed by
GitHub
Nov 27, 2024
Browse files
Merge pull request #1117 from icecraft/feat/add_s3_read_write_example
Feat/add s3 read write example
parents
b8fdab11
81529317
Changes
4
Hide whitespace changes
Inline
Side-by-side
Showing
4 changed files
with
223 additions
and
128 deletions
+223
-128
next_docs/en/user_guide/quick_start/to_markdown.rst
next_docs/en/user_guide/quick_start/to_markdown.rst
+53
-3
next_docs/en/user_guide/tutorial/output_file_description.rst
next_docs/en/user_guide/tutorial/output_file_description.rst
+95
-99
next_docs/zh_cn/user_guide/quick_start/to_markdown.rst
next_docs/zh_cn/user_guide/quick_start/to_markdown.rst
+53
-3
next_docs/zh_cn/user_guide/tutorial/output_file_description.rst
...ocs/zh_cn/user_guide/tutorial/output_file_description.rst
+22
-23
No files found.
next_docs/en/user_guide/quick_start/to_markdown.rst
View file @
132c2089
...
@@ -3,12 +3,16 @@
...
@@ -3,12 +3,16 @@
Convert To Markdown
Convert To Markdown
========================
========================
Local File Example
^^^^^^^^^^^^^^^^^^
.. code:: python
.. code:: python
import os
import os
from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader
from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader
from magic_pdf.
libs.M
ake
C
ontent
C
onfig import DropMode, MakeMode
from magic_pdf.
config.m
ake
_c
ontent
_c
onfig import DropMode, MakeMode
from magic_pdf.pipe.OCRPipe import OCRPipe
from magic_pdf.pipe.OCRPipe import OCRPipe
...
@@ -23,7 +27,7 @@ Convert To Markdown
...
@@ -23,7 +27,7 @@ Convert To Markdown
image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter(
image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter(
local_md_dir
local_md_dir
)
# create 00
)
image_dir = str(os.path.basename(local_image_dir))
image_dir = str(os.path.basename(local_image_dir))
reader1 = FileBasedDataReader("")
reader1 = FileBasedDataReader("")
...
@@ -49,4 +53,50 @@ Convert To Markdown
...
@@ -49,4 +53,50 @@ Convert To Markdown
md_writer.write_string(f"{pdf_file_name}.md", md_content)
md_writer.write_string(f"{pdf_file_name}.md", md_content)
Check :doc:`../data/data_reader_writer` for more [reader | writer] examples
S3 File Example
^^^^^^^^^^^^^^^^
.. code:: python
import os
from magic_pdf.data.data_reader_writer import S3DataReader, S3DataWriter
from magic_pdf.config.make_content_config import DropMode, MakeMode
from magic_pdf.pipe.OCRPipe import OCRPipe
bucket_name = "{Your S3 Bucket Name}" # replace with real bucket name
ak = "{Your S3 access key}" # replace with real s3 access key
sk = "{Your S3 secret key}" # replace with real s3 secret key
endpoint_url = "{Your S3 endpoint_url}" # replace with real s3 endpoint_url
reader = S3DataReader('unittest/tmp/', bucket_name, ak, sk, endpoint_url) # replace `unittest/tmp` with the real s3 prefix
writer = S3DataWriter('unittest/tmp', bucket_name, ak, sk, endpoint_url)
image_writer = S3DataWriter('unittest/tmp/images', bucket_name, ak, sk, endpoint_url)
## args
model_list = []
pdf_file_name = f"s3://{bucket_name}/{fake pdf path}" # replace with the real s3 path
pdf_bytes = reader.read(pdf_file_name) # read the pdf content
pipe = OCRPipe(pdf_bytes, model_list, image_writer)
pipe.pipe_classify()
pipe.pipe_analyze()
pipe.pipe_parse()
pdf_info = pipe.pdf_mid_data["pdf_info"]
md_content = pipe.pipe_mk_markdown(
"unittest/tmp/images", drop_mode=DropMode.NONE, md_make_mode=MakeMode.MM_MD
)
if isinstance(md_content, list):
writer.write_string(f"{pdf_file_name}.md", "\n".join(md_content))
else:
writer.write_string(f"{pdf_file_name}.md", md_content)
Check :doc:`../data/data_reader_writer` for more [reader | writer] examples
next_docs/en/user_guide/tutorial/output_file_description.rst
View file @
132c2089
...
@@ -141,60 +141,60 @@ example
...
@@ -141,60 +141,60 @@ example
some_pdf_middle.json
some_pdf_middle.json
~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~
+-------+--------------------------------------------------------------+
+-------
---------
+--------------------------------------------------------------+
| Field | Description |
| Field
Name
| Description |
|
Name
| |
|
| |
+=======+==============================================================+
+=======
=========
+==============================================================+
| pdf | list, each element is a dict representing the parsing result |
| pdf
_info
| list, each element is a dict representing the parsing result |
|
_info
| of each PDF page, see the table below for details |
|
| of each PDF page, see the table below for details |
+-------+--------------------------------------------------------------+
+-------
---------
+--------------------------------------------------------------+
| \_ | ocr \| txt, used to indicate the mode used in this |
| \_
| ocr \| txt, used to indicate the mode used in this |
| parse | intermediate parsing state |
| parse
_type
| intermediate parsing state |
|
_type
| |
|
| |
+-------+--------------------------------------------------------------+
+-------
---------
+--------------------------------------------------------------+
| \_ve
| string, indicates the version of magic-pdf used in this |
| \_ve
rsion_name
| string, indicates the version of magic-pdf used in this |
|
rsion
| parsing |
|
| parsing |
|
_name
| |
|
| |
+-------+--------------------------------------------------------------+
+-------
---------
+--------------------------------------------------------------+
**pdf_info**
**pdf_info**
Field structure description
Field structure description
+---------+------------------------------------------------------------+
+---------
----------------
+------------------------------------------------------------+
| Field | Description |
| Field
| Description |
| Name | |
| Name
| |
+=========+============================================================+
+=========
================
+============================================================+
| preproc | Intermediate result after PDF preprocessing, not yet |
| preproc
_blocks
| Intermediate result after PDF preprocessing, not yet |
|
_blocks
| segmented |
|
| segmented |
+---------+------------------------------------------------------------+
+---------
----------------
+------------------------------------------------------------+
| layout | Layout segmentation results, containing layout direction |
| layout
_bboxes
| Layout segmentation results, containing layout direction |
|
_bboxes
| (vertical, horizontal), and bbox, sorted by reading order |
|
| (vertical, horizontal), and bbox, sorted by reading order |
+---------+------------------------------------------------------------+
+---------
----------------
+------------------------------------------------------------+
| p | Page number, starting from 0 |
| p
age_idx
| Page number, starting from 0 |
|
age_idx
| |
|
| |
+---------+------------------------------------------------------------+
+---------
----------------
+------------------------------------------------------------+
| pa | Page width and height |
| pa
ge_size
| Page width and height |
|
ge_size
| |
|
| |
+---------+------------------------------------------------------------+
+---------
----------------
+------------------------------------------------------------+
| \_layo | Layout tree structure |
| \_layo
ut_tree
| Layout tree structure |
|
ut_tree
| |
|
| |
+---------+------------------------------------------------------------+
+---------
----------------
+------------------------------------------------------------+
| images | list, each element is a dict representing an img_block |
| images
| list, each element is a dict representing an img_block |
+---------+------------------------------------------------------------+
+---------
----------------
+------------------------------------------------------------+
| tables | list, each element is a dict representing a table_block |
| tables
| list, each element is a dict representing a table_block |
+---------+------------------------------------------------------------+
+---------
----------------
+------------------------------------------------------------+
| inter | list, each element is a dict representing an |
| inter
line_equation
| list, each element is a dict representing an |
|
line_eq
| interline_equation_block |
|
| interline_equation_block |
|
uations
| |
|
| |
+---------+------------------------------------------------------------+
+---------
----------------
+------------------------------------------------------------+
| di | List, block information returned by the model that needs |
| di
scarded_blocks
| List, block information returned by the model that needs |
|
scarded
| to be dropped |
|
| to be dropped |
|
_blocks
| |
|
| |
+---------+------------------------------------------------------------+
+---------
----------------
+------------------------------------------------------------+
| para | Result after segmenting preproc_blocks |
| para
_blocks
| Result after segmenting preproc_blocks |
|
_blocks
| |
|
| |
+---------+------------------------------------------------------------+
+---------
----------------
+------------------------------------------------------------+
In the above table, ``para_blocks`` is an array of dicts, each dict
In the above table, ``para_blocks`` is an array of dicts, each dict
representing a block structure. A block can support up to one level of
representing a block structure. A block can support up to one level of
...
@@ -205,38 +205,36 @@ nesting.
...
@@ -205,38 +205,36 @@ nesting.
The outer block is referred to as a first-level block, and the fields in
The outer block is referred to as a first-level block, and the fields in
the first-level block include:
the first-level block include:
+---------+-------------------------------------------------------------+
+---------
---------------
+-------------------------------------------------------------+
| Field | Description |
| Field
| Description |
| Name | |
| Name
| |
+=========+=============================================================+
+=========
===============
+=============================================================+
| type | Block type (table|image) |
| type
| Block type (table|image) |
+---------+-------------------------------------------------------------+
+---------
---------------
+-------------------------------------------------------------+
| bbox | Block bounding box coordinates |
| bbox
| Block bounding box coordinates |
+---------+-------------------------------------------------------------+
+---------
---------------
+-------------------------------------------------------------+
| blocks | list, each element is a dict representing a second-level |
| blocks
| list, each element is a dict representing a second-level |
| | block |
|
| block |
+---------+-------------------------------------------------------------+
+---------
---------------
+-------------------------------------------------------------+
There are only two types of first-level blocks: “table” and “image”. All
There are only two types of first-level blocks: “table” and “image”. All
other blocks are second-level blocks.
other blocks are second-level blocks.
The fields in a second-level block include:
The fields in a second-level block include:
+-----+----------------------------------------------------------------+
+----------------------+----------------------------------------------------------------+
| Fi | Description |
| Field | Description |
| eld | |
| Name | |
| N | |
+======================+================================================================+
| ame | |
| | Block type |
+=====+================================================================+
| type | |
| t | Block type |
+----------------------+----------------------------------------------------------------+
| ype | |
| | Block bounding box coordinates |
+-----+----------------------------------------------------------------+
| bbox | |
| b | Block bounding box coordinates |
+----------------------+----------------------------------------------------------------+
| box | |
| | list, each element is a dict representing a line, used to |
+-----+----------------------------------------------------------------+
| lines | describe the composition of a line of information |
| li | list, each element is a dict representing a line, used to |
+----------------------+----------------------------------------------------------------+
| nes | describe the composition of a line of information |
+-----+----------------------------------------------------------------+
Detailed explanation of second-level block types
Detailed explanation of second-level block types
...
@@ -257,33 +255,31 @@ interline_equation Block formula
...
@@ -257,33 +255,31 @@ interline_equation Block formula
The field format of a line is as follows:
The field format of a line is as follows:
+-----+----------------------------------------------------------------+
+---------------------+----------------------------------------------------------------+
| Fi | Description |
| Field | Description |
| eld | |
| Name | |
| N | |
+=====================+================================================================+
| ame | |
| | Bounding box coordinates of the line |
+=====+================================================================+
| bbox | |
| b | Bounding box coordinates of the line |
+---------------------+----------------------------------------------------------------+
| box | |
| spans | list, each element is a dict representing a span, used to |
+-----+----------------------------------------------------------------+
| | describe the composition of the smallest unit |
| sp | list, each element is a dict representing a span, used to |
+---------------------+----------------------------------------------------------------+
| ans | describe the composition of the smallest unit |
+-----+----------------------------------------------------------------+
**span**
**span**
+----------+-----------------------------------------------------------+
+----------
-----------
+-----------------------------------------------------------+
| Field | Description |
| Field
| Description |
| Name | |
| Name
| |
+==========+===========================================================+
+==========
===========
+===========================================================+
| bbox | Bounding box coordinates of the span |
| bbox
| Bounding box coordinates of the span |
+----------+-----------------------------------------------------------+
+----------
-----------
+-----------------------------------------------------------+
| type | Type of the span |
| type
| Type of the span |
+----------+-----------------------------------------------------------+
+----------
-----------
+-----------------------------------------------------------+
| content | Text spans use content, chart spans use img_path to store |
| content
| Text spans use content, chart spans use img_path to store |
| \| | the actual text or screenshot path information |
| \|
| the actual text or screenshot path information |
| img_path | |
| img_path
| |
+----------+-----------------------------------------------------------+
+----------
-----------
+-----------------------------------------------------------+
The types of spans are as follows:
The types of spans are as follows:
...
...
next_docs/zh_cn/user_guide/quick_start/to_markdown.rst
View file @
132c2089
...
@@ -3,12 +3,16 @@
...
@@ -3,12 +3,16 @@
转换为 Markdown 文件
转换为 Markdown 文件
========================
========================
本地文件示例
^^^^^^^^^^^
.. code:: python
.. code:: python
import os
import os
from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader
from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader
from magic_pdf.
libs.M
ake
C
ontent
C
onfig import DropMode, MakeMode
from magic_pdf.
config.m
ake
_c
ontent
_c
onfig import DropMode, MakeMode
from magic_pdf.pipe.OCRPipe import OCRPipe
from magic_pdf.pipe.OCRPipe import OCRPipe
...
@@ -23,7 +27,7 @@
...
@@ -23,7 +27,7 @@
image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter(
image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter(
local_md_dir
local_md_dir
)
# create 00
)
image_dir = str(os.path.basename(local_image_dir))
image_dir = str(os.path.basename(local_image_dir))
reader1 = FileBasedDataReader("")
reader1 = FileBasedDataReader("")
...
@@ -49,5 +53,51 @@
...
@@ -49,5 +53,51 @@
md_writer.write_string(f"{pdf_file_name}.md", md_content)
md_writer.write_string(f"{pdf_file_name}.md", md_content)
前去 :doc:`../data/data_reader_writer` 获取更多有关 **读写** 示例
对象存储使用示例
^^^^^^^^^^^^^^^
.. code:: python
import os
from magic_pdf.data.data_reader_writer import S3DataReader, S3DataWriter
from magic_pdf.config.make_content_config import DropMode, MakeMode
from magic_pdf.pipe.OCRPipe import OCRPipe
bucket_name = "{Your S3 Bucket Name}" # replace with real bucket name
ak = "{Your S3 access key}" # replace with real s3 access key
sk = "{Your S3 secret key}" # replace with real s3 secret key
endpoint_url = "{Your S3 endpoint_url}" # replace with real s3 endpoint_url
reader = S3DataReader('unittest/tmp/', bucket_name, ak, sk, endpoint_url) # replace `unittest/tmp` with the real s3 prefix
writer = S3DataWriter('unittest/tmp', bucket_name, ak, sk, endpoint_url)
image_writer = S3DataWriter('unittest/tmp/images', bucket_name, ak, sk, endpoint_url)
## args
model_list = []
pdf_file_name = f"s3://{bucket_name}/{fake pdf path}" # replace with the real s3 path
pdf_bytes = reader.read(pdf_file_name) # read the pdf content
pipe = OCRPipe(pdf_bytes, model_list, image_writer)
pipe.pipe_classify()
pipe.pipe_analyze()
pipe.pipe_parse()
pdf_info = pipe.pdf_mid_data["pdf_info"]
md_content = pipe.pipe_mk_markdown(
"unittest/tmp/images", drop_mode=DropMode.NONE, md_make_mode=MakeMode.MM_MD
)
if isinstance(md_content, list):
writer.write_string(f"{pdf_file_name}.md", "\n".join(md_content))
else:
writer.write_string(f"{pdf_file_name}.md", md_content)
前去 :doc:`../data/data_reader_writer` 获取更多有关 **读写** 示例
next_docs/zh_cn/user_guide/tutorial/output_file_description.rst
View file @
132c2089
...
@@ -143,11 +143,11 @@ some_pdf_middle.json
...
@@ -143,11 +143,11 @@ some_pdf_middle.json
| pdf_info | list,每个 |
| pdf_info | list,每个 |
| | 元素都是一个dict,这个dict是每一页pdf的解析结果,详见下表 |
| | 元素都是一个dict,这个dict是每一页pdf的解析结果,详见下表 |
+-----------+----------------------------------------------------------+
+-----------+----------------------------------------------------------+
|
\_p
| ocr \| txt,用来标识本次解析的中间态使用的模式 |
|
| ocr \| txt,用来标识本次解析的中间态使用的模式 |
| arse_type | |
|
\_p
arse_type | |
+-----------+----------------------------------------------------------+
+-----------+----------------------------------------------------------+
|
\_ver
| string, 表示本次解析使用的 magic-pdf 的版本号 |
|
| string, 表示本次解析使用的 magic-pdf 的版本号 |
| sion_name | |
|
\_ver
sion_name | |
+-----------+----------------------------------------------------------+
+-----------+----------------------------------------------------------+
**pdf_info** 字段结构说明
**pdf_info** 字段结构说明
...
@@ -155,11 +155,11 @@ some_pdf_middle.json
...
@@ -155,11 +155,11 @@ some_pdf_middle.json
+--------------+-------------------------------------------------------+
+--------------+-------------------------------------------------------+
| 字段名 | 解释 |
| 字段名 | 解释 |
+==============+=======================================================+
+==============+=======================================================+
|
pr
| pdf预处理后,未分段的中间结果 |
|
| pdf预处理后,未分段的中间结果 |
| eproc_blocks | |
|
pre
eproc_blocks | |
+--------------+-------------------------------------------------------+
+--------------+-------------------------------------------------------+
|
l
| 布局分割的结果, |
|
| 布局分割的结果, |
| ayout_bboxes | 含有布局的方向(垂直、水平),和bbox,按阅读顺序排序 |
|
l
ayout_bboxes | 含有布局的方向(垂直、水平),和bbox,按阅读顺序排序 |
+--------------+-------------------------------------------------------+
+--------------+-------------------------------------------------------+
| page_idx | 页码,从0开始 |
| page_idx | 页码,从0开始 |
+--------------+-------------------------------------------------------+
+--------------+-------------------------------------------------------+
...
@@ -172,11 +172,11 @@ some_pdf_middle.json
...
@@ -172,11 +172,11 @@ some_pdf_middle.json
+--------------+-------------------------------------------------------+
+--------------+-------------------------------------------------------+
| tables | list,每个元素是一个dict,每个dict表示一个table_block |
| tables | list,每个元素是一个dict,每个dict表示一个table_block |
+--------------+-------------------------------------------------------+
+--------------+-------------------------------------------------------+
|
interli
| list,每个元素 |
|
| list,每个元素 |
| ne_equations | 是一个dict,每个dict表示一个interline_equation_block |
|
interli
ne_equations | 是一个dict,每个dict表示一个interline_equation_block |
+--------------+-------------------------------------------------------+
+--------------+-------------------------------------------------------+
|
disc
| List, 模型返回的需要drop的block信息 |
|
| List, 模型返回的需要drop的block信息 |
| arded_blocks | |
|
disc
arded_blocks | |
+--------------+-------------------------------------------------------+
+--------------+-------------------------------------------------------+
| para_blocks | 将preproc_blocks进行分段之后的结果 |
| para_blocks | 将preproc_blocks进行分段之后的结果 |
+--------------+-------------------------------------------------------+
+--------------+-------------------------------------------------------+
...
@@ -205,14 +205,14 @@ blocks list,里面的每个元素都是一个dict格式的二级block
...
@@ -205,14 +205,14 @@ blocks list,里面的每个元素都是一个dict格式的二级block
| 段 | |
| 段 | |
| 名 | |
| 名 | |
+=====+================================================================+
+=====+================================================================+
|
t
| block类型 |
|
| block类型 |
| ype | |
|
t
ype | |
+-----+----------------------------------------------------------------+
+-----+----------------------------------------------------------------+
|
b
| block矩形框坐标 |
|
| block矩形框坐标 |
| box | |
|
b
box | |
+-----+----------------------------------------------------------------+
+-----+----------------------------------------------------------------+
|
li
| list,每个元素都是一个dict表示的line,用来描述一行信息的构成 |
|
| list,每个元素都是一个dict表示的line,用来描述一行信息的构成 |
| nes | |
|
li
nes | |
+-----+----------------------------------------------------------------+
+-----+----------------------------------------------------------------+
二级block的类型详解
二级block的类型详解
...
@@ -242,12 +242,11 @@ line 的 字段格式如下
...
@@ -242,12 +242,11 @@ line 的 字段格式如下
| 段 | |
| 段 | |
| 名 | |
| 名 | |
+====+=================================================================+
+====+=================================================================+
| bb | line的矩形框坐标 |
| bb
ox
| line的矩形框坐标 |
|
ox
| |
|
| |
+----+-----------------------------------------------------------------+
+----+-----------------------------------------------------------------+
| s | list, |
| spans | list, |
| pa | 每个元素都是一个dict表示的span,用来描述一个最小组成单元的构成 |
| | 每个元素都是一个dict表示的span,用来描述一个最小组成单元的构成 |
| ns | |
+----+-----------------------------------------------------------------+
+----+-----------------------------------------------------------------+
**span**
**span**
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment