Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
wangsen
MinerU
Commits
81529317
"git@developer.sourcefind.cn:modelzoo/resnet50_migraphx.git" did not exist on "a3e8d27f615d4aad4ed3b489cd6079ec8a297c4f"
Commit
81529317
authored
Nov 27, 2024
by
xu rui
Browse files
fix: table format
parent
a4b29f89
Changes
2
Hide whitespace changes
Inline
Side-by-side
Showing
2 changed files
with
117 additions
and
122 deletions
+117
-122
next_docs/en/user_guide/tutorial/output_file_description.rst
next_docs/en/user_guide/tutorial/output_file_description.rst
+95
-99
next_docs/zh_cn/user_guide/tutorial/output_file_description.rst
...ocs/zh_cn/user_guide/tutorial/output_file_description.rst
+22
-23
No files found.
next_docs/en/user_guide/tutorial/output_file_description.rst
View file @
81529317
...
@@ -141,60 +141,60 @@ example
...
@@ -141,60 +141,60 @@ example
some_pdf_middle.json
some_pdf_middle.json
~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~
+-------+--------------------------------------------------------------+
+-------
---------
+--------------------------------------------------------------+
| Field | Description |
| Field
Name
| Description |
|
Name
| |
|
| |
+=======+==============================================================+
+=======
=========
+==============================================================+
| pdf | list, each element is a dict representing the parsing result |
| pdf
_info
| list, each element is a dict representing the parsing result |
|
_info
| of each PDF page, see the table below for details |
|
| of each PDF page, see the table below for details |
+-------+--------------------------------------------------------------+
+-------
---------
+--------------------------------------------------------------+
| \_ | ocr \| txt, used to indicate the mode used in this |
| \_
| ocr \| txt, used to indicate the mode used in this |
| parse | intermediate parsing state |
| parse
_type
| intermediate parsing state |
|
_type
| |
|
| |
+-------+--------------------------------------------------------------+
+-------
---------
+--------------------------------------------------------------+
| \_ve
| string, indicates the version of magic-pdf used in this |
| \_ve
rsion_name
| string, indicates the version of magic-pdf used in this |
|
rsion
| parsing |
|
| parsing |
|
_name
| |
|
| |
+-------+--------------------------------------------------------------+
+-------
---------
+--------------------------------------------------------------+
**pdf_info**
**pdf_info**
Field structure description
Field structure description
+---------+------------------------------------------------------------+
+---------
----------------
+------------------------------------------------------------+
| Field | Description |
| Field
| Description |
| Name | |
| Name
| |
+=========+============================================================+
+=========
================
+============================================================+
| preproc | Intermediate result after PDF preprocessing, not yet |
| preproc
_blocks
| Intermediate result after PDF preprocessing, not yet |
|
_blocks
| segmented |
|
| segmented |
+---------+------------------------------------------------------------+
+---------
----------------
+------------------------------------------------------------+
| layout | Layout segmentation results, containing layout direction |
| layout
_bboxes
| Layout segmentation results, containing layout direction |
|
_bboxes
| (vertical, horizontal), and bbox, sorted by reading order |
|
| (vertical, horizontal), and bbox, sorted by reading order |
+---------+------------------------------------------------------------+
+---------
----------------
+------------------------------------------------------------+
| p | Page number, starting from 0 |
| p
age_idx
| Page number, starting from 0 |
|
age_idx
| |
|
| |
+---------+------------------------------------------------------------+
+---------
----------------
+------------------------------------------------------------+
| pa | Page width and height |
| pa
ge_size
| Page width and height |
|
ge_size
| |
|
| |
+---------+------------------------------------------------------------+
+---------
----------------
+------------------------------------------------------------+
| \_layo | Layout tree structure |
| \_layo
ut_tree
| Layout tree structure |
|
ut_tree
| |
|
| |
+---------+------------------------------------------------------------+
+---------
----------------
+------------------------------------------------------------+
| images | list, each element is a dict representing an img_block |
| images
| list, each element is a dict representing an img_block |
+---------+------------------------------------------------------------+
+---------
----------------
+------------------------------------------------------------+
| tables | list, each element is a dict representing a table_block |
| tables
| list, each element is a dict representing a table_block |
+---------+------------------------------------------------------------+
+---------
----------------
+------------------------------------------------------------+
| inter | list, each element is a dict representing an |
| inter
line_equation
| list, each element is a dict representing an |
|
line_eq
| interline_equation_block |
|
| interline_equation_block |
|
uations
| |
|
| |
+---------+------------------------------------------------------------+
+---------
----------------
+------------------------------------------------------------+
| di | List, block information returned by the model that needs |
| di
scarded_blocks
| List, block information returned by the model that needs |
|
scarded
| to be dropped |
|
| to be dropped |
|
_blocks
| |
|
| |
+---------+------------------------------------------------------------+
+---------
----------------
+------------------------------------------------------------+
| para | Result after segmenting preproc_blocks |
| para
_blocks
| Result after segmenting preproc_blocks |
|
_blocks
| |
|
| |
+---------+------------------------------------------------------------+
+---------
----------------
+------------------------------------------------------------+
In the above table, ``para_blocks`` is an array of dicts, each dict
In the above table, ``para_blocks`` is an array of dicts, each dict
representing a block structure. A block can support up to one level of
representing a block structure. A block can support up to one level of
...
@@ -205,38 +205,36 @@ nesting.
...
@@ -205,38 +205,36 @@ nesting.
The outer block is referred to as a first-level block, and the fields in
The outer block is referred to as a first-level block, and the fields in
the first-level block include:
the first-level block include:
+---------+-------------------------------------------------------------+
+---------
---------------
+-------------------------------------------------------------+
| Field | Description |
| Field
| Description |
| Name | |
| Name
| |
+=========+=============================================================+
+=========
===============
+=============================================================+
| type | Block type (table|image) |
| type
| Block type (table|image) |
+---------+-------------------------------------------------------------+
+---------
---------------
+-------------------------------------------------------------+
| bbox | Block bounding box coordinates |
| bbox
| Block bounding box coordinates |
+---------+-------------------------------------------------------------+
+---------
---------------
+-------------------------------------------------------------+
| blocks | list, each element is a dict representing a second-level |
| blocks
| list, each element is a dict representing a second-level |
| | block |
|
| block |
+---------+-------------------------------------------------------------+
+---------
---------------
+-------------------------------------------------------------+
There are only two types of first-level blocks: “table” and “image”. All
There are only two types of first-level blocks: “table” and “image”. All
other blocks are second-level blocks.
other blocks are second-level blocks.
The fields in a second-level block include:
The fields in a second-level block include:
+-----+----------------------------------------------------------------+
+----------------------+----------------------------------------------------------------+
| Fi | Description |
| Field | Description |
| eld | |
| Name | |
| N | |
+======================+================================================================+
| ame | |
| | Block type |
+=====+================================================================+
| type | |
| t | Block type |
+----------------------+----------------------------------------------------------------+
| ype | |
| | Block bounding box coordinates |
+-----+----------------------------------------------------------------+
| bbox | |
| b | Block bounding box coordinates |
+----------------------+----------------------------------------------------------------+
| box | |
| | list, each element is a dict representing a line, used to |
+-----+----------------------------------------------------------------+
| lines | describe the composition of a line of information |
| li | list, each element is a dict representing a line, used to |
+----------------------+----------------------------------------------------------------+
| nes | describe the composition of a line of information |
+-----+----------------------------------------------------------------+
Detailed explanation of second-level block types
Detailed explanation of second-level block types
...
@@ -257,33 +255,31 @@ interline_equation Block formula
...
@@ -257,33 +255,31 @@ interline_equation Block formula
The field format of a line is as follows:
The field format of a line is as follows:
+-----+----------------------------------------------------------------+
+---------------------+----------------------------------------------------------------+
| Fi | Description |
| Field | Description |
| eld | |
| Name | |
| N | |
+=====================+================================================================+
| ame | |
| | Bounding box coordinates of the line |
+=====+================================================================+
| bbox | |
| b | Bounding box coordinates of the line |
+---------------------+----------------------------------------------------------------+
| box | |
| spans | list, each element is a dict representing a span, used to |
+-----+----------------------------------------------------------------+
| | describe the composition of the smallest unit |
| sp | list, each element is a dict representing a span, used to |
+---------------------+----------------------------------------------------------------+
| ans | describe the composition of the smallest unit |
+-----+----------------------------------------------------------------+
**span**
**span**
+----------+-----------------------------------------------------------+
+----------
-----------
+-----------------------------------------------------------+
| Field | Description |
| Field
| Description |
| Name | |
| Name
| |
+==========+===========================================================+
+==========
===========
+===========================================================+
| bbox | Bounding box coordinates of the span |
| bbox
| Bounding box coordinates of the span |
+----------+-----------------------------------------------------------+
+----------
-----------
+-----------------------------------------------------------+
| type | Type of the span |
| type
| Type of the span |
+----------+-----------------------------------------------------------+
+----------
-----------
+-----------------------------------------------------------+
| content | Text spans use content, chart spans use img_path to store |
| content
| Text spans use content, chart spans use img_path to store |
| \| | the actual text or screenshot path information |
| \|
| the actual text or screenshot path information |
| img_path | |
| img_path
| |
+----------+-----------------------------------------------------------+
+----------
-----------
+-----------------------------------------------------------+
The types of spans are as follows:
The types of spans are as follows:
...
...
next_docs/zh_cn/user_guide/tutorial/output_file_description.rst
View file @
81529317
...
@@ -143,11 +143,11 @@ some_pdf_middle.json
...
@@ -143,11 +143,11 @@ some_pdf_middle.json
| pdf_info | list,每个 |
| pdf_info | list,每个 |
| | 元素都是一个dict,这个dict是每一页pdf的解析结果,详见下表 |
| | 元素都是一个dict,这个dict是每一页pdf的解析结果,详见下表 |
+-----------+----------------------------------------------------------+
+-----------+----------------------------------------------------------+
|
\_p
| ocr \| txt,用来标识本次解析的中间态使用的模式 |
|
| ocr \| txt,用来标识本次解析的中间态使用的模式 |
| arse_type | |
|
\_p
arse_type | |
+-----------+----------------------------------------------------------+
+-----------+----------------------------------------------------------+
|
\_ver
| string, 表示本次解析使用的 magic-pdf 的版本号 |
|
| string, 表示本次解析使用的 magic-pdf 的版本号 |
| sion_name | |
|
\_ver
sion_name | |
+-----------+----------------------------------------------------------+
+-----------+----------------------------------------------------------+
**pdf_info** 字段结构说明
**pdf_info** 字段结构说明
...
@@ -155,11 +155,11 @@ some_pdf_middle.json
...
@@ -155,11 +155,11 @@ some_pdf_middle.json
+--------------+-------------------------------------------------------+
+--------------+-------------------------------------------------------+
| 字段名 | 解释 |
| 字段名 | 解释 |
+==============+=======================================================+
+==============+=======================================================+
|
pr
| pdf预处理后,未分段的中间结果 |
|
| pdf预处理后,未分段的中间结果 |
| eproc_blocks | |
|
pre
eproc_blocks | |
+--------------+-------------------------------------------------------+
+--------------+-------------------------------------------------------+
|
l
| 布局分割的结果, |
|
| 布局分割的结果, |
| ayout_bboxes | 含有布局的方向(垂直、水平),和bbox,按阅读顺序排序 |
|
l
ayout_bboxes | 含有布局的方向(垂直、水平),和bbox,按阅读顺序排序 |
+--------------+-------------------------------------------------------+
+--------------+-------------------------------------------------------+
| page_idx | 页码,从0开始 |
| page_idx | 页码,从0开始 |
+--------------+-------------------------------------------------------+
+--------------+-------------------------------------------------------+
...
@@ -172,11 +172,11 @@ some_pdf_middle.json
...
@@ -172,11 +172,11 @@ some_pdf_middle.json
+--------------+-------------------------------------------------------+
+--------------+-------------------------------------------------------+
| tables | list,每个元素是一个dict,每个dict表示一个table_block |
| tables | list,每个元素是一个dict,每个dict表示一个table_block |
+--------------+-------------------------------------------------------+
+--------------+-------------------------------------------------------+
|
interli
| list,每个元素 |
|
| list,每个元素 |
| ne_equations | 是一个dict,每个dict表示一个interline_equation_block |
|
interli
ne_equations | 是一个dict,每个dict表示一个interline_equation_block |
+--------------+-------------------------------------------------------+
+--------------+-------------------------------------------------------+
|
disc
| List, 模型返回的需要drop的block信息 |
|
| List, 模型返回的需要drop的block信息 |
| arded_blocks | |
|
disc
arded_blocks | |
+--------------+-------------------------------------------------------+
+--------------+-------------------------------------------------------+
| para_blocks | 将preproc_blocks进行分段之后的结果 |
| para_blocks | 将preproc_blocks进行分段之后的结果 |
+--------------+-------------------------------------------------------+
+--------------+-------------------------------------------------------+
...
@@ -205,14 +205,14 @@ blocks list,里面的每个元素都是一个dict格式的二级block
...
@@ -205,14 +205,14 @@ blocks list,里面的每个元素都是一个dict格式的二级block
| 段 | |
| 段 | |
| 名 | |
| 名 | |
+=====+================================================================+
+=====+================================================================+
|
t
| block类型 |
|
| block类型 |
| ype | |
|
t
ype | |
+-----+----------------------------------------------------------------+
+-----+----------------------------------------------------------------+
|
b
| block矩形框坐标 |
|
| block矩形框坐标 |
| box | |
|
b
box | |
+-----+----------------------------------------------------------------+
+-----+----------------------------------------------------------------+
|
li
| list,每个元素都是一个dict表示的line,用来描述一行信息的构成 |
|
| list,每个元素都是一个dict表示的line,用来描述一行信息的构成 |
| nes | |
|
li
nes | |
+-----+----------------------------------------------------------------+
+-----+----------------------------------------------------------------+
二级block的类型详解
二级block的类型详解
...
@@ -242,12 +242,11 @@ line 的 字段格式如下
...
@@ -242,12 +242,11 @@ line 的 字段格式如下
| 段 | |
| 段 | |
| 名 | |
| 名 | |
+====+=================================================================+
+====+=================================================================+
| bb | line的矩形框坐标 |
| bb
ox
| line的矩形框坐标 |
|
ox
| |
|
| |
+----+-----------------------------------------------------------------+
+----+-----------------------------------------------------------------+
| s | list, |
| spans | list, |
| pa | 每个元素都是一个dict表示的span,用来描述一个最小组成单元的构成 |
| | 每个元素都是一个dict表示的span,用来描述一个最小组成单元的构成 |
| ns | |
+----+-----------------------------------------------------------------+
+----+-----------------------------------------------------------------+
**span**
**span**
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment