"git@developer.sourcefind.cn:modelzoo/resnet50_migraphx.git" did not exist on "a3e8d27f615d4aad4ed3b489cd6079ec8a297c4f"
Commit 81529317 authored by xu rui's avatar xu rui
Browse files

fix: table format

parent a4b29f89
...@@ -141,60 +141,60 @@ example ...@@ -141,60 +141,60 @@ example
some_pdf_middle.json some_pdf_middle.json
~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~
+-------+--------------------------------------------------------------+ +----------------+--------------------------------------------------------------+
| Field | Description | | Field Name | Description |
| Name | | | | |
+=======+==============================================================+ +================+==============================================================+
| pdf | list, each element is a dict representing the parsing result | | pdf_info | list, each element is a dict representing the parsing result |
| _info | of each PDF page, see the table below for details | | | of each PDF page, see the table below for details |
+-------+--------------------------------------------------------------+ +----------------+--------------------------------------------------------------+
| \_ | ocr \| txt, used to indicate the mode used in this | | \_ | ocr \| txt, used to indicate the mode used in this |
| parse | intermediate parsing state | | parse_type | intermediate parsing state |
| _type | | | | |
+-------+--------------------------------------------------------------+ +----------------+--------------------------------------------------------------+
| \_ve | string, indicates the version of magic-pdf used in this | | \_version_name | string, indicates the version of magic-pdf used in this |
| rsion | parsing | | | parsing |
| _name | | | | |
+-------+--------------------------------------------------------------+ +----------------+--------------------------------------------------------------+
**pdf_info** **pdf_info**
Field structure description Field structure description
+---------+------------------------------------------------------------+ +-------------------------+------------------------------------------------------------+
| Field | Description | | Field | Description |
| Name | | | Name | |
+=========+============================================================+ +=========================+============================================================+
| preproc | Intermediate result after PDF preprocessing, not yet | | preproc_blocks | Intermediate result after PDF preprocessing, not yet |
| _blocks | segmented | | | segmented |
+---------+------------------------------------------------------------+ +-------------------------+------------------------------------------------------------+
| layout | Layout segmentation results, containing layout direction | | layout_bboxes | Layout segmentation results, containing layout direction |
| _bboxes | (vertical, horizontal), and bbox, sorted by reading order | | | (vertical, horizontal), and bbox, sorted by reading order |
+---------+------------------------------------------------------------+ +-------------------------+------------------------------------------------------------+
| p | Page number, starting from 0 | | page_idx | Page number, starting from 0 |
| age_idx | | | | |
+---------+------------------------------------------------------------+ +-------------------------+------------------------------------------------------------+
| pa | Page width and height | | page_size | Page width and height |
| ge_size | | | | |
+---------+------------------------------------------------------------+ +-------------------------+------------------------------------------------------------+
| \_layo | Layout tree structure | | \_layout_tree | Layout tree structure |
| ut_tree | | | | |
+---------+------------------------------------------------------------+ +-------------------------+------------------------------------------------------------+
| images | list, each element is a dict representing an img_block | | images | list, each element is a dict representing an img_block |
+---------+------------------------------------------------------------+ +-------------------------+------------------------------------------------------------+
| tables | list, each element is a dict representing a table_block | | tables | list, each element is a dict representing a table_block |
+---------+------------------------------------------------------------+ +-------------------------+------------------------------------------------------------+
| inter | list, each element is a dict representing an | | interline_equation | list, each element is a dict representing an |
| line_eq | interline_equation_block | | | interline_equation_block |
| uations | | | | |
+---------+------------------------------------------------------------+ +-------------------------+------------------------------------------------------------+
| di | List, block information returned by the model that needs | | discarded_blocks | List, block information returned by the model that needs |
| scarded | to be dropped | | | to be dropped |
| _blocks | | | | |
+---------+------------------------------------------------------------+ +-------------------------+------------------------------------------------------------+
| para | Result after segmenting preproc_blocks | | para_blocks | Result after segmenting preproc_blocks |
| _blocks | | | | |
+---------+------------------------------------------------------------+ +-------------------------+------------------------------------------------------------+
In the above table, ``para_blocks`` is an array of dicts, each dict In the above table, ``para_blocks`` is an array of dicts, each dict
representing a block structure. A block can support up to one level of representing a block structure. A block can support up to one level of
...@@ -205,38 +205,36 @@ nesting. ...@@ -205,38 +205,36 @@ nesting.
The outer block is referred to as a first-level block, and the fields in The outer block is referred to as a first-level block, and the fields in
the first-level block include: the first-level block include:
+---------+-------------------------------------------------------------+ +------------------------+-------------------------------------------------------------+
| Field | Description | | Field | Description |
| Name | | | Name | |
+=========+=============================================================+ +========================+=============================================================+
| type | Block type (table|image) | | type | Block type (table|image) |
+---------+-------------------------------------------------------------+ +------------------------+-------------------------------------------------------------+
| bbox | Block bounding box coordinates | | bbox | Block bounding box coordinates |
+---------+-------------------------------------------------------------+ +------------------------+-------------------------------------------------------------+
| blocks | list, each element is a dict representing a second-level | | blocks | list, each element is a dict representing a second-level |
| | block | | | block |
+---------+-------------------------------------------------------------+ +------------------------+-------------------------------------------------------------+
There are only two types of first-level blocks: “table” and “image”. All There are only two types of first-level blocks: “table” and “image”. All
other blocks are second-level blocks. other blocks are second-level blocks.
The fields in a second-level block include: The fields in a second-level block include:
+-----+----------------------------------------------------------------+ +----------------------+----------------------------------------------------------------+
| Fi | Description | | Field | Description |
| eld | | | Name | |
| N | | +======================+================================================================+
| ame | | | | Block type |
+=====+================================================================+ | type | |
| t | Block type | +----------------------+----------------------------------------------------------------+
| ype | | | | Block bounding box coordinates |
+-----+----------------------------------------------------------------+ | bbox | |
| b | Block bounding box coordinates | +----------------------+----------------------------------------------------------------+
| box | | | | list, each element is a dict representing a line, used to |
+-----+----------------------------------------------------------------+ | lines | describe the composition of a line of information |
| li | list, each element is a dict representing a line, used to | +----------------------+----------------------------------------------------------------+
| nes | describe the composition of a line of information |
+-----+----------------------------------------------------------------+
Detailed explanation of second-level block types Detailed explanation of second-level block types
...@@ -257,33 +255,31 @@ interline_equation Block formula ...@@ -257,33 +255,31 @@ interline_equation Block formula
The field format of a line is as follows: The field format of a line is as follows:
+-----+----------------------------------------------------------------+ +---------------------+----------------------------------------------------------------+
| Fi | Description | | Field | Description |
| eld | | | Name | |
| N | | +=====================+================================================================+
| ame | | | | Bounding box coordinates of the line |
+=====+================================================================+ | bbox | |
| b | Bounding box coordinates of the line | +---------------------+----------------------------------------------------------------+
| box | | | spans | list, each element is a dict representing a span, used to |
+-----+----------------------------------------------------------------+ | | describe the composition of the smallest unit |
| sp | list, each element is a dict representing a span, used to | +---------------------+----------------------------------------------------------------+
| ans | describe the composition of the smallest unit |
+-----+----------------------------------------------------------------+
**span** **span**
+----------+-----------------------------------------------------------+ +---------------------+-----------------------------------------------------------+
| Field | Description | | Field | Description |
| Name | | | Name | |
+==========+===========================================================+ +=====================+===========================================================+
| bbox | Bounding box coordinates of the span | | bbox | Bounding box coordinates of the span |
+----------+-----------------------------------------------------------+ +---------------------+-----------------------------------------------------------+
| type | Type of the span | | type | Type of the span |
+----------+-----------------------------------------------------------+ +---------------------+-----------------------------------------------------------+
| content | Text spans use content, chart spans use img_path to store | | content | Text spans use content, chart spans use img_path to store |
| \| | the actual text or screenshot path information | | \| | the actual text or screenshot path information |
| img_path | | | img_path | |
+----------+-----------------------------------------------------------+ +---------------------+-----------------------------------------------------------+
The types of spans are as follows: The types of spans are as follows:
......
...@@ -143,11 +143,11 @@ some_pdf_middle.json ...@@ -143,11 +143,11 @@ some_pdf_middle.json
| pdf_info | list,每个 | | pdf_info | list,每个 |
| | 元素都是一个dict,这个dict是每一页pdf的解析结果,详见下表 | | | 元素都是一个dict,这个dict是每一页pdf的解析结果,详见下表 |
+-----------+----------------------------------------------------------+ +-----------+----------------------------------------------------------+
| \_p | ocr \| txt,用来标识本次解析的中间态使用的模式 | | | ocr \| txt,用来标识本次解析的中间态使用的模式 |
| arse_type | | | \_parse_type | |
+-----------+----------------------------------------------------------+ +-----------+----------------------------------------------------------+
| \_ver | string, 表示本次解析使用的 magic-pdf 的版本号 | | | string, 表示本次解析使用的 magic-pdf 的版本号 |
| sion_name | | | \_version_name | |
+-----------+----------------------------------------------------------+ +-----------+----------------------------------------------------------+
**pdf_info** 字段结构说明 **pdf_info** 字段结构说明
...@@ -155,11 +155,11 @@ some_pdf_middle.json ...@@ -155,11 +155,11 @@ some_pdf_middle.json
+--------------+-------------------------------------------------------+ +--------------+-------------------------------------------------------+
| 字段名 | 解释 | | 字段名 | 解释 |
+==============+=======================================================+ +==============+=======================================================+
| pr | pdf预处理后,未分段的中间结果 | | | pdf预处理后,未分段的中间结果 |
| eproc_blocks | | | preeproc_blocks | |
+--------------+-------------------------------------------------------+ +--------------+-------------------------------------------------------+
| l | 布局分割的结果, | | | 布局分割的结果, |
| ayout_bboxes | 含有布局的方向(垂直、水平),和bbox,按阅读顺序排序 | | layout_bboxes | 含有布局的方向(垂直、水平),和bbox,按阅读顺序排序 |
+--------------+-------------------------------------------------------+ +--------------+-------------------------------------------------------+
| page_idx | 页码,从0开始 | | page_idx | 页码,从0开始 |
+--------------+-------------------------------------------------------+ +--------------+-------------------------------------------------------+
...@@ -172,11 +172,11 @@ some_pdf_middle.json ...@@ -172,11 +172,11 @@ some_pdf_middle.json
+--------------+-------------------------------------------------------+ +--------------+-------------------------------------------------------+
| tables | list,每个元素是一个dict,每个dict表示一个table_block | | tables | list,每个元素是一个dict,每个dict表示一个table_block |
+--------------+-------------------------------------------------------+ +--------------+-------------------------------------------------------+
| interli | list,每个元素 | | | list,每个元素 |
| ne_equations | 是一个dict,每个dict表示一个interline_equation_block | | interline_equations | 是一个dict,每个dict表示一个interline_equation_block |
+--------------+-------------------------------------------------------+ +--------------+-------------------------------------------------------+
| disc | List, 模型返回的需要drop的block信息 | | | List, 模型返回的需要drop的block信息 |
| arded_blocks | | | discarded_blocks | |
+--------------+-------------------------------------------------------+ +--------------+-------------------------------------------------------+
| para_blocks | 将preproc_blocks进行分段之后的结果 | | para_blocks | 将preproc_blocks进行分段之后的结果 |
+--------------+-------------------------------------------------------+ +--------------+-------------------------------------------------------+
...@@ -205,14 +205,14 @@ blocks list,里面的每个元素都是一个dict格式的二级block ...@@ -205,14 +205,14 @@ blocks list,里面的每个元素都是一个dict格式的二级block
| 段 | | | 段 | |
| 名 | | | 名 | |
+=====+================================================================+ +=====+================================================================+
| t | block类型 | | | block类型 |
| ype | | | type | |
+-----+----------------------------------------------------------------+ +-----+----------------------------------------------------------------+
| b | block矩形框坐标 | | | block矩形框坐标 |
| box | | | bbox | |
+-----+----------------------------------------------------------------+ +-----+----------------------------------------------------------------+
| li | list,每个元素都是一个dict表示的line,用来描述一行信息的构成 | | | list,每个元素都是一个dict表示的line,用来描述一行信息的构成 |
| nes | | | lines | |
+-----+----------------------------------------------------------------+ +-----+----------------------------------------------------------------+
二级block的类型详解 二级block的类型详解
...@@ -242,12 +242,11 @@ line 的 字段格式如下 ...@@ -242,12 +242,11 @@ line 的 字段格式如下
| 段 | | | 段 | |
| 名 | | | 名 | |
+====+=================================================================+ +====+=================================================================+
| bb | line的矩形框坐标 | | bbox | line的矩形框坐标 |
| ox | | | | |
+----+-----------------------------------------------------------------+ +----+-----------------------------------------------------------------+
| s | list, | | spans | list, |
| pa | 每个元素都是一个dict表示的span,用来描述一个最小组成单元的构成 | | | 每个元素都是一个dict表示的span,用来描述一个最小组成单元的构成 |
| ns | |
+----+-----------------------------------------------------------------+ +----+-----------------------------------------------------------------+
**span** **span**
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment