".github/vscode:/vscode.git/clone" did not exist on "ab214bcbd9e3030bba362deb3811646636841c0f"
Unverified Commit 132c2089 authored by Xiaomeng Zhao's avatar Xiaomeng Zhao Committed by GitHub
Browse files

Merge pull request #1117 from icecraft/feat/add_s3_read_write_example

Feat/add s3 read write example
parents b8fdab11 81529317
......@@ -3,12 +3,16 @@
Convert To Markdown
========================
Local File Example
^^^^^^^^^^^^^^^^^^
.. code:: python
import os
from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader
from magic_pdf.libs.MakeContentConfig import DropMode, MakeMode
from magic_pdf.config.make_content_config import DropMode, MakeMode
from magic_pdf.pipe.OCRPipe import OCRPipe
......@@ -23,7 +27,7 @@ Convert To Markdown
image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter(
local_md_dir
) # create 00
)
image_dir = str(os.path.basename(local_image_dir))
reader1 = FileBasedDataReader("")
......@@ -49,4 +53,50 @@ Convert To Markdown
md_writer.write_string(f"{pdf_file_name}.md", md_content)
Check :doc:`../data/data_reader_writer` for more [reader | writer] examples
S3 File Example
^^^^^^^^^^^^^^^^
.. code:: python
import os
from magic_pdf.data.data_reader_writer import S3DataReader, S3DataWriter
from magic_pdf.config.make_content_config import DropMode, MakeMode
from magic_pdf.pipe.OCRPipe import OCRPipe
bucket_name = "{Your S3 Bucket Name}" # replace with real bucket name
ak = "{Your S3 access key}" # replace with real s3 access key
sk = "{Your S3 secret key}" # replace with real s3 secret key
endpoint_url = "{Your S3 endpoint_url}" # replace with real s3 endpoint_url
reader = S3DataReader('unittest/tmp/', bucket_name, ak, sk, endpoint_url) # replace `unittest/tmp` with the real s3 prefix
writer = S3DataWriter('unittest/tmp', bucket_name, ak, sk, endpoint_url)
image_writer = S3DataWriter('unittest/tmp/images', bucket_name, ak, sk, endpoint_url)
## args
model_list = []
pdf_file_name = f"s3://{bucket_name}/{fake pdf path}" # replace with the real s3 path
pdf_bytes = reader.read(pdf_file_name) # read the pdf content
pipe = OCRPipe(pdf_bytes, model_list, image_writer)
pipe.pipe_classify()
pipe.pipe_analyze()
pipe.pipe_parse()
pdf_info = pipe.pdf_mid_data["pdf_info"]
md_content = pipe.pipe_mk_markdown(
"unittest/tmp/images", drop_mode=DropMode.NONE, md_make_mode=MakeMode.MM_MD
)
if isinstance(md_content, list):
writer.write_string(f"{pdf_file_name}.md", "\n".join(md_content))
else:
writer.write_string(f"{pdf_file_name}.md", md_content)
Check :doc:`../data/data_reader_writer` for more [reader | writer] examples
......@@ -141,60 +141,60 @@ example
some_pdf_middle.json
~~~~~~~~~~~~~~~~~~~~
+-------+--------------------------------------------------------------+
| Field | Description |
| Name | |
+=======+==============================================================+
| pdf | list, each element is a dict representing the parsing result |
| _info | of each PDF page, see the table below for details |
+-------+--------------------------------------------------------------+
| \_ | ocr \| txt, used to indicate the mode used in this |
| parse | intermediate parsing state |
| _type | |
+-------+--------------------------------------------------------------+
| \_ve | string, indicates the version of magic-pdf used in this |
| rsion | parsing |
| _name | |
+-------+--------------------------------------------------------------+
+----------------+--------------------------------------------------------------+
| Field Name | Description |
| | |
+================+==============================================================+
| pdf_info | list, each element is a dict representing the parsing result |
| | of each PDF page, see the table below for details |
+----------------+--------------------------------------------------------------+
| \_ | ocr \| txt, used to indicate the mode used in this |
| parse_type | intermediate parsing state |
| | |
+----------------+--------------------------------------------------------------+
| \_version_name | string, indicates the version of magic-pdf used in this |
| | parsing |
| | |
+----------------+--------------------------------------------------------------+
**pdf_info**
Field structure description
+---------+------------------------------------------------------------+
| Field | Description |
| Name | |
+=========+============================================================+
| preproc | Intermediate result after PDF preprocessing, not yet |
| _blocks | segmented |
+---------+------------------------------------------------------------+
| layout | Layout segmentation results, containing layout direction |
| _bboxes | (vertical, horizontal), and bbox, sorted by reading order |
+---------+------------------------------------------------------------+
| p | Page number, starting from 0 |
| age_idx | |
+---------+------------------------------------------------------------+
| pa | Page width and height |
| ge_size | |
+---------+------------------------------------------------------------+
| \_layo | Layout tree structure |
| ut_tree | |
+---------+------------------------------------------------------------+
| images | list, each element is a dict representing an img_block |
+---------+------------------------------------------------------------+
| tables | list, each element is a dict representing a table_block |
+---------+------------------------------------------------------------+
| inter | list, each element is a dict representing an |
| line_eq | interline_equation_block |
| uations | |
+---------+------------------------------------------------------------+
| di | List, block information returned by the model that needs |
| scarded | to be dropped |
| _blocks | |
+---------+------------------------------------------------------------+
| para | Result after segmenting preproc_blocks |
| _blocks | |
+---------+------------------------------------------------------------+
+-------------------------+------------------------------------------------------------+
| Field | Description |
| Name | |
+=========================+============================================================+
| preproc_blocks | Intermediate result after PDF preprocessing, not yet |
| | segmented |
+-------------------------+------------------------------------------------------------+
| layout_bboxes | Layout segmentation results, containing layout direction |
| | (vertical, horizontal), and bbox, sorted by reading order |
+-------------------------+------------------------------------------------------------+
| page_idx | Page number, starting from 0 |
| | |
+-------------------------+------------------------------------------------------------+
| page_size | Page width and height |
| | |
+-------------------------+------------------------------------------------------------+
| \_layout_tree | Layout tree structure |
| | |
+-------------------------+------------------------------------------------------------+
| images | list, each element is a dict representing an img_block |
+-------------------------+------------------------------------------------------------+
| tables | list, each element is a dict representing a table_block |
+-------------------------+------------------------------------------------------------+
| interline_equation | list, each element is a dict representing an |
| | interline_equation_block |
| | |
+-------------------------+------------------------------------------------------------+
| discarded_blocks | List, block information returned by the model that needs |
| | to be dropped |
| | |
+-------------------------+------------------------------------------------------------+
| para_blocks | Result after segmenting preproc_blocks |
| | |
+-------------------------+------------------------------------------------------------+
In the above table, ``para_blocks`` is an array of dicts, each dict
representing a block structure. A block can support up to one level of
......@@ -205,38 +205,36 @@ nesting.
The outer block is referred to as a first-level block, and the fields in
the first-level block include:
+---------+-------------------------------------------------------------+
| Field | Description |
| Name | |
+=========+=============================================================+
| type | Block type (table|image) |
+---------+-------------------------------------------------------------+
| bbox | Block bounding box coordinates |
+---------+-------------------------------------------------------------+
| blocks | list, each element is a dict representing a second-level |
| | block |
+---------+-------------------------------------------------------------+
+------------------------+-------------------------------------------------------------+
| Field | Description |
| Name | |
+========================+=============================================================+
| type | Block type (table|image) |
+------------------------+-------------------------------------------------------------+
| bbox | Block bounding box coordinates |
+------------------------+-------------------------------------------------------------+
| blocks | list, each element is a dict representing a second-level |
| | block |
+------------------------+-------------------------------------------------------------+
There are only two types of first-level blocks: “table” and “image”. All
other blocks are second-level blocks.
The fields in a second-level block include:
+-----+----------------------------------------------------------------+
| Fi | Description |
| eld | |
| N | |
| ame | |
+=====+================================================================+
| t | Block type |
| ype | |
+-----+----------------------------------------------------------------+
| b | Block bounding box coordinates |
| box | |
+-----+----------------------------------------------------------------+
| li | list, each element is a dict representing a line, used to |
| nes | describe the composition of a line of information |
+-----+----------------------------------------------------------------+
+----------------------+----------------------------------------------------------------+
| Field | Description |
| Name | |
+======================+================================================================+
| | Block type |
| type | |
+----------------------+----------------------------------------------------------------+
| | Block bounding box coordinates |
| bbox | |
+----------------------+----------------------------------------------------------------+
| | list, each element is a dict representing a line, used to |
| lines | describe the composition of a line of information |
+----------------------+----------------------------------------------------------------+
Detailed explanation of second-level block types
......@@ -257,33 +255,31 @@ interline_equation Block formula
The field format of a line is as follows:
+-----+----------------------------------------------------------------+
| Fi | Description |
| eld | |
| N | |
| ame | |
+=====+================================================================+
| b | Bounding box coordinates of the line |
| box | |
+-----+----------------------------------------------------------------+
| sp | list, each element is a dict representing a span, used to |
| ans | describe the composition of the smallest unit |
+-----+----------------------------------------------------------------+
+---------------------+----------------------------------------------------------------+
| Field | Description |
| Name | |
+=====================+================================================================+
| | Bounding box coordinates of the line |
| bbox | |
+---------------------+----------------------------------------------------------------+
| spans | list, each element is a dict representing a span, used to |
| | describe the composition of the smallest unit |
+---------------------+----------------------------------------------------------------+
**span**
+----------+-----------------------------------------------------------+
| Field | Description |
| Name | |
+==========+===========================================================+
| bbox | Bounding box coordinates of the span |
+----------+-----------------------------------------------------------+
| type | Type of the span |
+----------+-----------------------------------------------------------+
| content | Text spans use content, chart spans use img_path to store |
| \| | the actual text or screenshot path information |
| img_path | |
+----------+-----------------------------------------------------------+
+---------------------+-----------------------------------------------------------+
| Field | Description |
| Name | |
+=====================+===========================================================+
| bbox | Bounding box coordinates of the span |
+---------------------+-----------------------------------------------------------+
| type | Type of the span |
+---------------------+-----------------------------------------------------------+
| content | Text spans use content, chart spans use img_path to store |
| \| | the actual text or screenshot path information |
| img_path | |
+---------------------+-----------------------------------------------------------+
The types of spans are as follows:
......
......@@ -3,12 +3,16 @@
转换为 Markdown 文件
========================
本地文件示例
^^^^^^^^^^^
.. code:: python
import os
from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader
from magic_pdf.libs.MakeContentConfig import DropMode, MakeMode
from magic_pdf.config.make_content_config import DropMode, MakeMode
from magic_pdf.pipe.OCRPipe import OCRPipe
......@@ -23,7 +27,7 @@
image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter(
local_md_dir
) # create 00
)
image_dir = str(os.path.basename(local_image_dir))
reader1 = FileBasedDataReader("")
......@@ -49,5 +53,51 @@
md_writer.write_string(f"{pdf_file_name}.md", md_content)
前去 :doc:`../data/data_reader_writer` 获取更多有关 **读写** 示例
对象存储使用示例
^^^^^^^^^^^^^^^
.. code:: python
import os
from magic_pdf.data.data_reader_writer import S3DataReader, S3DataWriter
from magic_pdf.config.make_content_config import DropMode, MakeMode
from magic_pdf.pipe.OCRPipe import OCRPipe
bucket_name = "{Your S3 Bucket Name}" # replace with real bucket name
ak = "{Your S3 access key}" # replace with real s3 access key
sk = "{Your S3 secret key}" # replace with real s3 secret key
endpoint_url = "{Your S3 endpoint_url}" # replace with real s3 endpoint_url
reader = S3DataReader('unittest/tmp/', bucket_name, ak, sk, endpoint_url) # replace `unittest/tmp` with the real s3 prefix
writer = S3DataWriter('unittest/tmp', bucket_name, ak, sk, endpoint_url)
image_writer = S3DataWriter('unittest/tmp/images', bucket_name, ak, sk, endpoint_url)
## args
model_list = []
pdf_file_name = f"s3://{bucket_name}/{fake pdf path}" # replace with the real s3 path
pdf_bytes = reader.read(pdf_file_name) # read the pdf content
pipe = OCRPipe(pdf_bytes, model_list, image_writer)
pipe.pipe_classify()
pipe.pipe_analyze()
pipe.pipe_parse()
pdf_info = pipe.pdf_mid_data["pdf_info"]
md_content = pipe.pipe_mk_markdown(
"unittest/tmp/images", drop_mode=DropMode.NONE, md_make_mode=MakeMode.MM_MD
)
if isinstance(md_content, list):
writer.write_string(f"{pdf_file_name}.md", "\n".join(md_content))
else:
writer.write_string(f"{pdf_file_name}.md", md_content)
前去 :doc:`../data/data_reader_writer` 获取更多有关 **读写** 示例
......@@ -143,11 +143,11 @@ some_pdf_middle.json
| pdf_info | list,每个 |
| | 元素都是一个dict,这个dict是每一页pdf的解析结果,详见下表 |
+-----------+----------------------------------------------------------+
| \_p | ocr \| txt,用来标识本次解析的中间态使用的模式 |
| arse_type | |
| | ocr \| txt,用来标识本次解析的中间态使用的模式 |
| \_parse_type | |
+-----------+----------------------------------------------------------+
| \_ver | string, 表示本次解析使用的 magic-pdf 的版本号 |
| sion_name | |
| | string, 表示本次解析使用的 magic-pdf 的版本号 |
| \_version_name | |
+-----------+----------------------------------------------------------+
**pdf_info** 字段结构说明
......@@ -155,11 +155,11 @@ some_pdf_middle.json
+--------------+-------------------------------------------------------+
| 字段名 | 解释 |
+==============+=======================================================+
| pr | pdf预处理后,未分段的中间结果 |
| eproc_blocks | |
| | pdf预处理后,未分段的中间结果 |
| preeproc_blocks | |
+--------------+-------------------------------------------------------+
| l | 布局分割的结果, |
| ayout_bboxes | 含有布局的方向(垂直、水平),和bbox,按阅读顺序排序 |
| | 布局分割的结果, |
| layout_bboxes | 含有布局的方向(垂直、水平),和bbox,按阅读顺序排序 |
+--------------+-------------------------------------------------------+
| page_idx | 页码,从0开始 |
+--------------+-------------------------------------------------------+
......@@ -172,11 +172,11 @@ some_pdf_middle.json
+--------------+-------------------------------------------------------+
| tables | list,每个元素是一个dict,每个dict表示一个table_block |
+--------------+-------------------------------------------------------+
| interli | list,每个元素 |
| ne_equations | 是一个dict,每个dict表示一个interline_equation_block |
| | list,每个元素 |
| interline_equations | 是一个dict,每个dict表示一个interline_equation_block |
+--------------+-------------------------------------------------------+
| disc | List, 模型返回的需要drop的block信息 |
| arded_blocks | |
| | List, 模型返回的需要drop的block信息 |
| discarded_blocks | |
+--------------+-------------------------------------------------------+
| para_blocks | 将preproc_blocks进行分段之后的结果 |
+--------------+-------------------------------------------------------+
......@@ -205,14 +205,14 @@ blocks list,里面的每个元素都是一个dict格式的二级block
| 段 | |
| 名 | |
+=====+================================================================+
| t | block类型 |
| ype | |
| | block类型 |
| type | |
+-----+----------------------------------------------------------------+
| b | block矩形框坐标 |
| box | |
| | block矩形框坐标 |
| bbox | |
+-----+----------------------------------------------------------------+
| li | list,每个元素都是一个dict表示的line,用来描述一行信息的构成 |
| nes | |
| | list,每个元素都是一个dict表示的line,用来描述一行信息的构成 |
| lines | |
+-----+----------------------------------------------------------------+
二级block的类型详解
......@@ -242,12 +242,11 @@ line 的 字段格式如下
| 段 | |
| 名 | |
+====+=================================================================+
| bb | line的矩形框坐标 |
| ox | |
| bbox | line的矩形框坐标 |
| | |
+----+-----------------------------------------------------------------+
| s | list, |
| pa | 每个元素都是一个dict表示的span,用来描述一个最小组成单元的构成 |
| ns | |
| spans | list, |
| | 每个元素都是一个dict表示的span,用来描述一个最小组成单元的构成 |
+----+-----------------------------------------------------------------+
**span**
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment