Merge pull request #1117 from icecraft/feat/add_s3_read_write_example

Feat/add s3 read write example

Merge pull request #1117 from icecraft/feat/add_s3_read_write_example
Feat/add s3 read write example
132c2089 · Xiaomeng Zhao · GitHub · b8fdab11 · 81529317 · 132c2089
Unverified Commit 132c2089 authored Nov 27, 2024 by Xiaomeng Zhao Committed by GitHub Nov 27, 2024
4 changed files
--- a/next_docs/en/user_guide/quick_start/to_markdown.rst
+++ b/next_docs/en/user_guide/quick_start/to_markdown.rst
@@ -3,12 +3,16 @@
 Convert To Markdown
 ========================

+
+Local File Example
+^^^^^^^^^^^^^^^^^^
+
 .. code:: python

    import os

    from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader
-    from magic_pdf.libs.MakeContentConfig import DropMode, MakeMode
+    from magic_pdf.config.make_content_config import DropMode, MakeMode
    from magic_pdf.pipe.OCRPipe import OCRPipe


@@ -23,7 +27,7 @@ Convert To Markdown

    image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter(
        local_md_dir
-    ) # create 00
+    )
    image_dir = str(os.path.basename(local_image_dir))

    reader1 = FileBasedDataReader("")
@@ -49,4 +53,50 @@ Convert To Markdown
        md_writer.write_string(f"{pdf_file_name}.md", md_content)


-Check :doc:`../data/data_reader_writer` for more [reader | writer] examples 
+S3 File Example
+^^^^^^^^^^^^^^^^
+
+.. code:: python
+
+    import os
+
+    from magic_pdf.data.data_reader_writer import S3DataReader, S3DataWriter
+    from magic_pdf.config.make_content_config import DropMode, MakeMode
+    from magic_pdf.pipe.OCRPipe import OCRPipe
+
+    bucket_name = "{Your S3 Bucket Name}"  # replace with real bucket name
+    ak = "{Your S3 access key}"  # replace with real s3 access key
+    sk = "{Your S3 secret key}"  # replace with real s3 secret key
+    endpoint_url = "{Your S3 endpoint_url}"  # replace with real s3 endpoint_url
+
+
+    reader = S3DataReader('unittest/tmp/', bucket_name, ak, sk, endpoint_url)  # replace `unittest/tmp` with the real s3 prefix
+    writer = S3DataWriter('unittest/tmp', bucket_name, ak, sk, endpoint_url)
+    image_writer = S3DataWriter('unittest/tmp/images', bucket_name, ak, sk, endpoint_url)
+
+    ## args
+    model_list = []
+    pdf_file_name = f"s3://{bucket_name}/{fake pdf path}"  # replace with the real s3 path
+
+    pdf_bytes = reader.read(pdf_file_name)  # read the pdf content
+
+
+    pipe = OCRPipe(pdf_bytes, model_list, image_writer)
+
+    pipe.pipe_classify()
+    pipe.pipe_analyze()
+    pipe.pipe_parse()
+
+    pdf_info = pipe.pdf_mid_data["pdf_info"]
+
+    md_content = pipe.pipe_mk_markdown(
+        "unittest/tmp/images", drop_mode=DropMode.NONE, md_make_mode=MakeMode.MM_MD
+    )
+
+    if isinstance(md_content, list):
+        writer.write_string(f"{pdf_file_name}.md", "\n".join(md_content))
+    else:
+        writer.write_string(f"{pdf_file_name}.md", md_content)
+
+
+Check :doc:`../data/data_reader_writer` for more [reader | writer] examples
--- a/next_docs/en/user_guide/tutorial/output_file_description.rst
+++ b/next_docs/en/user_guide/tutorial/output_file_description.rst
@@ -141,60 +141,60 @@ example
 some_pdf_middle.json
 ~~~~~~~~~~~~~~~~~~~~

-+-------+--------------------------------------------------------------+
-| Field | Description                                                  |
-| Name  |                                                              |
-+=======+==============================================================+
-| pdf   | list, each element is a dict representing the parsing result |
-| _info | of each PDF page, see the table below for details            |
-+-------+--------------------------------------------------------------+
-| \_    | ocr \| txt, used to indicate the mode used in this           |
-| parse | intermediate parsing state                                   |
-| _type |                                                              |
-+-------+--------------------------------------------------------------+
-| \_ve  | string, indicates the version of magic-pdf used in this      |
-| rsion | parsing                                                      |
-| _name |                                                              |
-+-------+--------------------------------------------------------------+
+----------------+--------------------------------------------------------------+
+| Field Name     | Description                                                  |
+|                |                                                              |
+================+==============================================================+
+| pdf_info       | list, each element is a dict representing the parsing result |
+|                | of each PDF page, see the table below for details            |
+----------------+--------------------------------------------------------------+
+| \_             | ocr \| txt, used to indicate the mode used in this           |
+| parse_type     | intermediate parsing state                                   |
+|                |                                                              |
+----------------+--------------------------------------------------------------+
+| \_version_name | string, indicates the version of magic-pdf used in this      |
+|                | parsing                                                      |
+|                |                                                              |
+----------------+--------------------------------------------------------------+

 **pdf_info**

 Field structure description

-+---------+------------------------------------------------------------+
-| Field   | Description                                                |
-| Name    |                                                            |
-+=========+============================================================+
-| preproc | Intermediate result after PDF preprocessing, not yet       |
-| _blocks | segmented                                                  |
-+---------+------------------------------------------------------------+
-| layout  | Layout segmentation results, containing layout direction   |
-| _bboxes | (vertical, horizontal), and bbox, sorted by reading order  |
-+---------+------------------------------------------------------------+
-| p       | Page number, starting from 0                               |
-| age_idx |                                                            |
-+---------+------------------------------------------------------------+
-| pa      | Page width and height                                      |
-| ge_size |                                                            |
-+---------+------------------------------------------------------------+
-| \_layo  | Layout tree structure                                      |
-| ut_tree |                                                            |
-+---------+------------------------------------------------------------+
-| images  | list, each element is a dict representing an img_block     |
-+---------+------------------------------------------------------------+
-| tables  | list, each element is a dict representing a table_block    |
-+---------+------------------------------------------------------------+
-| inter   | list, each element is a dict representing an               |
-| line_eq | interline_equation_block                                   |
-| uations |                                                            |
-+---------+------------------------------------------------------------+
-| di      | List, block information returned by the model that needs   |
-| scarded | to be dropped                                              |
-| _blocks |                                                            |
-+---------+------------------------------------------------------------+
-| para    | Result after segmenting preproc_blocks                     |
-| _blocks |                                                            |
-+---------+------------------------------------------------------------+
+-------------------------+------------------------------------------------------------+
+| Field                   | Description                                                |
+| Name                    |                                                            |
+=========================+============================================================+
+| preproc_blocks          | Intermediate result after PDF preprocessing, not yet       |
+|                         | segmented                                                  |
+-------------------------+------------------------------------------------------------+
+| layout_bboxes           | Layout segmentation results, containing layout direction   |
+|                         | (vertical, horizontal), and bbox, sorted by reading order  |
+-------------------------+------------------------------------------------------------+
+| page_idx                | Page number, starting from 0                               |
+|                         |                                                            |
+-------------------------+------------------------------------------------------------+
+| page_size               | Page width and height                                      |
+|                         |                                                            |
+-------------------------+------------------------------------------------------------+
+| \_layout_tree           | Layout tree structure                                      |
+|                         |                                                            |
+-------------------------+------------------------------------------------------------+
+| images                  | list, each element is a dict representing an img_block     |
+-------------------------+------------------------------------------------------------+
+| tables                  | list, each element is a dict representing a table_block    |
+-------------------------+------------------------------------------------------------+
+| interline_equation      | list, each element is a dict representing an               |
+|                         | interline_equation_block                                   |
+|                         |                                                            |
+-------------------------+------------------------------------------------------------+
+| discarded_blocks        | List, block information returned by the model that needs   |
+|                         | to be dropped                                              |
+|                         |                                                            |
+-------------------------+------------------------------------------------------------+
+| para_blocks             | Result after segmenting preproc_blocks                     |
+|                         |                                                            |
+-------------------------+------------------------------------------------------------+

 In the above table, ``para_blocks`` is an array of dicts, each dict
 representing a block structure. A block can support up to one level of
@@ -205,38 +205,36 @@ nesting.
 The outer block is referred to as a first-level block, and the fields in
 the first-level block include:

-+---------+-------------------------------------------------------------+
-| Field   | Description                                                 |
-| Name    |                                                             |
-+=========+=============================================================+
-| type    | Block type (table|image)                                    |
-+---------+-------------------------------------------------------------+
-| bbox    | Block bounding box coordinates                              |
-+---------+-------------------------------------------------------------+
-| blocks  | list, each element is a dict representing a second-level    |
-|         | block                                                       |
-+---------+-------------------------------------------------------------+
+------------------------+-------------------------------------------------------------+
+| Field                  | Description                                                 |
+| Name                   |                                                             |
+========================+=============================================================+
+| type                   | Block type (table|image)                                    |
+------------------------+-------------------------------------------------------------+
+| bbox                   | Block bounding box coordinates                              |
+------------------------+-------------------------------------------------------------+
+| blocks                 | list, each element is a dict representing a second-level    |
+|                        | block                                                       |
+------------------------+-------------------------------------------------------------+

 There are only two types of first-level blocks: “table” and “image”. All
 other blocks are second-level blocks.

 The fields in a second-level block include:

-+-----+----------------------------------------------------------------+
-| Fi  | Description                                                    |
-| eld |                                                                |
-| N   |                                                                |
-| ame |                                                                |
-+=====+================================================================+
-| t   | Block type                                                     |
-| ype |                                                                |
-+-----+----------------------------------------------------------------+
-| b   | Block bounding box coordinates                                 |
-| box |                                                                |
-+-----+----------------------------------------------------------------+
-| li  | list, each element is a dict representing a line, used to      |
-| nes | describe the composition of a line of information              |
-+-----+----------------------------------------------------------------+
+----------------------+----------------------------------------------------------------+
+| Field                | Description                                                    |
+| Name                 |                                                                |
+======================+================================================================+
+|                      | Block type                                                     |
+| type                 |                                                                |
+----------------------+----------------------------------------------------------------+
+|                      | Block bounding box coordinates                                 |
+| bbox                 |                                                                |
+----------------------+----------------------------------------------------------------+
+|                      | list, each element is a dict representing a line, used to      |
+| lines                | describe the composition of a line of information              |
+----------------------+----------------------------------------------------------------+

 Detailed explanation of second-level block types

@@ -257,33 +255,31 @@ interline_equation Block formula

 The field format of a line is as follows:

-+-----+----------------------------------------------------------------+
-| Fi  | Description                                                    |
-| eld |                                                                |
-| N   |                                                                |
-| ame |                                                                |
-+=====+================================================================+
-| b   | Bounding box coordinates of the line                           |
-| box |                                                                |
-+-----+----------------------------------------------------------------+
-| sp  | list, each element is a dict representing a span, used to      |
-| ans | describe the composition of the smallest unit                  |
-+-----+----------------------------------------------------------------+
+---------------------+----------------------------------------------------------------+
+| Field               | Description                                                    |
+| Name                |                                                                |
+=====================+================================================================+
+|                     | Bounding box coordinates of the line                           |
+| bbox                |                                                                |
+---------------------+----------------------------------------------------------------+
+| spans               | list, each element is a dict representing a span, used to      |
+|                     | describe the composition of the smallest unit                  |
+---------------------+----------------------------------------------------------------+

 **span**

-+----------+-----------------------------------------------------------+
-| Field    | Description                                               |
-| Name     |                                                           |
-+==========+===========================================================+
-| bbox     | Bounding box coordinates of the span                      |
-+----------+-----------------------------------------------------------+
-| type     | Type of the span                                          |
-+----------+-----------------------------------------------------------+
-| content  | Text spans use content, chart spans use img_path to store |
-| \|       | the actual text or screenshot path information            |
-| img_path |                                                           |
-+----------+-----------------------------------------------------------+
+---------------------+-----------------------------------------------------------+
+| Field               | Description                                               |
+| Name                |                                                           |
+=====================+===========================================================+
+| bbox                | Bounding box coordinates of the span                      |
+---------------------+-----------------------------------------------------------+
+| type                | Type of the span                                          |
+---------------------+-----------------------------------------------------------+
+| content             | Text spans use content, chart spans use img_path to store |
+| \|                  | the actual text or screenshot path information            |
+| img_path            |                                                           |
+---------------------+-----------------------------------------------------------+

 The types of spans are as follows:


--- a/next_docs/zh_cn/user_guide/quick_start/to_markdown.rst
+++ b/next_docs/zh_cn/user_guide/quick_start/to_markdown.rst
@@ -3,12 +3,16 @@
 转换为 Markdown 文件
 ========================

+
+本地文件示例
+^^^^^^^^^^^
+
 .. code:: python

    import os

    from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader
-    from magic_pdf.libs.MakeContentConfig import DropMode, MakeMode
+    from magic_pdf.config.make_content_config import DropMode, MakeMode
    from magic_pdf.pipe.OCRPipe import OCRPipe


@@ -23,7 +27,7 @@

    image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter(
        local_md_dir
-    ) # create 00
+    )
    image_dir = str(os.path.basename(local_image_dir))

    reader1 = FileBasedDataReader("")
@@ -49,5 +53,51 @@
        md_writer.write_string(f"{pdf_file_name}.md", md_content)


-前去 :doc:`../data/data_reader_writer` 获取更多有关 **读写** 示例
+对象存储使用示例
+^^^^^^^^^^^^^^^
+
+.. code:: python
+
+    import os
+
+    from magic_pdf.data.data_reader_writer import S3DataReader, S3DataWriter
+    from magic_pdf.config.make_content_config import DropMode, MakeMode
+    from magic_pdf.pipe.OCRPipe import OCRPipe
+
+    bucket_name = "{Your S3 Bucket Name}"  # replace with real bucket name
+    ak = "{Your S3 access key}"  # replace with real s3 access key
+    sk = "{Your S3 secret key}"  # replace with real s3 secret key
+    endpoint_url = "{Your S3 endpoint_url}"  # replace with real s3 endpoint_url
+
+
+    reader = S3DataReader('unittest/tmp/', bucket_name, ak, sk, endpoint_url)  # replace `unittest/tmp` with the real s3 prefix
+    writer = S3DataWriter('unittest/tmp', bucket_name, ak, sk, endpoint_url)
+    image_writer = S3DataWriter('unittest/tmp/images', bucket_name, ak, sk, endpoint_url)
+
+    ## args
+    model_list = []
+    pdf_file_name = f"s3://{bucket_name}/{fake pdf path}"  # replace with the real s3 path
+
+    pdf_bytes = reader.read(pdf_file_name)  # read the pdf content
+

+    pipe = OCRPipe(pdf_bytes, model_list, image_writer)
+
+    pipe.pipe_classify()
+    pipe.pipe_analyze()
+    pipe.pipe_parse()
+
+    pdf_info = pipe.pdf_mid_data["pdf_info"]
+
+    md_content = pipe.pipe_mk_markdown(
+        "unittest/tmp/images", drop_mode=DropMode.NONE, md_make_mode=MakeMode.MM_MD
+    )
+
+    if isinstance(md_content, list):
+        writer.write_string(f"{pdf_file_name}.md", "\n".join(md_content))
+    else:
+        writer.write_string(f"{pdf_file_name}.md", md_content)
+
+
+
+前去 :doc:`../data/data_reader_writer` 获取更多有关 **读写** 示例
--- a/next_docs/zh_cn/user_guide/tutorial/output_file_description.rst
+++ b/next_docs/zh_cn/user_guide/tutorial/output_file_description.rst
@@ -143,11 +143,11 @@ some_pdf_middle.json
 | pdf_info  | list，每个                                               |
 |           | 元素都是一个dict,这个dict是每一页pdf的解析结果，详见下表 |
 +-----------+----------------------------------------------------------+
-| \_p       | ocr \| txt，用来标识本次解析的中间态使用的模式           |
-| arse_type |                                                          |
+|              | ocr \| txt，用来标识本次解析的中间态使用的模式           |
+| \_parse_type |                                                          |
 +-----------+----------------------------------------------------------+
-| \_ver     | string, 表示本次解析使用的 magic-pdf 的版本号            |
-| sion_name |                                                          |
+|                | string, 表示本次解析使用的 magic-pdf 的版本号            |
+| \_version_name |                                                          |
 +-----------+----------------------------------------------------------+

 **pdf_info** 字段结构说明
@@ -155,11 +155,11 @@ some_pdf_middle.json
 +--------------+-------------------------------------------------------+
 | 字段名       | 解释                                                  |
 +==============+=======================================================+
-| pr           | pdf预处理后，未分段的中间结果                         |
-| eproc_blocks |                                                       |
+|                 | pdf预处理后，未分段的中间结果                         |
+| preeproc_blocks |                                                       |
 +--------------+-------------------------------------------------------+
-| l            | 布局分割的结果，                                      |
-| ayout_bboxes | 含有布局的方向（垂直、水平），和bbox，按阅读顺序排序  |
+|               | 布局分割的结果，                                      |
+| layout_bboxes | 含有布局的方向（垂直、水平），和bbox，按阅读顺序排序  |
 +--------------+-------------------------------------------------------+
 | page_idx     | 页码，从0开始                                         |
 +--------------+-------------------------------------------------------+
@@ -172,11 +172,11 @@ some_pdf_middle.json
 +--------------+-------------------------------------------------------+
 | tables       | list，每个元素是一个dict，每个dict表示一个table_block |
 +--------------+-------------------------------------------------------+
-| interli      | list，每个元素                                        |
-| ne_equations | 是一个dict，每个dict表示一个interline_equation_block  |
+|                     | list，每个元素                                        |
+| interline_equations | 是一个dict，每个dict表示一个interline_equation_block  |
 +--------------+-------------------------------------------------------+
-| disc         | List, 模型返回的需要drop的block信息                   |
-| arded_blocks |                                                       |
+|                  | List, 模型返回的需要drop的block信息                   |
+| discarded_blocks |                                                       |
 +--------------+-------------------------------------------------------+
 | para_blocks  | 将preproc_blocks进行分段之后的结果                    |
 +--------------+-------------------------------------------------------+
@@ -205,14 +205,14 @@ blocks list，里面的每个元素都是一个dict格式的二级block
 | 段  |                                                                |
 | 名  |                                                                |
 +=====+================================================================+
-| t   | block类型                                                      |
-| ype |                                                                |
+|      | block类型                                                      |
+| type |                                                                |
 +-----+----------------------------------------------------------------+
-| b   | block矩形框坐标                                                |
-| box |                                                                |
+|      | block矩形框坐标                                                |
+| bbox |                                                                |
 +-----+----------------------------------------------------------------+
-| li  | list，每个元素都是一个dict表示的line，用来描述一行信息的构成   |
-| nes |                                                                |
+|       | list，每个元素都是一个dict表示的line，用来描述一行信息的构成   |
+| lines |                                                                |
 +-----+----------------------------------------------------------------+

 二级block的类型详解
@@ -242,12 +242,11 @@ line 的 字段格式如下
 | 段 |                                                                 |
 | 名 |                                                                 |
 +====+=================================================================+
-| bb | line的矩形框坐标                                                |
-| ox |                                                                 |
+| bbox  | line的矩形框坐标                                                |
+|       |                                                                 |
 +----+-----------------------------------------------------------------+
-| s  | list，                                                          |
-| pa | 每个元素都是一个dict表示的span，用来描述一个最小组成单元的构成  |
-| ns |                                                                 |
+| spans  | list，                                                       |
+|        | 每个元素都是一个dict表示的span，用来描述一个最小组成单元的构成  |
 +----+-----------------------------------------------------------------+

 **span**