Merge pull request #833 from icecraft/feat/tune_docs

Feat/tune docs

Merge pull request #833 from icecraft/feat/tune_docs
Feat/tune docs
8b119e22 · Xiaomeng Zhao · GitHub · 099f19f2 · 065bf993 · 8b119e22
Unverified Commit 8b119e22 authored Nov 01, 2024 by Xiaomeng Zhao Committed by GitHub Nov 01, 2024
14 changed files
--- a/next_docs/en/user_guide/install/download_model_weight_files.rst
+++ b/next_docs/en/user_guide/install/download_model_weight_files.rst
+
+Download Model Weight Files
+==============================
+
+Model downloads are divided into initial downloads and updates to the
+model directory. Please refer to the corresponding documentation for
+instructions on how to proceed.
+
+Initial download of model files
+------------------------------
+
+1. Download the Model from Hugging Face
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Use a Python Script to Download Model Files from Hugging Face
+
+.. code:: bash
+
+   pip install huggingface_hub
+   wget https://github.com/opendatalab/MinerU/raw/master/scripts/download_models_hf.py -O download_models_hf.py
+   python download_models_hf.py
+
+The Python script will automatically download the model files and
+configure the model directory in the configuration file.
+
+The configuration file can be found in the user directory, with the
+filename ``magic-pdf.json``.
+
+How to update models previously downloaded
+-----------------------------------------
+
+1. Models downloaded via Git LFS
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+   Due to feedback from some users that downloading model files using
+   git lfs was incomplete or resulted in corrupted model files, this
+   method is no longer recommended.
+
+If you previously downloaded model files via git lfs, you can navigate
+to the previous download directory and use the ``git pull`` command to
+update the model.
+
+2. Models downloaded via Hugging Face or Model Scope
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+If you previously downloaded models via Hugging Face or Model Scope, you
+can rerun the Python script used for the initial download. This will
+automatically update the model directory to the latest version.
--- a/next_docs/en/user_guide/install/install.rst
+++ b/next_docs/en/user_guide/install/install.rst
+
+Install 
+===============================================================
+If you encounter any installation issues, please first consult the FAQ.
+If the parsing results are not as expected, refer to the Known Issues.
+There are three different ways to experience MinerU
+
+Pre-installation Notice—Hardware and Software Environment Support
+------------------------------------------------------------------
+
+To ensure the stability and reliability of the project, we only optimize
+and test for specific hardware and software environments during
+development. This ensures that users deploying and running the project
+on recommended system configurations will get the best performance with
+the fewest compatibility issues.
+
+By focusing resources on the mainline environment, our team can more
+efficiently resolve potential bugs and develop new features.
+
+In non-mainline environments, due to the diversity of hardware and
+software configurations, as well as third-party dependency compatibility
+issues, we cannot guarantee 100% project availability. Therefore, for
+users who wish to use this project in non-recommended environments, we
+suggest carefully reading the documentation and FAQ first. Most issues
+already have corresponding solutions in the FAQ. We also encourage
+community feedback to help us gradually expand support.
+
+.. raw:: html
+
+   <style>
+      table, th, td {
+      border: 1px solid black;
+      border-collapse: collapse;
+      }
+   </style>
+   <table>
+    <tr>
+        <td colspan="3" rowspan="2">Operating System</td>
+    </tr>
+    <tr>
+        <td>Ubuntu 22.04 LTS</td>
+        <td>Windows 10 / 11</td>
+        <td>macOS 11+</td>
+    </tr>
+    <tr>
+        <td colspan="3">CPU</td>
+        <td>x86_64</td>
+        <td>x86_64</td>
+        <td>x86_64 / arm64</td>
+    </tr>
+    <tr>
+        <td colspan="3">Memory</td>
+        <td colspan="3">16GB or more, recommended 32GB+</td>
+    </tr>
+    <tr>
+        <td colspan="3">Python Version</td>
+        <td colspan="3">3.10</td>
+    </tr>
+    <tr>
+        <td colspan="3">Nvidia Driver Version</td>
+        <td>latest (Proprietary Driver)</td>
+        <td>latest</td>
+        <td>None</td>
+    </tr>
+    <tr>
+        <td colspan="3">CUDA Environment</td>
+        <td>Automatic installation [12.1 (pytorch) + 11.8 (paddle)]</td>
+        <td>11.8 (manual installation) + cuDNN v8.7.0 (manual installation)</td>
+        <td>None</td>
+    </tr>
+    <tr>
+        <td rowspan="2">GPU Hardware Support List</td>
+        <td colspan="2">Minimum Requirement 8G+ VRAM</td>
+        <td colspan="2">3060ti/3070/3080/3080ti/4060/4070/4070ti<br>
+        8G VRAM enables layout, formula recognition acceleration and OCR acceleration</td>
+        <td rowspan="2">None</td>
+    </tr>
+    <tr>
+        <td colspan="2">Recommended Configuration 16G+ VRAM</td>
+        <td colspan="2">3090/3090ti/4070ti super/4080/4090<br>
+        16G VRAM or more can enable layout, formula recognition, OCR acceleration and table recognition acceleration simultaneously
+        </td>
+    </tr>
+   </table>
+
+
+Create an environment
+~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: shell
+
+    conda create -n MinerU python=3.10
+    conda activate MinerU
+    pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com
+
+
+Download model weight files
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: shell
+
+    pip install huggingface_hub
+    wget https://github.com/opendatalab/MinerU/raw/master/scripts/download_models_hf.py -O download_models_hf.py
+    python download_models_hf.py    
+
+
+The MinerU is installed, Check out :doc:`../quick_start` or reading :doc:`boost_with_cuda` for accelerate inference
\ No newline at end of file
--- a/next_docs/en/user_guide/quick_start.rst
+++ b/next_docs/en/user_guide/quick_start.rst
+
+Quick Start 
+==============
+
+Eager to get started? This page gives a good introduction to MinerU. Follow Installation to set up a project and install MinerU first.
+
+
+.. toctree::
+    :maxdepth: 1
+
+    quick_start/command_line
+    quick_start/to_markdown
+
--- a/next_docs/en/user_guide/quick_start/command_line.rst
+++ b/next_docs/en/user_guide/quick_start/command_line.rst
+
+
+Command Line
+===================
+
+.. code:: bash
+
+   magic-pdf --help
+   Usage: magic-pdf [OPTIONS]
+
+   Options:
+     -v, --version                display the version and exit
+     -p, --path PATH              local pdf filepath or directory  [required]
+     -o, --output-dir PATH        output local directory  [required]
+     -m, --method [ocr|txt|auto]  the method for parsing pdf. ocr: using ocr
+                                  technique to extract information from pdf. txt:
+                                  suitable for the text-based pdf only and
+                                  outperform ocr. auto: automatically choose the
+                                  best method for parsing pdf from ocr and txt.
+                                  without method specified, auto will be used by
+                                  default.
+     -l, --lang TEXT              Input the languages in the pdf (if known) to
+                                  improve OCR accuracy.  Optional. You should
+                                  input "Abbreviation" with language form url: ht
+                                  tps://paddlepaddle.github.io/PaddleOCR/en/ppocr
+                                  /blog/multi_languages.html#5-support-languages-
+                                  and-abbreviations
+     -d, --debug BOOLEAN          Enables detailed debugging information during
+                                  the execution of the CLI commands.
+     -s, --start INTEGER          The starting page for PDF parsing, beginning
+                                  from 0.
+     -e, --end INTEGER            The ending page for PDF parsing, beginning from
+                                  0.
+     --help                       Show this message and exit.
+
+
+   ## show version
+   magic-pdf -v
+
+   ## command line example
+   magic-pdf -p {some_pdf} -o {some_output_dir} -m auto
+
+``{some_pdf}`` can be a single PDF file or a directory containing
+multiple PDFs. The results will be saved in the ``{some_output_dir}``
+directory. The output file list is as follows:
+
+.. code:: text
+
+   ├── some_pdf.md                          # markdown file
+   ├── images                               # directory for storing images
+   ├── some_pdf_layout.pdf                  # layout diagram
+   ├── some_pdf_middle.json                 # MinerU intermediate processing result
+   ├── some_pdf_model.json                  # model inference result
+   ├── some_pdf_origin.pdf                  # original PDF file
+   ├── some_pdf_spans.pdf                   # smallest granularity bbox position information diagram
+   └── some_pdf_content_list.json           # Rich text JSON arranged in reading order
+
+For more information about the output files, please refer to the :doc:`../tutorial/output_file_description`
+
--- a/next_docs/en/user_guide/quick_start/extract_text.rst
+++ b/next_docs/en/user_guide/quick_start/extract_text.rst
+
+
+Extract Content from Pdf
+========================
+
+.. code:: python
+
+    from magic_pdf.data.read_api import read_local_pdfs
+    from magic_pdf.pdf_parse_union_core_v2 import pdf_parse_union
+    from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
--- a/next_docs/en/user_guide/quick_start/to_markdown.rst
+++ b/next_docs/en/user_guide/quick_start/to_markdown.rst
+
+
+Convert To Markdown
+========================
+
+.. code:: python
+
+    import os
+
+    from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader
+    from magic_pdf.libs.MakeContentConfig import DropMode, MakeMode
+    from magic_pdf.pipe.OCRPipe import OCRPipe
+
+
+    ## args
+    model_list = []
+    pdf_file_name = "abc.pdf"  # replace with the real pdf path
+
+
+    ## prepare env
+    local_image_dir, local_md_dir = "output/images", "output"
+    os.makedirs(local_image_dir, exist_ok=True)
+
+    image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter(
+        local_md_dir
+    ) # create 00
+    image_dir = str(os.path.basename(local_image_dir))
+
+    reader1 = FileBasedDataReader("")
+    pdf_bytes = reader1.read(pdf_file_name)   # read the pdf content
+
+
+    pipe = OCRPipe(pdf_bytes, model_list, image_writer)
+
+    pipe.pipe_classify()
+    pipe.pipe_analyze()
+    pipe.pipe_parse()
+
+    pdf_info = pipe.pdf_mid_data["pdf_info"]
+
+
+    md_content = pipe.pipe_mk_markdown(
+        image_dir, drop_mode=DropMode.NONE, md_make_mode=MakeMode.MM_MD
+    )
+
+    if isinstance(md_content, list):
+        md_writer.write_string(f"{pdf_file_name}.md", "\n".join(md_content))
+    else:
+        md_writer.write_string(f"{pdf_file_name}.md", md_content)
+
+
+Check :doc:`../data/data_reader_writer` for more [reader | writer] examples 
--- a/next_docs/en/user_guide/tutorial.rst
+++ b/next_docs/en/user_guide/tutorial.rst
+
+Tutorial
+===========
+
+From the beginning to the end, Show how to using mineru via a minimal project
+
+.. toctree::
+    :maxdepth: 1
+
+    tutorial/output_file_description
\ No newline at end of file
--- a/next_docs/en/user_guide/tutorial/output_file_description.rst
+++ b/next_docs/en/user_guide/tutorial/output_file_description.rst
+
+Output File Description
+=========================
+
+After executing the ``magic-pdf`` command, in addition to outputting
+files related to markdown, several other files unrelated to markdown
+will also be generated. These files will be introduced one by one.
+
+some_pdf_layout.pdf
+~~~~~~~~~~~~~~~~~~~
+
+Each page layout consists of one or more boxes. The number at the top
+left of each box indicates its sequence number. Additionally, in
+``layout.pdf``, different content blocks are highlighted with different
+background colors.
+
+.. figure:: ../../_static/image/layout_example.png
+   :alt: layout example
+
+   layout example
+
+some_pdf_spans.pdf
+~~~~~~~~~~~~~~~~~~
+
+All spans on the page are drawn with different colored line frames
+according to the span type. This file can be used for quality control,
+allowing for quick identification of issues such as missing text or
+unrecognized inline formulas.
+
+.. figure:: ../../_static/image/spans_example.png
+   :alt: spans example
+
+   spans example
+
+some_pdf_model.json
+~~~~~~~~~~~~~~~~~~~
+
+Structure Definition
+^^^^^^^^^^^^^^^^^^^^
+
+.. code:: python
+
+   from pydantic import BaseModel, Field
+   from enum import IntEnum
+
+   class CategoryType(IntEnum):
+        title = 0               # Title
+        plain_text = 1          # Text
+        abandon = 2             # Includes headers, footers, page numbers, and page annotations
+        figure = 3              # Image
+        figure_caption = 4      # Image description
+        table = 5               # Table
+        table_caption = 6       # Table description
+        table_footnote = 7      # Table footnote
+        isolate_formula = 8     # Block formula
+        formula_caption = 9     # Formula label
+
+        embedding = 13          # Inline formula
+        isolated = 14           # Block formula
+        text = 15               # OCR recognition result
+
+
+   class PageInfo(BaseModel):
+       page_no: int = Field(description="Page number, the first page is 0", ge=0)
+       height: int = Field(description="Page height", gt=0)
+       width: int = Field(description="Page width", ge=0)
+
+   class ObjectInferenceResult(BaseModel):
+       category_id: CategoryType = Field(description="Category", ge=0)
+       poly: list[float] = Field(description="Quadrilateral coordinates, representing the coordinates of the top-left, top-right, bottom-right, and bottom-left points respectively")
+       score: float = Field(description="Confidence of the inference result")
+       latex: str | None = Field(description="LaTeX parsing result", default=None)
+       html: str | None = Field(description="HTML parsing result", default=None)
+
+   class PageInferenceResults(BaseModel):
+        layout_dets: list[ObjectInferenceResult] = Field(description="Page recognition results", ge=0)
+        page_info: PageInfo = Field(description="Page metadata")
+
+
+   # The inference results of all pages, ordered by page number, are stored in a list as the inference results of MinerU
+   inference_result: list[PageInferenceResults] = []
+
+The format of the poly coordinates is [x0, y0, x1, y1, x2, y2, x3, y3],
+representing the coordinates of the top-left, top-right, bottom-right,
+and bottom-left points respectively. |Poly Coordinate Diagram|
+
+example
+^^^^^^^
+
+.. code:: json
+
+   [
+       {
+           "layout_dets": [
+               {
+                   "category_id": 2,
+                   "poly": [
+                       99.1906967163086,
+                       100.3119125366211,
+                       730.3707885742188,
+                       100.3119125366211,
+                       730.3707885742188,
+                       245.81326293945312,
+                       99.1906967163086,
+                       245.81326293945312
+                   ],
+                   "score": 0.9999997615814209
+               }
+           ],
+           "page_info": {
+               "page_no": 0,
+               "height": 2339,
+               "width": 1654
+           }
+       },
+       {
+           "layout_dets": [
+               {
+                   "category_id": 5,
+                   "poly": [
+                       99.13092803955078,
+                       2210.680419921875,
+                       497.3183898925781,
+                       2210.680419921875,
+                       497.3183898925781,
+                       2264.78076171875,
+                       99.13092803955078,
+                       2264.78076171875
+                   ],
+                   "score": 0.9999997019767761
+               }
+           ],
+           "page_info": {
+               "page_no": 1,
+               "height": 2339,
+               "width": 1654
+           }
+       }
+   ]
+
+some_pdf_middle.json
+~~~~~~~~~~~~~~~~~~~~
+
+-------+--------------------------------------------------------------+
+| Field | Description                                                  |
+| Name  |                                                              |
+=======+==============================================================+
+| pdf   | list, each element is a dict representing the parsing result |
+| _info | of each PDF page, see the table below for details            |
+-------+--------------------------------------------------------------+
+| \_    | ocr \| txt, used to indicate the mode used in this           |
+| parse | intermediate parsing state                                   |
+| _type |                                                              |
+-------+--------------------------------------------------------------+
+| \_ve  | string, indicates the version of magic-pdf used in this      |
+| rsion | parsing                                                      |
+| _name |                                                              |
+-------+--------------------------------------------------------------+
+
+**pdf_info**
+
+Field structure description
+
+---------+------------------------------------------------------------+
+| Field   | Description                                                |
+| Name    |                                                            |
+=========+============================================================+
+| preproc | Intermediate result after PDF preprocessing, not yet       |
+| _blocks | segmented                                                  |
+---------+------------------------------------------------------------+
+| layout  | Layout segmentation results, containing layout direction   |
+| _bboxes | (vertical, horizontal), and bbox, sorted by reading order  |
+---------+------------------------------------------------------------+
+| p       | Page number, starting from 0                               |
+| age_idx |                                                            |
+---------+------------------------------------------------------------+
+| pa      | Page width and height                                      |
+| ge_size |                                                            |
+---------+------------------------------------------------------------+
+| \_layo  | Layout tree structure                                      |
+| ut_tree |                                                            |
+---------+------------------------------------------------------------+
+| images  | list, each element is a dict representing an img_block     |
+---------+------------------------------------------------------------+
+| tables  | list, each element is a dict representing a table_block    |
+---------+------------------------------------------------------------+
+| inter   | list, each element is a dict representing an               |
+| line_eq | interline_equation_block                                   |
+| uations |                                                            |
+---------+------------------------------------------------------------+
+| di      | List, block information returned by the model that needs   |
+| scarded | to be dropped                                              |
+| _blocks |                                                            |
+---------+------------------------------------------------------------+
+| para    | Result after segmenting preproc_blocks                     |
+| _blocks |                                                            |
+---------+------------------------------------------------------------+
+
+In the above table, ``para_blocks`` is an array of dicts, each dict
+representing a block structure. A block can support up to one level of
+nesting.
+
+**block**
+
+The outer block is referred to as a first-level block, and the fields in
+the first-level block include:
+
+---------+-------------------------------------------------------------+
+| Field   | Description                                                 |
+| Name    |                                                             |
+=========+=============================================================+
+| type    | Block type (table|image)                                    |
+---------+-------------------------------------------------------------+
+| bbox    | Block bounding box coordinates                              |
+---------+-------------------------------------------------------------+
+| blocks  | list, each element is a dict representing a second-level    |
+|         | block                                                       |
+---------+-------------------------------------------------------------+
+
+There are only two types of first-level blocks: “table” and “image”. All
+other blocks are second-level blocks.
+
+The fields in a second-level block include:
+
+-----+----------------------------------------------------------------+
+| Fi  | Description                                                    |
+| eld |                                                                |
+| N   |                                                                |
+| ame |                                                                |
+=====+================================================================+
+| t   | Block type                                                     |
+| ype |                                                                |
+-----+----------------------------------------------------------------+
+| b   | Block bounding box coordinates                                 |
+| box |                                                                |
+-----+----------------------------------------------------------------+
+| li  | list, each element is a dict representing a line, used to      |
+| nes | describe the composition of a line of information              |
+-----+----------------------------------------------------------------+
+
+Detailed explanation of second-level block types
+
+================== ======================
+type               Description
+================== ======================
+image_body         Main body of the image
+image_caption      Image description text
+table_body         Main body of the table
+table_caption      Table description text
+table_footnote     Table footnote
+text               Text block
+title              Title block
+interline_equation Block formula
+================== ======================
+
+**line**
+
+The field format of a line is as follows:
+
+-----+----------------------------------------------------------------+
+| Fi  | Description                                                    |
+| eld |                                                                |
+| N   |                                                                |
+| ame |                                                                |
+=====+================================================================+
+| b   | Bounding box coordinates of the line                           |
+| box |                                                                |
+-----+----------------------------------------------------------------+
+| sp  | list, each element is a dict representing a span, used to      |
+| ans | describe the composition of the smallest unit                  |
+-----+----------------------------------------------------------------+
+
+**span**
+
+----------+-----------------------------------------------------------+
+| Field    | Description                                               |
+| Name     |                                                           |
+==========+===========================================================+
+| bbox     | Bounding box coordinates of the span                      |
+----------+-----------------------------------------------------------+
+| type     | Type of the span                                          |
+----------+-----------------------------------------------------------+
+| content  | Text spans use content, chart spans use img_path to store |
+| \|       | the actual text or screenshot path information            |
+| img_path |                                                           |
+----------+-----------------------------------------------------------+
+
+The types of spans are as follows:
+
+================== ==============
+type               Description
+================== ==============
+image              Image
+table              Table
+text               Text
+inline_equation    Inline formula
+interline_equation Block formula
+================== ==============
+
+**Summary**
+
+A span is the smallest storage unit for all elements.
+
+The elements stored within para_blocks are block information.
+
+The block structure is as follows:
+
+First-level block (if any) -> Second-level block -> Line -> Span
+
+.. _example-1:
+
+example
+^^^^^^^
+
+.. code:: json
+
+   {
+       "pdf_info": [
+           {
+               "preproc_blocks": [
+                   {
+                       "type": "text",
+                       "bbox": [
+                           52,
+                           61.956024169921875,
+                           294,
+                           82.99800872802734
+                       ],
+                       "lines": [
+                           {
+                               "bbox": [
+                                   52,
+                                   61.956024169921875,
+                                   294,
+                                   72.0000228881836
+                               ],
+                               "spans": [
+                                   {
+                                       "bbox": [
+                                           54.0,
+                                           61.956024169921875,
+                                           296.2261657714844,
+                                           72.0000228881836
+                                       ],
+                                       "content": "dependent on the service headway and the reliability of the departure ",
+                                       "type": "text",
+                                       "score": 1.0
+                                   }
+                               ]
+                           }
+                       ]
+                   }
+               ],
+               "layout_bboxes": [
+                   {
+                       "layout_bbox": [
+                           52,
+                           61,
+                           294,
+                           731
+                       ],
+                       "layout_label": "V",
+                       "sub_layout": []
+                   }
+               ],
+               "page_idx": 0,
+               "page_size": [
+                   612.0,
+                   792.0
+               ],
+               "_layout_tree": [],
+               "images": [],
+               "tables": [],
+               "interline_equations": [],
+               "discarded_blocks": [],
+               "para_blocks": [
+                   {
+                       "type": "text",
+                       "bbox": [
+                           52,
+                           61.956024169921875,
+                           294,
+                           82.99800872802734
+                       ],
+                       "lines": [
+                           {
+                               "bbox": [
+                                   52,
+                                   61.956024169921875,
+                                   294,
+                                   72.0000228881836
+                               ],
+                               "spans": [
+                                   {
+                                       "bbox": [
+                                           54.0,
+                                           61.956024169921875,
+                                           296.2261657714844,
+                                           72.0000228881836
+                                       ],
+                                       "content": "dependent on the service headway and the reliability of the departure ",
+                                       "type": "text",
+                                       "score": 1.0
+                                   }
+                               ]
+                           }
+                       ]
+                   }
+               ]
+           }
+       ],
+       "_parse_type": "txt",
+       "_version_name": "0.6.1"
+   }
+
+.. |Poly Coordinate Diagram| image:: ../../_static/image/poly.png
--- a/next_docs/requirements.txt
+++ b/next_docs/requirements.txt
@@ -5,7 +5,8 @@ Pillow==8.4.0
 pydantic>=2.7.2,<2.8.0
 PyMuPDF>=1.24.9
 sphinx
-sphinx-argparse
-sphinx-book-theme
-sphinx-copybutton
-sphinx_rtd_theme
+sphinx-argparse>=0.5.2
+sphinx-book-theme>=1.1.3
+sphinx-copybutton>=0.5.2
+sphinx_rtd_theme>=3.0.1
+autodoc_pydantic>=2.2.0
\ No newline at end of file
--- a/next_docs/zh_cn/.readthedocs.yaml
+++ b/next_docs/zh_cn/.readthedocs.yaml
@@ -10,7 +10,7 @@ formats:

 python:
  install:
-    - requirements: docs/requirements.txt
+    - requirements: next_docs/requirements.txt

 sphinx:
-  configuration: docs/zh_cn/conf.py
+  configuration: next_docs/zh_cn/conf.py
--- a/scripts/download_models.py
+++ b/scripts/download_models.py
+import json
+import os
+
+import requests
+from modelscope import snapshot_download
+
+
+def download_json(url):
+    # 下载JSON文件
+    response = requests.get(url)
+    response.raise_for_status()  # 检查请求是否成功
+    return response.json()
+
+
+def download_and_modify_json(url, local_filename, modifications):
+    if os.path.exists(local_filename):
+        data = json.load(open(local_filename))
+        config_version = data.get('config_version', '0.0.0')
+        if config_version < '1.0.0':
+            data = download_json(url)
+    else:
+        data = download_json(url)
+
+    # 修改内容
+    for key, value in modifications.items():
+        data[key] = value
+
+    # 保存修改后的内容
+    with open(local_filename, 'w', encoding='utf-8') as f:
+        json.dump(data, f, ensure_ascii=False, indent=4)
+
+
+if __name__ == '__main__':
+    mineru_patterns = [
+        "models/Layout/LayoutLMv3/*",
+        "models/Layout/YOLO/*",
+        "models/MFD/YOLO/*",
+        "models/MFR/unimernet_small/*",
+        "models/TabRec/TableMaster/*",
+        "models/TabRec/StructEqTable/*",
+    ]
+    model_dir = snapshot_download('opendatalab/PDF-Extract-Kit-1.0', allow_patterns=mineru_patterns)
+    layoutreader_model_dir = snapshot_download('ppaanngggg/layoutreader')
+    model_dir = model_dir + '/models'
+    print(f'model_dir is: {model_dir}')
+    print(f'layoutreader_model_dir is: {layoutreader_model_dir}')
+
+    json_url = 'https://gitee.com/myhloli/MinerU/raw/dev/magic-pdf.template.json'
+    config_file_name = 'magic-pdf.json'
+    home_dir = os.path.expanduser('~')
+    config_file = os.path.join(home_dir, config_file_name)
+
+    json_mods = {
+        'models-dir': model_dir,
+        'layoutreader-model-dir': layoutreader_model_dir,
+    }
+
+    download_and_modify_json(json_url, config_file, json_mods)
+    print(f'The configuration file has been configured successfully, the path is: {config_file}')
--- a/scripts/download_models_hf.py
+++ b/scripts/download_models_hf.py
+import json
+import os
+
+import requests
+from huggingface_hub import snapshot_download
+
+
+def download_json(url):
+    # 下载JSON文件
+    response = requests.get(url)
+    response.raise_for_status()  # 检查请求是否成功
+    return response.json()
+
+
+def download_and_modify_json(url, local_filename, modifications):
+    if os.path.exists(local_filename):
+        data = json.load(open(local_filename))
+        config_version = data.get('config_version', '0.0.0')
+        if config_version < '1.0.0':
+            data = download_json(url)
+    else:
+        data = download_json(url)
+
+    # 修改内容
+    for key, value in modifications.items():
+        data[key] = value
+
+    # 保存修改后的内容
+    with open(local_filename, 'w', encoding='utf-8') as f:
+        json.dump(data, f, ensure_ascii=False, indent=4)
+
+
+if __name__ == '__main__':
+
+    mineru_patterns = [
+        "models/Layout/LayoutLMv3/*",
+        "models/Layout/YOLO/*",
+        "models/MFD/YOLO/*",
+        "models/MFR/unimernet_small/*",
+        "models/TabRec/TableMaster/*",
+        "models/TabRec/StructEqTable/*",
+    ]
+    model_dir = snapshot_download('opendatalab/PDF-Extract-Kit-1.0', allow_patterns=mineru_patterns)
+
+    layoutreader_pattern = [
+        "*.json",
+        "*.safetensors",
+    ]
+    layoutreader_model_dir = snapshot_download('hantian/layoutreader', allow_patterns=layoutreader_pattern)
+
+    model_dir = model_dir + '/models'
+    print(f'model_dir is: {model_dir}')
+    print(f'layoutreader_model_dir is: {layoutreader_model_dir}')
+
+    json_url = 'https://github.com/opendatalab/MinerU/raw/dev/magic-pdf.template.json'
+    config_file_name = 'magic-pdf.json'
+    home_dir = os.path.expanduser('~')
+    config_file = os.path.join(home_dir, config_file_name)
+
+    json_mods = {
+        'models-dir': model_dir,
+        'layoutreader-model-dir': layoutreader_model_dir,
+    }
+
+    download_and_modify_json(json_url, config_file, json_mods)
+    print(f'The configuration file has been configured successfully, the path is: {config_file}')
--- a/tests/test_data/data_reader_writer/test_multi_bucket_s3.py
+++ b/tests/test_data/data_reader_writer/test_multi_bucket_s3.py
@@ -41,8 +41,8 @@ def test_multi_bucket_s3_reader_writer():
        ),
    ]

-    reader = MultiBucketS3DataReader(default_bucket=bucket, s3_configs=s3configs)
-    writer = MultiBucketS3DataWriter(default_bucket=bucket, s3_configs=s3configs)
+    reader = MultiBucketS3DataReader(bucket, s3configs)
+    writer = MultiBucketS3DataWriter(bucket, s3configs)

    bits = reader.read('meta-index/scihub/v001/scihub/part-66210c190659-000026.jsonl')

@@ -80,3 +80,81 @@ def test_multi_bucket_s3_reader_writer():
    assert '123'.encode() == reader.read(
        'unittest/data/data_reader_writer/multi_bucket_s3_data/test02.txt'
    )
+
+
+@pytest.mark.skipif(
+    os.getenv('S3_ACCESS_KEY_2', None) is None, reason='need s3 config!'
+)
+def test_multi_bucket_s3_reader_writer_with_prefix():
+    """test multi bucket s3 reader writer must config s3 config in the
+    environment export S3_BUCKET=xxx export S3_ACCESS_KEY=xxx export
+    S3_SECRET_KEY=xxx export S3_ENDPOINT=xxx.
+
+    export S3_BUCKET_2=xxx export S3_ACCESS_KEY_2=xxx export S3_SECRET_KEY_2=xxx export S3_ENDPOINT_2=xxx
+    """
+    bucket = os.getenv('S3_BUCKET', '')
+    ak = os.getenv('S3_ACCESS_KEY', '')
+    sk = os.getenv('S3_SECRET_KEY', '')
+    endpoint_url = os.getenv('S3_ENDPOINT', '')
+
+    bucket_2 = os.getenv('S3_BUCKET_2', '')
+    ak_2 = os.getenv('S3_ACCESS_KEY_2', '')
+    sk_2 = os.getenv('S3_SECRET_KEY_2', '')
+    endpoint_url_2 = os.getenv('S3_ENDPOINT_2', '')
+
+    s3configs = [
+        S3Config(
+            bucket_name=bucket, access_key=ak, secret_key=sk, endpoint_url=endpoint_url
+        ),
+        S3Config(
+            bucket_name=bucket_2,
+            access_key=ak_2,
+            secret_key=sk_2,
+            endpoint_url=endpoint_url_2,
+        ),
+    ]
+
+    prefix = 'meta-index'
+    reader = MultiBucketS3DataReader(f'{bucket}/{prefix}', s3configs)
+    writer = MultiBucketS3DataWriter(f'{bucket}/{prefix}', s3configs)
+
+    bits = reader.read('scihub/v001/scihub/part-66210c190659-000026.jsonl')
+
+    assert bits == reader.read(
+        f's3://{bucket}/{prefix}/scihub/v001/scihub/part-66210c190659-000026.jsonl'
+    )
+
+    bits = reader.read(
+        f's3://{bucket_2}/enbook-scimag/78800000/libgen.scimag78872000-78872999/10.1017/cbo9780511770425.012.pdf'
+    )
+    docs = fitz.open('pdf', bits)
+    assert len(docs) == 10
+
+    bits = reader.read(
+        'scihub/v001/scihub/part-66210c190659-000026.jsonl?bytes=566,713'
+    )
+    assert bits == reader.read_at(
+        'scihub/v001/scihub/part-66210c190659-000026.jsonl', 566, 713
+    )
+    assert len(json.loads(bits)) > 0
+
+    writer.write_string(
+        'unittest/data/data_reader_writer/multi_bucket_s3_data/test01.txt', 'abc'
+    )
+
+    assert 'abc'.encode() == reader.read(
+        'unittest/data/data_reader_writer/multi_bucket_s3_data/test01.txt'
+    )
+
+    assert 'abc'.encode() == reader.read(
+        f's3://{bucket}/{prefix}/unittest/data/data_reader_writer/multi_bucket_s3_data/test01.txt'
+    )
+
+    writer.write(
+        'unittest/data/data_reader_writer/multi_bucket_s3_data/test02.txt',
+        '123'.encode(),
+    )
+
+    assert '123'.encode() == reader.read(
+        'unittest/data/data_reader_writer/multi_bucket_s3_data/test02.txt'
+    )
--- a/tests/test_data/data_reader_writer/test_s3.py
+++ b/tests/test_data/data_reader_writer/test_s3.py
@@ -9,7 +9,7 @@ from magic_pdf.data.data_reader_writer import S3DataReader, S3DataWriter
 @pytest.mark.skipif(
    os.getenv('S3_ACCESS_KEY', None) is None, reason='need s3 config!'
 )
-def test_multi_bucket_s3_reader_writer():
+def test_s3_reader_writer():
    """test multi bucket s3 reader writer must config s3 config in the
    environment export S3_BUCKET=xxx export S3_ACCESS_KEY=xxx export
    S3_SECRET_KEY=xxx export S3_ENDPOINT=xxx."""
@@ -18,8 +18,8 @@ def test_multi_bucket_s3_reader_writer():
    sk = os.getenv('S3_SECRET_KEY', '')
    endpoint_url = os.getenv('S3_ENDPOINT', '')

-    reader = S3DataReader(bucket=bucket, ak=ak, sk=sk, endpoint_url=endpoint_url)
-    writer = S3DataWriter(bucket=bucket, ak=ak, sk=sk, endpoint_url=endpoint_url)
+    reader = S3DataReader('', bucket, ak, sk, endpoint_url)
+    writer = S3DataWriter('', bucket, ak, sk, endpoint_url)

    bits = reader.read('meta-index/scihub/v001/scihub/part-66210c190659-000026.jsonl')

@@ -51,3 +51,56 @@ def test_multi_bucket_s3_reader_writer():
    assert '123'.encode() == reader.read(
        'unittest/data/data_reader_writer/multi_bucket_s3_data/test02.txt'
    )
+
+
+@pytest.mark.skipif(
+    os.getenv('S3_ACCESS_KEY', None) is None, reason='need s3 config!'
+)
+def test_s3_reader_writer_with_prefix():
+    """test multi bucket s3 reader writer must config s3 config in the
+    environment export S3_BUCKET=xxx export S3_ACCESS_KEY=xxx export
+    S3_SECRET_KEY=xxx export S3_ENDPOINT=xxx."""
+    bucket = os.getenv('S3_BUCKET', '')
+    ak = os.getenv('S3_ACCESS_KEY', '')
+    sk = os.getenv('S3_SECRET_KEY', '')
+    endpoint_url = os.getenv('S3_ENDPOINT', '')
+
+    prefix = 'meta-index'
+
+    reader = S3DataReader(prefix, bucket, ak, sk, endpoint_url)
+    writer = S3DataWriter(prefix, bucket, ak, sk, endpoint_url)
+
+    bits = reader.read('scihub/v001/scihub/part-66210c190659-000026.jsonl')
+
+    assert bits == reader.read(
+        f's3://{bucket}/{prefix}/scihub/v001/scihub/part-66210c190659-000026.jsonl'
+    )
+
+    bits = reader.read(
+        'scihub/v001/scihub/part-66210c190659-000026.jsonl?bytes=566,713'
+    )
+    assert bits == reader.read_at(
+        'scihub/v001/scihub/part-66210c190659-000026.jsonl', 566, 713
+    )
+    assert len(json.loads(bits)) > 0
+
+    writer.write_string(
+        'unittest/data/data_reader_writer/multi_bucket_s3_data/test01.txt', 'abc'
+    )
+
+    assert 'abc'.encode() == reader.read(
+        'unittest/data/data_reader_writer/multi_bucket_s3_data/test01.txt'
+    )
+
+    assert 'abc'.encode() == reader.read(
+        f's3://{bucket}/{prefix}/unittest/data/data_reader_writer/multi_bucket_s3_data/test01.txt'
+    )
+
+    writer.write(
+        f'{bucket}/{prefix}/unittest/data/data_reader_writer/multi_bucket_s3_data/test02.txt',
+        '123'.encode(),
+    )
+
+    assert '123'.encode() == reader.read(
+        'unittest/data/data_reader_writer/multi_bucket_s3_data/test02.txt'
+    )