docs: remove outdated documentation files

- Deleted .readthedocs.yaml files from multiple directories - Removed outdated API and user guide documentation files - Deleted command line usage examples - Removed CUDA acceleration guide

docs: remove outdated documentation files
- Deleted .readthedocs.yaml files from multiple directories - Removed outdated API and user guide documentation files - Deleted command line usage examples - Removed CUDA acceleration guide
cf5c8f47 · myhloli · cb57e84c · cb57e84c · cb57e84c · cb57e84c
Commit cf5c8f47 authored Jun 13, 2025 by myhloli
5 changed files
--- a/next_docs/zh_cn/user_guide/tutorial.rst
+++ b/next_docs/zh_cn/user_guide/tutorial.rst
-教程
-===========
-让我们通过构建一个最小项目来学习 MinerU 
-.. toctree::
-    :maxdepth: 1
-    :caption: 教程
-    tutorial/output_file_description
-    tutorial/pipeline
--- a/next_docs/zh_cn/user_guide/tutorial/output_file_description.rst
+++ b/next_docs/zh_cn/user_guide/tutorial/output_file_description.rst
-输出文件格式介绍
-===============
-``magic-pdf`` 命令执行后除了输出和 markdown
-有关的文件以外，还会生成若干个和 markdown
-无关的文件。现在将一一介绍这些文件
-some_pdf_layout.pdf
-~~~~~~~~~~~~~~~~~~~
-每一页的 layout 均由一个或多个框组成。
-每个框左上脚的数字表明它们的序号。此外 layout.pdf
-框内用不同的背景色块圈定不同的内容块。
-.. figure:: ../../_static/image/layout_example.png
-   :alt: layout 页面示例
-   layout 页面示例
-some_pdf_spans.pdf
-~~~~~~~~~~~~~~~~~~
-根据 span 类型的不同，采用不同颜色线框绘制页面上所有
-span。该文件可以用于质检，可以快速排查出文本丢失、行间公式未识别等问题。
-.. figure:: ../../_static/image/spans_example.png
-   :alt: span 页面示例
-   span 页面示例
-some_pdf_model.json
-~~~~~~~~~~~~~~~~~~~
-结构定义
-^^^^^^^^
-.. code:: python
-   from pydantic import BaseModel, Field
-   from enum import IntEnum
-   class CategoryType(IntEnum):
-        title = 0               # 标题
-        plain_text = 1          # 文本
-        abandon = 2             # 包括页眉页脚页码和页面注释
-        figure = 3              # 图片
-        figure_caption = 4      # 图片描述
-        table = 5               # 表格
-        table_caption = 6       # 表格描述
-        table_footnote = 7      # 表格注释
-        isolate_formula = 8     # 行间公式
-        formula_caption = 9     # 行间公式的标号
-        embedding = 13          # 行内公式
-        isolated = 14           # 行间公式
-        text = 15               # ocr 识别结果
-   class PageInfo(BaseModel):
-       page_no: int = Field(description="页码序号，第一页的序号是 0", ge=0)
-       height: int = Field(description="页面高度", gt=0)
-       width: int = Field(description="页面宽度", ge=0)
-   class ObjectInferenceResult(BaseModel):
-       category_id: CategoryType = Field(description="类别", ge=0)
-       poly: list[float] = Field(description="四边形坐标, 分别是 左上，右上，右下，左下 四点的坐标")
-       score: float = Field(description="推理结果的置信度")
-       latex: str | None = Field(description="latex 解析结果", default=None)
-       html: str | None = Field(description="html 解析结果", default=None)
-   class PageInferenceResults(BaseModel):
-        layout_dets: list[ObjectInferenceResult] = Field(description="页面识别结果", ge=0)
-        page_info: PageInfo = Field(description="页面元信息")
-   # 所有页面的推理结果按照页码顺序依次放到列表中即为 minerU 推理结果
-   inference_result: list[PageInferenceResults] = []
-poly 坐标的格式 [x0, y0, x1, y1, x2, y2, x3, y3],
-分别表示左上、右上、右下、左下四点的坐标 |poly 坐标示意图|
-示例数据
-^^^^^^^^
-.. code:: json
-   [
-       {
-           "layout_dets": [
-               {
-                   "category_id": 2,
-                   "poly": [
-                       99.1906967163086,
-                       100.3119125366211,
-                       730.3707885742188,
-                       100.3119125366211,
-                       730.3707885742188,
-                       245.81326293945312,
-                       99.1906967163086,
-                       245.81326293945312
-                   ],
-                   "score": 0.9999997615814209
-               }
-           ],
-           "page_info": {
-               "page_no": 0,
-               "height": 2339,
-               "width": 1654
-           }
-       },
-       {
-           "layout_dets": [
-               {
-                   "category_id": 5,
-                   "poly": [
-                       99.13092803955078,
-                       2210.680419921875,
-                       497.3183898925781,
-                       2210.680419921875,
-                       497.3183898925781,
-                       2264.78076171875,
-                       99.13092803955078,
-                       2264.78076171875
-                   ],
-                   "score": 0.9999997019767761
-               }
-           ],
-           "page_info": {
-               "page_no": 1,
-               "height": 2339,
-               "width": 1654
-           }
-       }
-   ]
-some_pdf_middle.json
-~~~~~~~~~~~~~~~~~~~~
-+--------------------+----------------------------------------------------------+
-| 字段名              | 解释                                                    |
-+====================+==========================================================+
-| pdf_info           | list，每个元素都是一个                                   |
-|                    | dict，这个dict是每一页pdf的解析结果，详见下表            |
-+--------------------+----------------------------------------------------------+
-| \_parse_type       | ocr \| txt，用来标识本次解析的中间态使用的模式           |
-+--------------------+----------------------------------------------------------+
-| \_version_name     | string，表示本次解析使用的 magic-pdf 的版本号            |
-+-------------------------------------------------------------------------------+
-**pdf_info** 字段结构说明
-+---------------------+-------------------------------------------------------+
-| 字段名               | 解释                                                 |
-+=====================+=======================================================+
-| preproc_blocks      | pdf预处理后，未分段的中间结果                         |
-+---------------------+-------------------------------------------------------+
-|                     | 布局分割的结果，                                      |
-| layout_bboxes       | 含有布局的方向（垂直、水平），和bbox，按阅读顺序排序  |
-+---------------------+-------------------------------------------------------+
-| page_idx            | 页码，从0开始                                         |
-+---------------------+-------------------------------------------------------+
-| page_size           | 页面的宽度和高度                                      |
-+---------------------+-------------------------------------------------------+
-| \_layout_tree       | 布局树状结构                                          |
-+---------------------+-------------------------------------------------------+
-| images              | list，每个元素是一个dict，每个dict表示一个img_block   |
-+---------------------+-------------------------------------------------------+
-| tables              | list，每个元素是一个dict，每个dict表示一个table_block |
-+---------------------+-------------------------------------------------------+
-|                     | list，每个元素是一个                                  |
-| interline_equations | dict，每个dict表示一个interline_equation_block        |
-+---------------------+-------------------------------------------------------+
-|                     | List, 模型返回的需要drop的block信息                   |
-| discarded_blocks    |                                                       |
-+---------------------+-------------------------------------------------------+
-| para_blocks         | 将preproc_blocks进行分段之后的结果                    |
-+---------------------+-------------------------------------------------------+
-上表中 ``para_blocks``
-是个dict的数组，每个dict是一个block结构，block最多支持一次嵌套
-**block**
-外层block被称为一级block，一级block中的字段包括
-====== ===============================================
-字段名 解释
-====== ===============================================
-type   block类型（table|image）
-bbox   block矩形框坐标
-blocks list，里面的每个元素都是一个dict格式的二级block
-====== ===============================================
-一级block只有”table”和”image”两种类型，其余block均为二级block
-二级block中的字段包括
-+----------+----------------------------------------------------------------+
-| 字       | 解释                                                           |
-| 段       |                                                                |
-| 名       |                                                                |
-+==========+================================================================+
-|          | block类型                                                      |
-| type     |                                                                |
-+----------+----------------------------------------------------------------+
-| bbox     | block矩形框坐标                                                |
-+----------+----------------------------------------------------------------+
-| lines    | list，每个元素都是一个dict表示的line，用来描述一行信息的构成   |
-+----------+----------------------------------------------------------------+
-二级block的类型详解
-================== ==============
-type               desc
-================== ==============
-image_body         图像的本体
-image_caption      图像的描述文本
-image_footnote     图像的脚注
-table_body         表格本体
-table_caption      表格的描述文本
-table_footnote     表格的脚注
-text               文本块
-title              标题块
-index              目录块
-list               列表块
-interline_equation 行间公式块
-================== ==============
-**line**
-line 的 字段格式如下
-+-----------+-----------------------------------------------------------------+
-| 字        | 解释                                                            |
-| 段        |                                                                 |
-| 名        |                                                                 |
-+===========+=================================================================+
-| bbox      | line的矩形框坐标                                                |
-+-----------+-----------------------------------------------------------------+
-| spans     | list，                                                          |
-|           | 每个元素都是一个dict表示的span，用来描述一个最小组成单元的构成  |
-+-----------+-----------------------------------------------------------------+
-**span**
-+------------+---------------------------------------------------------+
-| 字段名      | 解释                                                   |
-+============+=========================================================+
-| bbox       | span的矩形框坐标                                        |
-+------------+---------------------------------------------------------+
-| type       | span的类型                                              |
-+------------+---------------------------------------------------------+
-| content \| | 文本类型的span使用content，图表类使用img_path           |
-| img_path   | 用来存储实际的文本或者截图路径信息                      |
-+------------+---------------------------------------------------------+
-span 的类型有如下几种
-================== ========
-type               desc
-================== ========
-image              图片
-table              表格
-text               文本
-inline_equation    行内公式
-interline_equation 行间公式
-================== ========
-**总结**
-span是所有元素的最小存储单元
-para_blocks内存储的元素为区块信息
-区块结构为
-一级block(如有)->二级block->line->span
-.. _示例数据-1:
-示例数据
-^^^^^^^^
-.. code:: json
-   {
-       "pdf_info": [
-           {
-               "preproc_blocks": [
-                   {
-                       "type": "text",
-                       "bbox": [
-                           52,
-                           61.956024169921875,
-                           294,
-                           82.99800872802734
-                       ],
-                       "lines": [
-                           {
-                               "bbox": [
-                                   52,
-                                   61.956024169921875,
-                                   294,
-                                   72.0000228881836
-                               ],
-                               "spans": [
-                                   {
-                                       "bbox": [
-                                           54.0,
-                                           61.956024169921875,
-                                           296.2261657714844,
-                                           72.0000228881836
-                                       ],
-                                       "content": "dependent on the service headway and the reliability of the departure ",
-                                       "type": "text",
-                                       "score": 1.0
-                                   }
-                               ]
-                           }
-                       ]
-                   }
-               ],
-               "layout_bboxes": [
-                   {
-                       "layout_bbox": [
-                           52,
-                           61,
-                           294,
-                           731
-                       ],
-                       "layout_label": "V",
-                       "sub_layout": []
-                   }
-               ],
-               "page_idx": 0,
-               "page_size": [
-                   612.0,
-                   792.0
-               ],
-               "_layout_tree": [],
-               "images": [],
-               "tables": [],
-               "interline_equations": [],
-               "discarded_blocks": [],
-               "para_blocks": [
-                   {
-                       "type": "text",
-                       "bbox": [
-                           52,
-                           61.956024169921875,
-                           294,
-                           82.99800872802734
-                       ],
-                       "lines": [
-                           {
-                               "bbox": [
-                                   52,
-                                   61.956024169921875,
-                                   294,
-                                   72.0000228881836
-                               ],
-                               "spans": [
-                                   {
-                                       "bbox": [
-                                           54.0,
-                                           61.956024169921875,
-                                           296.2261657714844,
-                                           72.0000228881836
-                                       ],
-                                       "content": "dependent on the service headway and the reliability of the departure ",
-                                       "type": "text",
-                                       "score": 1.0
-                                   }
-                               ]
-                           }
-                       ]
-                   }
-               ]
-           }
-       ],
-       "_parse_type": "txt",
-       "_version_name": "0.6.1"
-   }
-.. |poly 坐标示意图| image:: ../../_static/image/poly.png
--- a/next_docs/zh_cn/user_guide/tutorial/pipeline.rst
+++ b/next_docs/zh_cn/user_guide/tutorial/pipeline.rst
-流水线管道
-===========
-极简示例
-^^^^^^^^
-.. code:: python
-    import os
-    from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader
-    from magic_pdf.data.dataset import PymuDocDataset
-    from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
-    # args
-    pdf_file_name = "abc.pdf"  # replace with the real pdf path
-    name_without_suff = pdf_file_name.split(".")[0]
-    # prepare env
-    local_image_dir, local_md_dir = "output/images", "output"
-    image_dir = str(os.path.basename(local_image_dir))
-    os.makedirs(local_image_dir, exist_ok=True)
-    image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter(
-        local_md_dir
-    )
-    image_dir = str(os.path.basename(local_image_dir))
-    # read bytes
-    reader1 = FileBasedDataReader("")
-    pdf_bytes = reader1.read(pdf_file_name)  # read the pdf content
-    # proc
-    ## Create Dataset Instance
-    ds = PymuDocDataset(pdf_bytes)
-    ds.apply(doc_analyze, ocr=True).pipe_ocr_mode(image_writer).dump_md(md_writer, f"{name_without_suff}.md", image_dir)
-运行以上的代码，会得到如下的结果
-.. code:: bash 
-    output/
-    ├── abc.md
-    └── images
-除去初始化环境，如建立目录、导入依赖库等逻辑。真正将 ``pdf`` 转换为 ``markdown`` 的代码片段如下
-.. code::
-    # read bytes
-    reader1 = FileBasedDataReader("")
-    pdf_bytes = reader1.read(pdf_file_name)  # read the pdf content
-    # proc
-    ## Create Dataset Instance
-    ds = PymuDocDataset(pdf_bytes)
-    ds.apply(doc_analyze, ocr=True).pipe_ocr_mode(image_writer).dump_md(md_writer, f"{name_without_suff}.md", image_dir)
-``ds.apply(doc_analyze, ocr=True)`` 会生成 ``InferenceResult`` 对象。 ``InferenceResult`` 对象执行 ``pipe_ocr_mode`` 方法会生成 ``PipeResult`` 对象。
-``PipeResult`` 对象执行 ``dump_md`` 会在指定位置生成 ``markdown`` 文件。
-pipeline 的执行过程如下图所示
-.. image:: ../../_static/image/pipeline.drawio.svg 
-.. raw:: html 
-    <br> </br>
-目前划分出数据、推理、程序处理三个阶段，分别对应着图上的 ``Dataset``， ``InferenceResult``， ``PipeResult`` 这三个实体。通过 ``apply`` ， ``doc_analyze`` 或 ``pipe_ocr_mode`` 等方法链接在一起。
-.. admonition:: Tip
-    :class: tip
-    要想获得更多有关 Dataset、InferenceResult、PipeResult 的使用示例子，请前往 :doc:`../quick_start/to_markdown`
-    要想获得更多有关 Dataset、InferenceResult、PipeResult 的细节信息请前往英文版 MinerU 文档进行查看!
-管道组合
-^^^^^^^^^
-.. code:: python
-    class Dataset(ABC):
-        @abstractmethod
-        def apply(self, proc: Callable, *args, **kwargs):
-            """Apply callable method which.
-            Args:
-                proc (Callable): invoke proc as follows:
-                    proc(self, *args, **kwargs)
-            Returns:
-                Any: return the result generated by proc
-            """
-            pass
-    class InferenceResult(InferenceResultBase):
-        def apply(self, proc: Callable, *args, **kwargs):
-            """Apply callable method which.
-            Args:
-                proc (Callable): invoke proc as follows:
-                    proc(inference_result, *args, **kwargs)
-            Returns:
-                Any: return the result generated by proc
-            """
-            return proc(copy.deepcopy(self._infer_res), *args, **kwargs)
-        def pipe_ocr_mode(
-            self,
-            imageWriter: DataWriter,
-            start_page_id=0,
-            end_page_id=None,
-            debug_mode=False,
-            lang=None,
-            ) -> PipeResult:
-            pass
-    class PipeResult:
-        def apply(self, proc: Callable, *args, **kwargs):
-            """Apply callable method which.
-            Args:
-                proc (Callable): invoke proc as follows:
-                    proc(pipeline_result, *args, **kwargs)
-            Returns:
-                Any: return the result generated by proc
-            """
-            return proc(copy.deepcopy(self._pipe_res), *args, **kwargs)
-``Dataset`` 、 ``InferenceResult`` 和 ``PipeResult`` 类均有 ``apply`` method。可用于组合不同阶段的运算过程。
-如下所示，``MinerU`` 提供一套组合这些类的计算过程。
-.. code:: python 
-    # proc
-    ## Create Dataset Instance
-    ds = PymuDocDataset(pdf_bytes)
-    ds.apply(doc_analyze, ocr=True).pipe_ocr_mode(image_writer).dump_md(md_writer, f"{name_without_suff}.md", image_dir)
-用户可以根据的需求，自行实现一些组合用的函数。比如用户通过 ``apply`` 方法实现一个统计 ``pdf`` 文件页数的功能。
-.. code:: python 
-    from magic_pdf.data.data_reader_writer import  FileBasedDataReader
-    from magic_pdf.data.dataset import PymuDocDataset
-    # args
-    pdf_file_name = "abc.pdf"  # replace with the real pdf path
-    # read bytes
-    reader1 = FileBasedDataReader("")
-    pdf_bytes = reader1.read(pdf_file_name)  # read the pdf content
-    # proc
-    ## Create Dataset Instance
-    ds = PymuDocDataset(pdf_bytes)
-    def count_page(ds)-> int:
-        return len(ds)
-    print("page number: ", ds.apply(count_page)) # will output the page count of `abc.pdf`
--- a/scripts/download_models.py
+++ b/scripts/download_models.py
-import json
-import shutil
-import os
-import requests
-from modelscope import snapshot_download
-def download_json(url):
-    # 下载JSON文件
-    response = requests.get(url)
-    response.raise_for_status()  # 检查请求是否成功
-    return response.json()
-def download_and_modify_json(url, local_filename, modifications):
-    if os.path.exists(local_filename):
-        data = json.load(open(local_filename))
-        config_version = data.get('config_version', '0.0.0')
-        if config_version < '1.2.0':
-            data = download_json(url)
-    else:
-        data = download_json(url)
-    # 修改内容
-    for key, value in modifications.items():
-        data[key] = value
-    # 保存修改后的内容
-    with open(local_filename, 'w', encoding='utf-8') as f:
-        json.dump(data, f, ensure_ascii=False, indent=4)
-if __name__ == '__main__':
-    mineru_patterns = [
-        # "models/Layout/LayoutLMv3/*",
-        "models/Layout/YOLO/*",
-        "models/MFD/YOLO/*",
-        "models/MFR/unimernet_hf_small_2503/*",
-        "models/OCR/paddleocr_torch/*",
-        # "models/TabRec/TableMaster/*",
-        # "models/TabRec/StructEqTable/*",
-    ]
-    model_dir = snapshot_download('opendatalab/PDF-Extract-Kit-1.0', allow_patterns=mineru_patterns)
-    layoutreader_model_dir = snapshot_download('ppaanngggg/layoutreader')
-    model_dir = model_dir + '/models'
-    print(f'model_dir is: {model_dir}')
-    print(f'layoutreader_model_dir is: {layoutreader_model_dir}')
-    # paddleocr_model_dir = model_dir + '/OCR/paddleocr'
-    # user_paddleocr_dir = os.path.expanduser('~/.paddleocr')
-    # if os.path.exists(user_paddleocr_dir):
-    #     shutil.rmtree(user_paddleocr_dir)
-    # shutil.copytree(paddleocr_model_dir, user_paddleocr_dir)
-    json_url = 'https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/magic-pdf.template.json'
-    config_file_name = 'magic-pdf.json'
-    home_dir = os.path.expanduser('~')
-    config_file = os.path.join(home_dir, config_file_name)
-    json_mods = {
-        'models-dir': model_dir,
-        'layoutreader-model-dir': layoutreader_model_dir,
-    }
-    download_and_modify_json(json_url, config_file, json_mods)
-    print(f'The configuration file has been configured successfully, the path is: {config_file}')
--- a/scripts/download_models_hf.py
+++ b/scripts/download_models_hf.py
-import json
-import os
-import shutil
-import requests
-from huggingface_hub import snapshot_download
-def download_json(url):
-    # 下载JSON文件
-    response = requests.get(url)
-    response.raise_for_status()  # 检查请求是否成功
-    return response.json()
-def download_and_modify_json(url, local_filename, modifications):
-    if os.path.exists(local_filename):
-        data = json.load(open(local_filename))
-        config_version = data.get('config_version', '0.0.0')
-        if config_version < '1.2.0':
-            data = download_json(url)
-    else:
-        data = download_json(url)
-    # 修改内容
-    for key, value in modifications.items():
-        data[key] = value
-    # 保存修改后的内容
-    with open(local_filename, 'w', encoding='utf-8') as f:
-        json.dump(data, f, ensure_ascii=False, indent=4)
-if __name__ == '__main__':
-    mineru_patterns = [
-        # "models/Layout/LayoutLMv3/*",
-        "models/Layout/YOLO/*",
-        "models/MFD/YOLO/*",
-        "models/MFR/unimernet_hf_small_2503/*",
-        "models/OCR/paddleocr_torch/*",
-        # "models/TabRec/TableMaster/*",
-        # "models/TabRec/StructEqTable/*",
-    ]
-    model_dir = snapshot_download('opendatalab/PDF-Extract-Kit-1.0', allow_patterns=mineru_patterns)
-    layoutreader_pattern = [
-        "*.json",
-        "*.safetensors",
-    ]
-    layoutreader_model_dir = snapshot_download('hantian/layoutreader', allow_patterns=layoutreader_pattern)
-    model_dir = model_dir + '/models'
-    print(f'model_dir is: {model_dir}')
-    print(f'layoutreader_model_dir is: {layoutreader_model_dir}')
-    # paddleocr_model_dir = model_dir + '/OCR/paddleocr'
-    # user_paddleocr_dir = os.path.expanduser('~/.paddleocr')
-    # if os.path.exists(user_paddleocr_dir):
-    #     shutil.rmtree(user_paddleocr_dir)
-    # shutil.copytree(paddleocr_model_dir, user_paddleocr_dir)
-    json_url = 'https://github.com/opendatalab/MinerU/raw/master/magic-pdf.template.json'
-    config_file_name = 'magic-pdf.json'
-    home_dir = os.path.expanduser('~')
-    config_file = os.path.join(home_dir, config_file_name)
-    json_mods = {
-        'models-dir': model_dir,
-        'layoutreader-model-dir': layoutreader_model_dir,
-    }
-    download_and_modify_json(json_url, config_file, json_mods)
-    print(f'The configuration file has been configured successfully, the path is: {config_file}')