docs: remove outdated documentation files

- Deleted .readthedocs.yaml files from multiple directories - Removed outdated API and user guide documentation files - Deleted command line usage examples - Removed CUDA acceleration guide

docs: remove outdated documentation files
- Deleted .readthedocs.yaml files from multiple directories - Removed outdated API and user guide documentation files - Deleted command line usage examples - Removed CUDA acceleration guide
cf5c8f47 · myhloli · cb57e84c · cb57e84c · cb57e84c · cb57e84c
Commit cf5c8f47 authored Jun 13, 2025 by myhloli
20 changed files
--- a/next_docs/zh_cn/_static/image/web_demo_1.png
+++ b/next_docs/zh_cn/_static/image/web_demo_1.png
--- a/next_docs/zh_cn/additional_notes/faq.rst
+++ b/next_docs/zh_cn/additional_notes/faq.rst
-常见问题解答
-============
-
-1.在较新版本的mac上使用命令安装pip install magic-pdf[full] zsh: no matches found: magic-pdf[full]
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-在 macOS 上，默认的 shell 从 Bash 切换到了 Z shell，而 Z shell 对于某些类型的字符串匹配有特殊的处理逻辑，这可能导致no matches found错误。 可以通过在命令行禁用globbing特性，再尝试运行安装命令
-
-.. code:: bash
-
-   setopt no_nomatch
-   pip install magic-pdf[full]
-
-2.使用过程中遇到_pickle.UnpicklingError: invalid load key, ‘v’.错误
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-可能是由于模型文件未下载完整导致，可尝试重新下载模型文件后再试。参考：https://github.com/opendatalab/MinerU/issues/143
-
-3.模型文件应该下载到哪里/models-dir的配置应该怎么填
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-模型文件的路径输入是在”magic-pdf.json”中通过
-
-.. code:: json
-
-   {
-     "models-dir": "/tmp/models"
-   }
-
-进行配置的。这个路径是绝对路径而不是相对路径，绝对路径的获取可在models目录中通过命令 “pwd” 获取。
-参考：https://github.com/opendatalab/MinerU/issues/155#issuecomment-2230216874
-
-4.在WSL2的Ubuntu22.04中遇到报错\ ``ImportError: libGL.so.1: cannot open shared object file: No such file or directory``
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-WSL2的Ubuntu22.04中缺少\ ``libgl``\ 库，可通过以下命令安装\ ``libgl``\ 库解决：
-
-.. code:: bash
-
-   sudo apt-get install libgl1-mesa-glx
-
-参考：https://github.com/opendatalab/MinerU/issues/388
-
-5.遇到报错 ``ModuleNotFoundError : Nomodulenamed 'fairscale'``
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-需要卸载该模块并重新安装
-
-.. code:: bash
-
-   pip uninstall fairscale
-   pip install fairscale
-
-参考：https://github.com/opendatalab/MinerU/issues/411
-
-6.在部分较新的设备如H100上，使用CUDA加速OCR时解析出的文字乱码。
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-cuda11对新显卡的兼容性不好，需要升级paddle使用的cuda版本
-
-.. code:: bash
-
-   pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu123/
-
-参考：https://github.com/opendatalab/MinerU/issues/558
-
-7.在部分Linux服务器上，程序一运行就报错 ``非法指令 (核心已转储)`` 或 ``Illegal instruction (core dumped)``
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-可能是因为服务器CPU不支持AVX/AVX2指令集，或cpu本身支持但被运维禁用了，可以尝试联系运维解除限制或更换服务器。
-
-参考：https://github.com/opendatalab/MinerU/issues/591 ，https://github.com/opendatalab/MinerU/issues/736
--- a/next_docs/zh_cn/additional_notes/glossary.rst
+++ b/next_docs/zh_cn/additional_notes/glossary.rst
-
-
-名词解释
-===========
-
-1. jsonl 
-    TODO: add description
-
-2. magic-pdf.json
-    TODO: add description
-
--- a/next_docs/zh_cn/additional_notes/known_issues.rst
+++ b/next_docs/zh_cn/additional_notes/known_issues.rst
-已知问题
-============
-
-  阅读顺序基于模型对可阅读内容在空间中的分布进行排序，在极端复杂的排版下可能会部分区域乱序
-  不支持竖排文字
-  目录和列表通过规则进行识别，少部分不常见的列表形式可能无法识别
-  标题只有一级，目前不支持标题分级
-  代码块在layout模型里还没有支持
-  漫画书、艺术图册、小学教材、习题尚不能很好解析
-  表格识别在复杂表格上可能会出现行/列识别错误
-  在小语种PDF上，OCR识别可能会出现字符不准确的情况（如拉丁文的重音符号、阿拉伯文易混淆字符等）
-  部分公式可能会无法在markdown中渲染
-
--- a/next_docs/zh_cn/conf.py
+++ b/next_docs/zh_cn/conf.py
-# Configuration file for the Sphinx documentation builder.
-#
-# This file only contains a selection of the most common options. For a full
-# list see the documentation:
-# https://www.sphinx-doc.org/en/master/usage/configuration.html
-
-# -- Path setup --------------------------------------------------------------
-
-# If extensions (or modules to document with autodoc) are in another directory,
-# add these directories to sys.path here. If the directory is relative to the
-# documentation root, use os.path.abspath to make it absolute, like shown here.
-
-import os
-import subprocess
-import sys
-
-from sphinx.ext import autodoc
-from docutils import nodes
-from docutils.parsers.rst import Directive
-
-def install(package):
-    subprocess.check_call([sys.executable, '-m', 'pip', 'install', package])
-
-
-requirements_path = os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'requirements.txt'))
-if os.path.exists(requirements_path):
-    with open(requirements_path) as f:
-        packages = f.readlines()
-    for package in packages:
-        install(package.strip())
-
-sys.path.insert(0, os.path.abspath('../..'))
-
-# -- Project information -----------------------------------------------------
-
-project = 'MinerU'
-copyright = '2024, MinerU Contributors'
-author = 'OpenDataLab'
-
-# The full version, including alpha/beta/rc tags
-version_file = '../../magic_pdf/libs/version.py'
-with open(version_file) as f:
-    exec(compile(f.read(), version_file, 'exec'))
-__version__ = locals()['__version__']
-# The short X.Y version
-version = __version__
-# The full version, including alpha/beta/rc tags
-release = __version__
-
-# -- General configuration ---------------------------------------------------
-
-# Add any Sphinx extension module names here, as strings. They can be
-# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
-# ones.
-extensions = [
-    'sphinx.ext.napoleon',
-    'sphinx.ext.viewcode',
-    'sphinx.ext.intersphinx',
-    'sphinx_copybutton',
-    'sphinx.ext.autodoc',
-    'sphinx.ext.autosummary',
-    'sphinx.ext.inheritance_diagram',
-    'myst_parser',
-    'sphinxarg.ext',
-    'sphinxcontrib.autodoc_pydantic',
-]
-
-# class hierarchy diagram
-inheritance_graph_attrs = dict(rankdir="LR", size='"8.0, 12.0"', fontsize=14, ratio='compress')
-inheritance_node_attrs = dict(shape='ellipse', fontsize=14, height=0.75)
-inheritance_edge_attrs = dict(arrow='vee')
-
-autodoc_pydantic_model_show_json = True
-autodoc_pydantic_model_show_config_summary = False
-
-# Add any paths that contain templates here, relative to this directory.
-templates_path = ['_templates']
-
-# List of patterns, relative to source directory, that match files and
-# directories to ignore when looking for source files.
-# This pattern also affects html_static_path and html_extra_path.
-exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
-
-# Exclude the prompt "$" when copying code
-copybutton_prompt_text = r'\$ '
-copybutton_prompt_is_regexp = True
-
-language = 'zh_CN'
-
-# -- Options for HTML output -------------------------------------------------
-
-# The theme to use for HTML and HTML Help pages.  See the documentation for
-# a list of builtin themes.
-#
-html_theme = 'sphinx_book_theme'
-html_logo = '_static/image/logo.png'
-html_theme_options = {
-    'path_to_docs': 'next_docs/zh_cn',
-    'repository_url': 'https://github.com/opendatalab/MinerU',
-    'use_repository_button': True,
-}
-# Add any paths that contain custom static files (such as style sheets) here,
-# relative to this directory. They are copied after the builtin static files,
-# so a file named "default.css" will overwrite the builtin "default.css".
-# html_static_path = ['_static']
-
-# Mock out external dependencies here.
-autodoc_mock_imports = [
-    'cpuinfo',
-    'torch',
-    'transformers',
-    'psutil',
-    'prometheus_client',
-    'sentencepiece',
-    'vllm.cuda_utils',
-    'vllm._C',
-    'numpy',
-    'tqdm',
-]
-
-
-class MockedClassDocumenter(autodoc.ClassDocumenter):
-    """Remove note about base class when a class is derived from object."""
-
-    def add_line(self, line: str, source: str, *lineno: int) -> None:
-        if line == '   Bases: :py:class:`object`':
-            return
-        super().add_line(line, source, *lineno)
-
-
-autodoc.ClassDocumenter = MockedClassDocumenter
-
-navigation_with_keys = False
-
-
-# add custom directive 
-
-
-class VideoDirective(Directive):
-    required_arguments = 1
-    optional_arguments = 0
-    final_argument_whitespace = True
-    option_spec = {}
-
-    def run(self):
-        url = self.arguments[0]
-        video_node = nodes.raw('', f'<iframe width="560" height="315" src="{url}" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>', format='html')
-        return [video_node]
-
-def setup(app):
-    app.add_directive('video', VideoDirective)
\ No newline at end of file
--- a/next_docs/zh_cn/index.rst
+++ b/next_docs/zh_cn/index.rst
-.. xtuner documentation master file, created by
-   sphinx-quickstart on Tue Jan  9 16:33:06 2024.
-   You can adapt this file completely to your liking, but it should at least
-   contain the root `toctree` directive.
-
-欢迎来到 MinerU 文档库
-==============================================
-
-.. figure:: ./_static/image/logo.png
-  :align: center
-  :alt: mineru
-  :class: no-scaled-link
-
-.. raw:: html
-
-   <p style="text-align:center">
-   <strong> 一站式、高质量的开源文档提取工具
-   </strong>
-   </p>
-
-   <p style="text-align:center">
-   <script async defer src="https://buttons.github.io/buttons.js"></script>
-   <a class="github-button" href="https://github.com/opendatalab/MinerU" data-show-count="true" data-size="large" aria-label="Star">Star</a>
-   <a class="github-button" href="https://github.com/opendatalab/MinerU/subscription" data-icon="octicon-eye" data-size="large" aria-label="Watch">Watch</a>
-   <a class="github-button" href="https://github.com/opendatalab/MinerU/fork" data-icon="octicon-repo-forked" data-size="large" aria-label="Fork">Fork</a>
-   </p>
-
-
-项目介绍
--------------------
-
-MinerU是一款将PDF转化为机器可读格式的工具（如markdown、json），可以很方便地抽取为任意格式。
-MinerU诞生于\ `书生-浦语 <https://github.com/InternLM/InternLM>`__\ 的预训练过程中，我们将会集中精力解决科技文献中的符号转化问题，希望在大模型时代为科技发展做出贡献。
-相比国内外知名商用产品MinerU还很年轻，如果遇到问题或者结果不及预期请到\ `issue <https://github.com/opendatalab/MinerU/issues>`__\ 提交问题，同时\ **附上相关PDF**\ 。
-
-.. video:: https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
-
-主要功能
--------
-
-  删除页眉、页脚、脚注、页码等元素，确保语义连贯
-  输出符合人类阅读顺序的文本，适用于单栏、多栏及复杂排版
-  保留原文档的结构，包括标题、段落、列表等
-  提取图像、图片描述、表格、表格标题及脚注
-  自动识别并转换文档中的公式为LaTeX格式
-  自动识别并转换文档中的表格为LaTeX或HTML格式
-  自动检测扫描版PDF和乱码PDF，并启用OCR功能
-  OCR支持84种语言的检测与识别
-  支持多种输出格式，如多模态与NLP的Markdown、按阅读顺序排序的JSON、含有丰富信息的中间格式等
-  支持多种可视化结果，包括layout可视化、span可视化等，便于高效确认输出效果与质检
-  支持CPU和GPU环境
-  兼容Windows、Linux和Mac平台
-
-
-用户指南
-------------
-.. toctree::
-   :maxdepth: 2
-   :caption: 用户指南
-
-   user_guide
-
-
-API 接口
-------------
-本章节主要介绍函数、类、类方法的细节信息
-
-目前只提供英文版本的接口文档，请切换到英文版本的接口文档！
-
-
-附录
------------------
-.. toctree::
-   :maxdepth: 1
-   :caption: 附录
-
-   additional_notes/known_issues
-   additional_notes/faq
-   additional_notes/glossary
-
-
--- a/next_docs/zh_cn/make.bat
+++ b/next_docs/zh_cn/make.bat
-@ECHO OFF
-
-pushd %~dp0
-
-REM Command file for Sphinx documentation
-
-if "%SPHINXBUILD%" == "" (
-	set SPHINXBUILD=sphinx-build
-)
-set SOURCEDIR=.
-set BUILDDIR=_build
-
-%SPHINXBUILD% >NUL 2>NUL
-if errorlevel 9009 (
-	echo.
-	echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
-	echo.installed, then set the SPHINXBUILD environment variable to point
-	echo.to the full path of the 'sphinx-build' executable. Alternatively you
-	echo.may add the Sphinx directory to PATH.
-	echo.
-	echo.If you don't have Sphinx installed, grab it from
-	echo.https://www.sphinx-doc.org/
-	exit /b 1
-)
-
-if "%1" == "" goto help
-
-%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
-goto end
-
-:help
-%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
-
-:end
-popd
--- a/next_docs/zh_cn/user_guide.rst
+++ b/next_docs/zh_cn/user_guide.rst
-
-
-.. toctree::
-    :maxdepth: 2
-
-    user_guide/install
-    user_guide/quick_start
-    user_guide/tutorial
-    user_guide/data
-    
--- a/next_docs/zh_cn/user_guide/data.rst
+++ b/next_docs/zh_cn/user_guide/data.rst
-
-
-数据
-=========
-
-.. toctree::
-   :maxdepth: 2
-   :caption: 数据
-
-   data/dataset
-
-   data/read_api
-
-   data/data_reader_writer 
-
-   data/io
-
-
-
-
--- a/next_docs/zh_cn/user_guide/data/data_reader_writer.rst
+++ b/next_docs/zh_cn/user_guide/data/data_reader_writer.rst
-
-数据读取和写入类 
-=================
-
-旨在从不同的媒介读取或写入字节。如果 MinerU 没有提供合适的类，你可以实现新的类以满足个人场景的需求。实现新的类非常容易，唯一的要求是继承自 DataReader 或 DataWriter。
-
-.. code:: python
-
-    class SomeReader(DataReader):
-        def read(self, path: str) -> bytes:
-            pass
-
-        def read_at(self, path: str, offset: int = 0, limit: int = -1) -> bytes:
-            pass
-
-
-    class SomeWriter(DataWriter):
-        def write(self, path: str, data: bytes) -> None:
-            pass
-
-        def write_string(self, path: str, data: str) -> None:
-            pass
-
-读者可能会对 io 和本节的区别感到好奇。乍一看，这两部分非常相似。io 提供基本功能，而本节则更注重应用层面。用户可以构建自己的类以满足特定应用需求，这些类可能共享相同的基本 IO 功能。这就是为什么我们有 io。
-
-重要类
------------
-.. code:: python
-
-    class FileBasedDataReader(DataReader):
-        def __init__(self, parent_dir: str = ''):
-            pass
-
-
-    class FileBasedDataWriter(DataWriter):
-        def __init__(self, parent_dir: str = '') -> None:
-            pass
-
-类 FileBasedDataReader 使用单个参数 parent_dir 初始化。这意味着 FileBasedDataReader 提供的每个方法将具有以下特性：
-
-#. 从绝对路径文件读取内容，parent_dir 将被忽略。
-#. 从相对路径读取文件，首先将路径与 parent_dir 连接，然后从合并后的路径读取内容。
-
-.. note::
-
-    `FileBasedDataWriter` 与 `FileBasedDataReader` 具有相同的行为。
-
-.. code:: python
-
-    class MultiS3Mixin:
-        def __init__(self, default_prefix: str, s3_configs: list[S3Config]):
-            pass
-
-    class MultiBucketS3DataReader(DataReader, MultiS3Mixin):
-        pass
-
-MultiBucketS3DataReader 提供的所有读取相关方法将具有以下特性：
-
-#. 从完整的 S3 格式路径读取对象，例如 s3://test_bucket/test_object，default_prefix 将被忽略。
-#. 从相对路径读取对象，首先将路径与 default_prefix 连接并去掉 bucket_name，然后读取内容。bucket_name 是将 default_prefix 用分隔符 \ 分割后的第一个元素。
-
-.. note::
-    MultiBucketS3DataWriter 与 MultiBucketS3DataReader 具有类似的行为。
-
-.. code:: python
-
-    class S3DataReader(MultiBucketS3DataReader):
-        pass
-
-S3DataReader 基于 MultiBucketS3DataReader 构建，但仅支持单个桶。S3DataWriter 也是类似的情况。
-
-读取示例
---------
-.. code:: python
-
-    import os 
-    from magic_pdf.data.data_reader_writer import *
-    from magic_pdf.data.data_reader_writer import MultiBucketS3DataReader
-    from magic_pdf.data.schemas import S3Config
-
-    # 初始化 reader
-    file_based_reader1 = FileBasedDataReader('')
-
-    ## 读本地文件 abc
-    file_based_reader1.read('abc')
-
-    file_based_reader2 = FileBasedDataReader('/tmp')
-
-    ## 读本地文件 /tmp/abc
-    file_based_reader2.read('abc')
-
-    ## 读本地文件 /tmp/logs/message.txt
-    file_based_reader2.read('/tmp/logs/message.txt')
-
-    # 初始化多桶 s3 reader
-    bucket = "bucket"               # 替换为有效的 bucket
-    ak = "ak"                       # 替换为有效的 access key
-    sk = "sk"                       # 替换为有效的 secret key
-    endpoint_url = "endpoint_url"   # 替换为有效的 endpoint_url
-
-    bucket_2 = "bucket_2"               # 替换为有效的 bucket
-    ak_2 = "ak_2"                       # 替换为有效的 access key
-    sk_2 = "sk_2"                       # 替换为有效的 secret key 
-    endpoint_url_2 = "endpoint_url_2"   # 替换为有效的 endpoint_url
-
-    test_prefix = 'test/unittest'
-    multi_bucket_s3_reader1 = MultiBucketS3DataReader(f"{bucket}/{test_prefix}", [S3Config(
-            bucket_name=bucket, access_key=ak, secret_key=sk, endpoint_url=endpoint_url
-        ),
-        S3Config(
-            bucket_name=bucket_2,
-            access_key=ak_2,
-            secret_key=sk_2,
-            endpoint_url=endpoint_url_2,
-        )])
-
-    ## 读文件 s3://{bucket}/{test_prefix}/abc
-    multi_bucket_s3_reader1.read('abc')
-
-    ## 读文件 s3://{bucket}/{test_prefix}/efg
-    multi_bucket_s3_reader1.read(f's3://{bucket}/{test_prefix}/efg')
-
-    ## 读文件 s3://{bucket2}/{test_prefix}/abc
-    multi_bucket_s3_reader1.read(f's3://{bucket_2}/{test_prefix}/abc')
-
-    # 初始化 s3 reader
-    s3_reader1 = S3DataReader(
-        test_prefix,
-        bucket,
-        ak,
-        sk,
-        endpoint_url
-    )
-
-    ## 读文件 s3://{bucket}/{test_prefix}/abc
-    s3_reader1.read('abc')
-
-    ## 读文件 s3://{bucket}/efg
-    s3_reader1.read(f's3://{bucket}/efg')
-
-
-写入示例
----------
-.. code:: python
-
-    import os
-    from magic_pdf.data.data_reader_writer import *
-    from magic_pdf.data.data_reader_writer import MultiBucketS3DataWriter
-    from magic_pdf.data.schemas import S3Config
-
-    # 初始化 reader
-    file_based_writer1 = FileBasedDataWriter("")
-
-    ## 写数据 123 to abc
-    file_based_writer1.write("abc", "123".encode())
-
-    ## 写数据 123 to abc
-    file_based_writer1.write_string("abc", "123")
-
-    file_based_writer2 = FileBasedDataWriter("/tmp")
-
-    ## 写数据 123 to /tmp/abc
-    file_based_writer2.write_string("abc", "123")
-
-    ## 写数据 123 to /tmp/logs/message.txt
-    file_based_writer2.write_string("/tmp/logs/message.txt", "123")
-
-    # 初始化多桶 s3 writer
-    bucket = "bucket"               # 替换为有效的 bucket
-    ak = "ak"                       # 替换为有效的 access key
-    sk = "sk"                       # 替换为有效的 secret key
-    endpoint_url = "endpoint_url"   # 替换为有效的 endpoint_url
-
-    bucket_2 = "bucket_2"               # 替换为有效的 bucket
-    ak_2 = "ak_2"                       # 替换为有效的 access key
-    sk_2 = "sk_2"                       # 替换为有效的 secret key 
-    endpoint_url_2 = "endpoint_url_2"   # 替换为有效的 endpoint_url
-
-    test_prefix = "test/unittest"
-    multi_bucket_s3_writer1 = MultiBucketS3DataWriter(
-        f"{bucket}/{test_prefix}",
-        [
-            S3Config(
-                bucket_name=bucket, access_key=ak, secret_key=sk, endpoint_url=endpoint_url
-            ),
-            S3Config(
-                bucket_name=bucket_2,
-                access_key=ak_2,
-                secret_key=sk_2,
-                endpoint_url=endpoint_url_2,
-            ),
-        ],
-    )
-
-    ## 写数据 123 to s3://{bucket}/{test_prefix}/abc
-    multi_bucket_s3_writer1.write_string("abc", "123")
-
-    ## 写数据 123 to s3://{bucket}/{test_prefix}/abc
-    multi_bucket_s3_writer1.write("abc", "123".encode())
-
-    ## 写数据 123 to s3://{bucket}/{test_prefix}/efg
-    multi_bucket_s3_writer1.write(f"s3://{bucket}/{test_prefix}/efg", "123".encode())
-
-    ## 写数据 123 to s3://{bucket_2}/{test_prefix}/abc
-    multi_bucket_s3_writer1.write(f's3://{bucket_2}/{test_prefix}/abc', '123'.encode())
-
-    # 初始化 s3 writer
-    s3_writer1 = S3DataWriter(test_prefix, bucket, ak, sk, endpoint_url)
-
-    ## 写数据 123 to s3://{bucket}/{test_prefix}/abc
-    s3_writer1.write("abc", "123".encode())
-
-    ## 写数据 123 to s3://{bucket}/{test_prefix}/abc
-    s3_writer1.write_string("abc", "123")
-
-    ## 写数据 123 to s3://{bucket}/efg
-    s3_writer1.write(f"s3://{bucket}/efg", "123".encode())
-
--- a/next_docs/zh_cn/user_guide/data/dataset.rst
+++ b/next_docs/zh_cn/user_guide/data/dataset.rst
-
-数据集
-======
-
-导入数据类
-----------
-
-数据集
-^^^^^^^^
-
-每个 PDF 或图像将形成一个 Dataset。众所周知，PDF 有两种类别：:ref:`TXT <digital_method_section>` 或 :ref:`OCR <ocr_method_section>` 方法部分。从图像中可以获得 ImageDataset，它是 Dataset 的子类；从 PDF 文件中可以获得 PymuDocDataset。ImageDataset 和 PymuDocDataset 之间的区别在于 ImageDataset 仅支持 OCR 解析方法，而 PymuDocDataset 支持 OCR 和 TXT 两种方法。
-
-.. note::
-
-    实际上，有些 PDF 可能是由图像生成的，这意味着它们不支持 `TXT` 方法。目前，由用户保证不会调用 `TXT` 方法来解析图像生成的 PDF
-
-PDF 解析方法
---------------
-
-.. _ocr_method_section:
-
-OCR
-^^^^
-通过 光学字符识别 技术提取字符。
-
-.. _digital_method_section:
-
-TXT
-^^^^^^^^
-通过第三方库提取字符，目前我们使用的是 pymupdf。
-
--- a/next_docs/zh_cn/user_guide/data/io.rst
+++ b/next_docs/zh_cn/user_guide/data/io.rst
-
-
-IO
-====
-
-旨在从不同的媒介读取或写入字节。目前，我们提供了 S3Reader 和 S3Writer 用于兼容 AWS S3 的媒介，以及 HttpReader 和 HttpWriter 用于远程 HTTP 文件。如果 MinerU 没有提供合适的类，你可以实现新的类以满足个人场景的需求。实现新的类非常容易，唯一的要求是继承自 IOReader 或 IOWriter。
-
-.. code:: python
-
-    class SomeReader(IOReader):
-        def read(self, path: str) -> bytes:
-            pass
-
-        def read_at(self, path: str, offset: int = 0, limit: int = -1) -> bytes:
-            pass
-
-
-    class SomeWriter(IOWriter):
-        def write(self, path: str, data: bytes) -> None:
-            pass
-        
--- a/next_docs/zh_cn/user_guide/data/read_api.rst
+++ b/next_docs/zh_cn/user_guide/data/read_api.rst
-
-
-read_api
-=========
-
-从文件或目录读取内容以创建 Dataset。目前，我们提供了几个覆盖某些场景的函数。如果你有新的、大多数用户都会遇到的场景，可以在官方 GitHub 问题页面上发布详细描述。同时，实现你自己的读取相关函数也非常容易。
-
-重要函数
---------
-
-read_jsonl
-^^^^^^^^^^^^^^^^
-
-从本地机器或远程 S3 上的 JSONL 文件读取内容。如果你想了解更多关于 JSONL 的信息，请参阅 :doc:`../../additional_notes/glossary`。
-
-.. code:: python
-
-    from magic_pdf.data.read_api import *
-    from magic_pdf.data.data_reader_writer import MultiBucketS3DataReader
-    from magic_pdf.data.schemas import S3Config
-
-    # 读取本地 jsonl 文件
-    datasets = read_jsonl("tt.jsonl", None)   # 替换为有效的文件
-
-    # 读取 s3 jsonl 文件
-
-    bucket = "bucket_1"                     # 替换为有效的 s3 bucket
-    ak = "access_key_1"                     # 替换为有效的 s3 access key
-    sk = "secret_key_1"                     # 替换为有效的 s3 secret key
-    endpoint_url = "endpoint_url_1"         # 替换为有效的 s3 endpoint url
-
-    bucket_2 = "bucket_2"                   # 替换为有效的 s3 bucket
-    ak_2 = "access_key_2"                   # 替换为有效的 s3 access key
-    sk_2 = "secret_key_2"                   # 替换为有效的 s3 secret key
-    endpoint_url_2 = "endpoint_url_2"       # 替换为有效的 s3 endpoint url
-
-    s3configs = [
-        S3Config(
-            bucket_name=bucket, access_key=ak, secret_key=sk, endpoint_url=endpoint_url
-        ),
-        S3Config(
-            bucket_name=bucket_2,
-            access_key=ak_2,
-            secret_key=sk_2,
-            endpoint_url=endpoint_url_2,
-        ),
-    ]
-
-    s3_reader = MultiBucketS3DataReader(bucket, s3configs)
-
-    datasets = read_jsonl(f"s3://bucket_1/tt.jsonl", s3_reader)  # 替换为有效的 s3 jsonl file
-
-
-read_local_pdfs
-^^^^^^^^^^^^^^^^
-
-从路径或目录读取 PDF 文件。
-
-.. code:: python
-
-    from magic_pdf.data.read_api import *
-
-    # 读取 PDF 路径
-    datasets = read_local_pdfs("tt.pdf")  # 替换为有效的文件
-
-    # 读取目录下的 PDF 文件
-    datasets = read_local_pdfs("pdfs/")   # 替换为有效的文件目录
-
-read_local_images
-^^^^^^^^^^^^^^^^^^^
-
-从路径或目录读取图像。
-
-.. code:: python
-
-    from magic_pdf.data.read_api import *
-
-    # 从图像路径读取
-    datasets = read_local_images("tt.png")  # 替换为有效的文件
-
-    # 从目录读取以 suffixes 数组中指定后缀结尾的文件
-    datasets = read_local_images("images/", suffixes=["png", "jpg"])  # 替换为有效的文件目录
--- a/next_docs/zh_cn/user_guide/install.rst
+++ b/next_docs/zh_cn/user_guide/install.rst
-
-安装
-==============
-
-.. toctree::
-   :maxdepth: 1
-   :caption: 安装文档
-
-   install/install
-   install//boost_with_cuda
-   install/download_model_weight_files
-
-
--- a/next_docs/zh_cn/user_guide/install/boost_with_cuda.rst
+++ b/next_docs/zh_cn/user_guide/install/boost_with_cuda.rst
--- a/next_docs/zh_cn/user_guide/install/download_model_weight_files.rst
+++ b/next_docs/zh_cn/user_guide/install/download_model_weight_files.rst
-下载模型权重文件
-==================
-
-模型下载分为初始下载和更新到模型目录。请参考相应的文档以获取如何操作的指示。
-
-首次下载模型文件
-----------------
-
-模型文件可以从 Hugging Face 或 Model Scope下载，由于网络原因，国内用户访问HF可能会失败，请使用 ModelScope。
-
-
-方法一：从 Hugging Face 下载模型
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-使用python脚本 从Hugging Face下载模型文件
-
-.. code:: bash
-
-   pip install huggingface_hub
-   wget https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/scripts/download_models_hf.py -O download_models_hf.py
-   python download_models_hf.py
-
-python脚本会自动下载模型文件并配置好配置文件中的模型目录
-
-方法二：从 ModelScope 下载模型
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-使用python脚本从 ModelScope 下载模型文件
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. code:: bash
-
-   pip install modelscope
-   wget https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/scripts/download_models.py -O download_models.py
-   python download_models.py
-
-python脚本会自动下载模型文件并配置好配置文件中的模型目录
-
-配置文件可以在用户目录中找到，文件名为\ ``magic-pdf.json``
-
-.. admonition:: Tip
-    :class: tip
-
-    windows的用户目录为 “C:\Users\用户名”, linux用户目录为 “/home/用户名”, macOS用户目录为 “/Users/用户名”
-
-此前下载过模型，如何更新
--------------------
-
-1. 通过 git lfs 下载过模型
-^^^^^^^^^^^^^^^^^^^^^^^
-
-.. admonition:: Important
-    :class: tip
-
-    由于部分用户反馈通过git lfs下载模型文件遇到下载不全和模型文件损坏情况，现已不推荐使用该方式下载。
-
-    0.9.x及以后版本由于PDF-Extract-Kit 1.0更换仓库和新增layout排序模型，不能通过 ``git pull``\命令更新，需要使用python脚本一键更新。
-
-当magic-pdf <= 0.8.1时，如此前通过 git lfs 下载过模型文件，可以进入到之前的下载目录中，通过 ``git pull`` 命令更新模型。
-
-2. 通过 Hugging Face 或 Model Scope 下载过模型
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-如此前通过 HuggingFace 或 Model Scope 下载过模型，可以重复执行此前的模型下载 python 脚本，将会自动将模型目录更新到最新版本。
\ No newline at end of file
--- a/next_docs/zh_cn/user_guide/install/install.rst
+++ b/next_docs/zh_cn/user_guide/install/install.rst
-
-安装
-=====
-
-如果您遇到任何安装问题，请首先查阅 :doc:`../../additional_notes/faq`。如果解析结果不如预期，可参考 :doc:`../../additional_notes/known_issues`。
-
-.. admonition:: Warning
-    :class: tip
-
-    **安装前必看——软硬件环境支持说明**
-
-    为了确保项目的稳定性和可靠性，我们在开发过程中仅对特定的软硬件环境进行优化和测试。这样当用户在推荐的系统配置上部署和运行项目时，能够获得最佳的性能表现和最少的兼容性问题。
-
-    通过集中资源和精力于主线环境，我们团队能够更高效地解决潜在的BUG，及时开发新功能。
-
-    在非主线环境中，由于硬件、软件配置的多样性，以及第三方依赖项的兼容性问题，我们无法100%保证项目的完全可用性。因此，对于希望在非推荐环境中使用本项目的用户，我们建议先仔细阅读文档以及 :doc:`../../additional_notes/faq` ，大多数问题已经在 :doc:`../../additional_notes/faq` 中有对应的解决方案，除此之外我们鼓励社区反馈问题，以便我们能够逐步扩大支持范围。
-
-.. raw:: html
-
-    <style>
-        table, th, td {
-        border: 1px solid black;
-        border-collapse: collapse;
-        }
-    </style>
-    <table>
-    <tr>
-        <td colspan="3" rowspan="2">操作系统</td>
-    </tr>
-    <tr>
-        <td>Linux after 2019</td>
-        <td>Windows 10 / 11</td>
-        <td>macOS 11+</td>
-    </tr>
-    <tr>
-        <td colspan="3">CPU</td>
-        <td>x86_64 / arm64</td>
-        <td>x86_64(暂不支持ARM Windows)</td>
-        <td>x86_64 / arm64</td>
-    </tr>
-    <tr>
-        <td colspan="3">内存</td>
-        <td colspan="3">大于等于16GB，推荐32G以上</td>
-    </tr>
-    <tr>
-        <td colspan="3">存储空间</td>
-        <td colspan="3">大于等于20GB，推荐使用SSD以获得最佳性能</td>
-    </tr>
-    <tr>
-        <td colspan="3">python版本</td>
-        <td colspan="3">>=3.9,<=3.12</td>
-    </tr>
-    <tr>
-        <td colspan="3">Nvidia Driver 版本</td>
-        <td>latest(专有驱动)</td>
-        <td>latest</td>
-        <td>None</td>
-    </tr>
-    <tr>
-        <td colspan="3">CUDA环境</td>
-        <td>11.8/12.4/12.6</td>
-        <td>11.8/12.4/12.6</td>
-        <td>None</td>
-    </tr>
-    <tr>
-        <td colspan="3">CANN环境(NPU支持)</td>
-        <td>8.0+(Ascend 910b)</td>
-        <td>None</td>
-        <td>None</td>
-    </tr>
-    <tr>
-        <td rowspan="2">GPU/MPS 硬件支持列表</td>
-        <td colspan="2">显存6G以上</td>
-        <td colspan="2">
-        Volta(2017)及之后生产的全部带Tensor Core的GPU <br>
-        6G显存及以上</td>
-        <td rowspan="2">apple slicon</td>
-    </tr>
-    </table>
-
-
-创建环境
-~~~~~~~~~~
-
-.. code-block:: shell
-
-    conda create -n mineru 'python<3.13' -y
-    conda activate mineru
-    pip install -U "magic-pdf[full]" -i https://mirrors.aliyun.com/pypi/simple
-
-
-下载模型权重文件
-~~~~~~~~~~~~~~~
-
-.. code-block:: shell
-
-    pip install huggingface_hub
-    wget https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/scripts/download_models_hf.py -O download_models_hf.py
-    python download_models_hf.py
-
-
-MinerU 已安装，查看 :doc:`../quick_start` 或阅读 :doc:`boost_with_cuda` 以加速推理。
-
--- a/next_docs/zh_cn/user_guide/quick_start.rst
+++ b/next_docs/zh_cn/user_guide/quick_start.rst
-
-快速开始 
-==============
-
-从这里开始学习 MinerU 基本使用方法。若还没有安装，请参考安装文档进行安装
-
-.. toctree::
-    :maxdepth: 1
-    :caption: 快速开始
-
-    quick_start/command_line
-    quick_start/to_markdown
-
--- a/next_docs/zh_cn/user_guide/quick_start/command_line.rst
+++ b/next_docs/zh_cn/user_guide/quick_start/command_line.rst
-
-
-命令行
-========
-
-.. code:: bash
-
-   magic-pdf --help
-   Usage: magic-pdf [OPTIONS]
-
-   Options:
-     -v, --version                display the version and exit
-     -p, --path PATH              local pdf filepath or directory  [required]
-     -o, --output-dir PATH        output local directory  [required]
-     -m, --method [ocr|txt|auto]  the method for parsing pdf. ocr: using ocr
-                                  technique to extract information from pdf. txt:
-                                  suitable for the text-based pdf only and
-                                  outperform ocr. auto: automatically choose the
-                                  best method for parsing pdf from ocr and txt.
-                                  without method specified, auto will be used by
-                                  default.
-     -l, --lang TEXT              Input the languages in the pdf (if known) to
-                                  improve OCR accuracy.  Optional. You should
-                                  input "Abbreviation" with language form url: ht
-                                  tps://paddlepaddle.github.io/PaddleOCR/en/ppocr
-                                  /blog/multi_languages.html#5-support-languages-
-                                  and-abbreviations
-     -d, --debug BOOLEAN          Enables detailed debugging information during
-                                  the execution of the CLI commands.
-     -s, --start INTEGER          The starting page for PDF parsing, beginning
-                                  from 0.
-     -e, --end INTEGER            The ending page for PDF parsing, beginning from
-                                  0.
-     --help                       Show this message and exit.
-
-
-   ## show version
-   magic-pdf -v
-
-   ## command line example
-   magic-pdf -p {some_pdf} -o {some_output_dir} -m auto
-
-``{some_pdf}`` 可以是单个 PDF 文件或者一个包含多个 PDF 文件的目录。 解析的结果文件存放在目录 ``{some_output_dir}`` 下。 生成的结果文件列表如下所示：
-
-.. code:: text
-
-   ├── some_pdf.md                          # markdown 文件
-   ├── images                               # 存放图片目录
-   ├── some_pdf_layout.pdf                  # layout 绘图 （包含layout阅读顺序）
-   ├── some_pdf_middle.json                 # minerU 中间处理结果
-   ├── some_pdf_model.json                  # 模型推理结果
-   ├── some_pdf_origin.pdf                  # 原 pdf 文件
-   ├── some_pdf_spans.pdf                   # 最小粒度的bbox位置信息绘图
-   └── some_pdf_content_list.json           # 按阅读顺序排列的富文本json
-
-
-.. admonition:: Tip
-   :class: tip
-
-   欲知更多有关结果文件的信息，请参考 :doc:`../tutorial/output_file_description`
-
--- a/next_docs/zh_cn/user_guide/quick_start/to_markdown.rst
+++ b/next_docs/zh_cn/user_guide/quick_start/to_markdown.rst