Merge pull request #2625 from opendatalab/release-2.0.0

Release 2.0.0

Merge pull request #2625 from opendatalab/release-2.0.0
Release 2.0.0
6ab12348 · Xiaomeng Zhao · GitHub · 9487d33d · 4fbec469 · 9487d33d
Unverified Commit 6ab12348 authored Jun 13, 2025 by Xiaomeng Zhao Committed by GitHub Jun 13, 2025
20 changed files
--- a/next_docs/zh_cn/user_guide/install.rst
+++ b/next_docs/zh_cn/user_guide/install.rst
-安装
-==============
-.. toctree::
-   :maxdepth: 1
-   :caption: 安装文档
-   install/install
-   install//boost_with_cuda
-   install/download_model_weight_files
--- a/next_docs/zh_cn/user_guide/install/boost_with_cuda.rst
+++ b/next_docs/zh_cn/user_guide/install/boost_with_cuda.rst
-使用 CUDA 加速
-================
-如果您的设备支持 CUDA 并符合主线环境的 GPU 要求，您可以使用 GPU 加速。请选择适合您系统的指南：
-  :ref:`ubuntu_22_04_lts_section`
-  :ref:`windows_10_or_11_section`
-  使用 Docker 快速部署
-.. admonition:: Important
-    :class: tip
-    Docker 需要至少 6GB 显存的 GPU，并且所有加速功能默认启用。
-    在运行此 Docker 容器之前，您可以使用以下命令检查您的设备是否支持 Docker 上的 CUDA 加速。
-    .. code-block:: sh
-      docker run --rm --gpus=all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
-.. code:: sh
-      wget https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/docker/china/Dockerfile -O Dockerfile
-      docker build -t mineru:latest .
-      docker run -it --name mineru --gpus=all mineru:latest /bin/bash -c "echo 'source /opt/mineru_venv/bin/activate' >> ~/.bashrc && exec bash"
-      magic-pdf --help
-.. _ubuntu_22_04_lts_section:
-Ubuntu 22.04 LTS
----------------
-1. 检测是否已安装 nvidia 驱动
---------------------------
-.. code:: bash
-   nvidia-smi
-如果看到类似如下的信息，说明已经安装了 nvidia 驱动，可以跳过步骤2
-.. admonition:: Important
-    :class: tip
-    ``CUDA Version`` 显示的版本号应 >= 12.4，如显示的版本号小于12.4，请升级驱动
-.. code:: text
-   +---------------------------------------------------------------------------------------+
-   | NVIDIA-SMI 570.133.07             Driver Version: 572.83         CUDA Version: 12.8   |
-   |-----------------------------------------+----------------------+----------------------+
-   | GPU  Name                     TCC/WDDM  | Bus-Id        Disp.A | Volatile Uncorr. ECC |
-   | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
-   |                                         |                      |               MIG M. |
-   |=========================================+======================+======================|
-   |   0  NVIDIA GeForce RTX 3060 Ti   WDDM  | 00000000:01:00.0  On |                  N/A |
-   |  0%   51C    P8              12W / 200W |   1489MiB /  8192MiB |      5%      Default |
-   |                                         |                      |                  N/A |
-   +-----------------------------------------+----------------------+----------------------+
-2. 安装驱动
-----------
-如没有驱动，则通过如下命令
-.. code:: bash
-   sudo apt-get update
-   sudo apt-get install nvidia-driver-570-server
-安装专有驱动，安装完成后，重启电脑
-.. code:: bash
-   reboot
-3. 安装 anacoda
--------------
-如果已安装 conda，可以跳过本步骤
-.. code:: bash
-   wget -U NoSuchBrowser/1.0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/Anaconda3-2024.06-1-Linux-x86_64.sh
-   bash Anaconda3-2024.06-1-Linux-x86_64.sh
-最后一步输入yes，关闭终端重新打开
-4. 使用 conda 创建环境
---------------------
-.. code:: bash
-   conda create -n mineru 'python<3.13' -y
-   conda activate mineru
-5. 安装应用
-----------
-.. code:: bash
-   pip install -U magic-pdf[full] -i https://mirrors.aliyun.com/pypi/simple
-.. admonition:: Important
-    :class: tip
-    下载完成后，务必通过以下命令确认magic-pdf的版本是否正确
-    .. code:: bash
-       magic-pdf --version
-    如果版本号小于1.3.0，请到issue中向我们反馈
-6. 下载模型
-----------
-详细参考 :doc:`download_model_weight_files`
-7. 了解配置文件存放的位置
-------------------------
-完成\ `6.下载模型 <#6-下载模型>`__\ 步骤后，脚本会自动生成用户目录下的magic-pdf.json文件，并自动配置默认模型路径。您可在【用户目录】下找到magic-pdf.json文件。
-.. admonition:: Tip
-    :class: tip
-    linux用户目录为 “/home/用户名”
-8. 第一次运行
-------------
-从仓库中下载样本文件，并测试
-.. code:: bash
-   wget https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/demo/pdfs/small_ocr.pdf
-   magic-pdf -p small_ocr.pdf -o ./output
-9. 测试CUDA加速
---------------
-如果您的显卡显存大于等于 **8GB**
-，可以进行以下流程，测试CUDA解析加速效果
-**1.修改【用户目录】中配置文件 magic-pdf.json 中”device-mode”的值**
-.. code:: json
-   {
-     "device-mode":"cuda"
-   }
-**2.运行以下命令测试 cuda 加速效果**
-.. code:: bash
-   magic-pdf -p small_ocr.pdf -o ./output
-.. admonition:: Tip
-    :class: tip
-    CUDA 加速是否生效可以根据 log 中输出的各个阶段的耗时来简单判断，通常情况下，cuda应比cpu更快。
-.. _windows_10_or_11_section:
-Windows 10/11
--------------
-1. 安装 cuda 和 cuDNN
------------------
-需要安装符合torch要求的cuda版本，torch目前支持11.8/12.4/12.6
- CUDA 11.8 https://developer.nvidia.com/cuda-11-8-0-download-archive
- CUDA 12.4 https://developer.nvidia.com/cuda-12-4-0-download-archive
- CUDA 12.6 https://developer.nvidia.com/cuda-12-6-0-download-archive
-2. 安装 anaconda
---------------
-如果已安装 conda，可以跳过本步骤
-下载链接：https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/Anaconda3-2024.06-1-Windows-x86_64.exe
-3. 使用 conda 创建环境
---------------------
-.. code:: bash
-    conda create -n mineru 'python<3.13' -y
-    conda activate mineru
-4. 安装应用
-----------
-.. code:: bash
-   pip install -U magic-pdf[full] -i https://mirrors.aliyun.com/pypi/simple
-.. admonition:: Important
-    :class: tip
-    下载完成后，务必通过以下命令确认magic-pdf的版本是否正确
-    .. code:: bash
-      magic-pdf --version
-    如果版本号小于1.3.0，请到issue中向我们反馈
-5. 下载模型
-----------
-详细参考 :doc:`download_model_weight_files`
-6. 了解配置文件存放的位置
-------------------------
-完成\ `5.下载模型 <#5-下载模型>`__\ 步骤后，脚本会自动生成用户目录下的magic-pdf.json文件，并自动配置默认模型路径。您可在【用户目录】下找到 magic-pdf.json 文件。
-.. admonition:: Tip
-    :class: tip
-    windows 用户目录为 “C:/Users/用户名”
-7. 第一次运行
-------------
-从仓库中下载样本文件，并测试
-.. code:: powershell
-    wget https://github.com/opendatalab/MinerU/raw/master/demo/pdfs/small_ocr.pdf -O small_ocr.pdf
-    magic-pdf -p small_ocr.pdf -o ./output
-8. 测试 CUDA 加速
---------------
-如果您的显卡显存大于等于 **8GB**，可以进行以下流程，测试 CUDA 解析加速效果
-**1.覆盖安装支持cuda的torch和torchvision**(请根据cuda版本选择合适的index-url，具体可参考[torch官网](https://pytorch.org/get-started/locally/))
-.. code:: bash
-   pip install --force-reinstall torch==2.6.0 torchvision==0.21.1 "numpy<2.0.0" --index-url https://download.pytorch.org/whl/cu124
-**2.修改【用户目录】中配置文件magic-pdf.json中”device-mode”的值**
-.. code:: json
-   {
-     "device-mode":"cuda"
-   }
-**3.运行以下命令测试cuda加速效果**
-.. code:: bash
-   magic-pdf -p small_ocr.pdf -o ./output
-.. admonition:: Tip
-    :class: tip
-    CUDA 加速是否生效可以根据 log 中输出的各个阶段的耗时来简单判断，通常情况下， cuda会比cpu更快。
--- a/next_docs/zh_cn/user_guide/install/download_model_weight_files.rst
+++ b/next_docs/zh_cn/user_guide/install/download_model_weight_files.rst
-下载模型权重文件
-==================
-模型下载分为初始下载和更新到模型目录。请参考相应的文档以获取如何操作的指示。
-首次下载模型文件
-----------------
-模型文件可以从 Hugging Face 或 Model Scope下载，由于网络原因，国内用户访问HF可能会失败，请使用 ModelScope。
-方法一：从 Hugging Face 下载模型
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-使用python脚本 从Hugging Face下载模型文件
-.. code:: bash
-   pip install huggingface_hub
-   wget https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/scripts/download_models_hf.py -O download_models_hf.py
-   python download_models_hf.py
-python脚本会自动下载模型文件并配置好配置文件中的模型目录
-方法二：从 ModelScope 下载模型
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-使用python脚本从 ModelScope 下载模型文件
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. code:: bash
-   pip install modelscope
-   wget https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/scripts/download_models.py -O download_models.py
-   python download_models.py
-python脚本会自动下载模型文件并配置好配置文件中的模型目录
-配置文件可以在用户目录中找到，文件名为\ ``magic-pdf.json``
-.. admonition:: Tip
-    :class: tip
-    windows的用户目录为 “C:\Users\用户名”, linux用户目录为 “/home/用户名”, macOS用户目录为 “/Users/用户名”
-此前下载过模型，如何更新
--------------------
-1. 通过 git lfs 下载过模型
-^^^^^^^^^^^^^^^^^^^^^^^
-.. admonition:: Important
-    :class: tip
-    由于部分用户反馈通过git lfs下载模型文件遇到下载不全和模型文件损坏情况，现已不推荐使用该方式下载。
-    0.9.x及以后版本由于PDF-Extract-Kit 1.0更换仓库和新增layout排序模型，不能通过 ``git pull``\命令更新，需要使用python脚本一键更新。
-当magic-pdf <= 0.8.1时，如此前通过 git lfs 下载过模型文件，可以进入到之前的下载目录中，通过 ``git pull`` 命令更新模型。
-2. 通过 Hugging Face 或 Model Scope 下载过模型
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-如此前通过 HuggingFace 或 Model Scope 下载过模型，可以重复执行此前的模型下载 python 脚本，将会自动将模型目录更新到最新版本。
\ No newline at end of file
--- a/next_docs/zh_cn/user_guide/install/install.rst
+++ b/next_docs/zh_cn/user_guide/install/install.rst
-安装
-=====
-如果您遇到任何安装问题，请首先查阅 :doc:`../../additional_notes/faq`。如果解析结果不如预期，可参考 :doc:`../../additional_notes/known_issues`。
-.. admonition:: Warning
-    :class: tip
-    **安装前必看——软硬件环境支持说明**
-    为了确保项目的稳定性和可靠性，我们在开发过程中仅对特定的软硬件环境进行优化和测试。这样当用户在推荐的系统配置上部署和运行项目时，能够获得最佳的性能表现和最少的兼容性问题。
-    通过集中资源和精力于主线环境，我们团队能够更高效地解决潜在的BUG，及时开发新功能。
-    在非主线环境中，由于硬件、软件配置的多样性，以及第三方依赖项的兼容性问题，我们无法100%保证项目的完全可用性。因此，对于希望在非推荐环境中使用本项目的用户，我们建议先仔细阅读文档以及 :doc:`../../additional_notes/faq` ，大多数问题已经在 :doc:`../../additional_notes/faq` 中有对应的解决方案，除此之外我们鼓励社区反馈问题，以便我们能够逐步扩大支持范围。
-.. raw:: html
-    <style>
-        table, th, td {
-        border: 1px solid black;
-        border-collapse: collapse;
-        }
-    </style>
-    <table>
-    <tr>
-        <td colspan="3" rowspan="2">操作系统</td>
-    </tr>
-    <tr>
-        <td>Linux after 2019</td>
-        <td>Windows 10 / 11</td>
-        <td>macOS 11+</td>
-    </tr>
-    <tr>
-        <td colspan="3">CPU</td>
-        <td>x86_64 / arm64</td>
-        <td>x86_64(暂不支持ARM Windows)</td>
-        <td>x86_64 / arm64</td>
-    </tr>
-    <tr>
-        <td colspan="3">内存</td>
-        <td colspan="3">大于等于16GB，推荐32G以上</td>
-    </tr>
-    <tr>
-        <td colspan="3">存储空间</td>
-        <td colspan="3">大于等于20GB，推荐使用SSD以获得最佳性能</td>
-    </tr>
-    <tr>
-        <td colspan="3">python版本</td>
-        <td colspan="3">>=3.9,<=3.12</td>
-    </tr>
-    <tr>
-        <td colspan="3">Nvidia Driver 版本</td>
-        <td>latest(专有驱动)</td>
-        <td>latest</td>
-        <td>None</td>
-    </tr>
-    <tr>
-        <td colspan="3">CUDA环境</td>
-        <td>11.8/12.4/12.6</td>
-        <td>11.8/12.4/12.6</td>
-        <td>None</td>
-    </tr>
-    <tr>
-        <td colspan="3">CANN环境(NPU支持)</td>
-        <td>8.0+(Ascend 910b)</td>
-        <td>None</td>
-        <td>None</td>
-    </tr>
-    <tr>
-        <td rowspan="2">GPU/MPS 硬件支持列表</td>
-        <td colspan="2">显存6G以上</td>
-        <td colspan="2">
-        Volta(2017)及之后生产的全部带Tensor Core的GPU <br>
-        6G显存及以上</td>
-        <td rowspan="2">apple slicon</td>
-    </tr>
-    </table>
-创建环境
-~~~~~~~~~~
-.. code-block:: shell
-    conda create -n mineru 'python<3.13' -y
-    conda activate mineru
-    pip install -U "magic-pdf[full]" -i https://mirrors.aliyun.com/pypi/simple
-下载模型权重文件
-~~~~~~~~~~~~~~~
-.. code-block:: shell
-    pip install huggingface_hub
-    wget https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/scripts/download_models_hf.py -O download_models_hf.py
-    python download_models_hf.py
-MinerU 已安装，查看 :doc:`../quick_start` 或阅读 :doc:`boost_with_cuda` 以加速推理。
--- a/next_docs/zh_cn/user_guide/quick_start.rst
+++ b/next_docs/zh_cn/user_guide/quick_start.rst
-快速开始 
-==============
-从这里开始学习 MinerU 基本使用方法。若还没有安装，请参考安装文档进行安装
-.. toctree::
-    :maxdepth: 1
-    :caption: 快速开始
-    quick_start/command_line
-    quick_start/to_markdown
--- a/next_docs/zh_cn/user_guide/quick_start/command_line.rst
+++ b/next_docs/zh_cn/user_guide/quick_start/command_line.rst
-命令行
-========
-.. code:: bash
-   magic-pdf --help
-   Usage: magic-pdf [OPTIONS]
-   Options:
-     -v, --version                display the version and exit
-     -p, --path PATH              local pdf filepath or directory  [required]
-     -o, --output-dir PATH        output local directory  [required]
-     -m, --method [ocr|txt|auto]  the method for parsing pdf. ocr: using ocr
-                                  technique to extract information from pdf. txt:
-                                  suitable for the text-based pdf only and
-                                  outperform ocr. auto: automatically choose the
-                                  best method for parsing pdf from ocr and txt.
-                                  without method specified, auto will be used by
-                                  default.
-     -l, --lang TEXT              Input the languages in the pdf (if known) to
-                                  improve OCR accuracy.  Optional. You should
-                                  input "Abbreviation" with language form url: ht
-                                  tps://paddlepaddle.github.io/PaddleOCR/en/ppocr
-                                  /blog/multi_languages.html#5-support-languages-
-                                  and-abbreviations
-     -d, --debug BOOLEAN          Enables detailed debugging information during
-                                  the execution of the CLI commands.
-     -s, --start INTEGER          The starting page for PDF parsing, beginning
-                                  from 0.
-     -e, --end INTEGER            The ending page for PDF parsing, beginning from
-                                  0.
-     --help                       Show this message and exit.
-   ## show version
-   magic-pdf -v
-   ## command line example
-   magic-pdf -p {some_pdf} -o {some_output_dir} -m auto
-``{some_pdf}`` 可以是单个 PDF 文件或者一个包含多个 PDF 文件的目录。 解析的结果文件存放在目录 ``{some_output_dir}`` 下。 生成的结果文件列表如下所示：
-.. code:: text
-   ├── some_pdf.md                          # markdown 文件
-   ├── images                               # 存放图片目录
-   ├── some_pdf_layout.pdf                  # layout 绘图 （包含layout阅读顺序）
-   ├── some_pdf_middle.json                 # minerU 中间处理结果
-   ├── some_pdf_model.json                  # 模型推理结果
-   ├── some_pdf_origin.pdf                  # 原 pdf 文件
-   ├── some_pdf_spans.pdf                   # 最小粒度的bbox位置信息绘图
-   └── some_pdf_content_list.json           # 按阅读顺序排列的富文本json
-.. admonition:: Tip
-   :class: tip
-   欲知更多有关结果文件的信息，请参考 :doc:`../tutorial/output_file_description`
--- a/next_docs/zh_cn/user_guide/quick_start/to_markdown.rst
+++ b/next_docs/zh_cn/user_guide/quick_start/to_markdown.rst
-转换为 Markdown 文件
-========================
-本地文件示例
-^^^^^^^^^^^^^^^^^^
-.. code:: python
-    import os
-    from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader
-    from magic_pdf.data.dataset import PymuDocDataset
-    from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
-    from magic_pdf.config.enums import SupportedPdfParseMethod
-    # args
-    pdf_file_name = "abc.pdf"  # replace with the real pdf path
-    name_without_suff = pdf_file_name.split(".")[0]
-    # prepare env
-    local_image_dir, local_md_dir = "output/images", "output"
-    image_dir = str(os.path.basename(local_image_dir))
-    os.makedirs(local_image_dir, exist_ok=True)
-    image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter(
-        local_md_dir
-    )
-    image_dir = str(os.path.basename(local_image_dir))
-    # read bytes
-    reader1 = FileBasedDataReader("")
-    pdf_bytes = reader1.read(pdf_file_name)  # read the pdf content
-    # proc
-    ## Create Dataset Instance
-    ds = PymuDocDataset(pdf_bytes)
-    ## inference
-    if ds.classify() == SupportedPdfParseMethod.OCR:
-        infer_result = ds.apply(doc_analyze, ocr=True)
-        ## pipeline
-        pipe_result = infer_result.pipe_ocr_mode(image_writer)
-    else:
-        infer_result = ds.apply(doc_analyze, ocr=False)
-        ## pipeline
-        pipe_result = infer_result.pipe_txt_mode(image_writer)
-    ### draw model result on each page
-    infer_result.draw_model(os.path.join(local_md_dir, f"{name_without_suff}_model.pdf"))
-    ### draw layout result on each page
-    pipe_result.draw_layout(os.path.join(local_md_dir, f"{name_without_suff}_layout.pdf"))
-    ### draw spans result on each page
-    pipe_result.draw_span(os.path.join(local_md_dir, f"{name_without_suff}_spans.pdf"))
-    ### dump markdown
-    pipe_result.dump_md(md_writer, f"{name_without_suff}.md", image_dir)
-    ### dump content list
-    pipe_result.dump_content_list(md_writer, f"{name_without_suff}_content_list.json", image_dir)
-对象存储文件示例
-^^^^^^^^^^^^^^^^
-.. code:: python
-    import os
-    from magic_pdf.data.data_reader_writer import S3DataReader, S3DataWriter
-    from magic_pdf.data.dataset import PymuDocDataset
-    from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
-    bucket_name = "{Your S3 Bucket Name}"  # replace with real bucket name
-    ak = "{Your S3 access key}"  # replace with real s3 access key
-    sk = "{Your S3 secret key}"  # replace with real s3 secret key
-    endpoint_url = "{Your S3 endpoint_url}"  # replace with real s3 endpoint_url
-    reader = S3DataReader('unittest/tmp/', bucket_name, ak, sk, endpoint_url)  # replace `unittest/tmp` with the real s3 prefix
-    writer = S3DataWriter('unittest/tmp', bucket_name, ak, sk, endpoint_url)
-    image_writer = S3DataWriter('unittest/tmp/images', bucket_name, ak, sk, endpoint_url)
-    # args
-    pdf_file_name = (
-        "s3://llm-pdf-text-1/unittest/tmp/bug5-11.pdf"  # replace with the real s3 path
-    )
-    # prepare env
-    local_dir = "output"
-    name_without_suff = os.path.basename(pdf_file_name).split(".")[0]
-    # read bytes
-    pdf_bytes = reader.read(pdf_file_name)  # read the pdf content
-    # proc
-    ## Create Dataset Instance
-    ds = PymuDocDataset(pdf_bytes)
-    ## inference
-    if ds.classify() == SupportedPdfParseMethod.OCR:
-        infer_result = ds.apply(doc_analyze, ocr=True)
-        ## pipeline
-        pipe_result = infer_result.pipe_ocr_mode(image_writer)
-    else:
-        infer_result = ds.apply(doc_analyze, ocr=False)
-        ## pipeline
-        pipe_result = infer_result.pipe_txt_mode(image_writer)
-    ### draw model result on each page
-    infer_result.draw_model(os.path.join(local_dir, f'{name_without_suff}_model.pdf'))  # dump to local
-    ### draw layout result on each page
-    pipe_result.draw_layout(os.path.join(local_dir, f'{name_without_suff}_layout.pdf'))  # dump to local
-    ### draw spans result on each page
-    pipe_result.draw_span(os.path.join(local_dir, f'{name_without_suff}_spans.pdf'))  # dump to local
-    ### dump markdown
-    pipe_result.dump_md(writer, f'{name_without_suff}.md', "unittest/tmp/images")  # dump to remote s3
-    ### dump content list
-    pipe_result.dump_content_list(md_writer, f"{name_without_suff}_content_list.json", image_dir)
-前去 :doc:`../data/data_reader_writer` 获取更多有关 **读写** 示例
--- a/next_docs/zh_cn/user_guide/tutorial.rst
+++ b/next_docs/zh_cn/user_guide/tutorial.rst
-教程
-===========
-让我们通过构建一个最小项目来学习 MinerU 
-.. toctree::
-    :maxdepth: 1
-    :caption: 教程
-    tutorial/output_file_description
-    tutorial/pipeline
--- a/next_docs/zh_cn/user_guide/tutorial/output_file_description.rst
+++ b/next_docs/zh_cn/user_guide/tutorial/output_file_description.rst
-输出文件格式介绍
-===============
-``magic-pdf`` 命令执行后除了输出和 markdown
-有关的文件以外，还会生成若干个和 markdown
-无关的文件。现在将一一介绍这些文件
-some_pdf_layout.pdf
-~~~~~~~~~~~~~~~~~~~
-每一页的 layout 均由一个或多个框组成。
-每个框左上脚的数字表明它们的序号。此外 layout.pdf
-框内用不同的背景色块圈定不同的内容块。
-.. figure:: ../../_static/image/layout_example.png
-   :alt: layout 页面示例
-   layout 页面示例
-some_pdf_spans.pdf
-~~~~~~~~~~~~~~~~~~
-根据 span 类型的不同，采用不同颜色线框绘制页面上所有
-span。该文件可以用于质检，可以快速排查出文本丢失、行间公式未识别等问题。
-.. figure:: ../../_static/image/spans_example.png
-   :alt: span 页面示例
-   span 页面示例
-some_pdf_model.json
-~~~~~~~~~~~~~~~~~~~
-结构定义
-^^^^^^^^
-.. code:: python
-   from pydantic import BaseModel, Field
-   from enum import IntEnum
-   class CategoryType(IntEnum):
-        title = 0               # 标题
-        plain_text = 1          # 文本
-        abandon = 2             # 包括页眉页脚页码和页面注释
-        figure = 3              # 图片
-        figure_caption = 4      # 图片描述
-        table = 5               # 表格
-        table_caption = 6       # 表格描述
-        table_footnote = 7      # 表格注释
-        isolate_formula = 8     # 行间公式
-        formula_caption = 9     # 行间公式的标号
-        embedding = 13          # 行内公式
-        isolated = 14           # 行间公式
-        text = 15               # ocr 识别结果
-   class PageInfo(BaseModel):
-       page_no: int = Field(description="页码序号，第一页的序号是 0", ge=0)
-       height: int = Field(description="页面高度", gt=0)
-       width: int = Field(description="页面宽度", ge=0)
-   class ObjectInferenceResult(BaseModel):
-       category_id: CategoryType = Field(description="类别", ge=0)
-       poly: list[float] = Field(description="四边形坐标, 分别是 左上，右上，右下，左下 四点的坐标")
-       score: float = Field(description="推理结果的置信度")
-       latex: str | None = Field(description="latex 解析结果", default=None)
-       html: str | None = Field(description="html 解析结果", default=None)
-   class PageInferenceResults(BaseModel):
-        layout_dets: list[ObjectInferenceResult] = Field(description="页面识别结果", ge=0)
-        page_info: PageInfo = Field(description="页面元信息")
-   # 所有页面的推理结果按照页码顺序依次放到列表中即为 minerU 推理结果
-   inference_result: list[PageInferenceResults] = []
-poly 坐标的格式 [x0, y0, x1, y1, x2, y2, x3, y3],
-分别表示左上、右上、右下、左下四点的坐标 |poly 坐标示意图|
-示例数据
-^^^^^^^^
-.. code:: json
-   [
-       {
-           "layout_dets": [
-               {
-                   "category_id": 2,
-                   "poly": [
-                       99.1906967163086,
-                       100.3119125366211,
-                       730.3707885742188,
-                       100.3119125366211,
-                       730.3707885742188,
-                       245.81326293945312,
-                       99.1906967163086,
-                       245.81326293945312
-                   ],
-                   "score": 0.9999997615814209
-               }
-           ],
-           "page_info": {
-               "page_no": 0,
-               "height": 2339,
-               "width": 1654
-           }
-       },
-       {
-           "layout_dets": [
-               {
-                   "category_id": 5,
-                   "poly": [
-                       99.13092803955078,
-                       2210.680419921875,
-                       497.3183898925781,
-                       2210.680419921875,
-                       497.3183898925781,
-                       2264.78076171875,
-                       99.13092803955078,
-                       2264.78076171875
-                   ],
-                   "score": 0.9999997019767761
-               }
-           ],
-           "page_info": {
-               "page_no": 1,
-               "height": 2339,
-               "width": 1654
-           }
-       }
-   ]
-some_pdf_middle.json
-~~~~~~~~~~~~~~~~~~~~
-+--------------------+----------------------------------------------------------+
-| 字段名              | 解释                                                    |
-+====================+==========================================================+
-| pdf_info           | list，每个元素都是一个                                   |
-|                    | dict，这个dict是每一页pdf的解析结果，详见下表            |
-+--------------------+----------------------------------------------------------+
-| \_parse_type       | ocr \| txt，用来标识本次解析的中间态使用的模式           |
-+--------------------+----------------------------------------------------------+
-| \_version_name     | string，表示本次解析使用的 magic-pdf 的版本号            |
-+-------------------------------------------------------------------------------+
-**pdf_info** 字段结构说明
-+---------------------+-------------------------------------------------------+
-| 字段名               | 解释                                                 |
-+=====================+=======================================================+
-| preproc_blocks      | pdf预处理后，未分段的中间结果                         |
-+---------------------+-------------------------------------------------------+
-|                     | 布局分割的结果，                                      |
-| layout_bboxes       | 含有布局的方向（垂直、水平），和bbox，按阅读顺序排序  |
-+---------------------+-------------------------------------------------------+
-| page_idx            | 页码，从0开始                                         |
-+---------------------+-------------------------------------------------------+
-| page_size           | 页面的宽度和高度                                      |
-+---------------------+-------------------------------------------------------+
-| \_layout_tree       | 布局树状结构                                          |
-+---------------------+-------------------------------------------------------+
-| images              | list，每个元素是一个dict，每个dict表示一个img_block   |
-+---------------------+-------------------------------------------------------+
-| tables              | list，每个元素是一个dict，每个dict表示一个table_block |
-+---------------------+-------------------------------------------------------+
-|                     | list，每个元素是一个                                  |
-| interline_equations | dict，每个dict表示一个interline_equation_block        |
-+---------------------+-------------------------------------------------------+
-|                     | List, 模型返回的需要drop的block信息                   |
-| discarded_blocks    |                                                       |
-+---------------------+-------------------------------------------------------+
-| para_blocks         | 将preproc_blocks进行分段之后的结果                    |
-+---------------------+-------------------------------------------------------+
-上表中 ``para_blocks``
-是个dict的数组，每个dict是一个block结构，block最多支持一次嵌套
-**block**
-外层block被称为一级block，一级block中的字段包括
-====== ===============================================
-字段名 解释
-====== ===============================================
-type   block类型（table|image）
-bbox   block矩形框坐标
-blocks list，里面的每个元素都是一个dict格式的二级block
-====== ===============================================
-一级block只有”table”和”image”两种类型，其余block均为二级block
-二级block中的字段包括
-+----------+----------------------------------------------------------------+
-| 字       | 解释                                                           |
-| 段       |                                                                |
-| 名       |                                                                |
-+==========+================================================================+
-|          | block类型                                                      |
-| type     |                                                                |
-+----------+----------------------------------------------------------------+
-| bbox     | block矩形框坐标                                                |
-+----------+----------------------------------------------------------------+
-| lines    | list，每个元素都是一个dict表示的line，用来描述一行信息的构成   |
-+----------+----------------------------------------------------------------+
-二级block的类型详解
-================== ==============
-type               desc
-================== ==============
-image_body         图像的本体
-image_caption      图像的描述文本
-image_footnote     图像的脚注
-table_body         表格本体
-table_caption      表格的描述文本
-table_footnote     表格的脚注
-text               文本块
-title              标题块
-index              目录块
-list               列表块
-interline_equation 行间公式块
-================== ==============
-**line**
-line 的 字段格式如下
-+-----------+-----------------------------------------------------------------+
-| 字        | 解释                                                            |
-| 段        |                                                                 |
-| 名        |                                                                 |
-+===========+=================================================================+
-| bbox      | line的矩形框坐标                                                |
-+-----------+-----------------------------------------------------------------+
-| spans     | list，                                                          |
-|           | 每个元素都是一个dict表示的span，用来描述一个最小组成单元的构成  |
-+-----------+-----------------------------------------------------------------+
-**span**
-+------------+---------------------------------------------------------+
-| 字段名      | 解释                                                   |
-+============+=========================================================+
-| bbox       | span的矩形框坐标                                        |
-+------------+---------------------------------------------------------+
-| type       | span的类型                                              |
-+------------+---------------------------------------------------------+
-| content \| | 文本类型的span使用content，图表类使用img_path           |
-| img_path   | 用来存储实际的文本或者截图路径信息                      |
-+------------+---------------------------------------------------------+
-span 的类型有如下几种
-================== ========
-type               desc
-================== ========
-image              图片
-table              表格
-text               文本
-inline_equation    行内公式
-interline_equation 行间公式
-================== ========
-**总结**
-span是所有元素的最小存储单元
-para_blocks内存储的元素为区块信息
-区块结构为
-一级block(如有)->二级block->line->span
-.. _示例数据-1:
-示例数据
-^^^^^^^^
-.. code:: json
-   {
-       "pdf_info": [
-           {
-               "preproc_blocks": [
-                   {
-                       "type": "text",
-                       "bbox": [
-                           52,
-                           61.956024169921875,
-                           294,
-                           82.99800872802734
-                       ],
-                       "lines": [
-                           {
-                               "bbox": [
-                                   52,
-                                   61.956024169921875,
-                                   294,
-                                   72.0000228881836
-                               ],
-                               "spans": [
-                                   {
-                                       "bbox": [
-                                           54.0,
-                                           61.956024169921875,
-                                           296.2261657714844,
-                                           72.0000228881836
-                                       ],
-                                       "content": "dependent on the service headway and the reliability of the departure ",
-                                       "type": "text",
-                                       "score": 1.0
-                                   }
-                               ]
-                           }
-                       ]
-                   }
-               ],
-               "layout_bboxes": [
-                   {
-                       "layout_bbox": [
-                           52,
-                           61,
-                           294,
-                           731
-                       ],
-                       "layout_label": "V",
-                       "sub_layout": []
-                   }
-               ],
-               "page_idx": 0,
-               "page_size": [
-                   612.0,
-                   792.0
-               ],
-               "_layout_tree": [],
-               "images": [],
-               "tables": [],
-               "interline_equations": [],
-               "discarded_blocks": [],
-               "para_blocks": [
-                   {
-                       "type": "text",
-                       "bbox": [
-                           52,
-                           61.956024169921875,
-                           294,
-                           82.99800872802734
-                       ],
-                       "lines": [
-                           {
-                               "bbox": [
-                                   52,
-                                   61.956024169921875,
-                                   294,
-                                   72.0000228881836
-                               ],
-                               "spans": [
-                                   {
-                                       "bbox": [
-                                           54.0,
-                                           61.956024169921875,
-                                           296.2261657714844,
-                                           72.0000228881836
-                                       ],
-                                       "content": "dependent on the service headway and the reliability of the departure ",
-                                       "type": "text",
-                                       "score": 1.0
-                                   }
-                               ]
-                           }
-                       ]
-                   }
-               ]
-           }
-       ],
-       "_parse_type": "txt",
-       "_version_name": "0.6.1"
-   }
-.. |poly 坐标示意图| image:: ../../_static/image/poly.png
--- a/next_docs/zh_cn/user_guide/tutorial/pipeline.rst
+++ b/next_docs/zh_cn/user_guide/tutorial/pipeline.rst
-流水线管道
-===========
-极简示例
-^^^^^^^^
-.. code:: python
-    import os
-    from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader
-    from magic_pdf.data.dataset import PymuDocDataset
-    from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
-    # args
-    pdf_file_name = "abc.pdf"  # replace with the real pdf path
-    name_without_suff = pdf_file_name.split(".")[0]
-    # prepare env
-    local_image_dir, local_md_dir = "output/images", "output"
-    image_dir = str(os.path.basename(local_image_dir))
-    os.makedirs(local_image_dir, exist_ok=True)
-    image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter(
-        local_md_dir
-    )
-    image_dir = str(os.path.basename(local_image_dir))
-    # read bytes
-    reader1 = FileBasedDataReader("")
-    pdf_bytes = reader1.read(pdf_file_name)  # read the pdf content
-    # proc
-    ## Create Dataset Instance
-    ds = PymuDocDataset(pdf_bytes)
-    ds.apply(doc_analyze, ocr=True).pipe_ocr_mode(image_writer).dump_md(md_writer, f"{name_without_suff}.md", image_dir)
-运行以上的代码，会得到如下的结果
-.. code:: bash 
-    output/
-    ├── abc.md
-    └── images
-除去初始化环境，如建立目录、导入依赖库等逻辑。真正将 ``pdf`` 转换为 ``markdown`` 的代码片段如下
-.. code::
-    # read bytes
-    reader1 = FileBasedDataReader("")
-    pdf_bytes = reader1.read(pdf_file_name)  # read the pdf content
-    # proc
-    ## Create Dataset Instance
-    ds = PymuDocDataset(pdf_bytes)
-    ds.apply(doc_analyze, ocr=True).pipe_ocr_mode(image_writer).dump_md(md_writer, f"{name_without_suff}.md", image_dir)
-``ds.apply(doc_analyze, ocr=True)`` 会生成 ``InferenceResult`` 对象。 ``InferenceResult`` 对象执行 ``pipe_ocr_mode`` 方法会生成 ``PipeResult`` 对象。
-``PipeResult`` 对象执行 ``dump_md`` 会在指定位置生成 ``markdown`` 文件。
-pipeline 的执行过程如下图所示
-.. image:: ../../_static/image/pipeline.drawio.svg 
-.. raw:: html 
-    <br> </br>
-目前划分出数据、推理、程序处理三个阶段，分别对应着图上的 ``Dataset``， ``InferenceResult``， ``PipeResult`` 这三个实体。通过 ``apply`` ， ``doc_analyze`` 或 ``pipe_ocr_mode`` 等方法链接在一起。
-.. admonition:: Tip
-    :class: tip
-    要想获得更多有关 Dataset、InferenceResult、PipeResult 的使用示例子，请前往 :doc:`../quick_start/to_markdown`
-    要想获得更多有关 Dataset、InferenceResult、PipeResult 的细节信息请前往英文版 MinerU 文档进行查看!
-管道组合
-^^^^^^^^^
-.. code:: python
-    class Dataset(ABC):
-        @abstractmethod
-        def apply(self, proc: Callable, *args, **kwargs):
-            """Apply callable method which.
-            Args:
-                proc (Callable): invoke proc as follows:
-                    proc(self, *args, **kwargs)
-            Returns:
-                Any: return the result generated by proc
-            """
-            pass
-    class InferenceResult(InferenceResultBase):
-        def apply(self, proc: Callable, *args, **kwargs):
-            """Apply callable method which.
-            Args:
-                proc (Callable): invoke proc as follows:
-                    proc(inference_result, *args, **kwargs)
-            Returns:
-                Any: return the result generated by proc
-            """
-            return proc(copy.deepcopy(self._infer_res), *args, **kwargs)
-        def pipe_ocr_mode(
-            self,
-            imageWriter: DataWriter,
-            start_page_id=0,
-            end_page_id=None,
-            debug_mode=False,
-            lang=None,
-            ) -> PipeResult:
-            pass
-    class PipeResult:
-        def apply(self, proc: Callable, *args, **kwargs):
-            """Apply callable method which.
-            Args:
-                proc (Callable): invoke proc as follows:
-                    proc(pipeline_result, *args, **kwargs)
-            Returns:
-                Any: return the result generated by proc
-            """
-            return proc(copy.deepcopy(self._pipe_res), *args, **kwargs)
-``Dataset`` 、 ``InferenceResult`` 和 ``PipeResult`` 类均有 ``apply`` method。可用于组合不同阶段的运算过程。
-如下所示，``MinerU`` 提供一套组合这些类的计算过程。
-.. code:: python 
-    # proc
-    ## Create Dataset Instance
-    ds = PymuDocDataset(pdf_bytes)
-    ds.apply(doc_analyze, ocr=True).pipe_ocr_mode(image_writer).dump_md(md_writer, f"{name_without_suff}.md", image_dir)
-用户可以根据的需求，自行实现一些组合用的函数。比如用户通过 ``apply`` 方法实现一个统计 ``pdf`` 文件页数的功能。
-.. code:: python 
-    from magic_pdf.data.data_reader_writer import  FileBasedDataReader
-    from magic_pdf.data.dataset import PymuDocDataset
-    # args
-    pdf_file_name = "abc.pdf"  # replace with the real pdf path
-    # read bytes
-    reader1 = FileBasedDataReader("")
-    pdf_bytes = reader1.read(pdf_file_name)  # read the pdf content
-    # proc
-    ## Create Dataset Instance
-    ds = PymuDocDataset(pdf_bytes)
-    def count_page(ds)-> int:
-        return len(ds)
-    print("page number: ", ds.apply(count_page)) # will output the page count of `abc.pdf`
--- a/projects/README.md
+++ b/projects/README.md
@@ -2,8 +2,10 @@
 ## Project List
- [llama_index_rag](./llama_index_rag/README.md): Build a lightweight RAG system based on llama_index
+- Projects compatible with version 2.0:
- [gradio_app](./gradio_app/README.md): Build a web app based on gradio
+  - [gradio_app](./gradio_app/README.md): Web application based on Gradio
- ~~[web_demo](./web_demo/README.md): MinerU online [demo](https://opendatalab.com/OpenSourceTools/Extractor/PDF/) localized deployment version~~(Deprecated)
- [web_api](./web_api/README.md): Web API Based on FastAPI
+- Projects not yet compatible with version 2.0:
- [multi_gpu](./multi_gpu/README.md): Multi-GPU parallel processing based on LitServe
+  - [web_api](./web_api/README.md): Web API based on FastAPI
+  - [multi_gpu](./multi_gpu/README.md): Multi-GPU parallel processing based on LitServe
--- a/projects/README_zh-CN.md
+++ b/projects/README_zh-CN.md
@@ -2,8 +2,9 @@
 ## 项目列表
- [llama_index_rag](./llama_index_rag/README_zh-CN.md): 基于 llama_index 构建轻量级 RAG 系统
+- 已兼容2.0版本的项目列表
- [gradio_app](./gradio_app/README_zh-CN.md): 基于 Gradio 的 Web 应用
+  - [gradio_app](./gradio_app/README_zh-CN.md): 基于 Gradio 的 Web 应用
- ~~[web_demo](./web_demo/README_zh-CN.md): MinerU在线[demo](https://opendatalab.com/OpenSourceTools/Extractor/PDF/)本地化部署版本~~(已过时)
- [web_api](./web_api/README.md): 基于 FastAPI 的 Web API
+- 未兼容2.0版本的项目列表
- [multi_gpu](./multi_gpu/README.md): 基于 LitServe 的多 GPU 并行处理
+  - [web_api](./web_api/README.md): 基于 FastAPI 的 Web API 
+  - [multi_gpu](./multi_gpu/README.md): 基于 LitServe 的多 GPU 并行处理
--- a/projects/gradio_app/app.py
+++ b/projects/gradio_app/app.py
@@ -4,30 +4,22 @@ import base64
 import os
 import re
 import time
-import uuid
 import zipfile
 from pathlib import Path
 import gradio as gr
-import pymupdf
 from gradio_pdf import PDF
 from loguru import logger
-from magic_pdf.data.data_reader_writer import FileBasedDataReader
+from mineru.cli.common import prepare_env, do_parse, read_fn
-from magic_pdf.libs.hash_utils import compute_sha256
+from mineru.utils.hash_utils import str_sha256
-from magic_pdf.tools.common import do_parse, prepare_env
-def read_fn(path):
+def parse_pdf(doc_path, output_dir, end_page_id, is_ocr, formula_enable, table_enable, language):
-    disk_rw = FileBasedDataReader(os.path.dirname(path))
-    return disk_rw.read(os.path.basename(path))
-def parse_pdf(doc_path, output_dir, end_page_id, is_ocr, layout_mode, formula_enable, table_enable, language):
    os.makedirs(output_dir, exist_ok=True)
    try:
-        file_name = f'{str(Path(doc_path).stem)}_{time.time()}'
+        file_name = f'{str(Path(doc_path).stem)}_{time.strftime("%y%m%d_%H%M%S")}'
        pdf_data = read_fn(doc_path)
        if is_ocr:
            parse_method = 'ocr'
@@ -35,17 +27,14 @@ def parse_pdf(doc_path, output_dir, end_page_id, is_ocr, layout_mode, formula_en
            parse_method = 'auto'
        local_image_dir, local_md_dir = prepare_env(output_dir, file_name, parse_method)
        do_parse(
-            output_dir,
+            output_dir=output_dir,
-            file_name,
+            pdf_file_names=[file_name],
-            pdf_data,
+            pdf_bytes_list=[pdf_data],
-            [],
+            p_lang_list=[language],
-            parse_method,
+            parse_method=parse_method,
-            False,
            end_page_id=end_page_id,
-            layout_model=layout_mode,
+            p_formula_enable=formula_enable,
-            formula_enable=formula_enable,
+            p_table_enable=table_enable,
-            table_enable=table_enable,
-            lang=language,
        )
        return local_md_dir, file_name
    except Exception as e:
@@ -96,12 +85,11 @@ def replace_image_with_base64(markdown_text, image_dir_path):
    return re.sub(pattern, replace, markdown_text)
-def to_markdown(file_path, end_pages, is_ocr, layout_mode, formula_enable, table_enable, language):
+def to_markdown(file_path, end_pages, is_ocr, formula_enable, table_enable, language):
    file_path = to_pdf(file_path)
    # 获取识别的md文件以及压缩包文件路径
-    local_md_dir, file_name = parse_pdf(file_path, './output', end_pages - 1, is_ocr,
+    local_md_dir, file_name = parse_pdf(file_path, './output', end_pages - 1, is_ocr, formula_enable, table_enable, language)
-                                        layout_mode, formula_enable, table_enable, language)
+    archive_zip_path = os.path.join('./output', str_sha256(local_md_dir) + '.zip')
-    archive_zip_path = os.path.join('./output', compute_sha256(local_md_dir) + '.zip')
    zip_archive_success = compress_directory_to_zip(local_md_dir, archive_zip_path)
    if zip_archive_success == 0:
        logger.info('压缩成功')
@@ -125,24 +113,6 @@ latex_delimiters = [
 ]
-def init_model():
-    from magic_pdf.model.doc_analyze_by_custom_model import ModelSingleton
-    try:
-        model_manager = ModelSingleton()
-        txt_model = model_manager.get_model(False, False)  # noqa: F841
-        logger.info('txt_model init final')
-        ocr_model = model_manager.get_model(True, False)  # noqa: F841
-        logger.info('ocr_model init final')
-        return 0
-    except Exception as e:
-        logger.exception(e)
-        return -1
-model_init = init_model()
-logger.info(f'model_init: {model_init}')
 with open('header.html', 'r') as file:
    header = file.read()
@@ -171,24 +141,30 @@ all_lang = []
 all_lang.extend([*other_lang, *add_lang])
+def safe_stem(file_path):
+    stem = Path(file_path).stem
+    # 只保留字母、数字、下划线和点，其他字符替换为下划线
+    return re.sub(r'[^\w.]', '_', stem)
 def to_pdf(file_path):
-    with pymupdf.open(file_path) as f:
-        if f.is_pdf:
-            return file_path
-        else:
-            pdf_bytes = f.convert_to_pdf()
-            # 将pdfbytes 写入到uuid.pdf中
-            # 生成唯一的文件名
-            unique_filename = f'{uuid.uuid4()}.pdf'
-            # 构建完整的文件路径
+    if file_path is None:
-            tmp_file_path = os.path.join(os.path.dirname(file_path), unique_filename)
+        return None
+    pdf_bytes = read_fn(file_path)
-            # 将字节数据写入文件
+    # unique_filename = f'{uuid.uuid4()}.pdf'
-            with open(tmp_file_path, 'wb') as tmp_pdf_file:
+    unique_filename = f'{safe_stem(file_path)}.pdf'
-                tmp_pdf_file.write(pdf_bytes)
-            return tmp_file_path
+    # 构建完整的文件路径
+    tmp_file_path = os.path.join(os.path.dirname(file_path), unique_filename)
+    # 将字节数据写入文件
+    with open(tmp_file_path, 'wb') as tmp_pdf_file:
+        tmp_pdf_file.write(pdf_bytes)
+    return tmp_file_path
 if __name__ == '__main__':
@@ -196,14 +172,16 @@ if __name__ == '__main__':
        gr.HTML(header)
        with gr.Row():
            with gr.Column(variant='panel', scale=5):
-                file = gr.File(label='Please upload a PDF or image', file_types=['.pdf', '.png', '.jpeg', '.jpg'])
-                max_pages = gr.Slider(1, 20, 10, step=1, label='Max convert pages')
                with gr.Row():
-                    layout_mode = gr.Dropdown(['doclayout_yolo'], label='Layout model', value='doclayout_yolo')
+                    file = gr.File(label='Please upload a PDF or image', file_types=['.pdf', '.png', '.jpeg', '.jpg'])
-                    language = gr.Dropdown(all_lang, label='Language', value='ch')
+                with gr.Row(equal_height=True):
+                    with gr.Column(scale=4):
+                        max_pages = gr.Slider(1, 20, 10, step=1, label='Max convert pages')
+                    with gr.Column(scale=1):
+                        language = gr.Dropdown(all_lang, label='Language', value='ch')
                with gr.Row():
-                    formula_enable = gr.Checkbox(label='Enable formula recognition', value=True)
                    is_ocr = gr.Checkbox(label='Force enable OCR', value=False)
+                    formula_enable = gr.Checkbox(label='Enable formula recognition', value=True)
                    table_enable = gr.Checkbox(label='Enable table recognition(test)', value=True)
                with gr.Row():
                    change_bu = gr.Button('Convert')
@@ -227,7 +205,7 @@ if __name__ == '__main__':
                    with gr.Tab('Markdown text'):
                        md_text = gr.TextArea(lines=45, show_copy_button=True)
        file.change(fn=to_pdf, inputs=file, outputs=pdf_show)
-        change_bu.click(fn=to_markdown, inputs=[file, max_pages, is_ocr, layout_mode, formula_enable, table_enable, language],
+        change_bu.click(fn=to_markdown, inputs=[file, max_pages, is_ocr, formula_enable, table_enable, language],
                        outputs=[md, md_text, output_file, pdf_show])
        clear_bu.add([file, md, pdf_show, md_text, output_file, is_ocr])

--- a/projects/llama_index_rag/README.md
+++ b/projects/llama_index_rag/README.md
-## Installation
-MinerU
-```bash
-git clone https://github.com/opendatalab/MinerU.git
-cd MinerU
-conda create -n MinerU python=3.10
-conda activate MinerU
-pip install .[full] --extra-index-url https://wheels.myhloli.com
-```
-Third-party software
-```bash
-# install
-pip install llama-index-vector-stores-elasticsearch==0.2.0
-pip install llama-index-embeddings-dashscope==0.2.0
-pip install llama-index-core==0.10.68
-pip install einops==0.7.0
-pip install transformers-stream-generator==0.0.5
-pip install accelerate==0.33.0
-# uninstall
-pip uninstall transformer-engine
-```
-## Environment Configuration
-```
-export DASHSCOPE_API_KEY={some_key}
-export ES_USER={some_es_user}
-export ES_PASSWORD={some_es_password}
-export ES_URL=http://{es_url}:9200
-```
-For instructions on obtaining a DASHSCOPE_API_KEY, refer to [documentation](https://help.aliyun.com/zh/dashscope/opening-service)
-## Usage
-### Data Ingestion
-```bash
-python data_ingestion.py -p some.pdf  # load data from pdf
-    or
-python data_ingestion.py -p /opt/data/some_pdf_directory/ # load data from multiples pdf which under the directory of {some_pdf_directory}
-```
-### Query
-```bash
-python query.py --question '{the_question_you_want_to_ask}'
-```
-## Example
-````bash
-# Start the es service
-docker compose up -d
-or
-docker-compose up -d
-# Set environment variables
-export ES_USER=elastic
-export ES_PASSWORD=llama_index
-export ES_URL=http://127.0.0.1:9200
-export DASHSCOPE_API_KEY={some_key}
-# Ingest data
-python data_ingestion.py example/data/declaration_of_the_rights_of_man_1789.pdf
-# Ask a question
-python query.py -q 'how about the rights of men'
-## outputs
-Please answer the question based on the content within ```:
-            ```
-            I. Men are born, and always continue, free and equal in respect of their rights. Civil distinctions, therefore, can be founded only on public utility.
-            ```
-            My question is：how about the rights of men。
-question: how about the rights of men
-answer: The statement implies that men are born free and equal in terms of their rights. Civil distinctions should only be based on public utility. However, it does not specify what those rights are. It is up to society and individual countries to determine and protect the specific rights of their citizens.
-````
-## Development
-`MinerU` provides a `RAG` integration interface, allowing users to specify a single input `pdf` file or a directory. `MinerU` will automatically parse the input files and return an iterable interface for retrieving the data.
-### API Interface
-```python
-from magic_pdf.integrations.rag.type import Node
-class RagPageReader:
-    def get_rel_map(self) -> list[ElementRelation]:
-        # Retrieve the relationships between nodes
-        pass
-    ...
-class RagDocumentReader:
-    ...
-class DataReader:
-    def __init__(self, path_or_directory: str, method: str, output_dir: str):
-        pass
-    def get_documents_count(self) -> int:
-        """Get the number of pdf documents"""
-        pass
-    def get_document_result(self, idx: int) -> RagDocumentReader | None:
-        """Retrieve the parsed content of a specific pdf"""
-        pass
-    def get_document_filename(self, idx: int) -> Path:
-        """Retrieve the path of a specific pdf"""
-        pass
-```
-Type Definitions
-```python
-class Node(BaseModel):
-    category_type: CategoryType = Field(description='Category') # Category
-    text: str | None = Field(description='Text content', default=None)
-    image_path: str | None = Field(description='Path to image or table (table may be stored as an image)', default=None)
-    anno_id: int = Field(description='Unique ID', default=-1)
-    latex: str | None = Field(description='LaTeX output for equations or tables', default=None)
-    html: str | None = Field(description='HTML output for tables', default=None)
-```
-Tables can be stored in one of three formats: image, LaTeX, or HTML. 
-`anno_id` is a globally unique ID for each Node. It can be used later to match this Node with other Nodes. The relationships between nodes can be retrieved using the `get_rel_map` method. Users can use `anno_id` to link nodes and construct a RAG index that includes node relationships.
-### Node Relationship Matrix
-|                | image_body | table_body |
-| -------------- | ---------- | ---------- |
-| image_caption  | sibling    |            |
-| table_caption  |            | sibling    |
-| table_footnote |            | sibling    |
--- a/projects/llama_index_rag/README_zh-CN.md
+++ b/projects/llama_index_rag/README_zh-CN.md
-<details open="open">
-  <summary><h2 style="display: inline-block">目录</h2></summary>
-    <li><a href="#介绍">介绍</a></li>
-    <li><a href="#安装">安装</a></li>
-    <li><a href="#示例">示例</a></li>
-    <li><a href="#开发">开发</a></li>
-  </ol>
-</details>
-## 介绍
-`MinerU` 提供数据 `API接口` 以支持用户导入数据到 `RAG` 系统。本项目将基于`通义千问`展示如何构建一个轻量级的 `RAG` 系统。
-<p align="center">
-  <img src="rag_data_api.png" width="300px" style="vertical-align:middle;">
-</p>
-## 安装
-环境要求
-```text
-NVIDIA A100 80GB,
-Centos 7 3.10.0-957.el7.x86_64
-Client: Docker Engine - Community
- Version:           24.0.5
- API version:       1.43
- Go version:        go1.20.6
- Git commit:        ced0996
- Built:             Fri Jul 21 20:39:02 2023
- OS/Arch:           linux/amd64
- Context:           default
-Server: Docker Engine - Community
- Engine:
-  Version:          24.0.5
-  API version:      1.43 (minimum version 1.12)
-  Go version:       go1.20.6
-  Git commit:       a61e2b4
-  Built:            Fri Jul 21 20:38:05 2023
-  OS/Arch:          linux/amd64
-  Experimental:     false
- containerd:
-  Version:          1.6.25
-  GitCommit:        d8f198a4ed8892c764191ef7b3b06d8a2eeb5c7f
- runc:
-  Version:          1.1.10
-  GitCommit:        v1.1.10-0-g18a0cb0
- docker-init:
-  Version:          0.19.0
-  GitCommit:        de40ad0
-```
-请参考[文档](../../README_zh-CN.md) 安装 MinerU
-第三方软件
-```bash
-# install
-pip install modelscope==1.14.0
-pip install llama-index-vector-stores-elasticsearch==0.2.0
-pip install llama-index-embeddings-dashscope==0.2.0
-pip install llama-index-core==0.10.68
-pip install einops==0.7.0
-pip install transformers-stream-generator==0.0.5
-pip install accelerate==0.33.0
-# uninstall
-pip uninstall transformer-engine
-```
-## 示例
-````bash
-cd  projects/llama_index_rag
-docker compose up -d
-or
-docker-compose up -d
-# 配置环境变量
-export ES_USER=elastic
-export ES_PASSWORD=llama_index
-export ES_URL=http://127.0.0.1:9200
-export DASHSCOPE_API_KEY={some_key}
-DASHSCOPE_API_KEY 开通参考[文档](https://help.aliyun.com/zh/dashscope/opening-service)
-# 未导入数据，查询问题。返回通义千问默认答案
-python query.py -q 'how about the rights of men'
-## outputs
-question: how about the rights of men
-answer: The topic of men's rights often refers to discussions around legal, social, and political issues that affect men specifically or differently from women. Movements related to men's rights advocate for addressing areas where men face discrimination or unique challenges, such as:
-    Child Custody: Ensuring that men have equal opportunities for custody of their children following divorce or separation.
-    Domestic Violence: Recognizing that men can also be victims of domestic abuse and ensuring they have access to support services.
-    Mental Health and Suicide Rates: Addressing the higher rates of suicide among men and providing mental health resources.
-    Military Conscription: In some countries, only men are required to register for military service, which is seen as a gender-based obligation.
-    Workplace Safety: Historically, more men than women have been employed in high-risk occupations, leading to higher workplace injury and death rates.
-    Parental Leave: Advocating for paternity leave policies that allow men to take time off work for family care.
-    Men's rights activism often intersects with broader discussions on gender equality and aims to promote fairness and equity across genders. It's important to note that while advocating for these issues, it should be done in a way that does not detract from or oppose the goals of gender equality and the rights of other groups. The focus should be on creating a fair society where everyone has equal opportunities and protections under the law.
-# 导入数据
-python data_ingestion.py -p example/data/
-or
-python data_ingestion.py -p example/data/declaration_of_the_rights_of_man_1789.pdf
-# 导入数据后，查询问题。通义千问模型会根据 RAG 系统的检索结果，结合上下文，给出答案。
-python query.py -q 'how about the rights of men'
-## outputs
-请基于```内的内容回答问题。"
-            ```
-            I. Men are born, and always continue, free and equal in respect of their rights. Civil distinctions, therefore, can be founded only on public utility.
-            ```
-            我的问题是：how about the rights of men。
-question: how about the rights of men
-answer: The statement implies that men are born free and equal in terms of their rights. Civil distinctions should only be based on public utility. However, it does not specify what those rights are. It is up to society and individual countries to determine and protect the specific rights of their citizens.
-````
-## 开发
-`MinerU` 提供了 `RAG` 集成接口，用户可以通过指定输入单个 `pdf` 文件或者某个目录。`MinerU` 会自动解析输入文件并返回可以迭代的接口用于获取数据
-### API 接口
-```python
-from magic_pdf.integrations.rag.type import Node
-class RagPageReader:
-    def get_rel_map(self) -> list[ElementRelation]:
-        # 获取节点间的关系
-        pass
-    ...
-class RagDocumentReader:
-    ...
-class DataReader:
-    def __init__(self, path_or_directory: str, method: str, output_dir: str):
-        pass
-    def get_documents_count(self) -> int:
-        """获取 pdf 文档数量"""
-        pass
-    def get_document_result(self, idx: int) -> RagDocumentReader | None:
-        """获取某个 pdf 的解析内容"""
-        pass
-    def get_document_filename(self, idx: int) -> Path:
-        """获取某个 pdf 的具体路径"""
-        pass
-```
-类型定义
-```python
-class Node(BaseModel):
-    category_type: CategoryType = Field(description='类别') # 类别
-    text: str | None = Field(description='文本内容',
-                             default=None)
-    image_path: str | None = Field(description='图或者表格（表可能用图片形式存储）的存储路径',
-                                   default=None)
-    anno_id: int = Field(description='unique id', default=-1)
-    latex: str | None = Field(description='公式或表格 latex 解析结果', default=None)
-    html: str | None = Field(description='表格的 html 解析结果', default=None)
-```
-表格存储形式可能会是 图片、latex、html 三种形式之一。
-anno_id 是该 Node 的在全局唯一ID。后续可以用于匹配该 Node 和其他 Node 的关系。节点的关系可以通过方法 `get_rel_map` 获取。用户可以用 `anno_id` 匹配节点之间的关系，并用于构建具备节点的关系的 rag index。
-### 节点类型关系矩阵
-|                | image_body | table_body |
-| -------------- | ---------- | ---------- |
-| image_caption  | sibling    |            |
-| table_caption  |            | sibling    |
-| table_footnote |            | sibling    |
--- a/projects/llama_index_rag/data_ingestion.py
+++ b/projects/llama_index_rag/data_ingestion.py
-import os
-import click
-from llama_index.core.schema import TextNode
-from llama_index.embeddings.dashscope import (DashScopeEmbedding,
-                                              DashScopeTextEmbeddingModels,
-                                              DashScopeTextEmbeddingType)
-from llama_index.vector_stores.elasticsearch import ElasticsearchStore
-from magic_pdf.integrations.rag.api import DataReader
-es_vec_store = ElasticsearchStore(
-    index_name='rag_index',
-    es_url=os.getenv('ES_URL', 'http://127.0.0.1:9200'),
-    es_user=os.getenv('ES_USER', 'elastic'),
-    es_password=os.getenv('ES_PASSWORD', 'llama_index'),
-)
-# Create embeddings
-# text_type=`document` to build index
-def embed_node(node):
-    embedder = DashScopeEmbedding(
-        model_name=DashScopeTextEmbeddingModels.TEXT_EMBEDDING_V2,
-        text_type=DashScopeTextEmbeddingType.TEXT_TYPE_DOCUMENT,
-    )
-    result_embeddings = embedder.get_text_embedding(node.text)
-    node.embedding = result_embeddings
-    return node
-@click.command()
-@click.option(
-    '-p',
-    '--path',
-    'path',
-    type=click.Path(exists=True),
-    required=True,
-    help='local pdf filepath or directory',
-)
-def cli(path):
-    output_dir = '/tmp/magic_pdf/integrations/rag/'
-    os.makedirs(output_dir, exist_ok=True)
-    documents = DataReader(path, 'ocr', output_dir)
-    # build nodes
-    nodes = []
-    for idx in range(documents.get_documents_count()):
-        doc = documents.get_document_result(idx)
-        if doc is None:  # something wrong happens when parse pdf !
-            continue
-        for page in iter(
-                doc):  # iterate documents from initial page to last page !
-            for element in iter(page):  # iterate the element from all page !
-                if element.text is None:
-                    continue
-                nodes.append(
-                    embed_node(
-                        TextNode(text=element.text,
-                                 metadata={'purpose': 'demo'})))
-    es_vec_store.add(nodes)
-if __name__ == '__main__':
-    cli()
--- a/projects/llama_index_rag/docker-compose.yml
+++ b/projects/llama_index_rag/docker-compose.yml
-services:
-  es:
-    container_name: es
-    image: docker.elastic.co/elasticsearch/elasticsearch:8.11.3
-    volumes:
-      - esdata01:/usr/share/elasticsearch/data
-    ports:
-      - 9200:9200
-    environment:
-      - node.name=es
-      - ELASTIC_PASSWORD=llama_index
-      - bootstrap.memory_lock=false
-      - discovery.type=single-node
-      - xpack.security.enabled=true
-      - xpack.security.http.ssl.enabled=false
-      - xpack.security.transport.ssl.enabled=false
-    ulimits:
-      memlock:
-        soft: -1
-        hard: -1
-    restart: always
-volumes:
-  esdata01:
-    driver: local
--- a/projects/llama_index_rag/example/data/declaration_of_the_rights_of_man_1789.pdf
+++ b/projects/llama_index_rag/example/data/declaration_of_the_rights_of_man_1789.pdf
--- a/projects/llama_index_rag/query.py
+++ b/projects/llama_index_rag/query.py
-import os
-import click
-from llama_index.core.vector_stores.types import VectorStoreQuery
-from llama_index.embeddings.dashscope import (DashScopeEmbedding,
-                                              DashScopeTextEmbeddingModels,
-                                              DashScopeTextEmbeddingType)
-from llama_index.vector_stores.elasticsearch import (AsyncDenseVectorStrategy,
-                                                     ElasticsearchStore)
-# initialize qwen 7B model
-from modelscope import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
-es_vector_store = ElasticsearchStore(
-    index_name='rag_index',
-    es_url=os.getenv('ES_URL', 'http://127.0.0.1:9200'),
-    es_user=os.getenv('ES_USER', 'elastic'),
-    es_password=os.getenv('ES_PASSWORD', 'llama_index'),
-    retrieval_strategy=AsyncDenseVectorStrategy(),
-)
-def embed_text(text):
-    embedder = DashScopeEmbedding(
-        model_name=DashScopeTextEmbeddingModels.TEXT_EMBEDDING_V2,
-        text_type=DashScopeTextEmbeddingType.TEXT_TYPE_DOCUMENT,
-    )
-    return embedder.get_text_embedding(text)
-def search(vector_store: ElasticsearchStore, query: str):
-    query_vec = VectorStoreQuery(query_embedding=embed_text(query))
-    result = vector_store.query(query_vec)
-    return '\n'.join([node.text for node in result.nodes])
-@click.command()
-@click.option(
-    '-q',
-    '--question',
-    'question',
-    required=True,
-    help='ask what you want to know!',
-)
-def cli(question):
-    tokenizer = AutoTokenizer.from_pretrained('qwen/Qwen-7B-Chat',
-                                              revision='v1.0.5',
-                                              trust_remote_code=True)
-    model = AutoModelForCausalLM.from_pretrained('qwen/Qwen-7B-Chat',
-                                                 revision='v1.0.5',
-                                                 device_map='auto',
-                                                 trust_remote_code=True,
-                                                 fp32=True).eval()
-    model.generation_config = GenerationConfig.from_pretrained(
-        'Qwen/Qwen-7B-Chat', revision='v1.0.5', trust_remote_code=True)
-    # define a prompt template for the vectorDB-enhanced LLM generation
-    def answer_question(question, context, model):
-        if context == '':
-            prompt = question
-        else:
-            prompt = f'''请基于```内的内容回答问题。"
-            ```
-            {context}
-            ```
-            我的问题是：{question}。
-            '''
-        history = None
-        print(prompt)
-        response, history = model.chat(tokenizer, prompt, history=None)
-        return response
-    answer = answer_question(question, search(es_vector_store, question),
-                             model)
-    print(f'question: {question}\n'
-          f'answer: {answer}')
-"""
-python query.py -q 'how about the rights of men'
-"""
-if __name__ == '__main__':
-    cli()
--- a/projects/llama_index_rag/rag_data_api.png
+++ b/projects/llama_index_rag/rag_data_api.png