Merge pull request #6 from opendatalab/dev

Dev

Merge pull request #6 from opendatalab/dev
Dev
ece7f8d5 · Kaiwen Liu · GitHub · 98362a6e · 702b6ac9 · ece7f8d5
Unverified Commit ece7f8d5 authored Oct 15, 2024 by Kaiwen Liu Committed by GitHub Oct 15, 2024
20 changed files
--- a/docs/en/index.rst
+++ b/docs/en/index.rst
+.. xtuner documentation master file, created by
+   sphinx-quickstart on Tue Jan  9 16:33:06 2024.
+   You can adapt this file completely to your liking, but it should at least
+   contain the root `toctree` directive.
+Welcome to the MinerU Documentation
+==============================================
+.. figure:: ./_static/image/logo.png
+  :align: center
+  :alt: mineru
+  :class: no-scaled-link
+.. raw:: html
+   <p style="text-align:center">
+   <strong>A one-stop, open-source, high-quality data extraction tool
+   </strong>
+   </p>
+   <p style="text-align:center">
+   <script async defer src="https://buttons.github.io/buttons.js"></script>
+   <a class="github-button" href="https://github.com/opendatalab/MinerU" data-show-count="true" data-size="large" aria-label="Star">Star</a>
+   <a class="github-button" href="https://github.com/opendatalab/MinerU/subscription" data-icon="octicon-eye" data-size="large" aria-label="Watch">Watch</a>
+   <a class="github-button" href="https://github.com/opendatalab/MinerU/fork" data-icon="octicon-repo-forked" data-size="large" aria-label="Fork">Fork</a>
+   </p>
--- a/docs/en/make.bat
+++ b/docs/en/make.bat
+@ECHO OFF
+pushd %~dp0
+REM Command file for Sphinx documentation
+if "%SPHINXBUILD%" == "" (
+	set SPHINXBUILD=sphinx-build
+)
+set SOURCEDIR=.
+set BUILDDIR=_build
+%SPHINXBUILD% >NUL 2>NUL
+if errorlevel 9009 (
+	echo.
+	echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
+	echo.installed, then set the SPHINXBUILD environment variable to point
+	echo.to the full path of the 'sphinx-build' executable. Alternatively you
+	echo.may add the Sphinx directory to PATH.
+	echo.
+	echo.If you don't have Sphinx installed, grab it from
+	echo.https://www.sphinx-doc.org/
+	exit /b 1
+)
+if "%1" == "" goto help
+%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
+goto end
+:help
+%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
+:end
+popd
--- a/docs/how_to_download_models_en.md
+++ b/docs/how_to_download_models_en.md
-### 1. Download the Model from Hugging Face
-Use a Python Script to Download Model Files from Hugging Face
-```bash
-pip install huggingface_hub
-wget https://github.com/opendatalab/MinerU/raw/master/docs/download_models_hf.py
-python download_models_hf.py
-```
-After the Python script finishes executing, it will output the directory where the models are downloaded.
-### 2. Additional steps
-#### 1. Check whether the model directory is downloaded completely.
-The structure of the model folder is as follows, including configuration files and weight files of different components:
-```
-../
-├── Layout
-│   ├── config.json
-│   └── model_final.pth
-├── MFD
-│   └── weights.pt
-├── MFR
-│   └── UniMERNet
-│       ├── config.json
-│       ├── preprocessor_config.json
-│       ├── pytorch_model.bin
-│       ├── README.md
-│       ├── tokenizer_config.json
-│       └── tokenizer.json
-│── TabRec
-│   └─StructEqTable
-│       ├── config.json
-│       ├── generation_config.json
-│       ├── model.safetensors
-│       ├── preprocessor_config.json
-│       ├── special_tokens_map.json
-│       ├── spiece.model
-│       ├── tokenizer.json
-│       └── tokenizer_config.json 
-│   └─ TableMaster 
-│       └─ ch_PP-OCRv3_det_infer
-│           ├── inference.pdiparams
-│           ├── inference.pdiparams.info
-│           └── inference.pdmodel
-│       └─ ch_PP-OCRv3_rec_infer
-│           ├── inference.pdiparams
-│           ├── inference.pdiparams.info
-│           └── inference.pdmodel
-│       └─ table_structure_tablemaster_infer
-│           ├── inference.pdiparams
-│           ├── inference.pdiparams.info
-│           └── inference.pdmodel
-│       ├── ppocr_keys_v1.txt
-│       └── table_master_structure_dict.txt
-└── README.md
-```
-#### 2. Check whether the model file is fully downloaded.
-Please check whether the size of the model file in the directory is consistent with the description on the web page. If possible, it is best to check whether the model is downloaded completely through sha256.
-#### 3. 
-Additionally, in `~/magic-pdf.json`, update the model directory path to the absolute path of the `models` directory output by the previous Python script. Otherwise, you will encounter an error indicating that the model cannot be loaded.
--- a/docs/how_to_download_models_zh_cn.md
+++ b/docs/how_to_download_models_zh_cn.md
-# 如何下载模型文件
-模型文件可以从 Hugging Face 或 Model Scope 下载，由于网络原因，国内用户访问HF可能会失败，请使用 ModelScope。
-<details>
-  <summary>方法一：从 Hugging Face 下载模型</summary>
-  <p>使用python脚本 从Hugging Face下载模型文件</p>
-  <pre><code>pip install huggingface_hub
-wget https://gitee.com/myhloli/MinerU/raw/master/docs/download_models_hf.py
-python download_models_hf.py</code></pre>
-  <p>python脚本执行完毕后，会输出模型下载目录</p>
-</details>
-## 方法二：从 ModelScope 下载模型
-### 使用python脚本 从ModelScope下载模型文件
-```bash
-pip install modelscope
-wget https://gitee.com/myhloli/MinerU/raw/master/docs/download_models.py
-python download_models.py
-```
-python脚本执行完毕后，会输出模型下载目录
-## 【❗️必须要做❗️】的额外步骤（模型下载完成后请务必完成以下操作）
-### 1.检查模型目录是否下载完整
-模型文件夹的结构如下，包含了不同组件的配置文件和权重文件：
-```
-./
-├── Layout  # 布局检测模型
-│   ├── config.json
-│   └── model_final.pth
-├── MFD  # 公式检测
-│   └── weights.pt
-├── MFR  # 公式识别模型
-│   └── UniMERNet
-│       ├── config.json
-│       ├── preprocessor_config.json
-│       ├── pytorch_model.bin
-│       ├── README.md
-│       ├── tokenizer_config.json
-│       └── tokenizer.json
-│── TabRec # 表格识别模型
-│   └─StructEqTable
-│       ├── config.json
-│       ├── generation_config.json
-│       ├── model.safetensors
-│       ├── preprocessor_config.json
-│       ├── special_tokens_map.json
-│       ├── spiece.model
-│       ├── tokenizer.json
-│       └── tokenizer_config.json 
-│   └─ TableMaster 
-│       └─ ch_PP-OCRv3_det_infer
-│           ├── inference.pdiparams
-│           ├── inference.pdiparams.info
-│           └── inference.pdmodel
-│       └─ ch_PP-OCRv3_rec_infer
-│           ├── inference.pdiparams
-│           ├── inference.pdiparams.info
-│           └── inference.pdmodel
-│       └─ table_structure_tablemaster_infer
-│           ├── inference.pdiparams
-│           ├── inference.pdiparams.info
-│           └── inference.pdmodel
-│       ├── ppocr_keys_v1.txt
-│       └── table_master_structure_dict.txt
-└── README.md
-```
-### 2.检查模型文件是否下载完整
-请检查目录下的模型文件大小与网页上描述是否一致，如果可以的话，最好通过sha256校验模型是否下载完整
-### 3.修改magic-pdf.json中的模型路径
-此外在 `~/magic-pdf.json`里修改模型的目录指向之前python脚本输出的models目录的绝对路径，否则会报模型无法加载的错误。
--- a/docs/requirements.txt
+++ b/docs/requirements.txt
+myst-parser
+sphinx
+sphinx-argparse
+sphinx-book-theme
+sphinx-copybutton
+sphinx_rtd_theme
--- a/docs/zh_cn/.readthedocs.yaml
+++ b/docs/zh_cn/.readthedocs.yaml
+version: 2
+build:
+  os: ubuntu-22.04
+  tools:
+    python: "3.10"
+formats:
+  - epub
+python:
+  install:
+    - requirements: docs/requirements.txt
+sphinx:
+  configuration: docs/zh_cn/conf.py
--- a/docs/zh_cn/Makefile
+++ b/docs/zh_cn/Makefile
+# Minimal makefile for Sphinx documentation
+#
+# You can set these variables from the command line, and also
+# from the environment for the first two.
+SPHINXOPTS    ?=
+SPHINXBUILD   ?= sphinx-build
+SOURCEDIR     = .
+BUILDDIR      = _build
+# Put it first so that "make" without argument is like "make help".
+help:
+	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
+.PHONY: help Makefile
+# Catch-all target: route all unknown targets to Sphinx using the new
+# "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
+%: Makefile
+	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
--- a/docs/zh_cn/_static/image/logo.png
+++ b/docs/zh_cn/_static/image/logo.png
--- a/docs/zh_cn/conf.py
+++ b/docs/zh_cn/conf.py
+# Configuration file for the Sphinx documentation builder.
+#
+# This file only contains a selection of the most common options. For a full
+# list see the documentation:
+# https://www.sphinx-doc.org/en/master/usage/configuration.html
+# -- Path setup --------------------------------------------------------------
+# If extensions (or modules to document with autodoc) are in another directory,
+# add these directories to sys.path here. If the directory is relative to the
+# documentation root, use os.path.abspath to make it absolute, like shown here.
+import os
+import subprocess
+import sys
+from sphinx.ext import autodoc
+def install(package):
+    subprocess.check_call([sys.executable, '-m', 'pip', 'install', package])
+requirements_path = os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'requirements.txt'))
+if os.path.exists(requirements_path):
+    with open(requirements_path) as f:
+        packages = f.readlines()
+    for package in packages:
+        install(package.strip())
+sys.path.insert(0, os.path.abspath('../..'))
+# -- Project information -----------------------------------------------------
+project = 'MinerU'
+copyright = '2024, OpenDataLab'
+author = 'MinerU Contributors'
+# The full version, including alpha/beta/rc tags
+version_file = '../../magic_pdf/libs/version.py'
+with open(version_file) as f:
+    exec(compile(f.read(), version_file, 'exec'))
+__version__ = locals()['__version__']
+# The short X.Y version
+version = __version__
+# The full version, including alpha/beta/rc tags
+release = __version__
+# -- General configuration ---------------------------------------------------
+# Add any Sphinx extension module names here, as strings. They can be
+# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
+# ones.
+extensions = [
+    'sphinx.ext.napoleon',
+    'sphinx.ext.viewcode',
+    'sphinx.ext.intersphinx',
+    'sphinx_copybutton',
+    'sphinx.ext.autodoc',
+    'sphinx.ext.autosummary',
+    'myst_parser',
+    'sphinxarg.ext',
+]
+# Add any paths that contain templates here, relative to this directory.
+templates_path = ['_templates']
+# List of patterns, relative to source directory, that match files and
+# directories to ignore when looking for source files.
+# This pattern also affects html_static_path and html_extra_path.
+exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
+# Exclude the prompt "$" when copying code
+copybutton_prompt_text = r'\$ '
+copybutton_prompt_is_regexp = True
+language = 'zh_CN'
+# -- Options for HTML output -------------------------------------------------
+# The theme to use for HTML and HTML Help pages.  See the documentation for
+# a list of builtin themes.
+#
+html_theme = 'sphinx_book_theme'
+html_logo = '_static/image/logo.png'
+html_theme_options = {
+    'path_to_docs': 'docs/zh_cn',
+    'repository_url': 'https://github.com/opendatalab/MinerU',
+    'use_repository_button': True,
+}
+# Add any paths that contain custom static files (such as style sheets) here,
+# relative to this directory. They are copied after the builtin static files,
+# so a file named "default.css" will overwrite the builtin "default.css".
+# html_static_path = ['_static']
+# Mock out external dependencies here.
+autodoc_mock_imports = [
+    'cpuinfo',
+    'torch',
+    'transformers',
+    'psutil',
+    'prometheus_client',
+    'sentencepiece',
+    'vllm.cuda_utils',
+    'vllm._C',
+    'numpy',
+    'tqdm',
+]
+class MockedClassDocumenter(autodoc.ClassDocumenter):
+    """Remove note about base class when a class is derived from object."""
+    def add_line(self, line: str, source: str, *lineno: int) -> None:
+        if line == '   Bases: :py:class:`object`':
+            return
+        super().add_line(line, source, *lineno)
+autodoc.ClassDocumenter = MockedClassDocumenter
+navigation_with_keys = False
--- a/docs/zh_cn/index.rst
+++ b/docs/zh_cn/index.rst
+.. xtuner documentation master file, created by
+   sphinx-quickstart on Tue Jan  9 16:33:06 2024.
+   You can adapt this file completely to your liking, but it should at least
+   contain the root `toctree` directive.
+欢迎来到 MinerU 的中文文档
+==============================================
+.. figure:: ./_static/image/logo.png
+  :align: center
+  :alt: mineru
+  :class: no-scaled-link
+.. raw:: html
+   <p style="text-align:center">
+   <strong> 一站式开源高质量数据提取工具
+   </strong>
+   </p>
+   <p style="text-align:center">
+   <script async defer src="https://buttons.github.io/buttons.js"></script>
+   <a class="github-button" href="https://github.com/opendatalab/MinerU" data-show-count="true" data-size="large" aria-label="Star">Star</a>
+   <a class="github-button" href="https://github.com/opendatalab/MinerU/subscription" data-icon="octicon-eye" data-size="large" aria-label="Watch">Watch</a>
+   <a class="github-button" href="https://github.com/opendatalab/MinerU/fork" data-icon="octicon-repo-forked" data-size="large" aria-label="Fork">Fork</a>
+   </p>
--- a/docs/zh_cn/make.bat
+++ b/docs/zh_cn/make.bat
+@ECHO OFF
+pushd %~dp0
+REM Command file for Sphinx documentation
+if "%SPHINXBUILD%" == "" (
+	set SPHINXBUILD=sphinx-build
+)
+set SOURCEDIR=.
+set BUILDDIR=_build
+%SPHINXBUILD% >NUL 2>NUL
+if errorlevel 9009 (
+	echo.
+	echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
+	echo.installed, then set the SPHINXBUILD environment variable to point
+	echo.to the full path of the 'sphinx-build' executable. Alternatively you
+	echo.may add the Sphinx directory to PATH.
+	echo.
+	echo.If you don't have Sphinx installed, grab it from
+	echo.https://www.sphinx-doc.org/
+	exit /b 1
+)
+if "%1" == "" goto help
+%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
+goto end
+:help
+%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
+:end
+popd
--- a/magic-pdf.template.json
+++ b/magic-pdf.template.json
@@ -4,6 +4,7 @@
        "bucket-name-2":["ak", "sk", "endpoint"]
    },
    "models-dir":"/tmp/models",
+    "layoutreader-model-dir":"/tmp/layoutreader",
    "device-mode":"cpu",
    "table-config": {
        "model": "TableMaster",

--- a/magic_pdf/dict2md/ocr_mkcontent.py
+++ b/magic_pdf/dict2md/ocr_mkcontent.py
@@ -8,6 +8,7 @@ from magic_pdf.libs.language import detect_lang
 from magic_pdf.libs.MakeContentConfig import DropMode, MakeMode
 from magic_pdf.libs.markdown_utils import ocr_escape_special_markdown_char
 from magic_pdf.libs.ocr_content_type import BlockType, ContentType
+from magic_pdf.para.para_split_v3 import ListLineTag
 def __is_hyphen_at_line_end(line):
@@ -116,17 +117,20 @@ def ocr_mk_markdown_with_para_core(paras_of_layout, mode, img_buket_path=''):
 def ocr_mk_markdown_with_para_core_v2(paras_of_layout,
                                      mode,
-                                      img_buket_path=''):
+                                      img_buket_path='',
+                                      parse_type="auto",
+                                      lang=None
+                                      ):
    page_markdown = []
    for para_block in paras_of_layout:
        para_text = ''
        para_type = para_block['type']
-        if para_type == BlockType.Text:
+        if para_type in [BlockType.Text, BlockType.List, BlockType.Index]:
-            para_text = merge_para_with_text(para_block)
+            para_text = merge_para_with_text(para_block, parse_type=parse_type, lang=lang)
        elif para_type == BlockType.Title:
-            para_text = f'# {merge_para_with_text(para_block)}'
+            para_text = f'# {merge_para_with_text(para_block, parse_type=parse_type, lang=lang)}'
        elif para_type == BlockType.InterlineEquation:
-            para_text = merge_para_with_text(para_block)
+            para_text = merge_para_with_text(para_block, parse_type=parse_type, lang=lang)
        elif para_type == BlockType.Image:
            if mode == 'nlp':
                continue
@@ -139,17 +143,17 @@ def ocr_mk_markdown_with_para_core_v2(paras_of_layout,
                                    para_text += f"\n![]({join_path(img_buket_path, span['image_path'])})  \n"
                for block in para_block['blocks']:  # 2nd.拼image_caption
                    if block['type'] == BlockType.ImageCaption:
-                        para_text += merge_para_with_text(block)
+                        para_text += merge_para_with_text(block, parse_type=parse_type, lang=lang)
                for block in para_block['blocks']:  # 2nd.拼image_caption
                    if block['type'] == BlockType.ImageFootnote:
-                        para_text += merge_para_with_text(block)
+                        para_text += merge_para_with_text(block, parse_type=parse_type, lang=lang)
        elif para_type == BlockType.Table:
            if mode == 'nlp':
                continue
            elif mode == 'mm':
                for block in para_block['blocks']:  # 1st.拼table_caption
                    if block['type'] == BlockType.TableCaption:
-                        para_text += merge_para_with_text(block)
+                        para_text += merge_para_with_text(block, parse_type=parse_type, lang=lang)
                for block in para_block['blocks']:  # 2nd.拼table_body
                    if block['type'] == BlockType.TableBody:
                        for line in block['lines']:
@@ -164,7 +168,7 @@ def ocr_mk_markdown_with_para_core_v2(paras_of_layout,
                                        para_text += f"\n![]({join_path(img_buket_path, span['image_path'])})  \n"
                for block in para_block['blocks']:  # 3rd.拼table_footnote
                    if block['type'] == BlockType.TableFootnote:
-                        para_text += merge_para_with_text(block)
+                        para_text += merge_para_with_text(block, parse_type=parse_type, lang=lang)
        if para_text.strip() == '':
            continue
@@ -174,22 +178,26 @@ def ocr_mk_markdown_with_para_core_v2(paras_of_layout,
    return page_markdown
-def merge_para_with_text(para_block):
+def detect_language(text):
+    en_pattern = r'[a-zA-Z]+'
-    def detect_language(text):
+    en_matches = re.findall(en_pattern, text)
-        en_pattern = r'[a-zA-Z]+'
+    en_length = sum(len(match) for match in en_matches)
-        en_matches = re.findall(en_pattern, text)
+    if len(text) > 0:
-        en_length = sum(len(match) for match in en_matches)
+        if en_length / len(text) >= 0.5:
-        if len(text) > 0:
+            return 'en'
-            if en_length / len(text) >= 0.5:
-                return 'en'
-            else:
-                return 'unknown'
        else:
-            return 'empty'
+            return 'unknown'
+    else:
+        return 'empty'
+def merge_para_with_text(para_block, parse_type="auto", lang=None):
    para_text = ''
-    for line in para_block['lines']:
+    for i, line in enumerate(para_block['lines']):
+        if i >= 1 and line.get(ListLineTag.IS_LIST_START_LINE, False):
+            para_text += '  \n'
        line_text = ''
        line_lang = ''
        for span in line['spans']:
@@ -205,11 +213,15 @@ def merge_para_with_text(para_block):
                content = span['content']
                # language = detect_lang(content)
                language = detect_language(content)
-                if language == 'en':  # 只对英文长词进行分词处理，中文分词会丢失文本
+                # 判断是否小语种
-                    content = ocr_escape_special_markdown_char(
+                if lang is not None and lang != 'en':
-                        split_long_words(content))
-                else:
                    content = ocr_escape_special_markdown_char(content)
+                else:  # 非小语种逻辑
+                    if language == 'en' and parse_type == 'ocr':  # 只对英文长词进行分词处理，中文分词会丢失文本
+                        content = ocr_escape_special_markdown_char(
+                            split_long_words(content))
+                    else:
+                        content = ocr_escape_special_markdown_char(content)
            elif span_type == ContentType.InlineEquation:
                content = f" ${span['content']}$ "
            elif span_type == ContentType.InterlineEquation:
@@ -265,41 +277,39 @@ def para_to_standard_format(para, img_buket_path):
    return para_content
-def para_to_standard_format_v2(para_block, img_buket_path, page_idx):
+def para_to_standard_format_v2(para_block, img_buket_path, page_idx, parse_type="auto", lang=None, drop_reason=None):
    para_type = para_block['type']
+    para_content = {}
    if para_type == BlockType.Text:
        para_content = {
            'type': 'text',
-            'text': merge_para_with_text(para_block),
+            'text': merge_para_with_text(para_block, parse_type=parse_type, lang=lang),
-            'page_idx': page_idx,
        }
    elif para_type == BlockType.Title:
        para_content = {
            'type': 'text',
-            'text': merge_para_with_text(para_block),
+            'text': merge_para_with_text(para_block, parse_type=parse_type, lang=lang),
            'text_level': 1,
-            'page_idx': page_idx,
        }
    elif para_type == BlockType.InterlineEquation:
        para_content = {
            'type': 'equation',
-            'text': merge_para_with_text(para_block),
+            'text': merge_para_with_text(para_block, parse_type=parse_type, lang=lang),
            'text_format': 'latex',
-            'page_idx': page_idx,
        }
    elif para_type == BlockType.Image:
-        para_content = {'type': 'image', 'page_idx': page_idx}
+        para_content = {'type': 'image'}
        for block in para_block['blocks']:
            if block['type'] == BlockType.ImageBody:
                para_content['img_path'] = join_path(
                    img_buket_path,
                    block['lines'][0]['spans'][0]['image_path'])
            if block['type'] == BlockType.ImageCaption:
-                para_content['img_caption'] = merge_para_with_text(block)
+                para_content['img_caption'] = merge_para_with_text(block, parse_type=parse_type, lang=lang)
            if block['type'] == BlockType.ImageFootnote:
-                para_content['img_footnote'] = merge_para_with_text(block)
+                para_content['img_footnote'] = merge_para_with_text(block, parse_type=parse_type, lang=lang)
    elif para_type == BlockType.Table:
-        para_content = {'type': 'table', 'page_idx': page_idx}
+        para_content = {'type': 'table'}
        for block in para_block['blocks']:
            if block['type'] == BlockType.TableBody:
                if block["lines"][0]["spans"][0].get('latex', ''):
@@ -308,9 +318,14 @@ def para_to_standard_format_v2(para_block, img_buket_path, page_idx):
                    para_content['table_body'] = f"\n\n{block['lines'][0]['spans'][0]['html']}\n\n"
                para_content['img_path'] = join_path(img_buket_path, block["lines"][0]["spans"][0]['image_path'])
            if block['type'] == BlockType.TableCaption:
-                para_content['table_caption'] = merge_para_with_text(block)
+                para_content['table_caption'] = merge_para_with_text(block, parse_type=parse_type, lang=lang)
            if block['type'] == BlockType.TableFootnote:
-                para_content['table_footnote'] = merge_para_with_text(block)
+                para_content['table_footnote'] = merge_para_with_text(block, parse_type=parse_type, lang=lang)
+    para_content['page_idx'] = page_idx
+    if drop_reason is not None:
+        para_content['drop_reason'] = drop_reason
    return para_content
@@ -394,13 +409,19 @@ def ocr_mk_mm_standard_format(pdf_info_dict: list):
 def union_make(pdf_info_dict: list,
               make_mode: str,
               drop_mode: str,
-               img_buket_path: str = ''):
+               img_buket_path: str = '',
+               parse_type: str = "auto",
+               lang=None):
    output_content = []
    for page_info in pdf_info_dict:
+        drop_reason_flag = False
+        drop_reason = None
        if page_info.get('need_drop', False):
            drop_reason = page_info.get('drop_reason')
            if drop_mode == DropMode.NONE:
                pass
+            elif drop_mode == DropMode.NONE_WITH_REASON:
+                drop_reason_flag = True
            elif drop_mode == DropMode.WHOLE_PDF:
                raise Exception((f'drop_mode is {DropMode.WHOLE_PDF} ,'
                                 f'drop_reason is {drop_reason}'))
@@ -417,16 +438,20 @@ def union_make(pdf_info_dict: list,
            continue
        if make_mode == MakeMode.MM_MD:
            page_markdown = ocr_mk_markdown_with_para_core_v2(
-                paras_of_layout, 'mm', img_buket_path)
+                paras_of_layout, 'mm', img_buket_path, parse_type=parse_type, lang=lang)
            output_content.extend(page_markdown)
        elif make_mode == MakeMode.NLP_MD:
            page_markdown = ocr_mk_markdown_with_para_core_v2(
-                paras_of_layout, 'nlp')
+                paras_of_layout, 'nlp', parse_type=parse_type, lang=lang)
            output_content.extend(page_markdown)
        elif make_mode == MakeMode.STANDARD_FORMAT:
            for para_block in paras_of_layout:
-                para_content = para_to_standard_format_v2(
+                if drop_reason_flag:
-                    para_block, img_buket_path, page_idx)
+                    para_content = para_to_standard_format_v2(
+                        para_block, img_buket_path, page_idx, parse_type=parse_type, lang=lang, drop_reason=drop_reason)
+                else:
+                    para_content = para_to_standard_format_v2(
+                        para_block, img_buket_path, page_idx, parse_type=parse_type, lang=lang)
                output_content.append(para_content)
    if make_mode in [MakeMode.MM_MD, MakeMode.NLP_MD]:
        return '\n\n'.join(output_content)

--- a/magic_pdf/libs/MakeContentConfig.py
+++ b/magic_pdf/libs/MakeContentConfig.py
@@ -8,3 +8,4 @@ class DropMode:
    WHOLE_PDF = "whole_pdf"
    SINGLE_PAGE = "single_page"
    NONE = "none"
+    NONE_WITH_REASON = "none_with_reason"
--- a/magic_pdf/libs/__pycache__/__init__.cpython-312.pyc
+++ b/magic_pdf/libs/__pycache__/__init__.cpython-312.pyc
--- a/magic_pdf/libs/__pycache__/version.cpython-312.pyc
+++ b/magic_pdf/libs/__pycache__/version.cpython-312.pyc
--- a/magic_pdf/libs/boxbase.py
+++ b/magic_pdf/libs/boxbase.py
@@ -426,3 +426,22 @@ def bbox_distance(bbox1, bbox2):
    elif top:
        return y2 - y1b
    return 0.0
+def box_area(bbox):
+    return (bbox[2] - bbox[0]) * (bbox[3] - bbox[1])
+def get_overlap_area(bbox1, bbox2):
+    """计算box1和box2的重叠面积占bbox1的比例."""
+    # Determine the coordinates of the intersection rectangle
+    x_left = max(bbox1[0], bbox2[0])
+    y_top = max(bbox1[1], bbox2[1])
+    x_right = min(bbox1[2], bbox2[2])
+    y_bottom = min(bbox1[3], bbox2[3])
+    if x_right < x_left or y_bottom < y_top:
+        return 0.0
+    # The area of overlap area
+    return (x_right - x_left) * (y_bottom - y_top)
--- a/magic_pdf/libs/clean_memory.py
+++ b/magic_pdf/libs/clean_memory.py
+# Copyright (c) Opendatalab. All rights reserved.
+import torch
+import gc
+def clean_memory():
+    if torch.cuda.is_available():
+        torch.cuda.empty_cache()
+        torch.cuda.ipc_collect()
+    gc.collect()
\ No newline at end of file
--- a/magic_pdf/libs/config_reader.py
+++ b/magic_pdf/libs/config_reader.py
@@ -67,6 +67,18 @@ def get_local_models_dir():
        return models_dir
+def get_local_layoutreader_model_dir():
+    config = read_config()
+    layoutreader_model_dir = config.get("layoutreader-model-dir")
+    if layoutreader_model_dir is None or not os.path.exists(layoutreader_model_dir):
+        home_dir = os.path.expanduser("~")
+        layoutreader_at_modelscope_dir_path = os.path.join(home_dir, ".cache/modelscope/hub/ppaanngggg/layoutreader")
+        logger.warning(f"'layoutreader-model-dir' not exists, use {layoutreader_at_modelscope_dir_path} as default")
+        return layoutreader_at_modelscope_dir_path
+    else:
+        return layoutreader_model_dir
 def get_device():
    config = read_config()
    device = config.get("device-mode")

--- a/magic_pdf/libs/draw_bbox.py
+++ b/magic_pdf/libs/draw_bbox.py
@@ -33,7 +33,7 @@ def draw_bbox_without_number(i, bbox_list, page, rgb_config, fill_config):
            )  # Draw the rectangle
-def draw_bbox_with_number(i, bbox_list, page, rgb_config, fill_config):
+def draw_bbox_with_number(i, bbox_list, page, rgb_config, fill_config, draw_bbox=True):
    new_rgb = []
    for item in rgb_config:
        item = float(item) / 255
@@ -42,31 +42,31 @@ def draw_bbox_with_number(i, bbox_list, page, rgb_config, fill_config):
    for j, bbox in enumerate(page_data):
        x0, y0, x1, y1 = bbox
        rect_coords = fitz.Rect(x0, y0, x1, y1)  # Define the rectangle
-        if fill_config:
+        if draw_bbox:
-            page.draw_rect(
+            if fill_config:
-                rect_coords,
+                page.draw_rect(
-                color=None,
+                    rect_coords,
-                fill=new_rgb,
+                    color=None,
-                fill_opacity=0.3,
+                    fill=new_rgb,
-                width=0.5,
+                    fill_opacity=0.3,
-                overlay=True,
+                    width=0.5,
-            )  # Draw the rectangle
+                    overlay=True,
-        else:
+                )  # Draw the rectangle
-            page.draw_rect(
+            else:
-                rect_coords,
+                page.draw_rect(
-                color=new_rgb,
+                    rect_coords,
-                fill=None,
+                    color=new_rgb,
-                fill_opacity=1,
+                    fill=None,
-                width=0.5,
+                    fill_opacity=1,
-                overlay=True,
+                    width=0.5,
-            )  # Draw the rectangle
+                    overlay=True,
+                )  # Draw the rectangle
        page.insert_text(
-            (x0, y0 + 10), str(j + 1), fontsize=10, color=new_rgb
+            (x1+2, y0 + 10), str(j + 1), fontsize=10, color=new_rgb
        )  # Insert the index in the top left corner of the rectangle
 def draw_layout_bbox(pdf_info, pdf_bytes, out_path, filename):
-    layout_bbox_list = []
    dropped_bbox_list = []
    tables_list, tables_body_list = [], []
    tables_caption_list, tables_footnote_list = [], []
@@ -75,17 +75,19 @@ def draw_layout_bbox(pdf_info, pdf_bytes, out_path, filename):
    titles_list = []
    texts_list = []
    interequations_list = []
+    lists_list = []
+    indexs_list = []
    for page in pdf_info:
-        page_layout_list = []
        page_dropped_list = []
        tables, tables_body, tables_caption, tables_footnote = [], [], [], []
        imgs, imgs_body, imgs_caption, imgs_footnote = [], [], [], []
        titles = []
        texts = []
        interequations = []
-        for layout in page['layout_bboxes']:
+        lists = []
-            page_layout_list.append(layout['layout_bbox'])
+        indexs = []
-        layout_bbox_list.append(page_layout_list)
        for dropped_bbox in page['discarded_blocks']:
            page_dropped_list.append(dropped_bbox['bbox'])
        dropped_bbox_list.append(page_dropped_list)
@@ -117,6 +119,11 @@ def draw_layout_bbox(pdf_info, pdf_bytes, out_path, filename):
                texts.append(bbox)
            elif block['type'] == BlockType.InterlineEquation:
                interequations.append(bbox)
+            elif block['type'] == BlockType.List:
+                lists.append(bbox)
+            elif block['type'] == BlockType.Index:
+                indexs.append(bbox)
        tables_list.append(tables)
        tables_body_list.append(tables_body)
        tables_caption_list.append(tables_caption)
@@ -128,10 +135,22 @@ def draw_layout_bbox(pdf_info, pdf_bytes, out_path, filename):
        titles_list.append(titles)
        texts_list.append(texts)
        interequations_list.append(interequations)
+        lists_list.append(lists)
+        indexs_list.append(indexs)
+    layout_bbox_list = []
+    for page in pdf_info:
+        page_block_list = []
+        for block in page['para_blocks']:
+            bbox = block['bbox']
+            page_block_list.append(bbox)
+        layout_bbox_list.append(page_block_list)
    pdf_docs = fitz.open('pdf', pdf_bytes)
    for i, page in enumerate(pdf_docs):
-        draw_bbox_with_number(i, layout_bbox_list, page, [255, 0, 0], False)
        draw_bbox_without_number(i, dropped_bbox_list, page, [158, 158, 158],
                                 True)
        draw_bbox_without_number(i, tables_list, page, [153, 153, 0],
@@ -146,12 +165,16 @@ def draw_layout_bbox(pdf_info, pdf_bytes, out_path, filename):
        draw_bbox_without_number(i, imgs_body_list, page, [153, 255, 51], True)
        draw_bbox_without_number(i, imgs_caption_list, page, [102, 178, 255],
                                 True)
-        draw_bbox_with_number(i, imgs_footnote_list, page, [255, 178, 102],
+        draw_bbox_without_number(i, imgs_footnote_list, page, [255, 178, 102],
                              True),
        draw_bbox_without_number(i, titles_list, page, [102, 102, 255], True)
        draw_bbox_without_number(i, texts_list, page, [153, 0, 76], True)
        draw_bbox_without_number(i, interequations_list, page, [0, 255, 0],
                                 True)
+        draw_bbox_without_number(i, lists_list, page, [40, 169, 92], True)
+        draw_bbox_without_number(i, indexs_list, page, [40, 169, 92], True)
+        draw_bbox_with_number(i, layout_bbox_list, page, [255, 0, 0], False, draw_bbox=False)
    # Save the PDF
    pdf_docs.save(f'{out_path}/{filename}_layout.pdf')
@@ -211,9 +234,9 @@ def draw_span_bbox(pdf_info, pdf_bytes, out_path, filename):
        # 构造其余useful_list
        for block in page['para_blocks']:
            if block['type'] in [
-                    BlockType.Text,
+                BlockType.Text,
-                    BlockType.Title,
+                BlockType.Title,
-                    BlockType.InterlineEquation,
+                BlockType.InterlineEquation,
            ]:
                for line in block['lines']:
                    for span in line['spans']:
@@ -232,10 +255,8 @@ def draw_span_bbox(pdf_info, pdf_bytes, out_path, filename):
    for i, page in enumerate(pdf_docs):
        # 获取当前页面的数据
        draw_bbox_without_number(i, text_list, page, [255, 0, 0], False)
-        draw_bbox_without_number(i, inline_equation_list, page, [0, 255, 0],
+        draw_bbox_without_number(i, inline_equation_list, page, [0, 255, 0], False)
-                                 False)
+        draw_bbox_without_number(i, interline_equation_list, page, [0, 0, 255], False)
-        draw_bbox_without_number(i, interline_equation_list, page, [0, 0, 255],
-                                 False)
        draw_bbox_without_number(i, image_list, page, [255, 204, 0], False)
        draw_bbox_without_number(i, table_list, page, [204, 0, 255], False)
        draw_bbox_without_number(i, dropped_list, page, [158, 158, 158], False)
@@ -244,7 +265,7 @@ def draw_span_bbox(pdf_info, pdf_bytes, out_path, filename):
    pdf_docs.save(f'{out_path}/{filename}_spans.pdf')
-def drow_model_bbox(model_list: list, pdf_bytes, out_path, filename):
+def draw_model_bbox(model_list: list, pdf_bytes, out_path, filename):
    dropped_bbox_list = []
    tables_body_list, tables_caption_list, tables_footnote_list = [], [], []
    imgs_body_list, imgs_caption_list, imgs_footnote_list = [], [], []
@@ -279,7 +300,7 @@ def drow_model_bbox(model_list: list, pdf_bytes, out_path, filename):
            elif layout_det['category_id'] == CategoryId.ImageCaption:
                imgs_caption.append(bbox)
            elif layout_det[
-                    'category_id'] == CategoryId.InterlineEquation_YOLO:
+                'category_id'] == CategoryId.InterlineEquation_YOLO:
                interequations.append(bbox)
            elif layout_det['category_id'] == CategoryId.Abandon:
                page_dropped_list.append(bbox)
@@ -316,3 +337,47 @@ def drow_model_bbox(model_list: list, pdf_bytes, out_path, filename):
    # Save the PDF
    pdf_docs.save(f'{out_path}/{filename}_model.pdf')
+def draw_line_sort_bbox(pdf_info, pdf_bytes, out_path, filename):
+    layout_bbox_list = []
+    for page in pdf_info:
+        page_line_list = []
+        for block in page['preproc_blocks']:
+            if block['type'] in ['text', 'title', 'interline_equation']:
+                for line in block['lines']:
+                    bbox = line['bbox']
+                    index = line['index']
+                    page_line_list.append({'index': index, 'bbox': bbox})
+            if block['type'] in ['table', 'image']:
+                bbox = block['bbox']
+                index = block['index']
+                page_line_list.append({'index': index, 'bbox': bbox})
+            # for line in block['lines']:
+            #     bbox = line['bbox']
+            #     index = line['index']
+            #     page_line_list.append({'index': index, 'bbox': bbox})
+        sorted_bboxes = sorted(page_line_list, key=lambda x: x['index'])
+        layout_bbox_list.append(sorted_bbox['bbox'] for sorted_bbox in sorted_bboxes)
+    pdf_docs = fitz.open('pdf', pdf_bytes)
+    for i, page in enumerate(pdf_docs):
+        draw_bbox_with_number(i, layout_bbox_list, page, [255, 0, 0], False)
+    pdf_docs.save(f'{out_path}/{filename}_line_sort.pdf')
+def draw_layout_sort_bbox(pdf_info, pdf_bytes, out_path, filename):
+    layout_bbox_list = []
+    for page in pdf_info:
+        page_block_list = []
+        for block in page['para_blocks']:
+            bbox = block['bbox']
+            page_block_list.append(bbox)
+        layout_bbox_list.append(page_block_list)
+    pdf_docs = fitz.open('pdf', pdf_bytes)
+    for i, page in enumerate(pdf_docs):
+        draw_bbox_with_number(i, layout_bbox_list, page, [255, 0, 0], False)
+    pdf_docs.save(f'{out_path}/{filename}_layout_sort.pdf')