Merge pull request #838 from opendatalab/release-0.9.0

Release 0.9.0

Merge pull request #838 from opendatalab/release-0.9.0
Release 0.9.0
3a42ebbf · Xiaomeng Zhao · GitHub · 765c6d77 · 14024793 · 3a42ebbf
Unverified Commit 3a42ebbf authored Nov 01, 2024 by Xiaomeng Zhao Committed by GitHub Nov 01, 2024
20 changed files
--- a/next_docs/zh_cn/index.rst
+++ b/next_docs/zh_cn/index.rst
+.. xtuner documentation master file, created by
+   sphinx-quickstart on Tue Jan  9 16:33:06 2024.
+   You can adapt this file completely to your liking, but it should at least
+   contain the root `toctree` directive.
+欢迎来到 MinerU 的中文文档
+==============================================
+.. figure:: ./_static/image/logo.png
+  :align: center
+  :alt: mineru
+  :class: no-scaled-link
+.. raw:: html
+   <p style="text-align:center">
+   <strong> 一站式开源高质量数据提取工具
+   </strong>
+   </p>
+   <p style="text-align:center">
+   <script async defer src="https://buttons.github.io/buttons.js"></script>
+   <a class="github-button" href="https://github.com/opendatalab/MinerU" data-show-count="true" data-size="large" aria-label="Star">Star</a>
+   <a class="github-button" href="https://github.com/opendatalab/MinerU/subscription" data-icon="octicon-eye" data-size="large" aria-label="Watch">Watch</a>
+   <a class="github-button" href="https://github.com/opendatalab/MinerU/fork" data-icon="octicon-repo-forked" data-size="large" aria-label="Fork">Fork</a>
+   </p>
--- a/next_docs/zh_cn/make.bat
+++ b/next_docs/zh_cn/make.bat
+@ECHO OFF
+pushd %~dp0
+REM Command file for Sphinx documentation
+if "%SPHINXBUILD%" == "" (
+	set SPHINXBUILD=sphinx-build
+)
+set SOURCEDIR=.
+set BUILDDIR=_build
+%SPHINXBUILD% >NUL 2>NUL
+if errorlevel 9009 (
+	echo.
+	echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
+	echo.installed, then set the SPHINXBUILD environment variable to point
+	echo.to the full path of the 'sphinx-build' executable. Alternatively you
+	echo.may add the Sphinx directory to PATH.
+	echo.
+	echo.If you don't have Sphinx installed, grab it from
+	echo.https://www.sphinx-doc.org/
+	exit /b 1
+)
+if "%1" == "" goto help
+%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
+goto end
+:help
+%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
+:end
+popd
--- a/projects/README.md
+++ b/projects/README.md
@@ -3,3 +3,7 @@
 ## Project List
 - [llama_index_rag](./llama_index_rag/README.md): Build a lightweight RAG system based on llama_index
+- [gradio_app](./gradio_app/README.md): Build a web app based on gradio
+- [web_demo](./web_demo/README.md): MinerU online [demo](https://opendatalab.com/OpenSourceTools/Extractor/PDF/) localized deployment version
+- [web_api](./web_api/README.md): Web API Based on FastAPI
+- [multi_gpu](./multi_gpu/README.md): Multi-GPU parallel processing based on LitServe
--- a/projects/README_zh-CN.md
+++ b/projects/README_zh-CN.md
@@ -3,3 +3,7 @@
 ## 项目列表
 - [llama_index_rag](./llama_index_rag/README_zh-CN.md): 基于 llama_index 构建轻量级 RAG 系统
+- [gradio_app](./gradio_app/README_zh-CN.md): 基于 Gradio 的 Web 应用
+- [web_demo](./web_demo/README_zh-CN.md): MinerU在线[demo](https://opendatalab.com/OpenSourceTools/Extractor/PDF/)本地化部署版本
+- [web_api](./web_api/README.md): 基于 FastAPI 的 Web API
+- [multi_gpu](./multi_gpu/README.md): 基于 LitServe 的多 GPU 并行处理
--- a/projects/gradio_app/README.md
+++ b/projects/gradio_app/README.md
+## Installation
+MinerU(>=0.8.0)
+ > If you already have a functioning MinerU environment, you can skip this step.
+ > 
+[Deploy in CPU environment](https://github.com/opendatalab/MinerU?tab=readme-ov-file#quick-cpu-demo)
+[Deploy in GPU environment](https://github.com/opendatalab/MinerU?tab=readme-ov-file#using-gpu)
+Third-party Software
+```bash
+pip install gradio gradio-pdf
+```
+## Start Gradio App
+```bash
+python app.py
+```
+## Use Gradio App
+Access http://127.0.0.1:7860 in your web browser
\ No newline at end of file
--- a/projects/gradio_app/README_zh-CN.md
+++ b/projects/gradio_app/README_zh-CN.md
+## 安装
+MinerU(>=0.8.0)
+ >如已有正常运行的MinerU环境则可以跳过此步骤
+> 
+[在CPU环境部署](https://github.com/opendatalab/MinerU/blob/master/README_zh-CN.md#%E4%BD%BF%E7%94%A8cpu%E5%BF%AB%E9%80%9F%E4%BD%93%E9%AA%8C)
+[在GPU环境部署](https://github.com/opendatalab/MinerU/blob/master/README_zh-CN.md#%E4%BD%BF%E7%94%A8gpu)
+第三方软件
+```bash
+pip install gradio gradio-pdf
+```
+## 启动gradio应用
+```bash
+python app.py
+```
+## 使用gradio应用
+在浏览器中访问 http://127.0.0.1:7860
\ No newline at end of file
--- a/app.py
+++ b/app.py
@@ -3,10 +3,12 @@
 import base64
 import os
 import time
+import uuid
 import zipfile
 from pathlib import Path
 import re
+import pymupdf
 from loguru import logger
 from magic_pdf.libs.hash_utils import compute_sha256
@@ -14,8 +16,6 @@ from magic_pdf.rw.AbsReaderWriter import AbsReaderWriter
 from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter
 from magic_pdf.tools.common import do_parse, prepare_env
-os.system("pip install gradio")
-os.system("pip install gradio-pdf")
 import gradio as gr
 from gradio_pdf import PDF
@@ -25,13 +25,16 @@ def read_fn(path):
    return disk_rw.read(os.path.basename(path), AbsReaderWriter.MODE_BIN)
-def parse_pdf(doc_path, output_dir, end_page_id):
+def parse_pdf(doc_path, output_dir, end_page_id, is_ocr, layout_mode, formula_enable, table_enable, language):
    os.makedirs(output_dir, exist_ok=True)
    try:
        file_name = f"{str(Path(doc_path).stem)}_{time.time()}"
        pdf_data = read_fn(doc_path)
-        parse_method = "auto"
+        if is_ocr:
+            parse_method = "ocr"
+        else:
+            parse_method = "auto"
        local_image_dir, local_md_dir = prepare_env(output_dir, file_name, parse_method)
        do_parse(
            output_dir,
@@ -41,6 +44,10 @@ def parse_pdf(doc_path, output_dir, end_page_id):
            parse_method,
            False,
            end_page_id=end_page_id,
+            layout_model=layout_mode,
+            formula_enable=formula_enable,
+            table_enable=table_enable,
+            lang=language,
        )
        return local_md_dir, file_name
    except Exception as e:
@@ -92,9 +99,10 @@ def replace_image_with_base64(markdown_text, image_dir_path):
    return re.sub(pattern, replace, markdown_text)
-def to_markdown(file_path, end_pages):
+def to_markdown(file_path, end_pages, is_ocr, layout_mode, formula_enable, table_enable, language):
    # 获取识别的md文件以及压缩包文件路径
-    local_md_dir, file_name = parse_pdf(file_path, './output', end_pages - 1)
+    local_md_dir, file_name = parse_pdf(file_path, './output', end_pages - 1, is_ocr,
+                                        layout_mode, formula_enable, table_enable, language)
    archive_zip_path = os.path.join("./output", compute_sha256(local_md_dir) + ".zip")
    zip_archive_success = compress_directory_to_zip(local_md_dir, archive_zip_path)
    if zip_archive_success == 0:
@@ -111,14 +119,6 @@ def to_markdown(file_path, end_pages):
    return md_content, txt_content, archive_zip_path, new_pdf_path
-# def show_pdf(file_path):
-#     with open(file_path, "rb") as f:
-#         base64_pdf = base64.b64encode(f.read()).decode('utf-8')
-#     pdf_display = f'<embed src="data:application/pdf;base64,{base64_pdf}" ' \
-#                   f'width="100%" height="1000" type="application/pdf">'
-#     return pdf_display
 latex_delimiters = [{"left": "$$", "right": "$$", "display": True},
                    {"left": '$', "right": '$', "display": False}]
@@ -141,16 +141,76 @@ model_init = init_model()
 logger.info(f"model_init: {model_init}")
+with open("header.html", "r") as file:
+    header = file.read()
+latin_lang = [
+        'af', 'az', 'bs', 'cs', 'cy', 'da', 'de', 'es', 'et', 'fr', 'ga', 'hr',
+        'hu', 'id', 'is', 'it', 'ku', 'la', 'lt', 'lv', 'mi', 'ms', 'mt', 'nl',
+        'no', 'oc', 'pi', 'pl', 'pt', 'ro', 'rs_latin', 'sk', 'sl', 'sq', 'sv',
+        'sw', 'tl', 'tr', 'uz', 'vi', 'french', 'german'
+]
+arabic_lang = ['ar', 'fa', 'ug', 'ur']
+cyrillic_lang = [
+        'ru', 'rs_cyrillic', 'be', 'bg', 'uk', 'mn', 'abq', 'ady', 'kbd', 'ava',
+        'dar', 'inh', 'che', 'lbe', 'lez', 'tab'
+]
+devanagari_lang = [
+        'hi', 'mr', 'ne', 'bh', 'mai', 'ang', 'bho', 'mah', 'sck', 'new', 'gom',
+        'sa', 'bgc'
+]
+other_lang = ['ch', 'en', 'korean', 'japan', 'chinese_cht', 'ta', 'te', 'ka']
+all_lang = [""]
+all_lang.extend([*other_lang, *latin_lang, *arabic_lang, *cyrillic_lang, *devanagari_lang])
+def to_pdf(file_path):
+    with pymupdf.open(file_path) as f:
+        if f.is_pdf:
+            return file_path
+        else:
+            pdf_bytes = f.convert_to_pdf()
+            # 将pdfbytes 写入到uuid.pdf中
+            # 生成唯一的文件名
+            unique_filename = f"{uuid.uuid4()}.pdf"
+            # 构建完整的文件路径
+            tmp_file_path = os.path.join(os.path.dirname(file_path), unique_filename)
+            # 将字节数据写入文件
+            with open(tmp_file_path, 'wb') as tmp_pdf_file:
+                tmp_pdf_file.write(pdf_bytes)
+            return tmp_file_path
 if __name__ == "__main__":
    with gr.Blocks() as demo:
+        gr.HTML(header)
        with gr.Row():
            with gr.Column(variant='panel', scale=5):
-                pdf_show = gr.Markdown()
+                file = gr.File(label="Please upload a PDF or image", file_types=[".pdf", ".png", ".jpeg", "jpg"])
                max_pages = gr.Slider(1, 10, 5, step=1, label="Max convert pages")
-                with gr.Row() as bu_flow:
+                with gr.Row():
+                    layout_mode = gr.Dropdown(["layoutlmv3", "doclayout_yolo"], label="Layout model", value="layoutlmv3")
+                    language = gr.Dropdown(all_lang, label="Language", value="")
+                with gr.Row():
+                    formula_enable = gr.Checkbox(label="Enable formula recognition", value=True)
+                    is_ocr = gr.Checkbox(label="Force enable OCR", value=False)
+                    table_enable = gr.Checkbox(label="Enable table recognition(test)", value=False)
+                with gr.Row():
                    change_bu = gr.Button("Convert")
-                    clear_bu = gr.ClearButton([pdf_show], value="Clear")
+                    clear_bu = gr.ClearButton(value="Clear")
-                pdf_show = PDF(label="Please upload pdf", interactive=True, height=800)
+                pdf_show = PDF(label="PDF preview", interactive=True, height=800)
+                with gr.Accordion("Examples:"):
+                    example_root = os.path.join(os.path.dirname(__file__), "examples")
+                    gr.Examples(
+                        examples=[os.path.join(example_root, _) for _ in os.listdir(example_root) if
+                                  _.endswith("pdf")],
+                        inputs=pdf_show
+                    )
            with gr.Column(variant='panel', scale=5):
                output_file = gr.File(label="convert result", interactive=False)
@@ -160,8 +220,9 @@ if __name__ == "__main__":
                                         latex_delimiters=latex_delimiters, line_breaks=True)
                    with gr.Tab("Markdown text"):
                        md_text = gr.TextArea(lines=45, show_copy_button=True)
-        change_bu.click(fn=to_markdown, inputs=[pdf_show, max_pages], outputs=[md, md_text, output_file, pdf_show])
+        file.upload(fn=to_pdf, inputs=file, outputs=pdf_show)
-        clear_bu.add([md, pdf_show, md_text, output_file])
+        change_bu.click(fn=to_markdown, inputs=[pdf_show, max_pages, is_ocr, layout_mode, formula_enable, table_enable, language],
+                        outputs=[md, md_text, output_file, pdf_show])
-    demo.launch()
+        clear_bu.add([file, md, pdf_show, md_text, output_file, is_ocr, table_enable, language])
+    demo.launch(server_name="0.0.0.0")
\ No newline at end of file
--- a/projects/gradio_app/examples/2list_1table.pdf
+++ b/projects/gradio_app/examples/2list_1table.pdf
--- a/projects/gradio_app/examples/3list_1table.pdf
+++ b/projects/gradio_app/examples/3list_1table.pdf
--- a/projects/gradio_app/examples/academic_paper_formula.pdf
+++ b/projects/gradio_app/examples/academic_paper_formula.pdf
--- a/projects/gradio_app/examples/academic_paper_img_formula.pdf
+++ b/projects/gradio_app/examples/academic_paper_img_formula.pdf
--- a/projects/gradio_app/examples/academic_paper_list.pdf
+++ b/projects/gradio_app/examples/academic_paper_list.pdf
--- a/projects/gradio_app/examples/complex_layout.pdf
+++ b/projects/gradio_app/examples/complex_layout.pdf
--- a/projects/gradio_app/examples/complex_layout_para_split_list.pdf
+++ b/projects/gradio_app/examples/complex_layout_para_split_list.pdf
--- a/projects/gradio_app/examples/garbled_formula.pdf
+++ b/projects/gradio_app/examples/garbled_formula.pdf
--- a/projects/gradio_app/examples/magazine_complex_layout_images_list.pdf
+++ b/projects/gradio_app/examples/magazine_complex_layout_images_list.pdf
--- a/projects/gradio_app/examples/scanned.pdf
+++ b/projects/gradio_app/examples/scanned.pdf
--- a/projects/gradio_app/header.html
+++ b/projects/gradio_app/header.html
+<html><head>
+  <link rel="stylesheet" href="https://use.fontawesome.com/releases/v5.15.4/css/all.css">
+<style>
+  .link-block {
+    border: 1px solid transparent;
+    border-radius: 24px;
+    background-color: rgba(54, 54, 54, 1);
+    cursor: pointer !important;
+  }
+  .link-block:hover {
+    background-color: rgba(54, 54, 54, 0.75) !important;
+    cursor: pointer !important;
+  }
+  .external-link {
+    display: inline-flex;
+    align-items: center;
+    height: 36px;
+    line-height: 36px;
+    padding: 0 16px;
+    cursor: pointer !important;
+  }
+  .external-link,
+  .external-link:hover {
+    cursor: pointer !important;
+  }
+  a {
+    text-decoration: none;
+  }
+</style></head>
+<body>
+  <div style="
+      display: flex;
+      flex-direction: column;
+      justify-content: center;
+      align-items: center;
+      text-align: center;
+      background: linear-gradient(45deg, #007bff 0%, #0056b3 100%);
+      padding: 24px;
+      gap: 24px;
+      border-radius: 8px;
+    ">
+    <div style="
+        display: flex;
+        flex-direction: column;
+        align-items: center;
+        gap: 16px;
+      ">
+      <div style="display: flex; flex-direction: column; gap: 8px">
+        <h1 style="
+            font-size: 48px;
+            color: #fafafa;
+            margin: 0;
+            font-family: 'Trebuchet MS', 'Lucida Sans Unicode',
+              'Lucida Grande', 'Lucida Sans', Arial, sans-serif;
+          ">
+          MinerU: PDF Extraction Demo
+        </h1>
+      </div>
+    </div>
+    <p style="
+        margin: 0;
+        line-height: 1.6rem;
+        font-size: 16px;
+        color: #fafafa;
+        opacity: 0.8;
+      ">
+      A one-stop, open-source, high-quality data extraction tool, supports
+      PDF/webpage/e-book extraction.<br>
+    </p>
+    <style>
+      .link-block {
+        display: inline-block;
+      }
+      .link-block + .link-block {
+        margin-left: 20px;
+      }
+    </style>
+    <div class="column has-text-centered">
+      <div class="publication-links">
+        <!-- Code Link. -->
+        <span class="link-block">
+          <a href="https://github.com/opendatalab/MinerU" class="external-link button is-normal is-rounded is-dark" style="text-decoration: none; cursor: pointer">
+            <span class="icon" style="margin-right: 4px">
+              <i class="fab fa-github" style="color: white; margin-right: 4px"></i>
+            </span>
+            <span style="color: white">Code</span>
+          </a>
+        </span>
+        <!-- arXiv Link. -->
+        <span class="link-block">
+          <a href="https://arxiv.org/abs/2409.18839" class="external-link button is-normal is-rounded is-dark" style="text-decoration: none; cursor: pointer">
+            <span class="icon" style="margin-right: 8px">
+              <i class="fas fa-file" style="color: white"></i>
+            </span>
+            <span style="color: white">Paper</span>
+          </a>
+        </span>
+        <!-- Homepage Link. -->
+        <span class="link-block">
+          <a href="https://opendatalab.com/" class="external-link button is-normal is-rounded is-dark" style="text-decoration: none; cursor: pointer">
+            <span class="icon" style="margin-right: 8px">
+              <i class="fas fa-globe" style="color: white"></i>
+            </span>
+            <span style="color: white">Homepage</span>
+          </a>
+        </span>
+      </div>
+    </div>
+    <!-- New Demo Links -->
+  </div>
+</body></html>
\ No newline at end of file
--- a/projects/gradio_app/requirements.txt
+++ b/projects/gradio_app/requirements.txt
+magic-pdf[full]>=0.8.0
+gradio
+gradio-pdf
\ No newline at end of file
--- a/projects/llama_index_rag/README_zh-CN.md
+++ b/projects/llama_index_rag/README_zh-CN.md
@@ -70,6 +70,7 @@ pip install accelerate==0.33.0
 pip uninstall transformer-engine
 ```
 ## 示例
 ````bash
@@ -82,11 +83,14 @@ or
 docker-compose up -d
+# 配置环境变量
 export ES_USER=elastic
 export ES_PASSWORD=llama_index
 export ES_URL=http://127.0.0.1:9200
 export DASHSCOPE_API_KEY={some_key}
 DASHSCOPE_API_KEY 开通参考[文档](https://help.aliyun.com/zh/dashscope/opening-service)
 # 未导入数据，查询问题。返回通义千问默认答案
@@ -114,6 +118,7 @@ python data_ingestion.py -p example/data/declaration_of_the_rights_of_man_1789.p
 # 导入数据后，查询问题。通义千问模型会根据 RAG 系统的检索结果，结合上下文，给出答案。
 python query.py -q 'how about the rights of men'
 ## outputs