Unverified Commit 3a42ebbf authored by Xiaomeng Zhao's avatar Xiaomeng Zhao Committed by GitHub
Browse files

Merge pull request #838 from opendatalab/release-0.9.0

Release 0.9.0
parents 765c6d77 14024793
.. xtuner documentation master file, created by
sphinx-quickstart on Tue Jan 9 16:33:06 2024.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
欢迎来到 MinerU 的中文文档
==============================================
.. figure:: ./_static/image/logo.png
:align: center
:alt: mineru
:class: no-scaled-link
.. raw:: html
<p style="text-align:center">
<strong> 一站式开源高质量数据提取工具
</strong>
</p>
<p style="text-align:center">
<script async defer src="https://buttons.github.io/buttons.js"></script>
<a class="github-button" href="https://github.com/opendatalab/MinerU" data-show-count="true" data-size="large" aria-label="Star">Star</a>
<a class="github-button" href="https://github.com/opendatalab/MinerU/subscription" data-icon="octicon-eye" data-size="large" aria-label="Watch">Watch</a>
<a class="github-button" href="https://github.com/opendatalab/MinerU/fork" data-icon="octicon-repo-forked" data-size="large" aria-label="Fork">Fork</a>
</p>
@ECHO OFF
pushd %~dp0
REM Command file for Sphinx documentation
if "%SPHINXBUILD%" == "" (
set SPHINXBUILD=sphinx-build
)
set SOURCEDIR=.
set BUILDDIR=_build
%SPHINXBUILD% >NUL 2>NUL
if errorlevel 9009 (
echo.
echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
echo.installed, then set the SPHINXBUILD environment variable to point
echo.to the full path of the 'sphinx-build' executable. Alternatively you
echo.may add the Sphinx directory to PATH.
echo.
echo.If you don't have Sphinx installed, grab it from
echo.https://www.sphinx-doc.org/
exit /b 1
)
if "%1" == "" goto help
%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
goto end
:help
%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
:end
popd
...@@ -3,3 +3,7 @@ ...@@ -3,3 +3,7 @@
## Project List ## Project List
- [llama_index_rag](./llama_index_rag/README.md): Build a lightweight RAG system based on llama_index - [llama_index_rag](./llama_index_rag/README.md): Build a lightweight RAG system based on llama_index
- [gradio_app](./gradio_app/README.md): Build a web app based on gradio
- [web_demo](./web_demo/README.md): MinerU online [demo](https://opendatalab.com/OpenSourceTools/Extractor/PDF/) localized deployment version
- [web_api](./web_api/README.md): Web API Based on FastAPI
- [multi_gpu](./multi_gpu/README.md): Multi-GPU parallel processing based on LitServe
...@@ -3,3 +3,7 @@ ...@@ -3,3 +3,7 @@
## 项目列表 ## 项目列表
- [llama_index_rag](./llama_index_rag/README_zh-CN.md): 基于 llama_index 构建轻量级 RAG 系统 - [llama_index_rag](./llama_index_rag/README_zh-CN.md): 基于 llama_index 构建轻量级 RAG 系统
- [gradio_app](./gradio_app/README_zh-CN.md): 基于 Gradio 的 Web 应用
- [web_demo](./web_demo/README_zh-CN.md): MinerU在线[demo](https://opendatalab.com/OpenSourceTools/Extractor/PDF/)本地化部署版本
- [web_api](./web_api/README.md): 基于 FastAPI 的 Web API
- [multi_gpu](./multi_gpu/README.md): 基于 LitServe 的多 GPU 并行处理
## Installation
MinerU(>=0.8.0)
> If you already have a functioning MinerU environment, you can skip this step.
>
[Deploy in CPU environment](https://github.com/opendatalab/MinerU?tab=readme-ov-file#quick-cpu-demo)
[Deploy in GPU environment](https://github.com/opendatalab/MinerU?tab=readme-ov-file#using-gpu)
Third-party Software
```bash
pip install gradio gradio-pdf
```
## Start Gradio App
```bash
python app.py
```
## Use Gradio App
Access http://127.0.0.1:7860 in your web browser
\ No newline at end of file
## 安装
MinerU(>=0.8.0)
>如已有正常运行的MinerU环境则可以跳过此步骤
>
[在CPU环境部署](https://github.com/opendatalab/MinerU/blob/master/README_zh-CN.md#%E4%BD%BF%E7%94%A8cpu%E5%BF%AB%E9%80%9F%E4%BD%93%E9%AA%8C)
[在GPU环境部署](https://github.com/opendatalab/MinerU/blob/master/README_zh-CN.md#%E4%BD%BF%E7%94%A8gpu)
第三方软件
```bash
pip install gradio gradio-pdf
```
## 启动gradio应用
```bash
python app.py
```
## 使用gradio应用
在浏览器中访问 http://127.0.0.1:7860
\ No newline at end of file
...@@ -3,10 +3,12 @@ ...@@ -3,10 +3,12 @@
import base64 import base64
import os import os
import time import time
import uuid
import zipfile import zipfile
from pathlib import Path from pathlib import Path
import re import re
import pymupdf
from loguru import logger from loguru import logger
from magic_pdf.libs.hash_utils import compute_sha256 from magic_pdf.libs.hash_utils import compute_sha256
...@@ -14,8 +16,6 @@ from magic_pdf.rw.AbsReaderWriter import AbsReaderWriter ...@@ -14,8 +16,6 @@ from magic_pdf.rw.AbsReaderWriter import AbsReaderWriter
from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter
from magic_pdf.tools.common import do_parse, prepare_env from magic_pdf.tools.common import do_parse, prepare_env
os.system("pip install gradio")
os.system("pip install gradio-pdf")
import gradio as gr import gradio as gr
from gradio_pdf import PDF from gradio_pdf import PDF
...@@ -25,13 +25,16 @@ def read_fn(path): ...@@ -25,13 +25,16 @@ def read_fn(path):
return disk_rw.read(os.path.basename(path), AbsReaderWriter.MODE_BIN) return disk_rw.read(os.path.basename(path), AbsReaderWriter.MODE_BIN)
def parse_pdf(doc_path, output_dir, end_page_id): def parse_pdf(doc_path, output_dir, end_page_id, is_ocr, layout_mode, formula_enable, table_enable, language):
os.makedirs(output_dir, exist_ok=True) os.makedirs(output_dir, exist_ok=True)
try: try:
file_name = f"{str(Path(doc_path).stem)}_{time.time()}" file_name = f"{str(Path(doc_path).stem)}_{time.time()}"
pdf_data = read_fn(doc_path) pdf_data = read_fn(doc_path)
parse_method = "auto" if is_ocr:
parse_method = "ocr"
else:
parse_method = "auto"
local_image_dir, local_md_dir = prepare_env(output_dir, file_name, parse_method) local_image_dir, local_md_dir = prepare_env(output_dir, file_name, parse_method)
do_parse( do_parse(
output_dir, output_dir,
...@@ -41,6 +44,10 @@ def parse_pdf(doc_path, output_dir, end_page_id): ...@@ -41,6 +44,10 @@ def parse_pdf(doc_path, output_dir, end_page_id):
parse_method, parse_method,
False, False,
end_page_id=end_page_id, end_page_id=end_page_id,
layout_model=layout_mode,
formula_enable=formula_enable,
table_enable=table_enable,
lang=language,
) )
return local_md_dir, file_name return local_md_dir, file_name
except Exception as e: except Exception as e:
...@@ -92,9 +99,10 @@ def replace_image_with_base64(markdown_text, image_dir_path): ...@@ -92,9 +99,10 @@ def replace_image_with_base64(markdown_text, image_dir_path):
return re.sub(pattern, replace, markdown_text) return re.sub(pattern, replace, markdown_text)
def to_markdown(file_path, end_pages): def to_markdown(file_path, end_pages, is_ocr, layout_mode, formula_enable, table_enable, language):
# 获取识别的md文件以及压缩包文件路径 # 获取识别的md文件以及压缩包文件路径
local_md_dir, file_name = parse_pdf(file_path, './output', end_pages - 1) local_md_dir, file_name = parse_pdf(file_path, './output', end_pages - 1, is_ocr,
layout_mode, formula_enable, table_enable, language)
archive_zip_path = os.path.join("./output", compute_sha256(local_md_dir) + ".zip") archive_zip_path = os.path.join("./output", compute_sha256(local_md_dir) + ".zip")
zip_archive_success = compress_directory_to_zip(local_md_dir, archive_zip_path) zip_archive_success = compress_directory_to_zip(local_md_dir, archive_zip_path)
if zip_archive_success == 0: if zip_archive_success == 0:
...@@ -111,14 +119,6 @@ def to_markdown(file_path, end_pages): ...@@ -111,14 +119,6 @@ def to_markdown(file_path, end_pages):
return md_content, txt_content, archive_zip_path, new_pdf_path return md_content, txt_content, archive_zip_path, new_pdf_path
# def show_pdf(file_path):
# with open(file_path, "rb") as f:
# base64_pdf = base64.b64encode(f.read()).decode('utf-8')
# pdf_display = f'<embed src="data:application/pdf;base64,{base64_pdf}" ' \
# f'width="100%" height="1000" type="application/pdf">'
# return pdf_display
latex_delimiters = [{"left": "$$", "right": "$$", "display": True}, latex_delimiters = [{"left": "$$", "right": "$$", "display": True},
{"left": '$', "right": '$', "display": False}] {"left": '$', "right": '$', "display": False}]
...@@ -141,16 +141,76 @@ model_init = init_model() ...@@ -141,16 +141,76 @@ model_init = init_model()
logger.info(f"model_init: {model_init}") logger.info(f"model_init: {model_init}")
with open("header.html", "r") as file:
header = file.read()
latin_lang = [
'af', 'az', 'bs', 'cs', 'cy', 'da', 'de', 'es', 'et', 'fr', 'ga', 'hr',
'hu', 'id', 'is', 'it', 'ku', 'la', 'lt', 'lv', 'mi', 'ms', 'mt', 'nl',
'no', 'oc', 'pi', 'pl', 'pt', 'ro', 'rs_latin', 'sk', 'sl', 'sq', 'sv',
'sw', 'tl', 'tr', 'uz', 'vi', 'french', 'german'
]
arabic_lang = ['ar', 'fa', 'ug', 'ur']
cyrillic_lang = [
'ru', 'rs_cyrillic', 'be', 'bg', 'uk', 'mn', 'abq', 'ady', 'kbd', 'ava',
'dar', 'inh', 'che', 'lbe', 'lez', 'tab'
]
devanagari_lang = [
'hi', 'mr', 'ne', 'bh', 'mai', 'ang', 'bho', 'mah', 'sck', 'new', 'gom',
'sa', 'bgc'
]
other_lang = ['ch', 'en', 'korean', 'japan', 'chinese_cht', 'ta', 'te', 'ka']
all_lang = [""]
all_lang.extend([*other_lang, *latin_lang, *arabic_lang, *cyrillic_lang, *devanagari_lang])
def to_pdf(file_path):
with pymupdf.open(file_path) as f:
if f.is_pdf:
return file_path
else:
pdf_bytes = f.convert_to_pdf()
# 将pdfbytes 写入到uuid.pdf中
# 生成唯一的文件名
unique_filename = f"{uuid.uuid4()}.pdf"
# 构建完整的文件路径
tmp_file_path = os.path.join(os.path.dirname(file_path), unique_filename)
# 将字节数据写入文件
with open(tmp_file_path, 'wb') as tmp_pdf_file:
tmp_pdf_file.write(pdf_bytes)
return tmp_file_path
if __name__ == "__main__": if __name__ == "__main__":
with gr.Blocks() as demo: with gr.Blocks() as demo:
gr.HTML(header)
with gr.Row(): with gr.Row():
with gr.Column(variant='panel', scale=5): with gr.Column(variant='panel', scale=5):
pdf_show = gr.Markdown() file = gr.File(label="Please upload a PDF or image", file_types=[".pdf", ".png", ".jpeg", "jpg"])
max_pages = gr.Slider(1, 10, 5, step=1, label="Max convert pages") max_pages = gr.Slider(1, 10, 5, step=1, label="Max convert pages")
with gr.Row() as bu_flow: with gr.Row():
layout_mode = gr.Dropdown(["layoutlmv3", "doclayout_yolo"], label="Layout model", value="layoutlmv3")
language = gr.Dropdown(all_lang, label="Language", value="")
with gr.Row():
formula_enable = gr.Checkbox(label="Enable formula recognition", value=True)
is_ocr = gr.Checkbox(label="Force enable OCR", value=False)
table_enable = gr.Checkbox(label="Enable table recognition(test)", value=False)
with gr.Row():
change_bu = gr.Button("Convert") change_bu = gr.Button("Convert")
clear_bu = gr.ClearButton([pdf_show], value="Clear") clear_bu = gr.ClearButton(value="Clear")
pdf_show = PDF(label="Please upload pdf", interactive=True, height=800) pdf_show = PDF(label="PDF preview", interactive=True, height=800)
with gr.Accordion("Examples:"):
example_root = os.path.join(os.path.dirname(__file__), "examples")
gr.Examples(
examples=[os.path.join(example_root, _) for _ in os.listdir(example_root) if
_.endswith("pdf")],
inputs=pdf_show
)
with gr.Column(variant='panel', scale=5): with gr.Column(variant='panel', scale=5):
output_file = gr.File(label="convert result", interactive=False) output_file = gr.File(label="convert result", interactive=False)
...@@ -160,8 +220,9 @@ if __name__ == "__main__": ...@@ -160,8 +220,9 @@ if __name__ == "__main__":
latex_delimiters=latex_delimiters, line_breaks=True) latex_delimiters=latex_delimiters, line_breaks=True)
with gr.Tab("Markdown text"): with gr.Tab("Markdown text"):
md_text = gr.TextArea(lines=45, show_copy_button=True) md_text = gr.TextArea(lines=45, show_copy_button=True)
change_bu.click(fn=to_markdown, inputs=[pdf_show, max_pages], outputs=[md, md_text, output_file, pdf_show]) file.upload(fn=to_pdf, inputs=file, outputs=pdf_show)
clear_bu.add([md, pdf_show, md_text, output_file]) change_bu.click(fn=to_markdown, inputs=[pdf_show, max_pages, is_ocr, layout_mode, formula_enable, table_enable, language],
outputs=[md, md_text, output_file, pdf_show])
demo.launch() clear_bu.add([file, md, pdf_show, md_text, output_file, is_ocr, table_enable, language])
demo.launch(server_name="0.0.0.0")
\ No newline at end of file
<html><head>
<link rel="stylesheet" href="https://use.fontawesome.com/releases/v5.15.4/css/all.css">
<style>
.link-block {
border: 1px solid transparent;
border-radius: 24px;
background-color: rgba(54, 54, 54, 1);
cursor: pointer !important;
}
.link-block:hover {
background-color: rgba(54, 54, 54, 0.75) !important;
cursor: pointer !important;
}
.external-link {
display: inline-flex;
align-items: center;
height: 36px;
line-height: 36px;
padding: 0 16px;
cursor: pointer !important;
}
.external-link,
.external-link:hover {
cursor: pointer !important;
}
a {
text-decoration: none;
}
</style></head>
<body>
<div style="
display: flex;
flex-direction: column;
justify-content: center;
align-items: center;
text-align: center;
background: linear-gradient(45deg, #007bff 0%, #0056b3 100%);
padding: 24px;
gap: 24px;
border-radius: 8px;
">
<div style="
display: flex;
flex-direction: column;
align-items: center;
gap: 16px;
">
<div style="display: flex; flex-direction: column; gap: 8px">
<h1 style="
font-size: 48px;
color: #fafafa;
margin: 0;
font-family: 'Trebuchet MS', 'Lucida Sans Unicode',
'Lucida Grande', 'Lucida Sans', Arial, sans-serif;
">
MinerU: PDF Extraction Demo
</h1>
</div>
</div>
<p style="
margin: 0;
line-height: 1.6rem;
font-size: 16px;
color: #fafafa;
opacity: 0.8;
">
A one-stop, open-source, high-quality data extraction tool, supports
PDF/webpage/e-book extraction.<br>
</p>
<style>
.link-block {
display: inline-block;
}
.link-block + .link-block {
margin-left: 20px;
}
</style>
<div class="column has-text-centered">
<div class="publication-links">
<!-- Code Link. -->
<span class="link-block">
<a href="https://github.com/opendatalab/MinerU" class="external-link button is-normal is-rounded is-dark" style="text-decoration: none; cursor: pointer">
<span class="icon" style="margin-right: 4px">
<i class="fab fa-github" style="color: white; margin-right: 4px"></i>
</span>
<span style="color: white">Code</span>
</a>
</span>
<!-- arXiv Link. -->
<span class="link-block">
<a href="https://arxiv.org/abs/2409.18839" class="external-link button is-normal is-rounded is-dark" style="text-decoration: none; cursor: pointer">
<span class="icon" style="margin-right: 8px">
<i class="fas fa-file" style="color: white"></i>
</span>
<span style="color: white">Paper</span>
</a>
</span>
<!-- Homepage Link. -->
<span class="link-block">
<a href="https://opendatalab.com/" class="external-link button is-normal is-rounded is-dark" style="text-decoration: none; cursor: pointer">
<span class="icon" style="margin-right: 8px">
<i class="fas fa-globe" style="color: white"></i>
</span>
<span style="color: white">Homepage</span>
</a>
</span>
</div>
</div>
<!-- New Demo Links -->
</div>
</body></html>
\ No newline at end of file
magic-pdf[full]>=0.8.0
gradio
gradio-pdf
\ No newline at end of file
...@@ -70,6 +70,7 @@ pip install accelerate==0.33.0 ...@@ -70,6 +70,7 @@ pip install accelerate==0.33.0
pip uninstall transformer-engine pip uninstall transformer-engine
``` ```
## 示例 ## 示例
````bash ````bash
...@@ -82,11 +83,14 @@ or ...@@ -82,11 +83,14 @@ or
docker-compose up -d docker-compose up -d
# 配置环境变量
export ES_USER=elastic export ES_USER=elastic
export ES_PASSWORD=llama_index export ES_PASSWORD=llama_index
export ES_URL=http://127.0.0.1:9200 export ES_URL=http://127.0.0.1:9200
export DASHSCOPE_API_KEY={some_key} export DASHSCOPE_API_KEY={some_key}
DASHSCOPE_API_KEY 开通参考[文档](https://help.aliyun.com/zh/dashscope/opening-service) DASHSCOPE_API_KEY 开通参考[文档](https://help.aliyun.com/zh/dashscope/opening-service)
# 未导入数据,查询问题。返回通义千问默认答案 # 未导入数据,查询问题。返回通义千问默认答案
...@@ -114,6 +118,7 @@ python data_ingestion.py -p example/data/declaration_of_the_rights_of_man_1789.p ...@@ -114,6 +118,7 @@ python data_ingestion.py -p example/data/declaration_of_the_rights_of_man_1789.p
# 导入数据后,查询问题。通义千问模型会根据 RAG 系统的检索结果,结合上下文,给出答案。 # 导入数据后,查询问题。通义千问模型会根据 RAG 系统的检索结果,结合上下文,给出答案。
python query.py -q 'how about the rights of men' python query.py -q 'how about the rights of men'
## outputs ## outputs
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment