Commit 0fc1daac authored by luopl's avatar luopl
Browse files

Initial commit

parents
*.js linguist-vendored
*.mjs linguist-vendored
*.html linguist-documentation
*.css linguist-vendored
*.scss linguist-vendored
\ No newline at end of file
*.tar
*.tar.gz
*.zip
venv*/
envs/
slurm_logs/
sync1.sh
data_preprocess_pj1
data-preparation1
__pycache__
*.log
*.pyc
.vscode
debug/
*.ipynb
.idea
# vscode history
.history
.DS_Store
.env
bad_words/
bak/
app/tests/*
temp/
tmp/
tmp
.vscode
.vscode/
ocr_demo
.coveragerc
/app/common/__init__.py
/magic_pdf/config/__init__.py
source.dev.env
tmp
projects/web/node_modules
projects/web/dist
projects/web_demo/web_demo/static/
cli_debug/
debug_utils/
# sphinx docs
_build/
output/
\ No newline at end of file
This diff is collapsed.
# MinerU Contributor License Agreement
In order to clarify the intellectual property license granted with Contributions from any person or entity, the open source project MinerU ("MinerU") must have a Contributor License Agreement (CLA) on file that has been signed by each Contributor, indicating agreement to the license terms below. This license is for your protection as a Contributor as well as the protection of MinerU and its users; it does not change your rights to use your own Contributions for any other purpose.
You accept and agree to the following terms and conditions for Your present and future Contributions submitted to MinerU. Except for the license granted herein to MinerU and recipients of software distributed by MinerU, You reserve all right, title, and interest in and to Your Contributions.
1. Definitions. "You" (or "Your") shall mean the copyright owner or legal entity authorized by the copyright owner that is making this Agreement with MinerU. For legal entities, the entity making a Contribution and all other entities that control, are controlled by, or are under common control with that entity are considered to be a single Contributor. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity. "Contribution" shall mean the code, documentation or any original work of authorship, including any modifications or additions to an existing work, that is intentionally submitted by You to MinerU for inclusion in, or documentation of, any of the products owned or managed by MinerU (the "Work"). For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to MinerU or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, MinerU for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by You as "Not a Contribution."
2. Grant of Copyright License. Subject to the terms and conditions of this Agreement, You hereby grant to MinerU and to recipients of software distributed by MinerU a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare derivative works of, publicly display, publicly perform, sublicense, and distribute Your Contributions and such derivative works.
3. Grant of Patent License. Subject to the terms and conditions of this Agreement, You hereby grant to MinerU and to recipients of software distributed by MinerU a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by You that are necessarily infringed by Your Contribution(s) alone or by combination of Your Contribution(s) with the Work to which such Contribution(s) was submitted. If any entity institutes patent litigation against You or any other entity (including a cross-claim or counterclaim in a lawsuit) alleging that Your Contribution, or the Work to which You have contributed, constitutes direct or contributory patent infringement, then any patent licenses granted to that entity under this Agreement for that Contribution or Work shall terminate as of the date such litigation is filed.
4. You represent that You are legally entitled to grant the above license. If You are an entity, You represent further that each of Your employee designated by You is authorized to submit Contributions on behalf of You. If You are an individual and Your employer(s) has rights to intellectual property that You create that includes Your Contributions, You represent further that You have received permission to make Contributions on behalf of that employer, that Your employer has waived such rights for Your Contributions to MinerU, or that Your employer has executed a separate CLA with MinerU.
5. If you do post content or submit material on MinerU and unless we indicate otherwise, you grant MinerU a nonexclusive, royalty-free, perpetual, irrevocable, and fully sublicensable right to use, reproduce, modify, adapt, publish, perform, translate, create derivative works from, distribute, and display such content throughout the world in any media. You grant MinerU and sublicensees the right to use your GitHub Public Profile, including but not limited to name, that you submit in connection with such content. You represent and warrant that you own or otherwise control all of the rights to the content that you post; that the content is accurate; that use of the content you supply does not violate this policy and will not cause injury to any person or entity; and that you will indemnify MinerU for all claims resulting from content you supply. MinerU has the right but not the obligation to monitor and edit or remove any activity or content. MinerU takes no responsibility and assumes no liability for any content posted by you or any third party.
6. You represent that each of Your Contributions is Your original creation. Should You wish to submit work that is not Your original creation, You may submit it to MinerU separately from any Contribution, identifying the complete details of its source and of any license or other restriction (including, but not limited to, related patents, trademarks, and license agreements) of which You are personally aware, and conspicuously marking the work as "Submitted on behalf of a third-party: [named here]".
7. You are not expected to provide support for Your Contributions, except to the extent You desire to provide support. You may provide support for free, for a fee, or not at all. Unless required by applicable law or agreed to in writing, You provide Your Contributions on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE.
8. You agree to notify MinerU of any facts or circumstances of which You become aware that would make these representations inaccurate in any respect.
9. MinerU reserves the right to update or change this Agreement at any time, by posting the most current version of the Agreement on MinerU, with a new Effective Date shown on Jul. 24th, 2024. All such changes in the Agreement are effective from the Effective Date. Your continued use of MinerU after we post any such changes signifies your agreement to those changes. If you do not agree to the then-current Agreement, you must immediately discontinue using MinerU.
# MinerU
## 论文
`
MinerU: An Open-Source Solution for Precise Document Content Extraction
`
- https://arxiv.org/abs/2409.18839
## 模型结构
MinerU是一个功能强大的PDF文档内容提取工具,它利用了先进的PDF-Extract-Kit模型库,能够有效地从各种类型的文档中提取内容。
MinerU的框架设计简洁而高效,主要包括文档预处理、文档内容解析、文档内容后处理和格式转换四个阶段。
<div align=center>
<img src="./assets/workflow.png"/>
</div>
## 算法原理
MinerU处理过程如下:
(1)文档预处理
文档预处理主要有两个目标。一是筛选出无法处理的 PDF 文件,例如非 PDF 格式文件、加密文档以及受密码保护的文件,确保后续处理流程的顺利进行。
二是获取 PDF 文档的元数据,这些元数据在后续的处理过程中具有重要作用。
(2)文档内容解析
PDF - Extract - Kit 是 MinerU 用于解析文档的核心模型库,包含多种先进的开源 PDF 文档解析算法。与其他开源算法库不同,它致力于在处理现实世界多样化数据时确保准确性和速度。
当特定领域的现有开源算法无法满足实际需求时,PDF - Extract - Kit 会通过数据工程构建高质量、多样化的数据集来进一步微调模型,从而显著增强模型对不同数据的鲁棒性。
(3)文档内容后处理
文档内容后处理阶段主要解决内容排序问题。由于模型输出的文本、图像、表格和公式框之间可能存在重叠,
以及通过 OCR 或 API 获得的文本行之间也经常重叠,这给文本和元素的排序带来了巨大挑战。
(4)格式转换
最后,在格式转换阶段,MinerU将处理后的PDF数据转换为用户所需的机器可读格式(如Markdown或JSON)。
## 环境配置
### Docker(方法一)
```
docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.4.1-ubuntu22.04-dtk25.04-py3.10-fixpy
# <your IMAGE ID>为以上拉取的docker的镜像ID替换
docker run -it --name mineru --shm-size=1024G --device=/dev/kfd --device=/dev/dri/ --privileged --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ulimit memlock=-1:-1 --ipc=host --network host --group-add video -v /opt/hyhal:/opt/hyhal:ro -v $PWD/MinerU_pytorch:/home/MinerU_pytorch <your IMAGE ID> /bin/bash
cd /home/MinerU_pytorch
pip install -e .[core] -i https://mirrors.aliyun.com/pypi/simple/
pip install numpy==1.24.3
pip install torchvision-0.19.1+das.opt2.dtk2504-cp310-cp310-manylinux_2_28_x86_64.whl
pip install torch-2.4.1+das.opt2.dtk2504-cp310-cp310-manylinux_2_28_x86_64.whl
pip install triton-3.0.0+das.opt4.dtk2504-cp310-cp310-manylinux_2_28_x86_64.whl
cd sglang-v0.4.6.post5.dev/sgl-kernel
python setup_hip.py install
cd ..
pip install -e "python[all_hip]"
```
### Dockerfile(方法二)
```
cd /home/MinerU_pytorch
docker build --no-cache -t MinerU:latest .
docker run -it --name MinerU_test --shm-size=1024G --device=/dev/kfd --device=/dev/dri/ --privileged --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ulimit memlock=-1:-1 --ipc=host --network host --group-add video -v /opt/hyhal:/opt/hyhal:ro -v $PWD/MinerU_pytorch:/home/MinerU_pytorch MinerU /bin/bash
cd /home/MinerU_pytorch
pip install -e .[core] -i https://mirrors.aliyun.com/pypi/simple/
pip install numpy==1.24.3
pip install torchvision-0.19.1+das.opt2.dtk2504-cp310-cp310-manylinux_2_28_x86_64.whl
pip install torch-2.4.1+das.opt2.dtk2504-cp310-cp310-manylinux_2_28_x86_64.whl
pip install triton-3.0.0+das.opt4.dtk2504-cp310-cp310-manylinux_2_28_x86_64.whl
cd sglang-v0.4.6.post5.dev/sgl-kernel
python setup_hip.py install
cd ..
pip install -e "python[all_hip]"
```
### Anaconda(方法三)
1、关于本项目DCU显卡所需的特殊深度学习库可从光合开发者社区下载安装:
- https://developer.sourcefind.cn/tool/
```
DTK驱动:dtk25.04
python:python3.10
torch:2.4.1
torchvision:0.19.1
triton:3.0.0
flash-attn:2.6.1
vllm:0.7.2
```
`Tips:以上dtk驱动、python、torch等DCU相关工具版本需要严格一一对应。`
2、其它非特殊库参照requirements.txt安装
```
cd /home/MinerU_pytorch
pip install -e .[core] -i https://mirrors.aliyun.com/pypi/simple/
pip install torchvision-0.19.1+das.opt2.dtk2504-cp310-cp310-manylinux_2_28_x86_64.whl
pip install torch-2.4.1+das.opt2.dtk2504-cp310-cp310-manylinux_2_28_x86_64.whl
pip install triton-3.0.0+das.opt4.dtk2504-cp310-cp310-manylinux_2_28_x86_64.whl
pip install lmslim-0.2.1+das.dtk2504-cp310-cp310-manylinux_2_28_x86_64.whl
pip install flash_attn-2.6.1+das.opt4.dtk2504-cp310-cp310-manylinux_2_28_x86_64.whl
pip install vllm-0.7.2+das.opt1.dtk2504-cp310-cp310-manylinux_2_28_x86_64.whl
cd sglang-v0.4.6.post5.dev/sgl-kernel
python setup_hip.py install
cd ..
pip install -e "python[all_hip]"
pip install pytest==8.3.5 pytest-asyncio==0.26.0 numpy==1.24.3
pip install amdsmi-24.5.3+02cbffb.dirty-py3-none-any.whl
```
## 数据集
`无`
## 训练
`无`
## 推理
模型源配置
```
#添加HF镜像方便下载模型
#export HF_ENDPOINT=https://hf-mirror.com
#默认在首次运行时自动从 HuggingFace 下载所需模型。
#若无法访问 HuggingFace,可通过以下方式切换模型源:
mineru -p <input_path> -o <output_path> --source modelscope
#或设置环境变量:
export MINERU_MODEL_SOURCE=modelscope
#如需使用本地模型,可使用交互式命令行工具选择模型下载:
mineru-models-download --help
#下载完成后,模型路径会在当前终端窗口输出,并自动写入用户目录下的 mineru.json
```
### 单机单卡
```
#Run pipeline
cd /home/MinerU_pytorch
HIP_VISIBLE_DEVICES=0 python demo/demo.py
#Using sglang to Accelerate VLM Model Inference
#Through the sglang-server/client Mode
#注意:运行的时候需要加 --attention-backend triton选项,别的kernel不支持
HIP_VISIBLE_DEVICES=0 mineru-sglang-server --attention-backend triton --port 30000
#Use Client in another terminal:
mineru -p <input_path> -o <output_path> -b vlm-sglang-client -u http://127.0.0.1:30000
```
更多资料可参考源项目中的[`README_ori`](./README_orgin.md)
## result
解析示例:
layout:
<div align=center>
<img src="./assets/layout.png"/>
</div>
解析结果:
<div align=center>
<img src="./assets/result.png"/>
</div>
### 精度
DCU与GPU精度一致,推理框架:pytorch。
## 应用场景
### 算法类别
`OCR`
### 热点应用行业
`科研,教育,政府,广媒`
## 预训练权重
魔搭社区下载地址为:[OpenDataLab/PDF-Extract-Kit-1.0](https://modelscope.cn/models/OpenDataLab/PDF-Extract-Kit-1.0)
Hugging Face下载地址为:[OpenDataLab/PDF-Extract-Kit-1.0](https://huggingface.co/opendatalab/PDF-Extract-Kit-1.0)
注意:`自动下载模型建议加镜像源下载:export HF_ENDPOINT=https://hf-mirror.com`
## 源码仓库及问题反馈
- https://developer.sourcefind.cn/codes/modelzoo/mineru_pytorch
## 参考资料
- https://github.com/opendatalab/MinerU
This diff is collapsed.
This diff is collapsed.
# Security Policy
## Supported Versions
latest
## Reporting a Vulnerability
Please do not report security vulnerabilities through public GitHub issues.
Instead, please report them at https://github.com/opendatalab/MinerU/security.
Please include the requested information listed below (as much as you can provide) to help us better understand the nature and scope of the possible issue:
* Type of issue (e.g. buffer overflow, SQL injection, cross-site scripting, etc.)
* Full paths of source file(s) related to the manifestation of the issue
* The location of the affected source code (tag/branch/commit or direct URL)
* Any special configuration required to reproduce the issue
* Step-by-step instructions to reproduce the issue
* Proof-of-concept or exploit code (if possible)
* Impact of the issue, including how an attacker might exploit the issue
This information will help us triage your report more quickly.
## Preferred Languages
We prefer all communications to be in English and Chinese.
## Policy
We will fix security issues in the project's own code as quickly as possible. Before the project completes the fix, you must not disclose the vulnerability information to any public platform.
# Copyright (c) Opendatalab. All rights reserved.
import copy
import json
import os
from pathlib import Path
from loguru import logger
from mineru.cli.common import convert_pdf_bytes_to_bytes_by_pypdfium2, prepare_env, read_fn
from mineru.data.data_reader_writer import FileBasedDataWriter
from mineru.utils.draw_bbox import draw_layout_bbox, draw_span_bbox
from mineru.utils.enum_class import MakeMode
from mineru.backend.vlm.vlm_analyze import doc_analyze as vlm_doc_analyze
from mineru.backend.pipeline.pipeline_analyze import doc_analyze as pipeline_doc_analyze
from mineru.backend.pipeline.pipeline_middle_json_mkcontent import union_make as pipeline_union_make
from mineru.backend.pipeline.model_json_to_middle_json import result_to_middle_json as pipeline_result_to_middle_json
from mineru.backend.vlm.vlm_middle_json_mkcontent import union_make as vlm_union_make
from mineru.utils.models_download_utils import auto_download_and_get_model_root_path
def do_parse(
output_dir, # Output directory for storing parsing results
pdf_file_names: list[str], # List of PDF file names to be parsed
pdf_bytes_list: list[bytes], # List of PDF bytes to be parsed
p_lang_list: list[str], # List of languages for each PDF, default is 'ch' (Chinese)
backend="pipeline", # The backend for parsing PDF, default is 'pipeline'
parse_method="auto", # The method for parsing PDF, default is 'auto'
p_formula_enable=True, # Enable formula parsing
p_table_enable=True, # Enable table parsing
server_url=None, # Server URL for vlm-sglang-client backend
f_draw_layout_bbox=True, # Whether to draw layout bounding boxes
f_draw_span_bbox=True, # Whether to draw span bounding boxes
f_dump_md=True, # Whether to dump markdown files
f_dump_middle_json=True, # Whether to dump middle JSON files
f_dump_model_output=True, # Whether to dump model output files
f_dump_orig_pdf=True, # Whether to dump original PDF files
f_dump_content_list=True, # Whether to dump content list files
f_make_md_mode=MakeMode.MM_MD, # The mode for making markdown content, default is MM_MD
start_page_id=0, # Start page ID for parsing, default is 0
end_page_id=None, # End page ID for parsing, default is None (parse all pages until the end of the document)
):
if backend == "pipeline":
for idx, pdf_bytes in enumerate(pdf_bytes_list):
new_pdf_bytes = convert_pdf_bytes_to_bytes_by_pypdfium2(pdf_bytes, start_page_id, end_page_id)
pdf_bytes_list[idx] = new_pdf_bytes
infer_results, all_image_lists, all_pdf_docs, lang_list, ocr_enabled_list = pipeline_doc_analyze(pdf_bytes_list, p_lang_list, parse_method=parse_method, formula_enable=p_formula_enable,table_enable=p_table_enable)
for idx, model_list in enumerate(infer_results):
model_json = copy.deepcopy(model_list)
pdf_file_name = pdf_file_names[idx]
local_image_dir, local_md_dir = prepare_env(output_dir, pdf_file_name, parse_method)
image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter(local_md_dir)
images_list = all_image_lists[idx]
pdf_doc = all_pdf_docs[idx]
_lang = lang_list[idx]
_ocr_enable = ocr_enabled_list[idx]
middle_json = pipeline_result_to_middle_json(model_list, images_list, pdf_doc, image_writer, _lang, _ocr_enable, p_formula_enable)
pdf_info = middle_json["pdf_info"]
pdf_bytes = pdf_bytes_list[idx]
if f_draw_layout_bbox:
draw_layout_bbox(pdf_info, pdf_bytes, local_md_dir, f"{pdf_file_name}_layout.pdf")
if f_draw_span_bbox:
draw_span_bbox(pdf_info, pdf_bytes, local_md_dir, f"{pdf_file_name}_span.pdf")
if f_dump_orig_pdf:
md_writer.write(
f"{pdf_file_name}_origin.pdf",
pdf_bytes,
)
if f_dump_md:
image_dir = str(os.path.basename(local_image_dir))
md_content_str = pipeline_union_make(pdf_info, f_make_md_mode, image_dir)
md_writer.write_string(
f"{pdf_file_name}.md",
md_content_str,
)
if f_dump_content_list:
image_dir = str(os.path.basename(local_image_dir))
content_list = pipeline_union_make(pdf_info, MakeMode.CONTENT_LIST, image_dir)
md_writer.write_string(
f"{pdf_file_name}_content_list.json",
json.dumps(content_list, ensure_ascii=False, indent=4),
)
if f_dump_middle_json:
md_writer.write_string(
f"{pdf_file_name}_middle.json",
json.dumps(middle_json, ensure_ascii=False, indent=4),
)
if f_dump_model_output:
md_writer.write_string(
f"{pdf_file_name}_model.json",
json.dumps(model_json, ensure_ascii=False, indent=4),
)
logger.info(f"local output dir is {local_md_dir}")
else:
if backend.startswith("vlm-"):
backend = backend[4:]
f_draw_span_bbox = False
parse_method = "vlm"
for idx, pdf_bytes in enumerate(pdf_bytes_list):
pdf_file_name = pdf_file_names[idx]
pdf_bytes = convert_pdf_bytes_to_bytes_by_pypdfium2(pdf_bytes, start_page_id, end_page_id)
local_image_dir, local_md_dir = prepare_env(output_dir, pdf_file_name, parse_method)
image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter(local_md_dir)
middle_json, infer_result = vlm_doc_analyze(pdf_bytes, image_writer=image_writer, backend=backend, server_url=server_url)
pdf_info = middle_json["pdf_info"]
if f_draw_layout_bbox:
draw_layout_bbox(pdf_info, pdf_bytes, local_md_dir, f"{pdf_file_name}_layout.pdf")
if f_draw_span_bbox:
draw_span_bbox(pdf_info, pdf_bytes, local_md_dir, f"{pdf_file_name}_span.pdf")
if f_dump_orig_pdf:
md_writer.write(
f"{pdf_file_name}_origin.pdf",
pdf_bytes,
)
if f_dump_md:
image_dir = str(os.path.basename(local_image_dir))
md_content_str = vlm_union_make(pdf_info, f_make_md_mode, image_dir)
md_writer.write_string(
f"{pdf_file_name}.md",
md_content_str,
)
if f_dump_content_list:
image_dir = str(os.path.basename(local_image_dir))
content_list = vlm_union_make(pdf_info, MakeMode.CONTENT_LIST, image_dir)
md_writer.write_string(
f"{pdf_file_name}_content_list.json",
json.dumps(content_list, ensure_ascii=False, indent=4),
)
if f_dump_middle_json:
md_writer.write_string(
f"{pdf_file_name}_middle.json",
json.dumps(middle_json, ensure_ascii=False, indent=4),
)
if f_dump_model_output:
model_output = ("\n" + "-" * 50 + "\n").join(infer_result)
md_writer.write_string(
f"{pdf_file_name}_model_output.txt",
model_output,
)
logger.info(f"local output dir is {local_md_dir}")
def parse_doc(
path_list: list[Path],
output_dir,
lang="ch",
backend="pipeline",
method="auto",
server_url=None,
start_page_id=0, # Start page ID for parsing, default is 0
end_page_id=None # End page ID for parsing, default is None (parse all pages until the end of the document)
):
"""
Parameter description:
path_list: List of document paths to be parsed, can be PDF or image files.
output_dir: Output directory for storing parsing results.
lang: Language option, default is 'ch', optional values include['ch', 'ch_server', 'ch_lite', 'en', 'korean', 'japan', 'chinese_cht', 'ta', 'te', 'ka']。
Input the languages in the pdf (if known) to improve OCR accuracy. Optional.
Adapted only for the case where the backend is set to "pipeline"
backend: the backend for parsing pdf:
pipeline: More general.
vlm-transformers: More general.
vlm-sglang-engine: Faster(engine).
vlm-sglang-client: Faster(client).
without method specified, pipeline will be used by default.
method: the method for parsing pdf:
auto: Automatically determine the method based on the file type.
txt: Use text extraction method.
ocr: Use OCR method for image-based PDFs.
Without method specified, 'auto' will be used by default.
Adapted only for the case where the backend is set to "pipeline".
server_url: When the backend is `sglang-client`, you need to specify the server_url, for example:`http://127.0.0.1:30000`
"""
try:
file_name_list = []
pdf_bytes_list = []
lang_list = []
for path in path_list:
file_name = str(Path(path).stem)
pdf_bytes = read_fn(path)
file_name_list.append(file_name)
pdf_bytes_list.append(pdf_bytes)
lang_list.append(lang)
do_parse(
output_dir=output_dir,
pdf_file_names=file_name_list,
pdf_bytes_list=pdf_bytes_list,
p_lang_list=lang_list,
backend=backend,
parse_method=method,
server_url=server_url,
start_page_id=start_page_id,
end_page_id=end_page_id
)
except Exception as e:
logger.exception(e)
if __name__ == '__main__':
# args
__dir__ = os.path.dirname(os.path.abspath(__file__))
pdf_files_dir = os.path.join(__dir__, "pdfs")
output_dir = os.path.join(__dir__, "output")
pdf_suffixes = [".pdf"]
image_suffixes = [".png", ".jpeg", ".jpg"]
doc_path_list = []
for doc_path in Path(pdf_files_dir).glob('*'):
if doc_path.suffix in pdf_suffixes + image_suffixes:
doc_path_list.append(doc_path)
"""如果您由于网络问题无法下载模型,可以设置环境变量MINERU_MODEL_SOURCE为modelscope使用免代理仓库下载模型"""
os.environ['MINERU_MODEL_SOURCE'] = "modelscope"
"""Use pipeline mode if your environment does not support VLM"""
parse_doc(doc_path_list, output_dir, backend="pipeline")
"""To enable VLM mode, change the backend to 'vlm-xxx'"""
# parse_doc(doc_path_list, output_dir, backend="vlm-transformers") # more general.
# parse_doc(doc_path_list, output_dir, backend="vlm-sglang-engine") # faster(engine).
# parse_doc(doc_path_list, output_dir, backend="vlm-sglang-client", server_url="http://127.0.0.1:30000") # faster(client).
\ No newline at end of file
# Use the official sglang image
FROM lmsysorg/sglang:v0.4.7-cu124
# install mineru latest
RUN python3 -m pip install -U 'mineru[core]' -i https://mirrors.aliyun.com/pypi/simple --break-system-packages
# Download models and update the configuration file
RUN /bin/bash -c "mineru-models-download -s modelscope -m all"
# Set the entry point to activate the virtual environment and run the command line tool
ENTRYPOINT ["/bin/bash", "-c", "export MINERU_MODEL_SOURCE=local && exec \"$@\"", "--"]
\ No newline at end of file
# Documentation:
# https://docs.sglang.ai/backend/server_arguments.html#common-launch-commands
services:
mineru-sglang:
image: mineru-sglang:latest
container_name: mineru-sglang
restart: always
ports:
- 30000:30000
environment:
MINERU_MODEL_SOURCE: local
entrypoint: mineru-sglang-server
command:
--host 0.0.0.0
--port 30000
# --enable-torch-compile # You can also enable torch.compile to accelerate inference speed by approximately 15%
# --dp 2 # If you have more than two GPUs with 24GB VRAM or above, you can use sglang's multi-GPU parallel mode to increase throughput
# --tp 2 # If you have two GPUs with 12GB or 16GB VRAM, you can use the Tensor Parallel (TP) mode
# --mem-fraction-static 0.7 # If you have two GPUs with 11GB VRAM, in addition to Tensor Parallel mode, you need to reduce the KV cache size
ulimits:
memlock: -1
stack: 67108864
ipc: host
healthcheck:
test: ["CMD-SHELL", "curl -f http://localhost:30000/health || exit 1"]
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["0"]
capabilities: [gpu]
# Use the official sglang image
FROM lmsysorg/sglang:v0.4.7-cu124
# install mineru latest
RUN python3 -m pip install -U 'mineru[core]' --break-system-packages
# Download models and update the configuration file
RUN /bin/bash -c "mineru-models-download -s huggingface -m all"
# Set the entry point to activate the virtual environment and run the command line tool
ENTRYPOINT ["/bin/bash", "-c", "export MINERU_MODEL_SOURCE=local && exec \"$@\"", "--"]
\ No newline at end of file
# Frequently Asked Questions
### 1. When using the command `pip install magic-pdf[full]` on newer versions of macOS, the error `zsh: no matches found: magic-pdf[full]` occurs.
On macOS, the default shell has switched from Bash to Z shell, which has special handling logic for certain types of string matching. This can lead to the "no matches found" error. You can try disabling the globbing feature in the command line and then run the installation command again.
```bash
setopt no_nomatch
pip install magic-pdf[full]
```
### 2. Encountering the error `pickle.UnpicklingError: invalid load key, 'v'.` during use
This might be due to an incomplete download of the model file. You can try re-downloading the model file and then try again.
Reference: https://github.com/opendatalab/MinerU/issues/143
### 3. Where should the model files be downloaded and how should the `/models-dir` configuration be set?
The path for the model files is configured in "magic-pdf.json". just like:
```json
{
"models-dir": "/tmp/models"
}
```
This path is an absolute path, not a relative path. You can obtain the absolute path in the models directory using the "pwd" command.
Reference: https://github.com/opendatalab/MinerU/issues/155#issuecomment-2230216874
### 4. Encountered the error `ImportError: libGL.so.1: cannot open shared object file: No such file or directory` in Ubuntu 22.04 on WSL2
The `libgl` library is missing in Ubuntu 22.04 on WSL2. You can install the `libgl` library with the following command to resolve the issue:
```bash
sudo apt-get install libgl1-mesa-glx
```
Reference: https://github.com/opendatalab/MinerU/issues/388
### 5. Encountered error `ModuleNotFoundError: No module named 'fairscale'`
You need to uninstall the module and reinstall it:
```bash
pip uninstall fairscale
pip install fairscale
```
Reference: https://github.com/opendatalab/MinerU/issues/411
### 6. On some newer devices like the H100, the text parsed during OCR using CUDA acceleration is garbled.
The compatibility of cuda11 with new graphics cards is poor, and the CUDA version used by Paddle needs to be upgraded.
```bash
pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu123/
```
Reference: https://github.com/opendatalab/MinerU/issues/558
### 7. On some Linux servers, the program immediately reports an error `Illegal instruction (core dumped)`
This might be because the server's CPU does not support the AVX/AVX2 instruction set, or the CPU itself supports it but has been disabled by the system administrator. You can try contacting the system administrator to remove the restriction or change to a different server.
References: https://github.com/opendatalab/MinerU/issues/591 , https://github.com/opendatalab/MinerU/issues/736
### 8. Error when installing MinerU on CentOS 7 or Ubuntu 18: `ERROR: Failed building wheel for simsimd`
The new version of albumentations (1.4.21) introduces a dependency on simsimd. Since the pre-built package of simsimd for Linux requires a glibc version greater than or equal to 2.28, this causes installation issues on some Linux distributions released before 2019. You can resolve this issue by using the following command:
```
pip install -U magic-pdf[full,old_linux] --extra-index-url https://wheels.myhloli.com
```
Reference: https://github.com/opendatalab/MinerU/issues/1004
### 9. Old Graphics Cards Such as M40 Encounter "RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED"
An error occurs during operation (cuda):
```
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasGemmStridedBatchedEx(handle, opa, opb, (int)m, (int)n, (int)k, (void*)&falpha, a, CUDA_R_16BF, (int)lda, stridea, b, CUDA_R_16BF, (int)ldb, strideb, (void*)&fbeta, c, CUDA_R_16BF, (int)ldc, stridec, (int)num_batches, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)
```
Because BF16 precision is not supported on graphics cards before the Turing architecture and some graphics cards are not recognized by torch, it is necessary to manually disable BF16 precision.
Modify the code in lines 287-290 of the "pdf_parse_union_core_v2.py" file (note that the location may vary in different versions):
```
if torch.cuda.is_bf16_supported():
supports_bfloat16 = True
else:
supports_bfloat16 = False
```
Change it to:
```
supports_bfloat16 = False
```
Reference: https://github.com/opendatalab/MinerU/issues/1508
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment