"vscode:/vscode.git/clone" did not exist on "7aed6b3a84479a076b6bda74dbd7b49cb85a8e40"
Commit 1fa55b76 authored by myhloli's avatar myhloli
Browse files

Merge remote-tracking branch 'origin/dev' into dev

parents 98b8c4a9 f1997b49
...@@ -31,9 +31,9 @@ jobs: ...@@ -31,9 +31,9 @@ jobs:
conda env list conda env list
pip show coverage pip show coverage
cd $GITHUB_WORKSPACE && sh tests/retry_env.sh cd $GITHUB_WORKSPACE && sh tests/retry_env.sh
cd $GITHUB_WORKSPACE && python tests/clean_coverage.py # cd $GITHUB_WORKSPACE && python tests/clean_coverage.py
cd $GITHUB_WORKSPACE && coverage run -m pytest tests/unittest/ --cov=magic_pdf/ --cov-report html --cov-report term-missing # cd $GITHUB_WORKSPACE && coverage run -m pytest tests/unittest/ --cov=magic_pdf/ --cov-report html --cov-report term-missing
cd $GITHUB_WORKSPACE && python tests/get_coverage.py # cd $GITHUB_WORKSPACE && python tests/get_coverage.py
cd $GITHUB_WORKSPACE && pytest -m P0 -s -v tests/test_cli/test_cli_sdk.py cd $GITHUB_WORKSPACE && pytest -m P0 -s -v tests/test_cli/test_cli_sdk.py
notify_to_feishu: notify_to_feishu:
......
...@@ -30,9 +30,9 @@ jobs: ...@@ -30,9 +30,9 @@ jobs:
conda env list conda env list
pip show coverage pip show coverage
cd $GITHUB_WORKSPACE && sh tests/retry_env.sh cd $GITHUB_WORKSPACE && sh tests/retry_env.sh
cd $GITHUB_WORKSPACE && python tests/clean_coverage.py # cd $GITHUB_WORKSPACE && python tests/clean_coverage.py
cd $GITHUB_WORKSPACE && coverage run -m pytest tests/unittest/ --cov=magic_pdf/ --cov-report html --cov-report term-missing # cd $GITHUB_WORKSPACE && coverage run -m pytest tests/unittest/ --cov=magic_pdf/ --cov-report html --cov-report term-missing
cd $GITHUB_WORKSPACE && python tests/get_coverage.py # cd $GITHUB_WORKSPACE && python tests/get_coverage.py
cd $GITHUB_WORKSPACE && pytest -s -v tests/test_cli/test_cli_sdk.py cd $GITHUB_WORKSPACE && pytest -s -v tests/test_cli/test_cli_sdk.py
notify_to_feishu: notify_to_feishu:
......
...@@ -82,7 +82,7 @@ jobs: ...@@ -82,7 +82,7 @@ jobs:
- name: Install mineru - name: Install mineru
run: | run: |
python -m pip install --upgrade pip python -m pip install --upgrade pip
pip install -e .[all] pip install -e .[core]
build: build:
needs: [ check-install ] needs: [ check-install ]
......
...@@ -10,14 +10,17 @@ ...@@ -10,14 +10,17 @@
[![forks](https://img.shields.io/github/forks/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU) [![forks](https://img.shields.io/github/forks/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
[![open issues](https://img.shields.io/github/issues-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues) [![open issues](https://img.shields.io/github/issues-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues)
[![issue resolution](https://img.shields.io/github/issues-closed-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues) [![issue resolution](https://img.shields.io/github/issues-closed-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues)
[![PyPI version](https://img.shields.io/pypi/v/magic-pdf)](https://pypi.org/project/magic-pdf/)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/magic-pdf)](https://pypi.org/project/magic-pdf/) [![PyPI version](https://img.shields.io/pypi/v/mineru)](https://pypi.org/project/mineru/)
[![Downloads](https://static.pepy.tech/badge/magic-pdf)](https://pepy.tech/project/magic-pdf) [![PyPI - Python Version](https://img.shields.io/pypi/pyversions/mineru)](https://pypi.org/project/mineru/)
[![Downloads](https://static.pepy.tech/badge/magic-pdf/month)](https://pepy.tech/project/magic-pdf) [![Downloads](https://static.pepy.tech/badge/mineru)](https://pepy.tech/project/mineru)
[![Downloads](https://static.pepy.tech/badge/mineru/month)](https://pepy.tech/project/mineru)
[![OpenDataLab](https://img.shields.io/badge/Demo_on_OpenDataLab-blue?logo=&labelColor=white)](https://mineru.net/OpenSourceTools/Extractor?source=github) [![OpenDataLab](https://img.shields.io/badge/Demo_on_OpenDataLab-blue?logo=&labelColor=white)](https://mineru.net/OpenSourceTools/Extractor?source=github)
[![HuggingFace](https://img.shields.io/badge/Demo_on_HuggingFace-yellow.svg?logo=&labelColor=white)](https://huggingface.co/spaces/opendatalab/MinerU) [![HuggingFace](https://img.shields.io/badge/Demo_on_HuggingFace-yellow.svg?logo=&labelColor=white)](https://huggingface.co/spaces/opendatalab/MinerU)
[![ModelScope](https://img.shields.io/badge/Demo_on_ModelScope-purple?logo=&labelColor=white)](https://www.modelscope.cn/studios/OpenDataLab/MinerU) [![ModelScope](https://img.shields.io/badge/Demo_on_ModelScope-purple?logo=&labelColor=white)](https://www.modelscope.cn/studios/OpenDataLab/MinerU)
[![HuggingFace](https://img.shields.io/badge/VLM_Demo_on_HuggingFace-yellow.svg?logo=&labelColor=white)](https://huggingface.co/spaces/opendatalab/mineru2)
[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/gist/myhloli/3b3a00a4a0a61577b6c30f989092d20d/mineru_demo.ipynb) [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/gist/myhloli/3b3a00a4a0a61577b6c30f989092d20d/mineru_demo.ipynb)
[![Paper](https://img.shields.io/badge/Paper-arXiv-green)](https://arxiv.org/abs/2409.18839) [![Paper](https://img.shields.io/badge/Paper-arXiv-green)](https://arxiv.org/abs/2409.18839)
...@@ -48,213 +51,306 @@ Easier to use: Just grab MinerU Desktop. No coding, no login, just a simple inte ...@@ -48,213 +51,306 @@ Easier to use: Just grab MinerU Desktop. No coding, no login, just a simple inte
</div> </div>
# Changelog # Changelog
- 2025/05/24 1.3.12 Released - 2025/06/13 2.0.0 Released
- Added support for ppocrv5 model, updated `ch_server` model to `PP-OCRv5_rec_server` and `ch_lite` model to `PP-OCRv5_rec_mobile` (model update required) - MinerU 2.0 represents a comprehensive reconstruction and upgrade from architecture to functionality, delivering a more streamlined design, enhanced performance, and more flexible user experience.
- In testing, we found that ppocrv5(server) shows some improvement for handwritten documents, but slightly lower accuracy than v4_server_doc for other document types. Therefore, the default ch model remains unchanged as `PP-OCRv4_server_rec_doc`. - **New Architecture**: MinerU 2.0 has been deeply restructured in code organization and interaction methods, significantly improving system usability, maintainability, and extensibility.
- Since ppocrv5 enhances recognition capabilities for handwritten text and special characters, you can manually select ppocrv5 models for Japanese, traditional Chinese mixed scenarios and handwritten document scenarios - **Removal of Third-party Dependency Limitations**: Completely eliminated the dependency on `pymupdf`, moving the project toward a more open and compliant open-source direction.
- You can select the appropriate model through the lang parameter `lang='ch_server'` (python api) or `--lang ch_server` (command line): - **Ready-to-use, Easy Configuration**: No need to manually edit JSON configuration files; most parameters can now be set directly via command line or API.
- `ch`: `PP-OCRv4_rec_server_doc` (default) (Chinese, English, Japanese, Traditional Chinese mixed/15k dictionary) - **Automatic Model Management**: Added automatic model download and update mechanisms, allowing users to complete model deployment without manual intervention.
- `ch_server`: `PP-OCRv5_rec_server` (Chinese, English, Japanese, Traditional Chinese mixed + handwriting/18k dictionary) - **Offline Deployment Friendly**: Provides built-in model download commands, supporting deployment requirements in completely offline environments.
- `ch_lite`: `PP-OCRv5_rec_mobile` (Chinese, English, Japanese, Traditional Chinese mixed + handwriting/18k dictionary) - **Streamlined Code Structure**: Removed thousands of lines of redundant code, simplified class inheritance logic, significantly improving code readability and development efficiency.
- `ch_server_v4`: `PP-OCRv4_rec_server` (Chinese, English mixed/6k dictionary) - **Unified Intermediate Format Output**: Adopted standardized `middle_json` format, compatible with most secondary development scenarios based on this format, ensuring seamless ecosystem business migration.
- `ch_lite_v4`: `PP-OCRv4_rec_mobile` (Chinese, English mixed/6k dictionary) - **New Model**: MinerU 2.0 integrates our latest small-parameter, high-performance multimodal document parsing model, achieving end-to-end high-speed, high-precision document understanding.
- Added support for handwritten documents by optimizing layout recognition of handwritten text areas - **Small Model, Big Capabilities**: With parameters under 1B, yet surpassing traditional 72B-level vision-language models (VLMs) in parsing accuracy.
- This feature is supported by default, no additional configuration needed - **Multiple Functions in One**: A single model covers multilingual recognition, handwriting recognition, layout analysis, table parsing, formula recognition, reading order sorting, and other core tasks.
- You can refer to the instructions above to manually select ppocrv5 model for better handwritten document parsing - **Ultimate Inference Speed**: Achieves peak throughput exceeding 10,000 tokens/s through `sglang` acceleration on a single NVIDIA 4090 card, easily handling large-scale document processing requirements.
- The demos on `huggingface` and `modelscope` have been updated to support handwriting recognition and ppocrv5 models, which you can experience online - **Online Experience**: You can experience this model online on our Hugging Face demo: [![HuggingFace](https://img.shields.io/badge/VLM_Demo_on_HuggingFace-yellow.svg?logo=&labelColor=white)](https://huggingface.co/spaces/opendatalab/mineru2)
- 2025/04/29 1.3.10 Released - **Incompatible Changes Notice**: To improve overall architectural rationality and long-term maintainability, this version contains some incompatible changes:
- Support for custom formula delimiters can be achieved by modifying the `latex-delimiter-config` item in the `magic-pdf.json` file under the user directory. - Python package name changed from `magic-pdf` to `mineru`, and the command-line tool changed from `magic-pdf` to `mineru`. Please update your scripts and command calls accordingly.
- 2025/04/27 1.3.9 Released - For modular system design and ecosystem consistency considerations, MinerU 2.0 no longer includes the LibreOffice document conversion module. If you need to process Office documents, we recommend converting them to PDF format through an independently deployed LibreOffice service before proceeding with subsequent parsing operations.
- Optimized the formula parsing function to improve the success rate of formula rendering
- 2025/04/23 1.3.8 Released
- The default `ocr` model (`ch`) has been updated to `PP-OCRv4_server_rec_doc` (model update required)
- `PP-OCRv4_server_rec_doc` is trained on a mix of more Chinese document data and PP-OCR training data, enhancing recognition capabilities for some traditional Chinese characters, Japanese, and special characters. It supports over 15,000 recognizable characters, improving text recognition in documents while also boosting general text recognition.
- [Performance comparison between PP-OCRv4_server_rec_doc, PP-OCRv4_server_rec, and PP-OCRv4_mobile_rec](https://paddlepaddle.github.io/PaddleX/latest/en/module_usage/tutorials/ocr_modules/text_recognition.html#ii-supported-model-list)
- Verified results show that the `PP-OCRv4_server_rec_doc` model significantly improves accuracy in both single-language (`Chinese`, `English`, `Japanese`, `Traditional Chinese`) and mixed-language scenarios, with speed comparable to `PP-OCRv4_server_rec`, making it suitable for most use cases.
- In a small number of pure English scenarios, the `PP-OCRv4_server_rec_doc` model may encounter word concatenation issues, whereas `PP-OCRv4_server_rec` performs better in such cases. Therefore, we have retained the `PP-OCRv4_server_rec` model, which users can invoke by passing the parameter `lang='ch_server'`(python api) or `--lang ch_server`(cli).
- 2025/04/22 1.3.7 Released
- Fixed the issue where the `lang` parameter was ineffective during table parsing model initialization.
- Fixed the significant slowdown in OCR and table parsing speed in `cpu` mode.
- 2025/04/16 1.3.4 Released
- Slightly improved the speed of OCR detection by removing some unused blocks.
- Fixed page-level sorting errors caused by footnotes in certain cases.
- 2025/04/12 1.3.2 released
- Fixed the issue of incompatible dependency package versions when installing in Python 3.13 environment on Windows systems.
- Optimized memory usage during batch inference.
- Improved the parsing effect of tables rotated by 90 degrees.
- Enhanced the parsing accuracy for large tables in financial report samples.
- Fixed the occasional word concatenation issue in English text areas when OCR language is not specified.(The model needs to be updated)
- 2025/04/08 1.3.1 released, fixed some compatibility issues
- Supported Python 3.13
- Made the final adaptation for some outdated Linux systems (e.g., CentOS 7), and no further support will be guaranteed for subsequent versions. [Installation Instructions](https://github.com/opendatalab/MinerU/issues/1004)
- 2025/04/03 1.3.0 released, in this version we made many optimizations and improvements:
- Installation and compatibility optimization
- By removing the use of `layoutlmv3` in layout, resolved compatibility issues caused by `detectron2`.
- Torch version compatibility extended to 2.2~2.6 (excluding 2.5).
- CUDA compatibility supports 11.8/12.4/12.6/12.8 (CUDA version determined by torch), resolving compatibility issues for some users with 50-series and H-series GPUs.
- Python compatible versions expanded to 3.10~3.12, solving the problem of automatic downgrade to 0.6.1 during installation in non-3.10 environments.
- Offline deployment process optimized; no internet connection required after successful deployment to download any model files.
- Performance optimization
- By supporting batch processing of multiple PDF files ([script example](demo/batch_demo.py)), improved parsing speed for small files in batches (compared to version 1.0.1, formula parsing speed increased by over 1400%, overall parsing speed increased by over 500%).
- Optimized loading and usage of the mfr model, reducing GPU memory usage and improving parsing speed (requires re-execution of the [model download process](docs/how_to_download_models_en.md) to obtain incremental updates of model files).
- Optimized GPU memory usage, requiring only a minimum of 6GB to run this project.
- Improved running speed on MPS devices.
- Parsing effect optimization
- Updated the mfr model to `unimernet(2503)`, solving the issue of lost line breaks in multi-line formulas.
- Usability Optimization
- By using `paddleocr2torch`, completely replaced the use of the `paddle` framework and `paddleocr` in the project, resolving conflicts between `paddle` and `torch`, as well as thread safety issues caused by the `paddle` framework.
- Added a real-time progress bar during the parsing process to accurately track progress, making the wait less painful.
<details>
<summary>2025/03/03 1.2.1 released</summary>
<ul>
<li>Fixed the impact on punctuation marks during full-width to half-width conversion of letters and numbers</li>
<li>Fixed caption matching inaccuracies in certain scenarios</li>
<li>Fixed formula span loss issues in certain scenarios</li>
</ul>
</details>
<details>
<summary>2025/02/24 1.2.0 released</summary>
<p>This version includes several fixes and improvements to enhance parsing efficiency and accuracy:</p>
<ul>
<li><strong>Performance Optimization</strong>
<ul>
<li>Increased classification speed for PDF documents in auto mode.</li>
</ul>
</li>
<li><strong>Parsing Optimization</strong>
<ul>
<li>Improved parsing logic for documents containing watermarks, significantly enhancing the parsing results for such documents.</li>
<li>Enhanced the matching logic for multiple images/tables and captions within a single page, improving the accuracy of image-text matching in complex layouts.</li>
</ul>
</li>
<li><strong>Bug Fixes</strong>
<ul>
<li>Fixed an issue where image/table spans were incorrectly filled into text blocks under certain conditions.</li>
<li>Resolved an issue where title blocks were empty in some cases.</li>
</ul>
</li>
</ul>
</details>
<details>
<summary>2025/01/22 1.1.0 released</summary>
<p>In this version we have focused on improving parsing accuracy and efficiency:</p>
<ul>
<li><strong>Model capability upgrade</strong> (requires re-executing the <a href="https://github.com/opendatalab/MinerU/blob/master/docs/how_to_download_models_en.md">model download process</a> to obtain incremental updates of model files)
<ul>
<li>The layout recognition model has been upgraded to the latest <code>doclayout_yolo(2501)</code> model, improving layout recognition accuracy.</li>
<li>The formula parsing model has been upgraded to the latest <code>unimernet(2501)</code> model, improving formula recognition accuracy.</li>
</ul>
</li>
<li><strong>Performance optimization</strong>
<ul>
<li>On devices that meet certain configuration requirements (16GB+ VRAM), by optimizing resource usage and restructuring the processing pipeline, overall parsing speed has been increased by more than 50%.</li>
</ul>
</li>
<li><strong>Parsing effect optimization</strong>
<ul>
<li>Added a new heading classification feature (testing version, enabled by default) to the online demo (<a href="https://mineru.net/OpenSourceTools/Extractor">mineru.net</a>/<a href="https://huggingface.co/spaces/opendatalab/MinerU">huggingface</a>/<a href="https://www.modelscope.cn/studios/OpenDataLab/MinerU">modelscope</a>), which supports hierarchical classification of headings, thereby enhancing document structuring.</li>
</ul>
</li>
</ul>
</details>
<details>
<summary>2025/01/10 1.0.1 released</summary>
<p>This is our first official release, where we have introduced a completely new API interface and enhanced compatibility through extensive refactoring, as well as a brand new automatic language identification feature:</p>
<ul>
<li><strong>New API Interface</strong>
<ul>
<li>For the data-side API, we have introduced the Dataset class, designed to provide a robust and flexible data processing framework. This framework currently supports a variety of document formats, including images (.jpg and .png), PDFs, Word documents (.doc and .docx), and PowerPoint presentations (.ppt and .pptx). It ensures effective support for data processing tasks ranging from simple to complex.</li>
<li>For the user-side API, we have meticulously designed the MinerU processing workflow as a series of composable Stages. Each Stage represents a specific processing step, allowing users to define new Stages according to their needs and creatively combine these stages to customize their data processing workflows.</li>
</ul>
</li>
<li><strong>Enhanced Compatibility</strong>
<ul>
<li>By optimizing the dependency environment and configuration items, we ensure stable and efficient operation on ARM architecture Linux systems.</li>
<li>We have deeply integrated with Huawei Ascend NPU acceleration, providing autonomous and controllable high-performance computing capabilities. This supports the localization and development of AI application platforms in China. <a href="https://github.com/opendatalab/MinerU/blob/master/docs/README_Ascend_NPU_Acceleration_zh_CN.md">Ascend NPU Acceleration</a></li>
</ul>
</li>
<li><strong>Automatic Language Identification</strong>
<ul>
<li>By introducing a new language recognition model, setting the <code>lang</code> configuration to <code>auto</code> during document parsing will automatically select the appropriate OCR language model, improving the accuracy of scanned document parsing.</li>
</ul>
</li>
</ul>
</details>
<details>
<summary>2024/11/22 0.10.0 released</summary>
<p>Introducing hybrid OCR text extraction capabilities:</p>
<ul>
<li>Significantly improved parsing performance in complex text distribution scenarios such as dense formulas, irregular span regions, and text represented by images.</li>
<li>Combines the dual advantages of accurate content extraction and faster speed in text mode, and more precise span/line region recognition in OCR mode.</li>
</ul>
</details>
<details>
<summary>2024/11/15 0.9.3 released</summary>
<p>Integrated <a href="https://github.com/RapidAI/RapidTable">RapidTable</a> for table recognition, improving single-table parsing speed by more than 10 times, with higher accuracy and lower GPU memory usage.</p>
</details>
<details>
<summary>2024/11/06 0.9.2 released</summary>
<p>Integrated the <a href="https://huggingface.co/U4R/StructTable-InternVL2-1B">StructTable-InternVL2-1B</a> model for table recognition functionality.</p>
</details>
<details>
<summary>2024/10/31 0.9.0 released</summary>
<p>This is a major new version with extensive code refactoring, addressing numerous issues, improving performance, reducing hardware requirements, and enhancing usability:</p>
<ul>
<li>Refactored the sorting module code to use <a href="https://github.com/ppaanngggg/layoutreader">layoutreader</a> for reading order sorting, ensuring high accuracy in various layouts.</li>
<li>Refactored the paragraph concatenation module to achieve good results in cross-column, cross-page, cross-figure, and cross-table scenarios.</li>
<li>Refactored the list and table of contents recognition functions, significantly improving the accuracy of list blocks and table of contents blocks, as well as the parsing of corresponding text paragraphs.</li>
<li>Refactored the matching logic for figures, tables, and descriptive text, greatly enhancing the accuracy of matching captions and footnotes to figures and tables, and reducing the loss rate of descriptive text to near zero.</li>
<li>Added multi-language support for OCR, supporting detection and recognition of 84 languages. For the list of supported languages, see <a href="https://paddlepaddle.github.io/PaddleOCR/latest/en/ppocr/blog/multi_languages.html#5-support-languages-and-abbreviations">OCR Language Support List</a>.</li>
<li>Added memory recycling logic and other memory optimization measures, significantly reducing memory usage. The memory requirement for enabling all acceleration features except table acceleration (layout/formula/OCR) has been reduced from 16GB to 8GB, and the memory requirement for enabling all acceleration features has been reduced from 24GB to 10GB.</li>
<li>Optimized configuration file feature switches, adding an independent formula detection switch to significantly improve speed and parsing results when formula detection is not needed.</li>
<li>Integrated <a href="https://github.com/opendatalab/PDF-Extract-Kit">PDF-Extract-Kit 1.0</a>:
<ul>
<li>Added the self-developed <code>doclayout_yolo</code> model, which speeds up processing by more than 10 times compared to the original solution while maintaining similar parsing effects, and can be freely switched with <code>layoutlmv3</code> via the configuration file.</li>
<li>Upgraded formula parsing to <code>unimernet 0.2.1</code>, improving formula parsing accuracy while significantly reducing memory usage.</li>
<li>Due to the repository change for <code>PDF-Extract-Kit 1.0</code>, you need to re-download the model. Please refer to <a href="https://github.com/opendatalab/MinerU/blob/master/docs/how_to_download_models_en.md">How to Download Models</a> for detailed steps.</li>
</ul>
</li>
</ul>
</details>
<details>
<summary>2024/09/27 Version 0.8.1 released</summary>
<p>Fixed some bugs, and providing a <a href="https://github.com/opendatalab/MinerU/blob/master/projects/web_demo/README.md">localized deployment version</a> of the <a href="https://opendatalab.com/OpenSourceTools/Extractor/PDF/">online demo</a> and the <a href="https://github.com/opendatalab/MinerU/blob/master/projects/web/README.md">front-end interface</a>.</p>
</details>
<details>
<summary>2024/09/09 Version 0.8.0 released</summary>
<p>Supporting fast deployment with Dockerfile, and launching demos on Huggingface and Modelscope.</p>
</details>
<details>
<summary>2024/08/30 Version 0.7.1 released</summary>
<p>Add paddle tablemaster table recognition option</p>
</details>
<details>
<summary>2024/08/09 Version 0.7.0b1 released</summary>
<p>Simplified installation process, added table recognition functionality</p>
</details>
<details>
<summary>2024/08/01 Version 0.6.2b1 released</summary>
<p>Optimized dependency conflict issues and installation documentation</p>
</details>
<details> <details>
<summary>2024/07/05 Initial open-source release</summary> <summary>History Log</summary>
<details>
<summary>2025/05/24 Release 1.3.12</summary>
<ul>
<li>Added support for PPOCRv5 models, updated <code>ch_server</code> model to <code>PP-OCRv5_rec_server</code>, and <code>ch_lite</code> model to <code>PP-OCRv5_rec_mobile</code> (model update required)
<ul>
<li>In testing, we found that PPOCRv5(server) has some improvement for handwritten documents, but has slightly lower accuracy than v4_server_doc for other document types, so the default ch model remains unchanged as <code>PP-OCRv4_server_rec_doc</code>.</li>
<li>Since PPOCRv5 has enhanced recognition capabilities for handwriting and special characters, you can manually choose the PPOCRv5 model for Japanese-Traditional Chinese mixed scenarios and handwritten documents</li>
<li>You can select the appropriate model through the lang parameter <code>lang='ch_server'</code> (Python API) or <code>--lang ch_server</code> (command line):
<ul>
<li><code>ch</code>: <code>PP-OCRv4_server_rec_doc</code> (default) (Chinese/English/Japanese/Traditional Chinese mixed/15K dictionary)</li>
<li><code>ch_server</code>: <code>PP-OCRv5_rec_server</code> (Chinese/English/Japanese/Traditional Chinese mixed + handwriting/18K dictionary)</li>
<li><code>ch_lite</code>: <code>PP-OCRv5_rec_mobile</code> (Chinese/English/Japanese/Traditional Chinese mixed + handwriting/18K dictionary)</li>
<li><code>ch_server_v4</code>: <code>PP-OCRv4_rec_server</code> (Chinese/English mixed/6K dictionary)</li>
<li><code>ch_lite_v4</code>: <code>PP-OCRv4_rec_mobile</code> (Chinese/English mixed/6K dictionary)</li>
</ul>
</li>
</ul>
</li>
<li>Added support for handwritten documents through optimized layout recognition of handwritten text areas
<ul>
<li>This feature is supported by default, no additional configuration required</li>
<li>You can refer to the instructions above to manually select the PPOCRv5 model for better handwritten document parsing results</li>
</ul>
</li>
<li>The <code>huggingface</code> and <code>modelscope</code> demos have been updated to versions that support handwriting recognition and PPOCRv5 models, which you can experience online</li>
</ul>
</details>
<details>
<summary>2025/04/29 Release 1.3.10</summary>
<ul>
<li>Added support for custom formula delimiters, which can be configured by modifying the <code>latex-delimiter-config</code> section in the <code>magic-pdf.json</code> file in your user directory.</li>
</ul>
</details>
<details>
<summary>2025/04/27 Release 1.3.9</summary>
<ul>
<li>Optimized formula parsing functionality, improved formula rendering success rate</li>
</ul>
</details>
<details>
<summary>2025/04/23 Release 1.3.8</summary>
<ul>
<li>The default <code>ocr</code> model (<code>ch</code>) has been updated to <code>PP-OCRv4_server_rec_doc</code> (model update required)
<ul>
<li><code>PP-OCRv4_server_rec_doc</code> is trained on a mixture of more Chinese document data and PP-OCR training data based on <code>PP-OCRv4_server_rec</code>, adding recognition capabilities for some traditional Chinese characters, Japanese, and special characters. It can recognize over 15,000 characters and improves both document-specific and general text recognition abilities.</li>
<li><a href="https://paddlepaddle.github.io/PaddleX/latest/module_usage/tutorials/ocr_modules/text_recognition.html#_3">Performance comparison of PP-OCRv4_server_rec_doc/PP-OCRv4_server_rec/PP-OCRv4_mobile_rec</a></li>
<li>After verification, the <code>PP-OCRv4_server_rec_doc</code> model shows significant accuracy improvements in Chinese/English/Japanese/Traditional Chinese in both single language and mixed language scenarios, with comparable speed to <code>PP-OCRv4_server_rec</code>, making it suitable for most use cases.</li>
<li>In some pure English scenarios, <code>PP-OCRv4_server_rec_doc</code> may have word adhesion issues, while <code>PP-OCRv4_server_rec</code> performs better in these cases. Therefore, we've kept the <code>PP-OCRv4_server_rec</code> model, which users can access by adding the parameter <code>lang='ch_server'</code> (Python API) or <code>--lang ch_server</code> (command line).</li>
</ul>
</li>
</ul>
</details>
<details>
<summary>2025/04/22 Release 1.3.7</summary>
<ul>
<li>Fixed the issue where the lang parameter was ineffective during table parsing model initialization</li>
<li>Fixed the significant speed reduction of OCR and table parsing in <code>cpu</code> mode</li>
</ul>
</details>
<details>
<summary>2025/04/16 Release 1.3.4</summary>
<ul>
<li>Slightly improved OCR-det speed by removing some unnecessary blocks</li>
<li>Fixed page-internal sorting errors caused by footnotes in certain cases</li>
</ul>
</details>
<details>
<summary>2025/04/12 Release 1.3.2</summary>
<ul>
<li>Fixed dependency version incompatibility issues when installing on Windows with Python 3.13</li>
<li>Optimized memory usage during batch inference</li>
<li>Improved parsing of tables rotated 90 degrees</li>
<li>Enhanced parsing of oversized tables in financial report samples</li>
<li>Fixed the occasional word adhesion issue in English text areas when OCR language is not specified (model update required)</li>
</ul>
</details>
<details>
<summary>2025/04/08 Release 1.3.1</summary>
<ul>
<li>Fixed several compatibility issues
<ul>
<li>Added support for Python 3.13</li>
<li>Made final adaptations for outdated Linux systems (such as CentOS 7) with no guarantee of continued support in future versions, <a href="https://github.com/opendatalab/MinerU/issues/1004">installation instructions</a></li>
</ul>
</li>
</ul>
</details>
<details>
<summary>2025/04/03 Release 1.3.0</summary>
<ul>
<li>Installation and compatibility optimizations
<ul>
<li>Resolved compatibility issues caused by <code>detectron2</code> by removing <code>layoutlmv3</code> usage in layout</li>
<li>Extended torch version compatibility to 2.2~2.6 (excluding 2.5)</li>
<li>Added CUDA compatibility for versions 11.8/12.4/12.6/12.8 (CUDA version determined by torch), solving compatibility issues for users with 50-series and H-series GPUs</li>
<li>Extended Python compatibility to versions 3.10~3.12, fixing the issue of automatic downgrade to version 0.6.1 when installing in non-3.10 environments</li>
<li>Optimized offline deployment process, eliminating the need to download any model files after successful deployment</li>
</ul>
</li>
<li>Performance optimizations
<ul>
<li>Enhanced parsing speed for batches of small files by supporting batch processing of multiple PDF files (<a href="demo/batch_demo.py">script example</a>), with formula parsing speed improved by up to 1400% and overall parsing speed improved by up to 500% compared to version 1.0.1</li>
<li>Reduced memory usage and improved parsing speed by optimizing MFR model loading and usage (requires re-running the <a href="docs/how_to_download_models_zh_cn.md">model download process</a> to get incremental updates to model files)</li>
<li>Optimized GPU memory usage, requiring only 6GB minimum to run this project</li>
<li>Improved running speed on MPS devices</li>
</ul>
</li>
<li>Parsing effect optimizations
<ul>
<li>Updated MFR model to <code>unimernet(2503)</code>, fixing line break loss issues in multi-line formulas</li>
</ul>
</li>
<li>Usability optimizations
<ul>
<li>Completely replaced the <code>paddle</code> framework and <code>paddleocr</code> in the project by using <code>paddleocr2torch</code>, resolving conflicts between <code>paddle</code> and <code>torch</code>, as well as thread safety issues caused by the <code>paddle</code> framework</li>
<li>Added real-time progress bar display during parsing, allowing precise tracking of parsing progress and making the waiting process more bearable</li>
</ul>
</li>
</ul>
</details>
<details>
<summary>2025/03/03 1.2.1 released</summary>
<ul>
<li>Fixed the impact on punctuation marks during full-width to half-width conversion of letters and numbers</li>
<li>Fixed caption matching inaccuracies in certain scenarios</li>
<li>Fixed formula span loss issues in certain scenarios</li>
</ul>
</details>
<details>
<summary>2025/02/24 1.2.0 released</summary>
<p>This version includes several fixes and improvements to enhance parsing efficiency and accuracy:</p>
<ul>
<li><strong>Performance Optimization</strong>
<ul>
<li>Increased classification speed for PDF documents in auto mode.</li>
</ul>
</li>
<li><strong>Parsing Optimization</strong>
<ul>
<li>Improved parsing logic for documents containing watermarks, significantly enhancing the parsing results for such documents.</li>
<li>Enhanced the matching logic for multiple images/tables and captions within a single page, improving the accuracy of image-text matching in complex layouts.</li>
</ul>
</li>
<li><strong>Bug Fixes</strong>
<ul>
<li>Fixed an issue where image/table spans were incorrectly filled into text blocks under certain conditions.</li>
<li>Resolved an issue where title blocks were empty in some cases.</li>
</ul>
</li>
</ul>
</details>
<details>
<summary>2025/01/22 1.1.0 released</summary>
<p>In this version we have focused on improving parsing accuracy and efficiency:</p>
<ul>
<li><strong>Model capability upgrade</strong> (requires re-executing the <a href="https://github.com/opendatalab/MinerU/blob/master/docs/how_to_download_models_en.md">model download process</a> to obtain incremental updates of model files)
<ul>
<li>The layout recognition model has been upgraded to the latest <code>doclayout_yolo(2501)</code> model, improving layout recognition accuracy.</li>
<li>The formula parsing model has been upgraded to the latest <code>unimernet(2501)</code> model, improving formula recognition accuracy.</li>
</ul>
</li>
<li><strong>Performance optimization</strong>
<ul>
<li>On devices that meet certain configuration requirements (16GB+ VRAM), by optimizing resource usage and restructuring the processing pipeline, overall parsing speed has been increased by more than 50%.</li>
</ul>
</li>
<li><strong>Parsing effect optimization</strong>
<ul>
<li>Added a new heading classification feature (testing version, enabled by default) to the online demo (<a href="https://mineru.net/OpenSourceTools/Extractor">mineru.net</a>/<a href="https://huggingface.co/spaces/opendatalab/MinerU">huggingface</a>/<a href="https://www.modelscope.cn/studios/OpenDataLab/MinerU">modelscope</a>), which supports hierarchical classification of headings, thereby enhancing document structuring.</li>
</ul>
</li>
</ul>
</details>
<details>
<summary>2025/01/10 1.0.1 released</summary>
<p>This is our first official release, where we have introduced a completely new API interface and enhanced compatibility through extensive refactoring, as well as a brand new automatic language identification feature:</p>
<ul>
<li><strong>New API Interface</strong>
<ul>
<li>For the data-side API, we have introduced the Dataset class, designed to provide a robust and flexible data processing framework. This framework currently supports a variety of document formats, including images (.jpg and .png), PDFs, Word documents (.doc and .docx), and PowerPoint presentations (.ppt and .pptx). It ensures effective support for data processing tasks ranging from simple to complex.</li>
<li>For the user-side API, we have meticulously designed the MinerU processing workflow as a series of composable Stages. Each Stage represents a specific processing step, allowing users to define new Stages according to their needs and creatively combine these stages to customize their data processing workflows.</li>
</ul>
</li>
<li><strong>Enhanced Compatibility</strong>
<ul>
<li>By optimizing the dependency environment and configuration items, we ensure stable and efficient operation on ARM architecture Linux systems.</li>
<li>We have deeply integrated with Huawei Ascend NPU acceleration, providing autonomous and controllable high-performance computing capabilities. This supports the localization and development of AI application platforms in China. <a href="https://github.com/opendatalab/MinerU/blob/master/docs/README_Ascend_NPU_Acceleration_zh_CN.md">Ascend NPU Acceleration</a></li>
</ul>
</li>
<li><strong>Automatic Language Identification</strong>
<ul>
<li>By introducing a new language recognition model, setting the <code>lang</code> configuration to <code>auto</code> during document parsing will automatically select the appropriate OCR language model, improving the accuracy of scanned document parsing.</li>
</ul>
</li>
</ul>
</details>
<details>
<summary>2024/11/22 0.10.0 released</summary>
<p>Introducing hybrid OCR text extraction capabilities:</p>
<ul>
<li>Significantly improved parsing performance in complex text distribution scenarios such as dense formulas, irregular span regions, and text represented by images.</li>
<li>Combines the dual advantages of accurate content extraction and faster speed in text mode, and more precise span/line region recognition in OCR mode.</li>
</ul>
</details>
<details>
<summary>2024/11/15 0.9.3 released</summary>
<p>Integrated <a href="https://github.com/RapidAI/RapidTable">RapidTable</a> for table recognition, improving single-table parsing speed by more than 10 times, with higher accuracy and lower GPU memory usage.</p>
</details>
<details>
<summary>2024/11/06 0.9.2 released</summary>
<p>Integrated the <a href="https://huggingface.co/U4R/StructTable-InternVL2-1B">StructTable-InternVL2-1B</a> model for table recognition functionality.</p>
</details>
<details>
<summary>2024/10/31 0.9.0 released</summary>
<p>This is a major new version with extensive code refactoring, addressing numerous issues, improving performance, reducing hardware requirements, and enhancing usability:</p>
<ul>
<li>Refactored the sorting module code to use <a href="https://github.com/ppaanngggg/layoutreader">layoutreader</a> for reading order sorting, ensuring high accuracy in various layouts.</li>
<li>Refactored the paragraph concatenation module to achieve good results in cross-column, cross-page, cross-figure, and cross-table scenarios.</li>
<li>Refactored the list and table of contents recognition functions, significantly improving the accuracy of list blocks and table of contents blocks, as well as the parsing of corresponding text paragraphs.</li>
<li>Refactored the matching logic for figures, tables, and descriptive text, greatly enhancing the accuracy of matching captions and footnotes to figures and tables, and reducing the loss rate of descriptive text to near zero.</li>
<li>Added multi-language support for OCR, supporting detection and recognition of 84 languages. For the list of supported languages, see <a href="https://paddlepaddle.github.io/PaddleOCR/latest/en/ppocr/blog/multi_languages.html#5-support-languages-and-abbreviations">OCR Language Support List</a>.</li>
<li>Added memory recycling logic and other memory optimization measures, significantly reducing memory usage. The memory requirement for enabling all acceleration features except table acceleration (layout/formula/OCR) has been reduced from 16GB to 8GB, and the memory requirement for enabling all acceleration features has been reduced from 24GB to 10GB.</li>
<li>Optimized configuration file feature switches, adding an independent formula detection switch to significantly improve speed and parsing results when formula detection is not needed.</li>
<li>Integrated <a href="https://github.com/opendatalab/PDF-Extract-Kit">PDF-Extract-Kit 1.0</a>:
<ul>
<li>Added the self-developed <code>doclayout_yolo</code> model, which speeds up processing by more than 10 times compared to the original solution while maintaining similar parsing effects, and can be freely switched with <code>layoutlmv3</code> via the configuration file.</li>
<li>Upgraded formula parsing to <code>unimernet 0.2.1</code>, improving formula parsing accuracy while significantly reducing memory usage.</li>
<li>Due to the repository change for <code>PDF-Extract-Kit 1.0</code>, you need to re-download the model. Please refer to <a href="https://github.com/opendatalab/MinerU/blob/master/docs/how_to_download_models_en.md">How to Download Models</a> for detailed steps.</li>
</ul>
</li>
</ul>
</details>
<details>
<summary>2024/09/27 Version 0.8.1 released</summary>
<p>Fixed some bugs, and providing a <a href="https://github.com/opendatalab/MinerU/blob/master/projects/web_demo/README.md">localized deployment version</a> of the <a href="https://opendatalab.com/OpenSourceTools/Extractor/PDF/">online demo</a> and the <a href="https://github.com/opendatalab/MinerU/blob/master/projects/web/README.md">front-end interface</a>.</p>
</details>
<details>
<summary>2024/09/09 Version 0.8.0 released</summary>
<p>Supporting fast deployment with Dockerfile, and launching demos on Huggingface and Modelscope.</p>
</details>
<details>
<summary>2024/08/30 Version 0.7.1 released</summary>
<p>Add paddle tablemaster table recognition option</p>
</details>
<details>
<summary>2024/08/09 Version 0.7.0b1 released</summary>
<p>Simplified installation process, added table recognition functionality</p>
</details>
<details>
<summary>2024/08/01 Version 0.6.2b1 released</summary>
<p>Optimized dependency conflict issues and installation documentation</p>
</details>
<details>
<summary>2024/07/05 Initial open-source release</summary>
</details>
</details> </details>
<!-- TABLE OF CONTENT --> <!-- TABLE OF CONTENT -->
<details open="open"> <details open="open">
<summary><h2 style="display: inline-block">Table of Contents</h2></summary> <summary><h2 style="display: inline-block">Table of Contents</h2></summary>
<ol> <ol>
...@@ -266,17 +362,7 @@ Easier to use: Just grab MinerU Desktop. No coding, no login, just a simple inte ...@@ -266,17 +362,7 @@ Easier to use: Just grab MinerU Desktop. No coding, no login, just a simple inte
<li><a href="#quick-start">Quick Start</a> <li><a href="#quick-start">Quick Start</a>
<ul> <ul>
<li><a href="#online-demo">Online Demo</a></li> <li><a href="#online-demo">Online Demo</a></li>
<li><a href="#quick-cpu-demo">Quick CPU Demo</a></li> <li><a href="#quick-cpu-demo">Local Deployment</a></li>
<li><a href="#using-gpu">Using GPU</a></li>
<li><a href="#using-npu">Using NPU</a></li>
</ul>
</li>
<li><a href="#usage">Usage</a>
<ul>
<li><a href="#command-line">Command Line</a></li>
<li><a href="#api">API</a></li>
<li><a href="#deploy-derived-projects">Deploy Derived Projects</a></li>
<li><a href="#development-guide">Development Guide</a></li>
</ul> </ul>
</li> </li>
</ul> </ul>
...@@ -326,12 +412,9 @@ If you encounter any installation issues, please first consult the <a href="#faq ...@@ -326,12 +412,9 @@ If you encounter any installation issues, please first consult the <a href="#faq
If the parsing results are not as expected, refer to the <a href="#known-issues">Known Issues</a>. </br> If the parsing results are not as expected, refer to the <a href="#known-issues">Known Issues</a>. </br>
There are three different ways to experience MinerU: There are three different ways to experience MinerU:
- [Online Demo (No Installation Required)](#online-demo) - [Online Demo](#online-demo)
- [Quick CPU Demo (Windows, Linux, Mac)](#quick-cpu-demo) - [Local Deployment](#local-deployment)
- Accelerate inference by using CUDA/CANN/MPS
- [Linux/Windows + CUDA](#Using-GPU)
- [Linux + CANN](#using-npu)
- [MacOS + MPS](#using-mps)
> [!WARNING] > [!WARNING]
> **Pre-installation Notice—Hardware and Software Environment Support** > **Pre-installation Notice—Hardware and Software Environment Support**
...@@ -342,182 +425,236 @@ There are three different ways to experience MinerU: ...@@ -342,182 +425,236 @@ There are three different ways to experience MinerU:
> >
> In non-mainline environments, due to the diversity of hardware and software configurations, as well as third-party dependency compatibility issues, we cannot guarantee 100% project availability. Therefore, for users who wish to use this project in non-recommended environments, we suggest carefully reading the documentation and FAQ first. Most issues already have corresponding solutions in the FAQ. We also encourage community feedback to help us gradually expand support. > In non-mainline environments, due to the diversity of hardware and software configurations, as well as third-party dependency compatibility issues, we cannot guarantee 100% project availability. Therefore, for users who wish to use this project in non-recommended environments, we suggest carefully reading the documentation and FAQ first. Most issues already have corresponding solutions in the FAQ. We also encourage community feedback to help us gradually expand support.
<table> <table border="1">
<tr>
<td colspan="3" rowspan="2">Operating System</td>
</tr>
<tr> <tr>
<td>Linux after 2019</td> <td>Parsing Backend</td>
<td>Windows 10 / 11</td> <td>pipeline</td>
<td>macOS 11+</td> <td>vlm-transformers</td>
<td>vlm-sgslang</td>
</tr> </tr>
<tr> <tr>
<td colspan="3">CPU</td> <td>Operating System</td>
<td>x86_64 / arm64</td> <td>windows/linux/mac</td>
<td>x86_64(unsupported ARM Windows)</td> <td>windows/linux</td>
<td>x86_64 / arm64</td> <td>windows(wsl2)/linux</td>
</tr> </tr>
<tr> <tr>
<td colspan="3">Memory Requirements</td> <td>Memory Requirements</td>
<td colspan="3">16GB or more, recommended 32GB+</td> <td colspan="3">Minimum 16GB+, 32GB+ recommended</td>
</tr> </tr>
<tr> <tr>
<td colspan="3">Storage Requirements</td> <td>Disk Space Requirements</td>
<td colspan="3">20GB or more, with a preference for SSD</td> <td colspan="3">20GB+, SSD recommended</td>
</tr> </tr>
<tr> <tr>
<td colspan="3">Python Version</td> <td>Python Version</td>
<td colspan="3">3.10~3.13</td> <td colspan="3">3.10-3.13</td>
</tr> </tr>
<tr> <tr>
<td colspan="3">Nvidia Driver Version</td> <td>CPU Inference Support</td>
<td>latest (Proprietary Driver)</td> <td></td>
<td>latest</td> <td></td>
<td>None</td> <td></td>
</tr> </tr>
<tr> <tr>
<td colspan="3">CUDA Environment</td> <td>GPU Requirements</td>
<td colspan="2"><a href="https://pytorch.org/get-started/locally/">Refer to the PyTorch official website</a></td> <td>Turing architecture or later, 6GB+ VRAM or Apple Silicon</td>
<td>None</td> <td>Ampere architecture or later, 8GB+ VRAM</td>
</tr> <td>Ampere architecture or later, 24GB+ VRAM</td>
<tr>
<td colspan="3">CANN Environment(NPU support)</td>
<td>8.0+(Ascend 910b)</td>
<td>None</td>
<td>None</td>
</tr>
<tr>
<td rowspan="2">GPU/MPS Hardware Support List</td>
<td colspan="2">GPU VRAM 6GB or more</td>
<td colspan="2">All GPUs with Tensor Cores produced from Volta(2017) onwards.<br>
More than 6GB VRAM </td>
<td rowspan="2">Apple silicon</td>
</tr> </tr>
</table> </table>
### Online Demo ## Online Demo
Synced with dev branch updates:
[![OpenDataLab](https://img.shields.io/badge/Demo_on_OpenDataLab-blue?logo=&labelColor=white)](https://mineru.net/OpenSourceTools/Extractor?source=github) [![OpenDataLab](https://img.shields.io/badge/Demo_on_OpenDataLab-blue?logo=&labelColor=white)](https://mineru.net/OpenSourceTools/Extractor?source=github)
[![HuggingFace](https://img.shields.io/badge/Demo_on_HuggingFace-yellow.svg?logo=&labelColor=white)](https://huggingface.co/spaces/opendatalab/MinerU) [![HuggingFace](https://img.shields.io/badge/Demo_on_HuggingFace-yellow.svg?logo=&labelColor=white)](https://huggingface.co/spaces/opendatalab/MinerU)
[![ModelScope](https://img.shields.io/badge/Demo_on_ModelScope-purple?logo=&labelColor=white)](https://www.modelscope.cn/studios/OpenDataLab/MinerU) [![ModelScope](https://img.shields.io/badge/Demo_on_ModelScope-purple?logo=&labelColor=white)](https://www.modelscope.cn/studios/OpenDataLab/MinerU)
### Quick CPU Demo ### 🚀🚀🚀VLM demo
[![HuggingFace](https://img.shields.io/badge/VLM_Demo_on_HuggingFace-yellow.svg?logo=&labelColor=white)](https://huggingface.co/spaces/opendatalab/mineru2)
#### 1. Install magic-pdf ## Local Deployment
### 1. Install MinerU
#### 1.1 Install via pip or uv
```bash ```bash
conda create -n mineru 'python=3.12' -y pip install --upgrade pip
conda activate mineru pip install uv
pip install -U "magic-pdf[full]" uv pip install "mineru[core]>=2.0.0"
``` ```
#### 2. Download model weight files #### 1.2 Install from source
Refer to [How to Download Model Files](docs/how_to_download_models_en.md) for detailed instructions. ```bash
git clone https://github.com/opendatalab/MinerU.git
cd MinerU
uv pip install -e .[core]
```
#### 3. Modify the Configuration File for Additional Configuration #### 1.3 Install full version (with sglang acceleration)
After completing the [2. Download model weight files](#2-download-model-weight-files) step, the script will automatically generate a `magic-pdf.json` file in the user directory and configure the default model path. To use **sglang acceleration for VLM model inference**, install the full version:
You can find the `magic-pdf.json` file in your 【user directory】.
> [!TIP] ```bash
> The user directory for Windows is "C:\\Users\\username", for Linux it is "/home/username", and for macOS it is "/Users/username". uv pip install "mineru[all]>=2.0.0"
```
You can modify certain configurations in this file to enable or disable features, such as table recognition: Or install from source:
```bash
uv pip install -e .[all]
```
> [!NOTE] ---
> If the following items are not present in the JSON, please manually add the required items and remove the comment content (standard JSON does not support comments).
```json ### 2. Using MinerU
{
// other config #### 2.1 Command Line Usage
"layout-config": {
"model": "doclayout_yolo" ##### Basic Usage
},
"formula-config": { The simplest command line invocation is:
"mfd_model": "yolo_v8_mfd",
"mfr_model": "unimernet_small", ```bash
"enable": true // The formula recognition feature is enabled by default. If you need to disable it, please change the value here to "false". mineru -p <input_path> -o <output_path>
},
"table-config": {
"model": "rapid_table",
"sub_model": "slanet_plus",
"enable": true, // The table recognition feature is enabled by default. If you need to disable it, please change the value here to "false".
"max_time": 400
}
}
``` ```
### Using GPU - `<input_path>`: Local PDF file or directory (supports pdf/png/jpg/jpeg)
- `<output_path>`: Output directory
If your device supports CUDA and meets the GPU requirements of the mainline environment, you can use GPU acceleration. Please select the appropriate guide based on your system: ##### View Help Information
- [Ubuntu 22.04 LTS + GPU](docs/README_Ubuntu_CUDA_Acceleration_en_US.md) Get all available parameter descriptions:
- [Windows 10/11 + GPU](docs/README_Windows_CUDA_Acceleration_en_US.md)
- Quick Deployment with Docker
> [!IMPORTANT]
> Docker requires a GPU with at least 6GB of VRAM, and all acceleration features are enabled by default.
>
> Before running this Docker, you can use the following command to check if your device supports CUDA acceleration on Docker.
>
> ```bash
> docker run --rm --gpus=all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
> ```
```bash
wget https://github.com/opendatalab/MinerU/raw/master/docker/global/Dockerfile -O Dockerfile
docker build -t mineru:latest .
docker run -it --name mineru --gpus=all mineru:latest /bin/bash -c "echo 'source /opt/mineru_venv/bin/activate' >> ~/.bashrc && exec bash"
magic-pdf --help
```
### Using NPU ```bash
mineru --help
```
If your device has NPU acceleration hardware, you can follow the tutorial below to use NPU acceleration: ##### Parameter Details
```text
Usage: mineru [OPTIONS]
Options:
-v, --version Show version and exit
-p, --path PATH Input file path or directory (required)
-o, --output PATH Output directory (required)
-m, --method [auto|txt|ocr] Parsing method: auto (default), txt, ocr (pipeline backend only)
-b, --backend [pipeline|vlm-transformers|vlm-sglang-engine|vlm-sglang-client]
Parsing backend (default: pipeline)
-l, --lang [ch|ch_server|... ] Specify document language (improves OCR accuracy, pipeline backend only)
-u, --url TEXT Service address when using sglang-client
-s, --start INTEGER Starting page number (0-based)
-e, --end INTEGER Ending page number (0-based)
-f, --formula BOOLEAN Enable formula parsing (default: on, pipeline backend only)
-t, --table BOOLEAN Enable table parsing (default: on, pipeline backend only)
-d, --device TEXT Inference device (e.g., cpu/cuda/cuda:0/npu/mps, pipeline backend only)
--vram INTEGER Maximum GPU VRAM usage per process (pipeline backend only)
--source [huggingface|modelscope|local]
Model source, default: huggingface
--help Show help information
```
[Ascend NPU Acceleration](docs/README_Ascend_NPU_Acceleration_zh_CN.md) ---
### Using MPS #### 2.2 Model Source Configuration
If your device uses Apple silicon chips, you can enable MPS acceleration for your tasks. MinerU automatically downloads required models from HuggingFace on first run. If HuggingFace is inaccessible, you can switch model sources:
You can enable MPS acceleration by setting the `device-mode` parameter to `mps` in the `magic-pdf.json` configuration file. ##### Switch to ModelScope Source
```json ```bash
{ mineru -p <input_path> -o <output_path> --source modelscope
// other config
"device-mode": "mps"
}
``` ```
Or set environment variable:
## Usage ```bash
export MINERU_MODEL_SOURCE=modelscope
mineru -p <input_path> -o <output_path>
```
##### Using Local Models
###### 1. Download Models Locally
```bash
mineru-models-download --help
```
Or use interactive command-line tool to select models:
```bash
mineru-models-download
```
After download, model paths will be displayed in current terminal and automatically written to `mineru.json` in user directory.
###### 2. Parse Using Local Models
```bash
mineru -p <input_path> -o <output_path> --source local
```
Or enable via environment variable:
```bash
export MINERU_MODEL_SOURCE=local
mineru -p <input_path> -o <output_path>
```
---
#### 2.3 Using sglang to Accelerate VLM Model Inference
##### Start sglang-engine Mode
```bash
mineru -p <input_path> -o <output_path> -b vlm-sglang-engine
```
##### Start sglang-server/client Mode
1. Start Server:
```bash
mineru-sglang-server --port 30000
```
2. Use Client in another terminal:
```bash
mineru -p <input_path> -o <output_path> -b vlm-sglang-client -u http://127.0.0.1:30000
```
### Command Line > 💡 For more information about output files, please refer to [Output File Documentation](docs/output_file_en_us.md)
[Using MinerU via Command Line](https://mineru.readthedocs.io/en/latest/user_guide/usage/command_line.html) ---
> [!TIP] ### 3. API Usage
> For more information about the output files, please refer to the [Output File Description](docs/output_file_en_us.md).
### API You can also call MinerU through Python code, see example code at:
👉 [Python Usage Example](demo/demo.py)
[Using MinerU via Python API](https://mineru.readthedocs.io/en/latest/user_guide/usage/api.html) ---
### 4. Deploy Derivative Projects
### Deploy Derived Projects Community developers have created various extensions based on MinerU, including:
Derived projects include secondary development projects based on MinerU by project developers and community developers, - Graphical interface based on Gradio
such as application interfaces based on Gradio, RAG based on llama, web demos similar to the official website, lightweight multi-GPU load balancing client/server ends, etc. - Web API based on FastAPI
These projects may offer more features and a better user experience. - Client/server architecture with multi-GPU load balancing
For specific deployment methods, please refer to the [Derived Project README](projects/README.md) - MCP Server based on the official API
These projects typically offer better user experience and additional features.
### Development Guide For detailed deployment instructions, please refer to:
👉 [Derivative Projects Documentation](projects/README.md)
TODO ---
# TODO # TODO
...@@ -556,21 +693,23 @@ TODO ...@@ -556,21 +693,23 @@ TODO
[LICENSE.md](LICENSE.md) [LICENSE.md](LICENSE.md)
This project currently uses PyMuPDF to achieve advanced functionality. However, since it adheres to the AGPL license, it may impose restrictions on certain usage scenarios. In future iterations, we plan to explore and replace it with a more permissive PDF processing library to enhance user-friendliness and flexibility. Currently, some models in this project are trained based on YOLO. However, since YOLO follows the AGPL license, it may impose restrictions on certain use cases. In future iterations, we plan to explore and replace these with models under more permissive licenses to enhance user-friendliness and flexibility.
# Acknowledgments # Acknowledgments
- [PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit) - [PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit)
- [DocLayout-YOLO](https://github.com/opendatalab/DocLayout-YOLO) - [DocLayout-YOLO](https://github.com/opendatalab/DocLayout-YOLO)
- [StructEqTable](https://github.com/UniModal4Reasoning/StructEqTable-Deploy) - [UniMERNet](https://github.com/opendatalab/UniMERNet)
- [RapidTable](https://github.com/RapidAI/RapidTable) - [RapidTable](https://github.com/RapidAI/RapidTable)
- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR) - [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
- [RapidOCR](https://github.com/RapidAI/RapidOCR)
- [PaddleOCR2Pytorch](https://github.com/frotms/PaddleOCR2Pytorch) - [PaddleOCR2Pytorch](https://github.com/frotms/PaddleOCR2Pytorch)
- [PyMuPDF](https://github.com/pymupdf/PyMuPDF)
- [layoutreader](https://github.com/ppaanngggg/layoutreader) - [layoutreader](https://github.com/ppaanngggg/layoutreader)
- [xy-cut](https://github.com/Sanster/xy-cut)
- [fast-langdetect](https://github.com/LlmKira/fast-langdetect) - [fast-langdetect](https://github.com/LlmKira/fast-langdetect)
- [pypdfium2](https://github.com/pypdfium2-team/pypdfium2)
- [pdftext](https://github.com/datalab-to/pdftext)
- [pdfminer.six](https://github.com/pdfminer/pdfminer.six) - [pdfminer.six](https://github.com/pdfminer/pdfminer.six)
- [pypdf](https://github.com/py-pdf/pypdf)
# Citation # Citation
......
...@@ -10,14 +10,17 @@ ...@@ -10,14 +10,17 @@
[![forks](https://img.shields.io/github/forks/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU) [![forks](https://img.shields.io/github/forks/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
[![open issues](https://img.shields.io/github/issues-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues) [![open issues](https://img.shields.io/github/issues-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues)
[![issue resolution](https://img.shields.io/github/issues-closed-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues) [![issue resolution](https://img.shields.io/github/issues-closed-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues)
[![PyPI version](https://img.shields.io/pypi/v/magic-pdf)](https://pypi.org/project/magic-pdf/)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/magic-pdf)](https://pypi.org/project/magic-pdf/) [![PyPI version](https://img.shields.io/pypi/v/mineru)](https://pypi.org/project/mineru/)
[![Downloads](https://static.pepy.tech/badge/magic-pdf)](https://pepy.tech/project/magic-pdf) [![PyPI - Python Version](https://img.shields.io/pypi/pyversions/mineru)](https://pypi.org/project/mineru/)
[![Downloads](https://static.pepy.tech/badge/magic-pdf/month)](https://pepy.tech/project/magic-pdf) [![Downloads](https://static.pepy.tech/badge/mineru)](https://pepy.tech/project/mineru)
[![Downloads](https://static.pepy.tech/badge/mineru/month)](https://pepy.tech/project/mineru)
[![OpenDataLab](https://img.shields.io/badge/Demo_on_OpenDataLab-blue?logo=&labelColor=white)](https://mineru.net/OpenSourceTools/Extractor?source=github) [![OpenDataLab](https://img.shields.io/badge/Demo_on_OpenDataLab-blue?logo=&labelColor=white)](https://mineru.net/OpenSourceTools/Extractor?source=github)
[![ModelScope](https://img.shields.io/badge/Demo_on_ModelScope-purple?logo=&labelColor=white)](https://www.modelscope.cn/studios/OpenDataLab/MinerU) [![ModelScope](https://img.shields.io/badge/Demo_on_ModelScope-purple?logo=&labelColor=white)](https://www.modelscope.cn/studios/OpenDataLab/MinerU)
[![HuggingFace](https://img.shields.io/badge/Demo_on_HuggingFace-yellow.svg?logo=&labelColor=white)](https://huggingface.co/spaces/opendatalab/MinerU) [![HuggingFace](https://img.shields.io/badge/Demo_on_HuggingFace-yellow.svg?logo=&labelColor=white)](https://huggingface.co/spaces/opendatalab/MinerU)
[![HuggingFace](https://img.shields.io/badge/VLM_Demo_on_HuggingFace-yellow.svg?logo=&labelColor=white)](https://huggingface.co/spaces/opendatalab/mineru2)
[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/gist/myhloli/3b3a00a4a0a61577b6c30f989092d20d/mineru_demo.ipynb) [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/gist/myhloli/3b3a00a4a0a61577b6c30f989092d20d/mineru_demo.ipynb)
[![Paper](https://img.shields.io/badge/Paper-arXiv-green)](https://arxiv.org/abs/2409.18839) [![Paper](https://img.shields.io/badge/Paper-arXiv-green)](https://arxiv.org/abs/2409.18839)
...@@ -59,7 +62,8 @@ ...@@ -59,7 +62,8 @@
- **全新模型**:MinerU 2.0 集成了我们最新研发的小参数量、高性能多模态文档解析模型,实现端到端的高速、高精度文档理解。 - **全新模型**:MinerU 2.0 集成了我们最新研发的小参数量、高性能多模态文档解析模型,实现端到端的高速、高精度文档理解。
- **小模型,大能力**:模型参数不足 1B,却在解析精度上超越传统 72B 级别的视觉语言模型(VLM)。 - **小模型,大能力**:模型参数不足 1B,却在解析精度上超越传统 72B 级别的视觉语言模型(VLM)。
- **多功能合一**:单模型覆盖多语言识别、手写识别、版面分析、表格解析、公式识别、阅读顺序排序等核心任务。 - **多功能合一**:单模型覆盖多语言识别、手写识别、版面分析、表格解析、公式识别、阅读顺序排序等核心任务。
- **极致推理速度**:在单卡 NVIDIA 4090 上通过 `sglang` 加速,达到峰值吞吐量超过 10,000 token/s,轻松应对大规模文档处理需求。 - **极致推理速度**:在单卡 NVIDIA 4090 上通过 `sglang` 加速,达到峰值吞吐量超过 10,000 token/s,轻松应对大规模文档处理需求。
- **在线体验**:您可在我们的huggingface demo上在线体验该模型:[![HuggingFace](https://img.shields.io/badge/VLM_Demo_on_HuggingFace-yellow.svg?logo=&labelColor=white)](https://huggingface.co/spaces/opendatalab/mineru2)
- **不兼容变更说明**:为提升整体架构合理性与长期可维护性,本版本包含部分不兼容的变更: - **不兼容变更说明**:为提升整体架构合理性与长期可维护性,本版本包含部分不兼容的变更:
- Python 包名从 `magic-pdf` 更改为 `mineru`,命令行工具也由 `magic-pdf` 改为 `mineru`,请同步更新脚本与调用命令。 - Python 包名从 `magic-pdf` 更改为 `mineru`,命令行工具也由 `magic-pdf` 改为 `mineru`,请同步更新脚本与调用命令。
- 出于对系统模块化设计与生态一致性的考虑,MinerU 2.0 已不再内置 LibreOffice 文档转换模块。如需处理 Office 文档,建议通过独立部署的 LibreOffice 服务先行转换为 PDF 格式,再进行后续解析操作。 - 出于对系统模块化设计与生态一致性的考虑,MinerU 2.0 已不再内置 LibreOffice 文档转换模块。如需处理 Office 文档,建议通过独立部署的 LibreOffice 服务先行转换为 PDF 格式,再进行后续解析操作。
...@@ -332,49 +336,38 @@ ...@@ -332,49 +336,38 @@
<details> <details>
<summary>2024/07/05 首次开源</summary> <summary>2024/07/05 首次开源</summary>
</details> </details>
</details>
<!-- TABLE OF CONTENT -->
<!-- TABLE OF CONTENT -->
<details open="open">
<summary><h2 style="display: inline-block">文档目录</h2></summary> <details open="open">
<ol> <summary><h2 style="display: inline-block">文档目录</h2></summary>
<li> <ol>
<a href="#mineru">MinerU</a> <li>
<ul> <a href="#mineru">MinerU</a>
<li><a href="#项目简介">项目简介</a></li> <ul>
<li><a href="#主要功能">主要功能</a></li> <li><a href="#项目简介">项目简介</a></li>
<li><a href="#快速开始">快速开始</a> <li><a href="#主要功能">主要功能</a></li>
<ul> <li><a href="#快速开始">快速开始</a>
<li><a href="#在线体验">在线体验</a></li> <ul>
<li><a href="#使用CPU快速体验">使用CPU快速体验</a></li> <li><a href="#在线体验">在线体验</a></li>
<li><a href="#使用GPU">使用GPU</a></li> <li><a href="#本地部署">本地部署</a></li>
<li><a href="#使用NPU">使用NPU</a></li> </ul>
</ul> </ul>
</li> </li>
<li><a href="#使用">使用方式</a> <li><a href="#todo">TODO</a></li>
<ul> <li><a href="#known-issues">Known Issues</a></li>
<li><a href="#命令行">命令行</a></li> <li><a href="#faq">FAQ</a></li>
<li><a href="#api">API</a></li> <li><a href="#all-thanks-to-our-contributors">Contributors</a></li>
<li><a href="#部署衍生项目">部署衍生项目</a></li> <li><a href="#license-information">License Information</a></li>
<li><a href="#二次开发">二次开发</a></li> <li><a href="#acknowledgments">Acknowledgements</a></li>
</ul> <li><a href="#citation">Citation</a></li>
</li> <li><a href="#star-history">Star History</a></li>
</ul> <li><a href="#magic-doc">magic-doc快速提取PPT/DOC/PDF</a></li>
</li> <li><a href="#magic-html">magic-html提取混合网页内容</a></li>
<li><a href="#todo">TODO</a></li> <li><a href="#links">Links</a></li>
<li><a href="#known-issues">Known Issues</a></li> </ol>
<li><a href="#faq">FAQ</a></li>
<li><a href="#all-thanks-to-our-contributors">Contributors</a></li>
<li><a href="#license-information">License Information</a></li>
<li><a href="#acknowledgments">Acknowledgements</a></li>
<li><a href="#citation">Citation</a></li>
<li><a href="#star-history">Star History</a></li>
<li><a href="#magic-doc">magic-doc快速提取PPT/DOC/PDF</a></li>
<li><a href="#magic-html">magic-html提取混合网页内容</a></li>
<li><a href="#links">Links</a></li>
</ol>
</details>
</details> </details>
...@@ -409,7 +402,7 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c ...@@ -409,7 +402,7 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
如果遇到解析效果不及预期,参考 <a href="#known-issues">Known Issues</a></br> 如果遇到解析效果不及预期,参考 <a href="#known-issues">Known Issues</a></br>
有2种不同方式可以体验MinerU的效果: 有2种不同方式可以体验MinerU的效果:
- [在线体验(无需任何安装)](#在线体验) - [在线体验](#在线体验)
- [本地部署](#本地部署) - [本地部署](#本地部署)
...@@ -467,16 +460,19 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c ...@@ -467,16 +460,19 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
[![ModelScope](https://img.shields.io/badge/Demo_on_ModelScope-purple?logo=&labelColor=white)](https://www.modelscope.cn/studios/OpenDataLab/MinerU) [![ModelScope](https://img.shields.io/badge/Demo_on_ModelScope-purple?logo=&labelColor=white)](https://www.modelscope.cn/studios/OpenDataLab/MinerU)
[![HuggingFace](https://img.shields.io/badge/Demo_on_HuggingFace-yellow.svg?logo=&labelColor=white)](https://huggingface.co/spaces/opendatalab/MinerU) [![HuggingFace](https://img.shields.io/badge/Demo_on_HuggingFace-yellow.svg?logo=&labelColor=white)](https://huggingface.co/spaces/opendatalab/MinerU)
## 本地部署 MinerU ### 🚀🚀🚀VLM demo
[![HuggingFace](https://img.shields.io/badge/VLM_Demo_on_HuggingFace-yellow.svg?logo=&labelColor=white)](https://huggingface.co/spaces/opendatalab/mineru2)
## 本地部署
### 1. 安装 MinerU ### 1. 安装 MinerU
#### 1.1 使用 pip 或 uv 安装 #### 1.1 使用 pip 或 uv 安装
```bash ```bash
pip install --upgrade pip pip install --upgrade pip -i https://mirrors.aliyun.com/pypi/simple
pip install uv pip install uv -i https://mirrors.aliyun.com/pypi/simple
uv pip install "mineru[core]>=2.0.0" uv pip install "mineru[core]>=2.0.0" -i https://mirrors.aliyun.com/pypi/simple
``` ```
#### 1.2 源码安装 #### 1.2 源码安装
...@@ -484,7 +480,7 @@ uv pip install "mineru[core]>=2.0.0" ...@@ -484,7 +480,7 @@ uv pip install "mineru[core]>=2.0.0"
```bash ```bash
git clone https://github.com/opendatalab/MinerU.git git clone https://github.com/opendatalab/MinerU.git
cd MinerU cd MinerU
uv pip install -e .[core] uv pip install -e .[core] -i https://mirrors.aliyun.com/pypi/simple
``` ```
#### 1.3 安装完整版(支持 sglang 加速) #### 1.3 安装完整版(支持 sglang 加速)
...@@ -492,13 +488,13 @@ uv pip install -e .[core] ...@@ -492,13 +488,13 @@ uv pip install -e .[core]
如需使用 **sglang 加速 VLM 模型推理**,请安装完整版本: 如需使用 **sglang 加速 VLM 模型推理**,请安装完整版本:
```bash ```bash
uv pip install "mineru[all]>=2.0.0" uv pip install "mineru[all]>=2.0.0" -i https://mirrors.aliyun.com/pypi/simple
``` ```
或从源码安装: 或从源码安装:
```bash ```bash
uv pip install -e .[all] uv pip install -e .[all] -i https://mirrors.aliyun.com/pypi/simple
``` ```
--- ---
...@@ -640,7 +636,8 @@ mineru -p <input_path> -o <output_path> -b vlm-sglang-client -u http://127.0.0.1 ...@@ -640,7 +636,8 @@ mineru -p <input_path> -o <output_path> -b vlm-sglang-client -u http://127.0.0.1
- 基于 Gradio 的图形界面 - 基于 Gradio 的图形界面
- 基于 FastAPI 的 Web API - 基于 FastAPI 的 Web API
- 多卡负载均衡的客户端/服务端架构等 - 多卡负载均衡的客户端/服务端架构
- 基于官网API的MCP Server
这些项目通常提供更好的用户体验和更多功能。 这些项目通常提供更好的用户体验和更多功能。
...@@ -702,6 +699,7 @@ mineru -p <input_path> -o <output_path> -b vlm-sglang-client -u http://127.0.0.1 ...@@ -702,6 +699,7 @@ mineru -p <input_path> -o <output_path> -b vlm-sglang-client -u http://127.0.0.1
- [xy-cut](https://github.com/Sanster/xy-cut) - [xy-cut](https://github.com/Sanster/xy-cut)
- [fast-langdetect](https://github.com/LlmKira/fast-langdetect) - [fast-langdetect](https://github.com/LlmKira/fast-langdetect)
- [pypdfium2](https://github.com/pypdfium2-team/pypdfium2) - [pypdfium2](https://github.com/pypdfium2-team/pypdfium2)
- [pdftext](https://github.com/datalab-to/pdftext)
- [pdfminer.six](https://github.com/pdfminer/pdfminer.six) - [pdfminer.six](https://github.com/pdfminer/pdfminer.six)
- [pypdf](https://github.com/py-pdf/pypdf) - [pypdf](https://github.com/py-pdf/pypdf)
......
# Security Policy
## Supported Versions
latest
## Reporting a Vulnerability
Please do not report security vulnerabilities through public GitHub issues.
Instead, please report them at https://github.com/opendatalab/MinerU/security.
Please include the requested information listed below (as much as you can provide) to help us better understand the nature and scope of the possible issue:
* Type of issue (e.g. buffer overflow, SQL injection, cross-site scripting, etc.)
* Full paths of source file(s) related to the manifestation of the issue
* The location of the affected source code (tag/branch/commit or direct URL)
* Any special configuration required to reproduce the issue
* Step-by-step instructions to reproduce the issue
* Proof-of-concept or exploit code (if possible)
* Impact of the issue, including how an attacker might exploit the issue
This information will help us triage your report more quickly.
## Preferred Languages
We prefer all communications to be in English and Chinese.
## Policy
We will fix security issues in the project's own code as quickly as possible. Before the project completes the fix, you must not disclose the vulnerability information to any public platform.
...@@ -42,9 +42,8 @@ def download_and_modify_json(url, local_filename, modifications): ...@@ -42,9 +42,8 @@ def download_and_modify_json(url, local_filename, modifications):
def configure_model(model_dir, model_type): def configure_model(model_dir, model_type):
"""配置模型""" """配置模型"""
# json_url = 'https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/mineru.template.json' json_url = 'https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/mineru.template.json'
json_url = 'https://gcore.jsdelivr.net/gh/myhloli/Magic-PDF@dev/mineru.template.json' config_file_name = os.getenv('MINERU_TOOLS_CONFIG_JSON', 'mineru.json')
config_file_name = 'mineru.json'
home_dir = os.path.expanduser('~') home_dir = os.path.expanduser('~')
config_file = os.path.join(home_dir, config_file_name) config_file = os.path.join(home_dir, config_file_name)
...@@ -120,13 +119,13 @@ def download_models(model_source, model_type): ...@@ -120,13 +119,13 @@ def download_models(model_source, model_type):
click.echo(f"Downloading model: {model_path}") click.echo(f"Downloading model: {model_path}")
download_finish_path = auto_download_and_get_model_root_path(model_path, repo_mode='pipeline') download_finish_path = auto_download_and_get_model_root_path(model_path, repo_mode='pipeline')
click.echo(f"Pipeline models downloaded successfully to: {download_finish_path}") click.echo(f"Pipeline models downloaded successfully to: {download_finish_path}")
configure_model(download_finish_path, model_type) configure_model(download_finish_path, "pipeline")
def download_vlm_models(): def download_vlm_models():
"""下载VLM模型""" """下载VLM模型"""
download_finish_path = auto_download_and_get_model_root_path("/", repo_mode='vlm') download_finish_path = auto_download_and_get_model_root_path("/", repo_mode='vlm')
click.echo(f"VLM models downloaded successfully to: {download_finish_path}") click.echo(f"VLM models downloaded successfully to: {download_finish_path}")
configure_model(download_finish_path, model_type) configure_model(download_finish_path, "vlm")
try: try:
if model_type == 'pipeline': if model_type == 'pipeline':
......
__version__ = "2.0.0" __version__ = "2.0.1"
\ No newline at end of file
...@@ -8,4 +8,4 @@ ...@@ -8,4 +8,4 @@
- Projects not yet compatible with version 2.0: - Projects not yet compatible with version 2.0:
- [web_api](./web_api/README.md): Web API based on FastAPI - [web_api](./web_api/README.md): Web API based on FastAPI
- [multi_gpu](./multi_gpu/README.md): Multi-GPU parallel processing based on LitServe - [multi_gpu](./multi_gpu/README.md): Multi-GPU parallel processing based on LitServe
- [mcp](./mcp/README.md): MCP server based on the official API
...@@ -8,3 +8,4 @@ ...@@ -8,3 +8,4 @@
- 未兼容2.0版本的项目列表 - 未兼容2.0版本的项目列表
- [web_api](./web_api/README.md): 基于 FastAPI 的 Web API - [web_api](./web_api/README.md): 基于 FastAPI 的 Web API
- [multi_gpu](./multi_gpu/README.md): 基于 LitServe 的多 GPU 并行处理 - [multi_gpu](./multi_gpu/README.md): 基于 LitServe 的多 GPU 并行处理
- [mcp](./mcp/README.md): 基于官方api的mcp server
MINERU_API_BASE = "https://mineru.net"
MINERU_API_KEY = "eyJ0eXB..."
OUTPUT_DIR=./downloads
USE_LOCAL_API=false
LOCAL_MINERU_API_BASE="http://localhost:8888"
\ No newline at end of file
downloads
.env
uv.lock
.venv
src/mineru/__pycache__
dist
.DS_Store
.cursor
build
*.lock
src/mineru_mcp.egg-info
test
\ No newline at end of file
# MinerU MCP-Server Docker 部署指南
## 1. 简介
本文档提供了使用 Docker 部署 MinerU MCP-Server 的详细指南。通过 Docker 部署,你可以在任何支持 Docker 的环境中快速启动 MinerU MCP 服务器,无需考虑复杂的环境配置和依赖管理。
Docker 部署的主要优势:
- **一致的运行环境**:确保在任何平台上都有相同的运行环境
- **简化部署流程**:一键启动,无需手动安装依赖
- **易于扩展和迁移**:便于在不同环境间迁移和扩展服务
- **资源隔离**:避免与宿主机其他服务产生冲突
## 2. 先决条件
在开始之前,请确保你的系统已安装以下软件:
- [Docker](https://www.docker.com/get-started) (19.03 或更高版本)
- [Docker Compose](https://docs.docker.com/compose/install/) (1.27.0 或更高版本)
你可以通过以下命令检查它们是否已正确安装:
```bash
docker --version
docker-compose --version
```
同时,你需要:
-[MinerU 官网](https://mineru.net) 获取的 API 密钥(如果需要使用远程 API)
- 充足的硬盘空间,用于存储转换后的文件
## 3. 使用 Docker Compose 部署(推荐)
Docker Compose 提供了最简单的部署方式,特别适合快速开始使用或开发环境。
### 3.1 准备配置文件
1. 克隆仓库(如果尚未克隆):
```bash
git clone <repository-url>
cd mineru-mcp
```
2. 创建环境变量文件:
```bash
cp .env.example .env
```
3. 编辑 `.env` 文件,设置必要的环境变量:
```
MINERU_API_BASE=https://mineru.net
MINERU_API_KEY=你的API密钥
OUTPUT_DIR=./downloads
USE_LOCAL_API=false
LOCAL_MINERU_API_BASE=http://localhost:8080
```
如果你计划使用本地 API,请将 `USE_LOCAL_API` 设置为 `true`,并确保 `LOCAL_MINERU_API_BASE` 指向你的本地 API 服务地址。
### 3.2 启动服务
在项目根目录下运行:
```bash
docker-compose up -d
```
这将会:
- 构建 Docker 镜像(如果尚未构建)
- 创建并启动容器
- 在后台运行服务 (`-d` 参数)
服务将在 `http://localhost:8001` 上启动。你可以通过 MCP 客户端连接此地址。
### 3.3 查看日志
要查看服务日志,运行:
```bash
docker-compose logs -f
```
`Ctrl+C` 退出日志查看。
### 3.4 停止服务
要停止服务,运行:
```bash
docker-compose down
```
如果你想同时删除构建的镜像,可以使用:
```bash
docker-compose down --rmi local
```
## 4. 手动构建和运行 Docker 镜像
如果你需要更多的控制或自定义,你可以手动构建和运行 Docker 镜像。
### 4.1 构建镜像
在项目根目录下运行:
```bash
docker build -t mineru-mcp:latest .
```
这将根据 Dockerfile 构建一个名为 `mineru-mcp` 的 Docker 镜像,标签为 `latest`
### 4.2 运行容器
使用环境变量文件运行容器:
```bash
docker run -p 8001:8001 --env-file .env mineru-mcp:latest
```
或者直接指定环境变量:
```bash
docker run -p 8001:8001 \
-e MINERU_API_BASE=https://mineru.net \
-e MINERU_API_KEY=你的API密钥 \
-e OUTPUT_DIR=/app/downloads \
-v $(pwd)/downloads:/app/downloads \
mineru-mcp:latest
```
### 4.3 挂载卷
为了持久化存储转换后的文件,你应该挂载宿主机目录到容器的输出目录:
```bash
docker run -p 8001:8001 --env-file .env \
-v $(pwd)/downloads:/app/downloads \
mineru-mcp:latest
```
这将挂载当前工作目录下的 `downloads` 文件夹到容器内的 `/app/downloads` 目录。
## 5. 环境变量配置
Docker 环境中支持的环境变量与标准环境相同:
| 环境变量 | 说明 | 默认值 |
| ------------------------- | -------------------------------------------------------------- | ------------------------- |
| `MINERU_API_BASE` | MinerU 远程 API 的基础 URL | `https://mineru.net` |
| `MINERU_API_KEY` | MinerU API 密钥,需要从官网申请 | - |
| `OUTPUT_DIR` | 转换后文件的保存路径 | `/app/downloads` |
| `USE_LOCAL_API` | 是否使用本地 API 进行解析(仅适用于 `local_parse_pdf` 工具) | `false` |
| `LOCAL_MINERU_API_BASE` | 本地 API 的基础 URL(当 `USE_LOCAL_API=true` 时有效) | `http://localhost:8080` |
在 Docker 环境中,你可以:
- 通过 `--env-file` 指定环境变量文件
- 通过 `-e` 参数直接指定环境变量
-`docker-compose.yml` 文件中的 `environment` 部分配置环境变量
FROM python:3.12-slim
# Set working directory
WORKDIR /app
# Configure pip to use Alibaba Cloud mirror
RUN pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/
# Install dependencies
RUN pip install --no-cache-dir poetry
# Copy project files
COPY pyproject.toml .
COPY README.md .
COPY src/ ./src/
# Install the package
RUN poetry config virtualenvs.create false && \
poetry install
# Create downloads directory
RUN mkdir -p /app/downloads
# Set environment variables
ENV OUTPUT_DIR=/app/downloads
# MINERU_API_KEY should be provided at runtime
ENV MINERU_API_BASE=https://mineru.net
ENV USE_LOCAL_API=false
ENV LOCAL_MINERU_API_BASE=""
# Expose the port that SSE will run on
EXPOSE 8001
# Set command to start the service with SSE transport
CMD ["mineru-mcp", "--transport", "sse", "--output-dir", "/app/downloads"]
\ No newline at end of file
# MinerU MCP-Server
## 1. 概述
这个项目提供了一个 **MinerU MCP 服务器** (`mineru-mcp`),它基于 **FastMCP** 框架构建。其主要功能是作为 **MinerU API** 的接口,用于将文档转换为 Markdown格式。
该服务器通过 MCP 协议公开了以下主要工具:
1. `parse_documents`:统一接口,支持处理本地文件和URL,自动根据配置选择最合适的处理方式,并自动读取转换后的内容
2. `get_ocr_languages`:获取OCR支持的语言列表
这使得其他应用程序或 MCP 客户端能够轻松地集成 MinerU 的 文档 到 Markdown 转换功能。
## 2. 核心功能
* **文档提取**: 接收文档文件输入(单个或多个 URL、单个或多个本地路径,支持doc、ppt、pdf、图片多种格式),调用 MinerU API 进行内容提取和格式转换,最终生成 Markdown 文件。
* **批量处理**: 支持同时处理多个文档文件(通过提供由空格、逗号或换行符分隔的 URL 列表或本地文件路径列表)。
* **OCR 支持**: 可选启用 OCR 功能(默认不开启),以处理扫描版或图片型文档。
* **多语言支持**: 支持多种语言的识别,可以自动检测文档语言或手动指定。
* **自动化流程**: 自动处理与 MinerU API 的交互,包括任务提交、状态轮询、结果下载解压、结果文件读取。
* **本地解析**: 支持调用本地部署的mineru模型直接解析文档,不依赖远程 API,适用于隐私敏感场景或离线环境。
* **智能路径处理**: 自动识别URL和本地文件路径,根据USE_LOCAL_API配置选择最合适的处理方式。
## 3. 安装
在开始安装之前,请确保您的系统满足以下基本要求:
* Python >= 3.10
### 3.1 使用 pip 安装 (推荐)
如果你的包已发布到 PyPI 或其他 Python 包索引,可以直接使用 pip 安装:
```bash
pip install mineru-mcp==1.0.0
```
目前版本:1.0.0
这种方式适用于不需要修改源代码的普通用户。
### 3.2 从源码安装
如果你需要修改源代码或进行开发,可以从源码安装。
克隆仓库并进入项目目录:
```bash
git clone <repository-url> # 替换为你的仓库 URL
cd mineru-mcp
```
推荐使用 `uv``pip` 配合虚拟环境进行安装:
**使用 uv (推荐):**
```bash
# 安装 uv (如果尚未安装)
# pip install uv
# 创建并激活虚拟环境
uv venv
# Linux/macOS
source .venv/bin/activate
# Windows
# .venv\\Scripts\\activate
# 安装依赖和项目
uv pip install -e .
```
**使用 pip:**
```bash
# 创建并激活虚拟环境
python -m venv .venv
# Linux/macOS
source .venv/bin/activate
# Windows
# .venv\\Scripts\\activate
# 安装依赖和项目
pip install -e .
```
## 4. 环境变量配置
本项目支持通过环境变量进行配置。你可以选择直接设置系统环境变量,或者在项目根目录创建 `.env` 文件(参考 `.env.example` 模板)。
### 4.1 支持的环境变量
| 环境变量 | 说明 | 默认值 |
| ------------------------- | --------------------------------------------------------------- | ------------------------- |
| `MINERU_API_BASE` | MinerU 远程 API 的基础 URL | `https://mineru.net` |
| `MINERU_API_KEY` | MinerU API 密钥,需要从[官网](https://mineru.net)申请 | - |
| `OUTPUT_DIR` | 转换后文件的保存路径 | `./downloads` |
| `USE_LOCAL_API` | 是否使用本地 API 进行解析 | `false` |
| `LOCAL_MINERU_API_BASE` | 本地 API 的基础 URL(当 `USE_LOCAL_API=true` 时有效) | `http://localhost:8080` |
### 4.2 远程 API 与本地 API
本项目支持两种 API 模式:
* **远程 API**:默认模式,通过 MinerU 官方提供的云服务进行文档解析。优点是无需本地部署复杂的模型和环境,但需要网络连接和 API 密钥。
* **本地 API**:在本地部署 MinerU 引擎进行文档解析,适用于对数据隐私有高要求或需要离线使用的场景。设置 `USE_LOCAL_API=true` 时生效。
### 4.3 获取 API 密钥
要获取 `MINERU_API_KEY`,请访问 [MinerU 官网](https://mineru.net) 注册账号并申请 API 密钥。
## 5. 使用方法
### 5.1 工具概览
本项目通过 MCP 协议提供以下工具:
1. **parse_documents**:统一接口,支持处理本地文件和URL,根据 `USE_LOCAL_API` 配置自动选择合适的处理方式,并自动读取转换后的文件内容
2. **get_ocr_languages**:获取 OCR 支持的语言列表
### 5.2 参数说明
#### 5.2.1 parse_documents
| 参数 | 类型 | 说明 | 默认值 | 适用模式 |
| ------------------- | ------- | ------------------------------------------------------------------- | -------- | -------- |
| `file_sources` | 字符串 | 文件路径或URL,多个可用逗号或换行符分隔 (支持pdf、ppt、pptx、doc、docx以及图片格式jpg、jpeg、png) | - | 全部 |
| `enable_ocr` | 布尔值 | 是否启用 OCR 功能 | `false` | 全部 |
| `language` | 字符串 | 文档语言,默认"ch"中文,可选"en"英文等 | `ch` | 全部 |
| `page_ranges` | 字符串 (可选) | 指定页码范围,格式为逗号分隔的字符串。例如:"2,4-6":表示选取第2页、第4页至第6页;"2--2":表示从第2页一直选取到倒数第二页。(远程API) | `None` | 远程API |
> **注意**:
> - 当 `USE_LOCAL_API=true` 时,如果提供了URL,这些URL会被过滤掉,只处理本地文件路径
> - 当 `USE_LOCAL_API=false` 时,会同时处理URL和本地文件路径
#### 5.2.2 get_ocr_languages
无需参数
## 6. MCP 客户端集成
你可以在任何支持 MCP 协议的客户端中使用 MinerU MCP 服务器。
### 6.1 在 Claude 中使用
将 MinerU MCP 服务器配置为 Claude 的工具,即可在 Claude 中直接使用文档转 Markdown 功能。配置工具时详情请参考 MCP 工具配置文档。根据不同的安装和使用场景,你可以选择以下两种配置方式:
#### 6.1.1 源码运行方式
如果你是从源码安装并运行 MinerU MCP,可以使用以下配置。这种方式适合你需要修改源码或者进行开发调试的场景:
```json
{
"mcpServers": {
"mineru-mcp": {
"command": "uv",
"args": ["--directory", "/Users/adrianwang/Documents/minerU-mcp", "run", "-m", "mineru.cli"],
"env": {
"MINERU_API_BASE": "https://mineru.net",
"MINERU_API_KEY": "ey...",
"OUTPUT_DIR": "./downloads",
"USE_LOCAL_API": "true",
"LOCAL_MINERU_API_BASE": "http://localhost:8080"
}
}
}
}
```
这种配置的特点:
- 使用 `uv` 命令
- 通过 `--directory` 参数指定源码所在目录
- 使用 `-m mineru.cli` 运行模块
- 适合开发调试和定制化需求
#### 6.1.2 安装包运行方式
如果你是通过 pip 或 uv 安装了 mineru-mcp 包,可以使用以下更简洁的配置。这种方式适合生产环境或日常使用:
```json
{
"mcpServers": {
"mineru-mcp": {
"command": "uvx",
"args": ["mineru-mcp"],
"env": {
"MINERU_API_BASE": "https://mineru.net",
"MINERU_API_KEY": "ey...",
"OUTPUT_DIR": "./downloads",
"USE_LOCAL_API": "true",
"LOCAL_MINERU_API_BASE": "http://localhost:8080"
}
}
}
}
```
这种配置的特点:
- 使用 `uvx` 命令直接运行已安装的包
- 配置更加简洁
- 不需要指定源码目录
- 适合稳定的生产环境使用
### 6.2 在 FastMCP 客户端中使用
```python
from fastmcp import FastMCP
# 初始化 FastMCP 客户端
client = FastMCP(server_url="http://localhost:8001")
# 使用 parse_documents 工具处理单个文档
result = await client.tool_call(
tool_name="parse_documents",
params={"file_sources": "/path/to/document.pdf"}
)
# 混合处理URLs和本地文件
result = await client.tool_call(
tool_name="parse_documents",
params={"file_sources": "/path/to/file.pdf, https://example.com/document.pdf"}
)
# 启用OCR
result = await client.tool_call(
tool_name="parse_documents",
params={"file_sources": "/path/to/file.pdf", "enable_ocr": True}
)
```
### 6.3 直接运行服务
你可以通过设置环境变量并直接运行命令的方式启动 MinerU MCP 服务器,这种方式特别适合快速测试和开发环境。
#### 6.3.1 设置环境变量
首先,确保设置了必要的环境变量。你可以通过创建 `.env` 文件(参考 `.env.example`)或直接在命令行中设置:
```bash
# Linux/macOS
export MINERU_API_BASE="https://mineru.net"
export MINERU_API_KEY="your-api-key"
export OUTPUT_DIR="./downloads"
export USE_LOCAL_API="true" # 可选,如果需要本地解析
export LOCAL_MINERU_API_BASE="http://localhost:8080" # 可选,如果启用本地 API
# Windows
set MINERU_API_BASE=https://mineru.net
set MINERU_API_KEY=your-api-key
set OUTPUT_DIR=./downloads
set USE_LOCAL_API=true
set LOCAL_MINERU_API_BASE=http://localhost:8080
```
#### 6.3.2 启动服务
使用以下命令启动 MinerU MCP 服务器,支持多种传输模式:
**SSE 传输模式**
```bash
uv run mineru-mcp --transport sse
```
**Streamable HTTP 传输模式**
```bash
uv run mineru-mcp --transport streamable-http
```
或者,如果你使用全局安装:
```bash
mineru-mcp --transport sse
# 或
mineru-mcp --transport streamable-http
```
服务默认在 `http://localhost:8001` 启动,使用的传输协议取决于你指定的 `--transport` 参数。
> **注意**:不同传输模式使用不同的路由路径:
> - SSE 模式:`/sse`(例如:`http://localhost:8001/sse`)
> - Streamable HTTP 模式:`/mcp`(例如:`http://localhost:8001/mcp`)
## 7. Docker 部署
本项目支持使用 Docker 进行部署,使你能在任何支持 Docker 的环境中快速启动 MinerU MCP 服务器。
### 7.1 使用 Docker Compose
1. 确保你已经安装了 Docker 和 Docker Compose
2. 复制项目根目录中的 `.env.example` 文件为 `.env`,并根据你的需求修改环境变量
3. 运行以下命令启动服务:
```bash
docker-compose up -d
```
服务默认会在 `http://localhost:8001` 启动。
### 7.2 手动构建 Docker 镜像
如果需要手动构建 Docker 镜像,可以使用以下命令:
```bash
docker build -t mineru-mcp:latest .
```
然后启动容器:
```bash
docker run -p 8001:8001 --env-file .env mineru-mcp:latest
```
更多 Docker 相关信息,请参考 `DOCKER_README.md` 文件。
## 8. 常见问题
### 8.1 API 密钥问题
**问题**:无法连接 MinerU API 或返回 401 错误。
**解决方案**:检查你的 API 密钥是否正确设置。在 `.env` 文件中确保 `MINERU_API_KEY` 环境变量包含有效的密钥。
### 8.2 如何优雅退出服务
**问题**:如何正确地停止 MinerU MCP 服务?
**解决方案**:服务运行时,可以通过按 `Ctrl+C` 来优雅地退出。系统会自动处理正在进行的操作,并确保所有资源得到正确释放。如果一次 `Ctrl+C` 没有响应,可以再次按下 `Ctrl+C` 强制退出。
### 8.3 文件路径问题
**问题**:使用 `parse_documents` 工具处理本地文件时报找不到文件错误。
**解决方案**:请确保使用绝对路径,或者相对于服务器运行目录的正确相对路径。
### 8.4 MCP 服务调用超时问题
**问题**:调用 `parse_documents` 工具时出现 `Error calling tool 'parse_documents': MCP error -32001: Request timed out` 错误。
**解决方案**:这个问题常见于处理大型文档或网络不稳定的情况。在某些 MCP 客户端(如 Cursor)中,超时后可能导致无法再次调用 MCP 服务,需要重启客户端。最新版本的 Cursor 中可能会显示正在调用 MCP,但实际上没有真正调用成功。建议:
1. **等待官方修复**:这是Cursor客户端的已知问题,建议等待Cursor官方修复
2. **处理小文件**:尽量只处理少量小文件,避免处理大型文档导致超时
3. **分批处理**:将多个文件分成多次请求处理,每次只处理一两个文件
4. 增加超时时间设置(如果客户端支持)
5. 对于超时后无法再次调用的问题,需要重启 MCP 客户端
6. 如果反复出现超时,请检查网络连接或考虑使用本地 API 模式
version: '3'
services:
mineru-mcp:
build:
context: .
dockerfile: Dockerfile
ports:
- "8001:8001"
environment:
- MINERU_API_KEY=${MINERU_API_KEY}
volumes:
- ./downloads:/app/downloads
restart: unless-stopped
\ No newline at end of file
[project]
name = "mineru-mcp"
version = "1.0.0"
description = "MinerU MCP Server for PDF to Markdown conversion"
authors = [
{name = "minerU",email = "OpenDataLab@pjlab.org.cn"}
]
readme = "README.md"
license = {text = "MIT"}
requires-python = ">=3.10,<4.0"
classifiers = [
"Programming Language :: Python :: 3",
"License :: OSI Approved :: MIT License",
"Operating System :: OS Independent",
]
dependencies = [
"fastmcp>=2.5.2",
"python-dotenv>=1.0.0",
"requests>=2.31.0",
"aiohttp>=3.9.0",
"httpx>=0.24.0",
"uvicorn>=0.20.0",
"starlette>=0.27.0",
]
[project.scripts]
mineru-mcp = "mineru.cli:main"
[tool.poetry]
packages = [{include = "mineru", from = "src"}]
[[tool.poetry.source]]
name = "aliyun"
url = "https://mirrors.aliyun.com/pypi/simple/"
priority = "primary"
[build-system]
requires = ["setuptools>=42.0", "wheel"]
build-backend = "setuptools.build_meta"
"""MinerU File转Markdown转换的API客户端。"""
import asyncio
import os
import zipfile
from pathlib import Path
from typing import Any, Dict, List, Optional, Union
import aiohttp
import requests
from . import config
def singleton_func(cls):
instance = {}
def _singleton(*args, **kwargs):
if cls not in instance:
instance[cls] = cls(*args, **kwargs)
return instance[cls]
return _singleton
@singleton_func
class MinerUClient:
"""
用于与 MinerU API 交互以将 File 转换为 Markdown 的客户端。
"""
def __init__(self, api_base: Optional[str] = None, api_key: Optional[str] = None):
"""
初始化 MinerU API 客户端。
Args:
api_base: MinerU API 的基础 URL (默认: 从环境变量获取)
api_key: 用于向 MinerU 进行身份验证的 API 密钥 (默认: 从环境变量获取)
"""
self.api_base = api_base or config.MINERU_API_BASE
self.api_key = api_key or config.MINERU_API_KEY
if not self.api_key:
# 提供更友好的错误消息
raise ValueError(
"错误: MinerU API 密钥 (MINERU_API_KEY) 未设置或为空。\n"
"请确保已设置 MINERU_API_KEY 环境变量,例如:\n"
" export MINERU_API_KEY='your_actual_api_key'\n"
"或者,在项目根目录的 `.env` 文件中定义该变量。"
)
async def _request(self, method: str, endpoint: str, **kwargs) -> Dict[str, Any]:
"""
向 MinerU API 发出请求。
Args:
method: HTTP 方法 (GET, POST 等)
endpoint: API 端点路径 (不含基础 URL)
**kwargs: 传递给 aiohttp 请求的其他参数
Returns:
dict: API 响应 (JSON 格式)
"""
url = f"{self.api_base}{endpoint}"
headers = {
"Authorization": f"Bearer {self.api_key}",
"Accept": "application/json",
}
if "headers" in kwargs:
kwargs["headers"].update(headers)
else:
kwargs["headers"] = headers
# 创建一个不包含授权信息的参数副本,用于日志记录
log_kwargs = kwargs.copy()
if "headers" in log_kwargs and "Authorization" in log_kwargs["headers"]:
log_kwargs["headers"] = log_kwargs["headers"].copy()
log_kwargs["headers"]["Authorization"] = "Bearer ****" # 隐藏API密钥
config.logger.debug(f"API请求: {method} {url}")
config.logger.debug(f"请求参数: {log_kwargs}")
async with aiohttp.ClientSession() as session:
async with session.request(method, url, **kwargs) as response:
response.raise_for_status()
response_json = await response.json()
config.logger.debug(f"API响应: {response_json}")
return response_json
async def submit_file_url_task(
self,
urls: Union[str, List[Union[str, Dict[str, Any]]], Dict[str, Any]],
enable_ocr: bool = True,
language: str = "ch",
page_ranges: Optional[str] = None,
) -> Dict[str, Any]:
"""
提交 File URL 以转换为 Markdown。支持单个URL或多个URL批量处理。
Args:
urls: 可以是以下形式之一:
1. 单个URL字符串
2. 多个URL的列表
3. 包含URL配置的字典列表,每个字典包含:
- url: File文件URL (必需)
- is_ocr: 是否启用OCR (可选)
- data_id: 文件数据ID (可选)
- page_ranges: 页码范围 (可选)
enable_ocr: 是否为转换启用 OCR(所有文件的默认值)
language: 指定文档语言,默认 ch,中文
page_ranges: 指定页码范围,格式为逗号分隔的字符串。例如:"2,4-6"表示选取第2页、第4页至第6页;"2--2"表示从第2页到倒数第2页。
Returns:
dict: 任务信息,包括batch_id
"""
# 统计URL数量
url_count = 1
if isinstance(urls, list):
url_count = len(urls)
config.logger.debug(
f"调用submit_file_url_task: {url_count}个URL, "
+ f"ocr={enable_ocr}, "
+ f"language={language}"
)
# 处理输入,确保我们有一个URL配置列表
urls_config = []
# 转换输入为标准格式
if isinstance(urls, str):
urls_config.append(
{"url": urls, "is_ocr": enable_ocr, "page_ranges": page_ranges}
)
elif isinstance(urls, list):
# 处理URL列表或URL配置列表
for i, url_item in enumerate(urls):
if isinstance(url_item, str):
# 简单的URL字符串
urls_config.append(
{
"url": url_item,
"is_ocr": enable_ocr,
"page_ranges": page_ranges,
}
)
elif isinstance(url_item, dict):
# 含有详细配置的URL字典
if "url" not in url_item:
raise ValueError(f"URL配置必须包含 'url' 字段: {url_item}")
url_is_ocr = url_item.get("is_ocr", enable_ocr)
url_page_ranges = url_item.get("page_ranges", page_ranges)
url_config = {"url": url_item["url"], "is_ocr": url_is_ocr}
if url_page_ranges is not None:
url_config["page_ranges"] = url_page_ranges
urls_config.append(url_config)
else:
raise TypeError(f"不支持的URL配置类型: {type(url_item)}")
elif isinstance(urls, dict):
# 单个URL配置字典
if "url" not in urls:
raise ValueError(f"URL配置必须包含 'url' 字段: {urls}")
url_is_ocr = urls.get("is_ocr", enable_ocr)
url_page_ranges = urls.get("page_ranges", page_ranges)
url_config = {"url": urls["url"], "is_ocr": url_is_ocr}
if url_page_ranges is not None:
url_config["page_ranges"] = url_page_ranges
urls_config.append(url_config)
else:
raise TypeError(f"urls 必须是字符串、列表或字典,而不是 {type(urls)}")
# 构建API请求payload
files_payload = urls_config # 与submit_file_task不同,这里直接使用URLs配置
payload = {
"language": language,
"files": files_payload,
}
# 调用批量API
response = await self._request(
"POST", "/api/v4/extract/task/batch", json=payload
)
# 检查响应
if "data" not in response or "batch_id" not in response["data"]:
raise ValueError(f"提交批量URL任务失败: {response}")
batch_id = response["data"]["batch_id"]
config.logger.info(f"开始处理 {len(urls_config)} 个文件URL")
config.logger.debug(f"批量URL任务提交成功,批次ID: {batch_id}")
# 返回包含batch_id的响应和URLs信息
result = {
"data": {
"batch_id": batch_id,
"uploaded_files": [url_config.get("url") for url_config in urls_config],
}
}
# 对于单个URL的情况,设置file_name以保持与原来返回格式的兼容性
if len(urls_config) == 1:
url = urls_config[0]["url"]
# 从URL中提取文件名
file_name = url.split("/")[-1]
result["data"]["file_name"] = file_name
return result
async def submit_file_task(
self,
files: Union[str, List[Union[str, Dict[str, Any]]], Dict[str, Any]],
enable_ocr: bool = True,
language: str = "ch",
page_ranges: Optional[str] = None,
) -> Dict[str, Any]:
"""
提交本地 File 文件以转换为 Markdown。支持单个文件路径或多个文件配置。
Args:
files: 可以是以下形式之一:
1. 单个文件路径字符串
2. 多个文件路径的列表
3. 包含文件配置的字典列表,每个字典包含:
- path/name: 文件路径或文件名
- is_ocr: 是否启用OCR (可选)
- data_id: 文件数据ID (可选)
- page_ranges: 页码范围 (可选)
enable_ocr: 是否为转换启用 OCR(所有文件的默认值)
language: 指定文档语言,默认 ch,中文
page_ranges: 指定页码范围,格式为逗号分隔的字符串。例如:"2,4-6"表示选取第2页、第4页至第6页;"2--2"表示从第2页到倒数第2页。
Returns:
dict: 任务信息,包括batch_id
"""
# 统计文件数量
file_count = 1
if isinstance(files, list):
file_count = len(files)
config.logger.debug(
f"调用submit_file_task: {file_count}个文件, "
+ f"ocr={enable_ocr}, "
+ f"language={language}"
)
# 处理输入,确保我们有一个文件配置列表
files_config = []
# 转换输入为标准格式
if isinstance(files, str):
# 单个文件路径
file_path = Path(files)
if not file_path.exists():
raise FileNotFoundError(f"未找到 File 文件: {file_path}")
files_config.append(
{
"path": file_path,
"name": file_path.name,
"is_ocr": enable_ocr,
"page_ranges": page_ranges,
}
)
elif isinstance(files, list):
# 处理文件路径列表或文件配置列表
for i, file_item in enumerate(files):
if isinstance(file_item, str):
# 简单的文件路径
file_path = Path(file_item)
if not file_path.exists():
raise FileNotFoundError(f"未找到 File 文件: {file_path}")
files_config.append(
{
"path": file_path,
"name": file_path.name,
"is_ocr": enable_ocr,
"page_ranges": page_ranges,
}
)
elif isinstance(file_item, dict):
# 含有详细配置的文件字典
if "path" not in file_item and "name" not in file_item:
raise ValueError(
f"文件配置必须包含 'path' 或 'name' 字段: {file_item}"
)
if "path" in file_item:
file_path = Path(file_item["path"])
if not file_path.exists():
raise FileNotFoundError(f"未找到 File 文件: {file_path}")
file_name = file_path.name
else:
file_name = file_item["name"]
file_path = None
file_is_ocr = file_item.get("is_ocr", enable_ocr)
file_page_ranges = file_item.get("page_ranges", page_ranges)
file_config = {
"path": file_path,
"name": file_name,
"is_ocr": file_is_ocr,
}
if file_page_ranges is not None:
file_config["page_ranges"] = file_page_ranges
files_config.append(file_config)
else:
raise TypeError(f"不支持的文件配置类型: {type(file_item)}")
elif isinstance(files, dict):
# 单个文件配置字典
if "path" not in files and "name" not in files:
raise ValueError(f"文件配置必须包含 'path' 或 'name' 字段: {files}")
if "path" in files:
file_path = Path(files["path"])
if not file_path.exists():
raise FileNotFoundError(f"未找到 File 文件: {file_path}")
file_name = file_path.name
else:
file_name = files["name"]
file_path = None
file_is_ocr = files.get("is_ocr", enable_ocr)
file_page_ranges = files.get("page_ranges", page_ranges)
file_config = {
"path": file_path,
"name": file_name,
"is_ocr": file_is_ocr,
}
if file_page_ranges is not None:
file_config["page_ranges"] = file_page_ranges
files_config.append(file_config)
else:
raise TypeError(f"files 必须是字符串、列表或字典,而不是 {type(files)}")
# 步骤1: 构建API请求payload
files_payload = []
for file_config in files_config:
file_payload = {
"name": file_config["name"],
"is_ocr": file_config["is_ocr"],
}
if "page_ranges" in file_config and file_config["page_ranges"] is not None:
file_payload["page_ranges"] = file_config["page_ranges"]
files_payload.append(file_payload)
payload = {
"language": language,
"files": files_payload,
}
# 步骤2: 获取文件上传URL
response = await self._request("POST", "/api/v4/file-urls/batch", json=payload)
# 检查响应
if (
"data" not in response
or "batch_id" not in response["data"]
or "file_urls" not in response["data"]
):
raise ValueError(f"获取上传URL失败: {response}")
batch_id = response["data"]["batch_id"]
file_urls = response["data"]["file_urls"]
if len(file_urls) != len(files_config):
raise ValueError(
f"上传URL数量 ({len(file_urls)}) 与文件数量 ({len(files_config)}) 不匹配"
)
config.logger.info(f"开始上传 {len(file_urls)} 个本地文件")
config.logger.debug(f"获取上传URL成功,批次ID: {batch_id}")
# 步骤3: 上传所有文件
uploaded_files = []
for i, (file_config, upload_url) in enumerate(zip(files_config, file_urls)):
file_path = file_config["path"]
if file_path is None:
raise ValueError(f"文件 {file_config['name']} 没有有效的路径")
try:
with open(file_path, "rb") as f:
# 重要:不设置Content-Type,让OSS自动处理
response = requests.put(upload_url, data=f)
if response.status_code != 200:
raise ValueError(
f"文件上传失败,状态码: {response.status_code}, 响应: {response.text}"
)
config.logger.debug(f"文件 {file_path.name} 上传成功")
uploaded_files.append(file_path.name)
except Exception as e:
raise ValueError(f"文件 {file_path.name} 上传失败: {str(e)}")
config.logger.info(f"文件上传完成,共 {len(uploaded_files)} 个文件")
# 返回包含batch_id的响应和已上传的文件信息
result = {"data": {"batch_id": batch_id, "uploaded_files": uploaded_files}}
# 对于单个文件的情况,保持与原来返回格式的兼容性
if len(uploaded_files) == 1:
result["data"]["file_name"] = uploaded_files[0]
return result
async def get_batch_task_status(self, batch_id: str) -> Dict[str, Any]:
"""
获取批量转换任务的状态。
Args:
batch_id: 批量任务的ID
Returns:
dict: 批量任务状态信息
"""
response = await self._request(
"GET", f"/api/v4/extract-results/batch/{batch_id}"
)
return response
async def process_file_to_markdown(
self,
task_fn,
task_arg: Union[str, List[Dict[str, Any]], Dict[str, Any]],
enable_ocr: bool = True,
output_dir: Optional[str] = None,
max_retries: int = 180,
retry_interval: int = 10,
) -> Union[str, Dict[str, Any]]:
"""
从开始到结束处理 File 到 Markdown 的转换。
Args:
task_fn: 提交任务的函数 (submit_file_url_task 或 submit_file_task)
task_arg: 任务函数的参数,可以是:
- URL字符串
- 文件路径字符串
- 包含文件配置的字典
- 包含多个文件配置的字典列表
enable_ocr: 是否启用 OCR
output_dir: 结果的输出目录
max_retries: 最大状态检查重试次数
retry_interval: 状态检查之间的时间间隔 (秒)
Returns:
Union[str, Dict[str, Any]]:
- 单文件: 包含提取的 Markdown 文件的目录路径
- 多文件: {
"results": [
{
"filename": str,
"status": str,
"content": str,
"error_message": str,
}
],
"extract_dir": str
}
"""
try:
# 提交任务 - 使用位置参数调用,而不是命名参数
task_info = await task_fn(task_arg, enable_ocr)
# 批量任务处理
batch_id = task_info["data"]["batch_id"]
# 获取所有上传文件的名称
uploaded_files = task_info["data"].get("uploaded_files", [])
if not uploaded_files and "file_name" in task_info["data"]:
uploaded_files = [task_info["data"]["file_name"]]
if not uploaded_files:
raise ValueError("无法获取上传文件的信息")
config.logger.debug(f"批量任务提交成功。Batch ID: {batch_id}")
# 跟踪所有文件的处理状态
files_status = {} # 将使用file_name作为键
files_download_urls = {}
failed_files = {} # 记录失败的文件和错误信息
# 准备输出路径
output_path = config.ensure_output_dir(output_dir)
# 轮询任务完成情况
for i in range(max_retries):
status_info = await self.get_batch_task_status(batch_id)
config.logger.debug(f"轮训结果:{status_info}")
if (
"data" not in status_info
or "extract_result" not in status_info["data"]
):
config.logger.error(f"获取批量任务状态失败: {status_info}")
await asyncio.sleep(retry_interval)
continue
# 检查所有文件的状态
all_done = True
has_progress = False
for result in status_info["data"]["extract_result"]:
file_name = result.get("file_name")
if not file_name:
continue
# 初始化状态,如果之前没有记录
if file_name not in files_status:
files_status[file_name] = "pending"
state = result.get("state")
files_status[file_name] = state
if state == "done":
# 保存下载链接
full_zip_url = result.get("full_zip_url")
if full_zip_url:
files_download_urls[file_name] = full_zip_url
config.logger.info(f"文件 {file_name} 处理完成")
else:
config.logger.debug(
f"文件 {file_name} 标记为完成但没有下载链接"
)
all_done = False
elif state in ["failed", "error"]:
err_msg = result.get("err_msg", "未知错误")
failed_files[file_name] = err_msg
config.logger.warning(f"文件 {file_name} 处理失败: {err_msg}")
# 不抛出异常,继续处理其他文件
else:
all_done = False
# 显示进度信息
if state == "running" and "extract_progress" in result:
has_progress = True
progress = result["extract_progress"]
extracted = progress.get("extracted_pages", 0)
total = progress.get("total_pages", 0)
if total > 0:
percent = (extracted / total) * 100
config.logger.info(
f"处理进度: {file_name} "
+ f"{extracted}/{total} 页 "
+ f"({percent:.1f}%)"
)
# 检查是否所有文件都已经处理完成
expected_file_count = len(uploaded_files)
processed_file_count = len(files_status)
completed_file_count = len(files_download_urls) + len(failed_files)
# 记录当前状态
config.logger.debug(
f"文件处理状态: all_done={all_done}, "
+ f"files_status数量={processed_file_count}, "
+ f"上传文件数量={expected_file_count}, "
+ f"下载链接数量={len(files_download_urls)}, "
+ f"失败文件数量={len(failed_files)}"
)
# 判断是否所有文件都已完成(包括成功和失败的)
if (
processed_file_count > 0
and processed_file_count >= expected_file_count
and completed_file_count >= processed_file_count
):
if files_download_urls or failed_files:
config.logger.info("文件处理完成")
if failed_files:
config.logger.warning(
f"有 {len(failed_files)} 个文件处理失败"
)
break
else:
# 这种情况不应该发生,但保险起见
all_done = False
# 如果没有进度信息,只显示简单的等待消息
if not has_progress:
config.logger.info(f"等待文件处理完成... ({i+1}/{max_retries})")
await asyncio.sleep(retry_interval)
else:
# 如果超过最大重试次数,检查是否有部分文件完成
if not files_download_urls and not failed_files:
raise TimeoutError(f"批量任务 {batch_id} 未在允许的时间内完成")
else:
config.logger.warning(
"警告: 部分文件未在允许的时间内完成," + "继续处理已完成的文件"
)
# 创建主提取目录
extract_dir = output_path / batch_id
extract_dir.mkdir(exist_ok=True)
# 准备结果列表
results = []
# 下载并解压每个成功的文件的结果
for file_name, download_url in files_download_urls.items():
try:
config.logger.debug
(f"下载文件处理结果: {file_name}")
# 从下载URL中提取zip文件名作为子目录名
zip_file_name = download_url.split("/")[-1]
# 去掉.zip扩展名
zip_dir_name = os.path.splitext(zip_file_name)[0]
file_extract_dir = extract_dir / zip_dir_name
file_extract_dir.mkdir(exist_ok=True)
# 下载ZIP文件
zip_path = output_path / f"{batch_id}_{zip_file_name}"
async with aiohttp.ClientSession() as session:
async with session.get(
download_url,
headers={"Authorization": f"Bearer {self.api_key}"},
) as response:
response.raise_for_status()
with open(zip_path, "wb") as f:
f.write(await response.read())
# 解压到子文件夹
with zipfile.ZipFile(zip_path, "r") as zip_ref:
zip_ref.extractall(file_extract_dir)
# 解压后删除ZIP文件
zip_path.unlink()
# 尝试读取Markdown内容
markdown_content = ""
markdown_files = list(file_extract_dir.glob("*.md"))
if markdown_files:
with open(markdown_files[0], "r", encoding="utf-8") as f:
markdown_content = f.read()
# 添加成功结果
results.append(
{
"filename": file_name,
"status": "success",
"content": markdown_content,
"extract_path": str(file_extract_dir),
}
)
config.logger.debug(
f"文件 {file_name} 的结果已解压到: {file_extract_dir}"
)
except Exception as e:
# 下载失败,添加错误结果
error_msg = f"下载结果失败: {str(e)}"
config.logger.error(f"文件 {file_name} {error_msg}")
results.append(
{
"filename": file_name,
"status": "error",
"error_message": error_msg,
}
)
# 添加处理失败的文件到结果
for file_name, error_msg in failed_files.items():
results.append(
{
"filename": file_name,
"status": "error",
"error_message": f"处理失败: {error_msg}",
}
)
# 输出处理结果统计
success_count = len(files_download_urls)
fail_count = len(failed_files)
total_count = success_count + fail_count
config.logger.info("\n=== 文件处理结果统计 ===")
config.logger.info(f"总文件数: {total_count}")
config.logger.info(f"成功处理: {success_count}")
config.logger.info(f"处理失败: {fail_count}")
if failed_files:
config.logger.info("\n失败文件详情:")
for file_name, error_msg in failed_files.items():
config.logger.info(f" - {file_name}: {error_msg}")
if success_count > 0:
config.logger.info(f"\n结果保存目录: {extract_dir}")
else:
config.logger.info(f"\n输出目录: {extract_dir}")
# 返回详细结果
return {
"results": results,
"extract_dir": str(extract_dir),
"success_count": success_count,
"fail_count": fail_count,
"total_count": total_count,
}
except Exception as e:
config.logger.error(f"处理 File 到 Markdown 失败: {str(e)}")
raise
"""MinerU File转Markdown服务的命令行界面。"""
import sys
import argparse
from . import config
from . import server
def main():
"""命令行界面的入口点。"""
parser = argparse.ArgumentParser(description="MinerU File转Markdown转换服务")
parser.add_argument(
"--output-dir", "-o", type=str, help="保存转换后文件的目录 (默认: ./downloads)"
)
parser.add_argument(
"--transport",
"-t",
type=str,
default="stdio",
help="协议类型 (默认: stdio,可选: sse,streamable-http)",
)
parser.add_argument(
"--port",
"-p",
type=int,
default=8001,
help="服务器端口 (默认: 8001, 仅在使用HTTP协议时有效)",
)
parser.add_argument(
"--host",
type=str,
default="127.0.0.1",
help="服务器主机地址 (默认: 127.0.0.1, 仅在使用HTTP协议时有效)",
)
args = parser.parse_args()
# 检查参数有效性
if args.transport == "stdio" and (args.host != "127.0.0.1" or args.port != 8001):
print("警告: 在STDIO模式下,--host和--port参数将被忽略", file=sys.stderr)
# 验证API密钥 - 移动到这里,以便 --help 等参数可以无密钥运行
if not config.MINERU_API_KEY:
print(
"错误: 启动服务需要 MINERU_API_KEY 环境变量。"
"\\n请检查是否已设置该环境变量,例如:"
"\\n export MINERU_API_KEY='your_actual_api_key'"
"\\n或者,确保在项目根目录的 `.env` 文件中定义了该变量。"
"\\n\\n您可以使用 --help 查看可用的命令行选项。",
file=sys.stderr, # 将错误消息输出到 stderr
)
sys.exit(1)
# 如果提供了输出目录,则进行设置
if args.output_dir:
server.set_output_dir(args.output_dir)
# 打印配置信息
print("MinerU File转Markdown转换服务启动...")
if args.transport in ["sse", "streamable-http"]:
print(f"服务器地址: {args.host}:{args.port}")
print("按 Ctrl+C 可以退出服务")
server.run_server(mode=args.transport, port=args.port, host=args.host)
if __name__ == "__main__":
main()
"""MinerU File转Markdown转换服务的配置工具。"""
import os
import logging
from pathlib import Path
from dotenv import load_dotenv
# 从 .env 文件加载环境变量
load_dotenv()
# API 配置
MINERU_API_BASE = os.getenv("MINERU_API_BASE", "https://mineru.net")
MINERU_API_KEY = os.getenv("MINERU_API_KEY", "")
# 本地API配置
USE_LOCAL_API = os.getenv("USE_LOCAL_API", "").lower() in ["true", "1", "yes"]
LOCAL_MINERU_API_BASE = os.getenv("LOCAL_MINERU_API_BASE", "http://localhost:8080")
# 转换后文件的默认输出目录
DEFAULT_OUTPUT_DIR = os.getenv("OUTPUT_DIR", "./downloads")
# 设置日志系统
def setup_logging():
"""
设置日志系统,根据环境变量配置日志级别。
Returns:
logging.Logger: 配置好的日志记录器。
"""
# 获取环境变量中的日志级别设置
log_level = os.getenv("MINERU_LOG_LEVEL", "INFO").upper()
debug_mode = os.getenv("MINERU_DEBUG", "").lower() in ["true", "1", "yes"]
# 如果设置了debug_mode,则覆盖log_level
if debug_mode:
log_level = "DEBUG"
# 确保log_level是有效的
valid_levels = ["DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL"]
if log_level not in valid_levels:
log_level = "INFO"
# 设置日志格式
log_format = "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
# 配置日志
logging.basicConfig(level=getattr(logging, log_level), format=log_format)
logger = logging.getLogger("mineru")
logger.setLevel(getattr(logging, log_level))
# 输出日志级别信息
logger.info(f"日志级别设置为: {log_level}")
return logger
# 创建默认的日志记录器
logger = setup_logging()
# 如果输出目录不存在,则创建它
def ensure_output_dir(output_dir=None):
"""
确保输出目录存在。
Args:
output_dir: 输出目录的可选路径。如果为 None,则使用 DEFAULT_OUTPUT_DIR。
Returns:
表示输出目录的 Path 对象。
"""
output_path = Path(output_dir or DEFAULT_OUTPUT_DIR)
output_path.mkdir(parents=True, exist_ok=True)
return output_path
# 验证 API 配置
def validate_api_config():
"""
验证是否已设置所需的 API 配置。
Returns:
dict: 配置状态。
"""
return {
"api_base": MINERU_API_BASE,
"api_key_set": bool(MINERU_API_KEY),
"output_dir": DEFAULT_OUTPUT_DIR,
}
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment