"vscode:/vscode.git/clone" did not exist on "c80490a95e935c783686401287301a550cc2f5f2"
Unverified Commit 66e616bd authored by Xiaomeng Zhao's avatar Xiaomeng Zhao Committed by GitHub
Browse files

Merge pull request #2895 from opendatalab/release-2.1.0

Release 2.1.0
parents 592b659e a4c9a07b
......@@ -18,7 +18,8 @@
[![HuggingFace](https://img.shields.io/badge/Demo_on_HuggingFace-yellow.svg?logo=&labelColor=white)](https://huggingface.co/spaces/opendatalab/MinerU)
[![ModelScope](https://img.shields.io/badge/Demo_on_ModelScope-purple?logo=&labelColor=white)](https://www.modelscope.cn/studios/OpenDataLab/MinerU)
[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/gist/myhloli/3b3a00a4a0a61577b6c30f989092d20d/mineru_demo.ipynb)
[![Paper](https://img.shields.io/badge/Paper-arXiv-green)](https://arxiv.org/abs/2409.18839)
[![arXiv](https://img.shields.io/badge/arXiv-2409.18839-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2409.18839)
[![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/opendatalab/MinerU)
<a href="https://trendshift.io/repositories/11174" target="_blank"><img src="https://trendshift.io/api/badge/repositories/11174" alt="opendatalab%2FMinerU | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
......@@ -33,9 +34,7 @@
<a href="https://github.com/opendatalab/PDF-Extract-Kit">PDF-Extract-Kit: High-Quality PDF Extraction Toolkit</a>🔥🔥🔥
<br>
<br>
<a href="https://mineru.net/client?source=github">
Easier to use: Just grab MinerU Desktop. No coding, no login, just a simple interface and smooth interactions. Enjoy it without any fuss!</a>🚀🚀🚀
🚀<a href="https://mineru.net/?source=github">Access MinerU Now→✅ Zero-Install Web Version ✅ Full-Featured Desktop Client ✅ Instant API Access; Skip deployment headaches – get all product formats in one click. Developers, dive in!</a>
</p>
<!-- join us -->
......@@ -47,38 +46,80 @@ Easier to use: Just grab MinerU Desktop. No coding, no login, just a simple inte
</div>
# Changelog
- 2025/06/20 2.0.6 Released
- Fixed occasional parsing interruptions caused by invalid block content in `vlm` mode
- Fixed parsing interruptions caused by incomplete table structures in `vlm` mode
- 2025/06/17 2.0.5 Released
- Fixed the issue where models were still required to be downloaded in the `sglang-client` mode
- Fixed the issue where the `sglang-client` mode unnecessarily depended on packages like `torch` during runtime.
- Fixed the issue where only the first instance would take effect when attempting to launch multiple `sglang-client` instances via multiple URLs within the same process
- 2025/06/15 2.0.3 released
- Fixed a configuration file key-value update error that occurred when downloading model type was set to `all`
- Fixed the issue where the formula and table feature toggle switches were not working in `command line mode`, causing the features to remain enabled.
- Fixed compatibility issues with sglang version 0.4.7 in the `sglang-engine` mode.
- Updated Dockerfile and installation documentation for deploying the full version of MinerU in sglang environment
- 2025/06/13 2.0.0 Released
- MinerU 2.0 represents a comprehensive reconstruction and upgrade from architecture to functionality, delivering a more streamlined design, enhanced performance, and more flexible user experience.
- **New Architecture**: MinerU 2.0 has been deeply restructured in code organization and interaction methods, significantly improving system usability, maintainability, and extensibility.
- **Removal of Third-party Dependency Limitations**: Completely eliminated the dependency on `pymupdf`, moving the project toward a more open and compliant open-source direction.
- **Ready-to-use, Easy Configuration**: No need to manually edit JSON configuration files; most parameters can now be set directly via command line or API.
- **Automatic Model Management**: Added automatic model download and update mechanisms, allowing users to complete model deployment without manual intervention.
- **Offline Deployment Friendly**: Provides built-in model download commands, supporting deployment requirements in completely offline environments.
- **Streamlined Code Structure**: Removed thousands of lines of redundant code, simplified class inheritance logic, significantly improving code readability and development efficiency.
- **Unified Intermediate Format Output**: Adopted standardized `middle_json` format, compatible with most secondary development scenarios based on this format, ensuring seamless ecosystem business migration.
- **New Model**: MinerU 2.0 integrates our latest small-parameter, high-performance multimodal document parsing model, achieving end-to-end high-speed, high-precision document understanding.
- **Small Model, Big Capabilities**: With parameters under 1B, yet surpassing traditional 72B-level vision-language models (VLMs) in parsing accuracy.
- **Multiple Functions in One**: A single model covers multilingual recognition, handwriting recognition, layout analysis, table parsing, formula recognition, reading order sorting, and other core tasks.
- **Ultimate Inference Speed**: Achieves peak throughput exceeding 10,000 tokens/s through `sglang` acceleration on a single NVIDIA 4090 card, easily handling large-scale document processing requirements.
- **Online Experience**: You can experience our brand-new VLM model on [MinerU.net](https://mineru.net/OpenSourceTools/Extractor), [Hugging Face](https://huggingface.co/spaces/opendatalab/MinerU), and [ModelScope](https://www.modelscope.cn/studios/OpenDataLab/MinerU).
- **Incompatible Changes Notice**: To improve overall architectural rationality and long-term maintainability, this version contains some incompatible changes:
- Python package name changed from `magic-pdf` to `mineru`, and the command-line tool changed from `magic-pdf` to `mineru`. Please update your scripts and command calls accordingly.
- For modular system design and ecosystem consistency considerations, MinerU 2.0 no longer includes the LibreOffice document conversion module. If you need to process Office documents, we recommend converting them to PDF format through an independently deployed LibreOffice service before proceeding with subsequent parsing operations.
- 2025/07/05 Version 2.1.0 Released
- This is the first major update of Miner2, which includes a large number of new features and improvements, covering significant performance optimizations, user experience enhancements, and bug fixes. The detailed update contents are as follows:
- **Performance Optimizations:**
- Significantly improved preprocessing speed for documents with specific resolutions (around 2000 pixels on the long side).
- Greatly enhanced post-processing speed when the `pipeline` backend handles batch processing of documents with fewer pages (<10 pages).
- Layout analysis speed of the `pipeline` backend has been increased by approximately 20%.
- **Experience Enhancements:**
- Built-in ready-to-use `fastapi service` and `gradio webui`. For detailed usage instructions, please refer to [Documentation](#3-api-calls-or-visual-invocation).
- Adapted to `sglang` version `0.4.8`, significantly reducing the GPU memory requirements for the `vlm-sglang` backend. It can now run on graphics cards with as little as `8GB GPU memory` (Turing architecture or newer).
- Added transparent parameter passing for all commands related to `sglang`, allowing the `sglang-engine` backend to receive all `sglang` parameters consistently with the `sglang-server`.
- Supports feature extensions based on configuration files, including `custom formula delimiters`, `enabling heading classification`, and `customizing local model directories`. For detailed usage instructions, please refer to [Documentation](#4-extending-mineru-functionality-through-configuration-files).
- **New Features:**
- Updated the `pipeline` backend with the PP-OCRv5 multilingual text recognition model, supporting text recognition in 37 languages such as French, Spanish, Portuguese, Russian, and Korean, with an average accuracy improvement of over 30%. [Details](https://paddlepaddle.github.io/PaddleOCR/latest/en/version3.x/algorithm/PP-OCRv5/PP-OCRv5_multi_languages.html)
- Introduced limited support for vertical text layout in the `pipeline` backend.
<details>
<summary>History Log</summary>
<details>
<summary>2025/06/20 2.0.6 Released</summary>
<ul>
<li>Fixed occasional parsing interruptions caused by invalid block content in <code>vlm</code> mode</li>
<li>Fixed parsing interruptions caused by incomplete table structures in <code>vlm</code> mode</li>
</ul>
</details>
<details>
<summary>2025/06/17 2.0.5 Released</summary>
<ul>
<li>Fixed the issue where models were still required to be downloaded in the <code>sglang-client</code> mode</li>
<li>Fixed the issue where the <code>sglang-client</code> mode unnecessarily depended on packages like <code>torch</code> during runtime.</li>
<li>Fixed the issue where only the first instance would take effect when attempting to launch multiple <code>sglang-client</code> instances via multiple URLs within the same process</li>
</ul>
</details>
<details>
<summary>2025/06/15 2.0.3 released</summary>
<ul>
<li>Fixed a configuration file key-value update error that occurred when downloading model type was set to <code>all</code></li>
<li>Fixed the issue where the formula and table feature toggle switches were not working in <code>command line mode</code>, causing the features to remain enabled.</li>
<li>Fixed compatibility issues with sglang version 0.4.7 in the <code>sglang-engine</code> mode.</li>
<li>Updated Dockerfile and installation documentation for deploying the full version of MinerU in sglang environment</li>
</ul>
</details>
<details>
<summary>2025/06/13 2.0.0 Released</summary>
<ul>
<li><strong>New Architecture</strong>: MinerU 2.0 has been deeply restructured in code organization and interaction methods, significantly improving system usability, maintainability, and extensibility.
<ul>
<li><strong>Removal of Third-party Dependency Limitations</strong>: Completely eliminated the dependency on <code>pymupdf</code>, moving the project toward a more open and compliant open-source direction.</li>
<li><strong>Ready-to-use, Easy Configuration</strong>: No need to manually edit JSON configuration files; most parameters can now be set directly via command line or API.</li>
<li><strong>Automatic Model Management</strong>: Added automatic model download and update mechanisms, allowing users to complete model deployment without manual intervention.</li>
<li><strong>Offline Deployment Friendly</strong>: Provides built-in model download commands, supporting deployment requirements in completely offline environments.</li>
<li><strong>Streamlined Code Structure</strong>: Removed thousands of lines of redundant code, simplified class inheritance logic, significantly improving code readability and development efficiency.</li>
<li><strong>Unified Intermediate Format Output</strong>: Adopted standardized <code>middle_json</code> format, compatible with most secondary development scenarios based on this format, ensuring seamless ecosystem business migration.</li>
</ul>
</li>
<li><strong>New Model</strong>: MinerU 2.0 integrates our latest small-parameter, high-performance multimodal document parsing model, achieving end-to-end high-speed, high-precision document understanding.
<ul>
<li><strong>Small Model, Big Capabilities</strong>: With parameters under 1B, yet surpassing traditional 72B-level vision-language models (VLMs) in parsing accuracy.</li>
<li><strong>Multiple Functions in One</strong>: A single model covers multilingual recognition, handwriting recognition, layout analysis, table parsing, formula recognition, reading order sorting, and other core tasks.</li>
<li><strong>Ultimate Inference Speed</strong>: Achieves peak throughput exceeding 10,000 tokens/s through <code>sglang</code> acceleration on a single NVIDIA 4090 card, easily handling large-scale document processing requirements.</li>
<li><strong>Online Experience</strong>: You can experience our brand-new VLM model on <a href="https://mineru.net/OpenSourceTools/Extractor">MinerU.net</a>, <a href="https://huggingface.co/spaces/opendatalab/MinerU">Hugging Face</a>, and <a href="https://www.modelscope.cn/studios/OpenDataLab/MinerU">ModelScope</a>.</li>
</ul>
</li>
<li><strong>Incompatible Changes Notice</strong>: To improve overall architectural rationality and long-term maintainability, this version contains some incompatible changes:
<ul>
<li>Python package name changed from <code>magic-pdf</code> to <code>mineru</code>, and the command-line tool changed from <code>magic-pdf</code> to <code>mineru</code>. Please update your scripts and command calls accordingly.</li>
<li>For modular system design and ecosystem consistency considerations, MinerU 2.0 no longer includes the LibreOffice document conversion module. If you need to process Office documents, we recommend converting them to PDF format through an independently deployed LibreOffice service before proceeding with subsequent parsing operations.</li>
</ul>
</li>
</ul>
</details>
<details>
<summary>2025/05/24 Release 1.3.12</summary>
<ul>
......@@ -383,8 +424,6 @@ Easier to use: Just grab MinerU Desktop. No coding, no login, just a simple inte
<li><a href="#acknowledgments">Acknowledgments</a></li>
<li><a href="#citation">Citation</a></li>
<li><a href="#star-history">Star History</a></li>
<li><a href="#magic-doc">Magic-doc</a></li>
<li><a href="#magic-html">Magic-html</a></li>
<li><a href="#links">Links</a></li>
</ol>
</details>
......@@ -433,7 +472,7 @@ There are three different ways to experience MinerU:
>
> In non-mainline environments, due to the diversity of hardware and software configurations, as well as third-party dependency compatibility issues, we cannot guarantee 100% project availability. Therefore, for users who wish to use this project in non-recommended environments, we suggest carefully reading the documentation and FAQ first. Most issues already have corresponding solutions in the FAQ. We also encourage community feedback to help us gradually expand support.
<table border="1">
<table>
<tr>
<td>Parsing Backend</td>
<td>pipeline</td>
......@@ -446,6 +485,16 @@ There are three different ways to experience MinerU:
<td>windows/linux</td>
<td>windows(wsl2)/linux</td>
</tr>
<tr>
<td>CPU Inference Support</td>
<td></td>
<td colspan="2"></td>
</tr>
<tr>
<td>GPU Requirements</td>
<td>Turing architecture or later, 6GB+ VRAM or Apple Silicon</td>
<td colspan="2">Turing architecture or later, 8GB+ VRAM</td>
</tr>
<tr>
<td>Memory Requirements</td>
<td colspan="3">Minimum 16GB+, 32GB+ recommended</td>
......@@ -458,18 +507,6 @@ There are three different ways to experience MinerU:
<td>Python Version</td>
<td colspan="3">3.10-3.13</td>
</tr>
<tr>
<td>CPU Inference Support</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>GPU Requirements</td>
<td>Turing architecture or later, 6GB+ VRAM or Apple Silicon</td>
<td>Ampere architecture or later, 8GB+ VRAM</td>
<td>Ampere architecture or later, 24GB+ VRAM</td>
</tr>
</table>
## Online Demo
......@@ -502,7 +539,7 @@ uv pip install -e .[core]
> Linux and macOS systems automatically support CUDA/MPS acceleration after installation. For Windows users who want to use CUDA acceleration,
> please visit the [PyTorch official website](https://pytorch.org/get-started/locally/) to install PyTorch with the appropriate CUDA version.
#### 1.3 Install Full Version (supports sglang acceleration) (requires device with Ampere or newer architecture and at least 24GB GPU memory)
#### 1.3 Install Full Version (supports sglang acceleration) (requires device with Turing or newer architecture and at least 8GB GPU memory)
If you need to use **sglang to accelerate VLM model inference**, you can choose any of the following methods to install the full version:
......@@ -514,6 +551,10 @@ If you need to use **sglang to accelerate VLM model inference**, you can choose
```bash
uv pip install -e .[all]
```
> [!TIP]
> If any exceptions occur during the installation of `sglang`, please refer to the [official sglang documentation](https://docs.sglang.ai/start/install.html) for troubleshooting and solutions, or directly use Docker-based installation.
- Build image using Dockerfile:
```bash
wget https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/docker/global/Dockerfile
......@@ -535,8 +576,8 @@ If you need to use **sglang to accelerate VLM model inference**, you can choose
```
> [!TIP]
> The Dockerfile uses `lmsysorg/sglang:v0.4.7-cu124` as the default base image. If necessary, you can modify it to another platform version.
> The Dockerfile uses `lmsysorg/sglang:v0.4.8.post1-cu126` as the default base image, which supports the Turing/Ampere/Ada Lovelace/Hopper platforms.
> If you are using the newer Blackwell platform, please change the base image to `lmsysorg/sglang:v0.4.8.post1-cu128-b200`.
#### 1.4 Install client (for connecting to sglang-server on edge devices that require only CPU and network connectivity)
......@@ -559,7 +600,7 @@ The simplest command line invocation is:
mineru -p <input_path> -o <output_path>
```
- `<input_path>`: Local PDF file or directory (supports pdf/png/jpg/jpeg)
- `<input_path>`: Local PDF/Image file or directory (supports pdf/png/jpg/jpeg/webp/gif)
- `<output_path>`: Output directory
##### View Help Information
......@@ -582,14 +623,15 @@ Options:
-m, --method [auto|txt|ocr] Parsing method: auto (default), txt, ocr (pipeline backend only)
-b, --backend [pipeline|vlm-transformers|vlm-sglang-engine|vlm-sglang-client]
Parsing backend (default: pipeline)
-l, --lang [ch|ch_server|... ] Specify document language (improves OCR accuracy, pipeline backend only)
-l, --lang [ch|ch_server|ch_lite|en|korean|japan|chinese_cht|ta|te|ka|latin|arabic|east_slavic|cyrillic|devanagari]
Specify document language (improves OCR accuracy, pipeline backend only)
-u, --url TEXT Service address when using sglang-client
-s, --start INTEGER Starting page number (0-based)
-e, --end INTEGER Ending page number (0-based)
-f, --formula BOOLEAN Enable formula parsing (default: on, pipeline backend only)
-t, --table BOOLEAN Enable table parsing (default: on, pipeline backend only)
-f, --formula BOOLEAN Enable formula parsing (default: on)
-t, --table BOOLEAN Enable table parsing (default: on)
-d, --device TEXT Inference device (e.g., cpu/cuda/cuda:0/npu/mps, pipeline backend only)
--vram INTEGER Maximum GPU VRAM usage per process (pipeline backend only)
--vram INTEGER Maximum GPU VRAM usage per process (GB)(pipeline backend only)
--source [huggingface|modelscope|local]
Model source, default: huggingface
--help Show help information
......@@ -661,15 +703,6 @@ mineru -p <input_path> -o <output_path> -b vlm-sglang-engine
mineru-sglang-server --port 30000
```
> [!TIP]
> sglang-server has some commonly used parameters for configuration:
> - If you have two GPUs with `12GB` or `16GB` VRAM, you can use the Tensor Parallel (TP) mode: `--tp 2`
> - If you have two GPUs with `11GB` VRAM, in addition to Tensor Parallel mode, you need to reduce the KV cache size: `--tp 2 --mem-fraction-static 0.7`
> - If you have more than two GPUs with `24GB` VRAM or above, you can use sglang's multi-GPU parallel mode to increase throughput: `--dp 2`
> - You can also enable `torch.compile` to accelerate inference speed by approximately 15%: `--enable-torch-compile`
> - If you want to learn more about the usage of `sglang` parameters, please refer to the [official sglang documentation](https://docs.sglang.ai/backend/server_arguments.html#common-launch-commands)
2. Use Client in another terminal:
```bash
......@@ -681,26 +714,73 @@ mineru -p <input_path> -o <output_path> -b vlm-sglang-client -u http://127.0.0.1
---
### 3. API Usage
You can also call MinerU through Python code, see example code at:
👉 [Python Usage Example](demo/demo.py)
### 3. API Calls or Visual Invocation
1. Directly invoke using Python API: [Python Invocation Example](demo/demo.py)
2. Invoke using FastAPI:
```bash
mineru-api --host 127.0.0.1 --port 8000
```
Visit http://127.0.0.1:8000/docs in your browser to view the API documentation.
3. Use Gradio WebUI or Gradio API:
```bash
# Using pipeline/vlm-transformers/vlm-sglang-client backend
mineru-gradio --server-name 127.0.0.1 --server-port 7860
# Or using vlm-sglang-engine/pipeline backend
mineru-gradio --server-name 127.0.0.1 --server-port 7860 --enable-sglang-engine
```
Access http://127.0.0.1:7860 in your browser to use the Gradio WebUI, or visit http://127.0.0.1:7860/?view=api to use the Gradio API.
> [!TIP]
> Below are some suggestions and notes for using the sglang acceleration mode:
> - The sglang acceleration mode currently supports operation on Turing architecture GPUs with a minimum of 8GB VRAM, but you may encounter VRAM shortages on GPUs with less than 24GB VRAM. You can optimize VRAM usage with the following parameters:
> - If running on a single GPU and encountering VRAM shortage, reduce the KV cache size by setting `--mem-fraction-static 0.5`. If VRAM issues persist, try lowering it further to `0.4` or below.
> - If you have more than one GPU, you can expand available VRAM using tensor parallelism (TP) mode: `--tp 2`
> - If you are already successfully using sglang to accelerate VLM inference but wish to further improve inference speed, consider the following parameters:
> - If using multiple GPUs, increase throughput using sglang's multi-GPU parallel mode: `--dp 2`
> - You can also enable `torch.compile` to accelerate inference speed by about 15%: `--enable-torch-compile`
> - For more information on using sglang parameters, please refer to the [sglang official documentation](https://docs.sglang.ai/backend/server_arguments.html#common-launch-commands)
> - All sglang-supported parameters can be passed to MinerU via command-line arguments, including those used with the following commands: `mineru`, `mineru-sglang-server`, `mineru-gradio`, `mineru-api`
> [!TIP]
> - In any case, you can specify visible GPU devices at the start of a command line by adding the `CUDA_VISIBLE_DEVICES` environment variable. For example:
> ```bash
> CUDA_VISIBLE_DEVICES=1 mineru -p <input_path> -o <output_path>
> ```
> - This method works for all command-line calls, including `mineru`, `mineru-sglang-server`, `mineru-gradio`, and `mineru-api`, and applies to both `pipeline` and `vlm` backends.
> - Below are some common `CUDA_VISIBLE_DEVICES` settings:
> ```bash
> CUDA_VISIBLE_DEVICES=1 Only device 1 will be seen
> CUDA_VISIBLE_DEVICES=0,1 Devices 0 and 1 will be visible
> CUDA_VISIBLE_DEVICES="0,1" Same as above, quotation marks are optional
> CUDA_VISIBLE_DEVICES=0,2,3 Devices 0, 2, 3 will be visible; device 1 is masked
> CUDA_VISIBLE_DEVICES="" No GPU will be visible
> ```
> - Below are some possible use cases:
> - If you have multiple GPUs and need to specify GPU 0 and GPU 1 to launch 'sglang-server' in multi-GPU mode, you can use the following command:
> ```bash
> CUDA_VISIBLE_DEVICES=0,1 mineru-sglang-server --port 30000 --dp 2
> ```
> - If you have multiple GPUs and need to launch two `fastapi` services on GPU 0 and GPU 1 respectively, listening on different ports, you can use the following commands:
> ```bash
> # In terminal 1
> CUDA_VISIBLE_DEVICES=0 mineru-api --host 127.0.0.1 --port 8000
> # In terminal 2
> CUDA_VISIBLE_DEVICES=1 mineru-api --host 127.0.0.1 --port 8001
> ```
---
### 4. Deploy Derivative Projects
Community developers have created various extensions based on MinerU, including:
### 4. Extending MinerU Functionality Through Configuration Files
- Graphical interface based on Gradio
- Web API based on FastAPI
- Client/server architecture with multi-GPU load balancing
- MCP Server based on the official API
These projects typically offer better user experience and additional features.
For detailed deployment instructions, please refer to:
👉 [Derivative Projects Documentation](projects/README.md)
- MinerU is designed to work out-of-the-box, but also supports extending functionality through configuration files. You can create a `mineru.json` file in your home directory and add custom configurations.
- The `mineru.json` file will be automatically generated when you use the built-in model download command `mineru-models-download`. Alternatively, you can create it by copying the [configuration template file](./mineru.template.json) to your home directory and renaming it to `mineru.json`.
- Below are some available configuration options:
- `latex-delimiter-config`: Used to configure LaTeX formula delimiters, defaults to the `$` symbol, and can be modified to other symbols or strings as needed.
- `llm-aided-config`: Used to configure related parameters for LLM-assisted heading level detection, compatible with all LLM models supporting the `OpenAI protocol`. It defaults to Alibaba Cloud Qwen's `qwen2.5-32b-instruct` model. You need to configure an API key yourself and set `enable` to `true` to activate this feature.
- `models-dir`: Used to specify local model storage directories. Please specify separate model directories for the `pipeline` and `vlm` backends. After specifying these directories, you can use local models by setting the environment variable `export MINERU_MODEL_SOURCE=local`.
---
......@@ -717,7 +797,7 @@ For detailed deployment instructions, please refer to:
# Known Issues
- Reading order is determined by the model based on the spatial distribution of readable content, and may be out of order in some areas under extremely complex layouts.
- Vertical text is not supported.
- Limited support for vertical text.
- Tables of contents and lists are recognized through rules, and some uncommon list formats may not be recognized.
- Code blocks are not yet supported in the layout model.
- Comic books, art albums, primary school textbooks, and exercises cannot be parsed well.
......@@ -727,9 +807,9 @@ For detailed deployment instructions, please refer to:
# FAQ
[FAQ in Chinese](docs/FAQ_zh_cn.md)
[FAQ in English](docs/FAQ_en_us.md)
- If you encounter any issues during usage, you can first check the [FAQ](docs/FAQ_en_us.md) for solutions.
- If your issue remains unresolved, you may also use [DeepWiki](https://deepwiki.com/opendatalab/MinerU) to interact with an AI assistant, which can address most common problems.
- If you still cannot resolve the issue, you are welcome to join our community via [Discord](https://discord.gg/Tdedn9GTXq) or [WeChat](http://mineru.space/s/V85Yl) to discuss with other users and developers.
# All Thanks To Our Contributors
......@@ -790,16 +870,13 @@ Currently, some models in this project are trained based on YOLO. However, since
</picture>
</a>
# Magic-doc
[Magic-Doc](https://github.com/InternLM/magic-doc) Fast speed ppt/pptx/doc/docx/pdf extraction tool
# Magic-html
[Magic-HTML](https://github.com/opendatalab/magic-html) Mixed web page extraction tool
# Links
- [LabelU (A Lightweight Multi-modal Data Annotation Tool)](https://github.com/opendatalab/labelU)
- [LabelLLM (An Open-source LLM Dialogue Annotation Platform)](https://github.com/opendatalab/LabelLLM)
- [PDF-Extract-Kit (A Comprehensive Toolkit for High-Quality PDF Content Extraction)](https://github.com/opendatalab/PDF-Extract-Kit)
- [Vis3 (OSS browser based on s3)](https://github.com/opendatalab/Vis3)
- [OmniDocBench (A Comprehensive Benchmark for Document Parsing and Evaluation)](https://github.com/opendatalab/OmniDocBench)
- [Magic-HTML (Mixed web page extraction tool)](https://github.com/opendatalab/magic-html)
- [Magic-Doc (Fast speed ppt/pptx/doc/docx/pdf extraction tool)](https://github.com/InternLM/magic-doc)
\ No newline at end of file
......@@ -18,7 +18,8 @@
[![ModelScope](https://img.shields.io/badge/Demo_on_ModelScope-purple?logo=&labelColor=white)](https://www.modelscope.cn/studios/OpenDataLab/MinerU)
[![HuggingFace](https://img.shields.io/badge/Demo_on_HuggingFace-yellow.svg?logo=&labelColor=white)](https://huggingface.co/spaces/opendatalab/MinerU)
[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/gist/myhloli/3b3a00a4a0a61577b6c30f989092d20d/mineru_demo.ipynb)
[![Paper](https://img.shields.io/badge/Paper-arXiv-green)](https://arxiv.org/abs/2409.18839)
[![arXiv](https://img.shields.io/badge/arXiv-2409.18839-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2409.18839)
[![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/opendatalab/MinerU)
<a href="https://trendshift.io/repositories/11174" target="_blank"><img src="https://trendshift.io/api/badge/repositories/11174" alt="opendatalab%2FMinerU | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
......@@ -33,8 +34,7 @@
<a href="https://github.com/opendatalab/PDF-Extract-Kit">PDF-Extract-Kit: 高质量PDF解析工具箱</a>🔥🔥🔥
<br>
<br>
<a href="https://mineru.net/client?source=github">更便捷的使用方式:MinerU桌面端。无需编程,无需登录,图形界面,简单交互,畅用无忧。</a>🚀🚀🚀
🚀<a href="https://mineru.net/?source=github">MinerU 官网入口→✅ 免装在线版 ✅ 全功能客户端 ✅ 开发者API在线调用,省去部署麻烦,多种产品形态一键get,速冲!</a>
</p>
<!-- join us -->
......@@ -46,39 +46,79 @@
</div>
# 更新记录
- 2025/06/20 2.0.6发布
- 修复`vlm`模式下,某些偶发的无效块内容导致解析中断问题
- 修复`vlm`模式下,某些不完整的表结构导致的解析中断问题
- 2025/06/17 2.0.5发布
- 修复了`sglang-client`模式下依然需要下载模型的问题
- 修复了`sglang-client`模式需要依赖`torch`等实际运行不需要的包的问题
- 修复了同一进程内尝试通过多个url启动多个`sglang-client`实例时,只有第一个生效的问题
- 2025/06/15 2.0.3发布
- 修复了当下载模型类型设置为`all`时,配置文件出现键值更新错误的问题
- 修复了命令行模式下公式和表格功能开关不生效导致功能无法关闭的问题
- 修复了`sglang-engine`模式下,0.4.7版本sglang的兼容性问题
- 更新了sglang环境下部署完整版MinerU的Dockerfile和相关安装文档
- 2025/06/13 2.0.0发布
- MinerU 2.0 是一次从架构到功能的全面重构与升级,带来了更简洁的设计、更强的性能以及更灵活的使用体验。
- **全新架构**:MinerU 2.0 在代码结构和交互方式上进行了深度重构,显著提升了系统的易用性、可维护性与扩展能力。
- **去除第三方依赖限制**:彻底移除对 `pymupdf` 的依赖,推动项目向更开放、合规的开源方向迈进。
- **开箱即用,配置便捷**:无需手动编辑 JSON 配置文件,绝大多数参数已支持命令行或 API 直接设置。
- **模型自动管理**:新增模型自动下载与更新机制,用户无需手动干预即可完成模型部署。
- **离线部署友好**:提供内置模型下载命令,支持完全断网环境下的部署需求。
- **代码结构精简**:移除数千行冗余代码,简化类继承逻辑,显著提升代码可读性与开发效率。
- **统一中间格式输出**:采用标准化的 `middle_json` 格式,兼容多数基于该格式的二次开发场景,确保生态业务无缝迁移。
- **全新模型**:MinerU 2.0 集成了我们最新研发的小参数量、高性能多模态文档解析模型,实现端到端的高速、高精度文档理解。
- **小模型,大能力**:模型参数不足 1B,却在解析精度上超越传统 72B 级别的视觉语言模型(VLM)。
- **多功能合一**:单模型覆盖多语言识别、手写识别、版面分析、表格解析、公式识别、阅读顺序排序等核心任务。
- **极致推理速度**:在单卡 NVIDIA 4090 上通过 `sglang` 加速,达到峰值吞吐量超过 10,000 token/s,轻松应对大规模文档处理需求。
- **在线体验**:您可以在[MinerU.net](https://mineru.net/OpenSourceTools/Extractor)[Hugging Face](https://huggingface.co/spaces/opendatalab/MinerU), 以及[ModelScope](https://www.modelscope.cn/studios/OpenDataLab/MinerU)体验我们的全新VLM模型
- **不兼容变更说明**:为提升整体架构合理性与长期可维护性,本版本包含部分不兼容的变更:
- Python 包名从 `magic-pdf` 更改为 `mineru`,命令行工具也由 `magic-pdf` 改为 `mineru`,请同步更新脚本与调用命令。
- 出于对系统模块化设计与生态一致性的考虑,MinerU 2.0 已不再内置 LibreOffice 文档转换模块。如需处理 Office 文档,建议通过独立部署的 LibreOffice 服务先行转换为 PDF 格式,再进行后续解析操作。
- 2025/07/05 2.1.0发布
- 这是 Miner2 的第一个大版本更新,包含了大量新功能和改进,包含众多性能优化、体验优化和bug修复,具体更新内容如下:
- 性能优化:
- 大幅提升某些特定分辨率(长边2000像素左右)文档的预处理速度
- 大幅提升`pipeline`后端批量处理大量页数较少(<10)文档时的后处理速度
- `pipline`后端的layout分析速度提升约20%
- 体验优化:
- 内置开箱即用的`fastapi服务``gradio webui`,详细使用方法请参考[文档](#3-api-调用-或-可视化调用)
- `sglang`适配`0.4.8`版本,大幅降低`vlm-sglang`后端的显存要求,最低可在`8G显存`(Turing及以后架构)的显卡上运行
- 对所有命令增加`sglang`的参数透传,使得`sglang-engine`后端可以`sglang-server`一致,接收`sglang`的所有参数
- 支持基于配置文件的功能扩展,包含`自定义公式标识符``开启标题分级功能``自定义本地模型目录`,详细使用方法请参考[文档](#4-基于配置文件扩展-mineru-功能)
- 新特性:
- `pipeline`后端更新 PP-OCRv5 多语种文本识别模型,支持法语、西班牙语、葡萄牙语、俄语、韩语等 37 种语言的文字识别,平均精度涨幅超30%。[详情](https://paddlepaddle.github.io/PaddleOCR/latest/version3.x/algorithm/PP-OCRv5/PP-OCRv5_multi_languages.html)
- `pipeline`后端增加对竖排文本的有限支持
<details>
<summary>历史日志</summary>
<details>
<summary>2025/06/20 2.0.6发布</summary>
<ul>
<li>修复<code>vlm</code>模式下,某些偶发的无效块内容导致解析中断问题</li>
<li>修复<code>vlm</code>模式下,某些不完整的表结构导致的解析中断问题</li>
</ul>
</details>
<details>
<summary>2025/06/17 2.0.5发布</summary>
<ul>
<li>修复了<code>sglang-client</code>模式下依然需要下载模型的问题</li>
<li>修复了<code>sglang-client</code>模式需要依赖<code>torch</code>等实际运行不需要的包的问题</li>
<li>修复了同一进程内尝试通过多个url启动多个<code>sglang-client</code>实例时,只有第一个生效的问题</li>
</ul>
</details>
<details>
<summary>2025/06/15 2.0.3发布</summary>
<ul>
<li>修复了当下载模型类型设置为<code>all</code>时,配置文件出现键值更新错误的问题</li>
<li>修复了命令行模式下公式和表格功能开关不生效导致功能无法关闭的问题</li>
<li>修复了<code>sglang-engine</code>模式下,0.4.7版本sglang的兼容性问题</li>
<li>更新了sglang环境下部署完整版MinerU的Dockerfile和相关安装文档</li>
</ul>
</details>
<details>
<summary>2025/06/13 2.0.0发布</summary>
<ul>
<li><strong>全新架构</strong>:MinerU 2.0 在代码结构和交互方式上进行了深度重构,显著提升了系统的易用性、可维护性与扩展能力。
<ul>
<li><strong>去除第三方依赖限制</strong>:彻底移除对 <code>pymupdf</code> 的依赖,推动项目向更开放、合规的开源方向迈进。</li>
<li><strong>开箱即用,配置便捷</strong>:无需手动编辑 JSON 配置文件,绝大多数参数已支持命令行或 API 直接设置。</li>
<li><strong>模型自动管理</strong>:新增模型自动下载与更新机制,用户无需手动干预即可完成模型部署。</li>
<li><strong>离线部署友好</strong>:提供内置模型下载命令,支持完全断网环境下的部署需求。</li>
<li><strong>代码结构精简</strong>:移除数千行冗余代码,简化类继承逻辑,显著提升代码可读性与开发效率。</li>
<li><strong>统一中间格式输出</strong>:采用标准化的 <code>middle_json</code> 格式,兼容多数基于该格式的二次开发场景,确保生态业务无缝迁移。</li>
</ul>
</li>
<li><strong>全新模型</strong>:MinerU 2.0 集成了我们最新研发的小参数量、高性能多模态文档解析模型,实现端到端的高速、高精度文档理解。
<ul>
<li><strong>小模型,大能力</strong>:模型参数不足 1B,却在解析精度上超越传统 72B 级别的视觉语言模型(VLM)。</li>
<li><strong>多功能合一</strong>:单模型覆盖多语言识别、手写识别、版面分析、表格解析、公式识别、阅读顺序排序等核心任务。</li>
<li><strong>极致推理速度</strong>:在单卡 NVIDIA 4090 上通过 <code>sglang</code> 加速,达到峰值吞吐量超过 10,000 token/s,轻松应对大规模文档处理需求。</li>
<li><strong>在线体验</strong>:您可以在<a href="https://mineru.net/OpenSourceTools/Extractor">MinerU.net</a><a href="https://huggingface.co/spaces/opendatalab/MinerU">Hugging Face</a>, 以及<a href="https://www.modelscope.cn/studios/OpenDataLab/MinerU">ModelScope</a>体验我们的全新VLM模型</li>
</ul>
</li>
<li><strong>不兼容变更说明</strong>:为提升整体架构合理性与长期可维护性,本版本包含部分不兼容的变更:
<ul>
<li>Python 包名从 <code>magic-pdf</code> 更改为 <code>mineru</code>,命令行工具也由 <code>magic-pdf</code> 改为 <code>mineru</code>,请同步更新脚本与调用命令。</li>
<li>出于对系统模块化设计与生态一致性的考虑,MinerU 2.0 已不再内置 LibreOffice 文档转换模块。如需处理 Office 文档,建议通过独立部署的 LibreOffice 服务先行转换为 PDF 格式,再进行后续解析操作。</li>
</ul>
</li>
</ul>
</details>
<details>
<summary>2025/05/24 1.3.12 发布</summary>
<ul>
......@@ -372,8 +412,6 @@
<li><a href="#acknowledgments">Acknowledgements</a></li>
<li><a href="#citation">Citation</a></li>
<li><a href="#star-history">Star History</a></li>
<li><a href="#magic-doc">magic-doc快速提取PPT/DOC/PDF</a></li>
<li><a href="#magic-html">magic-html提取混合网页内容</a></li>
<li><a href="#links">Links</a></li>
</ol>
</details>
......@@ -423,7 +461,7 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
>
> 在非主线环境中,由于硬件、软件配置的多样性,以及第三方依赖项的兼容性问题,我们无法100%保证项目的完全可用性。因此,对于希望在非推荐环境中使用本项目的用户,我们建议先仔细阅读文档以及FAQ,大多数问题已经在FAQ中有对应的解决方案,除此之外我们鼓励社区反馈问题,以便我们能够逐步扩大支持范围。
<table border="1">
<table>
<tr>
<td>解析后端</td>
<td>pipeline</td>
......@@ -436,6 +474,16 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
<td>windows/linux</td>
<td>windows(wsl2)/linux</td>
</tr>
<tr>
<td>CPU推理支持</td>
<td></td>
<td colspan="2"></td>
</tr>
<tr>
<td>GPU要求</td>
<td>Turing及以后架构,6G显存以上或Apple Silicon</td>
<td colspan="2">Turing及以后架构,8G显存以上</td>
</tr>
<tr>
<td>内存要求</td>
<td colspan="3">最低16G以上,推荐32G以上</td>
......@@ -448,18 +496,6 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
<td>python版本</td>
<td colspan="3">3.10-3.13</td>
</tr>
<tr>
<td>CPU推理支持</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>GPU要求</td>
<td>Turing及以后架构,6G显存以上或Apple Silicon</td>
<td>Ampere及以后架构,8G显存以上</td>
<td>Ampere及以后架构,24G显存及以上</td>
</tr>
</table>
## 在线体验
......@@ -492,7 +528,7 @@ uv pip install -e .[core] -i https://mirrors.aliyun.com/pypi/simple
> Linux和macOS系统安装后自动支持cuda/mps加速,Windows用户如需使用cuda加速,
> 请前往 [Pytorch官网](https://pytorch.org/get-started/locally/) 选择合适的cuda版本安装pytorch。
#### 1.3 安装完整版(支持 sglang 加速)(需确保设备有Ampere及以后架构,24G显存及以上显卡)
#### 1.3 安装完整版(支持 sglang 加速)(需确保设备有Turing及以后架构,8G显存及以上显卡)
如需使用 **sglang 加速 VLM 模型推理**,请选择合适的方式安装完整版本:
......@@ -504,6 +540,10 @@ uv pip install -e .[core] -i https://mirrors.aliyun.com/pypi/simple
```bash
uv pip install -e .[all] -i https://mirrors.aliyun.com/pypi/simple
```
> [!TIP]
> sglang安装过程中如发生异常,请参考[sglang官方文档](https://docs.sglang.ai/start/install.html)尝试解决或直接使用docker方式安装。
- 使用 Dockerfile 构建镜像:
```bash
wget https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/docker/china/Dockerfile
......@@ -525,7 +565,8 @@ uv pip install -e .[core] -i https://mirrors.aliyun.com/pypi/simple
```
> [!TIP]
> Dockerfile默认使用`lmsysorg/sglang:v0.4.7-cu124`作为基础镜像,如有需要,您可以自行修改为其他平台版本。
> Dockerfile默认使用`lmsysorg/sglang:v0.4.8.post1-cu126`作为基础镜像,支持Turing/Ampere/Ada Lovelace/Hopper平台,
> 如您使用较新的Blackwell平台,请将基础镜像修改为`lmsysorg/sglang:v0.4.8.post1-cu128-b200`。
#### 1.4 安装client(用于在仅需 CPU 和网络连接的边缘设备上连接 sglang-server)
......@@ -548,7 +589,7 @@ mineru -p <input_path> -o <output_path> -b vlm-sglang-client -u http://<host_ip>
mineru -p <input_path> -o <output_path>
```
- `<input_path>`:本地 PDF 文件或目录(支持 pdf/png/jpg/jpeg)
- `<input_path>`:本地 PDF/图片 文件或目录(支持 pdf/png/jpg/jpeg/webp/gif
- `<output_path>`:输出目录
##### 查看帮助信息
......@@ -571,14 +612,15 @@ Options:
-m, --method [auto|txt|ocr] 解析方法:auto(默认)、txt、ocr(仅用于 pipeline 后端)
-b, --backend [pipeline|vlm-transformers|vlm-sglang-engine|vlm-sglang-client]
解析后端(默认为 pipeline)
-l, --lang [ch|ch_server|... ] 指定文档语言(可提升 OCR 准确率,仅用于 pipeline 后端)
-l, --lang [ch|ch_server|ch_lite|en|korean|japan|chinese_cht|ta|te|ka|latin|arabic|east_slavic|cyrillic|devanagari]
指定文档语言(可提升 OCR 准确率,仅用于 pipeline 后端)
-u, --url TEXT 当使用 sglang-client 时,需指定服务地址
-s, --start INTEGER 开始解析的页码(从 0 开始)
-e, --end INTEGER 结束解析的页码(从 0 开始)
-f, --formula BOOLEAN 是否启用公式解析(默认开启,仅 pipeline 后端
-t, --table BOOLEAN 是否启用表格解析(默认开启,仅 pipeline 后端
-f, --formula BOOLEAN 是否启用公式解析(默认开启)
-t, --table BOOLEAN 是否启用表格解析(默认开启)
-d, --device TEXT 推理设备(如 cpu/cuda/cuda:0/npu/mps,仅 pipeline 后端)
--vram INTEGER 单进程最大 GPU 显存占用(仅 pipeline 后端)
--vram INTEGER 单进程最大 GPU 显存占用(GB)(仅 pipeline 后端)
--source [huggingface|modelscope|local]
模型来源,默认 huggingface
--help 显示帮助信息
......@@ -650,14 +692,6 @@ mineru -p <input_path> -o <output_path> -b vlm-sglang-engine
mineru-sglang-server --port 30000
```
> [!TIP]
> sglang-server 有一些常用参数可以配置:
> - 如您有两张显存为`12G`或`16G`的显卡,可以通过张量并行(TP)模式使用:`--tp 2`
> - 如您有两张`11G`显卡,除了张量并行外,还需要调低KV缓存大小,可以使用:`--tp 2 --mem-fraction-static 0.7`
> - 如果您有超过多张`24G`以上显卡,可以使用sglang的多卡并行模式来增加吞吐量:`--dp 2`
> - 同时您可以启用`torch.compile`来将推理速度加速约15%:`--enable-torch-compile`
> - 如果您想了解更多有关`sglang`的参数使用方法,请参考 [sglang官方文档](https://docs.sglang.ai/backend/server_arguments.html#common-launch-commands)
2. 在另一个终端中使用 Client 调用:
```bash
......@@ -669,29 +703,75 @@ mineru -p <input_path> -o <output_path> -b vlm-sglang-client -u http://127.0.0.1
---
### 3. API 调用方式
### 3. API 调用 或 可视化调用
您也可以通过 Python 代码调用 MinerU,示例代码请参考:
👉 [Python 调用示例](demo/demo.py)
1. 使用python api直接调用:[Python 调用示例](demo/demo.py)
2. 使用fast api方式调用:
```bash
mineru-api --host 127.0.0.1 --port 8000
```
在浏览器中访问 http://127.0.0.1:8000/docs 查看API文档。
---
3. 使用gradio webui 或 gradio api调用
```bash
# 使用 pipeline/vlm-transformers/vlm-sglang-client 后端
mineru-gradio --server-name 127.0.0.1 --server-port 7860
# 或使用 vlm-sglang-engine/pipeline 后端
mineru-gradio --server-name 127.0.0.1 --server-port 7860 --enable-sglang-engine
```
在浏览器中访问 http://127.0.0.1:7860 使用 Gradio WebUI 或访问 http://127.0.0.1:7860/?view=api 使用 Gradio API。
### 4. 部署衍生项目
社区开发者基于 MinerU 进行了多种二次开发,包括:
> [!TIP]
> 以下是一些使用sglang加速模式的建议和注意事项:
> - sglang加速模式目前支持在最低8G显存的Turing架构显卡上运行,但在显存<24G的显卡上可能会遇到显存不足的问题, 可以通过使用以下参数来优化显存使用:
> - 如果您使用单张显卡遇到显存不足的情况时,可能需要调低KV缓存大小,`--mem-fraction-static 0.5`,如仍出现显存不足问题,可尝试进一步降低到`0.4`或更低。
> - 如您有两张以上显卡,可尝试通过张量并行(TP)模式简单扩充可用显存:`--tp 2`
> - 如果您已经可用正常使用sglang对vlm模型进行加速推理,但仍然希望进一步提升推理速度,可以尝试以下参数:
> - 如果您有超过多张显卡,可以使用sglang的多卡并行模式来增加吞吐量:`--dp 2`
> - 同时您可以启用`torch.compile`来将推理速度加速约15%:`--enable-torch-compile`
> - 如果您想了解更多有关`sglang`的参数使用方法,请参考 [sglang官方文档](https://docs.sglang.ai/backend/server_arguments.html#common-launch-commands)
> - 所有sglang官方支持的参数都可用通过命令行参数传递给 MinerU,包括以下命令:`mineru`、`mineru-sglang-server`、`mineru-gradio`、`mineru-api`
- 基于 Gradio 的图形界面
- 基于 FastAPI 的 Web API
- 多卡负载均衡的客户端/服务端架构
- 基于官网API的MCP Server
> [!TIP]
> - 任何情况下,您都可以通过在命令行的开头添加`CUDA_VISIBLE_DEVICES` 环境变量来指定可见的 GPU 设备。例如:
> ```bash
> CUDA_VISIBLE_DEVICES=1 mineru -p <input_path> -o <output_path>
> ```
> - 这种指定方式对所有的命令行调用都有效,包括 `mineru`、`mineru-sglang-server`、`mineru-gradio` 和 `mineru-api`,且对`pipeline`、`vlm`后端均适用。
> - 以下是一些常见的 `CUDA_VISIBLE_DEVICES` 设置示例:
> ```bash
> CUDA_VISIBLE_DEVICES=1 Only device 1 will be seen
> CUDA_VISIBLE_DEVICES=0,1 Devices 0 and 1 will be visible
> CUDA_VISIBLE_DEVICES=“0,1” Same as above, quotation marks are optional
> CUDA_VISIBLE_DEVICES=0,2,3 Devices 0, 2, 3 will be visible; device 1 is masked
> CUDA_VISIBLE_DEVICES="" No GPU will be visible
> ```
> - 以下是一些可能的使用场景:
> - 如果您有多张显卡,需要指定卡0和卡1,并使用多卡并行来启动'sglang-server',可以使用以下命令:
> ```bash
> CUDA_VISIBLE_DEVICES=0,1 mineru-sglang-server --port 30000 --dp 2
> ```
> - 如果您有多张显卡,需要在卡0和卡1上启动两个`fastapi`服务,并分别监听不同的端口,可以使用以下命令:
> ```bash
> # 在终端1中
> CUDA_VISIBLE_DEVICES=0 mineru-api --host 127.0.0.1 --port 8000
> # 在终端2中
> CUDA_VISIBLE_DEVICES=1 mineru-api --host 127.0.0.1 --port 8001
> ```
这些项目通常提供更好的用户体验和更多功能。
---
详细部署方式请参阅:
👉 [衍生项目说明](projects/README_zh-CN.md)
### 4. 基于配置文件扩展 MinerU 功能
---
- MinerU 现已实现开箱即用,但也支持通过配置文件扩展功能。您可以在用户目录下创建 `mineru.json` 文件,添加自定义配置。
- `mineru.json` 文件会在您使用内置模型下载命令 `mineru-models-download` 时自动生成,也可以通过将[配置模板文件](./mineru.template.json)复制到用户目录下并重命名为 `mineru.json` 来创建。
- 以下是一些可用的配置选项:
- `latex-delimiter-config`:用于配置 LaTeX 公式的分隔符,默认为`$`符号,可根据需要修改为其他符号或字符串。
- `llm-aided-config`:用于配置 LLM 辅助标题分级的相关参数,兼容所有支持`openai协议`的 LLM 模型,默认使用`阿里云百练``qwen2.5-32b-instruct`模型,您需要自行配置 API 密钥并将`enable`设置为`true`来启用此功能。
- `models-dir`:用于指定本地模型存储目录,请为`pipeline``vlm`后端分别指定模型目录,指定目录后您可通过配置环境变量`export MINERU_MODEL_SOURCE=local`来使用本地模型。
---
# TODO
......@@ -706,7 +786,7 @@ mineru -p <input_path> -o <output_path> -b vlm-sglang-client -u http://127.0.0.1
# Known Issues
- 阅读顺序基于模型对可阅读内容在空间中的分布进行排序,在极端复杂的排版下可能会部分区域乱序
- 不支持竖排文字
- 竖排文字的支持较为有限
- 目录和列表通过规则进行识别,少部分不常见的列表形式可能无法识别
- 代码块在layout模型里还没有支持
- 漫画书、艺术图册、小学教材、习题尚不能很好解析
......@@ -715,11 +795,10 @@ mineru -p <input_path> -o <output_path> -b vlm-sglang-client -u http://127.0.0.1
- 部分公式可能会无法在markdown中渲染
# FAQ
[常见问题](docs/FAQ_zh_cn.md)
[FAQ](docs/FAQ_en_us.md)
- 如果您在使用过程中遇到问题,可以先查看[常见问题](docs/FAQ_zh_cn.md)是否有解答。
- 如果未能解决您的问题,您也可以使用[DeepWiki](https://deepwiki.com/opendatalab/MinerU)与AI助手交流,这可以解决大部分常见问题。
- 如果您仍然无法解决问题,您可通过[Discord](https://discord.gg/Tdedn9GTXq)[WeChat](http://mineru.space/s/V85Yl)加入社区,与其他用户和开发者交流。
# All Thanks To Our Contributors
......@@ -780,16 +859,13 @@ mineru -p <input_path> -o <output_path> -b vlm-sglang-client -u http://127.0.0.1
</picture>
</a>
# Magic-doc
[Magic-Doc](https://github.com/InternLM/magic-doc) Fast speed ppt/pptx/doc/docx/pdf extraction tool
# Magic-html
[Magic-HTML](https://github.com/opendatalab/magic-html) Mixed web page extraction tool
# Links
- [LabelU (A Lightweight Multi-modal Data Annotation Tool)](https://github.com/opendatalab/labelU)
- [LabelLLM (An Open-source LLM Dialogue Annotation Platform)](https://github.com/opendatalab/LabelLLM)
- [PDF-Extract-Kit (A Comprehensive Toolkit for High-Quality PDF Content Extraction)](https://github.com/opendatalab/PDF-Extract-Kit)
- [Vis3 (OSS browser based on s3)](https://github.com/opendatalab/Vis3)
- [OmniDocBench (A Comprehensive Benchmark for Document Parsing and Evaluation)](https://github.com/opendatalab/OmniDocBench)
- [Magic-HTML (Mixed web page extraction tool)](https://github.com/opendatalab/magic-html)
- [Magic-Doc (Fast speed ppt/pptx/doc/docx/pdf extraction tool)](https://github.com/InternLM/magic-doc)
\ No newline at end of file
......@@ -25,8 +25,8 @@ def do_parse(
p_lang_list: list[str], # List of languages for each PDF, default is 'ch' (Chinese)
backend="pipeline", # The backend for parsing PDF, default is 'pipeline'
parse_method="auto", # The method for parsing PDF, default is 'auto'
p_formula_enable=True, # Enable formula parsing
p_table_enable=True, # Enable table parsing
formula_enable=True, # Enable formula parsing
table_enable=True, # Enable table parsing
server_url=None, # Server URL for vlm-sglang-client backend
f_draw_layout_bbox=True, # Whether to draw layout bounding boxes
f_draw_span_bbox=True, # Whether to draw span bounding boxes
......@@ -45,7 +45,7 @@ def do_parse(
new_pdf_bytes = convert_pdf_bytes_to_bytes_by_pypdfium2(pdf_bytes, start_page_id, end_page_id)
pdf_bytes_list[idx] = new_pdf_bytes
infer_results, all_image_lists, all_pdf_docs, lang_list, ocr_enabled_list = pipeline_doc_analyze(pdf_bytes_list, p_lang_list, parse_method=parse_method, formula_enable=p_formula_enable,table_enable=p_table_enable)
infer_results, all_image_lists, all_pdf_docs, lang_list, ocr_enabled_list = pipeline_doc_analyze(pdf_bytes_list, p_lang_list, parse_method=parse_method, formula_enable=formula_enable,table_enable=table_enable)
for idx, model_list in enumerate(infer_results):
model_json = copy.deepcopy(model_list)
......@@ -57,7 +57,7 @@ def do_parse(
pdf_doc = all_pdf_docs[idx]
_lang = lang_list[idx]
_ocr_enable = ocr_enabled_list[idx]
middle_json = pipeline_result_to_middle_json(model_list, images_list, pdf_doc, image_writer, _lang, _ocr_enable, p_formula_enable)
middle_json = pipeline_result_to_middle_json(model_list, images_list, pdf_doc, image_writer, _lang, _ocr_enable, formula_enable)
pdf_info = middle_json["pdf_info"]
......@@ -169,8 +169,8 @@ def parse_doc(
backend="pipeline",
method="auto",
server_url=None,
start_page_id=0, # Start page ID for parsing, default is 0
end_page_id=None # End page ID for parsing, default is None (parse all pages until the end of the document)
start_page_id=0,
end_page_id=None
):
"""
Parameter description:
......@@ -192,6 +192,8 @@ def parse_doc(
Without method specified, 'auto' will be used by default.
Adapted only for the case where the backend is set to "pipeline".
server_url: When the backend is `sglang-client`, you need to specify the server_url, for example:`http://127.0.0.1:30000`
start_page_id: Start page ID for parsing, default is 0
end_page_id: End page ID for parsing, default is None (parse all pages until the end of the document)
"""
try:
file_name_list = []
......
# Use the official sglang image
FROM lmsysorg/sglang:v0.4.7-cu124
FROM lmsysorg/sglang:v0.4.8.post1-cu126
# install mineru latest
RUN python3 -m pip install -U 'mineru[core]' -i https://mirrors.aliyun.com/pypi/simple --break-system-packages
......
# Use the official sglang image
FROM lmsysorg/sglang:v0.4.7-cu124
FROM lmsysorg/sglang:v0.4.8.post1-cu126
# install mineru latest
RUN python3 -m pip install -U 'mineru[core]' --break-system-packages
......
# Frequently Asked Questions
### 1. When using the command `pip install magic-pdf[full]` on newer versions of macOS, the error `zsh: no matches found: magic-pdf[full]` occurs.
On macOS, the default shell has switched from Bash to Z shell, which has special handling logic for certain types of string matching. This can lead to the "no matches found" error. You can try disabling the globbing feature in the command line and then run the installation command again.
```bash
setopt no_nomatch
pip install magic-pdf[full]
```
### 2. Encountering the error `pickle.UnpicklingError: invalid load key, 'v'.` during use
This might be due to an incomplete download of the model file. You can try re-downloading the model file and then try again.
Reference: https://github.com/opendatalab/MinerU/issues/143
### 3. Where should the model files be downloaded and how should the `/models-dir` configuration be set?
The path for the model files is configured in "magic-pdf.json". just like:
```json
{
"models-dir": "/tmp/models"
}
```
This path is an absolute path, not a relative path. You can obtain the absolute path in the models directory using the "pwd" command.
Reference: https://github.com/opendatalab/MinerU/issues/155#issuecomment-2230216874
### 4. Encountered the error `ImportError: libGL.so.1: cannot open shared object file: No such file or directory` in Ubuntu 22.04 on WSL2
### 1. Encountered the error `ImportError: libGL.so.1: cannot open shared object file: No such file or directory` in Ubuntu 22.04 on WSL2
The `libgl` library is missing in Ubuntu 22.04 on WSL2. You can install the `libgl` library with the following command to resolve the issue:
......@@ -37,59 +10,14 @@ sudo apt-get install libgl1-mesa-glx
Reference: https://github.com/opendatalab/MinerU/issues/388
### 5. Encountered error `ModuleNotFoundError: No module named 'fairscale'`
You need to uninstall the module and reinstall it:
```bash
pip uninstall fairscale
pip install fairscale
```
Reference: https://github.com/opendatalab/MinerU/issues/411
### 6. On some newer devices like the H100, the text parsed during OCR using CUDA acceleration is garbled.
The compatibility of cuda11 with new graphics cards is poor, and the CUDA version used by Paddle needs to be upgraded.
```bash
pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu123/
```
Reference: https://github.com/opendatalab/MinerU/issues/558
### 7. On some Linux servers, the program immediately reports an error `Illegal instruction (core dumped)`
This might be because the server's CPU does not support the AVX/AVX2 instruction set, or the CPU itself supports it but has been disabled by the system administrator. You can try contacting the system administrator to remove the restriction or change to a different server.
References: https://github.com/opendatalab/MinerU/issues/591 , https://github.com/opendatalab/MinerU/issues/736
### 8. Error when installing MinerU on CentOS 7 or Ubuntu 18: `ERROR: Failed building wheel for simsimd`
### 2. Error when installing MinerU on CentOS 7 or Ubuntu 18: `ERROR: Failed building wheel for simsimd`
The new version of albumentations (1.4.21) introduces a dependency on simsimd. Since the pre-built package of simsimd for Linux requires a glibc version greater than or equal to 2.28, this causes installation issues on some Linux distributions released before 2019. You can resolve this issue by using the following command:
```
pip install -U magic-pdf[full,old_linux] --extra-index-url https://wheels.myhloli.com
conda create -n mineru python=3.11 -y
conda activate mineru
pip install -U "mineru[pipeline_old_linux]"
```
Reference: https://github.com/opendatalab/MinerU/issues/1004
### 9. Old Graphics Cards Such as M40 Encounter "RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED"
An error occurs during operation (cuda):
```
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasGemmStridedBatchedEx(handle, opa, opb, (int)m, (int)n, (int)k, (void*)&falpha, a, CUDA_R_16BF, (int)lda, stridea, b, CUDA_R_16BF, (int)ldb, strideb, (void*)&fbeta, c, CUDA_R_16BF, (int)ldc, stridec, (int)num_batches, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)
```
Because BF16 precision is not supported on graphics cards before the Turing architecture and some graphics cards are not recognized by torch, it is necessary to manually disable BF16 precision.
Modify the code in lines 287-290 of the "pdf_parse_union_core_v2.py" file (note that the location may vary in different versions):
```
if torch.cuda.is_bf16_supported():
supports_bfloat16 = True
else:
supports_bfloat16 = False
```
Change it to:
```
supports_bfloat16 = False
```
Reference: https://github.com/opendatalab/MinerU/issues/1508
\ No newline at end of file
# 常见问题解答
### 1.在较新版本的mac上使用命令安装pip install magic-pdf\[full\] zsh: no matches found: magic-pdf\[full\]
在 macOS 上,默认的 shell 从 Bash 切换到了 Z shell,而 Z shell 对于某些类型的字符串匹配有特殊的处理逻辑,这可能导致no matches found错误。
可以通过在命令行禁用globbing特性,再尝试运行安装命令
```bash
setopt no_nomatch
pip install magic-pdf[full]
```
### 2.使用过程中遇到_pickle.UnpicklingError: invalid load key, 'v'.错误
可能是由于模型文件未下载完整导致,可尝试重新下载模型文件后再试
参考:https://github.com/opendatalab/MinerU/issues/143
### 3.模型文件应该下载到哪里/models-dir的配置应该怎么填
模型文件的路径输入是在"magic-pdf.json"中通过
```json
{
"models-dir": "/tmp/models"
}
```
进行配置的。
这个路径是绝对路径而不是相对路径,绝对路径的获取可在models目录中通过命令 "pwd" 获取。
参考:https://github.com/opendatalab/MinerU/issues/155#issuecomment-2230216874
### 4.在WSL2的Ubuntu22.04中遇到报错`ImportError: libGL.so.1: cannot open shared object file: No such file or directory`
### 1.在WSL2的Ubuntu22.04中遇到报错`ImportError: libGL.so.1: cannot open shared object file: No such file or directory`
WSL2的Ubuntu22.04中缺少`libgl`库,可通过以下命令安装`libgl`库解决:
......@@ -39,59 +10,14 @@ sudo apt-get install libgl1-mesa-glx
参考:https://github.com/opendatalab/MinerU/issues/388
### 5.遇到报错 `ModuleNotFoundError : Nomodulenamed 'fairscale'`
需要卸载该模块并重新安装
```bash
pip uninstall fairscale
pip install fairscale
```
参考:https://github.com/opendatalab/MinerU/issues/411
### 6.在部分较新的设备如H100上,使用CUDA加速OCR时解析出的文字乱码。
cuda11对新显卡的兼容性不好,需要升级paddle使用的cuda版本
```bash
pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu123/
```
参考:https://github.com/opendatalab/MinerU/issues/558
### 7.在部分Linux服务器上,程序一运行就报错 `非法指令 (核心已转储)` 或 `Illegal instruction (core dumped)`
可能是因为服务器CPU不支持AVX/AVX2指令集,或cpu本身支持但被运维禁用了,可以尝试联系运维解除限制或更换服务器。
参考:https://github.com/opendatalab/MinerU/issues/591 , https://github.com/opendatalab/MinerU/issues/736
### 8.在 CentOS 7 或 Ubuntu 18 系统安装MinerU时报错`ERROR: Failed building wheel for simsimd`
### 2.在 CentOS 7 或 Ubuntu 18 系统安装MinerU时报错`ERROR: Failed building wheel for simsimd`
新版本albumentations(1.4.21)引入了依赖simsimd,由于simsimd在linux的预编译包要求glibc的版本大于等于2.28,导致部分2019年之前发布的Linux发行版无法正常安装,可通过如下命令安装:
```
pip install -U magic-pdf[full,old_linux] --extra-index-url https://wheels.myhloli.com
conda create -n mineru python=3.11 -y
conda activate mineru
pip install -U "mineru[pipeline_old_linux]"
```
参考:https://github.com/opendatalab/MinerU/issues/1004
### 9. 旧显卡如M40出现 "RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED"
在运行过程中(使用CUDA)出现以下错误:
```
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasGemmStridedBatchedEx(handle, opa, opb, (int)m, (int)n, (int)k, (void*)&falpha, a, CUDA_R_16BF, (int)lda, stridea, b, CUDA_R_16BF, (int)ldb, strideb, (void*)&fbeta, c, CUDA_R_16BF, (int)ldc, stridec, (int)num_batches, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)
```
由于Turing架构之前的显卡不支持BF16精度,并且部分显卡未能被PyTorch正确识别,因此需要手动关闭BF16精度。
请找到并修改`pdf_parse_union_core_v2.py`文件中的第287至290行代码(注意:不同版本中位置可能有所不同),原代码如下:
```python
if torch.cuda.is_bf16_supported():
supports_bfloat16 = True
else:
supports_bfloat16 = False
```
将其修改为:
```python
supports_bfloat16 = False
```
参考:https://github.com/opendatalab/MinerU/issues/1508
docs/images/layout_example.png

559 KB | W: | H:

docs/images/layout_example.png

626 KB | W: | H:

docs/images/layout_example.png
docs/images/layout_example.png
docs/images/layout_example.png
docs/images/layout_example.png
  • 2-up
  • Swipe
  • Onion skin
## Overview
After executing the `magic-pdf` command, in addition to outputting files related to markdown, several other files unrelated to markdown will also be generated. These files will be introduced one by one.
After executing the `mineru` command, in addition to outputting files related to markdown, several other files unrelated to markdown will also be generated. These files will be introduced one by one.
### some_pdf_layout.pdf
Each page layout consists of one or more boxes. The number at the top left of each box indicates its sequence number. Additionally, in `layout.pdf`, different content blocks are highlighted with different background colors.
Each page's layout consists of one or more bounding boxes. The number in the top-right corner of each box indicates the reading order. Additionally, different content blocks are highlighted with distinct background colors within the layout.pdf.
![layout example](images/layout_example.png)
### some_pdf_spans.pdf
### some_pdf_spans.pdf(Applicable only to the pipeline backend)
All spans on the page are drawn with different colored line frames according to the span type. This file can be used for quality control, allowing for quick identification of issues such as missing text or unrecognized inline formulas.
![spans example](images/spans_example.png)
### some_pdf_model.json
### some_pdf_model.json(Applicable only to the pipeline backend)
#### Structure Definition
......@@ -117,13 +116,39 @@ The format of the poly coordinates is \[x0, y0, x1, y1, x2, y2, x3, y3\], repres
]
```
### some_pdf_model_output.txt (Applicable only to the VLM backend)
This file contains the output of the VLM model, with each page's output separated by `----`.
Each page's output consists of text blocks starting with `<|box_start|>` and ending with `<|md_end|>`.
The meaning of each field is as follows:
- `<|box_start|>x0 y0 x1 y1<|box_end|>`
x0 y0 x1 y1 represent the coordinates of a quadrilateral, indicating the top-left and bottom-right points. The values are based on a normalized page size of 1000x1000.
- `<|ref_start|>type<|ref_end|>`
`type` indicates the block type. Possible values are:
```json
{
"text": "Text",
"title": "Title",
"image": "Image",
"image_caption": "Image Caption",
"image_footnote": "Image Footnote",
"table": "Table",
"table_caption": "Table Caption",
"table_footnote": "Table Footnote",
"equation": "Interline Equation"
}
```
- `<|md_start|>Markdown content<|md_end|>`
This field contains the Markdown content of the block. If `type` is `text`, the end of the text may contain the `<|txt_contd|>` tag, indicating that this block can be connected with the following `text` block(s).
If `type` is `table`, the content is in `otsl` format and needs to be converted into HTML for rendering in Markdown.
### some_pdf_middle.json
| Field Name | Description |
| :------------- | :------------------------------------------------------------------------------------------------------------- |
|:---------------| :------------------------------------------------------------------------------------------------------------- |
| pdf_info | list, each element is a dict representing the parsing result of each PDF page, see the table below for details |
| \_parse_type | ocr \| txt, used to indicate the mode used in this intermediate parsing state |
| \_version_name | string, indicates the version of magic-pdf used in this parsing |
| \_backend | pipeline \| vlm, used to indicate the mode used in this intermediate parsing state |
| \_version_name | string, indicates the version of mineru used in this parsing |
<br>
......@@ -324,7 +349,92 @@ First-level block (if any) -> Second-level block -> Line -> Span
]
}
],
"_parse_type": "txt",
"_backend": "pipeline",
"_version_name": "0.6.1"
}
```
### some_pdf_content_list.json
This file is a JSON array where each element is a dict storing all readable content blocks in the document in reading order.
`content_list` can be viewed as a simplified version of `middle.json`. The content block types are mostly consistent with those in `middle.json`, but layout information is not included.
The content has the following types:
| type | desc |
|:---------|:--------------|
| image | Image |
| table | Table |
| text | Text / Title |
| equation | Block formula |
Please note that both `title` and text blocks in `content_list` are uniformly represented using the text type. The `text_level` field is used to distinguish the hierarchy of text blocks:
- A block without the `text_level` field or with `text_level=0` represents body text.
- A block with `text_level=1` represents a level-1 heading.
- A block with `text_level=2` represents a level-2 heading, and so on.
Each content contains the `page_idx` field, indicating the page number (starting from 0) where the content block resides.
#### example
```json
[
{
"type": "text",
"text": "The response of flow duration curves to afforestation ",
"text_level": 1,
"page_idx": 0
},
{
"type": "text",
"text": "Received 1 October 2003; revised 22 December 2004; accepted 3 January 2005 ",
"page_idx": 0
},
{
"type": "text",
"text": "Abstract ",
"text_level": 2,
"page_idx": 0
},
{
"type": "text",
"text": "The hydrologic effect of replacing pasture or other short crops with trees is reasonably well understood on a mean annual basis. The impact on flow regime, as described by the annual flow duration curve (FDC) is less certain. A method to assess the impact of plantation establishment on FDCs was developed. The starting point for the analyses was the assumption that rainfall and vegetation age are the principal drivers of evapotranspiration. A key objective was to remove the variability in the rainfall signal, leaving changes in streamflow solely attributable to the evapotranspiration of the plantation. A method was developed to (1) fit a model to the observed annual time series of FDC percentiles; i.e. 10th percentile for each year of record with annual rainfall and plantation age as parameters, (2) replace the annual rainfall variation with the long term mean to obtain climate adjusted FDCs, and (3) quantify changes in FDC percentiles as plantations age. Data from 10 catchments from Australia, South Africa and New Zealand were used. The model was able to represent flow variation for the majority of percentiles at eight of the 10 catchments, particularly for the 10–50th percentiles. The adjusted FDCs revealed variable patterns in flow reductions with two types of responses (groups) being identified. Group 1 catchments show a substantial increase in the number of zero flow days, with low flows being more affected than high flows. Group 2 catchments show a more uniform reduction in flows across all percentiles. The differences may be partly explained by storage characteristics. The modelled flow reductions were in accord with published results of paired catchment experiments. An additional analysis was performed to characterise the impact of afforestation on the number of zero flow days $( N _ { \\mathrm { z e r o } } )$ for the catchments in group 1. This model performed particularly well, and when adjusted for climate, indicated a significant increase in $N _ { \\mathrm { z e r o } }$ . The zero flow day method could be used to determine change in the occurrence of any given flow in response to afforestation. The methods used in this study proved satisfactory in removing the rainfall variability, and have added useful insight into the hydrologic impacts of plantation establishment. This approach provides a methodology for understanding catchment response to afforestation, where paired catchment data is not available. ",
"page_idx": 0
},
{
"type": "text",
"text": "1. Introduction ",
"text_level": 2,
"page_idx": 1
},
{
"type": "image",
"img_path": "images/a8ecda1c69b27e4f79fce1589175a9d721cbdc1cf78b4cc06a015f3746f6b9d8.jpg",
"img_caption": [
"Fig. 1. Annual flow duration curves of daily flows from Pine Creek, Australia, 1989–2000. "
],
"img_footnote": [],
"page_idx": 1
},
{
"type": "equation",
"img_path": "images/181ea56ef185060d04bf4e274685f3e072e922e7b839f093d482c29bf89b71e8.jpg",
"text": "$$\nQ _ { \\% } = f ( P ) + g ( T )\n$$",
"text_format": "latex",
"page_idx": 2
},
{
"type": "table",
"img_path": "images/e3cb413394a475e555807ffdad913435940ec637873d673ee1b039e3bc3496d0.jpg",
"table_caption": [
"Table 2 Significance of the rainfall and time terms "
],
"table_footnote": [
"indicates that the rainfall term was significant at the $5 \\%$ level, $T$ indicates that the time term was significant at the $5 \\%$ level, \\* represents significance at the $10 \\%$ level, and na denotes too few data points for meaningful analysis. "
],
"table_body": "<html><body><table><tr><td rowspan=\"2\">Site</td><td colspan=\"10\">Percentile</td></tr><tr><td>10</td><td>20</td><td>30</td><td>40</td><td>50</td><td>60</td><td>70</td><td>80</td><td>90</td><td>100</td></tr><tr><td>Traralgon Ck</td><td>P</td><td>P,*</td><td>P</td><td>P</td><td>P,</td><td>P,</td><td>P,</td><td>P,</td><td>P</td><td>P</td></tr><tr><td>Redhill</td><td>P,T</td><td>P,T</td><td>,*</td><td>**</td><td>P.T</td><td>P,*</td><td>P*</td><td>P*</td><td>*</td><td>,*</td></tr><tr><td>Pine Ck</td><td></td><td>P,T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>T</td><td>T</td><td>T</td><td>na</td><td>na</td></tr><tr><td>Stewarts Ck 5</td><td>P,T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>P.T</td><td>P.T</td><td>P,T</td><td>na</td><td>na</td><td>na</td></tr><tr><td>Glendhu 2</td><td>P</td><td>P,T</td><td>P,*</td><td>P,T</td><td>P.T</td><td>P,ns</td><td>P,T</td><td>P,T</td><td>P,T</td><td>P,T</td></tr><tr><td>Cathedral Peak 2</td><td>P,T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>*,T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>T</td></tr><tr><td>Cathedral Peak 3</td><td>P.T</td><td>P.T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>T</td></tr><tr><td>Lambrechtsbos A</td><td>P,T</td><td>P</td><td>P</td><td>P,T</td><td>*,T</td><td>*,T</td><td>*,T</td><td>*,T</td><td>*,T</td><td>T</td></tr><tr><td>Lambrechtsbos B</td><td>P,T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>T</td><td>T</td></tr><tr><td>Biesievlei</td><td>P,T</td><td>P.T</td><td>P,T</td><td>P,T</td><td>*,T</td><td>*,T</td><td>T</td><td>T</td><td>P,T</td><td>P,T</td></tr></table></body></html>",
"page_idx": 5
}
]
```
\ No newline at end of file
## 概览
`magic-pdf` 命令执行后除了输出 markdown 有关的文件以外,还会生成若干个和 markdown 无关的文件。现在将一一介绍这些文件
`mineru` 命令执行后除了输出 markdown 文件以外,还可能会生成若干个和 markdown 无关的文件。现在将一一介绍这些文件
### some_pdf_layout.pdf
每一页的 layout 均由一个或多个框组成。 每个框左上脚的数字表明它们的序。此外 layout.pdf 框内用不同的背景色块圈定不同的内容块。
每一页的 layout 均由一个或多个框组成。 每个框右上角的数字表明它们的阅读顺序。此外 layout.pdf 框内用不同的背景色块圈定不同的内容块。
![layout 页面示例](images/layout_example.png)
### some_pdf_spans.pdf
### some_pdf_spans.pdf(仅适用于pipeline后端)
根据 span 类型的不同,采用不同颜色线框绘制页面上所有 span。该文件可以用于质检,可以快速排查出文本丢失、行公式未识别等问题。
根据 span 类型的不同,采用不同颜色线框绘制页面上所有 span。该文件可以用于质检,可以快速排查出文本丢失、行公式未识别等问题。
![span 页面示例](images/spans_example.png)
### some_pdf_model.json
### some_pdf_model.json(仅适用于pipeline后端)
#### 结构定义
......@@ -117,13 +117,39 @@ poly 坐标的格式 \[x0, y0, x1, y1, x2, y2, x3, y3\], 分别表示左上、
]
```
### some_pdf_model_output.txt(仅适用于vlm后端)
该文件是vlm模型的输出结果,使用`----`分割每一页的输出结果。
每一页的输出结果一些以`<|box_start|>`开头,`<|md_end|>`结尾的文本块。
其中字段的含义:
- `<|box_start|>x0 y0 x1 y1<|box_end|>`
其中x0 y0 x1 y1是四边形的坐标,分别表示左上、右下的两点坐标,值为将页面缩放至1000x1000后,四边形的坐标值。
- `<|ref_start|>type<|ref_end|>`
type是该block的类型,可能的值有:
```json
{
"text": "文本",
"title": "标题",
"image": "图片",
"image_caption": "图片描述",
"image_footnote": "图片脚注",
"table": "表格",
"table_caption": "表格描述",
"table_footnote": "表格脚注",
"equation": "行间公式"
}
```
- `<|md_start|>markdown内容<|md_end|>`
该字段是该block的markdown内容,如type为text,文本末尾可能存在`<|txt_contd|>`标记,表示该文本块可以后后续text块连接。
如type为table,内容为`otsl`格式表示的表格内容,需要转换为html格式才能在markdown中渲染。
### some_pdf_middle.json
| 字段名 | 解释 |
| :------------- | :----------------------------------------------------------------- |
| 字段名 | 解释 |
|:---------------|:------------------------------------------|
| pdf_info | list,每个元素都是一个dict,这个dict是每一页pdf的解析结果,详见下表 |
| \_parse_type | ocr \| txt,用来标识本次解析的中间态使用的模式 |
| \_version_name | string, 表示本次解析使用的 magic-pdf 的版本号 |
| \_backend | pipeline \| vlm,用来标识本次解析的中间态使用的模式 |
| \_version_name | string, 表示本次解析使用的 mineru 的版本号 |
<br>
......@@ -323,7 +349,86 @@ para_blocks内存储的元素为区块信息
]
}
],
"_parse_type": "txt",
"_backend": "pipeline",
"_version_name": "0.6.1"
}
```
### some_pdf_content_list.json
该文件是一个json数组,每个元素是一个dict,按阅读顺序平铺存储文档中所有可阅读的内容块。
content_list可以看成简化后的middle.json,内容块的类型基本和middle.json一致,但不包含布局信息。
content的类型有如下几种:
| type | desc |
|:---------|:------|
| image | 图片 |
| table | 表格 |
| text | 文本/标题 |
| equation | 行间公式 |
需要注意的是,content_list中的title和text块统一使用text类型表示,通过`text_level`字段来区分文本块的层级,不含`text_level`字段或`text_level`为0的文本块表示正文文本,`text_level`为1的文本块表示一级标题,`text_level`为2的文本块表示二级标题,以此类推。
每个content包含`page_idx`字段,表示该内容块所在的页码,从0开始。
#### 示例数据
```json
[
{
"type": "text",
"text": "The response of flow duration curves to afforestation ",
"text_level": 1,
"page_idx": 0
},
{
"type": "text",
"text": "Received 1 October 2003; revised 22 December 2004; accepted 3 January 2005 ",
"page_idx": 0
},
{
"type": "text",
"text": "Abstract ",
"text_level": 2,
"page_idx": 0
},
{
"type": "text",
"text": "The hydrologic effect of replacing pasture or other short crops with trees is reasonably well understood on a mean annual basis. The impact on flow regime, as described by the annual flow duration curve (FDC) is less certain. A method to assess the impact of plantation establishment on FDCs was developed. The starting point for the analyses was the assumption that rainfall and vegetation age are the principal drivers of evapotranspiration. A key objective was to remove the variability in the rainfall signal, leaving changes in streamflow solely attributable to the evapotranspiration of the plantation. A method was developed to (1) fit a model to the observed annual time series of FDC percentiles; i.e. 10th percentile for each year of record with annual rainfall and plantation age as parameters, (2) replace the annual rainfall variation with the long term mean to obtain climate adjusted FDCs, and (3) quantify changes in FDC percentiles as plantations age. Data from 10 catchments from Australia, South Africa and New Zealand were used. The model was able to represent flow variation for the majority of percentiles at eight of the 10 catchments, particularly for the 10–50th percentiles. The adjusted FDCs revealed variable patterns in flow reductions with two types of responses (groups) being identified. Group 1 catchments show a substantial increase in the number of zero flow days, with low flows being more affected than high flows. Group 2 catchments show a more uniform reduction in flows across all percentiles. The differences may be partly explained by storage characteristics. The modelled flow reductions were in accord with published results of paired catchment experiments. An additional analysis was performed to characterise the impact of afforestation on the number of zero flow days $( N _ { \\mathrm { z e r o } } )$ for the catchments in group 1. This model performed particularly well, and when adjusted for climate, indicated a significant increase in $N _ { \\mathrm { z e r o } }$ . The zero flow day method could be used to determine change in the occurrence of any given flow in response to afforestation. The methods used in this study proved satisfactory in removing the rainfall variability, and have added useful insight into the hydrologic impacts of plantation establishment. This approach provides a methodology for understanding catchment response to afforestation, where paired catchment data is not available. ",
"page_idx": 0
},
{
"type": "text",
"text": "1. Introduction ",
"text_level": 2,
"page_idx": 1
},
{
"type": "image",
"img_path": "images/a8ecda1c69b27e4f79fce1589175a9d721cbdc1cf78b4cc06a015f3746f6b9d8.jpg",
"img_caption": [
"Fig. 1. Annual flow duration curves of daily flows from Pine Creek, Australia, 1989–2000. "
],
"img_footnote": [],
"page_idx": 1
},
{
"type": "equation",
"img_path": "images/181ea56ef185060d04bf4e274685f3e072e922e7b839f093d482c29bf89b71e8.jpg",
"text": "$$\nQ _ { \\% } = f ( P ) + g ( T )\n$$",
"text_format": "latex",
"page_idx": 2
},
{
"type": "table",
"img_path": "images/e3cb413394a475e555807ffdad913435940ec637873d673ee1b039e3bc3496d0.jpg",
"table_caption": [
"Table 2 Significance of the rainfall and time terms "
],
"table_footnote": [
"indicates that the rainfall term was significant at the $5 \\%$ level, $T$ indicates that the time term was significant at the $5 \\%$ level, \\* represents significance at the $10 \\%$ level, and na denotes too few data points for meaningful analysis. "
],
"table_body": "<html><body><table><tr><td rowspan=\"2\">Site</td><td colspan=\"10\">Percentile</td></tr><tr><td>10</td><td>20</td><td>30</td><td>40</td><td>50</td><td>60</td><td>70</td><td>80</td><td>90</td><td>100</td></tr><tr><td>Traralgon Ck</td><td>P</td><td>P,*</td><td>P</td><td>P</td><td>P,</td><td>P,</td><td>P,</td><td>P,</td><td>P</td><td>P</td></tr><tr><td>Redhill</td><td>P,T</td><td>P,T</td><td>,*</td><td>**</td><td>P.T</td><td>P,*</td><td>P*</td><td>P*</td><td>*</td><td>,*</td></tr><tr><td>Pine Ck</td><td></td><td>P,T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>T</td><td>T</td><td>T</td><td>na</td><td>na</td></tr><tr><td>Stewarts Ck 5</td><td>P,T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>P.T</td><td>P.T</td><td>P,T</td><td>na</td><td>na</td><td>na</td></tr><tr><td>Glendhu 2</td><td>P</td><td>P,T</td><td>P,*</td><td>P,T</td><td>P.T</td><td>P,ns</td><td>P,T</td><td>P,T</td><td>P,T</td><td>P,T</td></tr><tr><td>Cathedral Peak 2</td><td>P,T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>*,T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>T</td></tr><tr><td>Cathedral Peak 3</td><td>P.T</td><td>P.T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>T</td></tr><tr><td>Lambrechtsbos A</td><td>P,T</td><td>P</td><td>P</td><td>P,T</td><td>*,T</td><td>*,T</td><td>*,T</td><td>*,T</td><td>*,T</td><td>T</td></tr><tr><td>Lambrechtsbos B</td><td>P,T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>T</td><td>T</td></tr><tr><td>Biesievlei</td><td>P,T</td><td>P.T</td><td>P,T</td><td>P,T</td><td>*,T</td><td>*,T</td><td>T</td><td>T</td><td>P,T</td><td>P,T</td></tr></table></body></html>",
"page_idx": 5
}
]
```
\ No newline at end of file
......@@ -318,6 +318,13 @@ class BatchAnalyze:
layout_res_item['score'] = float(f"{ocr_score:.3f}")
if ocr_score < OcrConfidence.min_confidence:
layout_res_item['category_id'] = 16
else:
layout_res_bbox = [layout_res_item['poly'][0], layout_res_item['poly'][1],
layout_res_item['poly'][4], layout_res_item['poly'][5]]
layout_res_width = layout_res_bbox[2] - layout_res_bbox[0]
layout_res_height = layout_res_bbox[3] - layout_res_bbox[1]
if ocr_text in ['(204号', '(20', '(2', '(2号', '(20号'] and ocr_score < 0.8 and layout_res_width < layout_res_height:
layout_res_item['category_id'] = 16
total_processed += len(img_crop_list)
......
# Copyright (c) Opendatalab. All rights reserved.
import os
import time
from loguru import logger
......@@ -151,9 +152,6 @@ def page_model_info_to_page_info(page_model_info, image_dict, page, image_writer
"""对block进行fix操作"""
fix_blocks = fix_block_spans(block_with_spans)
"""同一行被断开的titile合并"""
# merge_title_blocks(fix_blocks)
"""对block进行排序"""
sorted_blocks = sort_blocks_by_bbox(fix_blocks, page_w, page_h, footnote_blocks)
......@@ -235,7 +233,8 @@ def result_to_middle_json(model_list, images_list, pdf_doc, image_writer, lang=N
"""清理内存"""
pdf_doc.close()
clean_memory(get_device())
if os.getenv('MINERU_DONOT_CLEAN_MEM') is None and len(model_list) >= 10:
clean_memory(get_device())
return middle_json
......
......@@ -365,8 +365,12 @@ def para_split(page_info_list):
for page_info in page_info_list:
page_info['para_blocks'] = []
for block in all_blocks:
if block['page_num'] == page_info['page_idx']:
page_info['para_blocks'].append(block)
if 'page_num' in block:
if block['page_num'] == page_info['page_idx']:
page_info['para_blocks'].append(block)
# 从block中删除不需要的page_num和page_size字段
del block['page_num']
del block['page_size']
if __name__ == '__main__':
......
......@@ -75,9 +75,9 @@ def doc_analyze(
):
"""
适当调大MIN_BATCH_INFERENCE_SIZE可以提高性能,可能会增加显存使用量,
可通过环境变量MINERU_MIN_BATCH_INFERENCE_SIZE设置,默认值为100
可通过环境变量MINERU_MIN_BATCH_INFERENCE_SIZE设置,默认值为128
"""
min_batch_inference_size = int(os.environ.get('MINERU_MIN_BATCH_INFERENCE_SIZE', 100))
min_batch_inference_size = int(os.environ.get('MINERU_MIN_BATCH_INFERENCE_SIZE', 128))
# 收集所有页面信息
all_pages_info = [] # 存储(dataset_index, page_index, img, ocr, lang, width, height)
......
from mineru.utils.boxbase import bbox_relative_pos, calculate_iou, bbox_distance, is_in
from mineru.utils.boxbase import bbox_relative_pos, calculate_iou, bbox_distance, is_in, get_minbox_if_overlap_by_ratio
from mineru.utils.enum_class import CategoryId, ContentType
......@@ -13,7 +13,62 @@ class MagicModel:
self.__fix_by_remove_low_confidence()
"""删除高iou(>0.9)数据中置信度较低的那个"""
self.__fix_by_remove_high_iou_and_low_confidence()
"""将部分tbale_footnote修正为image_footnote"""
self.__fix_footnote()
"""处理重叠的image_body和table_body"""
self.__fix_by_remove_overlap_image_table_body()
def __fix_by_remove_overlap_image_table_body(self):
need_remove_list = []
layout_dets = self.__page_model_info['layout_dets']
image_blocks = list(filter(
lambda x: x['category_id'] == CategoryId.ImageBody, layout_dets
))
table_blocks = list(filter(
lambda x: x['category_id'] == CategoryId.TableBody, layout_dets
))
def add_need_remove_block(blocks):
for i in range(len(blocks)):
for j in range(i + 1, len(blocks)):
block1 = blocks[i]
block2 = blocks[j]
overlap_box = get_minbox_if_overlap_by_ratio(
block1['bbox'], block2['bbox'], 0.8
)
if overlap_box is not None:
# 判断哪个区块的面积更小,移除较小的区块
area1 = (block1['bbox'][2] - block1['bbox'][0]) * (block1['bbox'][3] - block1['bbox'][1])
area2 = (block2['bbox'][2] - block2['bbox'][0]) * (block2['bbox'][3] - block2['bbox'][1])
if area1 <= area2:
block_to_remove = block1
large_block = block2
else:
block_to_remove = block2
large_block = block1
if block_to_remove not in need_remove_list:
# 扩展大区块的边界框
x1, y1, x2, y2 = large_block['bbox']
sx1, sy1, sx2, sy2 = block_to_remove['bbox']
x1 = min(x1, sx1)
y1 = min(y1, sy1)
x2 = max(x2, sx2)
y2 = max(y2, sy2)
large_block['bbox'] = [x1, y1, x2, y2]
need_remove_list.append(block_to_remove)
# 处理图像-图像重叠
add_need_remove_block(image_blocks)
# 处理表格-表格重叠
add_need_remove_block(table_blocks)
# 从布局中移除标记的区块
for need_remove in need_remove_list:
if need_remove in layout_dets:
layout_dets.remove(need_remove)
def __fix_axis(self):
need_remove_list = []
......@@ -46,42 +101,46 @@ class MagicModel:
def __fix_by_remove_high_iou_and_low_confidence(self):
need_remove_list = []
layout_dets = self.__page_model_info['layout_dets']
for layout_det1 in layout_dets:
for layout_det2 in layout_dets:
if layout_det1 == layout_det2:
continue
if layout_det1['category_id'] in [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] and layout_det2['category_id'] in [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]:
if (
calculate_iou(layout_det1['bbox'], layout_det2['bbox'])
> 0.9
):
if layout_det1['score'] < layout_det2['score']:
layout_det_need_remove = layout_det1
else:
layout_det_need_remove = layout_det2
layout_dets = list(filter(
lambda x: x['category_id'] in [
CategoryId.Title,
CategoryId.Text,
CategoryId.ImageBody,
CategoryId.ImageCaption,
CategoryId.TableBody,
CategoryId.TableCaption,
CategoryId.TableFootnote,
CategoryId.InterlineEquation_Layout,
CategoryId.InterlineEquationNumber_Layout,
], self.__page_model_info['layout_dets']
)
)
for i in range(len(layout_dets)):
for j in range(i + 1, len(layout_dets)):
layout_det1 = layout_dets[i]
layout_det2 = layout_dets[j]
if calculate_iou(layout_det1['bbox'], layout_det2['bbox']) > 0.9:
layout_det_need_remove = layout_det1 if layout_det1['score'] < layout_det2['score'] else layout_det2
if layout_det_need_remove not in need_remove_list:
need_remove_list.append(layout_det_need_remove)
if layout_det_need_remove not in need_remove_list:
need_remove_list.append(layout_det_need_remove)
else:
continue
else:
continue
for need_remove in need_remove_list:
layout_dets.remove(need_remove)
self.__page_model_info['layout_dets'].remove(need_remove)
def __fix_footnote(self):
# 3: figure, 5: table, 7: footnote
footnotes = []
figures = []
tables = []
for obj in self.__page_model_info['layout_dets']:
if obj['category_id'] == 7:
if obj['category_id'] == CategoryId.TableFootnote:
footnotes.append(obj)
elif obj['category_id'] == 3:
elif obj['category_id'] == CategoryId.ImageBody:
figures.append(obj)
elif obj['category_id'] == 5:
elif obj['category_id'] == CategoryId.TableBody:
tables.append(obj)
if len(footnotes) * len(figures) == 0:
continue
......@@ -314,10 +373,10 @@ class MagicModel:
def get_imgs(self):
with_captions = self.__tie_up_category_by_distance_v3(
3, 4
CategoryId.ImageBody, CategoryId.ImageCaption
)
with_footnotes = self.__tie_up_category_by_distance_v3(
3, CategoryId.ImageFootnote
CategoryId.ImageBody, CategoryId.ImageFootnote
)
ret = []
for v in with_captions:
......@@ -333,10 +392,10 @@ class MagicModel:
def get_tables(self) -> list:
with_captions = self.__tie_up_category_by_distance_v3(
5, 6
CategoryId.TableBody, CategoryId.TableCaption
)
with_footnotes = self.__tie_up_category_by_distance_v3(
5, 7
CategoryId.TableBody, CategoryId.TableFootnote
)
ret = []
for v in with_captions:
......@@ -385,20 +444,21 @@ class MagicModel:
all_spans = []
layout_dets = self.__page_model_info['layout_dets']
allow_category_id_list = [3, 5, 13, 14, 15]
allow_category_id_list = [
CategoryId.ImageBody,
CategoryId.TableBody,
CategoryId.InlineEquation,
CategoryId.InterlineEquation_YOLO,
CategoryId.OcrText,
]
"""当成span拼接的"""
# 3: 'image', # 图片
# 5: 'table', # 表格
# 13: 'inline_equation', # 行内公式
# 14: 'interline_equation', # 行间公式
# 15: 'text', # ocr识别文本
for layout_det in layout_dets:
category_id = layout_det['category_id']
if category_id in allow_category_id_list:
span = {'bbox': layout_det['bbox'], 'score': layout_det['score']}
if category_id == 3:
if category_id == CategoryId.ImageBody:
span['type'] = ContentType.IMAGE
elif category_id == 5:
elif category_id == CategoryId.TableBody:
# 获取table模型结果
latex = layout_det.get('latex', None)
html = layout_det.get('html', None)
......@@ -407,13 +467,13 @@ class MagicModel:
elif html:
span['html'] = html
span['type'] = ContentType.TABLE
elif category_id == 13:
elif category_id == CategoryId.InlineEquation:
span['content'] = layout_det['latex']
span['type'] = ContentType.INLINE_EQUATION
elif category_id == 14:
elif category_id == CategoryId.InterlineEquation_YOLO:
span['content'] = layout_det['latex']
span['type'] = ContentType.INTERLINE_EQUATION
elif category_id == 15:
elif category_id == CategoryId.OcrText:
span['content'] = layout_det['text']
span['type'] = ContentType.TEXT
all_spans.append(span)
......@@ -438,4 +498,4 @@ class MagicModel:
for col in extra_col:
block[col] = item.get(col, None)
blocks.append(block)
return blocks
return blocks
\ No newline at end of file
......@@ -157,9 +157,11 @@ def merge_para_with_text(para_block):
if span_type == ContentType.TEXT:
content = escape_special_markdown_char(span['content'])
elif span_type == ContentType.INLINE_EQUATION:
content = f"{inline_left_delimiter}{span['content']}{inline_right_delimiter}"
if span.get('content', ''):
content = f"{inline_left_delimiter}{span['content']}{inline_right_delimiter}"
elif span_type == ContentType.INTERLINE_EQUATION:
content = f"\n{display_left_delimiter}\n{span['content']}\n{display_right_delimiter}\n"
if span.get('content', ''):
content = f"\n{display_left_delimiter}\n{span['content']}\n{display_right_delimiter}\n"
content = content.strip()
......@@ -191,12 +193,12 @@ def make_blocks_to_content_list(para_block, img_buket_path, page_idx):
para_content = {}
if para_type in [BlockType.TEXT, BlockType.LIST, BlockType.INDEX]:
para_content = {
'type': 'text',
'type': ContentType.TEXT,
'text': merge_para_with_text(para_block),
}
elif para_type == BlockType.TITLE:
para_content = {
'type': 'text',
'type': ContentType.TEXT,
'text': merge_para_with_text(para_block),
}
title_level = get_title_level(para_block)
......@@ -206,14 +208,14 @@ def make_blocks_to_content_list(para_block, img_buket_path, page_idx):
if len(para_block['lines']) == 0 or len(para_block['lines'][0]['spans']) == 0:
return None
para_content = {
'type': 'equation',
'type': ContentType.EQUATION,
'img_path': f"{img_buket_path}/{para_block['lines'][0]['spans'][0].get('image_path', '')}",
}
if para_block['lines'][0]['spans'][0].get('content', ''):
para_content['text'] = merge_para_with_text(para_block)
para_content['text_format'] = 'latex'
elif para_type == BlockType.IMAGE:
para_content = {'type': 'image', 'img_path': '', 'img_caption': [], 'img_footnote': []}
para_content = {'type': ContentType.IMAGE, 'img_path': '', BlockType.IMAGE_CAPTION: [], BlockType.IMAGE_FOOTNOTE: []}
for block in para_block['blocks']:
if block['type'] == BlockType.IMAGE_BODY:
for line in block['lines']:
......@@ -222,29 +224,26 @@ def make_blocks_to_content_list(para_block, img_buket_path, page_idx):
if span.get('image_path', ''):
para_content['img_path'] = f"{img_buket_path}/{span['image_path']}"
if block['type'] == BlockType.IMAGE_CAPTION:
para_content['img_caption'].append(merge_para_with_text(block))
para_content[BlockType.IMAGE_CAPTION].append(merge_para_with_text(block))
if block['type'] == BlockType.IMAGE_FOOTNOTE:
para_content['img_footnote'].append(merge_para_with_text(block))
para_content[BlockType.IMAGE_FOOTNOTE].append(merge_para_with_text(block))
elif para_type == BlockType.TABLE:
para_content = {'type': 'table', 'img_path': '', 'table_caption': [], 'table_footnote': []}
para_content = {'type': ContentType.TABLE, 'img_path': '', BlockType.TABLE_CAPTION: [], BlockType.TABLE_FOOTNOTE: []}
for block in para_block['blocks']:
if block['type'] == BlockType.TABLE_BODY:
for line in block['lines']:
for span in line['spans']:
if span['type'] == ContentType.TABLE:
if span.get('latex', ''):
para_content['table_body'] = f"{span['latex']}"
elif span.get('html', ''):
para_content['table_body'] = f"{span['html']}"
if span.get('html', ''):
para_content[BlockType.TABLE_BODY] = f"{span['html']}"
if span.get('image_path', ''):
para_content['img_path'] = f"{img_buket_path}/{span['image_path']}"
if block['type'] == BlockType.TABLE_CAPTION:
para_content['table_caption'].append(merge_para_with_text(block))
para_content[BlockType.TABLE_CAPTION].append(merge_para_with_text(block))
if block['type'] == BlockType.TABLE_FOOTNOTE:
para_content['table_footnote'].append(merge_para_with_text(block))
para_content[BlockType.TABLE_FOOTNOTE].append(merge_para_with_text(block))
para_content['page_idx'] = page_idx
......
......@@ -77,7 +77,7 @@ def get_predictor(
raise ImportError(
"sglang is not installed, so sglang-engine backend cannot be used. "
"If you need to use sglang-engine backend for inference, "
"please install sglang[all]==0.4.7 or a newer version."
"please install sglang[all]==0.4.8 or a newer version."
)
predictor = SglangEnginePredictor(
server_args=ServerArgs(model_path, **kwargs),
......
import re
import time
import cv2
import numpy as np
from loguru import logger
from mineru.backend.pipeline.model_init import AtomModelSingleton
from mineru.utils.config_reader import get_llm_aided_config
from mineru.utils.cut_image import cut_image_and_table
from mineru.utils.enum_class import BlockType, ContentType
from mineru.utils.enum_class import ContentType
from mineru.utils.hash_utils import str_md5
from mineru.backend.vlm.vlm_magic_model import MagicModel
from mineru.utils.llm_aided import llm_aided_title
from mineru.utils.pdf_image_tools import get_crop_img
from mineru.version import __version__
......@@ -23,6 +30,34 @@ def token_to_page_info(token, image_dict, page, image_writer, page_index) -> dic
image_blocks = magic_model.get_image_blocks()
table_blocks = magic_model.get_table_blocks()
title_blocks = magic_model.get_title_blocks()
# 如果有标题优化需求,则对title_blocks截图det
llm_aided_config = get_llm_aided_config()
if llm_aided_config is not None:
title_aided_config = llm_aided_config.get('title_aided', None)
if title_aided_config is not None:
if title_aided_config.get('enable', False):
atom_model_manager = AtomModelSingleton()
ocr_model = atom_model_manager.get_atom_model(
atom_model_name='ocr',
ocr_show_log=False,
det_db_box_thresh=0.3,
lang='ch_lite'
)
for title_block in title_blocks:
title_pil_img = get_crop_img(title_block['bbox'], page_pil_img, scale)
title_np_img = np.array(title_pil_img)
# 给title_pil_img添加上下左右各50像素白边padding
title_np_img = cv2.copyMakeBorder(
title_np_img, 50, 50, 50, 50, cv2.BORDER_CONSTANT, value=[255, 255, 255]
)
title_img = cv2.cvtColor(title_np_img, cv2.COLOR_RGB2BGR)
ocr_det_res = ocr_model.ocr(title_img, rec=False)[0]
if len(ocr_det_res) > 0:
# 计算所有res的平均高度
avg_height = np.mean([box[2][1] - box[0][1] for box in ocr_det_res])
title_block['line_avg_height'] = round(avg_height/scale)
text_blocks = magic_model.get_text_blocks()
interline_equation_blocks = magic_model.get_interline_equation_blocks()
......@@ -48,6 +83,19 @@ def result_to_middle_json(token_list, images_list, pdf_doc, image_writer):
image_dict = images_list[index]
page_info = token_to_page_info(token, image_dict, page, image_writer, index)
middle_json["pdf_info"].append(page_info)
"""llm优化"""
llm_aided_config = get_llm_aided_config()
if llm_aided_config is not None:
"""标题优化"""
title_aided_config = llm_aided_config.get('title_aided', None)
if title_aided_config is not None:
if title_aided_config.get('enable', False):
llm_aided_title_start_time = time.time()
llm_aided_title(middle_json["pdf_info"], title_aided_config)
logger.info(f'llm aided title time: {round(time.time() - llm_aided_title_start_time, 2)}')
# 关闭pdf文档
pdf_doc.close()
return middle_json
......
......@@ -25,6 +25,7 @@ class ModelSingleton:
backend: str,
model_path: str | None,
server_url: str | None,
**kwargs,
) -> BasePredictor:
key = (backend, model_path, server_url)
if key not in self._models:
......@@ -34,6 +35,7 @@ class ModelSingleton:
backend=backend,
model_path=model_path,
server_url=server_url,
**kwargs,
)
return self._models[key]
......@@ -45,9 +47,10 @@ def doc_analyze(
backend="transformers",
model_path: str | None = None,
server_url: str | None = None,
**kwargs,
):
if predictor is None:
predictor = ModelSingleton().get_model(backend, model_path, server_url)
predictor = ModelSingleton().get_model(backend, model_path, server_url, **kwargs)
# load_images_start = time.time()
images_list, pdf_doc = load_images_from_pdf(pdf_bytes)
......@@ -71,19 +74,20 @@ async def aio_doc_analyze(
backend="transformers",
model_path: str | None = None,
server_url: str | None = None,
**kwargs,
):
if predictor is None:
predictor = ModelSingleton().get_model(backend, model_path, server_url)
predictor = ModelSingleton().get_model(backend, model_path, server_url, **kwargs)
load_images_start = time.time()
# load_images_start = time.time()
images_list, pdf_doc = load_images_from_pdf(pdf_bytes)
images_base64_list = [image_dict["img_base64"] for image_dict in images_list]
load_images_time = round(time.time() - load_images_start, 2)
logger.info(f"load images cost: {load_images_time}, speed: {round(len(images_base64_list)/load_images_time, 3)} images/s")
# load_images_time = round(time.time() - load_images_start, 2)
# logger.info(f"load images cost: {load_images_time}, speed: {round(len(images_base64_list)/load_images_time, 3)} images/s")
infer_start = time.time()
# infer_start = time.time()
results = await predictor.aio_batch_predict(images=images_base64_list)
infer_time = round(time.time() - infer_start, 2)
logger.info(f"infer finished, cost: {infer_time}, speed: {round(len(results)/infer_time, 3)} page/s")
# infer_time = round(time.time() - infer_start, 2)
# logger.info(f"infer finished, cost: {infer_time}, speed: {round(len(results)/infer_time, 3)} page/s")
middle_json = result_to_middle_json(results, images_list, pdf_doc, image_writer)
return middle_json
return middle_json, results
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment