Commit 7d2dfc80 authored by liukaiwen's avatar liukaiwen
Browse files

Merge branch 'dev' into dev-table-model-update

parents a0eff3be 6d571e2e
*.tar *.tar
*.tar.gz *.tar.gz
*.zip *.zip
venv*/ venv*/
envs/ envs/
slurm_logs/ slurm_logs/
sync1.sh sync1.sh
data_preprocess_pj1 data_preprocess_pj1
data-preparation1 data-preparation1
__pycache__ __pycache__
*.log *.log
*.pyc *.pyc
.vscode .vscode
debug/ debug/
*.ipynb *.ipynb
.idea .idea
# vscode history # vscode history
.history .history
.DS_Store .DS_Store
.env .env
bad_words/ bad_words/
bak/ bak/
app/tests/* app/tests/*
temp/ temp/
tmp/ tmp/
tmp tmp
.vscode .vscode
.vscode/ .vscode/
ocr_demo ocr_demo
.coveragerc .coveragerc
/app/common/__init__.py /app/common/__init__.py
/magic_pdf/config/__init__.py /magic_pdf/config/__init__.py
source.dev.env source.dev.env
tmp tmp
projects/web/node_modules projects/web/node_modules
projects/web/dist projects/web/dist
projects/web_demo/web_demo/static/ projects/web_demo/web_demo/static/
cli_debug/
debug_utils/
# sphinx docs
_build/
...@@ -3,7 +3,7 @@ repos: ...@@ -3,7 +3,7 @@ repos:
rev: 5.0.4 rev: 5.0.4
hooks: hooks:
- id: flake8 - id: flake8
args: ["--max-line-length=120", "--ignore=E131,E125,W503,W504,E203"] args: ["--max-line-length=150", "--ignore=E131,E125,W503,W504,E203"]
- repo: https://github.com/PyCQA/isort - repo: https://github.com/PyCQA/isort
rev: 5.11.5 rev: 5.11.5
hooks: hooks:
...@@ -12,11 +12,12 @@ repos: ...@@ -12,11 +12,12 @@ repos:
rev: v0.32.0 rev: v0.32.0
hooks: hooks:
- id: yapf - id: yapf
args: ["--style={based_on_style: google, column_limit: 120, indent_width: 4}"] args: ["--style={based_on_style: google, column_limit: 150, indent_width: 4}"]
- repo: https://github.com/codespell-project/codespell - repo: https://github.com/codespell-project/codespell
rev: v2.2.1 rev: v2.2.1
hooks: hooks:
- id: codespell - id: codespell
args: ['--skip', '*.json']
- repo: https://github.com/pre-commit/pre-commit-hooks - repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.3.0 rev: v4.3.0
hooks: hooks:
......
...@@ -41,6 +41,17 @@ ...@@ -41,6 +41,17 @@
</div> </div>
# Changelog # Changelog
- 2024/10/31 0.9.0 released. This is a major new version with extensive code refactoring, addressing numerous issues, improving performance, reducing hardware requirements, and enhancing usability:
- Refactored the sorting module code to use [layoutreader](https://github.com/ppaanngggg/layoutreader) for reading order sorting, ensuring high accuracy in various layouts.
- Refactored the paragraph concatenation module to achieve good results in cross-column, cross-page, cross-figure, and cross-table scenarios.
- Refactored the list and table of contents recognition functions, significantly improving the accuracy of list blocks and table of contents blocks, as well as the parsing of corresponding text paragraphs.
- Refactored the matching logic for figures, tables, and descriptive text, greatly enhancing the accuracy of matching captions and footnotes to figures and tables, and reducing the loss rate of descriptive text to zero.
- Added multi-language support for OCR, supporting detection and recognition of 84 languages.For the list of supported languages, see [OCR Language Support List](https://paddlepaddle.github.io/PaddleOCR/latest/en/ppocr/blog/multi_languages.html#5-support-languages-and-abbreviations).
- Added memory recycling logic and other memory optimization measures, significantly reducing memory usage. The memory requirement for enabling all acceleration features except table acceleration (layout/formula/OCR) has been reduced from 16GB to 8GB, and the memory requirement for enabling all acceleration features has been reduced from 24GB to 10GB.
- Optimized configuration file feature switches, adding an independent formula detection switch to significantly improve speed and parsing results when formula detection is not needed.
- Integrated [PDF-Extract-Kit 1.0](https://github.com/opendatalab/PDF-Extract-Kit):
- Added the self-developed `doclayout_yolo` model, which speeds up processing by more than 10 times compared to the original solution while maintaining similar parsing effects, and can be freely switched with `layoutlmv3` via the configuration file.
- Upgraded formula parsing to `unimernet 0.2.1`, improving formula parsing accuracy while significantly reducing memory usage.
- 2024/09/27 Version 0.8.1 released, Fixed some bugs, and providing a [localized deployment version](projects/web_demo/README.md) of the [online demo](https://opendatalab.com/OpenSourceTools/Extractor/PDF/) and the [front-end interface](projects/web/README.md). - 2024/09/27 Version 0.8.1 released, Fixed some bugs, and providing a [localized deployment version](projects/web_demo/README.md) of the [online demo](https://opendatalab.com/OpenSourceTools/Extractor/PDF/) and the [front-end interface](projects/web/README.md).
- 2024/09/09: Version 0.8.0 released, supporting fast deployment with Dockerfile, and launching demos on Huggingface and Modelscope. - 2024/09/09: Version 0.8.0 released, supporting fast deployment with Dockerfile, and launching demos on Huggingface and Modelscope.
- 2024/08/30: Version 0.7.1 released, add paddle tablemaster table recognition option - 2024/08/30: Version 0.7.1 released, add paddle tablemaster table recognition option
...@@ -69,6 +80,7 @@ ...@@ -69,6 +80,7 @@
<ul> <ul>
<li><a href="#command-line">Command Line</a></li> <li><a href="#command-line">Command Line</a></li>
<li><a href="#api">API</a></li> <li><a href="#api">API</a></li>
<li><a href="#deploy-derived-projects">Deploy Derived Projects</a></li>
<li><a href="#development-guide">Development Guide</a></li> <li><a href="#development-guide">Development Guide</a></li>
</ul> </ul>
</li> </li>
...@@ -100,15 +112,18 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c ...@@ -100,15 +112,18 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
## Key Features ## Key Features
- Removes elements such as headers, footers, footnotes, and page numbers while maintaining semantic continuity - Remove headers, footers, footnotes, page numbers, etc., to ensure semantic coherence.
- Outputs text in a human-readable order from multi-column documents - Output text in human-readable order, suitable for single-column, multi-column, and complex layouts.
- Retains the original structure of the document, including titles, paragraphs, and lists - Preserve the structure of the original document, including headings, paragraphs, lists, etc.
- Extracts images, image captions, tables, and table captions - Extract images, image descriptions, tables, table titles, and footnotes.
- Automatically recognizes formulas in the document and converts them to LaTeX - Automatically recognize and convert formulas in the document to LaTeX format.
- Automatically recognizes tables in the document and converts them to LaTeX - Automatically recognize and convert tables in the document to LaTeX or HTML format.
- Automatically detects and enables OCR for corrupted PDFs - Automatically detect scanned PDFs and garbled PDFs and enable OCR functionality.
- Supports both CPU and GPU environments - OCR supports detection and recognition of 84 languages.
- Supports Windows, Linux, and Mac platforms - Supports multiple output formats, such as multimodal and NLP Markdown, JSON sorted by reading order, and rich intermediate formats.
- Supports various visualization results, including layout visualization and span visualization, for efficient confirmation of output quality.
- Supports both CPU and GPU environments.
- Compatible with Windows, Linux, and Mac platforms.
## Quick Start ## Quick Start
...@@ -139,8 +154,8 @@ In non-mainline environments, due to the diversity of hardware and software conf ...@@ -139,8 +154,8 @@ In non-mainline environments, due to the diversity of hardware and software conf
</tr> </tr>
<tr> <tr>
<td colspan="3">CPU</td> <td colspan="3">CPU</td>
<td>x86_64</td> <td>x86_64(unsupported ARM Linux)</td>
<td>x86_64</td> <td>x86_64(unsupported ARM Windows)</td>
<td>x86_64 / arm64</td> <td>x86_64 / arm64</td>
</tr> </tr>
<tr> <tr>
...@@ -149,7 +164,7 @@ In non-mainline environments, due to the diversity of hardware and software conf ...@@ -149,7 +164,7 @@ In non-mainline environments, due to the diversity of hardware and software conf
</tr> </tr>
<tr> <tr>
<td colspan="3">Python Version</td> <td colspan="3">Python Version</td>
<td colspan="3">3.10</td> <td colspan="3">3.10(Please make sure to create a Python 3.10 virtual environment using conda)</td>
</tr> </tr>
<tr> <tr>
<td colspan="3">Nvidia Driver Version</td> <td colspan="3">Nvidia Driver Version</td>
...@@ -166,21 +181,24 @@ In non-mainline environments, due to the diversity of hardware and software conf ...@@ -166,21 +181,24 @@ In non-mainline environments, due to the diversity of hardware and software conf
<tr> <tr>
<td rowspan="2">GPU Hardware Support List</td> <td rowspan="2">GPU Hardware Support List</td>
<td colspan="2">Minimum Requirement 8G+ VRAM</td> <td colspan="2">Minimum Requirement 8G+ VRAM</td>
<td colspan="2">3060ti/3070/3080/3080ti/4060/4070/4070ti<br> <td colspan="2">3060ti/3070/4060<br>
8G VRAM enables layout, formula recognition acceleration and OCR acceleration</td> 8G VRAM enables layout, formula recognition acceleration and OCR acceleration</td>
<td rowspan="2">None</td> <td rowspan="2">None</td>
</tr> </tr>
<tr> <tr>
<td colspan="2">Recommended Configuration 16G+ VRAM</td> <td colspan="2">Recommended Configuration 10G+ VRAM</td>
<td colspan="2">3090/3090ti/4070ti super/4080/4090<br> <td colspan="2">3080/3080ti/3090/3090ti/4070/4070ti/4070tisuper/4080/4090<br>
16G VRAM or more can enable layout, formula recognition, OCR acceleration and table recognition acceleration simultaneously 10G VRAM or more can enable layout, formula recognition, OCR acceleration and table recognition acceleration simultaneously
</td> </td>
</tr> </tr>
</table> </table>
### Online Demo ### Online Demo
Stable Version (Stable version verified by QA):
[![OpenDataLab](https://img.shields.io/badge/Demo_on_OpenDataLab-blue?logo=&labelColor=white)](https://opendatalab.com/OpenSourceTools/Extractor/PDF) [![OpenDataLab](https://img.shields.io/badge/Demo_on_OpenDataLab-blue?logo=&labelColor=white)](https://opendatalab.com/OpenSourceTools/Extractor/PDF)
Test Version (Synced with dev branch updates, testing new features):
[![HuggingFace](https://img.shields.io/badge/Demo_on_HuggingFace-yellow.svg?logo=&labelColor=white)](https://huggingface.co/spaces/opendatalab/MinerU) [![HuggingFace](https://img.shields.io/badge/Demo_on_HuggingFace-yellow.svg?logo=&labelColor=white)](https://huggingface.co/spaces/opendatalab/MinerU)
[![ModelScope](https://img.shields.io/badge/Demo_on_ModelScope-purple?logo=&labelColor=white)](https://www.modelscope.cn/studios/OpenDataLab/MinerU) [![ModelScope](https://img.shields.io/badge/Demo_on_ModelScope-purple?logo=&labelColor=white)](https://www.modelscope.cn/studios/OpenDataLab/MinerU)
...@@ -211,10 +229,18 @@ You can modify certain configurations in this file to enable or disable features ...@@ -211,10 +229,18 @@ You can modify certain configurations in this file to enable or disable features
```json ```json
{ {
// other config // other config
"table-config": { "layout-config": {
"model": "TableMaster", // Another option of this value is 'struct_eqtable' "model": "layoutlmv3" // Please change to "doclayout_yolo" when using doclayout_yolo.
"is_table_recog_enable": false, // Table recognition is disabled by default, modify this value to enable it },
"formula-config": {
"mfd_model": "yolo_v8_mfd",
"mfr_model": "unimernet_small",
"enable": true // The formula recognition feature is enabled by default. If you need to disable it, please change the value here to "false".
},
"table-config": {
"model": "tablemaster", // When using structEqTable, please change to "struct_eqtable".
"enable": false, // The table recognition feature is disabled by default. If you need to enable it, please change the value here to "true".
"max_time": 400 "max_time": 400
} }
} }
...@@ -263,8 +289,8 @@ Options: ...@@ -263,8 +289,8 @@ Options:
-l, --lang TEXT Input the languages in the pdf (if known) to -l, --lang TEXT Input the languages in the pdf (if known) to
improve OCR accuracy. Optional. You should improve OCR accuracy. Optional. You should
input "Abbreviation" with language form url: ht input "Abbreviation" with language form url: ht
tps://paddlepaddle.github.io/PaddleOCR/en/ppocr tps://paddlepaddle.github.io/PaddleOCR/latest/en
/blog/multi_languages.html#5-support-languages- /ppocr/blog/multi_languages.html#5-support-languages-
and-abbreviations and-abbreviations
-d, --debug BOOLEAN Enables detailed debugging information during -d, --debug BOOLEAN Enables detailed debugging information during
the execution of the CLI commands. the execution of the CLI commands.
...@@ -288,7 +314,7 @@ The results will be saved in the `{some_output_dir}` directory. The output file ...@@ -288,7 +314,7 @@ The results will be saved in the `{some_output_dir}` directory. The output file
```text ```text
├── some_pdf.md # markdown file ├── some_pdf.md # markdown file
├── images # directory for storing images ├── images # directory for storing images
├── some_pdf_layout.pdf # layout diagram ├── some_pdf_layout.pdf # layout diagram (Include layout reading order)
├── some_pdf_middle.json # MinerU intermediate processing result ├── some_pdf_middle.json # MinerU intermediate processing result
├── some_pdf_model.json # model inference result ├── some_pdf_model.json # model inference result
├── some_pdf_origin.pdf # original PDF file ├── some_pdf_origin.pdf # original PDF file
...@@ -333,29 +359,38 @@ For detailed implementation, refer to: ...@@ -333,29 +359,38 @@ For detailed implementation, refer to:
- [demo.py Simplest Processing Method](demo/demo.py) - [demo.py Simplest Processing Method](demo/demo.py)
- [magic_pdf_parse_main.py More Detailed Processing Workflow](demo/magic_pdf_parse_main.py) - [magic_pdf_parse_main.py More Detailed Processing Workflow](demo/magic_pdf_parse_main.py)
### Deploy Derived Projects
Derived projects include secondary development projects based on MinerU by project developers and community developers,
such as application interfaces based on Gradio, RAG based on llama, web demos similar to the official website, lightweight multi-GPU load balancing client/server ends, etc.
These projects may offer more features and a better user experience.
For specific deployment methods, please refer to the [Derived Project README](projects/README.md)
### Development Guide ### Development Guide
TODO TODO
# TODO # TODO
- [x] Semantic-based reading order - 🗹 Reading order based on the model
- [ ] List recognition within the text - 🗹 Recognition of `index` and `list` in the main text
- [ ] Code block recognition within the text - 🗹 Table recognition
- [ ] Table of contents recognition - ☐ Code block recognition in the main text
- [x] Table recognition -[Chemical formula recognition](docs/chemical_knowledge_introduction/introduction.pdf)
- [ ] [Chemical formula recognition](docs/chemical_knowledge_introduction/introduction.pdf) - ☐ Geometric shape recognition
- [ ] Geometric shape recognition
# Known Issues # Known Issues
- Reading order is segmented based on rules, which can cause disordered sequences in some cases - Reading order is determined by the model based on the spatial distribution of readable content, and may be out of order in some areas under extremely complex layouts.
- Vertical text is not supported - Vertical text is not supported.
- Lists, code blocks, and table of contents are not yet supported in the layout model - Tables of contents and lists are recognized through rules, and some uncommon list formats may not be recognized.
- Comic books, art books, elementary school textbooks, and exercise books are not well-parsed yet - Only one level of headings is supported; hierarchical headings are not currently supported.
- Enabling OCR may produce better results in PDFs with a high density of formulas - Code blocks are not yet supported in the layout model.
- If you are processing PDFs with a large number of formulas, it is strongly recommended to enable the OCR function. When using PyMuPDF to extract text, overlapping text lines can occur, leading to inaccurate formula insertion positions. - Comic books, art albums, primary school textbooks, and exercises cannot be parsed well.
- Table recognition may result in row/column recognition errors in complex tables.
- OCR recognition may produce inaccurate characters in PDFs of lesser-known languages (e.g., diacritical marks in Latin script, easily confused characters in Arabic script).
- Some formulas may not render correctly in Markdown.
# FAQ # FAQ
......
...@@ -41,6 +41,18 @@ ...@@ -41,6 +41,18 @@
</div> </div>
# 更新记录 # 更新记录
- 2024/10/31 0.9.0发布,这是我们进行了大量代码重构的全新版本,解决了众多问题,提升了性能,降低了硬件需求,并提供了更丰富的易用性:
- 重构排序模块代码,使用 [layoutreader](https://github.com/ppaanngggg/layoutreader) 进行阅读顺序排序,确保在各种排版下都能实现极高准确率
- 重构段落拼接模块,在跨栏、跨页、跨图、跨表情况下均能实现良好的段落拼接效果
- 重构列表和目录识别功能,极大提升列表块和目录块识别的准确率及对应文本段落的解析效果
- 重构图、表与描述性文本的匹配逻辑,大幅提升 caption 和 footnote 与图表的匹配准确率,并将描述性文本的丢失率降至零
- 增加 OCR 的多语言支持,支持 84 种语言的检测与识别,语言支持列表详见 [OCR 语言支持列表](https://paddlepaddle.github.io/PaddleOCR/latest/ppocr/blog/multi_languages.html#5)
- 增加显存回收逻辑及其他显存优化措施,大幅降低显存使用需求。开启除表格加速外的全部加速功能(layout/公式/OCR)的显存需求从16GB降至8GB,开启全部加速功能的显存需求从24GB降至10GB
- 优化配置文件的功能开关,增加独立的公式检测开关,无需公式检测时可大幅提升速度和解析效果
- 集成 [PDF-Extract-Kit 1.0](https://github.com/opendatalab/PDF-Extract-Kit)
- 加入自研的 `doclayout_yolo` 模型,在相近解析效果情况下比原方案提速10倍以上,可通过配置文件与 `layoutlmv3` 自由切换
- 公式解析升级至 `unimernet 0.2.1`,在提升公式解析准确率的同时,大幅降低显存需求
- 2024/09/27 0.8.1发布,修复了一些bug,同时提供了[在线demo](https://opendatalab.com/OpenSourceTools/Extractor/PDF/)[本地化部署版本](projects/web_demo/README_zh-CN.md)[前端界面](projects/web/README_zh-CN.md) - 2024/09/27 0.8.1发布,修复了一些bug,同时提供了[在线demo](https://opendatalab.com/OpenSourceTools/Extractor/PDF/)[本地化部署版本](projects/web_demo/README_zh-CN.md)[前端界面](projects/web/README_zh-CN.md)
- 2024/09/09 0.8.0发布,支持Dockerfile快速部署,同时上线了huggingface、modelscope demo - 2024/09/09 0.8.0发布,支持Dockerfile快速部署,同时上线了huggingface、modelscope demo
- 2024/08/30 0.7.1发布,集成了paddle tablemaster表格识别功能 - 2024/08/30 0.7.1发布,集成了paddle tablemaster表格识别功能
...@@ -69,6 +81,7 @@ ...@@ -69,6 +81,7 @@
<ul> <ul>
<li><a href="#命令行">命令行</a></li> <li><a href="#命令行">命令行</a></li>
<li><a href="#api">API</a></li> <li><a href="#api">API</a></li>
<li><a href="#部署衍生项目">部署衍生项目</a></li>
<li><a href="#二次开发">二次开发</a></li> <li><a href="#二次开发">二次开发</a></li>
</ul> </ul>
</li> </li>
...@@ -100,15 +113,18 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c ...@@ -100,15 +113,18 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
## 主要功能 ## 主要功能
- 删除页眉、页脚、脚注、页码等元素,保语义连贯 - 删除页眉、页脚、脚注、页码等元素,保语义连贯
- 对多栏输出符合人类阅读顺序的文本 - 输出符合人类阅读顺序的文本,适用于单栏、多栏及复杂排版
- 保留原文档的结构,包括标题、段落、列表等 - 保留原文档的结构,包括标题、段落、列表等
- 提取图像、图片标题、表格、表格标题 - 提取图像、图片描述、表格、表格标题及脚注
- 自动识别文档中的公式并将公式转换成latex - 自动识别并转换文档中的公式为LaTeX格式
- 自动识别文档中的表格并将表格转换成latex - 自动识别并转换文档中的表格为LaTeX或HTML格式
- 乱码PDF自动检测并启用OCR - 自动检测扫描版PDF和乱码PDF,并启用OCR功能
- OCR支持84种语言的检测与识别
- 支持多种输出格式,如多模态与NLP的Markdown、按阅读顺序排序的JSON、含有丰富信息的中间格式等
- 支持多种可视化结果,包括layout可视化、span可视化等,便于高效确认输出效果与质检
- 支持CPU和GPU环境 - 支持CPU和GPU环境
- 支持windows/linux/mac平台 - 兼容Windows、Linux和Mac平台
## 快速开始 ## 快速开始
...@@ -139,8 +155,8 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c ...@@ -139,8 +155,8 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
</tr> </tr>
<tr> <tr>
<td colspan="3">CPU</td> <td colspan="3">CPU</td>
<td>x86_64</td> <td>x86_64(暂不支持ARM Linux)</td>
<td>x86_64</td> <td>x86_64(暂不支持ARM Windows)</td>
<td>x86_64 / arm64</td> <td>x86_64 / arm64</td>
</tr> </tr>
<tr> <tr>
...@@ -149,7 +165,7 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c ...@@ -149,7 +165,7 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
</tr> </tr>
<tr> <tr>
<td colspan="3">python版本</td> <td colspan="3">python版本</td>
<td colspan="3">3.10</td> <td colspan="3">3.10 (请务必通过conda创建3.10虚拟环境)</td>
</tr> </tr>
<tr> <tr>
<td colspan="3">Nvidia Driver 版本</td> <td colspan="3">Nvidia Driver 版本</td>
...@@ -166,23 +182,27 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c ...@@ -166,23 +182,27 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
<tr> <tr>
<td rowspan="2">GPU硬件支持列表</td> <td rowspan="2">GPU硬件支持列表</td>
<td colspan="2">最低要求 8G+显存</td> <td colspan="2">最低要求 8G+显存</td>
<td colspan="2">3060ti/3070/3080/3080ti/4060/4070/4070ti<br> <td colspan="2">3060ti/3070/4060<br>
8G显存可开启layout、公式识别和ocr加速</td> 8G显存可开启layout、公式识别和ocr加速</td>
<td rowspan="2">None</td> <td rowspan="2">None</td>
</tr> </tr>
<tr> <tr>
<td colspan="2">推荐配置 16G+显存</td> <td colspan="2">推荐配置 10G+显存</td>
<td colspan="2">3090/3090ti/4070tisuper/4080/4090<br> <td colspan="2">3080/3080ti/3090/3090ti/4070/4070ti/4070tisuper/4080/4090<br>
16G显存及以上可以同时开启layout、公式识别和ocr加速和表格识别加速<br> 10G显存及以上可以同时开启layout、公式识别和ocr加速和表格识别加速<br>
</td> </td>
</tr> </tr>
</table> </table>
### 在线体验 ### 在线体验
稳定版(经过QA验证的稳定版本):
[![OpenDataLab](https://img.shields.io/badge/Demo_on_OpenDataLab-blue?logo=&labelColor=white)](https://opendatalab.com/OpenSourceTools/Extractor/PDF) [![OpenDataLab](https://img.shields.io/badge/Demo_on_OpenDataLab-blue?logo=&labelColor=white)](https://opendatalab.com/OpenSourceTools/Extractor/PDF)
[![ModelScope](https://img.shields.io/badge/Demo_on_ModelScope-purple?logo=&labelColor=white)](https://www.modelscope.cn/studios/OpenDataLab/MinerU)
测试版(同步dev分支更新,测试新特性):
[![HuggingFace](https://img.shields.io/badge/Demo_on_HuggingFace-yellow.svg?logo=&labelColor=white)](https://huggingface.co/spaces/opendatalab/MinerU) [![HuggingFace](https://img.shields.io/badge/Demo_on_HuggingFace-yellow.svg?logo=&labelColor=white)](https://huggingface.co/spaces/opendatalab/MinerU)
[![ModelScope](https://img.shields.io/badge/Demo_on_ModelScope-purple?logo=&labelColor=white)](https://www.modelscope.cn/studios/OpenDataLab/MinerU)
### 使用CPU快速体验 ### 使用CPU快速体验
...@@ -212,10 +232,18 @@ pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com -i h ...@@ -212,10 +232,18 @@ pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com -i h
```json ```json
{ {
// other config // other config
"table-config": { "layout-config": {
"model": "TableMaster", // 使用structEqTable请修改为'struct_eqtable' "model": "layoutlmv3" // 使用doclayout_yolo请修改为“doclayout_yolo"
"is_table_recog_enable": false, // 表格识别功能默认是关闭的,如果需要修改此处的值 },
"formula-config": {
"mfd_model": "yolo_v8_mfd",
"mfr_model": "unimernet_small",
"enable": true // 公式识别功能默认是开启的,如果需要关闭请修改此处的值为"false"
},
"table-config": {
"model": "tablemaster", // 使用structEqTable请修改为"struct_eqtable"
"enable": false, // 表格识别功能默认是关闭的,如果需要开启请修改此处的值为"true"
"max_time": 400 "max_time": 400
} }
} }
...@@ -265,8 +293,8 @@ Options: ...@@ -265,8 +293,8 @@ Options:
-l, --lang TEXT Input the languages in the pdf (if known) to -l, --lang TEXT Input the languages in the pdf (if known) to
improve OCR accuracy. Optional. You should improve OCR accuracy. Optional. You should
input "Abbreviation" with language form url: ht input "Abbreviation" with language form url: ht
tps://paddlepaddle.github.io/PaddleOCR/en/ppocr tps://paddlepaddle.github.io/PaddleOCR/latest/en
/blog/multi_languages.html#5-support-languages- /ppocr/blog/multi_languages.html#5-support-languages-
and-abbreviations and-abbreviations
-d, --debug BOOLEAN Enables detailed debugging information during -d, --debug BOOLEAN Enables detailed debugging information during
the execution of the CLI commands. the execution of the CLI commands.
...@@ -290,7 +318,7 @@ magic-pdf -p {some_pdf} -o {some_output_dir} -m auto ...@@ -290,7 +318,7 @@ magic-pdf -p {some_pdf} -o {some_output_dir} -m auto
```text ```text
├── some_pdf.md # markdown 文件 ├── some_pdf.md # markdown 文件
├── images # 存放图片目录 ├── images # 存放图片目录
├── some_pdf_layout.pdf # layout 绘图 ├── some_pdf_layout.pdf # layout 绘图 (包含layout阅读顺序)
├── some_pdf_middle.json # minerU 中间处理结果 ├── some_pdf_middle.json # minerU 中间处理结果
├── some_pdf_model.json # 模型推理结果 ├── some_pdf_model.json # 模型推理结果
├── some_pdf_origin.pdf # 原 pdf 文件 ├── some_pdf_origin.pdf # 原 pdf 文件
...@@ -335,29 +363,38 @@ md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none") ...@@ -335,29 +363,38 @@ md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
- [demo.py 最简单的处理方式](demo/demo.py) - [demo.py 最简单的处理方式](demo/demo.py)
- [magic_pdf_parse_main.py 能够更清晰看到处理流程](demo/magic_pdf_parse_main.py) - [magic_pdf_parse_main.py 能够更清晰看到处理流程](demo/magic_pdf_parse_main.py)
### 部署衍生项目
衍生项目包含项目开发者和社群开发者们基于MinerU的二次开发项目,
例如基于Gradio的应用界面、基于llama的RAG、官网同款web demo、轻量级的多卡负载均衡c/s端等,
这些项目可能会提供更多的功能和更好的用户体验。
具体部署方式请参考 [衍生项目readme](projects/README_zh-CN.md)
### 二次开发 ### 二次开发
TODO TODO
# TODO # TODO
- [x] 基于语义的阅读顺序 - 🗹 基于模型的阅读顺序
- [ ] 正文中列表识别 - 🗹 正文中目录、列表识别
- [ ] 正文中代码块识别 - 🗹 表格识别
- [ ] 目录识别 - ☐ 正文中代码块识别
- [x] 表格识别 -[化学式识别](docs/chemical_knowledge_introduction/introduction.pdf)
- [ ] [化学式识别](docs/chemical_knowledge_introduction/introduction.pdf) - ☐ 几何图形识别
- [ ] 几何图形识别
# Known Issues # Known Issues
- 阅读顺序基于规则的分割,在一些情况下会乱序 - 阅读顺序基于模型对可阅读内容在空间中的分布进行排序,在极端复杂的排版下可能会部分区域乱序
- 不支持竖排文字 - 不支持竖排文字
- 列表、代码块、目录在layout模型里还没有支持 - 目录和列表通过规则进行识别,少部分不常见的列表形式可能无法识别
- 标题只有一级,目前不支持标题分级
- 代码块在layout模型里还没有支持
- 漫画书、艺术图册、小学教材、习题尚不能很好解析 - 漫画书、艺术图册、小学教材、习题尚不能很好解析
- 在一些公式密集的PDF上强制启用OCR效果会更好 - 表格识别在复杂表格上可能会出现行/列识别错误
- 如果您要处理包含大量公式的pdf,强烈建议开启OCR功能。使用pymuPDF提取文字的时候会出现文本行互相重叠的情况导致公式插入位置不准确。 - 在小语种PDF上,OCR识别可能会出现字符不准确的情况(如拉丁文的重音符号、阿拉伯文易混淆字符等)
- 部分公式可能会无法在markdown中渲染
# FAQ # FAQ
......
import os import os
import json
from loguru import logger from loguru import logger
from magic_pdf.pipe.UNIPipe import UNIPipe from magic_pdf.pipe.UNIPipe import UNIPipe
from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter
import magic_pdf.model as model_config
model_config.__use_inside_model__ = True
try: try:
current_script_dir = os.path.dirname(os.path.abspath(__file__)) current_script_dir = os.path.dirname(os.path.abspath(__file__))
demo_name = "demo1" demo_name = "demo1"
pdf_path = os.path.join(current_script_dir, f"{demo_name}.pdf") pdf_path = os.path.join(current_script_dir, f"{demo_name}.pdf")
model_path = os.path.join(current_script_dir, f"{demo_name}.json")
pdf_bytes = open(pdf_path, "rb").read() pdf_bytes = open(pdf_path, "rb").read()
# model_json = json.loads(open(model_path, "r", encoding="utf-8").read()) jso_useful_key = {"_pdf_type": "", "model_list": []}
model_json = [] # model_json传空list使用内置模型解析
jso_useful_key = {"_pdf_type": "", "model_list": model_json}
local_image_dir = os.path.join(current_script_dir, 'images') local_image_dir = os.path.join(current_script_dir, 'images')
image_dir = str(os.path.basename(local_image_dir)) image_dir = str(os.path.basename(local_image_dir))
image_writer = DiskReaderWriter(local_image_dir) image_writer = DiskReaderWriter(local_image_dir)
pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer) pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer)
pipe.pipe_classify() pipe.pipe_classify()
"""如果没有传入有效的模型数据,则使用内置model解析""" pipe.pipe_analyze()
if len(model_json) == 0:
if model_config.__use_inside_model__:
pipe.pipe_analyze()
else:
logger.error("need model list input")
exit(1)
pipe.pipe_parse() pipe.pipe_parse()
md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none") md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
with open(f"{demo_name}.md", "w", encoding="utf-8") as f: with open(f"{demo_name}.md", "w", encoding="utf-8") as f:
......
...@@ -4,13 +4,12 @@ import copy ...@@ -4,13 +4,12 @@ import copy
from loguru import logger from loguru import logger
from magic_pdf.libs.draw_bbox import draw_layout_bbox, draw_span_bbox
from magic_pdf.pipe.UNIPipe import UNIPipe from magic_pdf.pipe.UNIPipe import UNIPipe
from magic_pdf.pipe.OCRPipe import OCRPipe from magic_pdf.pipe.OCRPipe import OCRPipe
from magic_pdf.pipe.TXTPipe import TXTPipe from magic_pdf.pipe.TXTPipe import TXTPipe
from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter
import magic_pdf.model as model_config
model_config.__use_inside_model__ = True
# todo: 设备类型选择 (?) # todo: 设备类型选择 (?)
...@@ -47,11 +46,20 @@ def json_md_dump( ...@@ -47,11 +46,20 @@ def json_md_dump(
) )
# 可视化
def draw_visualization_bbox(pdf_info, pdf_bytes, local_md_dir, pdf_file_name):
# 画布局框,附带排序结果
draw_layout_bbox(pdf_info, pdf_bytes, local_md_dir, pdf_file_name)
# 画 span 框
draw_span_bbox(pdf_info, pdf_bytes, local_md_dir, pdf_file_name)
def pdf_parse_main( def pdf_parse_main(
pdf_path: str, pdf_path: str,
parse_method: str = 'auto', parse_method: str = 'auto',
model_json_path: str = None, model_json_path: str = None,
is_json_md_dump: bool = True, is_json_md_dump: bool = True,
is_draw_visualization_bbox: bool = True,
output_dir: str = None output_dir: str = None
): ):
""" """
...@@ -108,11 +116,7 @@ def pdf_parse_main( ...@@ -108,11 +116,7 @@ def pdf_parse_main(
# 如果没有传入模型数据,则使用内置模型解析 # 如果没有传入模型数据,则使用内置模型解析
if not model_json: if not model_json:
if model_config.__use_inside_model__: pipe.pipe_analyze() # 解析
pipe.pipe_analyze() # 解析
else:
logger.error("need model list input")
exit(1)
# 执行解析 # 执行解析
pipe.pipe_parse() pipe.pipe_parse()
...@@ -121,10 +125,11 @@ def pdf_parse_main( ...@@ -121,10 +125,11 @@ def pdf_parse_main(
content_list = pipe.pipe_mk_uni_format(image_path_parent, drop_mode="none") content_list = pipe.pipe_mk_uni_format(image_path_parent, drop_mode="none")
md_content = pipe.pipe_mk_markdown(image_path_parent, drop_mode="none") md_content = pipe.pipe_mk_markdown(image_path_parent, drop_mode="none")
if is_json_md_dump: if is_json_md_dump:
json_md_dump(pipe, md_writer, pdf_name, content_list, md_content) json_md_dump(pipe, md_writer, pdf_name, content_list, md_content)
if is_draw_visualization_bbox:
draw_visualization_bbox(pipe.pdf_mid_data['pdf_info'], pdf_bytes, output_path, pdf_name)
except Exception as e: except Exception as e:
logger.exception(e) logger.exception(e)
...@@ -132,5 +137,5 @@ def pdf_parse_main( ...@@ -132,5 +137,5 @@ def pdf_parse_main(
# 测试 # 测试
if __name__ == '__main__': if __name__ == '__main__':
pdf_path = r"C:\Users\XYTK2\Desktop\2024-2016-gb-cd-300.pdf" pdf_path = r"D:\project\20240617magicpdf\Magic-PDF\demo\demo1.pdf"
pdf_parse_main(pdf_path) pdf_parse_main(pdf_path)
...@@ -8,6 +8,8 @@ nvidia-smi ...@@ -8,6 +8,8 @@ nvidia-smi
If you see information similar to the following, it means that the NVIDIA drivers are already installed, and you can skip Step 2. If you see information similar to the following, it means that the NVIDIA drivers are already installed, and you can skip Step 2.
Notice:`CUDA Version` should be >= 12.1, If the displayed version number is less than 12.1, please upgrade the driver.
```plaintext ```plaintext
+---------------------------------------------------------------------------------------+ +---------------------------------------------------------------------------------------+
| NVIDIA-SMI 537.34 Driver Version: 537.34 CUDA Version: 12.2 | | NVIDIA-SMI 537.34 Driver Version: 537.34 CUDA Version: 12.2 |
...@@ -95,8 +97,6 @@ magic-pdf -p small_ocr.pdf ...@@ -95,8 +97,6 @@ magic-pdf -p small_ocr.pdf
If your graphics card has at least **8GB** of VRAM, follow these steps to test CUDA acceleration: If your graphics card has at least **8GB** of VRAM, follow these steps to test CUDA acceleration:
> ❗ Due to the extremely limited nature of 8GB VRAM for running this application, you need to close all other programs using VRAM to ensure that 8GB of VRAM is available when running this application.
1. Modify the value of `"device-mode"` in the `magic-pdf.json` configuration file located in your home directory. 1. Modify the value of `"device-mode"` in the `magic-pdf.json` configuration file located in your home directory.
```json ```json
{ {
......
...@@ -8,6 +8,9 @@ nvidia-smi ...@@ -8,6 +8,9 @@ nvidia-smi
如果看到类似如下的信息,说明已经安装了nvidia驱动,可以跳过步骤2 如果看到类似如下的信息,说明已经安装了nvidia驱动,可以跳过步骤2
注意:`CUDA Version` 显示的版本号应 >= 12.1,如显示的版本号小于12.1,请升级驱动
```plaintext
``` ```
+---------------------------------------------------------------------------------------+ +---------------------------------------------------------------------------------------+
| NVIDIA-SMI 537.34 Driver Version: 537.34 CUDA Version: 12.2 | | NVIDIA-SMI 537.34 Driver Version: 537.34 CUDA Version: 12.2 |
...@@ -95,8 +98,6 @@ magic-pdf -p small_ocr.pdf ...@@ -95,8 +98,6 @@ magic-pdf -p small_ocr.pdf
如果您的显卡显存大于等于 **8GB** ,可以进行以下流程,测试CUDA解析加速效果 如果您的显卡显存大于等于 **8GB** ,可以进行以下流程,测试CUDA解析加速效果
> ❗️因8GB显存运行本应用非常极限,需要关闭所有其他正在使用显存的程序以确保本应用运行时有足额8GB显存可用。
**1.修改【用户目录】中配置文件magic-pdf.json中"device-mode"的值** **1.修改【用户目录】中配置文件magic-pdf.json中"device-mode"的值**
```json ```json
......
...@@ -60,8 +60,6 @@ Download a sample file from the repository and test it. ...@@ -60,8 +60,6 @@ Download a sample file from the repository and test it.
If your graphics card has at least 8GB of VRAM, follow these steps to test CUDA-accelerated parsing performance. If your graphics card has at least 8GB of VRAM, follow these steps to test CUDA-accelerated parsing performance.
> ❗ Due to the extremely limited nature of 8GB VRAM for running this application, you need to close all other programs using VRAM to ensure that 8GB of VRAM is available when running this application.
1. **Overwrite the installation of torch and torchvision** supporting CUDA. 1. **Overwrite the installation of torch and torchvision** supporting CUDA.
``` ```
......
...@@ -61,8 +61,6 @@ pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com -i h ...@@ -61,8 +61,6 @@ pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com -i h
如果您的显卡显存大于等于 **8GB** ,可以进行以下流程,测试CUDA解析加速效果 如果您的显卡显存大于等于 **8GB** ,可以进行以下流程,测试CUDA解析加速效果
> ❗️因8GB显存运行本应用非常极限,需要关闭所有其他正在使用显存的程序以确保本应用运行时有足额8GB显存可用。
**1.覆盖安装支持cuda的torch和torchvision** **1.覆盖安装支持cuda的torch和torchvision**
```bash ```bash
......
...@@ -5,16 +5,21 @@ import requests ...@@ -5,16 +5,21 @@ import requests
from modelscope import snapshot_download from modelscope import snapshot_download
def download_json(url):
# 下载JSON文件
response = requests.get(url)
response.raise_for_status() # 检查请求是否成功
return response.json()
def download_and_modify_json(url, local_filename, modifications): def download_and_modify_json(url, local_filename, modifications):
if os.path.exists(local_filename): if os.path.exists(local_filename):
data = json.load(open(local_filename)) data = json.load(open(local_filename))
config_version = data.get('config_version', '0.0.0')
if config_version < '1.0.0':
data = download_json(url)
else: else:
# 下载JSON文件 data = download_json(url)
response = requests.get(url)
response.raise_for_status() # 检查请求是否成功
# 解析JSON内容
data = response.json()
# 修改内容 # 修改内容
for key, value in modifications.items(): for key, value in modifications.items():
...@@ -26,13 +31,21 @@ def download_and_modify_json(url, local_filename, modifications): ...@@ -26,13 +31,21 @@ def download_and_modify_json(url, local_filename, modifications):
if __name__ == '__main__': if __name__ == '__main__':
model_dir = snapshot_download('opendatalab/PDF-Extract-Kit') mineru_patterns = [
"models/Layout/LayoutLMv3/*",
"models/Layout/YOLO/*",
"models/MFD/YOLO/*",
"models/MFR/unimernet_small/*",
"models/TabRec/TableMaster/*",
"models/TabRec/StructEqTable/*",
]
model_dir = snapshot_download('opendatalab/PDF-Extract-Kit-1.0', allow_patterns=mineru_patterns)
layoutreader_model_dir = snapshot_download('ppaanngggg/layoutreader') layoutreader_model_dir = snapshot_download('ppaanngggg/layoutreader')
model_dir = model_dir + '/models' model_dir = model_dir + '/models'
print(f'model_dir is: {model_dir}') print(f'model_dir is: {model_dir}')
print(f'layoutreader_model_dir is: {layoutreader_model_dir}') print(f'layoutreader_model_dir is: {layoutreader_model_dir}')
json_url = 'https://gitee.com/myhloli/MinerU/raw/master/magic-pdf.template.json' json_url = 'https://gitee.com/myhloli/MinerU/raw/dev/magic-pdf.template.json'
config_file_name = 'magic-pdf.json' config_file_name = 'magic-pdf.json'
home_dir = os.path.expanduser('~') home_dir = os.path.expanduser('~')
config_file = os.path.join(home_dir, config_file_name) config_file = os.path.join(home_dir, config_file_name)
......
...@@ -5,16 +5,21 @@ import requests ...@@ -5,16 +5,21 @@ import requests
from huggingface_hub import snapshot_download from huggingface_hub import snapshot_download
def download_json(url):
# 下载JSON文件
response = requests.get(url)
response.raise_for_status() # 检查请求是否成功
return response.json()
def download_and_modify_json(url, local_filename, modifications): def download_and_modify_json(url, local_filename, modifications):
if os.path.exists(local_filename): if os.path.exists(local_filename):
data = json.load(open(local_filename)) data = json.load(open(local_filename))
config_version = data.get('config_version', '0.0.0')
if config_version < '1.0.0':
data = download_json(url)
else: else:
# 下载JSON文件 data = download_json(url)
response = requests.get(url)
response.raise_for_status() # 检查请求是否成功
# 解析JSON内容
data = response.json()
# 修改内容 # 修改内容
for key, value in modifications.items(): for key, value in modifications.items():
...@@ -26,13 +31,28 @@ def download_and_modify_json(url, local_filename, modifications): ...@@ -26,13 +31,28 @@ def download_and_modify_json(url, local_filename, modifications):
if __name__ == '__main__': if __name__ == '__main__':
model_dir = snapshot_download('opendatalab/PDF-Extract-Kit')
layoutreader_model_dir = snapshot_download('hantian/layoutreader') mineru_patterns = [
"models/Layout/LayoutLMv3/*",
"models/Layout/YOLO/*",
"models/MFD/YOLO/*",
"models/MFR/unimernet_small/*",
"models/TabRec/TableMaster/*",
"models/TabRec/StructEqTable/*",
]
model_dir = snapshot_download('opendatalab/PDF-Extract-Kit-1.0', allow_patterns=mineru_patterns)
layoutreader_pattern = [
"*.json",
"*.safetensors",
]
layoutreader_model_dir = snapshot_download('hantian/layoutreader', allow_patterns=layoutreader_pattern)
model_dir = model_dir + '/models' model_dir = model_dir + '/models'
print(f'model_dir is: {model_dir}') print(f'model_dir is: {model_dir}')
print(f'layoutreader_model_dir is: {layoutreader_model_dir}') print(f'layoutreader_model_dir is: {layoutreader_model_dir}')
json_url = 'https://github.com/opendatalab/MinerU/raw/master/magic-pdf.template.json' json_url = 'https://github.com/opendatalab/MinerU/raw/dev/magic-pdf.template.json'
config_file_name = 'magic-pdf.json' config_file_name = 'magic-pdf.json'
home_dir = os.path.expanduser('~') home_dir = os.path.expanduser('~')
config_file = os.path.join(home_dir, config_file_name) config_file = os.path.join(home_dir, config_file_name)
......
...@@ -22,7 +22,9 @@ The configuration file can be found in the user directory, with the filename `ma ...@@ -22,7 +22,9 @@ The configuration file can be found in the user directory, with the filename `ma
> Due to feedback from some users that downloading model files using git lfs was incomplete or resulted in corrupted model files, this method is no longer recommended. > Due to feedback from some users that downloading model files using git lfs was incomplete or resulted in corrupted model files, this method is no longer recommended.
If you previously downloaded model files via git lfs, you can navigate to the previous download directory and use the `git pull` command to update the model. When magic-pdf <= 0.8.1, if you have previously downloaded the model files via git lfs, you can navigate to the previous download directory and update the models using the `git pull` command.
> For versions 0.9.x and later, due to the repository change and the addition of the layout sorting model in PDF-Extract-Kit 1.0, the models cannot be updated using the `git pull` command. Instead, a Python script must be used for one-click updates.
## 2. Models downloaded via Hugging Face or Model Scope ## 2. Models downloaded via Hugging Face or Model Scope
......
...@@ -34,14 +34,10 @@ python脚本会自动下载模型文件并配置好配置文件中的模型目 ...@@ -34,14 +34,10 @@ python脚本会自动下载模型文件并配置好配置文件中的模型目
> 由于部分用户反馈通过git lfs下载模型文件遇到下载不全和模型文件损坏情况,现已不推荐使用该方式下载。 > 由于部分用户反馈通过git lfs下载模型文件遇到下载不全和模型文件损坏情况,现已不推荐使用该方式下载。
如此前通过 git lfs 下载过模型文件,可以进入到之前的下载目录中,通过`git pull`命令更新模型。 当magic-pdf <= 0.8.1时,如此前通过 git lfs 下载过模型文件,可以进入到之前的下载目录中,通过`git pull`命令更新模型。
> 0.9.x及以后版本由于新增layout排序模型,且该模型和此前的模型不在同一仓库,不能通过`git pull`命令更新,需要单独下载。 > 0.9.x及以后版本由于PDF-Extract-Kit 1.0更换仓库和新增layout排序模型,不能通过`git pull`命令更新,需要使用python脚本一键更新。
>
> ```
> from modelscope import snapshot_download
> snapshot_download('ppaanngggg/layoutreader')
> ```
## 2. 通过 Hugging Face 或 Model Scope 下载过模型 ## 2. 通过 Hugging Face 或 Model Scope 下载过模型
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment