README.md 60.8 KB
Newer Older
xuchao's avatar
xuchao committed
1
2
<div align="center" xmlns="http://www.w3.org/1999/html">
<!-- logo -->
徐超's avatar
徐超 committed
3
<p align="center">
4
  <img src="docs/images/MinerU-logo.png" width="300px" style="vertical-align:middle;">
徐超's avatar
徐超 committed
5
6
</p>

xuchao's avatar
xuchao committed
7
<!-- icon -->
8

赵小蒙's avatar
赵小蒙 committed
9
10
11
[![stars](https://img.shields.io/github/stars/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
[![forks](https://img.shields.io/github/forks/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
[![open issues](https://img.shields.io/github/issues-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues)
myhloli's avatar
myhloli committed
12
[![issue resolution](https://img.shields.io/github/issues-closed-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues)
13
14
15
16
[![PyPI version](https://img.shields.io/pypi/v/mineru)](https://pypi.org/project/mineru/)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/mineru)](https://pypi.org/project/mineru/)
[![Downloads](https://static.pepy.tech/badge/mineru)](https://pepy.tech/project/mineru)
[![Downloads](https://static.pepy.tech/badge/mineru/month)](https://pepy.tech/project/mineru)
17
[![OpenDataLab](https://img.shields.io/badge/Demo_on_OpenDataLab-blue?logo=&labelColor=white)](https://mineru.net/OpenSourceTools/Extractor?source=github)
Xiaomeng Zhao's avatar
Xiaomeng Zhao committed
18
19
[![HuggingFace](https://img.shields.io/badge/Demo_on_HuggingFace-yellow.svg?logo=&labelColor=white)](https://huggingface.co/spaces/opendatalab/MinerU)
[![ModelScope](https://img.shields.io/badge/Demo_on_ModelScope-purple?logo=&labelColor=white)](https://www.modelscope.cn/studios/OpenDataLab/MinerU)
myhloli's avatar
myhloli committed
20
[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/gist/myhloli/3b3a00a4a0a61577b6c30f989092d20d/mineru_demo.ipynb)
myhloli's avatar
myhloli committed
21
[![arXiv](https://img.shields.io/badge/arXiv-2409.18839-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2409.18839)
22
[![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/opendatalab/MinerU)
23

myhloli's avatar
myhloli committed
24

xuchao's avatar
xuchao committed
25
<a href="https://trendshift.io/repositories/11174" target="_blank"><img src="https://trendshift.io/api/badge/repositories/11174" alt="opendatalab%2FMinerU | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
drunkpig's avatar
drunkpig committed
26

xuchao's avatar
xuchao committed
27
<!-- language -->
28

xuchao's avatar
xuchao committed
29
[English](README.md) | [简体中文](README_zh-CN.md)
赵小蒙's avatar
赵小蒙 committed
30

xuchao's avatar
xuchao committed
31
<!-- hot link -->
32

徐超's avatar
徐超 committed
33
<p align="center">
xuchao's avatar
xuchao committed
34
<a href="https://github.com/opendatalab/PDF-Extract-Kit">PDF-Extract-Kit: High-Quality PDF Extraction Toolkit</a>🔥🔥🔥
35
36
<br>
<br>
37
🚀<a href="https://mineru.net/?source=github">Access MinerU Now→✅ Zero-Install Web Version ✅ Full-Featured Desktop Client ✅ Instant API Access; Skip deployment headaches – get all product formats in one click. Developers, dive in!</a>
徐超's avatar
徐超 committed
38
39
</p>

xuchao's avatar
xuchao committed
40
<!-- join us -->
41

徐超's avatar
徐超 committed
42
<p align="center">
43
    👋 join us on <a href="https://discord.gg/Tdedn9GTXq" target="_blank">Discord</a> and <a href="http://mineru.space/s/V85Yl" target="_blank">WeChat</a>
徐超's avatar
徐超 committed
44
</p>
赵小蒙's avatar
赵小蒙 committed
45

xuchao's avatar
xuchao committed
46
</div>
赵小蒙's avatar
赵小蒙 committed
47

xuchao's avatar
xuchao committed
48
# Changelog
49
50

- 2025/07/05 Version 2.1.0 Released
51
  - This is the first major update of MinerU 2, which includes a large number of new features and improvements, covering significant performance optimizations, user experience enhancements, and bug fixes. The detailed update contents are as follows:
52
53
54
55
56
57
58
59
60
61
62
63
  - **Performance Optimizations:**
    - Significantly improved preprocessing speed for documents with specific resolutions (around 2000 pixels on the long side).
    - Greatly enhanced post-processing speed when the `pipeline` backend handles batch processing of documents with fewer pages (<10 pages).
    - Layout analysis speed of the `pipeline` backend has been increased by approximately 20%.
  - **Experience Enhancements:**
    - Built-in ready-to-use `fastapi service` and `gradio webui`. For detailed usage instructions, please refer to [Documentation](#3-api-calls-or-visual-invocation).
    - Adapted to `sglang` version `0.4.8`, significantly reducing the GPU memory requirements for the `vlm-sglang` backend. It can now run on graphics cards with as little as `8GB GPU memory` (Turing architecture or newer).
    - Added transparent parameter passing for all commands related to `sglang`, allowing the `sglang-engine` backend to receive all `sglang` parameters consistently with the `sglang-server`.
    - Supports feature extensions based on configuration files, including `custom formula delimiters`, `enabling heading classification`, and `customizing local model directories`. For detailed usage instructions, please refer to [Documentation](#4-extending-mineru-functionality-through-configuration-files).
  - **New Features:**
    - Updated the `pipeline` backend with the PP-OCRv5 multilingual text recognition model, supporting text recognition in 37 languages such as French, Spanish, Portuguese, Russian, and Korean, with an average accuracy improvement of over 30%. [Details](https://paddlepaddle.github.io/PaddleOCR/latest/en/version3.x/algorithm/PP-OCRv5/PP-OCRv5_multi_languages.html)
    - Introduced limited support for vertical text layout in the `pipeline` backend.
64
65

<details>
66
  <summary>History Log</summary>
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
  <details>
    <summary>2025/06/20 2.0.6 Released</summary>
    <ul>
      <li>Fixed occasional parsing interruptions caused by invalid block content in <code>vlm</code> mode</li>
      <li>Fixed parsing interruptions caused by incomplete table structures in <code>vlm</code> mode</li>
    </ul>
  </details>
  
  <details>
    <summary>2025/06/17 2.0.5 Released</summary>
    <ul>
      <li>Fixed the issue where models were still required to be downloaded in the <code>sglang-client</code> mode</li>
      <li>Fixed the issue where the <code>sglang-client</code> mode unnecessarily depended on packages like <code>torch</code> during runtime.</li>
      <li>Fixed the issue where only the first instance would take effect when attempting to launch multiple <code>sglang-client</code> instances via multiple URLs within the same process</li>
    </ul>
  </details>
  
  <details>
    <summary>2025/06/15 2.0.3 released</summary>
    <ul>
      <li>Fixed a configuration file key-value update error that occurred when downloading model type was set to <code>all</code></li>
      <li>Fixed the issue where the formula and table feature toggle switches were not working in <code>command line mode</code>, causing the features to remain enabled.</li>
      <li>Fixed compatibility issues with sglang version 0.4.7 in the <code>sglang-engine</code> mode.</li>
      <li>Updated Dockerfile and installation documentation for deploying the full version of MinerU in sglang environment</li>
    </ul>
  </details>
  
  <details>
    <summary>2025/06/13 2.0.0 Released</summary>
    <ul>
      <li><strong>New Architecture</strong>: MinerU 2.0 has been deeply restructured in code organization and interaction methods, significantly improving system usability, maintainability, and extensibility.
        <ul>
          <li><strong>Removal of Third-party Dependency Limitations</strong>: Completely eliminated the dependency on <code>pymupdf</code>, moving the project toward a more open and compliant open-source direction.</li>
          <li><strong>Ready-to-use, Easy Configuration</strong>: No need to manually edit JSON configuration files; most parameters can now be set directly via command line or API.</li>
          <li><strong>Automatic Model Management</strong>: Added automatic model download and update mechanisms, allowing users to complete model deployment without manual intervention.</li>
          <li><strong>Offline Deployment Friendly</strong>: Provides built-in model download commands, supporting deployment requirements in completely offline environments.</li>
          <li><strong>Streamlined Code Structure</strong>: Removed thousands of lines of redundant code, simplified class inheritance logic, significantly improving code readability and development efficiency.</li>
          <li><strong>Unified Intermediate Format Output</strong>: Adopted standardized <code>middle_json</code> format, compatible with most secondary development scenarios based on this format, ensuring seamless ecosystem business migration.</li>
        </ul>
      </li>
      <li><strong>New Model</strong>: MinerU 2.0 integrates our latest small-parameter, high-performance multimodal document parsing model, achieving end-to-end high-speed, high-precision document understanding.
        <ul>
          <li><strong>Small Model, Big Capabilities</strong>: With parameters under 1B, yet surpassing traditional 72B-level vision-language models (VLMs) in parsing accuracy.</li>
          <li><strong>Multiple Functions in One</strong>: A single model covers multilingual recognition, handwriting recognition, layout analysis, table parsing, formula recognition, reading order sorting, and other core tasks.</li>
          <li><strong>Ultimate Inference Speed</strong>: Achieves peak throughput exceeding 10,000 tokens/s through <code>sglang</code> acceleration on a single NVIDIA 4090 card, easily handling large-scale document processing requirements.</li>
          <li><strong>Online Experience</strong>: You can experience our brand-new VLM model on <a href="https://mineru.net/OpenSourceTools/Extractor">MinerU.net</a>, <a href="https://huggingface.co/spaces/opendatalab/MinerU">Hugging Face</a>, and <a href="https://www.modelscope.cn/studios/OpenDataLab/MinerU">ModelScope</a>.</li>
        </ul>
      </li>
      <li><strong>Incompatible Changes Notice</strong>: To improve overall architectural rationality and long-term maintainability, this version contains some incompatible changes:
        <ul>
          <li>Python package name changed from <code>magic-pdf</code> to <code>mineru</code>, and the command-line tool changed from <code>magic-pdf</code> to <code>mineru</code>. Please update your scripts and command calls accordingly.</li>
          <li>For modular system design and ecosystem consistency considerations, MinerU 2.0 no longer includes the LibreOffice document conversion module. If you need to process Office documents, we recommend converting them to PDF format through an independently deployed LibreOffice service before proceeding with subsequent parsing operations.</li>
        </ul>
      </li>
    </ul>
  </details>
123
124
125
126
127
128
129
130
  <details>
  <summary>2025/05/24 Release 1.3.12</summary>
  <ul>
      <li>Added support for PPOCRv5 models, updated <code>ch_server</code> model to <code>PP-OCRv5_rec_server</code>, and <code>ch_lite</code> model to <code>PP-OCRv5_rec_mobile</code> (model update required)
        <ul>
          <li>In testing, we found that PPOCRv5(server) has some improvement for handwritten documents, but has slightly lower accuracy than v4_server_doc for other document types, so the default ch model remains unchanged as <code>PP-OCRv4_server_rec_doc</code>.</li>
          <li>Since PPOCRv5 has enhanced recognition capabilities for handwriting and special characters, you can manually choose the PPOCRv5 model for Japanese-Traditional Chinese mixed scenarios and handwritten documents</li>
          <li>You can select the appropriate model through the lang parameter <code>lang='ch_server'</code> (Python API) or <code>--lang ch_server</code> (command line):
xuchao's avatar
xuchao committed
131
            <ul>
132
133
134
135
136
              <li><code>ch</code>: <code>PP-OCRv4_server_rec_doc</code> (default) (Chinese/English/Japanese/Traditional Chinese mixed/15K dictionary)</li>
              <li><code>ch_server</code>: <code>PP-OCRv5_rec_server</code> (Chinese/English/Japanese/Traditional Chinese mixed + handwriting/18K dictionary)</li>
              <li><code>ch_lite</code>: <code>PP-OCRv5_rec_mobile</code> (Chinese/English/Japanese/Traditional Chinese mixed + handwriting/18K dictionary)</li>
              <li><code>ch_server_v4</code>: <code>PP-OCRv4_rec_server</code> (Chinese/English mixed/6K dictionary)</li>
              <li><code>ch_lite_v4</code>: <code>PP-OCRv4_rec_mobile</code> (Chinese/English mixed/6K dictionary)</li>
xuchao's avatar
xuchao committed
137
            </ul>
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
          </li>
        </ul>
      </li>
      <li>Added support for handwritten documents through optimized layout recognition of handwritten text areas
        <ul>
          <li>This feature is supported by default, no additional configuration required</li>
          <li>You can refer to the instructions above to manually select the PPOCRv5 model for better handwritten document parsing results</li>
        </ul>
      </li>
      <li>The <code>huggingface</code> and <code>modelscope</code> demos have been updated to versions that support handwriting recognition and PPOCRv5 models, which you can experience online</li>
  </ul>
  </details>
  
  <details>
  <summary>2025/04/29 Release 1.3.10</summary>
  <ul>
      <li>Added support for custom formula delimiters, which can be configured by modifying the <code>latex-delimiter-config</code> section in the <code>magic-pdf.json</code> file in your user directory.</li>
  </ul>
  </details>
  
  <details>
  <summary>2025/04/27 Release 1.3.9</summary>
  <ul>
      <li>Optimized formula parsing functionality, improved formula rendering success rate</li>
  </ul>
  </details>
  
  <details>
  <summary>2025/04/23 Release 1.3.8</summary>
  <ul>
      <li>The default <code>ocr</code> model (<code>ch</code>) has been updated to <code>PP-OCRv4_server_rec_doc</code> (model update required)
        <ul>
          <li><code>PP-OCRv4_server_rec_doc</code> is trained on a mixture of more Chinese document data and PP-OCR training data based on <code>PP-OCRv4_server_rec</code>, adding recognition capabilities for some traditional Chinese characters, Japanese, and special characters. It can recognize over 15,000 characters and improves both document-specific and general text recognition abilities.</li>
          <li><a href="https://paddlepaddle.github.io/PaddleX/latest/module_usage/tutorials/ocr_modules/text_recognition.html#_3">Performance comparison of PP-OCRv4_server_rec_doc/PP-OCRv4_server_rec/PP-OCRv4_mobile_rec</a></li>
          <li>After verification, the <code>PP-OCRv4_server_rec_doc</code> model shows significant accuracy improvements in Chinese/English/Japanese/Traditional Chinese in both single language and mixed language scenarios, with comparable speed to <code>PP-OCRv4_server_rec</code>, making it suitable for most use cases.</li>
          <li>In some pure English scenarios, <code>PP-OCRv4_server_rec_doc</code> may have word adhesion issues, while <code>PP-OCRv4_server_rec</code> performs better in these cases. Therefore, we've kept the <code>PP-OCRv4_server_rec</code> model, which users can access by adding the parameter <code>lang='ch_server'</code> (Python API) or <code>--lang ch_server</code> (command line).</li>
        </ul>
      </li>
  </ul>
  </details>
  
  <details>
  <summary>2025/04/22 Release 1.3.7</summary>
  <ul>
      <li>Fixed the issue where the lang parameter was ineffective during table parsing model initialization</li>
      <li>Fixed the significant speed reduction of OCR and table parsing in <code>cpu</code> mode</li>
  </ul>
  </details>
  
  <details>
  <summary>2025/04/16 Release 1.3.4</summary>
  <ul>
      <li>Slightly improved OCR-det speed by removing some unnecessary blocks</li>
      <li>Fixed page-internal sorting errors caused by footnotes in certain cases</li>
  </ul>
  </details>
  
  <details>
  <summary>2025/04/12 Release 1.3.2</summary>
  <ul>
      <li>Fixed dependency version incompatibility issues when installing on Windows with Python 3.13</li>
      <li>Optimized memory usage during batch inference</li>
      <li>Improved parsing of tables rotated 90 degrees</li>
      <li>Enhanced parsing of oversized tables in financial report samples</li>
      <li>Fixed the occasional word adhesion issue in English text areas when OCR language is not specified (model update required)</li>
  </ul>
  </details>
  
  <details>
  <summary>2025/04/08 Release 1.3.1</summary>
  <ul>
      <li>Fixed several compatibility issues
        <ul>
          <li>Added support for Python 3.13</li>
          <li>Made final adaptations for outdated Linux systems (such as CentOS 7) with no guarantee of continued support in future versions, <a href="https://github.com/opendatalab/MinerU/issues/1004">installation instructions</a></li>
        </ul>
      </li>
  </ul>
  </details>
  
  <details>
  <summary>2025/04/03 Release 1.3.0</summary>
  <ul>
      <li>Installation and compatibility optimizations
        <ul>
          <li>Resolved compatibility issues caused by <code>detectron2</code> by removing <code>layoutlmv3</code> usage in layout</li>
          <li>Extended torch version compatibility to 2.2~2.6 (excluding 2.5)</li>
          <li>Added CUDA compatibility for versions 11.8/12.4/12.6/12.8 (CUDA version determined by torch), solving compatibility issues for users with 50-series and H-series GPUs</li>
          <li>Extended Python compatibility to versions 3.10~3.12, fixing the issue of automatic downgrade to version 0.6.1 when installing in non-3.10 environments</li>
          <li>Optimized offline deployment process, eliminating the need to download any model files after successful deployment</li>
        </ul>
      </li>
      <li>Performance optimizations
        <ul>
          <li>Enhanced parsing speed for batches of small files by supporting batch processing of multiple PDF files (<a href="demo/batch_demo.py">script example</a>), with formula parsing speed improved by up to 1400% and overall parsing speed improved by up to 500% compared to version 1.0.1</li>
          <li>Reduced memory usage and improved parsing speed by optimizing MFR model loading and usage (requires re-running the <a href="docs/how_to_download_models_zh_cn.md">model download process</a> to get incremental updates to model files)</li>
          <li>Optimized GPU memory usage, requiring only 6GB minimum to run this project</li>
          <li>Improved running speed on MPS devices</li>
        </ul>
      </li>
      <li>Parsing effect optimizations
        <ul>
          <li>Updated MFR model to <code>unimernet(2503)</code>, fixing line break loss issues in multi-line formulas</li>
        </ul>
      </li>
      <li>Usability optimizations
        <ul>
          <li>Completely replaced the <code>paddle</code> framework and <code>paddleocr</code> in the project by using <code>paddleocr2torch</code>, resolving conflicts between <code>paddle</code> and <code>torch</code>, as well as thread safety issues caused by the <code>paddle</code> framework</li>
          <li>Added real-time progress bar display during parsing, allowing precise tracking of parsing progress and making the waiting process more bearable</li>
        </ul>
      </li>
  </ul>
  </details>
  <details>
  <summary>2025/03/03 1.2.1 released</summary>
  <ul>
    <li>Fixed the impact on punctuation marks during full-width to half-width conversion of letters and numbers</li>
    <li>Fixed caption matching inaccuracies in certain scenarios</li>
    <li>Fixed formula span loss issues in certain scenarios</li>
  </ul>
  </details>
  
  <details>
  <summary>2025/02/24 1.2.0 released</summary>
  <p>This version includes several fixes and improvements to enhance parsing efficiency and accuracy:</p>
  <ul>
    <li><strong>Performance Optimization</strong>
      <ul>
        <li>Increased classification speed for PDF documents in auto mode.</li>
      </ul>
    </li>
    <li><strong>Parsing Optimization</strong>
      <ul>
        <li>Improved parsing logic for documents containing watermarks, significantly enhancing the parsing results for such documents.</li>
        <li>Enhanced the matching logic for multiple images/tables and captions within a single page, improving the accuracy of image-text matching in complex layouts.</li>
      </ul>
    </li>
    <li><strong>Bug Fixes</strong>
      <ul>
        <li>Fixed an issue where image/table spans were incorrectly filled into text blocks under certain conditions.</li>
        <li>Resolved an issue where title blocks were empty in some cases.</li>
xuchao's avatar
xuchao committed
279
280
      </ul>
    </li>
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
  </ul>
  </details>
  
  <details>
  <summary>2025/01/22 1.1.0 released</summary>
  <p>In this version we have focused on improving parsing accuracy and efficiency:</p>
  <ul>
    <li><strong>Model capability upgrade</strong> (requires re-executing the <a href="https://github.com/opendatalab/MinerU/blob/master/docs/how_to_download_models_en.md">model download process</a> to obtain incremental updates of model files)
      <ul>
        <li>The layout recognition model has been upgraded to the latest <code>doclayout_yolo(2501)</code> model, improving layout recognition accuracy.</li>
        <li>The formula parsing model has been upgraded to the latest <code>unimernet(2501)</code> model, improving formula recognition accuracy.</li>
      </ul>
    </li>
    <li><strong>Performance optimization</strong>
      <ul>
        <li>On devices that meet certain configuration requirements (16GB+ VRAM), by optimizing resource usage and restructuring the processing pipeline, overall parsing speed has been increased by more than 50%.</li>
      </ul>
    </li>
    <li><strong>Parsing effect optimization</strong>
      <ul>
        <li>Added a new heading classification feature (testing version, enabled by default) to the online demo (<a href="https://mineru.net/OpenSourceTools/Extractor">mineru.net</a>/<a href="https://huggingface.co/spaces/opendatalab/MinerU">huggingface</a>/<a href="https://www.modelscope.cn/studios/OpenDataLab/MinerU">modelscope</a>), which supports hierarchical classification of headings, thereby enhancing document structuring.</li>
      </ul>
    </li>
  </ul>
  </details>
  
  <details>
  <summary>2025/01/10 1.0.1 released</summary>
  <p>This is our first official release, where we have introduced a completely new API interface and enhanced compatibility through extensive refactoring, as well as a brand new automatic language identification feature:</p>
  <ul>
    <li><strong>New API Interface</strong>
      <ul>
        <li>For the data-side API, we have introduced the Dataset class, designed to provide a robust and flexible data processing framework. This framework currently supports a variety of document formats, including images (.jpg and .png), PDFs, Word documents (.doc and .docx), and PowerPoint presentations (.ppt and .pptx). It ensures effective support for data processing tasks ranging from simple to complex.</li>
        <li>For the user-side API, we have meticulously designed the MinerU processing workflow as a series of composable Stages. Each Stage represents a specific processing step, allowing users to define new Stages according to their needs and creatively combine these stages to customize their data processing workflows.</li>
      </ul>
    </li>
    <li><strong>Enhanced Compatibility</strong>
      <ul>
        <li>By optimizing the dependency environment and configuration items, we ensure stable and efficient operation on ARM architecture Linux systems.</li>
        <li>We have deeply integrated with Huawei Ascend NPU acceleration, providing autonomous and controllable high-performance computing capabilities. This supports the localization and development of AI application platforms in China. <a href="https://github.com/opendatalab/MinerU/blob/master/docs/README_Ascend_NPU_Acceleration_zh_CN.md">Ascend NPU Acceleration</a></li>
      </ul>
    </li>
    <li><strong>Automatic Language Identification</strong>
      <ul>
        <li>By introducing a new language recognition model, setting the <code>lang</code> configuration to <code>auto</code> during document parsing will automatically select the appropriate OCR language model, improving the accuracy of scanned document parsing.</li>
      </ul>
    </li>
  </ul>
  </details>
  
  <details>
  <summary>2024/11/22 0.10.0 released</summary>
  <p>Introducing hybrid OCR text extraction capabilities:</p>
  <ul>
    <li>Significantly improved parsing performance in complex text distribution scenarios such as dense formulas, irregular span regions, and text represented by images.</li>
    <li>Combines the dual advantages of accurate content extraction and faster speed in text mode, and more precise span/line region recognition in OCR mode.</li>
  </ul>
  </details>
  
  <details>
  <summary>2024/11/15 0.9.3 released</summary>
  <p>Integrated <a href="https://github.com/RapidAI/RapidTable">RapidTable</a> for table recognition, improving single-table parsing speed by more than 10 times, with higher accuracy and lower GPU memory usage.</p>
  </details>
  
  <details>
  <summary>2024/11/06 0.9.2 released</summary>
  <p>Integrated the <a href="https://huggingface.co/U4R/StructTable-InternVL2-1B">StructTable-InternVL2-1B</a> model for table recognition functionality.</p>
  </details>
  
  <details>
  <summary>2024/10/31 0.9.0 released</summary>
  <p>This is a major new version with extensive code refactoring, addressing numerous issues, improving performance, reducing hardware requirements, and enhancing usability:</p>
  <ul>
    <li>Refactored the sorting module code to use <a href="https://github.com/ppaanngggg/layoutreader">layoutreader</a> for reading order sorting, ensuring high accuracy in various layouts.</li>
    <li>Refactored the paragraph concatenation module to achieve good results in cross-column, cross-page, cross-figure, and cross-table scenarios.</li>
    <li>Refactored the list and table of contents recognition functions, significantly improving the accuracy of list blocks and table of contents blocks, as well as the parsing of corresponding text paragraphs.</li>
    <li>Refactored the matching logic for figures, tables, and descriptive text, greatly enhancing the accuracy of matching captions and footnotes to figures and tables, and reducing the loss rate of descriptive text to near zero.</li>
    <li>Added multi-language support for OCR, supporting detection and recognition of 84 languages. For the list of supported languages, see <a href="https://paddlepaddle.github.io/PaddleOCR/latest/en/ppocr/blog/multi_languages.html#5-support-languages-and-abbreviations">OCR Language Support List</a>.</li>
    <li>Added memory recycling logic and other memory optimization measures, significantly reducing memory usage. The memory requirement for enabling all acceleration features except table acceleration (layout/formula/OCR) has been reduced from 16GB to 8GB, and the memory requirement for enabling all acceleration features has been reduced from 24GB to 10GB.</li>
    <li>Optimized configuration file feature switches, adding an independent formula detection switch to significantly improve speed and parsing results when formula detection is not needed.</li>
    <li>Integrated <a href="https://github.com/opendatalab/PDF-Extract-Kit">PDF-Extract-Kit 1.0</a>:
      <ul>
        <li>Added the self-developed <code>doclayout_yolo</code> model, which speeds up processing by more than 10 times compared to the original solution while maintaining similar parsing effects, and can be freely switched with <code>layoutlmv3</code> via the configuration file.</li>
        <li>Upgraded formula parsing to <code>unimernet 0.2.1</code>, improving formula parsing accuracy while significantly reducing memory usage.</li>
        <li>Due to the repository change for <code>PDF-Extract-Kit 1.0</code>, you need to re-download the model. Please refer to <a href="https://github.com/opendatalab/MinerU/blob/master/docs/how_to_download_models_en.md">How to Download Models</a> for detailed steps.</li>
      </ul>
    </li>
  </ul>
  </details>
  
  <details>
  <summary>2024/09/27 Version 0.8.1 released</summary>
  <p>Fixed some bugs, and providing a <a href="https://github.com/opendatalab/MinerU/blob/master/projects/web_demo/README.md">localized deployment version</a> of the <a href="https://opendatalab.com/OpenSourceTools/Extractor/PDF/">online demo</a> and the <a href="https://github.com/opendatalab/MinerU/blob/master/projects/web/README.md">front-end interface</a>.</p>
  </details>
  
  <details>
  <summary>2024/09/09 Version 0.8.0 released</summary>
  <p>Supporting fast deployment with Dockerfile, and launching demos on Huggingface and Modelscope.</p>
  </details>
  
  <details>
  <summary>2024/08/30 Version 0.7.1 released</summary>
  <p>Add paddle tablemaster table recognition option</p>
  </details>
  
  <details>
  <summary>2024/08/09 Version 0.7.0b1 released</summary>
  <p>Simplified installation process, added table recognition functionality</p>
  </details>
  
  <details>
  <summary>2024/08/01 Version 0.6.2b1 released</summary>
  <p>Optimized dependency conflict issues and installation documentation</p>
  </details>
  
  <details>
  <summary>2024/07/05 Initial open-source release</summary>
  </details>
399
400
401
</details>

<!-- TABLE OF CONTENT -->
402
  
403
404
405
406
407
408
409
410
411
412
413
<details open="open">
  <summary><h2 style="display: inline-block">Table of Contents</h2></summary>
  <ol>
    <li>
      <a href="#mineru">MinerU</a>
      <ul>
        <li><a href="#project-introduction">Project Introduction</a></li>
        <li><a href="#key-features">Key Features</a></li>
        <li><a href="#quick-start">Quick Start</a>
            <ul>
            <li><a href="#online-demo">Online Demo</a></li>
414
            <li><a href="#local-deployment">Local Deployment</a></li>
415
416
417
418
419
420
421
422
423
424
425
426
427
428
            </ul>
        </li>
      </ul>
    </li>
    <li><a href="#todo">TODO</a></li>
    <li><a href="#known-issues">Known Issues</a></li>
    <li><a href="#faq">FAQ</a></li>
    <li><a href="#all-thanks-to-our-contributors">All Thanks To Our Contributors</a></li>
    <li><a href="#license-information">License Information</a></li>
    <li><a href="#acknowledgments">Acknowledgments</a></li>
    <li><a href="#citation">Citation</a></li>
    <li><a href="#star-history">Star History</a></li>
    <li><a href="#links">Links</a></li>
  </ol>
xuchao's avatar
xuchao committed
429
430
431
</details>

# MinerU
432

xuchao's avatar
xuchao committed
433
## Project Introduction
434

xuchao's avatar
xuchao committed
435
436
437
MinerU is a tool that converts PDFs into machine-readable formats (e.g., markdown, JSON), allowing for easy extraction into any format.
MinerU was born during the pre-training process of [InternLM](https://github.com/InternLM/InternLM). We focus on solving symbol conversion issues in scientific literature and hope to contribute to technological development in the era of large models.
Compared to well-known commercial products, MinerU is still young. If you encounter any issues or if the results are not as expected, please submit an issue on [issue](https://github.com/opendatalab/MinerU/issues) and **attach the relevant PDF**.
myhloli's avatar
myhloli committed
438

Xiaomeng Zhao's avatar
Xiaomeng Zhao committed
439
https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
myhloli's avatar
myhloli committed
440

myhloli's avatar
myhloli committed
441
442
443
444
445
446
447
## Key Features

- Remove headers, footers, footnotes, page numbers, etc., to ensure semantic coherence.
- Output text in human-readable order, suitable for single-column, multi-column, and complex layouts.
- Preserve the structure of the original document, including headings, paragraphs, lists, etc.
- Extract images, image descriptions, tables, table titles, and footnotes.
- Automatically recognize and convert formulas in the document to LaTeX format.
448
- Automatically recognize and convert tables in the document to HTML format.
myhloli's avatar
myhloli committed
449
450
451
452
- Automatically detect scanned PDFs and garbled PDFs and enable OCR functionality.
- OCR supports detection and recognition of 84 languages.
- Supports multiple output formats, such as multimodal and NLP Markdown, JSON sorted by reading order, and rich intermediate formats.
- Supports various visualization results, including layout visualization and span visualization, for efficient confirmation of output quality.
453
- Supports running in a pure CPU environment, and also supports GPU(CUDA)/NPU(CANN)/MPS acceleration
myhloli's avatar
myhloli committed
454
455
- Compatible with Windows, Linux, and Mac platforms.

xuchao's avatar
xuchao committed
456
457
## Quick Start

myhloli's avatar
myhloli committed
458
459
460
If you encounter any installation issues, please first consult the <a href="#faq">FAQ</a>. </br>
If the parsing results are not as expected, refer to the <a href="#known-issues">Known Issues</a>. </br>
There are three different ways to experience MinerU:
461

462
463
464
- [Online Demo](#online-demo)
- [Local Deployment](#local-deployment)

myhloli's avatar
myhloli committed
465
466
467
468
469
470
471
472
473
474

> [!WARNING]
> **Pre-installation Notice—Hardware and Software Environment Support**
>
> To ensure the stability and reliability of the project, we only optimize and test for specific hardware and software environments during development. This ensures that users deploying and running the project on recommended system configurations will get the best performance with the fewest compatibility issues.
>
> By focusing resources on the mainline environment, our team can more efficiently resolve potential bugs and develop new features.
>
> In non-mainline environments, due to the diversity of hardware and software configurations, as well as third-party dependency compatibility issues, we cannot guarantee 100% project availability. Therefore, for users who wish to use this project in non-recommended environments, we suggest carefully reading the documentation and FAQ first. Most issues already have corresponding solutions in the FAQ. We also encourage community feedback to help us gradually expand support.

475
<table>
myhloli's avatar
myhloli committed
476
    <tr>
477
478
479
        <td>Parsing Backend</td>
        <td>pipeline</td>
        <td>vlm-transformers</td>
480
        <td>vlm-sglang</td>
myhloli's avatar
myhloli committed
481
482
    </tr>
    <tr>
483
484
485
486
        <td>Operating System</td>
        <td>windows/linux/mac</td>
        <td>windows/linux</td>
        <td>windows(wsl2)/linux</td>
myhloli's avatar
myhloli committed
487
    </tr>
488
489
490
491
492
493
494
495
    <tr>
        <td>CPU Inference Support</td>
        <td></td>
        <td colspan="2"></td>
    </tr>
    <tr>
        <td>GPU Requirements</td>
        <td>Turing architecture or later, 6GB+ VRAM or Apple Silicon</td>
496
        <td colspan="2">Turing architecture or later, 8GB+ VRAM</td>
497
    </tr>
myhloli's avatar
myhloli committed
498
    <tr>
499
500
        <td>Memory Requirements</td>
        <td colspan="3">Minimum 16GB+, 32GB+ recommended</td>
myhloli's avatar
myhloli committed
501
    </tr>
502
    <tr>
503
504
        <td>Disk Space Requirements</td>
        <td colspan="3">20GB+, SSD recommended</td>
505
    </tr>
myhloli's avatar
myhloli committed
506
    <tr>
507
508
        <td>Python Version</td>
        <td colspan="3">3.10-3.13</td>
myhloli's avatar
myhloli committed
509
510
    </tr>
</table>
xuchao's avatar
xuchao committed
511

512
## Online Demo
513

Xiaomeng Zhao's avatar
Xiaomeng Zhao committed
514
[![OpenDataLab](https://img.shields.io/badge/Demo_on_OpenDataLab-blue?logo=&labelColor=white)](https://mineru.net/OpenSourceTools/Extractor?source=github)
Xiaomeng Zhao's avatar
Xiaomeng Zhao committed
515
516
[![HuggingFace](https://img.shields.io/badge/Demo_on_HuggingFace-yellow.svg?logo=&labelColor=white)](https://huggingface.co/spaces/opendatalab/MinerU)
[![ModelScope](https://img.shields.io/badge/Demo_on_ModelScope-purple?logo=&labelColor=white)](https://www.modelscope.cn/studios/OpenDataLab/MinerU)
xuchao's avatar
xuchao committed
517

518
519
520
## Local Deployment

### 1. Install MinerU
xuchao's avatar
xuchao committed
521

522
#### 1.1 Install via pip or uv
523

524
```bash
525
526
pip install --upgrade pip
pip install uv
527
uv pip install -U "mineru[core]"
528
```
529

530
#### 1.2 Install from source
xuchao's avatar
xuchao committed
531

532
533
534
535
536
```bash
git clone https://github.com/opendatalab/MinerU.git
cd MinerU
uv pip install -e .[core]
```
537

538
> [!NOTE]  
539
540
541
> Linux and macOS systems automatically support CUDA/MPS acceleration after installation. For Windows users who want to use CUDA acceleration, 
> please visit the [PyTorch official website](https://pytorch.org/get-started/locally/) to install PyTorch with the appropriate CUDA version.

542
#### 1.3 Install Full Version (supports sglang acceleration) (requires device with Turing or newer architecture and at least 8GB GPU memory)
543
544
545
546
547
548
549
550
551
552
553

If you need to use **sglang to accelerate VLM model inference**, you can choose any of the following methods to install the full version:

- Install using uv or pip:
  ```bash
  uv pip install -U "mineru[all]"
  ```
- Install from source:
  ```bash
  uv pip install -e .[all]
  ```
554
555
556
557

> [!TIP]  
> If any exceptions occur during the installation of `sglang`, please refer to the [official sglang documentation](https://docs.sglang.ai/start/install.html) for troubleshooting and solutions, or directly use Docker-based installation.

558
- Build image using Dockerfile:
559
560
561
  ```bash
  wget https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/docker/global/Dockerfile
  docker build -t mineru-sglang:latest -f Dockerfile .
562
563
564
  ```
  Start Docker container:
  ```bash
565
566
567
568
569
570
  docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    --ipc=host \
    mineru-sglang:latest \
    mineru-sglang-server --host 0.0.0.0 --port 30000
571
  ```
572
573
574
575
576
  Or start using Docker Compose:
  ```bash
    wget https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/docker/compose.yaml
    docker compose -f compose.yaml up -d
  ```
577
578
  
> [!TIP]
579
580
> The Dockerfile uses `lmsysorg/sglang:v0.4.8.post1-cu126` as the default base image, which supports the Turing/Ampere/Ada Lovelace/Hopper platforms.  
> If you are using the newer Blackwell platform, please change the base image to `lmsysorg/sglang:v0.4.8.post1-cu128-b200`.
581
582
583
584
585
586
587
588

#### 1.4 Install client  (for connecting to sglang-server on edge devices that require only CPU and network connectivity)

```bash
uv pip install -U mineru
mineru -p <input_path> -o <output_path> -b vlm-sglang-client -u http://<host_ip>:<port>
```

589
---
590

591
592
593
594
595
596
597
598
599
600
### 2. Using MinerU

#### 2.1 Command Line Usage

##### Basic Usage

The simplest command line invocation is:

```bash
mineru -p <input_path> -o <output_path>
601
602
```

603
- `<input_path>`: Local PDF/Image file or directory (supports pdf/png/jpg/jpeg/webp/gif)
604
- `<output_path>`: Output directory
myhloli's avatar
myhloli committed
605

606
##### View Help Information
myhloli's avatar
myhloli committed
607

608
609
610
611
612
Get all available parameter descriptions:

```bash
mineru --help
```
myhloli's avatar
myhloli committed
613

614
615
616
617
618
619
620
621
622
623
624
625
##### Parameter Details

```text
Usage: mineru [OPTIONS]

Options:
  -v, --version                   Show version and exit
  -p, --path PATH                 Input file path or directory (required)
  -o, --output PATH              Output directory (required)
  -m, --method [auto|txt|ocr]     Parsing method: auto (default), txt, ocr (pipeline backend only)
  -b, --backend [pipeline|vlm-transformers|vlm-sglang-engine|vlm-sglang-client]
                                  Parsing backend (default: pipeline)
626
627
  -l, --lang [ch|ch_server|ch_lite|en|korean|japan|chinese_cht|ta|te|ka|latin|arabic|east_slavic|cyrillic|devanagari]
                                  Specify document language (improves OCR accuracy, pipeline backend only)
628
629
630
  -u, --url TEXT                  Service address when using sglang-client
  -s, --start INTEGER             Starting page number (0-based)
  -e, --end INTEGER               Ending page number (0-based)
631
632
  -f, --formula BOOLEAN           Enable formula parsing (default: on)
  -t, --table BOOLEAN             Enable table parsing (default: on)
633
  -d, --device TEXT               Inference device (e.g., cpu/cuda/cuda:0/npu/mps, pipeline backend only)
634
  --vram INTEGER                  Maximum GPU VRAM usage per process (GB)(pipeline backend only)
635
636
637
638
  --source [huggingface|modelscope|local]
                                  Model source, default: huggingface
  --help                          Show help information
```
639

640
---
641

642
#### 2.2 Model Source Configuration
643

644
MinerU automatically downloads required models from HuggingFace on first run. If HuggingFace is inaccessible, you can switch model sources:
645

646
##### Switch to ModelScope Source
647

648
649
650
```bash
mineru -p <input_path> -o <output_path> --source modelscope
```
651

652
653
654
655
656
Or set environment variable:

```bash
export MINERU_MODEL_SOURCE=modelscope
mineru -p <input_path> -o <output_path>
657
658
```

659
660
661
##### Using Local Models

###### 1. Download Models Locally
662

663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
```bash
mineru-models-download --help
```

Or use interactive command-line tool to select models:

```bash
mineru-models-download
```

After download, model paths will be displayed in current terminal and automatically written to `mineru.json` in user directory.

###### 2. Parse Using Local Models

```bash
mineru -p <input_path> -o <output_path> --source local
```

Or enable via environment variable:

```bash
export MINERU_MODEL_SOURCE=local
mineru -p <input_path> -o <output_path>
```

---

#### 2.3 Using sglang to Accelerate VLM Model Inference

692
##### Through the sglang-engine Mode
693
694
695
696
697

```bash
mineru -p <input_path> -o <output_path> -b vlm-sglang-engine
```

698
##### Through the sglang-server/client Mode
699
700
701
702
703
704
705
706
707
708
709
710

1. Start Server:

```bash
mineru-sglang-server --port 30000
```

2. Use Client in another terminal:

```bash
mineru -p <input_path> -o <output_path> -b vlm-sglang-client -u http://127.0.0.1:30000
```
711

712
713
> [!TIP]
> For more information about output files, please refer to [Output File Documentation](docs/output_file_en_us.md)
myhloli's avatar
myhloli committed
714

715
---
myhloli's avatar
myhloli committed
716

717
718
719
720
721
722
723
724
725
726
727
728
729
730
### 3. API Calls or Visual Invocation

1. Directly invoke using Python API: [Python Invocation Example](demo/demo.py)
2. Invoke using FastAPI:
   ```bash
   mineru-api --host 127.0.0.1 --port 8000
   ```
   Visit http://127.0.0.1:8000/docs in your browser to view the API documentation.

3. Use Gradio WebUI or Gradio API:
   ```bash
   # Using pipeline/vlm-transformers/vlm-sglang-client backend
   mineru-gradio --server-name 127.0.0.1 --server-port 7860
   # Or using vlm-sglang-engine/pipeline backend
731
   mineru-gradio --server-name 127.0.0.1 --server-port 7860 --enable-sglang-engine true
732
733
734
735
736
737
738
739
   ```
   Access http://127.0.0.1:7860 in your browser to use the Gradio WebUI, or visit http://127.0.0.1:7860/?view=api to use the Gradio API.


> [!TIP]  
> Below are some suggestions and notes for using the sglang acceleration mode:  
> - The sglang acceleration mode currently supports operation on Turing architecture GPUs with a minimum of 8GB VRAM, but you may encounter VRAM shortages on GPUs with less than 24GB VRAM. You can optimize VRAM usage with the following parameters:  
>   - If running on a single GPU and encountering VRAM shortage, reduce the KV cache size by setting `--mem-fraction-static 0.5`. If VRAM issues persist, try lowering it further to `0.4` or below.  
740
>   - If you have more than one GPU, you can expand available VRAM using tensor parallelism (TP) mode: `--tp-size 2`  
741
> - If you are already successfully using sglang to accelerate VLM inference but wish to further improve inference speed, consider the following parameters:  
742
>   - If using multiple GPUs, increase throughput using sglang's multi-GPU parallel mode: `--dp-size 2`  
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
>   - You can also enable `torch.compile` to accelerate inference speed by about 15%: `--enable-torch-compile`  
> - For more information on using sglang parameters, please refer to the [sglang official documentation](https://docs.sglang.ai/backend/server_arguments.html#common-launch-commands)  
> - All sglang-supported parameters can be passed to MinerU via command-line arguments, including those used with the following commands: `mineru`, `mineru-sglang-server`, `mineru-gradio`, `mineru-api`

> [!TIP]  
> - In any case, you can specify visible GPU devices at the start of a command line by adding the `CUDA_VISIBLE_DEVICES` environment variable. For example:  
>   ```bash
>   CUDA_VISIBLE_DEVICES=1 mineru -p <input_path> -o <output_path>
>   ```
> - This method works for all command-line calls, including `mineru`, `mineru-sglang-server`, `mineru-gradio`, and `mineru-api`, and applies to both `pipeline` and `vlm` backends.  
> - Below are some common `CUDA_VISIBLE_DEVICES` settings:  
>   ```bash
>   CUDA_VISIBLE_DEVICES=1 Only device 1 will be seen
>   CUDA_VISIBLE_DEVICES=0,1 Devices 0 and 1 will be visible
>   CUDA_VISIBLE_DEVICES="0,1" Same as above, quotation marks are optional
>   CUDA_VISIBLE_DEVICES=0,2,3 Devices 0, 2, 3 will be visible; device 1 is masked
>   CUDA_VISIBLE_DEVICES="" No GPU will be visible
>   ```
> - Below are some possible use cases:  
>   - If you have multiple GPUs and need to specify GPU 0 and GPU 1 to launch 'sglang-server' in multi-GPU mode, you can use the following command:  
>   ```bash
764
>   CUDA_VISIBLE_DEVICES=0,1 mineru-sglang-server --port 30000 --dp-size 2
765
766
767
768
769
770
771
772
>   ```
>   - If you have multiple GPUs and need to launch two `fastapi` services on GPU 0 and GPU 1 respectively, listening on different ports, you can use the following commands:  
>   ```bash
>   # In terminal 1
>   CUDA_VISIBLE_DEVICES=0 mineru-api --host 127.0.0.1 --port 8000
>   # In terminal 2
>   CUDA_VISIBLE_DEVICES=1 mineru-api --host 127.0.0.1 --port 8001
>   ```
赵小蒙's avatar
赵小蒙 committed
773

774
---
赵小蒙's avatar
赵小蒙 committed
775

776
### 4. Extending MinerU Functionality Through Configuration Files
777

778
779
780
781
782
783
- MinerU is designed to work out-of-the-box, but also supports extending functionality through configuration files. You can create a `mineru.json` file in your home directory and add custom configurations.
- The `mineru.json` file will be automatically generated when you use the built-in model download command `mineru-models-download`. Alternatively, you can create it by copying the [configuration template file](./mineru.template.json) to your home directory and renaming it to `mineru.json`.
- Below are some available configuration options:
  - `latex-delimiter-config`: Used to configure LaTeX formula delimiters, defaults to the `$` symbol, and can be modified to other symbols or strings as needed.
  - `llm-aided-config`: Used to configure related parameters for LLM-assisted heading level detection, compatible with all LLM models supporting the `OpenAI protocol`. It defaults to Alibaba Cloud Qwen's `qwen2.5-32b-instruct` model. You need to configure an API key yourself and set `enable` to `true` to activate this feature.
  - `models-dir`: Used to specify local model storage directories. Please specify separate model directories for the `pipeline` and `vlm` backends. After specifying these directories, you can use local models by setting the environment variable `export MINERU_MODEL_SOURCE=local`.
赵小蒙's avatar
赵小蒙 committed
784

785
---
赵小蒙's avatar
赵小蒙 committed
786

xuchao's avatar
xuchao committed
787
# TODO
赵小蒙's avatar
赵小蒙 committed
788

789
790
791
- [x] Reading order based on the model  
- [x] Recognition of `index` and `list` in the main text  
- [x] Table recognition
myhloli's avatar
myhloli committed
792
- [x] Heading Classification
793
794
795
- [ ] Code block recognition in the main text
- [ ] [Chemical formula recognition](docs/chemical_knowledge_introduction/introduction.pdf)
- [ ] Geometric shape recognition
赵小蒙's avatar
赵小蒙 committed
796

myhloli's avatar
myhloli committed
797
798
799
# Known Issues

- Reading order is determined by the model based on the spatial distribution of readable content, and may be out of order in some areas under extremely complex layouts.
800
- Limited support for vertical text.
myhloli's avatar
myhloli committed
801
802
803
804
805
806
807
808
809
- Tables of contents and lists are recognized through rules, and some uncommon list formats may not be recognized.
- Code blocks are not yet supported in the layout model.
- Comic books, art albums, primary school textbooks, and exercises cannot be parsed well.
- Table recognition may result in row/column recognition errors in complex tables.
- OCR recognition may produce inaccurate characters in PDFs of lesser-known languages (e.g., diacritical marks in Latin script, easily confused characters in Arabic script).
- Some formulas may not render correctly in Markdown.

# FAQ

810
811
812
- If you encounter any issues during usage, you can first check the [FAQ](docs/FAQ_en_us.md) for solutions.  
- If your issue remains unresolved, you may also use [DeepWiki](https://deepwiki.com/opendatalab/MinerU) to interact with an AI assistant, which can address most common problems.  
- If you still cannot resolve the issue, you are welcome to join our community via [Discord](https://discord.gg/Tdedn9GTXq) or [WeChat](http://mineru.space/s/V85Yl) to discuss with other users and developers.
myhloli's avatar
myhloli committed
813

赵小蒙's avatar
赵小蒙 committed
814
815
# All Thanks To Our Contributors

816
<a href="https://github.com/opendatalab/MinerU/graphs/contributors">
赵小蒙's avatar
赵小蒙 committed
817
818
819
820
821
822
823
  <img src="https://contrib.rocks/image?repo=opendatalab/MinerU" />
</a>

# License Information

[LICENSE.md](LICENSE.md)

824
Currently, some models in this project are trained based on YOLO. However, since YOLO follows the AGPL license, it may impose restrictions on certain use cases. In future iterations, we plan to explore and replace these with models under more permissive licenses to enhance user-friendliness and flexibility.
赵小蒙's avatar
赵小蒙 committed
825
826

# Acknowledgments
827

xuchao's avatar
xuchao committed
828
- [PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit)
829
- [DocLayout-YOLO](https://github.com/opendatalab/DocLayout-YOLO)
830
- [UniMERNet](https://github.com/opendatalab/UniMERNet)
831
- [RapidTable](https://github.com/RapidAI/RapidTable)
赵小蒙's avatar
赵小蒙 committed
832
- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
833
- [PaddleOCR2Pytorch](https://github.com/frotms/PaddleOCR2Pytorch)
834
- [layoutreader](https://github.com/ppaanngggg/layoutreader)
835
- [xy-cut](https://github.com/Sanster/xy-cut)
赵小蒙's avatar
赵小蒙 committed
836
- [fast-langdetect](https://github.com/LlmKira/fast-langdetect)
837
- [pypdfium2](https://github.com/pypdfium2-team/pypdfium2)
838
- [pdftext](https://github.com/datalab-to/pdftext)
赵小蒙's avatar
赵小蒙 committed
839
- [pdfminer.six](https://github.com/pdfminer/pdfminer.six)
840
- [pypdf](https://github.com/py-pdf/pypdf)
赵小蒙's avatar
赵小蒙 committed
841

赵小蒙's avatar
赵小蒙 committed
842
843
844
# Citation

```bibtex
845
846
847
848
849
850
851
852
853
854
@misc{wang2024mineruopensourcesolutionprecise,
      title={MinerU: An Open-Source Solution for Precise Document Content Extraction}, 
      author={Bin Wang and Chao Xu and Xiaomeng Zhao and Linke Ouyang and Fan Wu and Zhiyuan Zhao and Rui Xu and Kaiwen Liu and Yuan Qu and Fukai Shang and Bo Zhang and Liqun Wei and Zhihao Sui and Wei Li and Botian Shi and Yu Qiao and Dahua Lin and Conghui He},
      year={2024},
      eprint={2409.18839},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2409.18839}, 
}

Conghui He's avatar
Conghui He committed
855
856
857
858
859
860
@article{he2024opendatalab,
  title={Opendatalab: Empowering general artificial intelligence with open datasets},
  author={He, Conghui and Li, Wei and Jin, Zhenjiang and Xu, Chao and Wang, Bin and Lin, Dahua},
  journal={arXiv preprint arXiv:2407.13773},
  year={2024}
}
赵小蒙's avatar
赵小蒙 committed
861
862
863
```

# Star History
赵小蒙's avatar
赵小蒙 committed
864

赵小蒙's avatar
赵小蒙 committed
865
866
867
868
869
870
<a>
 <picture>
   <source media="(prefers-color-scheme: dark)" srcset="https://api.star-history.com/svg?repos=opendatalab/MinerU&type=Date&theme=dark" />
   <source media="(prefers-color-scheme: light)" srcset="https://api.star-history.com/svg?repos=opendatalab/MinerU&type=Date" />
   <img alt="Star History Chart" src="https://api.star-history.com/svg?repos=opendatalab/MinerU&type=Date" />
 </picture>
myhloli's avatar
myhloli committed
871
</a>
qiangqiang199's avatar
qiangqiang199 committed
872

xuchao's avatar
xuchao committed
873

qiangqiang199's avatar
qiangqiang199 committed
874
# Links
xuchao's avatar
xuchao committed
875

qiangqiang199's avatar
qiangqiang199 committed
876
877
- [LabelU (A Lightweight Multi-modal Data Annotation Tool)](https://github.com/opendatalab/labelU)
- [LabelLLM (An Open-source LLM Dialogue Annotation Platform)](https://github.com/opendatalab/LabelLLM)
qiangqiang199's avatar
qiangqiang199 committed
878
- [PDF-Extract-Kit (A Comprehensive Toolkit for High-Quality PDF Content Extraction)](https://github.com/opendatalab/PDF-Extract-Kit)
879
880
881
882
- [Vis3 (OSS browser based on s3)](https://github.com/opendatalab/Vis3)
- [OmniDocBench (A Comprehensive Benchmark for Document Parsing and Evaluation)](https://github.com/opendatalab/OmniDocBench)
- [Magic-HTML (Mixed web page extraction tool)](https://github.com/opendatalab/magic-html)
- [Magic-Doc (Fast speed ppt/pptx/doc/docx/pdf extraction tool)](https://github.com/InternLM/magic-doc)