README.md 10.7 KB
Newer Older
徐超's avatar
徐超 committed
1
2
3
<div id="top">

<p align="center">
drunkpig's avatar
drunkpig committed
4
  <img src="docs/images/MinerU-logo.png" width="300px" style="vertical-align:middle;">
徐超's avatar
徐超 committed
5
6
7
</p>

</div>
赵小蒙's avatar
赵小蒙 committed
8
<div align="center">
赵小蒙's avatar
赵小蒙 committed
9

赵小蒙's avatar
赵小蒙 committed
10
11
12
[![stars](https://img.shields.io/github/stars/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
[![forks](https://img.shields.io/github/forks/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
[![open issues](https://img.shields.io/github/issues-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues)
myhloli's avatar
myhloli committed
13
14
15
16
[![issue resolution](https://img.shields.io/github/issues-closed-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues)
[![PyPI version](https://badge.fury.io/py/magic-pdf.svg)](https://badge.fury.io/py/magic-pdf)
[![Downloads](https://static.pepy.tech/badge/magic-pdf)](https://pepy.tech/project/magic-pdf)
[![Downloads](https://static.pepy.tech/badge/magic-pdf/month)](https://pepy.tech/project/magic-pdf)
drunkpig's avatar
drunkpig committed
17
18

<a href="https://trendshift.io/repositories/11174" target="_blank"><img src="https://trendshift.io/api/badge/repositories/11174" alt="opendatalab%2FMinerU | Trendshift" style="width: 200px; height: 55px;"/></a>
myhloli's avatar
myhloli committed
19
20
21



赵小蒙's avatar
赵小蒙 committed
22

23
[English](README.md) | [简体中文](README_zh-CN.md) | [日本語](README_ja-JP.md)
赵小蒙's avatar
赵小蒙 committed
24
25
26
27

</div>

<div align="center">
徐超's avatar
徐超 committed
28
29
30
31
32
33
<p align="center">
<a href="https://github.com/opendatalab/MinerU">MinerU: An end-to-end PDF parsing tool based on PDF-Extract-Kit, supporting conversion from PDF to Markdown.</a>🚀🚀🚀<br>
<a href="https://github.com/opendatalab/PDF-Extract-Kit">PDF-Extract-Kit: A Comprehensive Toolkit for High-Quality PDF Content Extraction</a>🔥🔥🔥
</p>

<p align="center">
xuchao's avatar
xuchao committed
34
    👋 join us on <a href="https://discord.gg/gPxmVeGC" target="_blank">Discord</a> and <a href="https://cdn.vansin.top/internlm/mineru.jpg" target="_blank">WeChat</a>
徐超's avatar
徐超 committed
35
</p>
赵小蒙's avatar
赵小蒙 committed
36
</div>
赵小蒙's avatar
赵小蒙 committed
37

赵小蒙's avatar
赵小蒙 committed
38
39
# MinerU 

赵小蒙's avatar
赵小蒙 committed
40

赵小蒙's avatar
赵小蒙 committed
41
42
## Introduction

赵小蒙's avatar
赵小蒙 committed
43
MinerU is a one-stop, open-source, high-quality data extraction tool, includes the following primary features:
赵小蒙's avatar
赵小蒙 committed
44

赵小蒙's avatar
赵小蒙 committed
45
46
- [Magic-PDF](#Magic-PDF)  PDF Document Extraction  
- [Magic-Doc](#Magic-Doc)  Webpage & E-book Extraction
赵小蒙's avatar
赵小蒙 committed
47

赵小蒙's avatar
赵小蒙 committed
48

赵小蒙's avatar
赵小蒙 committed
49
# Magic-PDF
赵小蒙's avatar
赵小蒙 committed
50

赵小蒙's avatar
赵小蒙 committed
51

赵小蒙's avatar
赵小蒙 committed
52
## Introduction
赵小蒙's avatar
赵小蒙 committed
53

赵小蒙's avatar
赵小蒙 committed
54
Magic-PDF is a tool designed to convert PDF documents into Markdown format, capable of processing files stored locally or on object storage supporting S3 protocol.
赵小蒙's avatar
赵小蒙 committed
55

赵小蒙's avatar
赵小蒙 committed
56
Key features include:
赵小蒙's avatar
赵小蒙 committed
57

赵小蒙's avatar
赵小蒙 committed
58
59
60
- Support for multiple front-end model inputs
- Removal of headers, footers, footnotes, and page numbers
- Human-readable layout formatting
赵小蒙's avatar
赵小蒙 committed
61
- Retains the original document's structure and formatting, including headings, paragraphs, lists, and more
赵小蒙's avatar
赵小蒙 committed
62
63
64
65
66
- Extraction and display of images and tables within markdown
- Conversion of equations into LaTeX format
- Automatic detection and conversion of garbled PDFs
- Compatibility with CPU and GPU environments
- Available for Windows, Linux, and macOS platforms
赵小蒙's avatar
赵小蒙 committed
67

myhloli's avatar
myhloli committed
68

Xiaomeng Zhao's avatar
Xiaomeng Zhao committed
69
https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
myhloli's avatar
myhloli committed
70
71
72



赵小蒙's avatar
赵小蒙 committed
73
74
75
76
## Project Panorama

![Project Panorama](docs/images/project_panorama_en.png)

赵小蒙's avatar
赵小蒙 committed
77

78
79
80
81
## Flowchart

![Flowchart](docs/images/flowchart_en.png)

drunkpig's avatar
drunkpig committed
82
### Dependency repositorys
83

drunkpig's avatar
drunkpig committed
84
- [PDF-Extract-Kit : A Comprehensive Toolkit for High-Quality PDF Content Extraction](https://github.com/opendatalab/PDF-Extract-Kit) 🚀🚀🚀
赵小蒙's avatar
赵小蒙 committed
85

赵小蒙's avatar
赵小蒙 committed
86
## Getting Started
赵小蒙's avatar
赵小蒙 committed
87

赵小蒙's avatar
赵小蒙 committed
88
### Requirements
赵小蒙's avatar
赵小蒙 committed
89

赵小蒙's avatar
赵小蒙 committed
90
- Python >= 3.9
赵小蒙's avatar
赵小蒙 committed
91

92
93
94
95
96
97
Using a virtual environment is recommended to avoid potential dependency conflicts; both venv and conda are suitable. 
For example:
```bash
conda create -n MinerU python=3.10
conda activate MinerU
```
98

99
### Installation and Configuration
赵小蒙's avatar
赵小蒙 committed
100

赵小蒙's avatar
赵小蒙 committed
101
#### 1. Install Magic-PDF
赵小蒙's avatar
赵小蒙 committed
102

103
104
105
106
107
108
109
110
111
112
113
**1.Install dependencies**

The full-feature package depends on detectron2, which requires a compilation installation.   
If you need to compile it yourself, please refer to https://github.com/facebookresearch/detectron2/issues/5114  
Alternatively, you can directly use our precompiled whl package (limited to Python 3.10):

```bash
pip install detectron2 --extra-index-url https://myhloli.github.io/wheels/
```

**2.Install the full-feature package with pip**
114
115
116
117
>Note: The pip-installed package supports CPU-only and is ideal for quick tests.
>
>For CUDA/MPS acceleration in production, see [Acceleration Using CUDA or MPS](#4-Acceleration-Using-CUDA-or-MPS).

118
```bash
119
pip install magic-pdf[full]==0.6.2b1
120
```
121
122
123
124
125
126
> ❗️❗️❗️
> We have pre-released the 0.6.2 beta version, addressing numerous issues mentioned in our logs. However, this build has not undergone full QA testing and does not represent the final release quality. Should you encounter any problems, please promptly report them to us via issues or revert to using version 0.6.1.
> ```bash
> pip install magic-pdf[full-cpu]==0.6.1
> ```

赵小蒙's avatar
赵小蒙 committed
127
128


129
130
#### 2. Downloading model weights files

myhloli's avatar
myhloli committed
131
For detailed references, please see below [how_to_download_models](docs/how_to_download_models_en.md)
132
133
134
135
136

After downloading the model weights, move the 'models' directory to a directory on a larger disk space, preferably an SSD.


#### 3. Copy the Configuration File and Make Configurations
137
You can get the [magic-pdf.template.json](magic-pdf.template.json) file in the repository root directory.
赵小蒙's avatar
赵小蒙 committed
138
```bash
赵小蒙's avatar
赵小蒙 committed
139
cp magic-pdf.template.json ~/magic-pdf.json
140
141
142
143
144
145
146
147
148
149
```
In magic-pdf.json, configure "models-dir" to point to the directory where the model weights files are located.

```json
{
  "models-dir": "/tmp/models"
}
```


150
151
#### 4. Acceleration Using CUDA or MPS
If you have an available Nvidia GPU or are using a Mac with Apple Silicon, you can leverage acceleration with CUDA or MPS respectively.
152
153
##### CUDA

myhloli's avatar
myhloli committed
154
You need to install the corresponding PyTorch version according to your CUDA version.  
155
156
157
158
This example installs the CUDA 11.8 version.More information https://pytorch.org/get-started/locally/
```bash
pip install --force-reinstall torch==2.3.1 torchvision==0.18.1 --index-url https://download.pytorch.org/whl/cu118
```
159
160
161
162
163
> ❗ ️Make sure to specify version
> ```bash
> torch==2.3.1 torchvision==0.18.1
> ```
>  in the command, as these are the highest versions we support. Failing to specify the versions may result in automatically installing higher versions which can cause the program to fail.
164

myhloli's avatar
myhloli committed
165
Also, you need to modify the value of "device-mode" in the configuration file magic-pdf.json.  
166
167
168
169
170
171
172
173
```json
{
  "device-mode":"cuda"
}
```

##### MPS

myhloli's avatar
myhloli committed
174
175
For macOS users with M-series chip devices, you can use MPS for inference acceleration.  
You also need to modify the value of "device-mode" in the configuration file magic-pdf.json.  
176
177
178
179
180
181
```json
{
  "device-mode":"mps"
}
```

182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208

### Usage

#### 1.Usage via Command Line

###### simple

```bash
magic-pdf pdf-command --pdf "pdf_path" --inside_model true
```
After the program has finished, you can find the generated markdown files under the directory "/tmp/magic-pdf".  
You can find the corresponding xxx_model.json file in the markdown directory.   
If you intend to do secondary development on the post-processing pipeline, you can use the command:  
```bash
magic-pdf pdf-command --pdf "pdf_path" --model "model_json_path"
```
In this way, you won't need to re-run the model data, making debugging more convenient.


###### more 

```bash
magic-pdf --help
```


#### 2. Usage via Api
赵小蒙's avatar
赵小蒙 committed
209
210
211
212
213

###### Local
```python
image_writer = DiskReaderWriter(local_image_dir)
image_dir = str(os.path.basename(local_image_dir))
214
jso_useful_key = {"_pdf_type": "", "model_list": []}
赵小蒙's avatar
赵小蒙 committed
215
216
217
218
219
220
221
222
223
224
225
226
pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer)
pipe.pipe_classify()
pipe.pipe_parse()
md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
```

###### Object Storage
```python
s3pdf_cli = S3ReaderWriter(pdf_ak, pdf_sk, pdf_endpoint)
image_dir = "s3://img_bucket/"
s3image_cli = S3ReaderWriter(img_ak, img_sk, img_endpoint, parent_path=image_dir)
pdf_bytes = s3pdf_cli.read(s3_pdf_path, mode=s3pdf_cli.MODE_BIN)
227
jso_useful_key = {"_pdf_type": "", "model_list": []}
赵小蒙's avatar
赵小蒙 committed
228
229
230
231
232
233
pipe = UNIPipe(pdf_bytes, jso_useful_key, s3image_cli)
pipe.pipe_classify()
pipe.pipe_parse()
md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
```

234
Demo can be referred to [demo.py](demo/demo.py)
赵小蒙's avatar
赵小蒙 committed
235

赵小蒙's avatar
赵小蒙 committed
236

赵小蒙's avatar
赵小蒙 committed
237
238
# Magic-Doc

赵小蒙's avatar
赵小蒙 committed
239

赵小蒙's avatar
赵小蒙 committed
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
## Introduction

Magic-Doc is a tool designed to convert web pages or multi-format e-books into markdown format.

Key Features Include:

- Web Page Extraction
  - Cross-modal precise parsing of text, images, tables, and formula information.

- E-Book Document Extraction
  - Supports various document formats including epub, mobi, with full adaptation for text and images.

- Language Type Identification
  - Accurate recognition of 176 languages.

https://github.com/opendatalab/MinerU/assets/11393164/a5a650e9-f4c0-463e-acc3-960967f1a1ca



https://github.com/opendatalab/MinerU/assets/11393164/0f4a6fe9-6cca-4113-9fdc-a537749d764d



https://github.com/opendatalab/MinerU/assets/11393164/20438a02-ce6c-4af8-9dde-d722a4e825b2



赵小蒙's avatar
赵小蒙 committed
267

赵小蒙's avatar
赵小蒙 committed
268
269
## Project Repository

赵小蒙's avatar
赵小蒙 committed
270
- [Magic-Doc](https://github.com/InternLM/magic-doc)
赵小蒙's avatar
赵小蒙 committed
271
  Outstanding Webpage and E-book Extraction Tool
赵小蒙's avatar
赵小蒙 committed
272
273


赵小蒙's avatar
赵小蒙 committed
274
275
# All Thanks To Our Contributors

276
<a href="https://github.com/opendatalab/MinerU/graphs/contributors">
赵小蒙's avatar
赵小蒙 committed
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
  <img src="https://contrib.rocks/image?repo=opendatalab/MinerU" />
</a>


# License Information

[LICENSE.md](LICENSE.md)

The project currently leverages PyMuPDF to deliver advanced functionalities; however, its adherence to the AGPL license may impose limitations on certain use cases. In upcoming iterations, we intend to explore and transition to a more permissively licensed PDF processing library to enhance user-friendliness and flexibility.


# Acknowledgments

- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
- [PyMuPDF](https://github.com/pymupdf/PyMuPDF)
赵小蒙's avatar
赵小蒙 committed
292
293
- [fast-langdetect](https://github.com/LlmKira/fast-langdetect)
- [pdfminer.six](https://github.com/pdfminer/pdfminer.six)
赵小蒙's avatar
赵小蒙 committed
294
295


赵小蒙's avatar
赵小蒙 committed
296
297
298
# Citation

```bibtex
Conghui He's avatar
Conghui He committed
299
300
301
302
303
304
305
@article{he2024opendatalab,
  title={Opendatalab: Empowering general artificial intelligence with open datasets},
  author={He, Conghui and Li, Wei and Jin, Zhenjiang and Xu, Chao and Wang, Bin and Lin, Dahua},
  journal={arXiv preprint arXiv:2407.13773},
  year={2024}
}

赵小蒙's avatar
赵小蒙 committed
306
307
308
309
310
311
312
313
314
315
@misc{2024mineru,
    title={MinerU: A One-stop, Open-source, High-quality Data Extraction Tool},
    author={MinerU Contributors},
    howpublished = {\url{https://github.com/opendatalab/MinerU}},
    year={2024}
}
```


# Star History
赵小蒙's avatar
赵小蒙 committed
316

赵小蒙's avatar
赵小蒙 committed
317
318
319
320
321
322
<a>
 <picture>
   <source media="(prefers-color-scheme: dark)" srcset="https://api.star-history.com/svg?repos=opendatalab/MinerU&type=Date&theme=dark" />
   <source media="(prefers-color-scheme: light)" srcset="https://api.star-history.com/svg?repos=opendatalab/MinerU&type=Date" />
   <img alt="Star History Chart" src="https://api.star-history.com/svg?repos=opendatalab/MinerU&type=Date" />
 </picture>
myhloli's avatar
myhloli committed
323
</a>
qiangqiang199's avatar
qiangqiang199 committed
324
325
326
327

# Links
- [LabelU (A Lightweight Multi-modal Data Annotation Tool)](https://github.com/opendatalab/labelU)
- [LabelLLM (An Open-source LLM Dialogue Annotation Platform)](https://github.com/opendatalab/LabelLLM)
qiangqiang199's avatar
qiangqiang199 committed
328
- [PDF-Extract-Kit (A Comprehensive Toolkit for High-Quality PDF Content Extraction)](https://github.com/opendatalab/PDF-Extract-Kit)