README.md 9.7 KB
Newer Older
徐超's avatar
徐超 committed
1
2
3
<div id="top">

<p align="center">
徐超's avatar
徐超 committed
4
  <img src="docs/images/MinerU-logo.png" width="160px" style="vertical-align:middle;">
徐超's avatar
徐超 committed
5
6
7
</p>

</div>
赵小蒙's avatar
赵小蒙 committed
8
<div align="center">
赵小蒙's avatar
赵小蒙 committed
9

赵小蒙's avatar
赵小蒙 committed
10
11
12
[![stars](https://img.shields.io/github/stars/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
[![forks](https://img.shields.io/github/forks/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
[![open issues](https://img.shields.io/github/issues-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues)
myhloli's avatar
myhloli committed
13
14
15
16
[![issue resolution](https://img.shields.io/github/issues-closed-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues)
[![PyPI version](https://badge.fury.io/py/magic-pdf.svg)](https://badge.fury.io/py/magic-pdf)
[![Downloads](https://static.pepy.tech/badge/magic-pdf)](https://pepy.tech/project/magic-pdf)
[![Downloads](https://static.pepy.tech/badge/magic-pdf/month)](https://pepy.tech/project/magic-pdf)
myhloli's avatar
myhloli committed
17
<a href="https://trendshift.io/repositories/11174" target="_blank"><img src="https://trendshift.io/api/badge/repositories/11174" alt="opendatalab%2FMinerU | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
myhloli's avatar
myhloli committed
18
19
20



赵小蒙's avatar
赵小蒙 committed
21
22
23
24
25
26

[English](README.md) | [简体中文](README_zh-CN.md)

</div>

<div align="center">
徐超's avatar
徐超 committed
27
28
29
30
31
32
<p align="center">
<a href="https://github.com/opendatalab/MinerU">MinerU: An end-to-end PDF parsing tool based on PDF-Extract-Kit, supporting conversion from PDF to Markdown.</a>🚀🚀🚀<br>
<a href="https://github.com/opendatalab/PDF-Extract-Kit">PDF-Extract-Kit: A Comprehensive Toolkit for High-Quality PDF Content Extraction</a>🔥🔥🔥
</p>

<p align="center">
徐超's avatar
徐超 committed
33
    👋 join us on <a href="https://discord.gg/AsQMhuMN" target="_blank">Discord</a> and <a href="https://cdn.vansin.top/internlm/mineru.jpg" target="_blank">WeChat</a>
徐超's avatar
徐超 committed
34
</p>
赵小蒙's avatar
赵小蒙 committed
35
</div>
赵小蒙's avatar
赵小蒙 committed
36

赵小蒙's avatar
赵小蒙 committed
37
38
# MinerU 

赵小蒙's avatar
赵小蒙 committed
39

赵小蒙's avatar
赵小蒙 committed
40
41
## Introduction

赵小蒙's avatar
赵小蒙 committed
42
MinerU is a one-stop, open-source, high-quality data extraction tool, includes the following primary features:
赵小蒙's avatar
赵小蒙 committed
43

赵小蒙's avatar
赵小蒙 committed
44
45
- [Magic-PDF](#Magic-PDF)  PDF Document Extraction  
- [Magic-Doc](#Magic-Doc)  Webpage & E-book Extraction
赵小蒙's avatar
赵小蒙 committed
46

赵小蒙's avatar
赵小蒙 committed
47

赵小蒙's avatar
赵小蒙 committed
48
# Magic-PDF
赵小蒙's avatar
赵小蒙 committed
49

赵小蒙's avatar
赵小蒙 committed
50

赵小蒙's avatar
赵小蒙 committed
51
## Introduction
赵小蒙's avatar
赵小蒙 committed
52

赵小蒙's avatar
赵小蒙 committed
53
Magic-PDF is a tool designed to convert PDF documents into Markdown format, capable of processing files stored locally or on object storage supporting S3 protocol.
赵小蒙's avatar
赵小蒙 committed
54

赵小蒙's avatar
赵小蒙 committed
55
Key features include:
赵小蒙's avatar
赵小蒙 committed
56

赵小蒙's avatar
赵小蒙 committed
57
58
59
- Support for multiple front-end model inputs
- Removal of headers, footers, footnotes, and page numbers
- Human-readable layout formatting
赵小蒙's avatar
赵小蒙 committed
60
- Retains the original document's structure and formatting, including headings, paragraphs, lists, and more
赵小蒙's avatar
赵小蒙 committed
61
62
63
64
65
- Extraction and display of images and tables within markdown
- Conversion of equations into LaTeX format
- Automatic detection and conversion of garbled PDFs
- Compatibility with CPU and GPU environments
- Available for Windows, Linux, and macOS platforms
赵小蒙's avatar
赵小蒙 committed
66

myhloli's avatar
myhloli committed
67

Xiaomeng Zhao's avatar
Xiaomeng Zhao committed
68
https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
myhloli's avatar
myhloli committed
69
70
71



赵小蒙's avatar
赵小蒙 committed
72
73
74
75
## Project Panorama

![Project Panorama](docs/images/project_panorama_en.png)

赵小蒙's avatar
赵小蒙 committed
76

77
78
79
80
## Flowchart

![Flowchart](docs/images/flowchart_en.png)

drunkpig's avatar
drunkpig committed
81
### Dependency repositorys
82

drunkpig's avatar
drunkpig committed
83
- [PDF-Extract-Kit : A Comprehensive Toolkit for High-Quality PDF Content Extraction](https://github.com/opendatalab/PDF-Extract-Kit) 🚀🚀🚀
赵小蒙's avatar
赵小蒙 committed
84

赵小蒙's avatar
赵小蒙 committed
85
## Getting Started
赵小蒙's avatar
赵小蒙 committed
86

赵小蒙's avatar
赵小蒙 committed
87
### Requirements
赵小蒙's avatar
赵小蒙 committed
88

赵小蒙's avatar
赵小蒙 committed
89
- Python >= 3.9
赵小蒙's avatar
赵小蒙 committed
90

91
92
93
94
95
96
Using a virtual environment is recommended to avoid potential dependency conflicts; both venv and conda are suitable. 
For example:
```bash
conda create -n MinerU python=3.10
conda activate MinerU
```
97

98
### Installation and Configuration
赵小蒙's avatar
赵小蒙 committed
99

赵小蒙's avatar
赵小蒙 committed
100
#### 1. Install Magic-PDF
赵小蒙's avatar
赵小蒙 committed
101

102
103
104
105
106
Install the full-feature package with pip:
>Note: The pip-installed package supports CPU-only and is ideal for quick tests.
>
>For CUDA/MPS acceleration in production, see [Acceleration Using CUDA or MPS](#4-Acceleration-Using-CUDA-or-MPS).

107
```bash
108
pip install magic-pdf[full-cpu]
109
```
110
111
112
113
The full-feature package depends on detectron2, which requires a compilation installation.   
If you need to compile it yourself, please refer to https://github.com/facebookresearch/detectron2/issues/5114  
Alternatively, you can directly use our precompiled whl package (limited to Python 3.10):

114
115
```bash
pip install detectron2 --extra-index-url https://myhloli.github.io/wheels/
赵小蒙's avatar
赵小蒙 committed
116
117
118
```


119
120
#### 2. Downloading model weights files

myhloli's avatar
myhloli committed
121
For detailed references, please see below [how_to_download_models](docs/how_to_download_models_en.md)
122
123
124
125
126

After downloading the model weights, move the 'models' directory to a directory on a larger disk space, preferably an SSD.


#### 3. Copy the Configuration File and Make Configurations
127
You can get the [magic-pdf.template.json](magic-pdf.template.json) file in the repository root directory.
赵小蒙's avatar
赵小蒙 committed
128
```bash
赵小蒙's avatar
赵小蒙 committed
129
cp magic-pdf.template.json ~/magic-pdf.json
130
131
132
133
134
135
136
137
138
139
```
In magic-pdf.json, configure "models-dir" to point to the directory where the model weights files are located.

```json
{
  "models-dir": "/tmp/models"
}
```


140
141
#### 4. Acceleration Using CUDA or MPS
If you have an available Nvidia GPU or are using a Mac with Apple Silicon, you can leverage acceleration with CUDA or MPS respectively.
142
143
##### CUDA

myhloli's avatar
myhloli committed
144
145
You need to install the corresponding PyTorch version according to your CUDA version.  
This example installs the CUDA 11.8 version.More information https://pytorch.org/get-started/locally/  
146
147
148
```bash
pip install --force-reinstall torch==2.3.1 torchvision==0.18.1 --index-url https://download.pytorch.org/whl/cu118
```
myhloli's avatar
myhloli committed
149
Also, you need to modify the value of "device-mode" in the configuration file magic-pdf.json.  
150
151
152
153
154
155
156
157
```json
{
  "device-mode":"cuda"
}
```

##### MPS

myhloli's avatar
myhloli committed
158
159
For macOS users with M-series chip devices, you can use MPS for inference acceleration.  
You also need to modify the value of "device-mode" in the configuration file magic-pdf.json.  
160
161
162
163
164
165
```json
{
  "device-mode":"mps"
}
```

166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192

### Usage

#### 1.Usage via Command Line

###### simple

```bash
magic-pdf pdf-command --pdf "pdf_path" --inside_model true
```
After the program has finished, you can find the generated markdown files under the directory "/tmp/magic-pdf".  
You can find the corresponding xxx_model.json file in the markdown directory.   
If you intend to do secondary development on the post-processing pipeline, you can use the command:  
```bash
magic-pdf pdf-command --pdf "pdf_path" --model "model_json_path"
```
In this way, you won't need to re-run the model data, making debugging more convenient.


###### more 

```bash
magic-pdf --help
```


#### 2. Usage via Api
赵小蒙's avatar
赵小蒙 committed
193
194
195
196
197

###### Local
```python
image_writer = DiskReaderWriter(local_image_dir)
image_dir = str(os.path.basename(local_image_dir))
198
jso_useful_key = {"_pdf_type": "", "model_list": []}
赵小蒙's avatar
赵小蒙 committed
199
200
201
202
203
204
205
206
207
208
209
210
pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer)
pipe.pipe_classify()
pipe.pipe_parse()
md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
```

###### Object Storage
```python
s3pdf_cli = S3ReaderWriter(pdf_ak, pdf_sk, pdf_endpoint)
image_dir = "s3://img_bucket/"
s3image_cli = S3ReaderWriter(img_ak, img_sk, img_endpoint, parent_path=image_dir)
pdf_bytes = s3pdf_cli.read(s3_pdf_path, mode=s3pdf_cli.MODE_BIN)
211
jso_useful_key = {"_pdf_type": "", "model_list": []}
赵小蒙's avatar
赵小蒙 committed
212
213
214
215
216
217
pipe = UNIPipe(pdf_bytes, jso_useful_key, s3image_cli)
pipe.pipe_classify()
pipe.pipe_parse()
md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
```

218
Demo can be referred to [demo.py](demo/demo.py)
赵小蒙's avatar
赵小蒙 committed
219

赵小蒙's avatar
赵小蒙 committed
220

赵小蒙's avatar
赵小蒙 committed
221
222
# Magic-Doc

赵小蒙's avatar
赵小蒙 committed
223

赵小蒙's avatar
赵小蒙 committed
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
## Introduction

Magic-Doc is a tool designed to convert web pages or multi-format e-books into markdown format.

Key Features Include:

- Web Page Extraction
  - Cross-modal precise parsing of text, images, tables, and formula information.

- E-Book Document Extraction
  - Supports various document formats including epub, mobi, with full adaptation for text and images.

- Language Type Identification
  - Accurate recognition of 176 languages.

https://github.com/opendatalab/MinerU/assets/11393164/a5a650e9-f4c0-463e-acc3-960967f1a1ca



https://github.com/opendatalab/MinerU/assets/11393164/0f4a6fe9-6cca-4113-9fdc-a537749d764d



https://github.com/opendatalab/MinerU/assets/11393164/20438a02-ce6c-4af8-9dde-d722a4e825b2



赵小蒙's avatar
赵小蒙 committed
251

赵小蒙's avatar
赵小蒙 committed
252
253
## Project Repository

赵小蒙's avatar
赵小蒙 committed
254
- [Magic-Doc](https://github.com/InternLM/magic-doc)
赵小蒙's avatar
赵小蒙 committed
255
  Outstanding Webpage and E-book Extraction Tool
赵小蒙's avatar
赵小蒙 committed
256
257


赵小蒙's avatar
赵小蒙 committed
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
# All Thanks To Our Contributors

<a href="https://github.com/magicpdf/Magic-PDF/graphs/contributors">
  <img src="https://contrib.rocks/image?repo=opendatalab/MinerU" />
</a>


# License Information

[LICENSE.md](LICENSE.md)

The project currently leverages PyMuPDF to deliver advanced functionalities; however, its adherence to the AGPL license may impose limitations on certain use cases. In upcoming iterations, we intend to explore and transition to a more permissively licensed PDF processing library to enhance user-friendliness and flexibility.


# Acknowledgments

- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
- [PyMuPDF](https://github.com/pymupdf/PyMuPDF)
赵小蒙's avatar
赵小蒙 committed
276
277
- [fast-langdetect](https://github.com/LlmKira/fast-langdetect)
- [pdfminer.six](https://github.com/pdfminer/pdfminer.six)
赵小蒙's avatar
赵小蒙 committed
278
279


赵小蒙's avatar
赵小蒙 committed
280
281
282
283
284
285
286
287
288
289
290
291
292
# Citation

```bibtex
@misc{2024mineru,
    title={MinerU: A One-stop, Open-source, High-quality Data Extraction Tool},
    author={MinerU Contributors},
    howpublished = {\url{https://github.com/opendatalab/MinerU}},
    year={2024}
}
```


# Star History
赵小蒙's avatar
赵小蒙 committed
293

赵小蒙's avatar
赵小蒙 committed
294
295
296
297
298
299
<a>
 <picture>
   <source media="(prefers-color-scheme: dark)" srcset="https://api.star-history.com/svg?repos=opendatalab/MinerU&type=Date&theme=dark" />
   <source media="(prefers-color-scheme: light)" srcset="https://api.star-history.com/svg?repos=opendatalab/MinerU&type=Date" />
   <img alt="Star History Chart" src="https://api.star-history.com/svg?repos=opendatalab/MinerU&type=Date" />
 </picture>
myhloli's avatar
myhloli committed
300
</a>
qiangqiang199's avatar
qiangqiang199 committed
301
302
303
304

# Links
- [LabelU (A Lightweight Multi-modal Data Annotation Tool)](https://github.com/opendatalab/labelU)
- [LabelLLM (An Open-source LLM Dialogue Annotation Platform)](https://github.com/opendatalab/LabelLLM)
qiangqiang199's avatar
qiangqiang199 committed
305
- [PDF-Extract-Kit (A Comprehensive Toolkit for High-Quality PDF Content Extraction)](https://github.com/opendatalab/PDF-Extract-Kit)