README.md 8.7 KB
Newer Older
赵小蒙's avatar
赵小蒙 committed
1
2
<div id="top"></div>
<div align="center">
赵小蒙's avatar
赵小蒙 committed
3

赵小蒙's avatar
赵小蒙 committed
4
5
6
[![stars](https://img.shields.io/github/stars/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
[![forks](https://img.shields.io/github/forks/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
[![open issues](https://img.shields.io/github/issues-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues)
myhloli's avatar
myhloli committed
7
8
9
10
11
12
13
[![issue resolution](https://img.shields.io/github/issues-closed-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues)
[![PyPI version](https://badge.fury.io/py/magic-pdf.svg)](https://badge.fury.io/py/magic-pdf)
[![Downloads](https://static.pepy.tech/badge/magic-pdf)](https://pepy.tech/project/magic-pdf)
[![Downloads](https://static.pepy.tech/badge/magic-pdf/month)](https://pepy.tech/project/magic-pdf)



赵小蒙's avatar
赵小蒙 committed
14
15
16
17
18
19
20
21

[English](README.md) | [简体中文](README_zh-CN.md)

</div>

<div align="center">

</div>
赵小蒙's avatar
赵小蒙 committed
22

赵小蒙's avatar
赵小蒙 committed
23
24
# MinerU 

赵小蒙's avatar
赵小蒙 committed
25

赵小蒙's avatar
赵小蒙 committed
26
27
## Introduction

赵小蒙's avatar
赵小蒙 committed
28
MinerU is a one-stop, open-source, high-quality data extraction tool, includes the following primary features:
赵小蒙's avatar
赵小蒙 committed
29

赵小蒙's avatar
赵小蒙 committed
30
31
- [Magic-PDF](#Magic-PDF)  PDF Document Extraction  
- [Magic-Doc](#Magic-Doc)  Webpage & E-book Extraction
赵小蒙's avatar
赵小蒙 committed
32

赵小蒙's avatar
赵小蒙 committed
33

赵小蒙's avatar
赵小蒙 committed
34
# Magic-PDF
赵小蒙's avatar
赵小蒙 committed
35

赵小蒙's avatar
赵小蒙 committed
36

赵小蒙's avatar
赵小蒙 committed
37
## Introduction
赵小蒙's avatar
赵小蒙 committed
38

赵小蒙's avatar
赵小蒙 committed
39
Magic-PDF is a tool designed to convert PDF documents into Markdown format, capable of processing files stored locally or on object storage supporting S3 protocol.
赵小蒙's avatar
赵小蒙 committed
40

赵小蒙's avatar
赵小蒙 committed
41
Key features include:
赵小蒙's avatar
赵小蒙 committed
42

赵小蒙's avatar
赵小蒙 committed
43
44
45
- Support for multiple front-end model inputs
- Removal of headers, footers, footnotes, and page numbers
- Human-readable layout formatting
赵小蒙's avatar
赵小蒙 committed
46
- Retains the original document's structure and formatting, including headings, paragraphs, lists, and more
赵小蒙's avatar
赵小蒙 committed
47
48
49
50
51
- Extraction and display of images and tables within markdown
- Conversion of equations into LaTeX format
- Automatic detection and conversion of garbled PDFs
- Compatibility with CPU and GPU environments
- Available for Windows, Linux, and macOS platforms
赵小蒙's avatar
赵小蒙 committed
52

myhloli's avatar
myhloli committed
53

赵小蒙's avatar
赵小蒙 committed
54
https://github.com/opendatalab/MinerU/assets/11393164/618937cb-dc6a-4646-b433-e3131a5f4070
myhloli's avatar
myhloli committed
55
56
57



赵小蒙's avatar
赵小蒙 committed
58
59
60
61
## Project Panorama

![Project Panorama](docs/images/project_panorama_en.png)

赵小蒙's avatar
赵小蒙 committed
62

63
64
65
66
67
68
## Flowchart

![Flowchart](docs/images/flowchart_en.png)

### Submodule Repositories

wangbinDL's avatar
wangbinDL committed
69
- [PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit)
赵小蒙's avatar
赵小蒙 committed
70
  - A Comprehensive Toolkit for High-Quality PDF Content Extraction
赵小蒙's avatar
赵小蒙 committed
71

赵小蒙's avatar
赵小蒙 committed
72
## Getting Started
赵小蒙's avatar
赵小蒙 committed
73

赵小蒙's avatar
赵小蒙 committed
74
### Requirements
赵小蒙's avatar
赵小蒙 committed
75

赵小蒙's avatar
赵小蒙 committed
76
- Python >= 3.9
赵小蒙's avatar
赵小蒙 committed
77

78
79
80
81
It is recommended to use a virtual environment, either with venv or conda.
Development is based on Python 3.10, should you encounter problems with other Python versions, please switch to Python 3.10.


赵小蒙's avatar
赵小蒙 committed
82
### Usage Instructions
赵小蒙's avatar
赵小蒙 committed
83

赵小蒙's avatar
赵小蒙 committed
84
#### 1. Install Magic-PDF
赵小蒙's avatar
赵小蒙 committed
85

赵小蒙's avatar
赵小蒙 committed
86
```bash
87
# If you only need the basic features (without built-in model parsing functionality)
赵小蒙's avatar
赵小蒙 committed
88
pip install magic-pdf
89
90
91
92
# or
# For complete parsing capabilities (including high-precision model parsing)
pip install magic-pdf[full-cpu]

93
# For high-precision model parsing, you will need to install the dependency detectron2.
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
# For detectron2, compile it yourself as per https://github.com/facebookresearch/detectron2/issues/5114
# Or use our precompiled wheel

# windows
pip install https://github.com/opendatalab/MinerU/raw/master/assets/whl/detectron2-0.6-cp310-cp310-win_amd64.whl

# linux
pip install https://github.com/opendatalab/MinerU/raw/master/assets/whl/detectron2-0.6-cp310-cp310-linux_x86_64.whl

# macOS(Intel)
pip install https://github.com/opendatalab/MinerU/raw/master/assets/whl/detectron2-0.6-cp310-cp310-macosx_10_9_universal2.whl

# macOS(M1/M2/M3)
pip install https://github.com/opendatalab/MinerU/raw/master/assets/whl/detectron2-0.6-cp310-cp310-macosx_11_0_arm64.whl

赵小蒙's avatar
赵小蒙 committed
109
110
111
```


112
113
114
115
116
117
118
119
#### 2. Downloading model weights files

For detailed references, please see below[how_to_download_models](docs/how_to_download_models.md)

After downloading the model weights, move the 'models' directory to a directory on a larger disk space, preferably an SSD.


#### 3. Copy the Configuration File and Make Configurations
赵小蒙's avatar
赵小蒙 committed
120

赵小蒙's avatar
赵小蒙 committed
121
```bash
122
# Copy the configuration file to the root directory
赵小蒙's avatar
赵小蒙 committed
123
cp magic-pdf.template.json ~/magic-pdf.json
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
```
In magic-pdf.json, configure "models-dir" to point to the directory where the model weights files are located.

```json
{
  "models-dir": "/tmp/models"
}
```


#### 4. Usage via Command Line

###### simple

```bash
139
#If the full version is installed, you can invoke the built-in models for parsing.
140
magic-pdf pdf-command --pdf "pdf_path" --inside_model true
赵小蒙's avatar
赵小蒙 committed
141
```
赵小蒙's avatar
赵小蒙 committed
142
After the program has finished, you can find the generated markdown files under the directory "/tmp/magic-pdf".
143
144
145
146
147
148
149
You can find the corresponding xxx_model.json file in the markdown directory. 
If you intend to do secondary development on the post-processing pipeline, you can use the command:
```bash
magic-pdf pdf-command --pdf "pdf_path" --model "model_json_path"
```
In this way, you won't need to re-run the model data, making debugging more convenient.

赵小蒙's avatar
赵小蒙 committed
150

赵小蒙's avatar
赵小蒙 committed
151
###### more 
赵小蒙's avatar
赵小蒙 committed
152

赵小蒙's avatar
赵小蒙 committed
153
154
```bash
magic-pdf --help
赵小蒙's avatar
赵小蒙 committed
155
```
赵小蒙's avatar
赵小蒙 committed
156

157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185

#### 5. Acceleration Using CUDA or MPS

##### CUDA

You need to install the corresponding PyTorch version according to your CUDA version.
```bash
# When using the GPU solution, you need to reinstall PyTorch for the corresponding CUDA version. This example installs the CUDA 11.8 version.
pip install --force-reinstall torch==2.3.1 torchvision==0.18.1 --index-url https://download.pytorch.org/whl/cu118
```
Also, you need to modify the value of "device-mode" in the configuration file magic-pdf.json.
```json
{
  "device-mode":"cuda"
}
```

##### MPS

For macOS users with M-series chip devices, you can use MPS for inference acceleration.
You also need to modify the value of "device-mode" in the configuration file magic-pdf.json.

```json
{
  "device-mode":"mps"
}
```

#### 6. Usage via Api
赵小蒙's avatar
赵小蒙 committed
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210

###### Local
```python
image_writer = DiskReaderWriter(local_image_dir)
image_dir = str(os.path.basename(local_image_dir))
jso_useful_key = {"_pdf_type": "", "model_list": model_json}
pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer)
pipe.pipe_classify()
pipe.pipe_parse()
md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
```

###### Object Storage
```python
s3pdf_cli = S3ReaderWriter(pdf_ak, pdf_sk, pdf_endpoint)
image_dir = "s3://img_bucket/"
s3image_cli = S3ReaderWriter(img_ak, img_sk, img_endpoint, parent_path=image_dir)
pdf_bytes = s3pdf_cli.read(s3_pdf_path, mode=s3pdf_cli.MODE_BIN)
jso_useful_key = {"_pdf_type": "", "model_list": model_json}
pipe = UNIPipe(pdf_bytes, jso_useful_key, s3image_cli)
pipe.pipe_classify()
pipe.pipe_parse()
md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
```

211
Demo can be referred to [demo.py](demo/demo.py)
赵小蒙's avatar
赵小蒙 committed
212

赵小蒙's avatar
赵小蒙 committed
213

赵小蒙's avatar
赵小蒙 committed
214
215
# Magic-Doc

赵小蒙's avatar
赵小蒙 committed
216

赵小蒙's avatar
赵小蒙 committed
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
## Introduction

Magic-Doc is a tool designed to convert web pages or multi-format e-books into markdown format.

Key Features Include:

- Web Page Extraction
  - Cross-modal precise parsing of text, images, tables, and formula information.

- E-Book Document Extraction
  - Supports various document formats including epub, mobi, with full adaptation for text and images.

- Language Type Identification
  - Accurate recognition of 176 languages.

https://github.com/opendatalab/MinerU/assets/11393164/a5a650e9-f4c0-463e-acc3-960967f1a1ca



https://github.com/opendatalab/MinerU/assets/11393164/0f4a6fe9-6cca-4113-9fdc-a537749d764d



https://github.com/opendatalab/MinerU/assets/11393164/20438a02-ce6c-4af8-9dde-d722a4e825b2



赵小蒙's avatar
赵小蒙 committed
244

赵小蒙's avatar
赵小蒙 committed
245
246
## Project Repository

赵小蒙's avatar
赵小蒙 committed
247
- [Magic-Doc](https://github.com/InternLM/magic-doc)
赵小蒙's avatar
赵小蒙 committed
248
  Outstanding Webpage and E-book Extraction Tool
赵小蒙's avatar
赵小蒙 committed
249
250


赵小蒙's avatar
赵小蒙 committed
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
# All Thanks To Our Contributors

<a href="https://github.com/magicpdf/Magic-PDF/graphs/contributors">
  <img src="https://contrib.rocks/image?repo=opendatalab/MinerU" />
</a>


# License Information

[LICENSE.md](LICENSE.md)

The project currently leverages PyMuPDF to deliver advanced functionalities; however, its adherence to the AGPL license may impose limitations on certain use cases. In upcoming iterations, we intend to explore and transition to a more permissively licensed PDF processing library to enhance user-friendliness and flexibility.


# Acknowledgments

- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
- [PyMuPDF](https://github.com/pymupdf/PyMuPDF)
赵小蒙's avatar
赵小蒙 committed
269
270
- [fast-langdetect](https://github.com/LlmKira/fast-langdetect)
- [pdfminer.six](https://github.com/pdfminer/pdfminer.six)
赵小蒙's avatar
赵小蒙 committed
271
272


赵小蒙's avatar
赵小蒙 committed
273
274
275
276
277
278
279
280
281
282
283
284
285
# Citation

```bibtex
@misc{2024mineru,
    title={MinerU: A One-stop, Open-source, High-quality Data Extraction Tool},
    author={MinerU Contributors},
    howpublished = {\url{https://github.com/opendatalab/MinerU}},
    year={2024}
}
```


# Star History
赵小蒙's avatar
赵小蒙 committed
286

赵小蒙's avatar
赵小蒙 committed
287
288
289
290
291
292
<a>
 <picture>
   <source media="(prefers-color-scheme: dark)" srcset="https://api.star-history.com/svg?repos=opendatalab/MinerU&type=Date&theme=dark" />
   <source media="(prefers-color-scheme: light)" srcset="https://api.star-history.com/svg?repos=opendatalab/MinerU&type=Date" />
   <img alt="Star History Chart" src="https://api.star-history.com/svg?repos=opendatalab/MinerU&type=Date" />
 </picture>
myhloli's avatar
myhloli committed
293
</a>