README.md 6 KB
Newer Older
赵小蒙's avatar
赵小蒙 committed
1
2
<div id="top"></div>
<div align="center">
赵小蒙's avatar
赵小蒙 committed
3

赵小蒙's avatar
赵小蒙 committed
4
5
6
7
[![stars](https://img.shields.io/github/stars/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
[![forks](https://img.shields.io/github/forks/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
[![license](https://img.shields.io/github/license/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU/tree/main/LICENSE)
[![open issues](https://img.shields.io/github/issues-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues)
myhloli's avatar
myhloli committed
8
9
10
11
12
13
14
[![issue resolution](https://img.shields.io/github/issues-closed-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues)
[![PyPI version](https://badge.fury.io/py/magic-pdf.svg)](https://badge.fury.io/py/magic-pdf)
[![Downloads](https://static.pepy.tech/badge/magic-pdf)](https://pepy.tech/project/magic-pdf)
[![Downloads](https://static.pepy.tech/badge/magic-pdf/month)](https://pepy.tech/project/magic-pdf)



赵小蒙's avatar
赵小蒙 committed
15
16
17
18
19
20
21
22

[English](README.md) | [简体中文](README_zh-CN.md)

</div>

<div align="center">

</div>
赵小蒙's avatar
赵小蒙 committed
23

赵小蒙's avatar
赵小蒙 committed
24
25
# MinerU 

赵小蒙's avatar
赵小蒙 committed
26

赵小蒙's avatar
赵小蒙 committed
27
28
## Introduction

赵小蒙's avatar
赵小蒙 committed
29
MinerU is a one-stop, open-source, high-quality data extraction tool, includes the following primary features:
赵小蒙's avatar
赵小蒙 committed
30

赵小蒙's avatar
赵小蒙 committed
31
32
- [Magic-PDF](#Magic-PDF)  PDF Document Extraction  
- [Magic-Doc](#Magic-Doc)  Webpage & E-book Extraction
赵小蒙's avatar
赵小蒙 committed
33

赵小蒙's avatar
赵小蒙 committed
34

赵小蒙's avatar
赵小蒙 committed
35
# Magic-PDF
赵小蒙's avatar
赵小蒙 committed
36

赵小蒙's avatar
赵小蒙 committed
37

赵小蒙's avatar
赵小蒙 committed
38
## Introduction
赵小蒙's avatar
赵小蒙 committed
39

赵小蒙's avatar
赵小蒙 committed
40
Magic-PDF is a tool designed to convert PDF documents into Markdown format, capable of processing files stored locally or on object storage supporting S3 protocol.
赵小蒙's avatar
赵小蒙 committed
41

赵小蒙's avatar
赵小蒙 committed
42
Key features include:
赵小蒙's avatar
赵小蒙 committed
43

赵小蒙's avatar
赵小蒙 committed
44
45
46
- Support for multiple front-end model inputs
- Removal of headers, footers, footnotes, and page numbers
- Human-readable layout formatting
赵小蒙's avatar
赵小蒙 committed
47
- Retains the original document's structure and formatting, including headings, paragraphs, lists, and more
赵小蒙's avatar
赵小蒙 committed
48
49
50
51
52
- Extraction and display of images and tables within markdown
- Conversion of equations into LaTeX format
- Automatic detection and conversion of garbled PDFs
- Compatibility with CPU and GPU environments
- Available for Windows, Linux, and macOS platforms
赵小蒙's avatar
赵小蒙 committed
53

myhloli's avatar
myhloli committed
54

赵小蒙's avatar
赵小蒙 committed
55
https://github.com/opendatalab/MinerU/assets/11393164/618937cb-dc6a-4646-b433-e3131a5f4070
myhloli's avatar
myhloli committed
56
57
58



赵小蒙's avatar
赵小蒙 committed
59
60
61
62
## Project Panorama

![Project Panorama](docs/images/project_panorama_en.png)

赵小蒙's avatar
赵小蒙 committed
63

64
65
66
67
68
69
## Flowchart

![Flowchart](docs/images/flowchart_en.png)

### Submodule Repositories

wangbinDL's avatar
wangbinDL committed
70
- [PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit)
赵小蒙's avatar
赵小蒙 committed
71
  - A Comprehensive Toolkit for High-Quality PDF Content Extraction
赵小蒙's avatar
赵小蒙 committed
72

赵小蒙's avatar
赵小蒙 committed
73
## Getting Started
赵小蒙's avatar
赵小蒙 committed
74

赵小蒙's avatar
赵小蒙 committed
75
### Requirements
赵小蒙's avatar
赵小蒙 committed
76

赵小蒙's avatar
赵小蒙 committed
77
- Python >= 3.9
赵小蒙's avatar
赵小蒙 committed
78

赵小蒙's avatar
赵小蒙 committed
79
### Usage Instructions
赵小蒙's avatar
赵小蒙 committed
80

赵小蒙's avatar
赵小蒙 committed
81
#### 1. Install Magic-PDF
赵小蒙's avatar
赵小蒙 committed
82

赵小蒙's avatar
赵小蒙 committed
83
```bash
赵小蒙's avatar
赵小蒙 committed
84
pip install magic-pdf
赵小蒙's avatar
赵小蒙 committed
85
86
```

赵小蒙's avatar
赵小蒙 committed
87
#### 2. Usage via Command Line
赵小蒙's avatar
赵小蒙 committed
88

赵小蒙's avatar
赵小蒙 committed
89
###### simple
赵小蒙's avatar
赵小蒙 committed
90

赵小蒙's avatar
赵小蒙 committed
91
```bash
赵小蒙's avatar
赵小蒙 committed
92
cp magic-pdf.template.json ~/magic-pdf.json
赵小蒙's avatar
赵小蒙 committed
93
94
magic-pdf pdf-command --pdf "pdf_path" --model "model_json_path"
```
赵小蒙's avatar
赵小蒙 committed
95
After the program has finished, you can find the generated markdown files under the directory "/tmp/magic-pdf".
赵小蒙's avatar
赵小蒙 committed
96

赵小蒙's avatar
赵小蒙 committed
97
###### more 
赵小蒙's avatar
赵小蒙 committed
98

赵小蒙's avatar
赵小蒙 committed
99
100
```bash
magic-pdf --help
赵小蒙's avatar
赵小蒙 committed
101
```
赵小蒙's avatar
赵小蒙 committed
102

赵小蒙's avatar
赵小蒙 committed
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
#### 3. Usage via Api

###### Local
```python
image_writer = DiskReaderWriter(local_image_dir)
image_dir = str(os.path.basename(local_image_dir))
jso_useful_key = {"_pdf_type": "", "model_list": model_json}
pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer)
pipe.pipe_classify()
pipe.pipe_parse()
md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
```

###### Object Storage
```python
s3pdf_cli = S3ReaderWriter(pdf_ak, pdf_sk, pdf_endpoint)
image_dir = "s3://img_bucket/"
s3image_cli = S3ReaderWriter(img_ak, img_sk, img_endpoint, parent_path=image_dir)
pdf_bytes = s3pdf_cli.read(s3_pdf_path, mode=s3pdf_cli.MODE_BIN)
jso_useful_key = {"_pdf_type": "", "model_list": model_json}
pipe = UNIPipe(pdf_bytes, jso_useful_key, s3image_cli)
pipe.pipe_classify()
pipe.pipe_parse()
md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
```

129
Demo can be referred to [demo.py](demo/demo.py)
赵小蒙's avatar
赵小蒙 committed
130

赵小蒙's avatar
赵小蒙 committed
131

赵小蒙's avatar
赵小蒙 committed
132
133
# Magic-Doc

赵小蒙's avatar
赵小蒙 committed
134

赵小蒙's avatar
赵小蒙 committed
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
## Introduction

Magic-Doc is a tool designed to convert web pages or multi-format e-books into markdown format.

Key Features Include:

- Web Page Extraction
  - Cross-modal precise parsing of text, images, tables, and formula information.

- E-Book Document Extraction
  - Supports various document formats including epub, mobi, with full adaptation for text and images.

- Language Type Identification
  - Accurate recognition of 176 languages.

https://github.com/opendatalab/MinerU/assets/11393164/a5a650e9-f4c0-463e-acc3-960967f1a1ca



https://github.com/opendatalab/MinerU/assets/11393164/0f4a6fe9-6cca-4113-9fdc-a537749d764d



https://github.com/opendatalab/MinerU/assets/11393164/20438a02-ce6c-4af8-9dde-d722a4e825b2



赵小蒙's avatar
赵小蒙 committed
162

赵小蒙's avatar
赵小蒙 committed
163
164
## Project Repository

赵小蒙's avatar
赵小蒙 committed
165
- [Magic-Doc](https://github.com/InternLM/magic-doc)
赵小蒙's avatar
赵小蒙 committed
166
  Outstanding Webpage and E-book Extraction Tool
赵小蒙's avatar
赵小蒙 committed
167
168


赵小蒙's avatar
赵小蒙 committed
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
# All Thanks To Our Contributors

<a href="https://github.com/magicpdf/Magic-PDF/graphs/contributors">
  <img src="https://contrib.rocks/image?repo=opendatalab/MinerU" />
</a>


# License Information

[LICENSE.md](LICENSE.md)

The project currently leverages PyMuPDF to deliver advanced functionalities; however, its adherence to the AGPL license may impose limitations on certain use cases. In upcoming iterations, we intend to explore and transition to a more permissively licensed PDF processing library to enhance user-friendliness and flexibility.


# Acknowledgments

- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
- [PyMuPDF](https://github.com/pymupdf/PyMuPDF)
赵小蒙's avatar
赵小蒙 committed
187
188
- [fast-langdetect](https://github.com/LlmKira/fast-langdetect)
- [pdfminer.six](https://github.com/pdfminer/pdfminer.six)
赵小蒙's avatar
赵小蒙 committed
189
190


赵小蒙's avatar
赵小蒙 committed
191
192
193
194
195
196
197
198
199
200
201
202
203
# Citation

```bibtex
@misc{2024mineru,
    title={MinerU: A One-stop, Open-source, High-quality Data Extraction Tool},
    author={MinerU Contributors},
    howpublished = {\url{https://github.com/opendatalab/MinerU}},
    year={2024}
}
```


# Star History
赵小蒙's avatar
赵小蒙 committed
204

赵小蒙's avatar
赵小蒙 committed
205
206
207
208
209
210
<a>
 <picture>
   <source media="(prefers-color-scheme: dark)" srcset="https://api.star-history.com/svg?repos=opendatalab/MinerU&type=Date&theme=dark" />
   <source media="(prefers-color-scheme: light)" srcset="https://api.star-history.com/svg?repos=opendatalab/MinerU&type=Date" />
   <img alt="Star History Chart" src="https://api.star-history.com/svg?repos=opendatalab/MinerU&type=Date" />
 </picture>
myhloli's avatar
myhloli committed
211
</a>