README.md 5 KB
Newer Older
赵小蒙's avatar
赵小蒙 committed
1
2
<div id="top"></div>
<div align="center">
赵小蒙's avatar
赵小蒙 committed
3

赵小蒙's avatar
赵小蒙 committed
4
5
6
7
8
[![stars](https://img.shields.io/github/stars/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
[![forks](https://img.shields.io/github/forks/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
[![license](https://img.shields.io/github/license/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU/tree/main/LICENSE)
[![issue resolution](https://img.shields.io/github/issues-closed-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues)
[![open issues](https://img.shields.io/github/issues-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues)
赵小蒙's avatar
赵小蒙 committed
9
10
11
12
13
14
15
16

[English](README.md) | [简体中文](README_zh-CN.md)

</div>

<div align="center">

</div>
赵小蒙's avatar
赵小蒙 committed
17

赵小蒙's avatar
赵小蒙 committed
18
19
# MinerU 

赵小蒙's avatar
赵小蒙 committed
20

赵小蒙's avatar
赵小蒙 committed
21
22
23
24
## Introduction

MinerU is a one-stop, open-source data extraction tool, primarily includes the following features:

赵小蒙's avatar
赵小蒙 committed
25
26
- [Magic-PDF](#Magic-PDF)  PDF Document Extraction  
- [Magic-Doc](#Magic-Doc)  Webpage & E-book Extraction
赵小蒙's avatar
赵小蒙 committed
27

赵小蒙's avatar
赵小蒙 committed
28

赵小蒙's avatar
赵小蒙 committed
29
# Magic-PDF
赵小蒙's avatar
赵小蒙 committed
30

赵小蒙's avatar
赵小蒙 committed
31

赵小蒙's avatar
赵小蒙 committed
32
## Introduction
赵小蒙's avatar
赵小蒙 committed
33

赵小蒙's avatar
赵小蒙 committed
34
Magic-PDF is a tool designed to convert PDF documents into Markdown format, capable of processing files stored locally or on object storage supporting S3 protocol.
赵小蒙's avatar
赵小蒙 committed
35

赵小蒙's avatar
赵小蒙 committed
36
Key features include:
赵小蒙's avatar
赵小蒙 committed
37

赵小蒙's avatar
赵小蒙 committed
38
39
40
- Support for multiple front-end model inputs
- Removal of headers, footers, footnotes, and page numbers
- Human-readable layout formatting
赵小蒙's avatar
赵小蒙 committed
41
- Retains the original document's structure and formatting, including headings, paragraphs, lists, and more
赵小蒙's avatar
赵小蒙 committed
42
43
44
45
46
- Extraction and display of images and tables within markdown
- Conversion of equations into LaTeX format
- Automatic detection and conversion of garbled PDFs
- Compatibility with CPU and GPU environments
- Available for Windows, Linux, and macOS platforms
赵小蒙's avatar
赵小蒙 committed
47

myhloli's avatar
myhloli committed
48
49
50
51
52

https://github.com/magicpdf/Magic-PDF/assets/11393164/618937cb-dc6a-4646-b433-e3131a5f4070



赵小蒙's avatar
赵小蒙 committed
53
54
55
56
## Project Panorama

![Project Panorama](docs/images/project_panorama_en.png)

赵小蒙's avatar
赵小蒙 committed
57

58
59
60
61
62
63
## Flowchart

![Flowchart](docs/images/flowchart_en.png)

### Submodule Repositories

wangbinDL's avatar
wangbinDL committed
64
- [PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit)
赵小蒙's avatar
赵小蒙 committed
65
  - A Comprehensive Toolkit for High-Quality PDF Content Extraction
66
- [Miner-PDF-Benchmark](https://github.com/opendatalab/Miner-PDF-Benchmark)
赵小蒙's avatar
赵小蒙 committed
67
  - An end-to-end PDF document comprehension evaluation suite designed for large-scale model data scenarios
68

赵小蒙's avatar
赵小蒙 committed
69

赵小蒙's avatar
赵小蒙 committed
70
## Getting Started
赵小蒙's avatar
赵小蒙 committed
71

赵小蒙's avatar
赵小蒙 committed
72
### Requirements
赵小蒙's avatar
赵小蒙 committed
73

赵小蒙's avatar
赵小蒙 committed
74
- Python >= 3.9
赵小蒙's avatar
赵小蒙 committed
75

赵小蒙's avatar
赵小蒙 committed
76
### Usage Instructions
赵小蒙's avatar
赵小蒙 committed
77

赵小蒙's avatar
赵小蒙 committed
78
#### 1. Install Magic-PDF
赵小蒙's avatar
赵小蒙 committed
79

赵小蒙's avatar
赵小蒙 committed
80
```bash
赵小蒙's avatar
赵小蒙 committed
81
pip install magic-pdf
赵小蒙's avatar
赵小蒙 committed
82
83
```

赵小蒙's avatar
赵小蒙 committed
84
#### 2. Usage via Command Line
赵小蒙's avatar
赵小蒙 committed
85

赵小蒙's avatar
赵小蒙 committed
86
###### simple
赵小蒙's avatar
赵小蒙 committed
87

赵小蒙's avatar
赵小蒙 committed
88
89
90
91
```bash
cp magic-pdf.template.json to ~/magic-pdf.json
magic-pdf pdf-command --pdf "pdf_path" --model "model_json_path"
```
赵小蒙's avatar
赵小蒙 committed
92

赵小蒙's avatar
赵小蒙 committed
93
###### more 
赵小蒙's avatar
赵小蒙 committed
94

赵小蒙's avatar
赵小蒙 committed
95
96
```bash
magic-pdf --help
赵小蒙's avatar
赵小蒙 committed
97
```
赵小蒙's avatar
赵小蒙 committed
98

赵小蒙's avatar
赵小蒙 committed
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
#### 3. Usage via Api

###### Local
```python
image_writer = DiskReaderWriter(local_image_dir)
image_dir = str(os.path.basename(local_image_dir))
jso_useful_key = {"_pdf_type": "", "model_list": model_json}
pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer)
pipe.pipe_classify()
pipe.pipe_parse()
md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
```

###### Object Storage
```python
s3pdf_cli = S3ReaderWriter(pdf_ak, pdf_sk, pdf_endpoint)
image_dir = "s3://img_bucket/"
s3image_cli = S3ReaderWriter(img_ak, img_sk, img_endpoint, parent_path=image_dir)
pdf_bytes = s3pdf_cli.read(s3_pdf_path, mode=s3pdf_cli.MODE_BIN)
jso_useful_key = {"_pdf_type": "", "model_list": model_json}
pipe = UNIPipe(pdf_bytes, jso_useful_key, s3image_cli)
pipe.pipe_classify()
pipe.pipe_parse()
md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
```

125
Demo can be referred to [demo.py](demo/demo.py)
赵小蒙's avatar
赵小蒙 committed
126

赵小蒙's avatar
赵小蒙 committed
127

赵小蒙's avatar
赵小蒙 committed
128
## All Thanks To Our Contributors
赵小蒙's avatar
赵小蒙 committed
129
130

<a href="https://github.com/magicpdf/Magic-PDF/graphs/contributors">
赵小蒙's avatar
赵小蒙 committed
131
  <img src="https://contrib.rocks/image?repo=opendatalab/MinerU" />
赵小蒙's avatar
赵小蒙 committed
132
133
</a>

赵小蒙's avatar
赵小蒙 committed
134

赵小蒙's avatar
赵小蒙 committed
135
## License Information
赵小蒙's avatar
赵小蒙 committed
136

赵小蒙's avatar
赵小蒙 committed
137
138
139
[LICENSE.md](LICENSE.md)

The project currently leverages PyMuPDF to deliver advanced functionalities; however, its adherence to the AGPL license may impose limitations on certain use cases. In upcoming iterations, we intend to explore and transition to a more permissively licensed PDF processing library to enhance user-friendliness and flexibility.
赵小蒙's avatar
赵小蒙 committed
140

赵小蒙's avatar
赵小蒙 committed
141

赵小蒙's avatar
赵小蒙 committed
142
## Acknowledgments
赵小蒙's avatar
赵小蒙 committed
143

赵小蒙's avatar
赵小蒙 committed
144
- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
赵小蒙's avatar
赵小蒙 committed
145
- [PyMuPDF](https://github.com/pymupdf/PyMuPDF)
赵小蒙's avatar
赵小蒙 committed
146
147
148
149


# Magic-Doc

赵小蒙's avatar
赵小蒙 committed
150

赵小蒙's avatar
赵小蒙 committed
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
## Introduction

Magic-Doc is a tool designed to convert web pages or multi-format e-books into markdown format.

Key Features Include:

- Web Page Extraction
  - Cross-modal precise parsing of text, images, tables, and formula information.

- E-Book Document Extraction
  - Supports various document formats including epub, mobi, with full adaptation for text and images.

- Language Type Identification
  - Accurate recognition of 176 languages.

https://github.com/opendatalab/MinerU/assets/11393164/a5a650e9-f4c0-463e-acc3-960967f1a1ca



https://github.com/opendatalab/MinerU/assets/11393164/0f4a6fe9-6cca-4113-9fdc-a537749d764d



https://github.com/opendatalab/MinerU/assets/11393164/20438a02-ce6c-4af8-9dde-d722a4e825b2



赵小蒙's avatar
赵小蒙 committed
178

赵小蒙's avatar
赵小蒙 committed
179
180
181
182
## Project Repository

- [Magic-Doc](https://github.com/magicpdf/Magic-Doc)
  Outstanding Webpage and E-book Extraction Tool