README.md 4.99 KB
Newer Older
赵小蒙's avatar
赵小蒙 committed
1
2
<div id="top"></div>
<div align="center">
赵小蒙's avatar
赵小蒙 committed
3

赵小蒙's avatar
赵小蒙 committed
4
5
6
7
8
[![stars](https://img.shields.io/github/stars/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
[![forks](https://img.shields.io/github/forks/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
[![license](https://img.shields.io/github/license/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU/tree/main/LICENSE)
[![issue resolution](https://img.shields.io/github/issues-closed-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues)
[![open issues](https://img.shields.io/github/issues-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues)
赵小蒙's avatar
赵小蒙 committed
9
10
11
12
13
14
15
16

[English](README.md) | [简体中文](README_zh-CN.md)

</div>

<div align="center">

</div>
赵小蒙's avatar
赵小蒙 committed
17

赵小蒙's avatar
赵小蒙 committed
18
19
20
21
22
23
# MinerU 

## Introduction

MinerU is a one-stop, open-source data extraction tool, primarily includes the following features:

赵小蒙's avatar
赵小蒙 committed
24
25
- [Magic-PDF](#Magic-PDF)  PDF Document Extraction  
- [Magic-Doc](#Magic-Doc)  Webpage & E-book Extraction
赵小蒙's avatar
赵小蒙 committed
26

赵小蒙's avatar
赵小蒙 committed
27
# Magic-PDF
赵小蒙's avatar
赵小蒙 committed
28

赵小蒙's avatar
赵小蒙 committed
29
## Introduction
赵小蒙's avatar
赵小蒙 committed
30

赵小蒙's avatar
赵小蒙 committed
31
Magic-PDF is a tool designed to convert PDF documents into Markdown format, capable of processing files stored locally or on object storage supporting S3 protocol.
赵小蒙's avatar
赵小蒙 committed
32

赵小蒙's avatar
赵小蒙 committed
33
Key features include:
赵小蒙's avatar
赵小蒙 committed
34

赵小蒙's avatar
赵小蒙 committed
35
36
37
- Support for multiple front-end model inputs
- Removal of headers, footers, footnotes, and page numbers
- Human-readable layout formatting
赵小蒙's avatar
赵小蒙 committed
38
- Retains the original document's structure and formatting, including headings, paragraphs, lists, and more
赵小蒙's avatar
赵小蒙 committed
39
40
41
42
43
- Extraction and display of images and tables within markdown
- Conversion of equations into LaTeX format
- Automatic detection and conversion of garbled PDFs
- Compatibility with CPU and GPU environments
- Available for Windows, Linux, and macOS platforms
赵小蒙's avatar
赵小蒙 committed
44

myhloli's avatar
myhloli committed
45
46
47
48
49

https://github.com/magicpdf/Magic-PDF/assets/11393164/618937cb-dc6a-4646-b433-e3131a5f4070



赵小蒙's avatar
赵小蒙 committed
50
51
52
53
## Project Panorama

![Project Panorama](docs/images/project_panorama_en.png)

54
55
56
57
58
59
## Flowchart

![Flowchart](docs/images/flowchart_en.png)

### Submodule Repositories

wangbinDL's avatar
wangbinDL committed
60
- [PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit)
赵小蒙's avatar
赵小蒙 committed
61
  - A Comprehensive Toolkit for High-Quality PDF Content Extraction
62
- [Miner-PDF-Benchmark](https://github.com/opendatalab/Miner-PDF-Benchmark)
赵小蒙's avatar
赵小蒙 committed
63
  - An end-to-end PDF document comprehension evaluation suite designed for large-scale model data scenarios
64

赵小蒙's avatar
赵小蒙 committed
65
## Getting Started
赵小蒙's avatar
赵小蒙 committed
66

赵小蒙's avatar
赵小蒙 committed
67
### Requirements
赵小蒙's avatar
赵小蒙 committed
68

赵小蒙's avatar
赵小蒙 committed
69
- Python >= 3.9
赵小蒙's avatar
赵小蒙 committed
70

赵小蒙's avatar
赵小蒙 committed
71
### Usage Instructions
赵小蒙's avatar
赵小蒙 committed
72

赵小蒙's avatar
赵小蒙 committed
73
#### 1. Install Magic-PDF
赵小蒙's avatar
赵小蒙 committed
74

赵小蒙's avatar
赵小蒙 committed
75
```bash
赵小蒙's avatar
赵小蒙 committed
76
pip install magic-pdf
赵小蒙's avatar
赵小蒙 committed
77
78
```

赵小蒙's avatar
赵小蒙 committed
79
#### 2. Usage via Command Line
赵小蒙's avatar
赵小蒙 committed
80

赵小蒙's avatar
赵小蒙 committed
81
###### simple
赵小蒙's avatar
赵小蒙 committed
82

赵小蒙's avatar
赵小蒙 committed
83
84
85
86
```bash
cp magic-pdf.template.json to ~/magic-pdf.json
magic-pdf pdf-command --pdf "pdf_path" --model "model_json_path"
```
赵小蒙's avatar
赵小蒙 committed
87

赵小蒙's avatar
赵小蒙 committed
88
###### more 
赵小蒙's avatar
赵小蒙 committed
89

赵小蒙's avatar
赵小蒙 committed
90
91
```bash
magic-pdf --help
赵小蒙's avatar
赵小蒙 committed
92
```
赵小蒙's avatar
赵小蒙 committed
93

赵小蒙's avatar
赵小蒙 committed
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
#### 3. Usage via Api

###### Local
```python
image_writer = DiskReaderWriter(local_image_dir)
image_dir = str(os.path.basename(local_image_dir))
jso_useful_key = {"_pdf_type": "", "model_list": model_json}
pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer)
pipe.pipe_classify()
pipe.pipe_parse()
md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
```

###### Object Storage
```python
s3pdf_cli = S3ReaderWriter(pdf_ak, pdf_sk, pdf_endpoint)
image_dir = "s3://img_bucket/"
s3image_cli = S3ReaderWriter(img_ak, img_sk, img_endpoint, parent_path=image_dir)
pdf_bytes = s3pdf_cli.read(s3_pdf_path, mode=s3pdf_cli.MODE_BIN)
jso_useful_key = {"_pdf_type": "", "model_list": model_json}
pipe = UNIPipe(pdf_bytes, jso_useful_key, s3image_cli)
pipe.pipe_classify()
pipe.pipe_parse()
md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
```

120
Demo can be referred to [demo.py](demo/demo.py)
赵小蒙's avatar
赵小蒙 committed
121
122

## All Thanks To Our Contributors
赵小蒙's avatar
赵小蒙 committed
123
124
125
126
127

<a href="https://github.com/magicpdf/Magic-PDF/graphs/contributors">
  <img src="https://contrib.rocks/image?repo=magicpdf/Magic-PDF" />
</a>

赵小蒙's avatar
赵小蒙 committed
128
## License Information
赵小蒙's avatar
赵小蒙 committed
129

赵小蒙's avatar
赵小蒙 committed
130
131
132
[LICENSE.md](LICENSE.md)

The project currently leverages PyMuPDF to deliver advanced functionalities; however, its adherence to the AGPL license may impose limitations on certain use cases. In upcoming iterations, we intend to explore and transition to a more permissively licensed PDF processing library to enhance user-friendliness and flexibility.
赵小蒙's avatar
赵小蒙 committed
133

赵小蒙's avatar
赵小蒙 committed
134
## Acknowledgments
赵小蒙's avatar
赵小蒙 committed
135

赵小蒙's avatar
赵小蒙 committed
136
- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
赵小蒙's avatar
赵小蒙 committed
137
- [PyMuPDF](https://github.com/pymupdf/PyMuPDF)
赵小蒙's avatar
赵小蒙 committed
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172


# Magic-Doc

## Introduction

Magic-Doc is a tool designed to convert web pages or multi-format e-books into markdown format.

Key Features Include:

- Web Page Extraction
  - Cross-modal precise parsing of text, images, tables, and formula information.

- E-Book Document Extraction
  - Supports various document formats including epub, mobi, with full adaptation for text and images.

- Language Type Identification
  - Accurate recognition of 176 languages.

https://github.com/opendatalab/MinerU/assets/11393164/a5a650e9-f4c0-463e-acc3-960967f1a1ca



https://github.com/opendatalab/MinerU/assets/11393164/0f4a6fe9-6cca-4113-9fdc-a537749d764d



https://github.com/opendatalab/MinerU/assets/11393164/20438a02-ce6c-4af8-9dde-d722a4e825b2



## Project Repository

- [Magic-Doc](https://github.com/magicpdf/Magic-Doc)
  Outstanding Webpage and E-book Extraction Tool