README.md 3.44 KB
Newer Older
赵小蒙's avatar
赵小蒙 committed
1
2
<div id="top"></div>
<div align="center">
赵小蒙's avatar
赵小蒙 committed
3

赵小蒙's avatar
赵小蒙 committed
4
5
6
7
8
9
10
11
12
13
14
15
16
[![stars](https://img.shields.io/github/stars/magicpdf/Magic-PDF.svg)](https://github.com/magicpdf/Magic-PDF)
[![forks](https://img.shields.io/github/forks/magicpdf/Magic-PDF.svg)](https://github.com/magicpdf/Magic-PDF)
[![license](https://img.shields.io/github/license/magicpdf/Magic-PDF.svg)](https://github.com/magicpdf/Magic-PDF/tree/main/LICENSE)
[![issue resolution](https://img.shields.io/github/issues-closed-raw/magicpdf/Magic-PDF)](https://github.com/magicpdf/Magic-PDF/issues)
[![open issues](https://img.shields.io/github/issues-raw/magicpdf/Magic-PDF)](https://github.com/magicpdf/Magic-PDF/issues)

[English](README.md) | [简体中文](README_zh-CN.md)

</div>

<div align="center">

</div>
赵小蒙's avatar
赵小蒙 committed
17

赵小蒙's avatar
赵小蒙 committed
18
# Magic-PDF
赵小蒙's avatar
赵小蒙 committed
19

赵小蒙's avatar
赵小蒙 committed
20
## Introduction
赵小蒙's avatar
赵小蒙 committed
21

赵小蒙's avatar
赵小蒙 committed
22
Magic-PDF is a tool designed to convert PDF documents into Markdown format, capable of processing files stored locally or on object storage supporting S3 protocol.
赵小蒙's avatar
赵小蒙 committed
23

赵小蒙's avatar
赵小蒙 committed
24
Key features include:
赵小蒙's avatar
赵小蒙 committed
25

赵小蒙's avatar
赵小蒙 committed
26
27
28
- Support for multiple front-end model inputs
- Removal of headers, footers, footnotes, and page numbers
- Human-readable layout formatting
赵小蒙's avatar
赵小蒙 committed
29
- Retains the original document's structure and formatting, including headings, paragraphs, lists, and more
赵小蒙's avatar
赵小蒙 committed
30
31
32
33
34
- Extraction and display of images and tables within markdown
- Conversion of equations into LaTeX format
- Automatic detection and conversion of garbled PDFs
- Compatibility with CPU and GPU environments
- Available for Windows, Linux, and macOS platforms
赵小蒙's avatar
赵小蒙 committed
35

赵小蒙's avatar
赵小蒙 committed
36
37
38
39
## Project Panorama

![Project Panorama](docs/images/project_panorama_en.png)

40
41
42
43
44
45
46
47
48
## Flowchart

![Flowchart](docs/images/flowchart_en.png)

### Submodule Repositories

- [pdf-extract-kit](https://github.com/wangbinDL/pdf-extract-kit)
- [Miner-PDF-Benchmark](https://github.com/opendatalab/Miner-PDF-Benchmark)

赵小蒙's avatar
赵小蒙 committed
49
## Getting Started
赵小蒙's avatar
赵小蒙 committed
50

赵小蒙's avatar
赵小蒙 committed
51
### Requirements
赵小蒙's avatar
赵小蒙 committed
52

赵小蒙's avatar
赵小蒙 committed
53
- Python 3.9 or newer
赵小蒙's avatar
赵小蒙 committed
54

赵小蒙's avatar
赵小蒙 committed
55
### Usage Instructions
赵小蒙's avatar
赵小蒙 committed
56

赵小蒙's avatar
赵小蒙 committed
57
#### 1. Install Magic-PDF
赵小蒙's avatar
赵小蒙 committed
58
```bash
赵小蒙's avatar
赵小蒙 committed
59
pip install magic-pdf
赵小蒙's avatar
赵小蒙 committed
60
61
```

赵小蒙's avatar
赵小蒙 committed
62
#### 2. Usage via Command Line
赵小蒙's avatar
赵小蒙 committed
63

赵小蒙's avatar
赵小蒙 committed
64
65
66
67
68
69
###### simple
```bash
cp magic-pdf.template.json to ~/magic-pdf.json
magic-pdf pdf-command --pdf "pdf_path" --model "model_json_path"
```
###### more 
赵小蒙's avatar
赵小蒙 committed
70
71
```bash
magic-pdf --help
赵小蒙's avatar
赵小蒙 committed
72
```
赵小蒙's avatar
赵小蒙 committed
73

赵小蒙's avatar
赵小蒙 committed
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
#### 3. Usage via Api

###### Local
```python
image_writer = DiskReaderWriter(local_image_dir)
image_dir = str(os.path.basename(local_image_dir))
jso_useful_key = {"_pdf_type": "", "model_list": model_json}
pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer)
pipe.pipe_classify()
pipe.pipe_parse()
md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
```

###### Object Storage
```python
s3pdf_cli = S3ReaderWriter(pdf_ak, pdf_sk, pdf_endpoint)
image_dir = "s3://img_bucket/"
s3image_cli = S3ReaderWriter(img_ak, img_sk, img_endpoint, parent_path=image_dir)
pdf_bytes = s3pdf_cli.read(s3_pdf_path, mode=s3pdf_cli.MODE_BIN)
jso_useful_key = {"_pdf_type": "", "model_list": model_json}
pipe = UNIPipe(pdf_bytes, jso_useful_key, s3image_cli)
pipe.pipe_classify()
pipe.pipe_parse()
md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
```

Demo can be referred to [demo.py](https://github.com/magicpdf/Magic-PDF/blob/master/demo/demo.py)

## All Thanks To Our Contributors
赵小蒙's avatar
赵小蒙 committed
103
104
105
106
107

<a href="https://github.com/magicpdf/Magic-PDF/graphs/contributors">
  <img src="https://contrib.rocks/image?repo=magicpdf/Magic-PDF" />
</a>

赵小蒙's avatar
赵小蒙 committed
108
## License Information
赵小蒙's avatar
赵小蒙 committed
109

赵小蒙's avatar
赵小蒙 committed
110
See [LICENSE.md](https://github.com/magicpdf/Magic-PDF/blob/master/LICENSE.md) for details.
赵小蒙's avatar
赵小蒙 committed
111

赵小蒙's avatar
赵小蒙 committed
112
## Acknowledgments
赵小蒙's avatar
赵小蒙 committed
113

赵小蒙's avatar
赵小蒙 committed
114
- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
赵小蒙's avatar
赵小蒙 committed
115
- [PyMuPDF](https://github.com/pymupdf/PyMuPDF)