README.md 3.43 KB
Newer Older
赵小蒙's avatar
赵小蒙 committed
1
2
<div id="top"></div>
<div align="center">
赵小蒙's avatar
赵小蒙 committed
3

赵小蒙's avatar
赵小蒙 committed
4
5
6
7
8
9
10
11
12
13
14
15
16
[![stars](https://img.shields.io/github/stars/magicpdf/Magic-PDF.svg)](https://github.com/magicpdf/Magic-PDF)
[![forks](https://img.shields.io/github/forks/magicpdf/Magic-PDF.svg)](https://github.com/magicpdf/Magic-PDF)
[![license](https://img.shields.io/github/license/magicpdf/Magic-PDF.svg)](https://github.com/magicpdf/Magic-PDF/tree/main/LICENSE)
[![issue resolution](https://img.shields.io/github/issues-closed-raw/magicpdf/Magic-PDF)](https://github.com/magicpdf/Magic-PDF/issues)
[![open issues](https://img.shields.io/github/issues-raw/magicpdf/Magic-PDF)](https://github.com/magicpdf/Magic-PDF/issues)

[English](README.md) | [简体中文](README_zh-CN.md)

</div>

<div align="center">

</div>
赵小蒙's avatar
赵小蒙 committed
17

赵小蒙's avatar
赵小蒙 committed
18
# Magic-PDF
赵小蒙's avatar
赵小蒙 committed
19

赵小蒙's avatar
赵小蒙 committed
20
## Introduction
赵小蒙's avatar
赵小蒙 committed
21

赵小蒙's avatar
赵小蒙 committed
22
Magic-PDF is a tool designed to convert PDF documents into Markdown format, capable of processing files stored locally or on object storage supporting S3 protocol.
赵小蒙's avatar
赵小蒙 committed
23

赵小蒙's avatar
赵小蒙 committed
24
Key features include:
赵小蒙's avatar
赵小蒙 committed
25

赵小蒙's avatar
赵小蒙 committed
26
27
28
- Support for multiple front-end model inputs
- Removal of headers, footers, footnotes, and page numbers
- Human-readable layout formatting
赵小蒙's avatar
赵小蒙 committed
29
- Retains the original document's structure and formatting, including headings, paragraphs, lists, and more
赵小蒙's avatar
赵小蒙 committed
30
31
32
33
34
- Extraction and display of images and tables within markdown
- Conversion of equations into LaTeX format
- Automatic detection and conversion of garbled PDFs
- Compatibility with CPU and GPU environments
- Available for Windows, Linux, and macOS platforms
赵小蒙's avatar
赵小蒙 committed
35

myhloli's avatar
myhloli committed
36
37
38
39
40

https://github.com/magicpdf/Magic-PDF/assets/11393164/618937cb-dc6a-4646-b433-e3131a5f4070



赵小蒙's avatar
赵小蒙 committed
41
42
43
44
## Project Panorama

![Project Panorama](docs/images/project_panorama_en.png)

45
46
47
48
49
50
51
52
53
## Flowchart

![Flowchart](docs/images/flowchart_en.png)

### Submodule Repositories

- [pdf-extract-kit](https://github.com/wangbinDL/pdf-extract-kit)
- [Miner-PDF-Benchmark](https://github.com/opendatalab/Miner-PDF-Benchmark)

赵小蒙's avatar
赵小蒙 committed
54
## Getting Started
赵小蒙's avatar
赵小蒙 committed
55

赵小蒙's avatar
赵小蒙 committed
56
### Requirements
赵小蒙's avatar
赵小蒙 committed
57

赵小蒙's avatar
赵小蒙 committed
58
- Python 3.9 or newer
赵小蒙's avatar
赵小蒙 committed
59

赵小蒙's avatar
赵小蒙 committed
60
### Usage Instructions
赵小蒙's avatar
赵小蒙 committed
61

赵小蒙's avatar
赵小蒙 committed
62
#### 1. Install Magic-PDF
赵小蒙's avatar
赵小蒙 committed
63
```bash
赵小蒙's avatar
赵小蒙 committed
64
pip install magic-pdf
赵小蒙's avatar
赵小蒙 committed
65
66
```

赵小蒙's avatar
赵小蒙 committed
67
#### 2. Usage via Command Line
赵小蒙's avatar
赵小蒙 committed
68

赵小蒙's avatar
赵小蒙 committed
69
70
71
72
73
74
###### simple
```bash
cp magic-pdf.template.json to ~/magic-pdf.json
magic-pdf pdf-command --pdf "pdf_path" --model "model_json_path"
```
###### more 
赵小蒙's avatar
赵小蒙 committed
75
76
```bash
magic-pdf --help
赵小蒙's avatar
赵小蒙 committed
77
```
赵小蒙's avatar
赵小蒙 committed
78

赵小蒙's avatar
赵小蒙 committed
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
#### 3. Usage via Api

###### Local
```python
image_writer = DiskReaderWriter(local_image_dir)
image_dir = str(os.path.basename(local_image_dir))
jso_useful_key = {"_pdf_type": "", "model_list": model_json}
pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer)
pipe.pipe_classify()
pipe.pipe_parse()
md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
```

###### Object Storage
```python
s3pdf_cli = S3ReaderWriter(pdf_ak, pdf_sk, pdf_endpoint)
image_dir = "s3://img_bucket/"
s3image_cli = S3ReaderWriter(img_ak, img_sk, img_endpoint, parent_path=image_dir)
pdf_bytes = s3pdf_cli.read(s3_pdf_path, mode=s3pdf_cli.MODE_BIN)
jso_useful_key = {"_pdf_type": "", "model_list": model_json}
pipe = UNIPipe(pdf_bytes, jso_useful_key, s3image_cli)
pipe.pipe_classify()
pipe.pipe_parse()
md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
```

105
Demo can be referred to [demo.py](demo/demo.py)
赵小蒙's avatar
赵小蒙 committed
106
107

## All Thanks To Our Contributors
赵小蒙's avatar
赵小蒙 committed
108
109
110
111
112

<a href="https://github.com/magicpdf/Magic-PDF/graphs/contributors">
  <img src="https://contrib.rocks/image?repo=magicpdf/Magic-PDF" />
</a>

赵小蒙's avatar
赵小蒙 committed
113
## License Information
赵小蒙's avatar
赵小蒙 committed
114

115
See [LICENSE.md](LICENSE.md) for details.
赵小蒙's avatar
赵小蒙 committed
116

赵小蒙's avatar
赵小蒙 committed
117
## Acknowledgments
赵小蒙's avatar
赵小蒙 committed
118

赵小蒙's avatar
赵小蒙 committed
119
- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
赵小蒙's avatar
赵小蒙 committed
120
- [PyMuPDF](https://github.com/pymupdf/PyMuPDF)