README.md 3.22 KB
Newer Older
赵小蒙's avatar
赵小蒙 committed
1
2
<div id="top"></div>
<div align="center">
赵小蒙's avatar
赵小蒙 committed
3

赵小蒙's avatar
赵小蒙 committed
4
5
6
7
8
9
10
11
12
13
14
15
16
[![stars](https://img.shields.io/github/stars/magicpdf/Magic-PDF.svg)](https://github.com/magicpdf/Magic-PDF)
[![forks](https://img.shields.io/github/forks/magicpdf/Magic-PDF.svg)](https://github.com/magicpdf/Magic-PDF)
[![license](https://img.shields.io/github/license/magicpdf/Magic-PDF.svg)](https://github.com/magicpdf/Magic-PDF/tree/main/LICENSE)
[![issue resolution](https://img.shields.io/github/issues-closed-raw/magicpdf/Magic-PDF)](https://github.com/magicpdf/Magic-PDF/issues)
[![open issues](https://img.shields.io/github/issues-raw/magicpdf/Magic-PDF)](https://github.com/magicpdf/Magic-PDF/issues)

[English](README.md) | [简体中文](README_zh-CN.md)

</div>

<div align="center">

</div>
赵小蒙's avatar
赵小蒙 committed
17

赵小蒙's avatar
赵小蒙 committed
18
# Magic-PDF
赵小蒙's avatar
赵小蒙 committed
19

赵小蒙's avatar
赵小蒙 committed
20
## Introduction
赵小蒙's avatar
赵小蒙 committed
21

赵小蒙's avatar
赵小蒙 committed
22
Magic-PDF is a tool designed to convert PDF documents into Markdown format, capable of processing files stored locally or on object storage supporting S3 protocol.
赵小蒙's avatar
赵小蒙 committed
23

赵小蒙's avatar
赵小蒙 committed
24
Key features include:
赵小蒙's avatar
赵小蒙 committed
25

赵小蒙's avatar
赵小蒙 committed
26
27
28
- Support for multiple front-end model inputs
- Removal of headers, footers, footnotes, and page numbers
- Human-readable layout formatting
赵小蒙's avatar
赵小蒙 committed
29
- Retains the original document's structure and formatting, including headings, paragraphs, lists, and more
赵小蒙's avatar
赵小蒙 committed
30
31
32
33
34
- Extraction and display of images and tables within markdown
- Conversion of equations into LaTeX format
- Automatic detection and conversion of garbled PDFs
- Compatibility with CPU and GPU environments
- Available for Windows, Linux, and macOS platforms
赵小蒙's avatar
赵小蒙 committed
35

赵小蒙's avatar
赵小蒙 committed
36
37
38
39
## Project Panorama

![Project Panorama](docs/images/project_panorama_en.png)

赵小蒙's avatar
赵小蒙 committed
40
## Getting Started
赵小蒙's avatar
赵小蒙 committed
41

赵小蒙's avatar
赵小蒙 committed
42
### Requirements
赵小蒙's avatar
赵小蒙 committed
43

赵小蒙's avatar
赵小蒙 committed
44
- Python 3.9 or newer
赵小蒙's avatar
赵小蒙 committed
45

赵小蒙's avatar
赵小蒙 committed
46
### Usage Instructions
赵小蒙's avatar
赵小蒙 committed
47

赵小蒙's avatar
赵小蒙 committed
48
#### 1. Install Magic-PDF
赵小蒙's avatar
赵小蒙 committed
49
```bash
赵小蒙's avatar
赵小蒙 committed
50
pip install magic-pdf
赵小蒙's avatar
赵小蒙 committed
51
52
```

赵小蒙's avatar
赵小蒙 committed
53
#### 2. Usage via Command Line
赵小蒙's avatar
赵小蒙 committed
54

赵小蒙's avatar
赵小蒙 committed
55
56
57
58
59
60
###### simple
```bash
cp magic-pdf.template.json to ~/magic-pdf.json
magic-pdf pdf-command --pdf "pdf_path" --model "model_json_path"
```
###### more 
赵小蒙's avatar
赵小蒙 committed
61
62
```bash
magic-pdf --help
赵小蒙's avatar
赵小蒙 committed
63
```
赵小蒙's avatar
赵小蒙 committed
64

赵小蒙's avatar
赵小蒙 committed
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
#### 3. Usage via Api

###### Local
```python
image_writer = DiskReaderWriter(local_image_dir)
image_dir = str(os.path.basename(local_image_dir))
jso_useful_key = {"_pdf_type": "", "model_list": model_json}
pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer)
pipe.pipe_classify()
pipe.pipe_parse()
md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
```

###### Object Storage
```python
s3pdf_cli = S3ReaderWriter(pdf_ak, pdf_sk, pdf_endpoint)
image_dir = "s3://img_bucket/"
s3image_cli = S3ReaderWriter(img_ak, img_sk, img_endpoint, parent_path=image_dir)
pdf_bytes = s3pdf_cli.read(s3_pdf_path, mode=s3pdf_cli.MODE_BIN)
jso_useful_key = {"_pdf_type": "", "model_list": model_json}
pipe = UNIPipe(pdf_bytes, jso_useful_key, s3image_cli)
pipe.pipe_classify()
pipe.pipe_parse()
md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
```

Demo can be referred to [demo.py](https://github.com/magicpdf/Magic-PDF/blob/master/demo/demo.py)

## All Thanks To Our Contributors
赵小蒙's avatar
赵小蒙 committed
94
95
96
97
98

<a href="https://github.com/magicpdf/Magic-PDF/graphs/contributors">
  <img src="https://contrib.rocks/image?repo=magicpdf/Magic-PDF" />
</a>

赵小蒙's avatar
赵小蒙 committed
99
## License Information
赵小蒙's avatar
赵小蒙 committed
100

赵小蒙's avatar
赵小蒙 committed
101
See [LICENSE.md](https://github.com/magicpdf/Magic-PDF/blob/master/LICENSE.md) for details.
赵小蒙's avatar
赵小蒙 committed
102

赵小蒙's avatar
赵小蒙 committed
103
## Acknowledgments
赵小蒙's avatar
赵小蒙 committed
104

赵小蒙's avatar
赵小蒙 committed
105
- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
赵小蒙's avatar
赵小蒙 committed
106
- [PyMuPDF](https://github.com/pymupdf/PyMuPDF)