README.md 3.14 KB
Newer Older
赵小蒙's avatar
赵小蒙 committed
1
2
<div id="top"></div>
<div align="center">
赵小蒙's avatar
赵小蒙 committed
3

赵小蒙's avatar
赵小蒙 committed
4
5
6
7
8
9
10
11
12
13
14
15
16
[![stars](https://img.shields.io/github/stars/magicpdf/Magic-PDF.svg)](https://github.com/magicpdf/Magic-PDF)
[![forks](https://img.shields.io/github/forks/magicpdf/Magic-PDF.svg)](https://github.com/magicpdf/Magic-PDF)
[![license](https://img.shields.io/github/license/magicpdf/Magic-PDF.svg)](https://github.com/magicpdf/Magic-PDF/tree/main/LICENSE)
[![issue resolution](https://img.shields.io/github/issues-closed-raw/magicpdf/Magic-PDF)](https://github.com/magicpdf/Magic-PDF/issues)
[![open issues](https://img.shields.io/github/issues-raw/magicpdf/Magic-PDF)](https://github.com/magicpdf/Magic-PDF/issues)

[English](README.md) | [简体中文](README_zh-CN.md)

</div>

<div align="center">

</div>
赵小蒙's avatar
赵小蒙 committed
17

赵小蒙's avatar
赵小蒙 committed
18
# Magic-PDF
赵小蒙's avatar
赵小蒙 committed
19

赵小蒙's avatar
赵小蒙 committed
20
## Introduction
赵小蒙's avatar
赵小蒙 committed
21

赵小蒙's avatar
赵小蒙 committed
22
Magic-PDF is a tool designed to convert PDF documents into Markdown format, capable of processing files stored locally or on object storage supporting S3 protocol.
赵小蒙's avatar
赵小蒙 committed
23

赵小蒙's avatar
赵小蒙 committed
24
Key features include:
赵小蒙's avatar
赵小蒙 committed
25

赵小蒙's avatar
赵小蒙 committed
26
27
28
- Support for multiple front-end model inputs
- Removal of headers, footers, footnotes, and page numbers
- Human-readable layout formatting
赵小蒙's avatar
赵小蒙 committed
29
- Retains the original document's structure and formatting, including headings, paragraphs, lists, and more
赵小蒙's avatar
赵小蒙 committed
30
31
32
33
34
- Extraction and display of images and tables within markdown
- Conversion of equations into LaTeX format
- Automatic detection and conversion of garbled PDFs
- Compatibility with CPU and GPU environments
- Available for Windows, Linux, and macOS platforms
赵小蒙's avatar
赵小蒙 committed
35

赵小蒙's avatar
赵小蒙 committed
36
## Getting Started
赵小蒙's avatar
赵小蒙 committed
37

赵小蒙's avatar
赵小蒙 committed
38
### Requirements
赵小蒙's avatar
赵小蒙 committed
39

赵小蒙's avatar
赵小蒙 committed
40
- Python 3.9 or newer
赵小蒙's avatar
赵小蒙 committed
41

赵小蒙's avatar
赵小蒙 committed
42
### Usage Instructions
赵小蒙's avatar
赵小蒙 committed
43

赵小蒙's avatar
赵小蒙 committed
44
#### 1. Install Magic-PDF
赵小蒙's avatar
赵小蒙 committed
45
```bash
赵小蒙's avatar
赵小蒙 committed
46
pip install magic-pdf
赵小蒙's avatar
赵小蒙 committed
47
48
```

赵小蒙's avatar
赵小蒙 committed
49
#### 2. Usage via Command Line
赵小蒙's avatar
赵小蒙 committed
50

赵小蒙's avatar
赵小蒙 committed
51
52
53
54
55
56
###### simple
```bash
cp magic-pdf.template.json to ~/magic-pdf.json
magic-pdf pdf-command --pdf "pdf_path" --model "model_json_path"
```
###### more 
赵小蒙's avatar
赵小蒙 committed
57
58
```bash
magic-pdf --help
赵小蒙's avatar
赵小蒙 committed
59
```
赵小蒙's avatar
赵小蒙 committed
60

赵小蒙's avatar
赵小蒙 committed
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
#### 3. Usage via Api

###### Local
```python
image_writer = DiskReaderWriter(local_image_dir)
image_dir = str(os.path.basename(local_image_dir))
jso_useful_key = {"_pdf_type": "", "model_list": model_json}
pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer)
pipe.pipe_classify()
pipe.pipe_parse()
md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
```

###### Object Storage
```python
s3pdf_cli = S3ReaderWriter(pdf_ak, pdf_sk, pdf_endpoint)
image_dir = "s3://img_bucket/"
s3image_cli = S3ReaderWriter(img_ak, img_sk, img_endpoint, parent_path=image_dir)
pdf_bytes = s3pdf_cli.read(s3_pdf_path, mode=s3pdf_cli.MODE_BIN)
jso_useful_key = {"_pdf_type": "", "model_list": model_json}
pipe = UNIPipe(pdf_bytes, jso_useful_key, s3image_cli)
pipe.pipe_classify()
pipe.pipe_parse()
md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
```

Demo can be referred to [demo.py](https://github.com/magicpdf/Magic-PDF/blob/master/demo/demo.py)

## All Thanks To Our Contributors
赵小蒙's avatar
赵小蒙 committed
90
91
92
93
94

<a href="https://github.com/magicpdf/Magic-PDF/graphs/contributors">
  <img src="https://contrib.rocks/image?repo=magicpdf/Magic-PDF" />
</a>

赵小蒙's avatar
赵小蒙 committed
95
## License Information
赵小蒙's avatar
赵小蒙 committed
96

赵小蒙's avatar
赵小蒙 committed
97
See [LICENSE.md](https://github.com/magicpdf/Magic-PDF/blob/master/LICENSE.md) for details.
赵小蒙's avatar
赵小蒙 committed
98

赵小蒙's avatar
赵小蒙 committed
99
## Acknowledgments
赵小蒙's avatar
赵小蒙 committed
100

赵小蒙's avatar
赵小蒙 committed
101
- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
赵小蒙's avatar
赵小蒙 committed
102
- [PyMuPDF](https://github.com/pymupdf/PyMuPDF)