README.md 4.49 KB
Newer Older
zhougaofeng's avatar
zhougaofeng committed
1
2
# magic_pdf

zhougaofeng's avatar
zhougaofeng committed
3
## 一、安装pdf文档解析
zhougaofeng's avatar
zhougaofeng committed
4

zhougaofeng's avatar
zhougaofeng committed
5
### 以下演示在223节点安装pdf解析模块(可以直接使用镜像:1177ea7959ce)
zhougaofeng's avatar
zhougaofeng committed
6

zhougaofeng's avatar
zhougaofeng committed
7
### 1、下载本项目
zhougaofeng's avatar
zhougaofeng committed
8

zhougaofeng's avatar
zhougaofeng committed
9
`git clone http://developer.sourcefind.cn/codes/zhiAn123/magic_pdf.git`
zhougaofeng's avatar
zhougaofeng committed
10

zhougaofeng's avatar
zhougaofeng committed
11
12
13
#### 安装依赖包

【提供的镜像暂未支持doclayout-YoLo模型,如需使用,请下载最新的库文件】
zhougaofeng's avatar
zhougaofeng committed
14
15

`pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com -i https://mirrors.aliyun.com/pypi/simple`
zhougaofeng's avatar
zhougaofeng committed
16
17
18

安装完毕后,请卸载需要使用光源库文件(torch、vllm等),以及nvdia-cuda类似不支持的库文件

zhougaofeng's avatar
zhougaofeng committed
19
### 2、下载需要的模型库
zhougaofeng's avatar
zhougaofeng committed
20

zhougaofeng's avatar
zhougaofeng committed
21

zhougaofeng's avatar
zhougaofeng committed
22
23
下载qwen模型:[快速下载通道](http://113.200.138.88:18080/aimodels/qwen/Qwen2-VL-7B-Instruct.git)

zhougaofeng's avatar
zhougaofeng committed
24
下载PDF解析需要的模型: [快速下载通道](http://113.200.138.88:18080/aimodels/opendatalab/PDF-Extract-Kit)
zhougaofeng's avatar
zhougaofeng committed
25

zhougaofeng's avatar
zhougaofeng committed
26
27
`pip install modelscope`

zhougaofeng's avatar
zhougaofeng committed
28
`wget https://gitee.com/myhloli/MinerU/raw/master/scripts/download_models.py -O download_models.py`
zhougaofeng's avatar
zhougaofeng committed
29

zhougaofeng's avatar
zhougaofeng committed
30
`python download_models.py`
zhougaofeng's avatar
zhougaofeng committed
31

zhougaofeng's avatar
zhougaofeng committed
32
【注意,download_models.py执行完毕,会把模型文件以及layout文件安装在/root/.cache下】
zhougaofeng's avatar
zhougaofeng committed
33

zhougaofeng's avatar
zhougaofeng committed
34
### 3、安装需要的依赖库
zhougaofeng's avatar
zhougaofeng committed
35

zhougaofeng's avatar
zhougaofeng committed
36
#### 进入主目录(以下内容都在主目录下进行)
zhougaofeng's avatar
zhougaofeng committed
37

zhougaofeng's avatar
zhougaofeng committed
38
`cd magic_pdf`
zhougaofeng's avatar
zhougaofeng committed
39

zhougaofeng's avatar
zhougaofeng committed
40
执行本地源码安装
zhougaofeng's avatar
zhougaofeng committed
41

zhougaofeng's avatar
zhougaofeng committed
42
#### pip install -e .
zhougaofeng's avatar
zhougaofeng committed
43

zhougaofeng's avatar
zhougaofeng committed
44
45
46
47
48
`pip install qwen_vl_utils`

`pip install easyofd`


zhougaofeng's avatar
zhougaofeng committed
49
50
在国产计算卡运行时,下载已适配的[paddle](https://cancon.hpccube.com:65024/4/main/paddle)可以大幅提高文本解析速度(根据自身容器,自行选择已适配paddle的版本)

zhougaofeng's avatar
zhougaofeng committed
51
52
### 4、修改magic-pdf.json
在第二步执行download_models.py文件后,会在/root文件夹下,下载名为magic-pdf.json文件,修改文件内容
zhougaofeng's avatar
zhougaofeng committed
53

zhougaofeng's avatar
zhougaofeng committed
54
55
56
<div align=center>
    <img src="doc/image (9).png"/>
</div>
zhougaofeng's avatar
zhougaofeng committed
57

zhougaofeng's avatar
zhougaofeng committed
58
"models-dir":"[模型路径]" 路径指向**第二步下载的pdf解析模型路径下的models文件夹**
zhougaofeng's avatar
zhougaofeng committed
59

zhougaofeng's avatar
zhougaofeng committed
60
61
62
<div align=center>
    <img src="doc/image (10).png"/>
</div>
zhougaofeng's avatar
zhougaofeng committed
63

64
65
66
67
68
69
### 5、配置config.ini中的路由地址

vim magic_pdf/config.ini

默认如下:

70
71
72
<div align=center>
    <img src="doc/image13.png"/>
</div>
73
74
75
76

根据需要,自行配置路由地址

### 6、启动qwen-ocr模块:
zhougaofeng's avatar
zhougaofeng committed
77

zhougaofeng's avatar
zhougaofeng committed
78
79
80
1、修改magic_pdf/config.ini 文件中的ocr_workers 指定运行ocr解析的进程数,默认为4进程

2、修改magic_pdf/config.ini 文件中的vllm_able 默认为True启用vllm加速qwen_ocr,禁用请设置为False
zhougaofeng's avatar
zhougaofeng committed
81
82
83
84
85

<div align=center>
    <img src="doc/image_vllm.png"/>
</div>

zhougaofeng's avatar
zhougaofeng committed
86
87
#### 6.1、启动非vllm的qwen-ocr模块:

zhougaofeng's avatar
zhougaofeng committed
88
修改magic_pdf/magic_pdf/dict2md/ocr_server.py文件中模型路径地址
zhougaofeng's avatar
zhougaofeng committed
89
90
91
92

<div align=center>
    <img src="doc/image11.png"/>
</div>
zhougaofeng's avatar
zhougaofeng committed
93

zhougaofeng's avatar
zhougaofeng committed
94
##### qwen-ocr服务启动代码:
zhougaofeng's avatar
zhougaofeng committed
95

zhougaofeng's avatar
zhougaofeng committed
96
`python magic_pdf/dict2md/ocr_server.py`
zhougaofeng's avatar
zhougaofeng committed
97

98
默认使用0号DCU卡 ,可以通过--dcu_id 指定卡,-c 指定qwen模型地址,--config_path 指定config.ini路径
zhougaofeng's avatar
zhougaofeng committed
99

zhougaofeng's avatar
zhougaofeng committed
100
101
102
103
104
qwen-ocr模块启动成功:
<div align=center>
    <img src="doc/image (5).png"/>
</div>

zhougaofeng's avatar
zhougaofeng committed
105
106
107
#### 6.2、 启动vllm的qwen-ocr模块:

1、修改magic_pdf/magic_pdf/dict2md/ocr_vllm_server.py文件中模型路径地址
zhougaofeng's avatar
zhougaofeng committed
108

zhougaofeng's avatar
zhougaofeng committed
109
110
111
112
113
114
115
116
117
118
119
120
121
122

##### qwen-ocr-vllm服务启动代码:

`CUDA_VISIBLE_DEVICES=0 python magic_pdf/dict2md/ocr_vllm_server.py`

默认使用0号DCU卡,-c 指定qwen模型地址,--config_path 指定config.ini路径 


qwen-ocr模块启动成功:
<div align=center>
    <img src="doc/image (5).png"/>
</div>


123
### 7、启动pdf-server解析服务:
zhougaofeng's avatar
zhougaofeng committed
124
125
126

#### pdf-server解析服务启动代码:

zhougaofeng's avatar
zhougaofeng committed
127
`python magic_pdf/tools/pdf_server.py`
zhougaofeng's avatar
zhougaofeng committed
128

129
默认使用0号DCU卡 ,可以通过--dcu_id 指定卡,--config_path 指定config.ini路径
zhougaofeng's avatar
zhougaofeng committed
130

zhougaofeng's avatar
zhougaofeng committed
131
132
133
<div align=center>
    <img src="doc/image (6).png"/>
</div>
zhougaofeng's avatar
zhougaofeng committed
134

zhougaofeng's avatar
zhougaofeng committed
135
136
137
138
启动成功:
<div align=center>
    <img src="doc/image (7).png"/>
</div>
zhougaofeng's avatar
zhougaofeng committed
139

140
141
### 8、解析pdf

zhougaofeng's avatar
zhougaofeng committed
142
`python magic_pdf/parse/common_parse.py -p [文件/目录 路径] -o [输出地址]`
143
144
145
146
147
148
149

-p指定pdf路径,-o指定输出路径 --config_path 指定config.ini路径 

<div align=center>
    <img src="doc/image12.png"/>
</div>

zhougaofeng's avatar
zhougaofeng committed
150
151
152
<div align=center>
    <img src="doc/image (8).png"/>
</div>
153

zhougaofeng's avatar
zhougaofeng committed
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
## 二、安装ofd文档解析

`1、pip install easyofd`

2、安装需要的字体文件

查看容器内的字体:

`fc-list`

请配置magic_pdf/tools/font_tools.py文件,配置字体路径

<div align=center>
    <img src="doc/font.png"/>
</div>

zhougaofeng's avatar
zhougaofeng committed
170
171
172
173
若代码执行过程中报错,字体未找到
<div align=center>
    <img src="doc/fonts.png"/>
</div>
zhougaofeng's avatar
zhougaofeng committed
174

zhougaofeng's avatar
zhougaofeng committed
175
请执行:
zhougaofeng's avatar
zhougaofeng committed
176

zhougaofeng's avatar
zhougaofeng committed
177
178
179
180
`fc-list #查找字体文件`

`vim magic_pdf/tools/font_tools.py`

zhougaofeng's avatar
zhougaofeng committed
181
182
如果fc-list未找到需要的字体文件,请向容器内添加字体后,再修改font_tools.py文件

zhougaofeng's avatar
zhougaofeng committed
183
184
185
186




187