README.md 2.45 KB
Newer Older
zhougaofeng's avatar
zhougaofeng committed
1
2
# magic_pdf

zhougaofeng's avatar
zhougaofeng committed
3
4
## 安装

zhougaofeng's avatar
zhougaofeng committed
5
### 以下演示在223节点安装pdf解析模块(可以直接使用镜像:1177ea7959ce)
zhougaofeng's avatar
zhougaofeng committed
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

### 1、docker run -it --shm-size=1024G -v /parastor/home/zhougf/Qwen1.5-pytorch:/home/practice -v /opt/hyhal:/opt/hyhal --privileged=true --device=/dev/kfd --device=/dev/dri/  --network=host --group-add video --name pdf_tmp  a4dd5be0ca23 bash 
<div align=center>
    <img src="doc/image.png"/>
    <img src="doc/image (1).png"/>
</div>

### 2、安装需要的依赖库
pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com -i https://pypi.tuna.tsinghua.edu.cn/simple

<div align=center>
    <img src="doc/image (2).png"/>
</div>
注意:会安装cuda相关的库(nvidia-cudnn),以及没有适配的库(比如torchtext),等安装结束后,卸载这些库即可

安装dtk版本的torch、torchvision
zhougaofeng's avatar
zhougaofeng committed
22
23
下载官方的项目:
 git clone https://github.com/opendatalab/MinerU.git
zhougaofeng's avatar
zhougaofeng committed
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
#### 将本项目的magic_pdf替换git clone 官方的magic_pdf
#### pip uninstall magic-pdf
#### pip install -e .

### 3、安装需要的模型
git clone https://www.modelscope.cn/opendatalab/PDF-Extract-Kit.git
#### 修改magic-pdf.template.json
cd MinerU
<div align=center>
    <img src="doc/image (9).png"/>
</div>
需要注意,"models-dir":"/home/practice/model/PDF-Extract-Kit/models" 路径指向PDF-Extract-Kit/models
将magic-pdf.template.json 拷贝到/root目录下并改名为magic-pdf.json
<div align=center>
    <img src="doc/image (10).png"/>
</div>
### 4、启动qwen-ocr模块:

安装qwen_vl_utils库,更新transformers库为4。45版本,卸载flash_attn
zhougaofeng's avatar
zhougaofeng committed
43
44
45
(1)、pip install qwen_vl_utils -i  https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
(2)、pip install transformers==4.45 -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
(3)、pip uninstall flash_attn
zhougaofeng's avatar
zhougaofeng committed
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
默认使用6020端口,0号DCU卡 ,可以通过--dcu_id 指定卡,--server_port指定端口号
<div align=center>
    <img src="doc/image (4).png"/>
</div>
qwen-ocr模块启动成功:
<div align=center>
    <img src="doc/image (5).png"/>
</div>

### 5、启动pdf-server解析服务:
python magic_pdf/tools/pdf_server.py
<div align=center>
    <img src="doc/image (6).png"/>
</div>
启动成功:
<div align=center>
    <img src="doc/image (7).png"/>
</div>
### 6、解析pdf
python magic_pdf/parse/common_parse.py -p other/接口人.xlsx -o other_res/
<div align=center>
    <img src="doc/image (8).png"/>
</div>
-p指定pdf路径,-o指定输出路径