Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
zhougaofeng
magic_pdf
Commits
bf06d293
"tests/lora/test_lora_layers.py" did not exist on "5333f4c0ec1c4a69ad2ada88364c5dd5836ac1b7"
Commit
bf06d293
authored
Oct 22, 2024
by
zhougaofeng
Browse files
Update README.md
parent
4bf3d6fd
Changes
1
Show whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
9 additions
and
28 deletions
+9
-28
README.md
README.md
+9
-28
No files found.
README.md
View file @
bf06d293
...
...
@@ -4,56 +4,37 @@
### 以下演示在223节点安装pdf解析模块(可以直接使用镜像:1177ea7959ce)
### 1、docker run -it --shm-size=1024G -v /parastor/home/zhougf/Qwen1.5-pytorch:/home/practice -v /opt/hyhal:/opt/hyhal --privileged=true --device=/dev/kfd --device=/dev/dri/ --network=host --group-add video --name pdf_tmp a4dd5be0ca23 bash
<div
align=
center
>
<img
src=
"doc/image.png"
/>
<img
src=
"doc/image (1).png"
/>
</div>
### 2、安装需要的依赖库
pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com -i https://pypi.tuna.tsinghua.edu.cn/simple
<div
align=
center
>
<img
src=
"doc/image (2).png"
/>
</div>
注意:会安装cuda相关的库(nvidia-cudnn),以及没有适配的库(比如torchtext),等安装结束后,卸载这些库即可
安装dtk版本的torch、torchvision
### 1、安装需要的依赖库
下载官方的项目:
git clone https://github.com/opendatalab/MinerU.git
`
git clone https://github.com/opendatalab/MinerU.git
`
#### 将本项目的magic_pdf替换git clone 官方的magic_pdf
#### pip uninstall magic-pdf
#### pip install -e .
###
3
、安装需要的模型
git clone https://www.modelscope.cn/opendatalab/PDF-Extract-Kit.git
###
2
、安装需要的模型
`
git clone https://www.modelscope.cn/opendatalab/PDF-Extract-Kit.git
`
#### 修改magic-pdf.template.json
cd MinerU
<div
align=
center
>
<img
src=
"doc/image (9).png"
/>
</div>
需要注意,"models-dir":"/home/practice/model/PDF-Extract-Kit/models" 路径指向PDF-Extract-Kit/models
将magic-pdf.template.json 拷贝到/root目录下并改名为magic-pdf.json
<div
align=
center
>
<img
src=
"doc/image (10).png"
/>
</div>
### 4、启动qwen-ocr模块:
安装qwen_vl_utils库,更新transformers库为4。45版本,卸载flash_attn
(1)、pip install qwen_vl_utils -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
(2)、pip install transformers==4.45 -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
(3)、pip uninstall flash_attn
`python magic_pdf/dict2md/ocr_server.py`
默认使用6020端口,0号DCU卡 ,可以通过--dcu_id 指定卡,--server_port指定端口号
<div
align=
center
>
<img
src=
"doc/image (4).png"
/>
</div>
qwen-ocr模块启动成功:
<div
align=
center
>
<img
src=
"doc/image (5).png"
/>
</div>
### 5、启动pdf-server解析服务:
python magic_pdf/tools/pdf_server.py
`
python magic_pdf/tools/pdf_server.py
`
<div
align=
center
>
<img
src=
"doc/image (6).png"
/>
</div>
...
...
@@ -62,7 +43,7 @@ python magic_pdf/tools/pdf_server.py
<img
src=
"doc/image (7).png"
/>
</div>
### 6、解析pdf
python magic_pdf/parse/common_parse.py -p
other/接口人.xlsx -o other_res/
`
python magic_pdf/parse/common_parse.py -p
[文件/目录 路径] -o [输出地址]`
<div
align=
center
>
<img
src=
"doc/image (8).png"
/>
</div>
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment