README.md 5.34 KB
Newer Older
chenych's avatar
chenych committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# PaddleOCR-VL
## 论文
[PaddleOCR-VL](https://arxiv.org/abs/2510.14528)

## 模型结构
PaddleOCR-VL-0.9B是百度PaddlePaddle团队于2025年10月发布的超轻量级视觉-语言模型,专门针对文档解析场景优化。它是ERNIE-4.5系列中最强大的衍生模型之一。其核心组件为 PaddleOCR-VL-0.9B,这是一种紧凑而强大的视觉语言模型(VLM),它由 NaViT 风格的动态分辨率视觉编码器与 ERNIE-4.5-0.3B 语言模型组成,以实现精准的元素识别。该创新模型高效支持 109 种语言,并在识别复杂元素(如文本、表格、公式和图表)方面表现出色,同时保持极低的资源消耗。
<div align=center>
    <img src="./doc/model.png"/>
</div>

## 算法原理
PaddleOCR-VL 将复杂的文档解析任务分解为两个阶段。第一阶段 PP-DocLayoutV2 负责版面分析,定位语义区域并预测其阅读顺序。随后,第二阶段 PaddleOCR-VL-0.9B 基于这些版面预测,对文本、表格、公式和图表等多样化内容进行细粒度识别。最后,轻量级后处理模块聚合两阶段输出,并将最终文档格式化为结构化的 Markdown 和 JSON。
<div align=center>
    <img src="./doc/method.png"/>
</div>

## 环境配置
### 硬件需求
DCU型号:K100AI,节点数量:1台,卡数:1张。

chenych's avatar
chenych committed
21
`-v 挂载路径``docker_name`根据实际情况修改
chenych's avatar
chenych committed
22
23
24

### Docker(方法一)
```bash
chenych's avatar
chenych committed
25
docker pull image.sourcefind.cn:5000/dcu/admin/base/vllm:0.9.2-ubuntu22.04-dtk25.04.2-py3.10
chenych's avatar
chenych committed
26

chenych's avatar
chenych committed
27
docker run -it --shm-size 200g --network=host --name {docker_name} --privileged --device=/dev/kfd --device=/dev/dri --device=/dev/mkfd --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root -v /path/your_code_data/:/path/your_code_data/ -v /opt/hyhal/:/opt/hyhal/:ro image.sourcefind.cn:5000/dcu/admin/base/vllm:0.9.2-ubuntu22.04-dtk25.04.2-py3.10 bash
chenych's avatar
chenych committed
28
29

cd /your_code_path/paddleocr-vl_paddle
chenych's avatar
chenych committed
30
31
32
python -m pip install paddlepaddle-dcu==3.2.1 -i https://www.paddlepaddle.org.cn/packages/stable/dcu/
python -m pip install -U "paddleocr[doc-parser]"
python -m pip install https://paddle-whl.bj.bcebos.com/nightly/cu126/safetensors/safetensors-0.6.2.dev0-cp38-abi3-linux_x86_64.whl
chenych's avatar
chenych committed
33
pip install paddlex==3.3.9
chenych's avatar
chenych committed
34
35
36
37
38
```

### Dockerfile(方法二)
```bash
cd docker
chenych's avatar
chenych committed
39
docker build --no-cache -t paddleocr-vl:latest .
chenych's avatar
chenych committed
40

chenych's avatar
chenych committed
41
docker run -it --shm-size 200g --network=host --name {docker_name} --privileged --device=/dev/kfd --device=/dev/dri --device=/dev/mkfd --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root -v /path/your_code_data/:/path/your_code_data/ -v /opt/hyhal/:/opt/hyhal/:ro image.sourcefind.cn:5000/dcu/admin/base/vllm:0.9.2-ubuntu22.04-dtk25.04.2-py3.10 bash
chenych's avatar
chenych committed
42
43

cd /your_code_path/paddleocr-vl_paddle
chenych's avatar
chenych committed
44
45
46
python -m pip install paddlepaddle-dcu==3.2.1 -i https://www.paddlepaddle.org.cn/packages/stable/dcu/
python -m pip install -U "paddleocr[doc-parser]"
python -m pip install https://paddle-whl.bj.bcebos.com/nightly/cu126/safetensors/safetensors-0.6.2.dev0-cp38-abi3-linux_x86_64.whl
chenych's avatar
chenych committed
47
pip install paddlex==3.3.9
chenych's avatar
chenych committed
48
49
50
51
52
53
54
55
56
57
58
59
```

### Anaconda(方法三)
关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.sourcefind.cn/tool/)开发者社区下载安装。
```bash
DTK: 25.04.2
python: 3.10.12
vllm: 0.9.2+das.opt1.dtk25042
transformers: 4.57.1
```
`Tips:以上dtk驱动、pytorch等DCU相关工具版本需要严格一一对应`, 其它非深度学习库参照requirements.txt安装:
```bash
chenych's avatar
chenych committed
60
pip install -r requirements.txt
chenych's avatar
chenych committed
61
62
63
python -m pip install paddlepaddle-dcu==3.2.1 -i https://www.paddlepaddle.org.cn/packages/stable/dcu/
python -m pip install -U "paddleocr[doc-parser]"
python -m pip install https://paddle-whl.bj.bcebos.com/nightly/cu126/safetensors/safetensors-0.6.2.dev0-cp38-abi3-linux_x86_64.whl
chenych's avatar
chenych committed
64
pip install paddlex==3.3.9
chenych's avatar
chenych committed
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
```

## 数据集
暂无

## 训练
暂无

## 推理
> 模型地址,测试图片路径,输出路径根据实际情况修改。
### 命令行推理
```bash
export PADDLE_PDX_DISABLE_DEV_MODEL_WL=1

paddleocr doc_parser -i ./doc/paddleocr_vl_demo.png --device DCU --precision fp32 --save_path ./output
```
### vllm
serve端
```bash
chenych's avatar
chenych committed
84
export PADDLE_PDX_DISABLE_DEV_MODEL_WL=1
chenych's avatar
chenych committed
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125

vllm serve PaddlePaddle/PaddleOCR-VL --trust-remote-code --max-model-len 16384 --max-num-batched-tokens 16384 --gpu-memory-utilization 0.8 --served-model-name PaddleOCR-VL-0.9B
```
client
```bash
curl http://localhost:8000/v1/chat/completions   \
    -H "Content-Type:application/json"  \
    -d '{
        "messages": [
            {
                "role": "user",
                "content": [
                    {"type": "image_url", "image_url": {"url": "https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/paddleocr_vl_demo.png"}},
                    {"type": "text", "text": "OCR:"}
                ]
            }
        ],
        "temperature": 0.7
    }'
```

## result

<div align=center>
    <img src="./doc/result-dcu.png"/>
</div>

### 精度
DCU与GPU精度一致,推理框架:paddle。

## 应用场景
### 算法类别
OCR

### 热点应用行业
`制造,金融,交通,教育,医疗`

## 预训练权重
- [PaddleOCR-VL](https://huggingface.co/PaddlePaddle/PaddleOCR-VL)

## 源码仓库及问题反馈
chenych's avatar
chenych committed
126
- https://developer.sourcefind.cn/codes/modelzoo/paddleocr-vl_paddle
chenych's avatar
chenych committed
127
128
129
130

## 参考资料
- https://github.com/PaddlePaddle/PaddleOCR
- https://www.paddleocr.ai/latest/version3.x/pipeline_usage/PaddleOCR-VL.html