Commit 92a367a9 authored by chenych's avatar chenych
Browse files

Add results and Update README

parents
Pipeline #3023 failed with stages
in 0 seconds
# Contributors
None
\ No newline at end of file
MIT License
Copyright (c) 2025 DeepSeek
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
# PaddleOCR-VL
## 论文
[PaddleOCR-VL](https://arxiv.org/abs/2510.14528)
## 模型结构
PaddleOCR-VL-0.9B是百度PaddlePaddle团队于2025年10月发布的超轻量级视觉-语言模型,专门针对文档解析场景优化。它是ERNIE-4.5系列中最强大的衍生模型之一。其核心组件为 PaddleOCR-VL-0.9B,这是一种紧凑而强大的视觉语言模型(VLM),它由 NaViT 风格的动态分辨率视觉编码器与 ERNIE-4.5-0.3B 语言模型组成,以实现精准的元素识别。该创新模型高效支持 109 种语言,并在识别复杂元素(如文本、表格、公式和图表)方面表现出色,同时保持极低的资源消耗。
<div align=center>
<img src="./doc/model.png"/>
</div>
## 算法原理
PaddleOCR-VL 将复杂的文档解析任务分解为两个阶段。第一阶段 PP-DocLayoutV2 负责版面分析,定位语义区域并预测其阅读顺序。随后,第二阶段 PaddleOCR-VL-0.9B 基于这些版面预测,对文本、表格、公式和图表等多样化内容进行细粒度识别。最后,轻量级后处理模块聚合两阶段输出,并将最终文档格式化为结构化的 Markdown 和 JSON。
<div align=center>
<img src="./doc/method.png"/>
</div>
## 环境配置
### 硬件需求
DCU型号:K100AI,节点数量:1台,卡数:1张。
`-v 路径``docker_name`根据实际情况修改
### Docker(方法一)
```bash
docker pull image.sourcefind.cn:5000/dcu/admin/base/custom:vllm0.9.2-ubuntu22.04-dtk25.04.2-py3.10-paddleocr-vl
docker run -it --shm-size 200g --network=host --name {docker_name} --privileged --device=/dev/kfd --device=/dev/dri --device=/dev/mkfd --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root -v /path/your_code_data/:/path/your_code_data/ -v /opt/hyhal/:/opt/hyhal/:ro image.sourcefind.cn:5000/dcu/admin/base/vllm:0.9.2-ubuntu22.04-dtk25.04.2-py3.10 bash
cd /your_code_path/paddleocr-vl_paddle
```
### Dockerfile(方法二)
```bash
cd docker
docker build --no-cache -t paddleocr-vl:latest .
docker run -it --shm-size 200g --network=host --name {docker_name} --privileged --device=/dev/kfd --device=/dev/dri --device=/dev/mkfd --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root -v /path/your_code_data/:/path/your_code_data/ -v /opt/hyhal/:/opt/hyhal/:ro image.sourcefind.cn:5000/dcu/admin/base/vllm:0.9.2-ubuntu22.04-dtk25.04.2-py3.10 bash
cd /your_code_path/paddleocr-vl_paddle
```
### Anaconda(方法三)
关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.sourcefind.cn/tool/)开发者社区下载安装。
```bash
DTK: 25.04.2
python: 3.10.12
vllm: 0.9.2+das.opt1.dtk25042
transformers: 4.57.1
```
`Tips:以上dtk驱动、pytorch等DCU相关工具版本需要严格一一对应`, 其它非深度学习库参照requirements.txt安装:
```bash
python -m pip install paddlepaddle-dcu==3.2.1 -i https://www.paddlepaddle.org.cn/packages/stable/dcu/
python -m pip install -U "paddleocr[doc-parser]"
python -m pip install https://paddle-whl.bj.bcebos.com/nightly/cu126/safetensors/safetensors-0.6.2.dev0-cp38-abi3-linux_x86_64.whl
```
## 数据集
暂无
## 训练
暂无
## 推理
> 模型地址,测试图片路径,输出路径根据实际情况修改。
### 命令行推理
```bash
export PADDLE_PDX_DISABLE_DEV_MODEL_WL=1
paddleocr doc_parser -i ./doc/paddleocr_vl_demo.png --device DCU --precision fp32 --save_path ./output
```
### vllm
serve端
```bash
export PADDLE_PDX_DISABLE_DEV_MODEL_WL=true
vllm serve PaddlePaddle/PaddleOCR-VL --trust-remote-code --max-model-len 16384 --max-num-batched-tokens 16384 --gpu-memory-utilization 0.8 --served-model-name PaddleOCR-VL-0.9B
```
client
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type:application/json" \
-d '{
"messages": [
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/paddleocr_vl_demo.png"}},
{"type": "text", "text": "OCR:"}
]
}
],
"temperature": 0.7
}'
```
## result
<div align=center>
<img src="./doc/result-dcu.png"/>
</div>
### 精度
DCU与GPU精度一致,推理框架:paddle。
## 应用场景
### 算法类别
OCR
### 热点应用行业
`制造,金融,交通,教育,医疗`
## 预训练权重
- [PaddleOCR-VL](https://huggingface.co/PaddlePaddle/PaddleOCR-VL)
## 源码仓库及问题反馈
- https://developer.sourcefind.cn/codes/modelzoo/deepseek-ocr_pytorch
## 参考资料
- https://github.com/PaddlePaddle/PaddleOCR
- https://www.paddleocr.ai/latest/version3.x/pipeline_usage/PaddleOCR-VL.html
FROM image.sourcefind.cn:5000/dcu/admin/base/custom:vllm0.9.2-ubuntu22.04-dtk25.04.2-py3.10-paddleocr-vl
\ No newline at end of file
icon.png

61 KB

# 模型唯一标识
modelCode=1791
# 模型名称
modelName=paddleocr-vl_paddle
# 模型描述
modelDescription=飞桨正式发布新一代多模态文档解析模型方案PaddleOCR-VL!该方案仅0.9B参数就刷新了多个权威文档解析评测记录,并具备109种语言的文档解析能力。
# 应用场景
appScenario=推理,OCR,制造,金融,交通,教育,医疗
# 框架类型
frameType=paddle
# 加速卡类型
accelerateType=K100AI,BW1000
\ No newline at end of file
from paddleocr import PaddleOCRVL
pipeline = PaddleOCRVL(device='DCU')
# pipeline = PaddleOCRVL(use_doc_orientation_classify=True) # 通过 use_doc_orientation_classify 指定是否使用文档方向分类模型
# pipeline = PaddleOCRVL(use_doc_unwarping=True) # 通过 use_doc_unwarping 指定是否使用文本图像矫正模块
# pipeline = PaddleOCRVL(use_layout_detection=False) # 通过 use_layout_detection 指定是否使用版面区域检测排序模块
output = pipeline.predict("./doc/paddleocr_vl_demo.png")
for res in output:
res.print() ## 打印预测的结构化输出
res.save_to_json(save_path="output") ## 保存当前图像的结构化json结果
res.save_to_markdown(save_path="output") ## 保存当前图像的markdown格式的结果
from pathlib import Path
from paddleocr import PaddleOCRVL
input_file = "./your_pdf_file.pdf"
output_path = Path("./output")
pipeline = PaddleOCRVL(device='DCU')
output = pipeline.predict(input=input_file)
markdown_list = []
markdown_images = []
for res in output:
md_info = res.markdown
markdown_list.append(md_info)
markdown_images.append(md_info.get("markdown_images", {}))
markdown_texts = pipeline.concatenate_markdown_pages(markdown_list)
mkd_file_path = output_path / f"{Path(input_file).stem}.md"
mkd_file_path.parent.mkdir(parents=True, exist_ok=True)
with open(mkd_file_path, "w", encoding="utf-8") as f:
f.write(markdown_texts)
for item in markdown_images:
if item:
for path, image in item.items():
file_path = output_path / path
file_path.parent.mkdir(parents=True, exist_ok=True)
image.save(file_path)
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment