Commit 10f294ff authored by yuguo-Jack's avatar yuguo-Jack
Browse files

llama_paddle

parent 7c64e6ec
Pipeline #678 failed with stages
in 0 seconds
**简体中文**🀄 | [English🌎](.github/CONTRIBUTING_en.md)
# Contributing to PaddleNLP
我们非常欢迎并希望您对`PaddleNLP`做出开源贡献。在您开始提交您的贡献之前,请先行签署[PaddlePaddle 贡献者许可协议](https://cla-assistant.io/PaddlePaddle/PaddleNLP)
本文接下来将介绍我们的开发与贡献流程:
## 贡献方式
我们欢迎不同的向`PaddleNLP`做出贡献的方式,例如:
- 修复已知的Issue
- 提交新的Issue,例如提出功能需求或者bug报告
- 实现新的模型结构
如果您不知道从哪里开始,请查看Issues板块中的`Good First Issue`标签。它为您提供一个对初学者友好的已知Issue列表,可以降低贡献的门槛,帮助您开始为开源做出贡献。您只需在您想处理的Issue中告知我们您想负责此Issue即可。
## 开发流程
PaddleNLP 使用 [Git 分支模型](http://nvie.com/posts/a-successful-git-branching-model/)。对于常见的开源贡献,我们有以下的贡献流程:
#### 1. Fork
因为PaddleNLP的开发社区一直在发展,如果每位贡献者都直接向官方Repo提交commit将会难以管理。因此,请从您的分支中提交 Pull Requests。建议您通过GitHub的[“Fork”按钮](https://help.github.com/articles/fork-a-repo/)来创建您的Fork分支。
#### 2. Clone
请运行一下命令将您的分支clone到本地
```bash
git clone https://github.com/<your-github-account>/PaddleNLP
cd PaddleNLP
```
#### 3. 创建本地开发分支
对于添加新功能或修复错误等日常工作,请在开发前创建您的本地开发分支:
```bash
git checkout -b my-cool-feature
```
#### 4. 配置开发环境
在开始编码之前,您需要设置开发环境。我们强烈建议您在虚拟环境中进行所有开发,例如[venv](https://docs.python.org/3/library/venv.html)[conda](https://docs.conda.io/en/latest/)
请您设置并激活虚拟环境后,运行以下命令:
```bash
make install
```
这将设置 `PaddleNLP` 的所有依赖以及 [`pre-commit`](http://pre-commit.com/) 工具。
如果您需要开发 `examples``applications` 模块并加载 `PaddleNLP`,请确保以可编辑模式(`-e`)安装 `PaddleNLP`
如果在虚拟环境中已经安装 `PaddleNLP` ,请使用 `pip uninstall paddlenlp` 将其删除,然后以可编辑模式重新安装它
`pip install -e .`
#### 5. 开发
当您开发时,请确保您新增的代码会被单元测试所覆盖。我们所有的单元测试都可以在 `tests` 目录下找到。
您可以修改现有单元测试以覆盖新功能,也可以从头开始创建新测试。
当您完成代码时,您应该确保相关的单元测试可以通过。您可以像这样运行受更改影响的测试:
```bash
pytest tests/<test_to_run>.py
```
#### 6. Commit
我们使用 [`pre-commit`](http://pre-commit.com/)工具(包括[black](https://black.readthedocs.io/en/stable/)[isort](https:/ /pycqa.github.io/isort/) 和
[flake8](https://flake8.pycqa.org/en/latest/))来检查每次提交中的代码和文档的风格。当你运行 `git commit` 时,你会看到
类似于以下内容:
```
➜ (my-virtual-env) git commit -m "commiting my cool feature"
black....................................................................Passed
isort....................................................................Passed
flake8...................................................................Passed
check for merge conflicts................................................Passed
check for broken symlinks............................(no files to check)Skipped
detect private key.......................................................Passed
fix end of files.....................................(no files to check)Skipped
trim trailing whitespace.............................(no files to check)Skipped
CRLF end-lines checker...............................(no files to check)Skipped
CRLF end-lines remover...............................(no files to check)Skipped
No-tabs checker......................................(no files to check)Skipped
Tabs remover.........................................(no files to check)Skipped
copyright_checker........................................................Passed
```
但大多数时候事情并没有那么顺利。当您的代码或文档不符合标准时,`pre-commit` 检查将失败。
```
➜ (my-virtual-env) git commit -m "commiting my cool feature"
black....................................................................Passed
isort....................................................................Failed
- hook id: isort
- files were modified by this hook
Fixing examples/information_extraction/waybill_ie/run_ernie_crf.py
flake8...................................................................Passed
check for merge conflicts................................................Passed
check for broken symlinks............................(no files to check)Skipped
detect private key.......................................................Passed
fix end of files.....................................(no files to check)Skipped
trim trailing whitespace.............................(no files to check)Skipped
CRLF end-lines checker...............................(no files to check)Skipped
CRLF end-lines remover...............................(no files to check)Skipped
No-tabs checker......................................(no files to check)Skipped
Tabs remover.........................................(no files to check)Skipped
copyright_checker........................................................Passed
```
我们的工具将自动修复大部分样式错误,但是有些错误需要手动解决。幸运的是,错误信息一般通俗易懂,很容易修复。
解决错误后,您可以再次运行 git add <files> 和 git commit ,这将再次触发 pre-commit 。
一旦 pre-commit 检查通过,您就可以推送代码了。
[Google][http://google.com/][StackOverflow](https://stackoverflow.com/) 是帮助您了解代码风格错误的好工具。
如果您仍然无法弄清楚,请不要担心。您可以使用 `git commit -m "style error" --no-verify` 提交,我们很乐意在您创建 Pull Request 后帮助您。
#### 7. git pull与代码冲突
有经验的 Git 用户经常从官方Repo中git pull。因为这样子他们会及早注意到与其他人的代码冲突,并且让代码冲突更容易解决
```bash
git remote add upstream https://github.com/PaddlePaddle/PaddleNLP
git pull upstream develop
```
#### 8. git push与提交Pull Request
您可以将您的本地开发分支中的工作 push 到您的fork的分支中:
```bash
git push origin my-cool-stuff
```
git push之后,您可以提交Pull Request,请求[官方repo](https://github.com/PaddlePaddle/PaddleNLP) 采纳您的开发工作。请您依照[这些步骤](https://help.github.com/articles/creating-a-pull-request/)创建Pull Request。
#### 9. 删除已经合入的本地和远程分支
为了保持您本地的工作区和fork分支的干净整洁,建议您在Pull Request合入之后删除本地的残余分支:
```bash
git push origin my-cool-stuff
git checkout develop
git pull upstream develop
git branch -d my-cool-stuff
```
## 代码Review
- 在您的Pull Request能够顺利通过本地测试以及CI的情况下,您可以在Pull Request中 @ 相关的Reviewer,提醒他们尽快对您的Pull Request进行Review。
- 请处理Reviewer的每一条评论。如果您已按照评论修改,请回复“完成”;否则,可以在评论下展开讨论。
- 如果您不希望您的Reviewer被电子邮件通知淹没,您可以[批量回复](https://help.github.com/articles/reviewing-proposed-changes-in-a-pull-request/)
Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
# Makefile for PaddleNLP
#
# GitHb: https://github.com/PaddlePaddle/PaddleNLP
# Author: Paddle Team https://github.com/PaddlePaddle
#
.PHONY: all
all : lint test
check_dirs := applications examples model_zoo paddlenlp pipelines ppdiffusers scripts tests
# # # # # # # # # # # # # # # Format Block # # # # # # # # # # # # # # #
format:
pre-commit run isort
pre-commit run black
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
# # # # # # # # # # # # # # # Lint Block # # # # # # # # # # # # # # #
.PHONY: lint
lint:
$(eval modified_py_files := $(shell python scripts/get_modified_files.py $(check_dirs)))
@if test -n "$(modified_py_files)"; then \
echo ${modified_py_files}; \
pre-commit run --files ${modified_py_files}; \
else \
echo "No library .py files were modified"; \
fi
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
# # # # # # # # # # # # # # # Test Block # # # # # # # # # # # # # # #
.PHONY: test
test: unit-test
unit-test:
PYTHONPATH=$(shell pwd) pytest -v \
-n auto \
--durations 20 \
--cov paddlenlp \
--cov-report xml:coverage.xml
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
.PHONY: install
install:
pip install -r requirements-dev.txt
pip install -r requirements.txt
pip install -r paddlenlp/experimental/autonlp/requirements.txt
pre-commit install
.PHONY: deploy-ppdiffusers
deploy-ppdiffusers:
cd ppdiffusers && make install && make
.PHONY: deploy-paddle-pipelines
deploy-paddle-pipelines:
cd pipelines && make install && make
.PHONY: deploy-paddlenlp
deploy-paddlenlp:
# install related package
make install
# build
python3 setup.py sdist bdist_wheel
# upload
twine upload --skip-existing dist/*
.PHONY: regression-all
release:
bash ./scripts/regression/run_release.sh 0 0,1 all
.PHONY: regression-key
key:
bash ./scripts/regression/run_release.sh 0 0,1 p0
# LLAMA_paddle # LLAMA
llama-13b pretrain example for paddle ## 论文
\ No newline at end of file
`LLaMA: Open and Efficient Foundation Language Models`
- [https://arxiv.org/abs/2302.13971](https://arxiv.org/abs/2302.13971)
## 模型结构
LLaMA,这是一个基础语言模型的集合,参数范围从7B到65B。在数万亿的tokens上训练出的模型,并表明可以专门使用公开可用的数据集来训练最先进的模型,而不依赖于专有的和不可访问的数据集。特别是,llama 13B在大多数基准测试中优于GPT-3 (175B), LLaMA 65B与最好的模型Chinchilla-70B和PaLM-540B具有竞争力。LLAMA网络基于 Transformer 架构。提出了各种改进,并用于不同的模型,例如 PaLM。
<img src="http://developer.hpccube.com/codes/modelzoo/llama_fastchat_pytorch/-/raw/main/llama%E6%A8%A1%E5%9E%8B%E7%BB%93%E6%9E%84.png" alt="llama模型结构.png" style="zoom:50%;" />
以下是llama-13B的主要网络参数配置:
```
{
"architectures": [
"LlamaForCausalLM"
],
"bos_token_id": 1,
"eos_token_id": 2,
"hidden_size": 5120,
"initializer_range": 0.02,
"intermediate_size": 13824,
"max_position_embeddings": 2048,
"model_type": "llama",
"num_attention_heads": 40,
"num_hidden_layers": 40,
"pad_token_id": 0,
"paddlenlp_version": null,
"rms_norm_eps": 1e-06,
"use_recompute": false,
"vocab_size": 32000
}
```
## 算法原理
<img src="http://developer.hpccube.com/codes/modelzoo/llama_fastchat_pytorch/-/raw/main/llama%E7%AE%97%E6%B3%95%E5%8E%9F%E7%90%86.png" alt="llama算法原理.png" style="zoom:50%;" />
以下是与原始 Transformer 架构的主要区别:
**预归一化**。为了提高训练稳定性,对每个transformer 子层的输入进行归一化,而不是对输出进行归一化。使用 RMSNorm 归一化函数。
**SwiGLU 激活函数 [PaLM]**。使用 SwiGLU 激活函数替换 ReLU 非线性以提高性能。使用 2 /3 4d 的维度而不是 PaLM 中的 4d。
**旋转嵌入**。移除了绝对位置嵌入,而是添加了旋转位置嵌入 (RoPE),在网络的每一层。
## 数据集
数据详细制作流程可参考[此处](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/model_zoo/ernie-1.0/preprocess/README.md),例:OpenWebText2预训练数据制作参考[此处](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/model_zoo/ernie-1.0/preprocess/docs/OpenWebText2.md)
为了方便用户运行测试本模型,本项目提供了处理好的100k条doc的训练样本:
cd ./llm/llama/
mkkdir data && cd data
wget https://bj.bcebos.com/paddlenlp/models/transformers/llama/data/llama_openwebtext_100k_ids.npy
wget https://bj.bcebos.com/paddlenlp/models/transformers/llama/data/llama_openwebtext_100k_idx.npz
cd .. && tree data
data
├── llama_openwebtext_100k_ids.npy
└── llama_openwebtext_100k_idx.npz
## 环境配置
### Docker
推荐使用docker方式运行,提供拉取的docker镜像,关于本项目所需新版本 DTK 等均可从[光合](https://developer.hpccube.com/tool/)开发者社区下载安装,docker中默认使用dtk-23.04.1:
```
docker pull registry.baidubce.com/device/paddle-dcu:dtk23.04.1-centos79-x86_64-gcc73
docker run -it --network=host --name=paddle_llama --privileged --device=/dev/kfd --device=/dev/dri --ipc=host --shm-size=16G --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root --ulimit stack=-1:-1 --ulimit memlock=-1:-1 -v `pwd`:/home registry.baidubce.com/device/paddle-dcu:dtk23.04.1-centos79-x86_64-gcc73 /bin/bash
# 替换DTK-23.10
pip install paddlenlp==2.6.1 -i http://mirrors.aliyun.com/pypi/simple/
wget http://10.6.10.68:8000/customized/paddle/llama/paddlepaddle_dtk2310-2.5.1-cp39-cp39-linux_x86_64.whl
pip3 install paddlepaddle_dtk2310-2.5.1-cp39-cp39-linux_x86_64.whl
pip3 install tool_helpers visualdl==2.5.3 -i http://mirrors.aliyun.com/pypi/simple/
```
## 训练
权重链接
13B:[https://bj.bcebos.com/paddlenlp/models/community/facebook/llama-13b](https://bj.bcebos.com/paddlenlp/models/community/facebook/llama-13b)
7B:[https://bj.bcebos.com/paddlenlp/models/community/facebook/llama-7b](https://bj.bcebos.com/paddlenlp/models/community/facebook/llama-7b)
该训练脚本需要1节点,每节点8张DCU-Z100L-32G。
并行配置采用TP 8,PP 1,使用fp16精度微调,配置如下:
```
--max_seq_length 2048 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 2 \
--per_device_eval_batch_size 2 \
--use_flash_attention 0 \
--use_fused_rms_norm 0 \
--fp16 \
--fp16_opt_level "O2" \
--scale_loss 512 \
--tensor_parallel_degree 8 \
--learning_rate 0.00001 \
--min_learning_rate 0.000001 \
--max_steps 10000 \
--save_steps 5000 \
--weight_decay 0.01 \
--warmup_ratio 0.01 \
--max_grad_norm 1.0 \
--logging_steps 10 \
--dataloader_num_workers 1 \
--eval_steps 1000 \
--report_to "visualdl" \
--sharding "stage1" \
--disable_tqdm true \
--continue_training 1 \
--recompute 1 \
--recompute_granularity full \
--do_train \
--do_eval \
--device "gpu" \
--distributed_dataloader 1
```
微调命令:
```
cd ./llm/llama/
bash run_trainer_tp8.sh
```
## result
### 精度
训练数据:[https://bj.bcebos.com/paddlenlp/models/transformers/llama/data](https://bj.bcebos.com/paddlenlp/models/transformers/llama/data)
使用的GPGPU:8张DCU-Z100L-32G。
模型精度(max_sequence_length: 2048):
| 卡数 | 分布式工具 | 收敛性 |
| :------: | :------: |:------: |
| 8 | Paddle | |
### input
```plaintext
>>>冬天,中国哪座城市最适合避寒?问题描述:能推荐一些国内适合冬天避寒的城市吗?回答用户:旅游爱好者
```
### output
```plaintext
>>>回答:避寒,当然是去海南呀!海南的冬天,阳光明媚,温度适宜,而且空气清新,没有雾霾,没有沙尘暴,没有雾霾,没有雾霾!
```
## benchmark
### 训练benchmark
数据集使用[tatsu-lab/alpaca · Datasets at Hugging Face](https://huggingface.co/datasets/tatsu-lab/alpaca),将数据集放置在./examples/benchmark/peft/paddle下:
```
$tree tatsu-lab
tatsu-lab/
└── alpaca
└── data
└── train-00000-of-00001-a09b74b3ef9c3b56.parquet
```
训练benchmark测试命令:
```
cd ./examples/benchmark/peft/paddle
RCCL_NCHANNELS=8 HSA_FORCE_FINE_GRAIN_PCIE=1 python3 -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" benchmark.py --model_name_or_path facebook/llama-13b --english --train_data_size 1000 --intokens --intokens_length 1024 --num_train_epochs 1 --per_device_train_batch_size 2 --gradient_accumulation_steps 2 --evaluation_strategy no --save_strategy no --fp16 --fp16_opt_level O2 --recompute --tensor_parallel_degree 8 --logging_steps 50 --output_dir outputs
```
### 推理benchmark
```
cd ./examples/benchmark/peft/paddle
python3 inference_benchmark.py --model_name_or_path facebook/llama-13b --dtype float16 --do_forward --do_generate
```
### LAMBADA推理评估
```
cd ./examples/benchmark/lambada
wget https://paddlenlp.bj.bcebos.com/data/benchmark/lambada_test.jsonl
```
验证LAMBADA数据集,运行以下脚本:
```
python3 eval.py \
--model_name_or_path facebook/llama-13b \
--batch_size 4 \
--eval_path lambada_test.jsonl \
--tensor_parallel_degree 1 \
--cloze_eval
```
## 应用场景
### 算法类别
`自然语言处理`
### 热点应用行业
`医疗,教育,科研,金融`
## 源码仓库及问题反馈
- [https://developer.hpccube.com/codes/modelzoo/llama_paddle](https://developer.hpccube.com/codes/modelzoo/llama_paddle)
## 参考
* https://huggingface.co/decapoda-research/llama-13b-hf
* [https://github.com/PaddlePaddle/PaddleNLP](https://github.com/PaddlePaddle/PaddleNLP)
\ No newline at end of file
This diff is collapsed.
**简体中文**🀄 | [English🌎](./README_en.md)
<p align="center">
<img src="https://user-images.githubusercontent.com/1371212/175816733-8ec25eb0-9af3-4380-9218-27c154518258.png" align="middle" width="500" />
</p>
------------------------------------------------------------------------------------------
<p align="center">
<a href="./LICENSE"><img src="https://img.shields.io/badge/license-Apache%202-dfd.svg"></a>
<a href="https://github.com/PaddlePaddle/PaddleNLP/releases"><img src="https://img.shields.io/github/v/release/PaddlePaddle/PaddleNLP?color=ffa"></a>
<a href=""><img src="https://img.shields.io/badge/python-3.7+-aff.svg"></a>
<a href=""><img src="https://img.shields.io/badge/os-linux%2C%20win%2C%20mac-pink.svg"></a>
<a href="https://github.com/PaddlePaddle/PaddleNLP/graphs/contributors"><img src="https://img.shields.io/github/contributors/PaddlePaddle/PaddleNLP?color=9ea"></a>
<a href="https://github.com/PaddlePaddle/PaddleNLP/commits"><img src="https://img.shields.io/github/commit-activity/m/PaddlePaddle/PaddleNLP?color=3af"></a>
<a href="https://pypi.org/project/paddlenlp/"><img src="https://img.shields.io/pypi/dm/paddlenlp?color=9cf"></a>
<a href="https://github.com/PaddlePaddle/PaddleNLP/issues"><img src="https://img.shields.io/github/issues/PaddlePaddle/PaddleNLP?color=9cc"></a>
<a href="https://github.com/PaddlePaddle/PaddleNLP/stargazers"><img src="https://img.shields.io/github/stars/PaddlePaddle/PaddleNLP?color=ccf"></a>
</p>
<h4 align="center">
<a href=#安装> 安装 </a> |
<a href=#快速开始> 快速开始 </a> |
<a href=#特性> 特性 </a> |
<a href=#社区交流> 社区交流 </a>
</h4>
**PaddleNLP**是一款**简单易用****功能强大**的自然语言处理和大语言模型(LLM)开发库。聚合业界**优质预训练模型**并提供**开箱即用**的开发体验,覆盖NLP多场景的模型库搭配**产业实践范例**可满足开发者**灵活定制**的需求。
## News 📢
* **2023.8.15 [PaddleNLP v2.6](https://github.com/PaddlePaddle/PaddleNLP/releases/tag/v2.6.0)**: 发布[全流程大模型工具链](./llm),涵盖预训练,精调,压缩,推理以及部署等各个环节,为用户提供端到端的大模型方案和一站式的开发体验;内置[4D并行分布式Trainer](./docs/trainer.md)[高效微调算法LoRA/Prefix Tuning](./llm#33-lora), [自研INT8/INT4量化算法](./llm#6-量化)等等;全面支持[LLaMA 1/2](./llm/llama), [BLOOM](.llm/bloom), [ChatGLM 1/2](./llm/chatglm), [GLM](./llm/glm), [OPT](./llm/opt)等主流大模型
## 安装
### 环境依赖
- python >= 3.7
- paddlepaddle >= 2.5.1
- 如需大模型功能,请使用 paddlepaddle-gpu >= 2.5.1
### pip安装
```shell
pip install --upgrade paddlenlp
```
或者可通过以下命令安装最新 develop 分支代码:
```shell
pip install --pre --upgrade paddlenlp -f https://www.paddlepaddle.org.cn/whl/paddlenlp.html
```
更多关于PaddlePaddle和PaddleNLP安装的详细教程请查看[Installation](./docs/get_started/installation.rst)
## 快速开始
### 大模型文本生成
PaddleNLP提供了方便易用的Auto API,能够快速的加载模型和Tokenizer。这里以使用 `linly-ai/chinese-llama-2-7b` 大模型做文本生成为例:
```python
>>> from paddlenlp.transformers import AutoTokenizer, AutoModelForCausalLM
>>> tokenizer = AutoTokenizer.from_pretrained("linly-ai/chinese-llama-2-7b")
>>> model = AutoModelForCausalLM.from_pretrained("linly-ai/chinese-llama-2-7b", dtype="float16")
>>> input_features = tokenizer("你好!请自我介绍一下。", return_tensors="pd")
>>> outputs = model.generate(**input_features, max_length=128)
>>> tokenizer.batch_decode(outputs[0])
['\n你好!我是一个AI语言模型,可以回答你的问题和提供帮助。']
```
### 一键UIE预测
PaddleNLP提供[一键预测功能](./docs/model_zoo/taskflow.md),无需训练,直接输入数据即可开放域抽取结果。这里以信息抽取-命名实体识别任务,UIE模型为例:
```python
>>> from pprint import pprint
>>> from paddlenlp import Taskflow
>>> schema = ['时间', '选手', '赛事名称'] # Define the schema for entity extraction
>>> ie = Taskflow('information_extraction', schema=schema)
>>> pprint(ie("2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中中国选手谷爱凌以188.25分获得金牌!"))
[{'时间': [{'end': 6,
'probability': 0.9857378532924486,
'start': 0,
'text': '2月8日上午'}],
'赛事名称': [{'end': 23,
'probability': 0.8503089953268272,
'start': 6,
'text': '北京冬奥会自由式滑雪女子大跳台决赛'}],
'选手': [{'end': 31,
'probability': 0.8981548639781138,
'start': 28,
'text': '谷爱凌'}]}]
```
更多PaddleNLP内容可参考:
- [大模型全流程工具链](./llm),包含主流中文大模型的全流程方案。
- [精选模型库](./model_zoo),包含优质预训练模型的端到端全流程使用。
- [多场景示例](./examples),了解如何使用PaddleNLP解决NLP多种技术问题,包含基础技术、系统应用与拓展应用。
- [交互式教程](https://aistudio.baidu.com/aistudio/personalcenter/thirdview/574995),在🆓免费算力平台AI Studio上快速学习PaddleNLP。
## 特性
#### <a href=#开箱即用的nlp工具集> 📦 开箱即用的NLP工具集 </a>
#### <a href=#丰富完备的中文模型库> 🤗 丰富完备的中文模型库 </a>
#### <a href=#产业级端到端系统范例> 🎛️ 产业级端到端系统范例 </a>
#### <a href=#高性能分布式训练与推理> 🚀 高性能分布式训练与推理 </a>
### 开箱即用的NLP工具集
Taskflow提供丰富的**📦开箱即用**的产业级NLP预置模型,覆盖自然语言理解与生成两大场景,提供**💪产业级的效果****⚡️极致的推理性能**
![taskflow1](https://user-images.githubusercontent.com/11793384/159693816-fda35221-9751-43bb-b05c-7fc77571dd76.gif)
更多使用方法可参考[Taskflow文档](./docs/model_zoo/taskflow.md)
### 丰富完备的中文模型库
#### 🀄 业界最全的中文预训练模型
精选 45+ 个网络结构和 500+ 个预训练模型参数,涵盖业界最全的中文预训练模型:既包括文心NLP大模型的ERNIE、PLATO等,也覆盖BERT、GPT、RoBERTa、T5等主流结构。通过`AutoModel` API一键⚡**高速下载**⚡。
```python
from paddlenlp.transformers import *
ernie = AutoModel.from_pretrained('ernie-3.0-medium-zh')
bert = AutoModel.from_pretrained('bert-wwm-chinese')
albert = AutoModel.from_pretrained('albert-chinese-tiny')
roberta = AutoModel.from_pretrained('roberta-wwm-ext')
electra = AutoModel.from_pretrained('chinese-electra-small')
gpt = AutoModelForPretraining.from_pretrained('gpt-cpm-large-cn')
```
针对预训练模型计算瓶颈,可以使用API一键使用文心ERNIE-Tiny全系列轻量化模型,降低预训练模型部署难度。
```python
# 6L768H
ernie = AutoModel.from_pretrained('ernie-3.0-medium-zh')
# 6L384H
ernie = AutoModel.from_pretrained('ernie-3.0-mini-zh')
# 4L384H
ernie = AutoModel.from_pretrained('ernie-3.0-micro-zh')
# 4L312H
ernie = AutoModel.from_pretrained('ernie-3.0-nano-zh')
```
对预训练模型应用范式如语义表示、文本分类、句对匹配、序列标注、问答等,提供统一的API体验。
```python
import paddle
from paddlenlp.transformers import *
tokenizer = AutoTokenizer.from_pretrained('ernie-3.0-medium-zh')
text = tokenizer('自然语言处理')
# 语义表示
model = AutoModel.from_pretrained('ernie-3.0-medium-zh')
sequence_output, pooled_output = model(input_ids=paddle.to_tensor([text['input_ids']]))
# 文本分类 & 句对匹配
model = AutoModelForSequenceClassification.from_pretrained('ernie-3.0-medium-zh')
# 序列标注
model = AutoModelForTokenClassification.from_pretrained('ernie-3.0-medium-zh')
# 问答
model = AutoModelForQuestionAnswering.from_pretrained('ernie-3.0-medium-zh')
```
#### 💯 全场景覆盖的应用示例
覆盖从学术到产业的NLP应用示例,涵盖NLP基础技术、NLP系统应用以及拓展应用。全面基于飞桨核心框架2.0全新API体系开发,为开发者提供飞桨文本领域的最佳实践。
精选预训练模型示例可参考[Model Zoo](./model_zoo),更多场景示例文档可参考[examples目录](./examples)。更有免费算力支持的[AI Studio](https://aistudio.baidu.com)平台的[Notbook交互式教程](https://aistudio.baidu.com/aistudio/personalcenter/thirdview/574995)提供实践。
<details><summary> PaddleNLP预训练模型适用任务汇总(<b>点击展开详情</b></summary><div>
| Model | Sequence Classification | Token Classification | Question Answering | Text Generation | Multiple Choice |
| :----------------- | ----------------------- | -------------------- | ------------------ | --------------- | --------------- |
| ALBERT | ✅ | ✅ | ✅ | ❌ | ✅ |
| BART | ✅ | ✅ | ✅ | ✅ | ❌ |
| BERT | ✅ | ✅ | ✅ | ❌ | ✅ |
| BigBird | ✅ | ✅ | ✅ | ❌ | ✅ |
| BlenderBot | ❌ | ❌ | ❌ | ✅ | ❌ |
| ChineseBERT | ✅ | ✅ | ✅ | ❌ | ❌ |
| ConvBERT | ✅ | ✅ | ✅ | ❌ | ✅ |
| CTRL | ✅ | ❌ | ❌ | ❌ | ❌ |
| DistilBERT | ✅ | ✅ | ✅ | ❌ | ❌ |
| ELECTRA | ✅ | ✅ | ✅ | ❌ | ✅ |
| ERNIE | ✅ | ✅ | ✅ | ❌ | ✅ |
| ERNIE-CTM | ❌ | ✅ | ❌ | ❌ | ❌ |
| ERNIE-Doc | ✅ | ✅ | ✅ | ❌ | ❌ |
| ERNIE-GEN | ❌ | ❌ | ❌ | ✅ | ❌ |
| ERNIE-Gram | ✅ | ✅ | ✅ | ❌ | ❌ |
| ERNIE-M | ✅ | ✅ | ✅ | ❌ | ❌ |
| FNet | ✅ | ✅ | ✅ | ❌ | ✅ |
| Funnel-Transformer | ✅ | ✅ | ✅ | ❌ | ❌ |
| GPT | ✅ | ✅ | ❌ | ✅ | ❌ |
| LayoutLM | ✅ | ✅ | ❌ | ❌ | ❌ |
| LayoutLMv2 | ❌ | ✅ | ❌ | ❌ | ❌ |
| LayoutXLM | ❌ | ✅ | ❌ | ❌ | ❌ |
| LUKE | ❌ | ✅ | ✅ | ❌ | ❌ |
| mBART | ✅ | ❌ | ✅ | ❌ | ✅ |
| MegatronBERT | ✅ | ✅ | ✅ | ❌ | ✅ |
| MobileBERT | ✅ | ❌ | ✅ | ❌ | ❌ |
| MPNet | ✅ | ✅ | ✅ | ❌ | ✅ |
| NEZHA | ✅ | ✅ | ✅ | ❌ | ✅ |
| PP-MiniLM | ✅ | ❌ | ❌ | ❌ | ❌ |
| ProphetNet | ❌ | ❌ | ❌ | ✅ | ❌ |
| Reformer | ✅ | ❌ | ✅ | ❌ | ❌ |
| RemBERT | ✅ | ✅ | ✅ | ❌ | ✅ |
| RoBERTa | ✅ | ✅ | ✅ | ❌ | ✅ |
| RoFormer | ✅ | ✅ | ✅ | ❌ | ❌ |
| SKEP | ✅ | ✅ | ❌ | ❌ | ❌ |
| SqueezeBERT | ✅ | ✅ | ✅ | ❌ | ❌ |
| T5 | ❌ | ❌ | ❌ | ✅ | ❌ |
| TinyBERT | ✅ | ❌ | ❌ | ❌ | ❌ |
| UnifiedTransformer | ❌ | ❌ | ❌ | ✅ | ❌ |
| XLNet | ✅ | ✅ | ✅ | ❌ | ✅ |
</div></details>
可参考[Transformer 文档](/docs/model_zoo/index.rst) 查看目前支持的预训练模型结构、参数和详细用法。
### 产业级端到端系统范例
PaddleNLP针对信息抽取、语义检索、智能问答、情感分析等高频NLP场景,提供了端到端系统范例,打通*数据标注*-*模型训练*-*模型调优*-*预测部署*全流程,持续降低NLP技术产业落地门槛。更多详细的系统级产业范例使用说明请参考[Applications](./applications)
#### 🔍 语义检索系统
针对无监督数据、有监督数据等多种数据情况,结合SimCSE、In-batch Negatives、ERNIE-Gram单塔模型等,推出前沿的语义检索方案,包含召回、排序环节,打通训练、调优、高效向量检索引擎建库和查询全流程。
<div align="center">
<img src="https://user-images.githubusercontent.com/11793384/168514909-8817d79a-72c4-4be1-8080-93d1f682bb46.gif" width="400">
</div>
更多使用说明请参考[语义检索系统](./applications/neural_search)
#### ❓ 智能问答系统
基于[🚀RocketQA](https://github.com/PaddlePaddle/RocketQA)技术的检索式问答系统,支持FAQ问答、说明书问答等多种业务场景。
<div align="center">
<img src="https://user-images.githubusercontent.com/11793384/168514868-1babe981-c675-4f89-9168-dd0a3eede315.gif" width="400">
</div>
更多使用说明请参考[智能问答系统](./applications/question_answering)[文档智能问答](./applications/document_intelligence/doc_vqa)
#### 💌 评论观点抽取与情感分析
基于情感知识增强预训练模型SKEP,针对产品评论进行评价维度和观点抽取,以及细粒度的情感分析。
<div align="center">
<img src="https://user-images.githubusercontent.com/11793384/168407260-b7f92800-861c-4207-98f3-2291e0102bbe.png" width="400">
</div>
更多使用说明请参考[情感分析](./applications/sentiment_analysis)
#### 🎙️ 智能语音指令解析
集成了[PaddleSpeech](https://github.com/PaddlePaddle/PaddleSpeech)[百度开放平台](https://ai.baidu.com/)的语音识别和[UIE](./model_zoo/uie)通用信息抽取等技术,打造智能一体化的语音指令解析系统范例,该方案可应用于智能语音填单、智能语音交互、智能语音检索等场景,提高人机交互效率。
<div align="center">
<img src="https://user-images.githubusercontent.com/16698950/168589100-a6c6f346-97bb-47b2-ac26-8d50e71fddc5.png" width="400">
</div>
更多使用说明请参考[智能语音指令解析](./applications/speech_cmd_analysis)
### 高性能分布式训练与推理
#### ⚡ FastTokenizer:高性能文本处理库
<div align="center">
<img src="https://user-images.githubusercontent.com/11793384/168407921-b4395b1d-44bd-41a0-8c58-923ba2b703ef.png" width="400">
</div>
```python
AutoTokenizer.from_pretrained("ernie-3.0-medium-zh", use_fast=True)
```
为了实现更极致的模型部署性能,安装FastTokenizers后只需在`AutoTokenizer` API上打开 `use_fast=True`选项,即可调用C++实现的高性能分词算子,轻松获得超Python百余倍的文本处理加速,更多使用说明可参考[FastTokenizer文档](./fast_tokenizer)
#### ⚡️ FastGeneration:高性能生成加速库
<div align="center">
<img src="https://user-images.githubusercontent.com/11793384/168407831-914dced0-3a5a-40b8-8a65-ec82bf13e53c.gif" width="400">
</div>
```python
model = GPTLMHeadModel.from_pretrained('gpt-cpm-large-cn')
...
outputs, _ = model.generate(
input_ids=inputs_ids, max_length=10, decode_strategy='greedy_search',
use_fast=True)
```
简单地在`generate()`API上打开`use_fast=True`选项,轻松在Transformer、GPT、BART、PLATO、UniLM等生成式预训练模型上获得5倍以上GPU加速,更多使用说明可参考[FastGeneration文档](./fast_generation)
#### 🚀 Fleet:飞桨4D混合并行分布式训练技术
<div align="center">
<img src="https://user-images.githubusercontent.com/11793384/168515134-513f13e0-9902-40ef-98fa-528271dcccda.png" width="300">
</div>
更多关于千亿级AI模型的分布式训练使用说明可参考[GPT-3](./examples/language_model/gpt-3)
## 社区交流
- 微信扫描二维码并填写问卷,回复小助手关键词(NLP)之后,即可加入交流群领取福利
- 与众多社区开发者以及官方团队深度交流。
- 10G重磅NLP学习大礼包!
<div align="center">
<img src="https://user-images.githubusercontent.com/11987277/245085922-0aa68d24-00ff-442e-9c53-2f1e898151ce.png" width="150" height="150" />
</div>
## Citation
如果PaddleNLP对您的研究有帮助,欢迎引用
```
@misc{=paddlenlp,
title={PaddleNLP: An Easy-to-use and High Performance NLP Library},
author={PaddleNLP Contributors},
howpublished = {\url{https://github.com/PaddlePaddle/PaddleNLP}},
year={2021}
}
```
## Acknowledge
我们借鉴了Hugging Face的[Transformers](https://github.com/huggingface/transformers)🤗关于预训练模型使用的优秀设计,在此对Hugging Face作者及其开源社区表示感谢。
## License
PaddleNLP遵循[Apache-2.0开源协议](./LICENSE)
# 产业级端到端系统范例
## 1、简介
PaddleNLP 从预训练模型库出发,提供了经典预训练模型在主流 NLP 任务上丰富的[应用示例](../examples),满足了大量开发者的学习科研与基础应用需求。
针对更广泛的产业落地需求、更复杂的 NLP 场景任务,PaddleNLP 推出**产业级端到端系统范例库**(下文简称产业范例),提供单个模型之上的产业解决方案。
- 最强模型与实践———产业范例针对具体业务场景,提供最佳模型(组合),兼顾模型精度与性能,降低开发者模型选型成本;
- 全流程———打通数据标注-模型训练-模型调优-模型压缩—预测部署全流程,帮助开发者更低成本得完成产业落地。
## 2、基于 Pipelines 构建产业范例,加速落地
在面向不同场景任务建设一系列产业方案的过程中,不难发现,从技术基础设施角度看:
(1)NLP系统都可以抽象为由多个基础组件串接而成的流水线系统;
(2)多个NLP流水线系统可共享使用相同的基础组件。
因此,PaddleNLP 逐渐孵化出了一套 NLP 流水线系统 [Pipelines](../pipelines),将各个 NLP 复杂系统的通用模块抽象封装为标准组件,支持开发者通过配置文件对标准组件进行组合,仅需几分钟即可定制化构建智能系统,让解决NLP任务像搭积木一样便捷、灵活、高效。同时,Pipelines 中预置了前沿的预训练模型和算法,在研发效率、模型效果和性能方面提供多重保障。因此,Pipelines 能够大幅加快开发者使用飞桨落地的效率。
<div>
<img src="https://user-images.githubusercontent.com/11793384/212836991-d9132e46-b5bf-4389-80e1-4f9dee32f1fe.png" width="90%" length="90%">
</div>
<br>
**PaddleNLP 提供了多个版本的产业范例:**
- 如果你希望快速体验、直接应用、从零搭建一套完整系统,推荐使用 **Pipelines 版本**。这里集成了训练好的模型,无需关心模型训练细节;提供 Docker 环境,可快速一键部署端到端系统;打通前端 Demo 界面,便于直观展示、分析、调试效果。
- 如果你希望使用自己的业务数据进行二次开发,推荐使用`./applications`目录下的**可定制版本**,训练好的模型可以直接集成进 Pipelines 中进行使用。
- 也可以使用 [AI Studio](https://aistudio.baidu.com/aistudio/index) 在线 Jupyter Notebook 快速体验,有 GPU 算力哦。
| 场景任务 | Pipelines版本地址 | 可定制版本地址 | Notebook |
| :--------------- | ------- | ------- | ------- |
| **检索**| [字面+语义检索](../pipelines/examples/semantic-search) | [语义检索](./neural_search) | [基于Pipelines搭建检索系统](https://aistudio.baidu.com/aistudio/projectdetail/4442670)<br>[二次开发语义检索](https://aistudio.baidu.com/aistudio/projectdetail/3351784) |
| **问答** | [FAQ问答](../pipelines/examples/FAQ/)<br>[无监督检索式问答](../pipelines/examples/unsupervised-question-answering)<br>[有监督检索式问答](../pipelines/examples/question-answering) | [FAQ问答](./question_answering/supervised_qa)<br>[无监督检索式问答](./question_answering/unsupervised_qa) | [基于Pipelines搭建FAQ问答系统](https://aistudio.baidu.com/aistudio/projectdetail/4465498)<br>[基于Pipelines搭建抽取式问答系统](https://aistudio.baidu.com/aistudio/projectdetail/4442857)<br>[FAQ政务问答](https://aistudio.baidu.com/aistudio/projectdetail/3678873)<br>[FAQ保险问答](https://aistudio.baidu.com/aistudio/projectdetail/3882519) |
| **文本分类**| 暂无 | [文本分类](./text_classification) | [对话意图识别](https://aistudio.baidu.com/aistudio/projectdetail/2017202)<br>[法律文本多标签分类](https://aistudio.baidu.com/aistudio/projectdetail/3996601)<br>[层次分类](https://aistudio.baidu.com/aistudio/projectdetail/4568985) |
| **通用文本分类** | 暂无 | [通用文本分类](./zero_shot_text_classification) | |
| **通用信息抽取** | 暂无 | [通用信息抽取](./information_extraction) | [UIE快速体验](https://aistudio.baidu.com/aistudio/projectdetail/3914778)<br>[UIE微调实体抽取](https://aistudio.baidu.com/aistudio/projectdetail/4038499)<br>[UIE微调关系抽取](https://aistudio.baidu.com/aistudio/projectdetail/4371345)<br>[UIE-X快速体验](https://aistudio.baidu.com/aistudio/projectdetail/5017442)<br>[UIE-X微调](https://aistudio.baidu.com/aistudio/projectdetail/5261592) |
| **情感分析** | [情感分析](../pipelines/examples/sentiment_analysis) | [情感分析](./sentiment_analysis) | [情感分析](https://aistudio.baidu.com/aistudio/projectdetail/5318177)|
| **文档智能** | [文档抽取问答](../pipelines/examples/document-intelligence) | [跨模态文档问答](./document_intelligence/doc_vqa)| [文档抽取问答](https://aistudio.baidu.com/aistudio/projectdetail/4881278)<br>[汽车说明书问答](https://aistudio.baidu.com/aistudio/projectdetail/4049663) |
| **文生图** | [文生图系统](../pipelines/examples/text_to_image) | 可参考[PPDiffusers](../ppdiffusers) | |
| **语音指令解析** | 暂无 | [语音指令解析](./speech_cmd_analysis) | [语音指令解析](https://aistudio.baidu.com/aistudio/projectdetail/4399703) |
| **文本摘要** | 暂无 | [文本摘要](./text_summarization) | [文本摘要](https://aistudio.baidu.com/aistudio/projectdetail/4903667) |
## 3、典型范例介绍
#### 📄 通用信息抽取系统
- 首个产业级通用信息抽取方案 UIE,面向纯文本,实现多任务统一建模,提供强大的零样本抽取和少样本快速迁移能力;
- 首个兼具文本及文档抽取能力、多语言、开放域的信息抽取方案 UIE-X,基于 [ERNIE-Layout](../model_zoo/ernie-layout) 跨模态布局增强预训练模型,集成 [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR) 的 PP-OCR、PP-Structure 版面分析能力,小样本文档信息抽取效果领先。
<div align="center">
<img src="https://user-images.githubusercontent.com/11793384/213365046-69967745-b4a8-4435-98fb-c34f68cd22e9.png" width="60%" length="60%">
</div>
详细使用说明请参考[通用信息抽取系统](./information_extraction),更多:[UIE 解读](https://mp.weixin.qq.com/s/-hHz8knHIKKqKCBTke7i5A)[UIE-X 解读](https://zhuanlan.zhihu.com/p/592422623)
#### 🔍 语义检索系统
- 前沿算法———基于 SimCSE、In-batch Negatives、ERNIE Pairwise、RocketQA Pointwise 等提供针对无监督、有监督等多种数据情况的多样化方案;
- 全流程———覆盖召回、排序环节,集成主流 ANN 引擎,同时兼容 ElasticSearch 字面检索模式,提供多路召回方案。打通训练、调优、高效向量检索引擎建库和查询全流程。
<div align="center">
<img src="https://user-images.githubusercontent.com/11793384/213134465-30cae5fd-4cd1-4e5b-a1cb-fa55c72980a7.gif" width="60%" length="60%">
</div>
详细使用说明请参考[语义检索系统](./neural_search)
#### ❓ 智能问答系统
- 端到端问答技术 [🚀RocketQA](https://github.com/PaddlePaddle/RocketQA),首个中文端到端问答模型,基于知识增强的预训练模型ERNIE和百万量级的人工标注数据集DuReader训练得到,效果优异;
- 覆盖有监督(如 FAQ 问答)、无监督(自动生成 QA 对,生成的问答对语料可以通过无监督的方式构建检索式问答系统)等多种情况,适用各类业务场景。
<div align="center">
<img src="https://user-images.githubusercontent.com/11793384/168514868-1babe981-c675-4f89-9168-dd0a3eede315.gif" width="60%" length="60%">
</div>
详细使用说明请参考[智能问答系统](./question_answering)[文档智能问答](./document_intelligence/doc_vqa)
#### 📚 通用文本分类
- 基于“任务架构统一、通用能力共享”的通用文本分类技术 UTC,实了良好的零/少样本迁移能力,实现大一统诸多任务的开放域分类,可支持情感分析、意图识别、语义匹配、蕴含推理等各种可转换为分类问题的 NLU 任务。
<div align="center">
<img src="https://user-images.githubusercontent.com/11793384/213347595-e9c08bd1-3d32-4519-9a52-31fb69b841e8.png" width="60%" length="60%">
</div>
<br>
详细使用说明请参考[通用文本分类](./zero_shot_text_classification),更多:[文章解读](https://mp.weixin.qq.com/s/VV-nYv4y1r7oipJnURRL5w)
#### 🗂 文本分类
- 场景方案全覆盖––––开源预训练模型-微调、提示学习、基于语义索引等多种分类技术方案,满足不同场景需求,涵盖多分类(multi-class)、多标签(multi-label)、层次分类(hierarchical)三类任务;
- 模型高效调优––––强强结合数据增强能力与可信增强技术,解决脏数据、标注数据欠缺、数据不平衡等问题,大幅提升模型效果。
<div align="center">
<img src="https://user-images.githubusercontent.com/63761690/186378697-630d3590-4e67-49a0-8d5f-7cabd9daa894.png" width="60%" length="60%">
</div>
<br>
详细使用说明请参考[文本分类](./text_classification),更多:[文章解读](https://mp.weixin.qq.com/s/tas7yM8vapxwtlJt-MRZdg)
#### 💌 评论观点抽取与情感分析
- 经典方案:基于情感知识增强预训练模型SKEP,两阶段式抽取和分类,首先通过序列标注的方式定位属性词和观点词,然后进行属性集情感分类;
- 前沿方案:基于UIE的情感分析方案采用 Prompt Learning 的方式进行情感信息抽取,精度更高。支持语句级和属性级情感分析,解决同义属性聚合、隐性观点抽取难点,并提供可视化分析能力。
<div align="center">
<img src="https://user-images.githubusercontent.com/35913314/200259473-434888f7-c0ac-4253-ab23-ede1628e6ba2.png" width="60%" length="60%">
</div>
<br>
详细使用说明请参考[情感分析](./sentiment_analysis),更多:[文章解读](https://mp.weixin.qq.com/s/QAHjIRG9zxpYfM6YPRQ-9w)
#### 🎙️ 智能语音指令解析
- 集成了[PaddleSpeech](https://github.com/PaddlePaddle/PaddleSpeech)[百度开放平台](https://ai.baidu.com/)的语音识别和[UIE](./model_zoo/uie)通用信息抽取等技术,打造智能一体化的语音指令解析系统范例,该方案可应用于智能语音填单、智能语音交互、智能语音检索等场景,提高人机交互效率。
<div align="center">
<img src="https://user-images.githubusercontent.com/16698950/168589100-a6c6f346-97bb-47b2-ac26-8d50e71fddc5.png" width="400">
</div>
详细使用说明请参考[智能语音指令解析](./applications/speech_cmd_analysis)
# 文档智能应用
**目录**
- [1. 文档智能应用简介](#文档智能应用简介)
- [2. 技术特色介绍](#技术特色介绍)
- [2.1 多语言跨模态训练基座](#多语言跨模态训练基座)
- [2.2 多场景覆盖](#多场景覆盖)
- [3. 快速开始](#快速开始)
- [3.1 开箱即用](#开箱即用)
- [3.2 产业级流程方案](#产业级流程方案)
## 1. 文档智能应用简介
文档智能(DI, Document Intelligence)主要指**对于网页、数字文档或扫描文档所包含的文本以及丰富的排版格式等信息,通过人工智能技术进行理解、分类、提取以及信息归纳**的过程。文档智能技术广泛应用于金融、保险、能源、物流、医疗等行业,常见的应用场景包括财务报销单、招聘简历、企业财报、合同文书、动产登记证、法律判决书、物流单据等多模态文档的关键信息抽取、文档解析、文档比对等。
在实际应用中,需要解决文档格式繁杂、布局多样、信息模态多样、需求开放、业务数据少等多重难题。针对文档智能领域的痛点和难点,PaddleNLP将持续开源一系列产业实践范例,解决开发者们实际应用难题。
<div align="center">
<img width="1000" height="270" alt="文档智能技术一般流程" src="https://user-images.githubusercontent.com/40840292/196361583-6b1c66d1-6a9b-4193-949a-71e2d420a82a.png">
</div>
<a name="技术特色介绍"></a>
## 2. 技术特色介绍
<a name="多语言跨模态训练基座"></a>
### 2.1 多语言跨模态训练基座
近期,百度文心文档智能,基于多语言跨模态布局增强的文档智能大模型[ERNIE-Layout](http://arxiv.org/abs/2210.06155),刷新了五类11项文档智能任务效果。依托文心ERNIE大模型,基于布局知识增强技术,融合文本、图像、布局等信息进行联合建模,能够对多模态文档(如文档图片、PDF 文件、扫描件等)进行深度理解与分析,为各类上层应用提供SOTA模型底座。
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/196373896-597f6178-4c78-41a1-bb12-796546644b32.png width="600"/>
</div>
<a name="多场景覆盖"></a>
### 2.2 多场景覆盖
以下是文档智能技术的一些应用场景展示:
- 发票抽取问答
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/196118171-fd3e49a0-b9f1-4536-a904-c48f709a2dec.png height=350 width=1000 hspace='10'/>
</div>
- 海报抽取问答
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/195610368-04230855-62de-439e-b708-2c195b70461f.png height=600 width=1000 hspace='15'/>
</div>
- 网页抽取问答
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/195611613-bdbe692e-d7f2-4a2b-b548-1a933463b0b9.png height=350 width=1000 hspace='10'/>
</div>
- 表格抽取问答
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/195610692-8367f1c8-32c2-4b5d-9514-a149795cf609.png height=350 width=1000 hspace='10'/>
</div>
- 试卷抽取问答
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/195823294-d891d95a-2ef8-4519-be59-0fedb96c00de.png height=700 width=1000 hspace='10'/>
</div>
- 英文票据多语种(中、英、日、泰、西班牙、俄语)抽取问答
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/195610820-7fb88608-b317-45fc-a6ab-97bf3b20a4ac.png height=400 width=1000 hspace='15'/>
</div>
- 中文票据多语种(中简、中繁、英、日、法语)抽取问答
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/195611075-9323ce9f-134b-4657-ab1c-f4892075d909.png height=350 width=1000 hspace='15'/>
</div>
- Demo图片可在此[下载](https://bj.bcebos.com/paddlenlp/taskflow/document_intelligence/demo.zip)
<a name="快速开始"></a>
## 3. 快速开始
<a name="开箱即用"></a>
### 3.1 开箱即用
开源DocPrompt开放文档抽取问答模型,以ERNIE-Layout为底座,可精准理解图文信息,推理学习附加知识,准备捕捉图片、PDF等多模态文档中的每个细节。
🧾 通过[Huggingface网页](https://huggingface.co/spaces/PaddlePaddle/ERNIE-Layout)体验DocPrompt功能:
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/195749427-864d7744-1fd1-455e-99c6-53a260776483.jpg height=700 width=1100 hspace='10'/>
</div>
#### Taskflow
通过``paddlenlp.Taskflow``三行代码调用DocPrompt功能,具备多语言文档抽取问答能力,部分应用场景展示如下:
- 输入格式
```
[
{"doc": "./invoice.jpg", "prompt": ["发票号码是多少?", "校验码是多少?"]},
{"doc": "./resume.png", "prompt": ["五百丁本次想要担任的是什么职位?", "五百丁是在哪里上的大学?", "大学学的是什么专业?"]}
]
```
默认使用PaddleOCR进行OCR识别,同时支持用户通过``word_boxes``传入自己的OCR结果,格式为``List[str, List[float, float, float, float]]``
```
[
{"doc": doc_path, "prompt": prompt, "word_boxes": word_boxes}
]
```
- 支持单条、批量预测
- 支持本地图片路径输入
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/194748579-f9e8aa86-7f65-4827-bfae-824c037228b3.png height=800 hspace='20'/>
</div>
```python
>>> from pprint import pprint
>>> from paddlenlp import Taskflow
>>> docprompt = Taskflow("document_intelligence")
>>> pprint(docprompt([{"doc": "./resume.png", "prompt": ["五百丁本次想要担任的是什么职位?", "五百丁是在哪里上的大学?", "大学学的是什么专业?"]}]))
[{'prompt': '五百丁本次想要担任的是什么职位?',
'result': [{'end': 7, 'prob': 1.0, 'start': 4, 'value': '客户经理'}]},
{'prompt': '五百丁是在哪里上的大学?',
'result': [{'end': 37, 'prob': 1.0, 'start': 31, 'value': '广州五百丁学院'}]},
{'prompt': '大学学的是什么专业?',
'result': [{'end': 44, 'prob': 0.82, 'start': 38, 'value': '金融学(本科)'}]}]
```
- http图片链接输入
<div align="center">
<img src=https://user-images.githubusercontent.com/40840292/194748592-e20b2a5f-d36b-46fb-8057-86755d188af0.jpg height=400 hspace='10'/>
</div>
```python
>>> from pprint import pprint
>>> from paddlenlp import Taskflow
>>> docprompt = Taskflow("document_intelligence")
>>> pprint(docprompt([{"doc": "https://bj.bcebos.com/paddlenlp/taskflow/document_intelligence/images/invoice.jpg", "prompt": ["发票号码是多少?", "校验码是多少?"]}]))
[{'prompt': '发票号码是多少?',
'result': [{'end': 2, 'prob': 0.74, 'start': 2, 'value': 'No44527206'}]},
{'prompt': '校验码是多少?',
'result': [{'end': 233,
'prob': 1.0,
'start': 231,
'value': '01107 555427109891646'}]}]
```
- 可配置参数说明
* `batch_size`:批处理大小,请结合机器情况进行调整,默认为1。
* `lang`:选择PaddleOCR的语言,`ch`可在中英混合的图片中使用,`en`在英文图片上的效果更好,默认为`ch`
* `topn`: 如果模型识别出多个结果,将返回前n个概率值最高的结果,默认为1。
<a name="产业级流程方案"></a>
### 3.2 产业级流程方案
针对文档智能领域的痛点和难点,PaddleNLP将持续开源一系列文档智能产业实践范例,解决开发者们实际应用难题。
- 👉 [汽车说明书跨模态智能问答](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/document_intelligence/doc_vqa#readme)
更多:百度TextMind智能文档分析平台可提供包括文档信息抽取、文本内容审查、企业文档管理、文档格式解析、文档内容比对等全方位一站式的文档智能服务,已形成一套完整的企业文档场景化解决方案,满足银行、券商、法律、能源、传媒、通信、物流等不同行业和场景的文档处理需求,以AI助力企业的办公智能化升级和数字化转型。欢迎深度交流与商业合作,了解详情:https://ai.baidu.com/tech/nlp/Textanalysis
## References
- [文档智能:数据集、模型和应用](http://jcip.cipsc.org.cn/CN/abstract/abstract3331.shtml)
- [ERNIE-Layout: Layout-Knowledge Enhanced Multi-modal Pre-training for Document Understanding](http://arxiv.org/abs/2210.06155)
OCR_process/*.json
*.png
*.json
answers/*
checkpoints/*
__pycache__/*
OCR_process/demo_pics/*
Rerank/log/*
Rerank/checkpoints/*
Rerank/data/*
Rerank/output/*
Rerank/__pycache__/*
Extraction/log/*
Extraction/checkpoints/*
Extraction/data/*
Extraction/output/*
Extraction/__pycache__/*
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import sys
import json
import numpy as np
def get_top1_from_ranker(path):
with open(path, "r", encoding="utf-8") as f:
scores = [float(line.strip()) for line in f.readlines()]
top_id = np.argmax(scores)
return top_id
def get_ocr_result_by_id(path, top_id):
with open(path, "r", encoding="utf-8") as f:
reses = f.readlines()
res = reses[top_id]
return json.loads(res)
def write_to_file(doc, path):
with open(path, "w", encoding="utf-8") as f:
json.dump(doc, f, ensure_ascii=False)
f.write("\n")
if __name__ == "__main__":
question = sys.argv[1]
ranker_result_path = "../Rerank/data/demo.score"
ocr_result_path = "../OCR_process/demo_ocr_res.json"
save_path = "data/demo_test.json"
top_id = get_top1_from_ranker(ranker_result_path)
doc = get_ocr_result_by_id(ocr_result_path, top_id)
doc["question"] = question
doc["img_id"] = str(top_id + 1)
write_to_file(doc, save_path)
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import collections
import json
import sys
import numpy as np
import paddle
from paddle.io import Dataset
from tqdm import tqdm
sys.path.insert(0, "../")
class DocVQAExample(object):
def __init__(self, question, doc_tokens, doc_boxes=[], answer=None, labels=None, image=None):
self.question = question
self.doc_tokens = doc_tokens
self.doc_boxes = doc_boxes
self.image = image
self.answer = answer
self.labels = labels
class DocVQAFeatures(object):
"""A single set of features of data."""
def __init__(self, example_index, input_ids, input_mask, segment_ids, boxes=None, label=None):
self.example_index = example_index
self.input_ids = input_ids
self.input_mask = input_mask
self.segment_ids = segment_ids
self.boxes = boxes
self.label = label
class DocVQA(Dataset):
def __init__(
self, args, tokenizer, label2id_map, max_seq_len=512, max_query_length=20, max_doc_length=512, max_span_num=1
):
super(DocVQA, self).__init__()
self.tokenizer = tokenizer
self.label2id_map = label2id_map
self.max_seq_len = max_seq_len
self.max_query_length = max_query_length
self.max_doc_length = max_doc_length
self.max_span_num = max_span_num
self.sample_list = None
self.args = args
self.docvqa_inputs = self.docvqa_input()
def check_is_max_context(self, doc_spans, cur_span_index, position):
"""Check if this is the 'max context' doc span for the token."""
# Because of the sliding window approach taken to scoring documents, a single
# token can appear in multiple documents. E.g.
# Doc: the man went to the store and bought a gallon of milk
# Span A: the man went to the
# Span B: to the store and bought
# Span C: and bought a gallon of
# ...
#
# Now the word 'bought' will have two scores from spans B and C. We only
# want to consider the score with "maximum context", which we define as
# the *minimum* of its left and right context (the *sum* of left and
# right context will always be the same, of course).
#
# In the example the maximum context for 'bought' would be span C since
# it has 1 left context and 3 right context, while span B has 4 left context
# and 0 right context.
best_score = None
best_span_index = None
for (span_index, doc_span) in enumerate(doc_spans):
end = doc_span.start + doc_span.length - 1
if position < doc_span.start:
continue
if position > end:
continue
num_left_context = position - doc_span.start
num_right_context = end - position
score = min(num_left_context, num_right_context) + 0.01 * doc_span.length
if best_score is None or score > best_score:
best_score = score
best_span_index = span_index
return cur_span_index == best_span_index
def convert_examples_to_features(
self, examples, tokenizer, label_map, max_seq_length, max_span_num, max_doc_length, max_query_length
):
if "[CLS]" in self.tokenizer.get_vocab():
start_token = "[CLS]"
end_token = "[SEP]"
else:
start_token = "<s>"
end_token = "</s>"
features = []
for (example_index, example) in enumerate(examples):
query_tokens = tokenizer.tokenize(example.question)
if len(query_tokens) > max_query_length:
query_tokens = query_tokens[0:max_query_length]
all_doc_tokens = example.doc_tokens
all_doc_boxes_tokens = example.doc_boxes
cls_token_box = [0, 0, 0, 0]
sep_token_box = [1000, 1000, 1000, 1000]
pad_token_box = [0, 0, 0, 0]
ques_token_box = [0, 0, 0, 0]
# The -3 accounts for [CLS], [SEP] and [SEP]
max_tokens_for_doc = max_seq_length - len(query_tokens) - 3
# We can have documents that are longer than the maximum sequence length.
# To deal with this we do a sliding window approach, where we take chunks
# of the up to our max length with a stride of `doc_stride`.
_DocSpan = collections.namedtuple("DocSpan", ["start", "length"])
doc_spans = []
start_offset = 0
while start_offset < len(all_doc_tokens):
length = len(all_doc_tokens) - start_offset
if length > max_tokens_for_doc:
length = max_tokens_for_doc
doc_spans.append(_DocSpan(start=start_offset, length=length))
if start_offset + length == len(all_doc_tokens):
break
start_offset += length
spans_input_ids = []
spans_input_mask = []
spans_segment_ids = []
spans_boxes_tokens = []
for (doc_span_index, doc_span) in enumerate(doc_spans):
if doc_span_index == max_span_num:
break
tokens = []
boxes_tokens = []
token_is_max_context = {}
segment_ids = []
tokens.append(start_token)
boxes_tokens.append(cls_token_box)
segment_ids.append(0)
for token in query_tokens:
tokens.append(token)
boxes_tokens.append(ques_token_box)
segment_ids.append(0)
tokens.append(end_token)
boxes_tokens.append(sep_token_box)
segment_ids.append(0)
for i in range(doc_span.length):
split_token_index = doc_span.start + i
is_max_context = self.check_is_max_context(doc_spans, doc_span_index, split_token_index)
token_is_max_context[len(tokens)] = is_max_context
tokens.append(all_doc_tokens[split_token_index])
boxes_tokens.append(all_doc_boxes_tokens[split_token_index])
segment_ids.append(0)
tokens.append(end_token)
boxes_tokens.append(sep_token_box)
segment_ids.append(0)
input_ids = tokenizer.convert_tokens_to_ids(tokens)
# The mask has 1 for real tokens and 0 for padding tokens. Only real
# tokens are attended to.
input_mask = [1] * len(input_ids)
# Zero-pad up to the sequence length.
while len(input_ids) < max_seq_length:
input_ids.append(0)
input_mask.append(0)
segment_ids.append(0)
boxes_tokens.append(pad_token_box)
assert len(input_ids) == max_seq_length
assert len(input_mask) == max_seq_length
assert len(segment_ids) == max_seq_length
assert len(boxes_tokens) == max_seq_length
spans_input_ids.append(input_ids)
spans_input_mask.append(input_mask)
spans_segment_ids.append(segment_ids)
spans_boxes_tokens.append(boxes_tokens)
# Padding
# padding spans
# max_span_num: max_seg_num
# spans_input_ids: the tokens in each segment
if len(spans_input_ids) > max_span_num:
spans_input_ids = spans_input_ids[0:max_span_num]
spans_input_mask = spans_input_mask[0:max_span_num]
spans_segment_ids = spans_segment_ids[0:max_span_num]
spans_boxes_tokens = spans_boxes_tokens[0:max_span_num]
while len(spans_input_ids) < max_span_num:
tokens = []
boxes_tokens = []
segment_ids = []
tokens.append(start_token)
boxes_tokens.append(cls_token_box)
segment_ids.append(0)
for token in query_tokens:
tokens.append(token)
boxes_tokens.append(ques_token_box)
segment_ids.append(0)
tokens.append(end_token)
boxes_tokens.append(sep_token_box)
segment_ids.append(0)
tokens.append(end_token)
boxes_tokens.append(sep_token_box)
segment_ids.append(0)
input_ids = tokenizer.convert_tokens_to_ids(tokens)
input_mask = [1] * len(input_ids)
while len(input_ids) < max_seq_length:
input_ids.append(0)
input_mask.append(0)
segment_ids.append(0)
boxes_tokens.append(pad_token_box)
spans_input_ids.append(input_ids)
spans_input_mask.append(input_mask)
spans_segment_ids.append(segment_ids)
spans_boxes_tokens.append(boxes_tokens)
# padding labels
labels = example.labels
sep_id = tokenizer.convert_tokens_to_ids(end_token)
labels = ["O"] * (spans_input_ids[0].index(sep_id) + 1) + labels
if len(labels) > 512:
labels = labels[:512]
if len(labels) < 512:
labels += ["O"] * (512 - len(labels))
assert len(spans_input_ids[0]) == len(labels)
label_ids = []
for lid, l in enumerate(labels):
if l not in label_map:
label_ids.append(0)
else:
label_ids.append(label_map[l])
feature = DocVQAFeatures(
example_index=example_index,
input_ids=spans_input_ids,
input_mask=spans_input_mask,
segment_ids=spans_segment_ids,
boxes=spans_boxes_tokens,
label=label_ids,
)
features.append(feature)
return features
def create_examples(self, data, is_test=False):
"""Creates examples for the training and dev sets."""
examples = []
for sample in tqdm(data, total=len(data)):
question = sample["question"]
doc_tokens = sample["document"]
doc_boxes = sample["document_bbox"]
labels = sample["labels"] if not is_test else []
x_min, y_min = min(doc_boxes, key=lambda x: x[0])[0], min(doc_boxes, key=lambda x: x[2])[2]
x_max, y_max = max(doc_boxes, key=lambda x: x[1])[1], max(doc_boxes, key=lambda x: x[3])[3]
width = x_max - x_min
height = y_max - y_min
if max(width, height) < 1000:
scale_x = 1
scale_y = 1
else:
scale_x = 1000 / max(width, height)
scale_y = 1000 / max(width, height)
scaled_doc_boxes = [
[
round((b[0] - x_min) * scale_x),
round((b[2] - y_min) * scale_y),
round((b[1] - x_min) * scale_x),
round((b[3] - y_min) * scale_y),
]
for b in doc_boxes
]
for box, oribox in zip(scaled_doc_boxes, doc_boxes):
if box[0] < 0:
print(box, oribox)
if box[2] - box[0] < 0:
print(box, oribox)
if box[3] - box[1] < 0:
print(box, oribox)
for pos in box:
if pos > 1000:
print(width, height, box, oribox)
example = DocVQAExample(
question=question, doc_tokens=doc_tokens, doc_boxes=scaled_doc_boxes, labels=labels
)
examples.append(example)
return examples
def docvqa_input(self):
data = []
if self.args.do_train:
dataset = self.args.train_file
elif self.args.do_test:
dataset = self.args.test_file
with open(dataset, "r", encoding="utf8") as f:
for index, line in enumerate(f):
data.append(json.loads(line.strip()))
# read the examples from train/test xlm files
examples = self.create_examples(data, is_test=self.args.do_test)
features = self.convert_examples_to_features(
examples,
self.tokenizer,
self.label2id_map,
max_seq_length=self.max_seq_len,
max_doc_length=self.max_doc_length,
max_span_num=self.max_span_num,
max_query_length=self.max_query_length,
)
all_input_ids = paddle.to_tensor([f.input_ids for f in features], dtype="int64")
all_input_mask = paddle.to_tensor([f.input_mask for f in features], dtype="int64")
all_segment_ids = paddle.to_tensor([f.segment_ids for f in features], dtype="int64")
all_bboxes = paddle.to_tensor([f.boxes for f in features], dtype="int64")
all_labels = paddle.to_tensor([f.label for f in features], dtype="int64")
self.sample_list = [
np.array(all_input_ids),
np.array(all_input_mask),
np.array(all_segment_ids),
np.array(all_bboxes),
np.array(all_labels),
]
def __getitem__(self, idx):
return (
self.sample_list[0][idx],
self.sample_list[1][idx],
self.sample_list[2][idx],
self.sample_list[3][idx],
self.sample_list[4][idx],
)
def __len__(
self,
):
return self.sample_list[0].shape[0]
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import paddle
import paddle.fluid as fluid
import paddle.nn as nn
from paddlenlp.transformers import LayoutXLMPretrainedModel
class Crf_decoding(paddle.fluid.dygraph.Layer):
def __init__(self, param_attr, size=None, is_test=True, dtype="float32"):
super(Crf_decoding, self).__init__()
self._dtype = dtype
self._size = size
self._is_test = is_test
self._param_attr = param_attr
self._transition = self.create_parameter(
attr=self._param_attr, shape=[self._size + 2, self._size], dtype=self._dtype
)
@property
def weight(self):
return self._transition
@weight.setter
def weight(self, value):
self._transition = value
def forward(self, input, label=None, length=None):
viterbi_path = self._helper.create_variable_for_type_inference(dtype=self._dtype)
this_inputs = {"Emission": [input], "Transition": self._transition, "Label": label}
if length is not None:
this_inputs["Length"] = [length]
self._helper.append_op(
type="crf_decoding",
inputs=this_inputs,
outputs={"ViterbiPath": [viterbi_path]},
attrs={
"is_test": self._is_test,
},
)
return viterbi_path
class Chunk_eval(paddle.fluid.dygraph.Layer):
def __init__(self, num_chunk_types, chunk_scheme, excluded_chunk_types=None):
super(Chunk_eval, self).__init__()
self.num_chunk_types = num_chunk_types
self.chunk_scheme = chunk_scheme
self.excluded_chunk_types = excluded_chunk_types
def forward(self, input, label, seq_length=None):
precision = self._helper.create_variable_for_type_inference(dtype="float32")
recall = self._helper.create_variable_for_type_inference(dtype="float32")
f1_score = self._helper.create_variable_for_type_inference(dtype="float32")
num_infer_chunks = self._helper.create_variable_for_type_inference(dtype="int64")
num_label_chunks = self._helper.create_variable_for_type_inference(dtype="int64")
num_correct_chunks = self._helper.create_variable_for_type_inference(dtype="int64")
this_input = {"Inference": [input], "Label": [label]}
if seq_length is not None:
this_input["SeqLength"] = [seq_length]
self._helper.append_op(
type="chunk_eval",
inputs=this_input,
outputs={
"Precision": [precision],
"Recall": [recall],
"F1-Score": [f1_score],
"NumInferChunks": [num_infer_chunks],
"NumLabelChunks": [num_label_chunks],
"NumCorrectChunks": [num_correct_chunks],
},
attrs={
"num_chunk_types": self.num_chunk_types,
"chunk_scheme": self.chunk_scheme,
"excluded_chunk_types": self.excluded_chunk_types or [],
},
)
return (precision, recall, f1_score, num_infer_chunks, num_label_chunks, num_correct_chunks)
class Linear_chain_crf(paddle.fluid.dygraph.Layer):
def __init__(self, param_attr, size=None, is_test=False, dtype="float32"):
super(Linear_chain_crf, self).__init__()
self._param_attr = param_attr
self._dtype = dtype
self._size = size
self._is_test = is_test
self._transition = self.create_parameter(
attr=self._param_attr, shape=[self._size + 2, self._size], dtype=self._dtype
)
@property
def weight(self):
return self._transition
@weight.setter
def weight(self, value):
self._transition = value
def forward(self, input, label, length=None):
alpha = self._helper.create_variable_for_type_inference(dtype=self._dtype)
emission_exps = self._helper.create_variable_for_type_inference(dtype=self._dtype)
transition_exps = self._helper.create_variable_for_type_inference(dtype=self._dtype)
log_likelihood = self._helper.create_variable_for_type_inference(dtype=self._dtype)
this_inputs = {"Emission": [input], "Transition": self._transition, "Label": [label]}
if length is not None:
this_inputs["Length"] = [length]
self._helper.append_op(
type="linear_chain_crf",
inputs=this_inputs,
outputs={
"Alpha": [alpha],
"EmissionExps": [emission_exps],
"TransitionExps": transition_exps,
"LogLikelihood": log_likelihood,
},
attrs={
"is_test": self._is_test,
},
)
return log_likelihood
class LayoutXLMForTokenClassification_with_CRF(LayoutXLMPretrainedModel):
def __init__(self, layoutxlm, num_classes, dropout=None):
super(LayoutXLMForTokenClassification_with_CRF, self).__init__()
self.num_classes = num_classes
self.layoutxlm = layoutxlm
self.dropout = nn.Dropout(dropout if dropout is not None else self.layoutxlm.config["hidden_dropout_prob"])
self.emission_classifier = nn.Linear(self.layoutxlm.config["hidden_size"], self.num_classes)
self.emission_classifier.apply(self.init_weights)
self.linear_chain_crf = Linear_chain_crf(
size=self.num_classes, param_attr=paddle.fluid.ParamAttr(name="liner_chain_crfw")
)
self.crf_decoding = Crf_decoding(param_attr=paddle.fluid.ParamAttr(name="crfw_decode"), size=self.num_classes)
self.crf_decoding.weight = self.linear_chain_crf.weight
self.crfw = fluid.layers.create_parameter(
shape=[self.num_classes + 2, self.num_classes], dtype="float32", name="crfw"
)
self.mask_crfw = fluid.layers.create_parameter(
shape=[self.num_classes + 2, self.num_classes], dtype="float32", name="mask_matrix"
)
def get_input_embeddings(self):
return self.layoutxlm.embeddings.word_embeddings
def forward(
self,
input_ids=None,
bbox=None,
attention_mask=None,
token_type_ids=None,
labels=None,
image=None,
position_ids=None,
head_mask=None,
is_train=False,
):
input_ids = input_ids.squeeze(axis=1)
bbox = bbox.squeeze(axis=1)
attention_mask = attention_mask.squeeze(axis=1)
token_type_ids = token_type_ids.squeeze(axis=1)
outputs = self.layoutxlm(
input_ids=input_ids,
bbox=bbox,
image=image,
attention_mask=attention_mask,
token_type_ids=token_type_ids,
position_ids=position_ids,
head_mask=head_mask,
)
seq_length = input_ids.shape[1]
# sequence out and image out
sequence_logits, _ = outputs[0][:, :seq_length], outputs[0][:, seq_length:]
emission = self.emission_classifier(sequence_logits)
length = paddle.sum(attention_mask, axis=1)
labels = labels.reshape([-1, seq_length, 1])
# standard crf loss
crf_cost = self.linear_chain_crf(input=emission, label=labels, length=length)
crf_decode = self.crf_decoding(input=emission, length=length)
if is_train:
return [crf_cost]
else:
return [crf_cost, crf_decode]
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import json
import logging
import os
import random
import warnings
from collections import Counter
import numpy as np
import paddle
from docvqa import DocVQA
from model import LayoutXLMForTokenClassification_with_CRF
from paddlenlp.transformers import LayoutXLMModel, LayoutXLMTokenizer
warnings.filterwarnings("ignore")
logger = logging.getLogger(__name__)
def parse_args():
parser = argparse.ArgumentParser()
# yapf: disable
parser.add_argument("--model_name_or_path", default=None, type=str, required=True)
parser.add_argument("--do_train", default=False, type=bool, required=False)
parser.add_argument("--do_test", default=False, type=bool, required=False)
parser.add_argument("--test_file", default=None, type=str, required=False)
parser.add_argument("--train_file", default=None, type=str, required=False)
parser.add_argument("--output_dir", default=None, type=str, required=True)
parser.add_argument("--max_seq_len", default=512, type=int)
parser.add_argument("--max_query_length", default=20, type=int)
parser.add_argument("--max_doc_length", default=512, type=int)
parser.add_argument("--max_span_num", default=1, type=int)
parser.add_argument("--per_gpu_train_batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.")
parser.add_argument("--per_gpu_eval_batch_size", default=8, type=int, help="Batch size per GPU/CPU for eval.")
parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.")
parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.")
parser.add_argument("--num_train_epochs", default=3, type=int, help="Total number of training epochs to perform.")
parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup over warmup_steps.")
parser.add_argument("--eval_steps", type=int, default=10, help="eval every X updates steps.")
parser.add_argument("--save_steps", type=int, default=50, help="Save checkpoint every X updates steps.")
parser.add_argument("--seed", type=int, default=42, help="random seed for initialization")
parser.add_argument("--init_checkpoint", type=str, default=None, help="the initialized checkpoint")
parser.add_argument("--save_path", type=str, default=None, help="the initialized checkpoint")
# yapf: enable
args = parser.parse_args()
return args
def set_seed(args):
random.seed(args.seed)
np.random.seed(args.seed)
paddle.seed(args.seed)
def get_label_maps():
labels = ["O", "I-ans", "B-ans", "E-ans"]
label2id_map = {label: idx for idx, label in enumerate(labels)}
id2label_map = {idx: label for idx, label in enumerate(labels)}
return label2id_map, id2label_map
def main(args):
os.makedirs(args.output_dir, exist_ok=True)
logging.basicConfig(
filename=os.path.join(args.output_dir, "train.log") if paddle.distributed.get_rank() == 0 else None,
format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
datefmt="%m/%d/%Y %H:%M:%S",
level=logging.INFO if paddle.distributed.get_rank() == 0 else logging.WARN,
)
ch = logging.StreamHandler()
ch.setLevel(logging.DEBUG)
logger.addHandler(ch)
label2id_map, id2label_map = get_label_maps()
pad_token_label_id = paddle.nn.CrossEntropyLoss().ignore_index
# dist mode
if paddle.distributed.get_world_size() > 1:
paddle.distributed.init_parallel_env()
tokenizer = LayoutXLMTokenizer.from_pretrained(args.model_name_or_path)
if args.do_test:
model = LayoutXLMForTokenClassification_with_CRF.from_pretrained(args.init_checkpoint)
evaluate(args, model, tokenizer, label2id_map, id2label_map, pad_token_label_id, global_step=0)
exit(0)
if args.init_checkpoint:
logger.info("Init checkpoint from {}".format(args.init_checkpoint))
model = LayoutXLMForTokenClassification_with_CRF.from_pretrained(args.init_checkpoint)
else:
base_model = LayoutXLMModel.from_pretrained(args.model_name_or_path)
model = LayoutXLMForTokenClassification_with_CRF(base_model, num_classes=len(label2id_map), dropout=None)
# dist mode
if paddle.distributed.get_world_size() > 1:
model = paddle.DataParallel(model)
train_dataset = DocVQA(
args,
tokenizer,
label2id_map,
max_seq_len=args.max_seq_len,
max_query_length=args.max_query_length,
max_doc_length=args.max_doc_length,
max_span_num=args.max_span_num,
)
train_sampler = paddle.io.DistributedBatchSampler(
train_dataset, batch_size=args.per_gpu_train_batch_size, shuffle=False
)
args.train_batch_size = args.per_gpu_train_batch_size * max(1, paddle.distributed.get_world_size())
train_dataloader = paddle.io.DataLoader(
train_dataset, batch_sampler=train_sampler, num_workers=0, use_shared_memory=True, collate_fn=None
)
t_total = len(train_dataloader) * args.num_train_epochs
# build linear decay with warmup lr sch
lr_scheduler = paddle.optimizer.lr.PolynomialDecay(
learning_rate=args.learning_rate, decay_steps=t_total, end_lr=0.0, power=1.0
)
if args.warmup_steps > 0:
lr_scheduler = paddle.optimizer.lr.LinearWarmup(
lr_scheduler, args.warmup_steps, start_lr=0, end_lr=args.learning_rate
)
optimizer = paddle.optimizer.AdamW(
learning_rate=lr_scheduler,
parameters=model.parameters(),
epsilon=args.adam_epsilon,
weight_decay=args.weight_decay,
)
logger.info("***** Running training *****")
logger.info(" Num examples = %d", len(train_dataset))
logger.info(" Num Epochs = %d", args.num_train_epochs)
logger.info(" Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
logger.info(
" Total train batch size (w. parallel, distributed) = %d",
args.train_batch_size * paddle.distributed.get_world_size(),
)
logger.info(" Total optimization steps = %d", t_total)
global_step = 0
tr_loss = 0.0
set_seed(args)
for epoch_id in range(args.num_train_epochs):
print("epoch id:{}".format(epoch_id))
for step, batch in enumerate(train_dataloader):
model.train()
input_ids, input_mask, segment_ids, bboxes, labels = batch
if input_ids.shape[0] != args.per_gpu_train_batch_size:
continue
outputs = model(
input_ids=input_ids,
bbox=bboxes,
attention_mask=input_mask,
token_type_ids=segment_ids,
labels=labels,
is_train=True,
)
# model outputs are always tuple in paddlenlp (see doc)
loss = outputs[0]
loss = loss.mean()
if global_step % 50 == 0:
logger.info(
"[epoch {}/{}][iter: {}/{}] lr: {:.5f}, train loss: {:.5f}, ".format(
epoch_id,
args.num_train_epochs,
step,
len(train_dataloader),
lr_scheduler.get_lr(),
float(loss),
)
)
loss.backward()
tr_loss += loss.item()
optimizer.step()
lr_scheduler.step() # Update learning rate schedule
optimizer.clear_grad()
global_step += 1
if paddle.distributed.get_rank() == 0 and args.save_steps > 0 and global_step % args.save_steps == 0:
# Save model checkpoint
output_dir = os.path.join(args.output_dir, "checkpoint-{}".format(global_step))
os.makedirs(output_dir, exist_ok=True)
if paddle.distributed.get_rank() == 0:
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
paddle.save(args, os.path.join(output_dir, "training_args.bin"))
logger.info("Saving model checkpoint to %s", output_dir)
def _tokenize_chinese_chars(text):
"""
:param text: input text, unicode string
:return:
tokenized text, list
"""
def _is_chinese_char(cp):
"""Checks whether CP is the codepoint of a CJK character."""
# This defines a "chinese character" as anything in the CJK Unicode block:
# https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
#
# Note that the CJK Unicode block is NOT all Japanese and Korean characters,
# despite its name. The modern Korean Hangul alphabet is a different block,
# as is Japanese Hiragana and Katakana. Those alphabets are used to write
# space-separated words, so they are not treated specially and handled
# like the all of the other languages.
if (
(cp >= 0x4E00 and cp <= 0x9FFF)
or (cp >= 0x3400 and cp <= 0x4DBF) #
or (cp >= 0x20000 and cp <= 0x2A6DF) #
or (cp >= 0x2A700 and cp <= 0x2B73F) #
or (cp >= 0x2B740 and cp <= 0x2B81F) #
or (cp >= 0x2B820 and cp <= 0x2CEAF) #
or (cp >= 0xF900 and cp <= 0xFAFF)
or (cp >= 0x2F800 and cp <= 0x2FA1F) #
): #
return True
return False
output = []
buff = ""
for char in text:
cp = ord(char)
if _is_chinese_char(cp) or char == "=":
if buff != "":
output.append(buff)
buff = ""
output.append(char)
else:
buff += char
if buff != "":
output.append(buff)
return output
def fast_f1(text1, text2):
common_char = Counter(text1) & Counter(text2)
len_seq1 = len(text1)
len_seq2 = len(text2)
len_common = sum(common_char.values())
if len_common == 0:
return 0.0
precision = 1.0 * len_common / len_seq2
recall = 1.0 * len_common / len_seq1
return (2.0 * precision * recall) / (precision + recall)
def _normalize(in_str):
"""
normalize the input unicode string
"""
in_str = in_str.lower()
sp_char = [
":",
"_",
"`",
",",
"。",
":",
"?",
"!",
"(",
")",
"“",
"”",
";",
"’",
"《",
"》",
"……",
"·",
"、",
",",
"「",
"」",
"(",
")",
"-",
"~",
"『",
"』",
"|",
]
out_segs = []
for char in in_str:
if char in sp_char:
continue
else:
out_segs.append(char)
return "".join(out_segs)
def calc_f1_score(answer, prediction):
ans_segs = _tokenize_chinese_chars(_normalize(answer))
prediction_segs = _tokenize_chinese_chars(_normalize(prediction))
f1 = fast_f1(prediction_segs, ans_segs)
return f1
def decode(tokenizer, res):
sep_id = tokenizer._convert_token_to_id("</s>")
text_res = []
all_f1 = []
save_f1 = []
for i in range(len(res)):
input_ids, label_ids, predict_ids, bbox = res[i]
remove_pos = (
len(" ".join([str(x) for x in input_ids]).split("2 6 ")[0].strip(" ").split(" ")) + 2
) # remove the question bbox and sep bbox
start_pos = input_ids.index(sep_id)
query_text = []
for idx in range(1, start_pos):
input_id = input_ids[idx]
query_text.append(tokenizer._convert_id_to_token(int(input_id)))
# label texts and predict texts
text_label, text_predict = [], []
label_bbox_index, predict_bbox_index = [], []
for idx in range(start_pos + 1, len(input_ids)):
input_id, label_id, predict_id = input_ids[idx], label_ids[idx], predict_ids[idx]
if label_id in [1, 2, 3]:
text_label.append(tokenizer._convert_id_to_token(int(input_id)))
label_bbox_index.append(idx - remove_pos + 1)
if predict_id in [1, 2, 3]:
text_predict.append(tokenizer._convert_id_to_token(int(input_id)))
predict_bbox_index.append(idx - remove_pos + 1)
text_res.append(
["".join(query_text), "".join(text_label), "".join(text_predict), label_bbox_index, predict_bbox_index]
)
f1 = calc_f1_score("".join(text_label), "".join(text_predict))
save_f1.append(f1)
if len("".join(text_label)) > 10:
all_f1.append(f1)
if len(all_f1) > 0:
print("F1: ", sum(all_f1) / len(all_f1))
assert len(text_res) == len(save_f1)
return text_res
def evaluate(args, model, tokenizer, label2id_map, id2label_map, pad_token_label_id, prefix="", global_step=0):
eval_dataset = DocVQA(
args, tokenizer, label2id_map, max_seq_len=512, max_query_length=20, max_doc_length=512, max_span_num=1
)
args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, paddle.distributed.get_world_size())
eval_dataloader = paddle.io.DataLoader(
eval_dataset, batch_size=args.eval_batch_size, num_workers=0, use_shared_memory=True, collate_fn=None
)
# Eval!
logger.info("***** Running evaluation %s *****", prefix)
logger.info(" Num examples = %d", len(eval_dataset))
logger.info(" Batch size = %d", args.eval_batch_size)
model.eval()
res = []
for idx, batch in enumerate(eval_dataloader):
with paddle.no_grad():
input_ids, input_mask, segment_ids, bboxes, labels = batch
if input_ids.shape[0] != args.eval_batch_size:
continue
outputs = model(
input_ids=input_ids,
bbox=bboxes,
attention_mask=input_mask,
token_type_ids=segment_ids,
labels=labels,
is_train=False,
)
labels = labels.numpy()
crf_decode = outputs[1].numpy()
bboxes = bboxes.squeeze().numpy()
input_ids = input_ids.squeeze(axis=1).numpy()
for index in range(input_ids.shape[0]):
res.append([list(input_ids[index]), list(labels[index]), list(crf_decode[index]), bboxes[index]])
origin_inputs = []
with open(args.test_file, "r", encoding="utf8") as f:
for line in f:
line = json.loads(line.strip())
origin_inputs.append(
{
"img_name": line["img_name"],
"question": line["question"],
"bboxes": line["document_bbox"],
"img_id": line["img_id"],
}
)
text_res = decode(tokenizer, res)
with open(args.save_path, "w", encoding="utf8") as f:
for line_res, line_text, line_label in zip(res, text_res, origin_inputs):
line_json = {}
line_json["img_name"] = line_label["img_name"]
line_json["img_id"] = line_label["img_id"]
line_json["question"] = line_label["question"]
line_json["label_answer"] = line_text[1]
line_json["predict_answer"] = line_text[2]
label_bbox_index, predict_bbox_index = line_text[3], line_text[4]
label_bboxes, predict_bboxes = [], []
for i in range(len(line_label["bboxes"])):
if i in label_bbox_index:
label_bboxes.append(line_label["bboxes"][i])
if i in predict_bbox_index:
predict_bboxes.append(line_label["bboxes"][i])
line_json["label_bboxes"] = label_bboxes
line_json["predict_bboxes"] = predict_bboxes
json.dump(line_json, f, ensure_ascii=False)
f.write("\n")
def print_arguments(args):
"""print arguments"""
print("----------- Configuration Arguments -----------")
for arg, value in sorted(vars(args).items()):
print("%s: %s" % (arg, value))
print("------------------------------------------------")
if __name__ == "__main__":
args = parse_args()
print_arguments(args)
main(args)
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
export CUDA_VISIBLE_DEVICES=0
QUESTION=$1
python3 change_to_mrc.py ${QUESTION}
python3 ./run_docvqa.py \
--model_name_or_path "layoutxlm-base-uncased" \
--max_seq_len 512 \
--do_test true \
--test_file "data/demo_test.json" \
--num_train_epochs 100 \
--eval_steps 6000 \
--save_steps 6000 \
--output_dir "output/" \
--save_path "data/decode_res.json" \
--init_checkpoint "./checkpoints/layoutxlm/" \
--learning_rate 3e-5 \
--warmup_steps 12000 \
--per_gpu_train_batch_size 4 \
--per_gpu_eval_batch_size 1 \
--seed 2048
python3 view.py
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
export CUDA_VISIBLE_DEVICES=0
python3 ./run_docvqa.py \
--model_name_or_path "layoutxlm-base-uncased" \
--max_seq_len 512 \
--train_file "data/train.json" \
--init_checkpoint "checkpoints/base_model" \
--do_train true \
--num_train_epochs 50 \
--eval_steps 24000 \
--save_steps 40 \
--output_dir "output" \
--save_path "data/decode_res.json" \
--learning_rate 3e-5 \
--warmup_steps 40 \
--per_gpu_train_batch_size 4 \
--per_gpu_eval_batch_size 4 \
--seed 2048
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import cv2
import json
import numpy as np
def view_ocr_result(img_path, bboxes, opath):
image = cv2.imread(img_path)
for char_bbox in bboxes:
x_min, x_max, y_min, y_max = char_bbox
cv2.rectangle(image, (x_min, y_min), (x_max, y_max), (0, 0, 255), 1)
cv2.imwrite(opath, image)
def _highlight_bbox(img, bbox):
x = bbox[0]
w = bbox[1] - x
y = bbox[2]
h = bbox[3] - y
sub_img = img[y : y + h, x : x + w]
colored_rect = np.zeros(sub_img.shape, dtype=np.uint8)
colored_rect[:, :, 2] = 255
colored_rect[:, :, 1] = 255
res = cv2.addWeighted(sub_img, 0.5, colored_rect, 0.5, 1.0)
img[y : y + h, x : x + w] = res
def highlight_ans(source_img_path, output_img_path, ans_bbox):
image = cv2.imread(source_img_path)
for bbox in ans_bbox:
_highlight_bbox(image, bbox)
cv2.imwrite(output_img_path, image)
def highlight_img(source_img_path, output_img_path):
image = cv2.imread(source_img_path)
height = image.shape[0]
width = image.shape[1]
bbox = [0, width - 1, 0, height - 1]
_highlight_bbox(image, bbox)
cv2.imwrite(output_img_path, image)
if __name__ == "__main__":
res_path = "./data/decode_res.json"
result = {}
with open(res_path, "r", encoding="utf-8") as f:
line = f.readline()
result = json.loads(line.strip())
img_path = "../OCR_process/demo_pics/demo_{}.png".format(result["img_id"])
img_save_path = "../answer.png"
highlight_ans(img_path, img_save_path, result["predict_bboxes"])
print("extraction result has been saved to answer.png")
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import json
import os
import re
from paddleocr import PaddleOCR
from paddlenlp.transformers import LayoutXLMTokenizer
tokenizer = LayoutXLMTokenizer.from_pretrained("layoutxlm-base-uncased")
def get_all_chars(tokenizer):
all_chr = []
for i in range(30000):
tok_chr = tokenizer.tokenize(chr(i))
tok_chr = [tc.replace("▁", "") for tc in tok_chr]
while "" in tok_chr:
tok_chr.remove("")
tok_chr = "".join(tok_chr)
if len(tok_chr) != 1:
all_chr.append(i)
return all_chr
def merge_bbox(tok_bboxes):
min_gx = min([box[0] for box in tok_bboxes])
max_gx = max([box[1] for box in tok_bboxes])
min_gy = min([box[2] for box in tok_bboxes])
max_gy = max([box[3] for box in tok_bboxes])
height_g = max_gy - min_gy
width_g = max_gx - min_gx
height_m = 0
width_m = 0
for box in tok_bboxes:
x_min, x_max, y_min, y_max = box
height_m += y_max - y_min
width_m += x_max - x_min
height_m = height_m / len(tok_bboxes)
if (height_g - height_m) < 0.5 * height_m and width_g - width_m < 0.1 * width_m:
return False, [min_gx, max_gx, min_gy, max_gy]
else:
return True, tok_bboxes[0]
def xlm_parse(ocr_res, tokenizer):
doc_bboxes = []
all_chr = get_all_chars(tokenizer)
try:
new_tokens, new_token_boxes = [], []
for item in ocr_res:
new_tokens.extend(item["tokens"])
new_token_boxes.extend(item["token_box"])
# get layoutxlm tokenizer results and get the final results
temp_span_text = "".join(new_tokens)
temp_span_bbox = new_token_boxes
span_text = ""
span_bbox = []
# drop blank space
for text, bbox in zip(temp_span_text, temp_span_bbox):
if text == " ":
continue
else:
span_text += text
span_bbox += [bbox]
# span_tokens starts with "_"
span_tokens = tokenizer.tokenize(span_text)
span_tokens[0] = span_tokens[0].replace("▁", "")
while "" in span_tokens:
span_tokens.remove("")
doc_bboxes = []
i = 0
for tid, tok in enumerate(span_tokens):
tok = tok.replace("▁", "")
if tok == "":
doc_bboxes.append(span_bbox[i])
continue
if tok == "<unk>":
if tid + 1 == len(span_tokens):
tok_len = 1
else:
if span_tokens[tid + 1] == "<unk>":
tok_len = 1
else:
for j in range(i, len(span_text)):
if span_text[j].lower() == span_tokens[tid + 1][0]:
break
tok_len = j - i
elif ord(span_text[i]) in all_chr:
if tid + 1 == len(span_tokens):
tok_len = 1
elif "°" in tok and "C" in span_tokens[tid + 1]:
tok_len = len(tok) - 1
if tok_len == 0:
doc_bboxes.append(span_bbox[i])
continue
elif span_text[i] == "ⅱ":
if tok == "ii":
if span_text[i + 1] != "i":
tok_len = len(tok) - 1
else:
tok_len = len(tok)
elif tok == "i":
tok_len = len(tok) - 1
if tok_len == 0:
doc_bboxes.append(span_bbox[i])
continue
elif "m" in tok and "2" == span_tokens[tid + 1][0]:
tok_len = len(tok) - 1
if tok_len == 0:
doc_bboxes.append(span_bbox[i])
continue
elif ord(span_text[i + 1]) in all_chr:
tok_len = 1
else:
for j in range(i, len(span_text)):
if span_text[j].lower() == span_tokens[tid + 1][0]:
break
if span_text[j].lower() == "," and span_tokens[tid + 1][0] == ",":
break
if span_text[j].lower() == ";" and span_tokens[tid + 1][0] == ";":
break
if span_text[j].lower() == ")" and span_tokens[tid + 1][0] == ")":
break
if span_text[j].lower() == "(" and span_tokens[tid + 1][0] == "(":
break
if span_text[j].lower() == "¥" and span_tokens[tid + 1][0] == "¥":
break
tok_len = j - i
else:
if "�" == span_text[i]:
tok_len = len(tok) + 1
elif tok == "......" and "…" in span_text[i : i + 6]:
tok_len = len(tok) - 2
elif "ⅱ" in span_text[i + len(tok) - 1]:
if tok == "i":
tok_len = 1
else:
tok_len = len(tok) - 1
elif "°" in tok and "C" in span_tokens[tid + 1]:
tok_len = len(tok) - 1
else:
tok_len = len(tok)
assert i + tok_len <= len(span_bbox)
tok_bboxes = span_bbox[i : i + tok_len]
_, merged_bbox = merge_bbox(tok_bboxes)
doc_bboxes.append(merged_bbox)
i = i + tok_len
except Exception:
print("Error")
span_tokens = ["▁"] * 512
doc_bboxes = [[0, 0, 0, 0]] * 512
return span_tokens, doc_bboxes
def tokenize_ocr_res(ocr_reses):
"""
input:
ocr_res: the ocr result of the image
return:
new_reses: {
pid: {
"text": all text in each ocr_res,
"bounding_box": the bounding box of the ocr_res,
"tokens": all chars in ocr_res,
"token_box: bounding box of each chars in ocr_res
}
}
"""
new_reses = []
for img_name, ocr_res in ocr_reses:
new_res = []
for para in ocr_res:
text = para["text"]
text_box = para["bbox"]
x_min, y_min = [int(min(idx)) for idx in zip(*text_box)]
x_max, y_max = [int(max(idx)) for idx in zip(*text_box)]
text_chars = list(text.lower())
char_num = 0
for char in text_chars:
if re.match("[^\x00-\xff]", char):
char_num += 2
else:
char_num += 1
width = x_max - x_min
shift = x_min
new_token_boxes, new_tokens = [], []
for char in text_chars:
if re.match("[^\x00-\xff]", char):
tok_x_max = shift + width / char_num * 2
else:
tok_x_max = shift + width / char_num * 1
tok_x_min = shift
tok_y_min = y_min
tok_y_max = y_max
shift = tok_x_max
new_token_boxes.append([round(tok_x_min), round(tok_x_max), tok_y_min, tok_y_max])
new_tokens.append(char)
new_res.append(
{
"text": para["text"],
"bounding_box": para["bbox"],
"tokens": new_tokens,
"token_box": new_token_boxes,
}
)
new_reses.append((img_name, new_res))
return new_reses
def process_input(ocr_reses, tokenizer, save_ocr_path):
ocr_reses = tokenize_ocr_res(ocr_reses)
examples = []
for img_name, ocr_res in ocr_reses:
doc_tokens, doc_bboxes = xlm_parse(ocr_res, tokenizer)
doc_tokens.insert(0, "▁")
doc_bboxes.insert(0, doc_bboxes[0])
example = {"img_name": img_name, "document": doc_tokens, "document_bbox": doc_bboxes}
examples.append(example)
with open(save_ocr_path, "w", encoding="utf8") as f:
for example in examples:
json.dump(example, f, ensure_ascii=False)
f.write("\n")
print(f"ocr parsing results has been save to: {save_ocr_path}")
def ocr_preprocess(img_dir):
ocr = PaddleOCR(use_angle_cls=True, lang="ch", use_gpu=True)
ocr_reses = []
img_names = sorted(os.listdir(img_dir), key=lambda x: int(x.split("_")[1].split(".")[0]))
for img_name in img_names:
img_path = os.path.join(img_dir, img_name)
parsing_res = ocr.ocr(img_path, cls=True)[0]
ocr_res = []
for para in parsing_res:
ocr_res.append({"text": para[1][0], "bbox": para[0]})
ocr_reses.append((img_name, ocr_res))
return ocr_reses
if __name__ == "__main__":
img_dir = "./demo_pics"
save_path = "./demo_ocr_res.json"
ocr_results = ocr_preprocess(img_dir)
process_input(ocr_results, tokenizer, save_path)
# 汽车说明书跨模态智能问答
## 1. 项目说明
**跨模态文档问答** 是跨模态的文档抽取任务,要求文档智能模型在文档中抽取能够回答文档相关问题的答案,需要模型在抽取和理解文档中文本信息的同时,还能充分利用文档的布局、字体、颜色等视觉信息,这比单一模态的信息抽取任务更具挑战性。
这种基于跨模态文档阅读理解技术的智能问答能力,可以深度解析非结构化文档中排版复杂的图文/图表内容,直接定位问题答案。
本项目将基于跨模态文档问答技术实现**汽车说明书问答系统**,该系统能够对用户提出的问题,自动从汽车说明书中寻找答案并进行回答。
如下图所示, 用户提出问题:"如何更换前风窗玻璃的刮水片",跨模态文档问答引擎将从库中寻找相关的文档,然后通过跨模态阅读理解模型抽取出相应的答案,并进行了高亮展示。
<center><img width="883" alt="image" src="https://user-images.githubusercontent.com/35913314/169781111-0734729d-3c7b-400d-8e92-e56548bb7dc5.png"></center>
通过使用汽车说明书问答系统,能够极大地解决传统汽车售后的压力:
- 用户:用户没有耐心查阅说明书,打客服电话需要等待
- 售后客服:需要配置大量客服人员,且客服专业知识培训周期长
- 构建问题库:需要投入大量人力整理常见问题库,并且固定的问题库难以覆盖灵活多变的提问
对于用户来说,汽车说明书问答系统能够支持通过车机助手/APP/小程序为用户提供即问即答的功能。对于常见问题,用户不再需要查阅说明书,也无需打客服电话,从而缓解了人工客服的压力。
对于客服来讲,汽车说明书问答系统帮助客服人员快速定位答案,高效查阅文档,提高客服的专业水平,同时也能够缩短客服的培训周期。
## 2. 安装说明
#### 环境要求
- paddlepaddle == 2.3.2
- paddlenlp == 2.5.2
- paddleocr == 2.6.1.3
安装相关问题可参考[PaddlePaddle](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html)[PaddleNLP](https://paddlenlp.readthedocs.io/zh/latest/get_started/installation.html)文档。
## 3. 整体流程
汽车说明书问答系统针对用户提出的汽车使用相关问题,智能化地在汽车说明书中找出对应答案,并返回给用户。本项目提供的汽车说明书问答系统的使用流程如下图所示。本项目提供的汽车说明书问答系统主要包括 3 个模块:OCR处理模块、排序模块和跨模态阅读理解模块。
在使用汽车说明书问答模型回答问题之前,需要先使用PaddleOCR对离线提供的汽车说明书文档进解析,并将解析结果保存下来,以备后续排序模块使用。
对于用户提问的问题,首先会被传入排序模块,排序模块会针对该问题对解析的文档进行排序打分,其结果将会被传入跨模态阅读理解模块。阅读理解模块将从分数最高的说明书文档中,抽取用户问题的答案,并返回给用户。
<center><img width="864" alt="image" src="https://user-images.githubusercontent.com/35913314/170222662-c438ff2a-a1df-44e5-8a83-f14dc0814b9d.png"></center>
下面将具体介绍各个模块的功能。
## 4. OCR处理模块
本项目提供了包含10张图片的汽车说明书,为方便后续处理,首先需要通过 PaddleOCR 对汽车说明书进行识别,记录汽车说明书上的文字和文字布局信息, 以方便后续使用计算机视觉和自然语言处理方面的技术进行问答任务。
本项目提供的汽车说明书图片可点击[这里](https://paddlenlp.bj.bcebos.com/images/applications/automobile.tar.gz)进行下载,下载后解压放至 `./OCR_process/demo_pics` 目录下,然后通过如下命令,使用 PaddleOCR 对图片进行解析。
```shell
cd OCR_process/
python3 ocr_process.py
cd ..
```
解析后的结果存放至 `./OCR_process/demo_ocr_res.json` 中。
## 5. 排序模块
对于用户提出的问题,如果从所有的汽车说明书图片中去寻找答案会比较耗时且耗费资源。因此这里使用了一个基于[RocketQA](https://arxiv.org/pdf/2010.08191.pdf)的排序模块,该模块将根据用户提出的问题对汽车说明书的不同图片进行打分排序,这样便可以获取和问题最相关的图片,并使用跨模态阅读理解模块在该问题上进行抽取答案。
本项目提供了140条汽车说明书相关的训练样本,用于排序模型的训练, 同时也提供了一个基于RocketQA的预先训练好的基线模型 base_model。 本模块可以使用 base_model 在汽车说明书训练样本上进一步微调。
其中,汽车说明书的训练集可点击[这里](https://paddlenlp.bj.bcebos.com/data/automobile_rerank_train.tsv) 进行下载,下载后将其重命名为 `train.tsv` ,存放至 `./Rerank/data/` 目录下。
同时,base_model 是 [Dureader retrieval](https://arxiv.org/abs/2203.10232) 数据集训练的排序模型, 可点击[这里](https://paddlenlp.bj.bcebos.com/models/base_ranker.tar.gz) 进行下载,解压后可获得包含模型的目录 `base_model`,将其放至 `./Rerank/checkpoints` 目录下。
可使用如下代码进行训练:
```shell
cd Rerank
bash run_train.sh ./data/train.tsv ./checkpoints/base_model 50 1
cd ..
```
其中,参数依次为训练数据地址,base_model 地址,训练轮次,节点数。
在模型训练完成后,可将模型重命名为 `ranker` 存放至 `./checkpoints/` 目录下,接下来便可以使用如下命令,根据给定的汽车说明书相关问题,对汽车说明书的图片进行打分。代码如下:
```shell
cd Rerank
bash run_test.sh 后备箱怎么开
cd ..
```
其中,后一项为用户问题,命令执行完成后,分数文件将会保存至 `./Rerank/data/demo.score` 中。
## 6. 跨模态阅读理解模块
本项目首先获取排序模块输出的结果中评分最高的图片,然后将会使用跨模态的语言模型 LayoutXLM 从该图片中去抽取用户提问的答案。在获取答案后,将会对答案在该图片中进行高亮显示并返回用户。
本项目提供了28条汽车说明书相关的训练样本,用于跨模态阅读理解模型的训练, 同时也提供了一个预先训练好的基线模型 base_model。 本模块可以使用 base_model 在汽车说明书训练样本上进一步微调,增强模型对汽车说明书领域的理解。
其中,汽车说明书的阅读理解训练集可点击[这里](https://paddlenlp.bj.bcebos.com/data/automobile_mrc_train.json) 进行下载,下载后将其重命名为 `train.json`,存放至 `./Extraction/data/` 目录下。
同时,base_model 是 [Dureader VIS](https://aclanthology.org/2022.findings-acl.105.pdf) 数据集训练的跨模态阅读理解模型, 可点击[这里](https://paddlenlp.bj.bcebos.com/models/base_mrc.tar.gz) 进行下载,解压后可获得包含模型的目录 `base_model`,将其放至 `./Extraction/checkpoints` 目录下。
可使用如下代码进行训练:
```shell
cd Extraction
bash run_train.sh
cd ..
```
在模型训练完成后,可将模型重命名为 `layoutxlm` 存放至 `./checkpoints/` 目录下,接下来便可以使用如下命令,根据给定的汽车说明书相关问题,从得分最高的汽车说明书图片中抽取答案。代码如下:
```shell
cd Extraction
bash run_test.sh 后备箱怎么开
cd ..
```
其中,后一项为用户问题,命令执行完成后,最终结果将会保存至 `./answer.png` 中。
## 7. 全流程预测
本项目提供了全流程预测的功能,可通过如下命令进行一键式预测:
```shell
bash run_test.sh 后备箱怎么开
```
其中,后一项参数为用户问题,最终结果将会保存至 `./answer.png` 中。
**备注**:在运行命令前,请确保已使用第4节介绍的命令对原始汽车说明书图片完成了文档解析。
下图展示了用户提问的三个问题:"后备箱怎么开","钥匙怎么充电" 和 "NFC解锁注意事项", 可以看到,本项目的汽车说明书问答系统能够精准地找到答案并进行高亮显示。
<center><img src="https://user-images.githubusercontent.com/35913314/169012902-1a42bd14-976f-4da8-b5b5-d8e7352b68df.png"/></center>
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import sys
import json
question = sys.argv[1]
with open("../OCR_process/demo_ocr_res.json", "r", encoding="utf8") as f:
paras = []
for line in f:
line = json.loads(line.strip())
document = line["document"]
para = []
for token in document:
token = token.replace("▁", "")
para.append(token)
paras.append("".join(para))
with open("./data/demo.tsv", "w", encoding="utf8") as f:
for para in paras:
f.write("{}\t\t{}\t0\n".format(question, para))
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment